Vectorized String Operations
1. INTRODUCING PANDAS STRING OPERATIONS
import numpy as np
import pandas as pd Vectorization is process of doing an operation on multiple items (in an array, for example) in one go.
x = np.array([1,2,3,4,5])
# performing vectorization of operations
x * 10array([10, 20, 30, 40, 50])However, it is not straightforward to perform vectorization on “array of strings” and Pandas addresses this need of performing vectorized string operations using various str methods
names_series = pd.Series(['tom','JOhn','MARIA'])
names_series0 tom
1 JOhn
2 MARIA
dtype: objectnames_series.str.capitalize()0 Tom
1 John
2 Maria
dtype: object2. STRING OPERATIONS
Let’s first define a Pandas Series to work with:
2.1 Methods Similar to Python String Methods
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Visit this link to get the complete list.
2.2. String Methods using Regular Expressions
Regular expression is a special syntax to find string or set of strings. This topic is very broad and can be very dry. However, we are going to taste plain-vanilla flavor of them here.
The following methods accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module
There are some good introductory examples on regular expressions usage in Python here
2.3. Vectorized indexing and slicing
2.4. get_dummies
get_dummiesThe get_dummies() lets you quickly split out indicator variables into a DataFrame
Last updated
Was this helpful?