Vectorized String Operations

1. INTRODUCING PANDAS STRING OPERATIONS

import numpy as np
import pandas as pd 

Vectorization is process of doing an operation on multiple items (in an array, for example) in one go.

x  = np.array([1,2,3,4,5])

# performing vectorization of operations
x * 10
array([10, 20, 30, 40, 50])

However, it is not straightforward to perform vectorization on “array of strings” and Pandas addresses this need of performing vectorized string operations using various str methods

names_series  = pd.Series(['tom','JOhn','MARIA'])
names_series
0      tom
1     JOhn
2    MARIA
dtype: object
names_series.str.capitalize()
0      Tom
1     John
2    Maria
dtype: object

2. STRING OPERATIONS

Let’s first define a Pandas Series to work with:

2.1 Methods Similar to Python String Methods

Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Visit this link to get the complete list.

2.2. String Methods using Regular Expressions

Regular expression is a special syntax to find string or set of strings. This topic is very broad and can be very dry. However, we are going to taste plain-vanilla flavor of them here.

The following methods accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module

There are some good introductory examples on regular expressions usage in Python here

2.3. Vectorized indexing and slicing

2.4. get_dummies

The get_dummies() lets you quickly split out indicator variables into a DataFrame

Last updated

Was this helpful?