Vectorized String Operations

1. INTRODUCING PANDAS STRING OPERATIONS

import numpy as np
import pandas as pd 

Vectorization is process of doing an operation on multiple items (in an array, for example) in one go.

x  = np.array([1,2,3,4,5])

# performing vectorization of operations
x * 10
array([10, 20, 30, 40, 50])

However, it is not straightforward to perform vectorization on “array of strings” and Pandas addresses this need of performing vectorized string operations using various str methods

names_series  = pd.Series(['tom','JOhn','MARIA'])
names_series
0      tom
1     JOhn
2    MARIA
dtype: object
names_series.str.capitalize()
0      Tom
1     John
2    Maria
dtype: object

2. STRING OPERATIONS

Let’s first define a Pandas Series to work with:

# Panda series use in this section
names = pd.Series(['Walter White', 'Jesse Pinkman', 'Skyler White', 'Hank Shrader', 'Mike Ehrmantraut', 'Gus Fring']) 
names
0        Walter White
1       Jesse Pinkman
2        Skyler White
3        Hank Shrader
4    Mike Ehrmantraut
5           Gus Fring
dtype: objec

2.1 Methods Similar to Python String Methods

Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Visit this link to get the complete list.

# lets apply some of these string methods to panda series
# to upper case
names.str.upper()
0        WALTER WHITE
1       JESSE PINKMAN
2        SKYLER WHITE
3        HANK SHRADER
4    MIKE EHRMANTRAUT
5           GUS FRING
dtype: objectt
# to check if it is digit
names.str.isdigit()
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool
# to get length of each item in the array
names.str.len()
0    12
1    13
2    12
3    12
4    16
5     9
dtype: int64
# to get boolean array, one that passes the condition
names.str.startswith('W')
0     True
1    False
2    False
3    False
4    False
5    False
dtype: bool

2.2. String Methods using Regular Expressions

Regular expression is a special syntax to find string or set of strings. This topic is very broad and can be very dry. However, we are going to taste plain-vanilla flavor of them here.

The following methods accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module

# let apply str.extract() method with regular expression to extract the first names
names.str.extract('([A-Za-z]+)')
        0
0  Walter
1   Jesse
2  Skyler
3    Hank
4    Mike
5     Gus

There are some good introductory examples on regular expressions usage in Python here

2.3. Vectorized indexing and slicing

# getting first letter of each element in the array
# using standard indexing method
names.str[0]
0    W
1    J
2    S
3    H
4    M
5    G
dtype: object
# getting first letter of each element in the array
# using  str.get() method
names.str.get(0)
0    W
1    J
2    S
3    H
4    M
5    G
dtype: object
# str.slice()
names.str.slice(0,2)
0    Wa
1    Je
2    Sk
3    Ha
4    Mi
5    Gu
dtype: object
# str.split()
names.str.split()
0        [Walter, White]
1       [Jesse, Pinkman]
2        [Skyler, White]
3        [Hank, Shrader]
4    [Mike, Ehrmantraut]
5           [Gus, Fring]
dtype: object
# str.split() with str.get(0) to get first name
names.str.split().str.get(0)
0    Walter
1     Jesse
2    Skyler
3      Hank
4      Mike
5       Gus
dtype: object

2.4. get_dummies

The get_dummies() lets you quickly split out indicator variables into a DataFrame

dummy = pd.DataFrame({'info': ['A|B|C','A','A|C'],
                     'name': ['tom','dick','harry']})
print(dummy)
    info   name
0  A|B|C    tom
1      A   dick
2    A|C  harry
# using get_dummies
print(dummy['info'].str.get_dummies('|'))
   A  B  C
0  1  1  1
1  1  0  0
2  1  0  1

Last updated