# Vectorized String Operations

## 1. INTRODUCING PANDAS STRING OPERATIONS

```python
import numpy as np
import pandas as pd 
```

> Vectorization is process of doing an operation on multiple items (in an array, for example) in one go.

```python
x  = np.array([1,2,3,4,5])

# performing vectorization of operations
x * 10
```

```
array([10, 20, 30, 40, 50])
```

However, it is not straightforward to perform vectorization on “array of strings” and Pandas addresses this need of performing vectorized string operations using various `str` methods

```python
names_series  = pd.Series(['tom','JOhn','MARIA'])
names_series
```

```
0      tom
1     JOhn
2    MARIA
dtype: object
```

```python
names_series.str.capitalize()
```

```
0      Tom
1     John
2    Maria
dtype: object
```

## 2. STRING OPERATIONS

Let’s first define a Pandas Series to work with:

```python
# Panda series use in this section
names = pd.Series(['Walter White', 'Jesse Pinkman', 'Skyler White', 'Hank Shrader', 'Mike Ehrmantraut', 'Gus Fring']) 
names
```

```
0        Walter White
1       Jesse Pinkman
2        Skyler White
3        Hank Shrader
4    Mike Ehrmantraut
5           Gus Fring
dtype: objec
```

### 2.1 Methods Similar to Python String Methods

Nearly all [Python’s built-in string methods](https://www.w3schools.com/python/python_ref_string.asp) are mirrored by a Pandas vectorized string method. Visit this [link](https://www.w3schools.com/python/python_ref_string.asp) to get the complete list.

```python
# lets apply some of these string methods to panda series
# to upper case
names.str.upper()
```

```
0        WALTER WHITE
1       JESSE PINKMAN
2        SKYLER WHITE
3        HANK SHRADER
4    MIKE EHRMANTRAUT
5           GUS FRING
dtype: objectt
```

```python
# to check if it is digit
names.str.isdigit()
```

```
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool
```

```python
# to get length of each item in the array
names.str.len()
```

```
0    12
1    13
2    12
3    12
4    16
5     9
dtype: int64
```

```python
# to get boolean array, one that passes the condition
names.str.startswith('W')
```

```
0     True
1    False
2    False
3    False
4    False
5    False
dtype: bool
```

### 2.2. String Methods using Regular Expressions

Regular expression is a special syntax to find string or set of strings. This topic is very broad and can be very dry. However, we are going to taste plain-vanilla flavor of them here.

The following methods accept regular expressions **to examine** ***the content of each string element***, and follow some of the API conventions of Python’s built-in `re` module

```python
# let apply str.extract() method with regular expression to extract the first names
names.str.extract('([A-Za-z]+)')
```

```
        0
0  Walter
1   Jesse
2  Skyler
3    Hank
4    Mike
5     Gus
```

There are some good introductory examples on regular expressions usage in Python [here](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/)

### 2.3. Vectorized indexing and slicing

```python
# getting first letter of each element in the array
# using standard indexing method
names.str[0]
```

```
0    W
1    J
2    S
3    H
4    M
5    G
dtype: object
```

```python
# getting first letter of each element in the array
# using  str.get() method
names.str.get(0)
```

```
0    W
1    J
2    S
3    H
4    M
5    G
dtype: object
```

```python
# str.slice()
names.str.slice(0,2)
```

```
0    Wa
1    Je
2    Sk
3    Ha
4    Mi
5    Gu
dtype: object
```

```python
# str.split()
names.str.split()
```

```
0        [Walter, White]
1       [Jesse, Pinkman]
2        [Skyler, White]
3        [Hank, Shrader]
4    [Mike, Ehrmantraut]
5           [Gus, Fring]
dtype: object
```

```python
# str.split() with str.get(0) to get first name
names.str.split().str.get(0)
```

```
0    Walter
1     Jesse
2    Skyler
3      Hank
4      Mike
5       Gus
dtype: object
```

### 2.4. `get_dummies`

The `get_dummies()` lets you quickly split out indicator variables into a DataFrame

```python
dummy = pd.DataFrame({'info': ['A|B|C','A','A|C'],
                     'name': ['tom','dick','harry']})
print(dummy)
```

```
    info   name
0  A|B|C    tom
1      A   dick
2    A|C  harry
```

```python
# using get_dummies
print(dummy['info'].str.get_dummies('|'))
```

```
   A  B  C
0  1  1  1
1  1  0  0
2  1  0  1
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://codingnotes.gitbook.io/coding_notes/coding/pandas/vectorized-string-operations.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
