Io File Read And Write

1. READING THE FILE

We will learn how to read the three most commonly used file format - csv, xls(xlsx) and json

1.1 Reading a CSV file

  • CSV is a comma separated key-value pairs, where key is the column name and value its corresponding column values

  • pd.read_csv() constructor is used to read the csv file

  • First (and only required) keyword argument is the location of the file. The returned output is the DataFrame with integer indexes and comma-separated column names as column labels for the DataFrame

  • However, when we specify the keyword argument index_col, we can specify which columns we want to use as row labels

In the example below, we are going to read the file which is available here. I dropped some columns in the file to make it presentable because our main purpose is to read/write files rather than working with actual content of the file. In addition, following short terms are used column labels:

  • Number of firms = No

  • Average Unlevered Beta = AU Beta

  • Average Levered Beta = AL Beta

import numpy as np
import pandas as pd 

# location of csv file we are going to read
csv_location = "data/totalbeta.csv"

csv_df = pd.read_csv(csv_location)

# .head() shows only the first five rows in output
csv_df.head() 
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.93

1.44

1

Aerospace/Defense

77

1.08

1.23

2

Air Transport

18

0.84

1.44

3

Apparel

51

0.83

1.06

4

Auto & Truck

13

0.53

1.10

Let use the kwarg index_col to use specified column value as row label:

# Assigning column "Industry Name" as row labels
# Index positon of "industry Name" is 0 

pd.read_csv(csv_location, index_col=0).head()
No.AU BetaAL Beta

Industry

---

---

---

---

Advertising

47

0.93

1.44

Aerospace/Defense

77

1.08

1.23

Air Transport

18

0.84

1.44

Apparel

51

0.83

1.06

Auto & Truck

13

0.53

1.10

1.2. Reading an Excel File

  • We can read all spreadsheets in an excel workbook or individual sheets inside the workbook

  • pd.read_excel constructor is used to read an excel file

  • We can provide keyword argument, sheet_name to read a specific sheet in workbook — it can be either integer index or exact sheet name as string. sheet_name=None returns all sheets

  • We can provide keyword argument, index_col to set the column name as row label (just like we do in pd.read_csv)

xls_location  = "data/totalbeta.xls"

xls_df = pd.read_excel(xls_location)
xls_df.head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.934953

1.439612

1

Aerospace/Defense

77

1.078522

1.231584

2

Air Transport

18

0.843673

1.435348

3

Apparel

51

0.829641

1.055097

4

Auto & Truck

13

0.525735

1.095074

  • Let’s fetch all sheet names inside the workbook using kwarg sheet_name=None and .keys method:

# reading by sheet_name
# first, let fetch all the sheet names in the workbook

pd.read_excel(xls_location, sheet_name=None).keys()
odict_keys(['US', 'Global', 'Europe', 'Emerging'])
  • Let’s store each sheet content inside the variable

# storing each sheet in its own dataframe

us_sheet = pd.read_excel(xls_location, sheet_name='US')
global_sheet = pd.read_excel(xls_location, sheet_name='Global')
europe_sheet = pd.read_excel(xls_location, sheet_name='Europe')
emerging_sheet = pd.read_excel(xls_location, sheet_name='Emerging')
# read the first sheet by integer indexing

pd.read_excel(xls_location, sheet_name=0).head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.934953

1.439612

1

Aerospace/Defense

77

1.078522

1.231584

2

Air Transport

18

0.843673

1.435348

3

Apparel

51

0.829641

1.055097

4

Auto & Truck

13

0.525735

1.095074

# read the first sheet by its sheet name

pd.read_excel(xls_location, sheet_name='US').head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.934953

1.439612

1

Aerospace/Defense

77

1.078522

1.231584

2

Air Transport

18

0.843673

1.435348

3

Apparel

51

0.829641

1.055097

4

Auto & Truck

13

0.525735

1.095074

# reading more than one sheets

excel_file = pd.read_excel(xls_location, sheet_name=[0,-1])
excel_file.keys()
odict_keys([0, -1])
excel_file[-1].head()
Industry NameNo.AU BetaAL Beta

0

Advertising

103

1.463880

1.495675

1

Aerospace/Defense

89

1.099578

1.175447

2

Air Transport

86

0.590283

1.111339

3

Apparel

907

0.635054

0.749096

4

Auto & Truck

83

1.127485

1.379411

1.3. Reading a JSON file

  • Think of json file as a python dictionary with key-value pair

  • pd.read_json() constructor is used to read json file

  • When no keyword argument is used: outer keys are used as column labels and inner keys are used as row labels

  • When we use orient='index' keyword argument, outer keys are used as row labels and inner keys are used as column labels

json_location = "data/totalbeta.json"

json_df = pd.read_json(json_location)
json_df.head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.93

1.44

1

Aerospace/Defense

77

1.08

1.23

2

Air Transport

18

0.84

1.44

3

Apparel

51

0.83

1.06

4

Auto & Truck

13

0.53

1.10

2. Writing to File

We will write in three most commonly used formats - csv, xls(xlsx) and json

2.1. Writing to a CSV file

  • .to_csv() constructor is used to write in csv format

  • The first keyword argument is the name of file

  • By default, to.csv() uses the index labels as first column in the csv file, which can be non-meaningful if just integer, we can provide index=False keyword argument to not write the row label into a csv file

  • for the purpose of demonstration, we will use the data we have stored above under variables csv_df, excel_df and json_df

# writing
csv_df.to_csv('data/tocsv_file.csv') 

# reading the written file
pd.read_csv("data/tocsv_file.csv").head()
Unnamed: 0IndustryNo.AU BetaAL Beta

0

0

Advertising

47

0.93

1.44

1

1

Aerospace/Defense

77

1.08

1.23

2

2

Air Transport

18

0.84

1.44

3

3

Apparel

51

0.83

1.06

4

4

Auto & Truck

13

0.53

1.10

We can see that the index label is used as first column, which is meaningless in our case, let use index=False to not use the index as first column

# setting index=False

# writing
csv_df.to_csv('data/tocsv_index_false.csv', index=False) 

# reading the written file
pd.read_csv("data/tocsv_index_false.csv").head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.93

1.44

1

Aerospace/Defense

77

1.08

1.23

2

Air Transport

18

0.84

1.44

3

Apparel

51

0.83

1.06

4

Auto & Truck

13

0.53

1.10

2.2. Writing to Excel File

  • .to_excel() constructor is used to write single sheet to the excel workbook

  • To write multiple sheets in excel workbook, first we need to load the excel file using pd.ExcelWriter, then use it as first argument in the .to_excel()

  • default label for sheet is Sheet1, Sheet2 unless we provide it explicitly using sheet_name keyword argument

  • to avoid writing row label as first column, use index=False

# writing
xls_df.to_excel('data/toexcel_file.xls', index=False)

# reading the written file
pd.read_excel('data/toexcel_file.xls').head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.934953

1.439612

1

Aerospace/Defense

77

1.078522

1.231584

2

Air Transport

18

0.843673

1.435348

3

Apparel

51

0.829641

1.055097

4

Auto & Truck

13

0.525735

1.095074

# writing individual sheets to excel workbook

with pd.ExcelWriter('data/toexcel_file_multiple_sheets.xls') as writer:
    us_sheet.to_excel(writer, index=False, sheet_name='usa')
    europe_sheet.to_excel(writer, index=False, sheet_name='europe')
    global_sheet.to_excel(writer, index=False, sheet_name='global')
    emerging_sheet.to_excel(writer, index=False, sheet_name='emerging')
    
# reading the written file

xls_df_new = pd.read_excel('data/toexcel_file_multiple_sheets.xls', sheet_name=None)
xls_df_new.keys()
odict_keys(['usa', 'europe', 'global', 'emerging'])
xls_df_new['usa'].head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.934953

1.439612

1

Aerospace/Defense

77

1.078522

1.231584

2

Air Transport

18

0.843673

1.435348

3

Apparel

51

0.829641

1.055097

4

Auto & Truck

13

0.525735

1.095074

2.3. Write to JSON File

  • .to_json() constructor is used

  • Meaning and application of orient='index' in .to_json will remain the same as explained in pd.read_json()

# writing
json_df.to_json('data/tojson_file.json')

# reading the written file
pd.read_json('data/tojson_file.json').head()
IndustryNo.AU BetaAL Beta

0

Advertising

47

0.93

1.44

1

Aerospace/Defense

77

1.08

1.23

2

Air Transport

18

0.84

1.44

3

Apparel

51

0.83

1.06

4

Auto & Truck

13

0.53

1.10

Last updated