Python version of Omni Analytics' Data Science on the Ethereum Blockchain Module 1-2: Reading Files.¶

All datasets, code ideas, concepts, and text used are based on Omni Analytics' great work that can be found here: https://github.com/Omni-Analytics-Group/eth-data-science-course ¶

Translation from R into Python by Jeanna Schoonmaker, Oct 2021¶

Outline¶

Reading files: Excel and .csv

Importing Data¶

We often need to import our own data rather than just using built-in datasets. The following steps are usually taken:

Import the pandas library for reading data files
Use pandas functions such as:
- pd.read_csv(filename)
- pd.read_excel(filename)
Assign the data to a variable name when reading in the file.

Importing Data Demo¶

We load a file in by identifying where in our system the file is stored and then enclosing that filepath in the pd.read_csv parentheses. For this example, we will load the dataset from the github directory where it is stored for this course.

Common practice is to load dataframes (tabular data from a csv) with the name "df."

In [1]:

import pandas as pd
crypto_art_filename = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/crypto_art_pulse.csv'
crypto_art_df = pd.read_csv(crypto_art_filename)

Data in Excel¶

Excel has file formats such as xls and csv - so what are the differences? File extensions .xls or .xlsx are proprietary Excel formats, and they are binary files. .csv is the extension for comma separated value files. They are text files and are directly readable.

Excel file in Excel¶

.csv file in notepad¶

Reading Files in Python¶

Comma separated or tabular separated files can both be loaded with pd.read_csv

This code is used for a comma separated file:

pd.read_csv(filename)

This code is used for a tab separated file:

pd.read_csv(filename, sep="\t")

Reading .csv Files¶

The pd.read_csv also needs special instructions on how to handle some of the unicode characters in the data file, which is why "encoding = 'unicode_escape'" is included.

In [7]:

punks_file = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/punks.csv'
df_punks = pd.read_csv(punks_file, encoding='unicode_escape')
df_punks.head()

Out[7]:

	Transaction	From	To	Crypto	USD	Txn	ID	Sex	Type	Skin	Slots	Rank
0	Sold	0xf5099e	14715954	25.00	2822.0	2018-11-30	0	Girl	Female	Mid	3	3682560000%
1	Sold	0x00d7c9	10528156	1.60	386.0	2017-07-07	0	Girl	Female	Mid	3	3682560000%
2	Sold	0xc352b5	55241	0.98	320.0	2017-06-23	0	Girl	Female	Mid	3	3682560000%
3	Claimed	NaN	12800693	NaN	NaN	2017-06-23	0	Girl	Female	Mid	3	3682560000%
4	Sold	EliteCat	0xcf6165	60.00	36305.0	2020-11-30	1	Guy	Male	Dark	2	2050240500%

In [8]:

df_punks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17554 entries, 0 to 17553
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Transaction  17554 non-null  object 
 1   From         7554 non-null   object 
 2   To           17554 non-null  object 
 3   Crypto       7554 non-null   float64
 4   USD          7554 non-null   float64
 5   Txn          17554 non-null  object 
 6   ID           17554 non-null  int64  
 7   Sex          17554 non-null  object 
 8   Type         17554 non-null  object 
 9   Skin         17554 non-null  object 
 10  Slots        17554 non-null  int64  
 11  Rank         17554 non-null  object 
dtypes: float64(2), int64(2), object(8)
memory usage: 1.6+ MB

Reading Excel Files¶

The pandas library can also handle reading excel files.

In order to access an Excel file in Google colab, first go to this link and click "Download." Make note of where your downloaded file is stored: https://github.com/Omni-Analytics-Group/eth-data-science-course/blob/master/courses/Module%201/punks.xlsx
Click on the file folder icon in the left hand sidebar on your Google Colab screen.
Click on the "Upload" icon, then navigate to the folder where the punks.xlsx file you just downloaded is stored and click on the punks.xlsx file.
When the punks.xlsx file shows up in the Files list, hover your mouse over it and click on the 3 vertical dots to the right of the filename, then click on "Copy path."
Paste the copied path after the equal sign in the first line of the code below:

In [16]:

punks_excel_file = '/content/punks.xlsx' # post your copied path between the single quotation marks
df_punks_excel = pd.read_excel(punks_excel_file, 'Worksheet')
df_punks_excel.head()

Out[16]:

	Transaction	From	To	Crypto	USD	Txn	ID	Sex	Type	Skin	Slots	Rank
0	Sold	0xf5099e	14715954	25.00	2822.0	2018-11-30	0	Girl	Female	Mid	3	3682560000%
1	Sold	0x00d7c9	10528156	1.60	386.0	2017-07-07	0	Girl	Female	Mid	3	3682560000%
2	Sold	0xc352b5	55241	0.98	320.0	2017-06-23	0	Girl	Female	Mid	3	3682560000%
3	Claimed	NaN	12800693	NaN	NaN	2017-06-23	0	Girl	Female	Mid	3	3682560000%
4	Sold	False	0xcf6165	60.00	36305.0	2020-11-30	1	Guy	Male	Dark	2	2050240500%

Your Turn¶

Using the crypto_art_pulse.csv file, do the following:

Read the data into a dataframe called crypto_art_df
Use the head() function to display some rows
Use the info() function to see each variable type

Answers¶

1. Read the data into a dataframe called crypto_art_df¶

In [17]:

crypto_art_filename = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/crypto_art_pulse.csv'
crypto_art_df = pd.read_csv(crypto_art_filename)

2. Use the head() function to display some rows¶

In [18]:

crypto_art_df.head()

Out[18]:

	asset_token_id	asset_contract_address	asset_collection	asset_name	asset_artist	asset_image	asset_link	asset_external_link	asset_first_sale_price	asset_last_sale_price	asset_last_sale_owner_name	asset_last_sale_owner_address	asset_num_sales	asset_last_sale_date
0	211201	0xfbeef911dc5821886e1dda71586d90ed28174b7d	Known-origin	Wi-Five / Cuban Impressions	Adoption	https://lh3.googleusercontent.com/RAcLH_YifV2a...	https://opensea.io/assets/0xfbeef911dc5821886e...	https://knownorigin.io/token/211201	800000000000000000	800000000000000000	mattiac	0x576a655161b5502dcf40602be1f3519a89b71658	1	2020-12-31 14:02:57
1	16773	0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0	Superrare	SUBTERRANEAN HOMESICK ALIEN (TRIBUTE)	rubahitam	https://lh3.googleusercontent.com/pKhMQQNVUt3J...	https://opensea.io/assets/0xb932a70a57673d89f4...	https://superrare.co/?origin=https%3A%2F%2Fope...	1500000000000000000	1500000000000000000	NaN	0x9dac76101b5ccda5970e60402de31a6649115b53	1	2020-12-31 13:35:20
2	16113	0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0	Superrare	Somewhere	MrY	https://lh3.googleusercontent.com/9ycKTkeIZTro...	https://opensea.io/assets/0xb932a70a57673d89f4...	https://superrare.co/?origin=https%3A%2F%2Fope...	500000000000000000	500000000000000000	endless_traveler	0x423a0a0f3479ab24b2b04c8dba8bbe0e2bbb5cc6	1	2020-12-31 13:19:51
3	17443	0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0	Superrare	TINY ROOMS TWO - IN DA CLUB 2020 EDITION	MrY	https://lh3.googleusercontent.com/WgD9n7W0B8Et...	https://opensea.io/assets/0xb932a70a57673d89f4...	https://superrare.co/?origin=https%3A%2F%2Fope...	500000000000000000	500000000000000000	endless_traveler	0x423a0a0f3479ab24b2b04c8dba8bbe0e2bbb5cc6	1	2020-12-31 13:19:51
4	38926	0x2a46f2ffd99e19a89476e2f62270e0a35bbf0756	Makersplace	Boredoom	wizmo	https://lh3.googleusercontent.com/HY5jSHOtLn3g...	https://opensea.io/assets/0x2a46f2ffd99e19a894...	https://makersplace.com/product/token/0x2a46f2...	500000000000000000	500000000000000000	brandonkangfilms	0x720a4fab08cb746fc90e88d1924a98104c0822cf	1	2020-12-31 09:28:46

3. Use the info() function to see each variable type¶

In [19]:

crypto_art_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25336 entries, 0 to 25335
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   asset_token_id                 25336 non-null  int64  
 1   asset_contract_address         25336 non-null  object 
 2   asset_collection               25336 non-null  object 
 3   asset_name                     25335 non-null  object 
 4   asset_artist                   25299 non-null  object 
 5   asset_image                    25336 non-null  object 
 6   asset_link                     25336 non-null  object 
 7   asset_external_link            25215 non-null  object 
 8   asset_first_sale_price         25336 non-null  object 
 9   asset_last_sale_price          25336 non-null  object 
 10  asset_last_sale_owner_name     1556 non-null   object 
 11  asset_last_sale_owner_address  25308 non-null  object 
 12  asset_sale_grows               25336 non-null  float64
 13  asset_num_sales                25336 non-null  int64  
 14  asset_last_sale_date           25336 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 2.9+ MB

Thanks for sticking with us and learning about reading .csv and Excel files!¶