Python version of Omni Analytics' Data Science on the Ethereum Blockchain Module 1-2: Reading Files.¶

All datasets, code ideas, concepts, and text used are based on Omni Analytics' great work that can be found here: https://github.com/Omni-Analytics-Group/eth-data-science-course¶

Translation from R into Python by Jeanna Schoonmaker, Oct 2021¶

Outline¶

  • Reading files: Excel and .csv

Importing Data¶

We often need to import our own data rather than just using built-in datasets. The following steps are usually taken:

  • Import the pandas library for reading data files
  • Use pandas functions such as:
    • pd.read_csv(filename)
    • pd.read_excel(filename)
  • Assign the data to a variable name when reading in the file.

Importing Data Demo¶

We load a file in by identifying where in our system the file is stored and then enclosing that filepath in the pd.read_csv parentheses. For this example, we will load the dataset from the github directory where it is stored for this course.

Common practice is to load dataframes (tabular data from a csv) with the name "df."

In [1]:
import pandas as pd
crypto_art_filename = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/crypto_art_pulse.csv'
crypto_art_df = pd.read_csv(crypto_art_filename)

Data in Excel¶

Excel has file formats such as xls and csv - so what are the differences? File extensions .xls or .xlsx are proprietary Excel formats, and they are binary files. .csv is the extension for comma separated value files. They are text files and are directly readable.

Excel file in Excel¶

excelinexcel.png

.csv file in notepad¶

csvexample.png

Reading Files in Python¶

Comma separated or tabular separated files can both be loaded with pd.read_csv

This code is used for a comma separated file:

pd.read_csv(filename)

This code is used for a tab separated file:

pd.read_csv(filename, sep="\t")

Reading .csv Files¶

The pd.read_csv also needs special instructions on how to handle some of the unicode characters in the data file, which is why "encoding = 'unicode_escape'" is included.

In [7]:
punks_file = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/punks.csv'
df_punks = pd.read_csv(punks_file, encoding='unicode_escape')
df_punks.head()
Out[7]:
Transaction From To Crypto USD Txn ID Sex Type Skin Slots Rank
0 Sold 0xf5099e 14715954 25.00 2822.0 2018-11-30 0 Girl Female Mid 3 3682560000%
1 Sold 0x00d7c9 10528156 1.60 386.0 2017-07-07 0 Girl Female Mid 3 3682560000%
2 Sold 0xc352b5 55241 0.98 320.0 2017-06-23 0 Girl Female Mid 3 3682560000%
3 Claimed NaN 12800693 NaN NaN 2017-06-23 0 Girl Female Mid 3 3682560000%
4 Sold EliteCat 0xcf6165 60.00 36305.0 2020-11-30 1 Guy Male Dark 2 2050240500%
In [8]:
df_punks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17554 entries, 0 to 17553
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Transaction  17554 non-null  object 
 1   From         7554 non-null   object 
 2   To           17554 non-null  object 
 3   Crypto       7554 non-null   float64
 4   USD          7554 non-null   float64
 5   Txn          17554 non-null  object 
 6   ID           17554 non-null  int64  
 7   Sex          17554 non-null  object 
 8   Type         17554 non-null  object 
 9   Skin         17554 non-null  object 
 10  Slots        17554 non-null  int64  
 11  Rank         17554 non-null  object 
dtypes: float64(2), int64(2), object(8)
memory usage: 1.6+ MB

Reading Excel Files¶

The pandas library can also handle reading excel files.

  1. In order to access an Excel file in Google colab, first go to this link and click "Download." Make note of where your downloaded file is stored: https://github.com/Omni-Analytics-Group/eth-data-science-course/blob/master/courses/Module%201/punks.xlsx

  2. Click on the file folder icon in the left hand sidebar on your Google Colab screen.

  3. Click on the "Upload" icon, then navigate to the folder where the punks.xlsx file you just downloaded is stored and click on the punks.xlsx file.

  4. When the punks.xlsx file shows up in the Files list, hover your mouse over it and click on the 3 vertical dots to the right of the filename, then click on "Copy path."

  5. Paste the copied path after the equal sign in the first line of the code below:

In [16]:
punks_excel_file = '/content/punks.xlsx' # post your copied path between the single quotation marks
df_punks_excel = pd.read_excel(punks_excel_file, 'Worksheet')
df_punks_excel.head()
Out[16]:
Transaction From To Crypto USD Txn ID Sex Type Skin Slots Rank
0 Sold 0xf5099e 14715954 25.00 2822.0 2018-11-30 0 Girl Female Mid 3 3682560000%
1 Sold 0x00d7c9 10528156 1.60 386.0 2017-07-07 0 Girl Female Mid 3 3682560000%
2 Sold 0xc352b5 55241 0.98 320.0 2017-06-23 0 Girl Female Mid 3 3682560000%
3 Claimed NaN 12800693 NaN NaN 2017-06-23 0 Girl Female Mid 3 3682560000%
4 Sold False 0xcf6165 60.00 36305.0 2020-11-30 1 Guy Male Dark 2 2050240500%

Your Turn¶

Using the crypto_art_pulse.csv file, do the following:

  1. Read the data into a dataframe called crypto_art_df

  2. Use the head() function to display some rows

  3. Use the info() function to see each variable type

Answers¶

1. Read the data into a dataframe called crypto_art_df¶

In [17]:
crypto_art_filename = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/crypto_art_pulse.csv'
crypto_art_df = pd.read_csv(crypto_art_filename)

2. Use the head() function to display some rows¶

In [18]:
crypto_art_df.head()
Out[18]:
asset_token_id asset_contract_address asset_collection asset_name asset_artist asset_image asset_link asset_external_link asset_first_sale_price asset_last_sale_price asset_last_sale_owner_name asset_last_sale_owner_address asset_sale_grows asset_num_sales asset_last_sale_date
0 211201 0xfbeef911dc5821886e1dda71586d90ed28174b7d Known-origin Wi-Five / Cuban Impressions Adoption https://lh3.googleusercontent.com/RAcLH_YifV2a... https://opensea.io/assets/0xfbeef911dc5821886e... https://knownorigin.io/token/211201 800000000000000000 800000000000000000 mattiac 0x576a655161b5502dcf40602be1f3519a89b71658 0.0 1 2020-12-31 14:02:57
1 16773 0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0 Superrare SUBTERRANEAN HOMESICK ALIEN (TRIBUTE) rubahitam https://lh3.googleusercontent.com/pKhMQQNVUt3J... https://opensea.io/assets/0xb932a70a57673d89f4... https://superrare.co/?origin=https%3A%2F%2Fope... 1500000000000000000 1500000000000000000 NaN 0x9dac76101b5ccda5970e60402de31a6649115b53 0.0 1 2020-12-31 13:35:20
2 16113 0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0 Superrare Somewhere MrY https://lh3.googleusercontent.com/9ycKTkeIZTro... https://opensea.io/assets/0xb932a70a57673d89f4... https://superrare.co/?origin=https%3A%2F%2Fope... 500000000000000000 500000000000000000 endless_traveler 0x423a0a0f3479ab24b2b04c8dba8bbe0e2bbb5cc6 0.0 1 2020-12-31 13:19:51
3 17443 0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0 Superrare TINY ROOMS TWO - IN DA CLUB 2020 EDITION MrY https://lh3.googleusercontent.com/WgD9n7W0B8Et... https://opensea.io/assets/0xb932a70a57673d89f4... https://superrare.co/?origin=https%3A%2F%2Fope... 500000000000000000 500000000000000000 endless_traveler 0x423a0a0f3479ab24b2b04c8dba8bbe0e2bbb5cc6 0.0 1 2020-12-31 13:19:51
4 38926 0x2a46f2ffd99e19a89476e2f62270e0a35bbf0756 Makersplace Boredoom wizmo https://lh3.googleusercontent.com/HY5jSHOtLn3g... https://opensea.io/assets/0x2a46f2ffd99e19a894... https://makersplace.com/product/token/0x2a46f2... 500000000000000000 500000000000000000 brandonkangfilms 0x720a4fab08cb746fc90e88d1924a98104c0822cf 0.0 1 2020-12-31 09:28:46

3. Use the info() function to see each variable type¶

In [19]:
crypto_art_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25336 entries, 0 to 25335
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   asset_token_id                 25336 non-null  int64  
 1   asset_contract_address         25336 non-null  object 
 2   asset_collection               25336 non-null  object 
 3   asset_name                     25335 non-null  object 
 4   asset_artist                   25299 non-null  object 
 5   asset_image                    25336 non-null  object 
 6   asset_link                     25336 non-null  object 
 7   asset_external_link            25215 non-null  object 
 8   asset_first_sale_price         25336 non-null  object 
 9   asset_last_sale_price          25336 non-null  object 
 10  asset_last_sale_owner_name     1556 non-null   object 
 11  asset_last_sale_owner_address  25308 non-null  object 
 12  asset_sale_grows               25336 non-null  float64
 13  asset_num_sales                25336 non-null  int64  
 14  asset_last_sale_date           25336 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 2.9+ MB

Thanks for sticking with us and learning about reading .csv and Excel files!¶