We often need to import our own data rather than just using built-in datasets. The following steps are usually taken:
We load a file in by identifying where in our system the file is stored and then enclosing that filepath in the pd.read_csv parentheses. For this example, we will load the dataset from the github directory where it is stored for this course.
Common practice is to load dataframes (tabular data from a csv) with the name "df."
import pandas as pd
crypto_art_filename = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/crypto_art_pulse.csv'
crypto_art_df = pd.read_csv(crypto_art_filename)
Excel has file formats such as xls and csv - so what are the differences? File extensions .xls or .xlsx are proprietary Excel formats, and they are binary files. .csv is the extension for comma separated value files. They are text files and are directly readable.
Comma separated or tabular separated files can both be loaded with pd.read_csv
This code is used for a comma separated file:
pd.read_csv(filename)
This code is used for a tab separated file:
pd.read_csv(filename, sep="\t")
The pd.read_csv also needs special instructions on how to handle some of the unicode characters in the data file, which is why "encoding = 'unicode_escape'" is included.
punks_file = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/punks.csv'
df_punks = pd.read_csv(punks_file, encoding='unicode_escape')
df_punks.head()
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sold | 0xf5099e | 14715954 | 25.00 | 2822.0 | 2018-11-30 | 0 | Girl | Female | Mid | 3 | 3682560000% |
1 | Sold | 0x00d7c9 | 10528156 | 1.60 | 386.0 | 2017-07-07 | 0 | Girl | Female | Mid | 3 | 3682560000% |
2 | Sold | 0xc352b5 | 55241 | 0.98 | 320.0 | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% |
3 | Claimed | NaN | 12800693 | NaN | NaN | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% |
4 | Sold | EliteCat | 0xcf6165 | 60.00 | 36305.0 | 2020-11-30 | 1 | Guy | Male | Dark | 2 | 2050240500% |
df_punks.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17554 entries, 0 to 17553 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Transaction 17554 non-null object 1 From 7554 non-null object 2 To 17554 non-null object 3 Crypto 7554 non-null float64 4 USD 7554 non-null float64 5 Txn 17554 non-null object 6 ID 17554 non-null int64 7 Sex 17554 non-null object 8 Type 17554 non-null object 9 Skin 17554 non-null object 10 Slots 17554 non-null int64 11 Rank 17554 non-null object dtypes: float64(2), int64(2), object(8) memory usage: 1.6+ MB
The pandas library can also handle reading excel files.
In order to access an Excel file in Google colab, first go to this link and click "Download." Make note of where your downloaded file is stored: https://github.com/Omni-Analytics-Group/eth-data-science-course/blob/master/courses/Module%201/punks.xlsx
Click on the file folder icon in the left hand sidebar on your Google Colab screen.
Click on the "Upload" icon, then navigate to the folder where the punks.xlsx file you just downloaded is stored and click on the punks.xlsx file.
When the punks.xlsx file shows up in the Files list, hover your mouse over it and click on the 3 vertical dots to the right of the filename, then click on "Copy path."
Paste the copied path after the equal sign in the first line of the code below:
punks_excel_file = '/content/punks.xlsx' # post your copied path between the single quotation marks
df_punks_excel = pd.read_excel(punks_excel_file, 'Worksheet')
df_punks_excel.head()
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sold | 0xf5099e | 14715954 | 25.00 | 2822.0 | 2018-11-30 | 0 | Girl | Female | Mid | 3 | 3682560000% |
1 | Sold | 0x00d7c9 | 10528156 | 1.60 | 386.0 | 2017-07-07 | 0 | Girl | Female | Mid | 3 | 3682560000% |
2 | Sold | 0xc352b5 | 55241 | 0.98 | 320.0 | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% |
3 | Claimed | NaN | 12800693 | NaN | NaN | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% |
4 | Sold | False | 0xcf6165 | 60.00 | 36305.0 | 2020-11-30 | 1 | Guy | Male | Dark | 2 | 2050240500% |
Using the crypto_art_pulse.csv file, do the following:
Read the data into a dataframe called crypto_art_df
Use the head() function to display some rows
Use the info() function to see each variable type
crypto_art_filename = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/crypto_art_pulse.csv'
crypto_art_df = pd.read_csv(crypto_art_filename)
crypto_art_df.head()
asset_token_id | asset_contract_address | asset_collection | asset_name | asset_artist | asset_image | asset_link | asset_external_link | asset_first_sale_price | asset_last_sale_price | asset_last_sale_owner_name | asset_last_sale_owner_address | asset_sale_grows | asset_num_sales | asset_last_sale_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 211201 | 0xfbeef911dc5821886e1dda71586d90ed28174b7d | Known-origin | Wi-Five / Cuban Impressions | Adoption | https://lh3.googleusercontent.com/RAcLH_YifV2a... | https://opensea.io/assets/0xfbeef911dc5821886e... | https://knownorigin.io/token/211201 | 800000000000000000 | 800000000000000000 | mattiac | 0x576a655161b5502dcf40602be1f3519a89b71658 | 0.0 | 1 | 2020-12-31 14:02:57 |
1 | 16773 | 0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0 | Superrare | SUBTERRANEAN HOMESICK ALIEN (TRIBUTE) | rubahitam | https://lh3.googleusercontent.com/pKhMQQNVUt3J... | https://opensea.io/assets/0xb932a70a57673d89f4... | https://superrare.co/?origin=https%3A%2F%2Fope... | 1500000000000000000 | 1500000000000000000 | NaN | 0x9dac76101b5ccda5970e60402de31a6649115b53 | 0.0 | 1 | 2020-12-31 13:35:20 |
2 | 16113 | 0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0 | Superrare | Somewhere | MrY | https://lh3.googleusercontent.com/9ycKTkeIZTro... | https://opensea.io/assets/0xb932a70a57673d89f4... | https://superrare.co/?origin=https%3A%2F%2Fope... | 500000000000000000 | 500000000000000000 | endless_traveler | 0x423a0a0f3479ab24b2b04c8dba8bbe0e2bbb5cc6 | 0.0 | 1 | 2020-12-31 13:19:51 |
3 | 17443 | 0xb932a70a57673d89f4acffbe830e8ed7f75fb9e0 | Superrare | TINY ROOMS TWO - IN DA CLUB 2020 EDITION | MrY | https://lh3.googleusercontent.com/WgD9n7W0B8Et... | https://opensea.io/assets/0xb932a70a57673d89f4... | https://superrare.co/?origin=https%3A%2F%2Fope... | 500000000000000000 | 500000000000000000 | endless_traveler | 0x423a0a0f3479ab24b2b04c8dba8bbe0e2bbb5cc6 | 0.0 | 1 | 2020-12-31 13:19:51 |
4 | 38926 | 0x2a46f2ffd99e19a89476e2f62270e0a35bbf0756 | Makersplace | Boredoom | wizmo | https://lh3.googleusercontent.com/HY5jSHOtLn3g... | https://opensea.io/assets/0x2a46f2ffd99e19a894... | https://makersplace.com/product/token/0x2a46f2... | 500000000000000000 | 500000000000000000 | brandonkangfilms | 0x720a4fab08cb746fc90e88d1924a98104c0822cf | 0.0 | 1 | 2020-12-31 09:28:46 |
crypto_art_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25336 entries, 0 to 25335 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 asset_token_id 25336 non-null int64 1 asset_contract_address 25336 non-null object 2 asset_collection 25336 non-null object 3 asset_name 25335 non-null object 4 asset_artist 25299 non-null object 5 asset_image 25336 non-null object 6 asset_link 25336 non-null object 7 asset_external_link 25215 non-null object 8 asset_first_sale_price 25336 non-null object 9 asset_last_sale_price 25336 non-null object 10 asset_last_sale_owner_name 1556 non-null object 11 asset_last_sale_owner_address 25308 non-null object 12 asset_sale_grows 25336 non-null float64 13 asset_num_sales 25336 non-null int64 14 asset_last_sale_date 25336 non-null object dtypes: float64(1), int64(2), object(12) memory usage: 2.9+ MB