Since this notebook uses Python instead of R, we need to start with importing pandas, a Python library for data analysis. It is common practice to import pandas with the alias "pd"
import pandas as pd
First, we set the url where the dataset can be found as the "punks_url" variable.
Next, we use pandas' read_csv method to load the dataset.
punks_url = 'https://raw.githubusercontent.com/Omni-Analytics-Group/eth-data-science-course/master/courses/Module%201/punks.csv'
df_punks = pd.read_csv(punks_url, encoding='unicode_escape')
Use Python to look at the first few rows of data in the Punks data set. The head() function will allow you to look at the first 5 rows of the dataset. The tail() function will allow you to look at the last 5 rows of the dataset.
df_punks.head()
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sold | 0xf5099e | 14715954 | 25.00 | 2822.0 | 2018-11-30 | 0 | Girl | Female | Mid | 3 | 3682560000% |
1 | Sold | 0x00d7c9 | 10528156 | 1.60 | 386.0 | 2017-07-07 | 0 | Girl | Female | Mid | 3 | 3682560000% |
2 | Sold | 0xc352b5 | 55241 | 0.98 | 320.0 | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% |
3 | Claimed | NaN | 12800693 | NaN | NaN | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% |
4 | Sold | EliteCat | 0xcf6165 | 60.00 | 36305.0 | 2020-11-30 | 1 | Guy | Male | Dark | 2 | 2050240500% |
df_punks.tail()
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
17549 | Sold | TJ2010 | gaus | 59.0 | 21062.0 | 2020-09-27 | 9997 | Guy | Zombie | Zombie | 2 | 0.023188000% |
17550 | Claimed | NaN | TJ2010 | NaN | NaN | 2017-06-23 | 9997 | Guy | Zombie | Zombie | 2 | 0.023188000% |
17551 | Sold | 7595170 | TokenAng | 15.0 | 9499.0 | 2020-12-27 | 9998 | Girl | Female | Mid | 3 | 1452800000% |
17552 | Claimed | NaN | 0x73e4a2 | NaN | NaN | 2017-06-23 | 9998 | Girl | Female | Mid | 3 | 1452800000% |
17553 | Claimed | NaN | 8269084 | NaN | NaN | 2017-06-23 | 9999 | Girl | Female | Dark | 2 | 1752960000% |
The command .info() gives us information about the dataframe including number of rows and name, number, and data type for each column.
df_punks.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17554 entries, 0 to 17553 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Transaction 17554 non-null object 1 From 7554 non-null object 2 To 17554 non-null object 3 Crypto 7554 non-null float64 4 USD 7554 non-null float64 5 Txn 17554 non-null object 6 ID 17554 non-null int64 7 Sex 17554 non-null object 8 Type 17554 non-null object 9 Skin 17554 non-null object 10 Slots 17554 non-null int64 11 Rank 17554 non-null object dtypes: float64(2), int64(2), object(8) memory usage: 1.6+ MB
Let's summarize the value for each variable in Punks by using .describe().
Note that stats such as mean, min, and max are shown only for columns containing numeric data.
Columns containing string, or text data are summarized with a count, count of unique items, the most common value (top) and the frequency at which the top value occurs.
df_punks.describe(include='all')
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17554 | 7554 | 17554 | 7554.000000 | 7554.000000 | 17554 | 17554.000000 | 17554 | 17554 | 17554 | 17554.000000 | 17554 |
unique | 2 | 682 | 1524 | NaN | NaN | 898 | NaN | 2 | 5 | 7 | NaN | 4007 |
top | Claimed | Hemba | 12800693 | NaN | NaN | 2017-06-23 | NaN | Guy | Male | Mid | NaN | 2213293500% |
freq | 10000 | 1058 | 990 | NaN | NaN | 10019 | NaN | 11379 | 11152 | 5307 | NaN | 45 |
mean | NaN | NaN | NaN | 2.389854 | 867.305033 | NaN | 5193.788481 | NaN | NaN | NaN | 2.778227 | NaN |
std | NaN | NaN | NaN | 6.118331 | 2833.265913 | NaN | 2757.690393 | NaN | NaN | NaN | 0.799278 | NaN |
min | NaN | NaN | NaN | 0.010000 | 0.010000 | NaN | 0.000000 | NaN | NaN | NaN | 0.000000 | NaN |
25% | NaN | NaN | NaN | 0.300000 | 72.000000 | NaN | 2906.000000 | NaN | NaN | NaN | 2.000000 | NaN |
50% | NaN | NaN | NaN | 1.000000 | 232.500000 | NaN | 5189.000000 | NaN | NaN | NaN | 3.000000 | NaN |
75% | NaN | NaN | NaN | 2.900000 | 1003.250000 | NaN | 7557.000000 | NaN | NaN | NaN | 3.000000 | NaN |
max | NaN | NaN | NaN | 189.990000 | 137522.000000 | NaN | 9999.000000 | NaN | NaN | NaN | 7.000000 | NaN |
Let's look at the relationship between Crypto and USD.
matplotlib and seaborn are both open source libraries for creating visualizations in python.
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x=df_punks.Crypto, y=df_punks.USD).set(title='USD vs. Crypto (ETH)')
[Text(0.5, 1.0, 'USD vs. Crypto (ETH)')]
sns.scatterplot(x=df_punks.Crypto, y=df_punks.USD, hue = df_punks.Type).set(title='USD vs. Crypto (ETH)')
[Text(0.5, 1.0, 'USD vs. Crypto (ETH)')]
We will make a new variable in the Punks data set to account for the ratio between Crypto and USD that is Ratio of Currency = USD / Crypto
.
df_punks['Ratio of Currency'] = df_punks['USD']/df_punks['Crypto']
One way we can interpret this is 1 ETH is worth $328.00 (the mean at the time of writing).
df_punks['Ratio of Currency'].describe()
count 7554.000000 mean 327.962994 std 188.430207 min 1.000000 25% 202.857143 50% 326.887464 75% 367.200000 max 1409.090909 Name: Ratio of Currency, dtype: float64
We can look at all of the column names found in the dataset with this method.
df_punks.columns
Index(['Transaction', 'From', 'To', 'Crypto', 'USD', 'Txn', 'ID', 'Sex', 'Type', 'Skin', 'Slots', 'Rank', 'Ratio of Currency'], dtype='object')
We now plot a histogram of Crypto to see its distribution.
sns.histplot(x=df_punks.Crypto, binwidth=2.5).set(title='Histogram of Crypto') # binwidth is the length of each rectangular bar
[Text(0.5, 1.0, 'Histogram of Crypto')]
We can look at the most expensive punk that was last sold by using the .loc (location) command and the .idxmax (index for the max value) command.
df_punks.loc[df_punks.Crypto.idxmax()]
Transaction Sold From jmg To 0x7224a1 Crypto 189.99 USD 137522 Txn 2020-12-30 ID 3306 Sex Guy Type Male Skin Mid Slots 3 Rank 6437574000% Ratio of Currency 723.838 Name: 5234, dtype: object
df_punks.loc - locates the values in the dataframe meeting the criteria inside the brackets
df_punks.Sex == 'XX' - subsets the dataset with the rows that meet the critera
'Slots' - selects the Slots column from the dataset
.mean() - provides the mean
df_punks.loc[df_punks.Sex == 'Girl', 'Slots'].mean()
2.763076923076923
df_punks.loc[df_punks.Sex == 'Guy', 'Slots'].mean()
2.7864487213287634
We can label Alien, Ape and Zombie as non_human, and similarly we can label Female and Male as human.
This code works by creating new columns "Non_human" and "Human" by selecting the rows that contain "Alien", "Ape", "Zombie" and "Female", "Male" in the Type column.
Note the | operator is interpreted as 'or', so the non_human column assignment looks at the Type column in the Punks data and assigns a 'True' value if the Punk Type is Alien OR Ape OR Zombie. Similarly, the Punk is assigned a 'True' value in the Human column if the Type is Female OR Male.
df_punks['Non_human'] = (df_punks['Type'] == 'Alien') | (df_punks['Type'] == 'Ape') | (df_punks['Type'] == 'Zombie')
df_punks
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | Ratio of Currency | Non_human | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sold | 0xf5099e | 14715954 | 25.00 | 2822.0 | 2018-11-30 | 0 | Girl | Female | Mid | 3 | 3682560000% | 112.880000 | False |
1 | Sold | 0x00d7c9 | 10528156 | 1.60 | 386.0 | 2017-07-07 | 0 | Girl | Female | Mid | 3 | 3682560000% | 241.250000 | False |
2 | Sold | 0xc352b5 | 55241 | 0.98 | 320.0 | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% | 326.530612 | False |
3 | Claimed | NaN | 12800693 | NaN | NaN | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% | NaN | False |
4 | Sold | EliteCat | 0xcf6165 | 60.00 | 36305.0 | 2020-11-30 | 1 | Guy | Male | Dark | 2 | 2050240500% | 605.083333 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17549 | Sold | TJ2010 | gaus | 59.00 | 21062.0 | 2020-09-27 | 9997 | Guy | Zombie | Zombie | 2 | 0.023188000% | 356.983051 | True |
17550 | Claimed | NaN | TJ2010 | NaN | NaN | 2017-06-23 | 9997 | Guy | Zombie | Zombie | 2 | 0.023188000% | NaN | True |
17551 | Sold | 7595170 | TokenAng | 15.00 | 9499.0 | 2020-12-27 | 9998 | Girl | Female | Mid | 3 | 1452800000% | 633.266667 | False |
17552 | Claimed | NaN | 0x73e4a2 | NaN | NaN | 2017-06-23 | 9998 | Girl | Female | Mid | 3 | 1452800000% | NaN | False |
17553 | Claimed | NaN | 8269084 | NaN | NaN | 2017-06-23 | 9999 | Girl | Female | Dark | 2 | 1752960000% | NaN | False |
17554 rows × 14 columns
df_punks['Human'] = (df_punks['Type'] == 'Female')| (df_punks['Type'] == 'Male')
df_punks
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | Ratio of Currency | Non_human | Human | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sold | 0xf5099e | 14715954 | 25.00 | 2822.0 | 2018-11-30 | 0 | Girl | Female | Mid | 3 | 3682560000% | 112.880000 | False | True |
1 | Sold | 0x00d7c9 | 10528156 | 1.60 | 386.0 | 2017-07-07 | 0 | Girl | Female | Mid | 3 | 3682560000% | 241.250000 | False | True |
2 | Sold | 0xc352b5 | 55241 | 0.98 | 320.0 | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% | 326.530612 | False | True |
3 | Claimed | NaN | 12800693 | NaN | NaN | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% | NaN | False | True |
4 | Sold | EliteCat | 0xcf6165 | 60.00 | 36305.0 | 2020-11-30 | 1 | Guy | Male | Dark | 2 | 2050240500% | 605.083333 | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17549 | Sold | TJ2010 | gaus | 59.00 | 21062.0 | 2020-09-27 | 9997 | Guy | Zombie | Zombie | 2 | 0.023188000% | 356.983051 | True | False |
17550 | Claimed | NaN | TJ2010 | NaN | NaN | 2017-06-23 | 9997 | Guy | Zombie | Zombie | 2 | 0.023188000% | NaN | True | False |
17551 | Sold | 7595170 | TokenAng | 15.00 | 9499.0 | 2020-12-27 | 9998 | Girl | Female | Mid | 3 | 1452800000% | 633.266667 | False | True |
17552 | Claimed | NaN | 0x73e4a2 | NaN | NaN | 2017-06-23 | 9998 | Girl | Female | Mid | 3 | 1452800000% | NaN | False | True |
17553 | Claimed | NaN | 8269084 | NaN | NaN | 2017-06-23 | 9999 | Girl | Female | Dark | 2 | 1752960000% | NaN | False | True |
17554 rows × 15 columns
Finding the mean of each based on the newly created Types
df_punks.loc[df_punks['Non_human'] == True, 'Slots'].mean()
2.2246696035242293
df_punks.loc[df_punks['Human'] == True, 'Slots'].mean()
2.7854793097477923
We can see that human type punks have more slots than non_human type punks.
We could compare the slots for the different types of punks with a side by side boxplot.
sns.boxplot(x=df_punks.Type, y=df_punks.Slots).set(title='Box Plot of Slots by Types')
[Text(0.5, 1.0, 'Box Plot of Slots by Types')]
Try playing with chunks of code from this session to further investigate the Punks data:
df_punks['Crypto'].describe()
count 7554.000000 mean 2.389854 std 6.118331 min 0.010000 25% 0.300000 50% 1.000000 75% 2.900000 max 189.990000 Name: Crypto, dtype: float64
sns.boxplot(x=df_punks.Skin, y=df_punks.Crypto).set(title='Box Plot of Crypto by Skin')
[Text(0.5, 1.0, 'Box Plot of Crypto by Skin')]
df_punks['Pale'] = (df_punks['Skin'] == 'Alien') | (df_punks['Skin'] == 'Albino') | (df_punks['Skin'] == 'Light')
df_punks['Dark'] = (df_punks['Skin'] == 'Dark') | (df_punks['Skin'] == 'Mid')
df_punks.head()
Transaction | From | To | Crypto | USD | Txn | ID | Sex | Type | Skin | Slots | Rank | Ratio of Currency | Non_human | Human | Pale | Dark | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sold | 0xf5099e | 14715954 | 25.00 | 2822.0 | 2018-11-30 | 0 | Girl | Female | Mid | 3 | 3682560000% | 112.880000 | False | True | False | True |
1 | Sold | 0x00d7c9 | 10528156 | 1.60 | 386.0 | 2017-07-07 | 0 | Girl | Female | Mid | 3 | 3682560000% | 241.250000 | False | True | False | True |
2 | Sold | 0xc352b5 | 55241 | 0.98 | 320.0 | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% | 326.530612 | False | True | False | True |
3 | Claimed | NaN | 12800693 | NaN | NaN | 2017-06-23 | 0 | Girl | Female | Mid | 3 | 3682560000% | NaN | False | True | False | True |
4 | Sold | EliteCat | 0xcf6165 | 60.00 | 36305.0 | 2020-11-30 | 1 | Guy | Male | Dark | 2 | 2050240500% | 605.083333 | False | True | False | True |
df_punks.loc[df_punks['Pale'] == True, 'Slots'].mean()
2.748796498905908
df_punks.loc[df_punks['Dark'] == True, 'Slots'].mean()
2.80833412796796