class: center, middle, inverse, title-slide # Motivating Example
### Omni Analytics Group --- ## Motivating Example - Explore a real data set using R. - The goal is to have a good understanding of using R for data management and exploration. - Don't worry about understanding all of the coding right away. - We will go back and explain how it all works in detail. ## CryptoPunks The CryptoPunks are 10,000 uniquely generated characters. Each of them is officially owned by a single person on the Ethereum blockchain. Initially, they were all claimed for free. Now, you can buy, bid on, and offer punks for sale via a marketplace that is also embedded in the blockchain. <p align="center"> <img src="images/punk-image.png" width="800px" height="200px"> </p> --- ## CryptoPunks Dataset - Shows sales of CryptoPunks since June 23rd, 2017 to December 30th, 2020. - The dataset includes the following: * Details of sales * Details of the Punk * Rarity Rank of the Punk Note that the data may not be 100% accurate! #### Let's get started! -- ## Load the Data First, we load Punks using `read.csv()`: ```r Punks <- read.csv("punks.csv") ``` --- ## First look at data in R Let's use R to look at the top few rows of the Punks data set. The `head()` function allows you to look at first 6 rows of the data and `tail()` allows you to look at last 6 rows. ```r head(Punks) ``` ``` ## Transaction From To Crypto USD Txn ID Sex Type Skin ## 1 Sold 0xf5099e 14715954 25.00 2822 2018-11-30 0 Girl Female Mid ## 2 Sold 0x00d7c9 10528156 1.60 386 2017-07-07 0 Girl Female Mid ## 3 Sold 0xc352b5 55241 0.98 320 2017-06-23 0 Girl Female Mid ## 4 Claimed <NA> 12800693 NA NA 2017-06-23 0 Girl Female Mid ## 5 Sold EliteCat… 0xcf6165 60.00 36305 2020-11-30 1 Guy Male Dark ## 6 Sold 0xf5099e GoWest23 31.00 5155 2019-04-06 1 Guy Male Dark ## Slots Rank ## 1 3 3682560000% ## 2 3 3682560000% ## 3 3 3682560000% ## 4 3 3682560000% ## 5 2 2050240500% ## 6 2 2050240500% ``` --- ## CryptoPunks Data Attributes The command `str()`, short for structure, gives us a summary of each variable along with the size of the "data frame". ```r str(Punks) ``` ``` ## 'data.frame': 17554 obs. of 12 variables: ## $ Transaction: chr "Sold" "Sold" "Sold" "Claimed" ... ## $ From : chr "0xf5099e" "0x00d7c9" "0xc352b5" NA ... ## $ To : chr "14715954" "10528156" "55241" "12800693" ... ## $ Crypto : num 25 1.6 0.98 NA 60 31 0.42 NA NA NA ... ## $ USD : num 2822 386 320 NA 36305 ... ## $ Txn : chr "2018-11-30" "2017-07-07" "2017-06-23" "2017-06-23" ... ## $ ID : int 0 0 0 0 1 1 1 1 2 3 ... ## $ Sex : chr "Girl" "Girl" "Girl" "Girl" ... ## $ Type : chr "Female" "Female" "Female" "Female" ... ## $ Skin : chr "Mid" "Mid" "Mid" "Mid" ... ## $ Slots : int 3 3 3 3 2 2 2 2 1 3 ... ## $ Rank : chr "3682560000%" "3682560000%" "3682560000%" "3682560000%" ... ``` The Punks data frame has 17554 observations (rows) and 12 variables (columns). --- ## CryptoPunks Variables Summary Let's summarize the values for each variable in Punks with the `summary()` command. With this command we immediately have summary statistics of each variable. ```r summary(Punks) ``` <br> <br> <br> <br> <br> <p align="center"> <img src="images/Cut_outs/Cut_out_16.png" width="200px" height="150px"> </p> --- ``` ## Transaction From To Crypto ## Length:17554 Length:17554 Length:17554 Min. : 0.01 ## Class :character Class :character Class :character 1st Qu.: 0.30 ## Mode :character Mode :character Mode :character Median : 1.00 ## Mean : 2.39 ## 3rd Qu.: 2.90 ## Max. :189.99 ## NA's :10000 ## USD Txn ID Sex ## Min. : 0.01 Length:17554 Min. : 0 Length:17554 ## 1st Qu.: 72.00 Class :character 1st Qu.:2906 Class :character ## Median : 232.50 Mode :character Median :5189 Mode :character ## Mean : 867.31 Mean :5194 ## 3rd Qu.: 1003.25 3rd Qu.:7557 ## Max. :137522.00 Max. :9999 ## NA's :10000 ## Type Skin Slots Rank ## Length:17554 Length:17554 Min. :0.000 Length:17554 ## Class :character Class :character 1st Qu.:2.000 Class :character ## Mode :character Mode :character Median :3.000 Mode :character ## Mean :2.778 ## 3rd Qu.:3.000 ## Max. :7.000 ## ``` --- ## Scatterplots Let's look at the relationship between Crypto and USD. First, we need to install and load ggplot2, a special package for plotting and visualizations. ```r install.packages("ggplot2") ``` ```r library(ggplot2) ``` Using the `qplot()` command we can create a simple scatter plot. Do not let the `qplot()` command or any of the parameters confuse you. We will discuss this function in detail in later lessons. --- ## Scatter Plot ```r qplot(Crypto, USD, geom="point", data = Punks, main = "USD vs. Crypto (ETH)") ``` <img src="1-1-Motivating-Example-Crypto-Punk_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## More Scatterplots ```r qplot(Crypto, USD, geom="point", data = Punks, main = "USD vs. Crypto (ETH)", colour = Type) ``` <img src="1-1-Motivating-Example-Crypto-Punk_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Even More Scatterplots ```r qplot(Crypto, USD, geom="point", data = Punks, main = "USD vs. Crypto (ETH) with Regression Line")+ geom_smooth(method = "lm") ``` <img src="1-1-Motivating-Example-Crypto-Punk_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Creating A New Variable We will make a new variable in the Punks data set to account for ratio between Crypto and USD that is, `Ratio of Currency = USD / Crypto`. ```r # Creates new column Ratio.of.Currency in Punks data Punks$Ratio.of.Currency <- Punks$USD / Punks$Crypto ``` Notice that we had to place periods between words. R does not allow users to separate phrases with spaces. <p align="right"> <img src="images/Cut_outs/Cut_out_06.png" width="200px" height="150px"> </p> --- ## Summary ```r summary(Punks$Ratio.of.Currency) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 1.0 202.9 326.9 328.0 367.2 1409.1 10000 ``` One way we can interpret this as 1 ETH is worth $328.00 (at the time of writing). ## Column Names ```r names(Punks) # Provides the name of each column ``` ``` ## [1] "Transaction" "From" "To" ## [4] "Crypto" "USD" "Txn" ## [7] "ID" "Sex" "Type" ## [10] "Skin" "Slots" "Rank" ## [13] "Ratio.of.Currency" ``` --- ## Crypto Histogram We now plot a histogram of Crypto to see its distribution. ```r qplot(Crypto, data = Punks, binwidth = 2.5, main = "Histogram of Crypto") # binwidth is the length of each rectangular bar ``` <img src="1-1-Motivating-Example-Crypto-Punk_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Most expensive Punk that was last sold in the sample data We can look at the most expensive Punk that was last sold using `which.max()`. ```r Punks[which.max(Punks$Crypto),] ``` ``` ## Transaction From To Crypto USD Txn ID Sex Type Skin ## 5235 Sold jmg 0x7224a1 189.99 137522 2020-12-30 3306 Guy Male Mid ## Slots Rank Ratio.of.Currency ## 5235 3 6437574000% 723.8381 ``` Thus, the most expensive Punk is <b> Punk 3306 </b> sold at 189.99 ETH on December 30, 2020. -- ## Wait a minute! -- Did you fact check this by visiting the official CryptoPunks website? Did you notice something strange when you looked up <b> Punk 3306 </b>? -- It turns out that it's <b> Punk 3307 </b> that sold at 189.99 ETH on December 30, 2020. This is a common 'off by one' mistake made in the data science world and it is something we need to be aware of as we learn more about the data. --- ## Average Slots By Sex Looking at the average number of slots, we noticed that <b> Girl </b> and <b> Guy </b> punks have a similar number of slots. ```r mean((Punks$Slots)[Punks$Sex == "Girl"]) ``` ``` ## [1] 2.763077 ``` ```r mean((Punks$Slots)[Punks$Sex == "Guy"]) ``` ``` ## [1] 2.786449 ``` As a breakdown: - `mean()` - provides the mean of argument - `Punks$Slots` - selects the Slots column in the Punks dataset - `[Punks$Sex == "XX"]` - subsets the column with the chosen Sex XX --- ## Combine Types We can label Alien, Ape and Zombie as non_human, and similarly we can label Female and Male as human. This code works by creating new columns **non_human** and **human** by selecting the rows that contain `c("Alien", "Ape", "Zombie")` and `c("Female", "Male")` in the Position column, respectfully ```r non_human <- (Punks$Type %in% c("Alien", "Ape", "Zombie")) head(non_human) ``` ``` ## [1] FALSE FALSE FALSE FALSE FALSE FALSE ``` ```r human <- (Punks$Type %in% c("Female", "Male")) head(human) ``` ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE ``` --- ## Average Slots by Types Finding the mean of each based on the newly created Types ```r mean(Punks$Slots[non_human]) ``` ``` ## [1] 2.22467 ``` ```r mean(Punks$Slots[human]) ``` ``` ## [1] 2.785479 ``` We can see that, **human** type punks have more slots than **non_human** type punks. --- ## Box Plots We could compare the slots for the different types of punks with a side by side boxplot. ```r qplot(Type, Slots, geom = "boxplot", data = Punks, main = "Box Plot of Slots by Types") ``` <img src="1-1-Motivating-Example-Crypto-Punk_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Your Turn Try playing with chunks of code from this session to further investigate the Punks data: 1. Summarize the Crypto values. 2. Make a boxplot comparing Crypto for different Skins. 3. Find the average slots for the pale (Alien, Albino, Light) and dark (Dark, Mid). <br> <br> <p align="left"> <img src="images/Cut_outs/Cut_out_01.png" width="200px" height="150px"> </p> --- ## Answers ### 1 ```r summary(Punks$Crypto) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.01 0.30 1.00 2.39 2.90 189.99 10000 ``` --- ### 2 ```r qplot(Skin, Crypto, geom = "boxplot", data = Punks, main = "Box Plot of Crypto by Skins") ``` <img src="1-1-Motivating-Example-Crypto-Punk_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ### 3 ```r pale <- (Punks$Skin %in% c("Alien", "Albino", "Light")) dark <- (Punks$Skin %in% c("Dark", "Mid")) mean(Punks$Slots[pale]) ``` ``` ## [1] 2.748796 ``` ```r mean(Punks$Slots[dark]) ``` ``` ## [1] 2.808334 ```