class: center, middle, inverse, title-slide # Data Structures
### Omni Analytics Group --- ## Data Frames - Data frames are the work horse of R objects. - They are structured by rows and columns and can be indexed. - Each column is a specified variable type. - Column names can be used to index a variable. - Advice for naming variables, stated earlier in the course, applies to editing column names. - Can be specified by grouping vectors of equal length as columns. <br> <br> <p align="right"> <img src="images/Cut_outs/Cut_out_07.png" width="200px" height="150px"> </p> --- ## Data Frame Indexing - Elements indexed similar to a vector using `[` `]`. - `df[i,j]` will select the element in the `\(i^{th}\)` row and `\(j^{th}\)` column. - `df[ ,j]` will select the entire `\(j^{th}\)` column and treat it as a vector. - `df[i ,]` will select the entire `\(i^{th}\)` row and treat it as a vector. - Logical vectors can be used in place of i and j to subset the row and columns. -- ## Adding a New Variable to a Data Frame - Create a new vector that is the same length as other columns. - Append the new column to the data frame using the `$` operator. - The new data frame column will adopt the name of the vector. --- ## Data Frame Demo Loading the previously used Punks data set: ```r Punks <- read.csv("punks.csv") ``` Select the date column (6th column): ```r Punks[,6] ``` Select the date and skin columns (6th and 10th columns): ```r Punks[,c(6,10)] ``` --- ## Demo (Continued) Select the ID column with the `$` operator: ```r Punks$ID ``` We now determine the row location in our data where the ID is 9126. ```r punk_9126 <- Punks$ID == 9126 # Creates vector of T/F values if the entry is 9126. head(punk_9126) ``` ``` ## [1] FALSE FALSE FALSE FALSE FALSE FALSE ``` This output doesn't show much. It would be much easier if we could see which positions are `TRUE`! ```r which(punk_9126 == TRUE) # Tells row number where the ID is 9126 ``` ``` ## [1] 15965 15966 15967 15968 15969 ``` --- ## Demo (Continued) Displaying part of the Punks data set where the ID is 9126 by subsetting rows. ```r Punks[Punks$ID == 9126, ] #or Punks[punk_9126, ] ``` ``` ## Transaction From To Crypto USD Txn ID Sex Type ## 15965 Sold 0x81ed84 0x3e17fa 6.00 2111 2020-09-27 9126 Girl Female ## 15966 Sold 0xab96b9 0x81ed84 3.00 963 2020-07-28 9126 Girl Female ## 15967 Sold Pranksy 0xab96b9 1.99 454 2020-06-19 9126 Girl Female ## 15968 Sold 0x00d7c9 Pranksy 0.25 45 2019-09-02 9126 Girl Female ## 15969 Claimed <NA> 0x00d7c9 NA NA 2017-06-23 9126 Girl Female ## Skin Slots Rank ## 15965 Mid 4 3780480000% ## 15966 Mid 4 3780480000% ## 15967 Mid 4 3780480000% ## 15968 Mid 4 3780480000% ## 15969 Mid 4 3780480000% ``` --- ## Creating our own Data Frame Creating our own data frame using the `data.frame()` function: ```r mydf <- data.frame(NUMS = 1:5, LETS = letters[1:5], PUNK_TYPES = c("Male", "Female", "Alien", "Ape", "Zombie")) mydf ``` ``` ## NUMS LETS PUNK_TYPES ## 1 1 a Male ## 2 2 b Female ## 3 3 c Alien ## 4 4 d Ape ## 5 5 e Zombie ``` Note that in a data frame, each column has to have the same length! --- ## Renaming columns We can use the `names()` function to set that first column to lowercase: ```r names(mydf)[1] <- "nums" # Changes the names of the first column in mydf mydf ``` ``` ## nums LETS PUNK_TYPES ## 1 1 a Male ## 2 2 b Female ## 3 3 c Alien ## 4 4 d Ape ## 5 5 e Zombie ``` --- ## Renaming columns (Continued) We can also rename all the columns at once using the `colnames()` command. ```r colnames(mydf) <- c("numbers","letters","punk_types") # Changes all columns at once mydf ``` ``` ## numbers letters punk_types ## 1 1 a Male ## 2 2 b Female ## 3 3 c Alien ## 4 4 d Ape ## 5 5 e Zombie ``` --- ## Your Turn 1. Construct a data frame where column 1 contains any 5 Punks and column 2 is their Slot number. 2. Select only the rows where the Slot number is even. 3. Determine which rows of the Punks data set contains the Alien Type. <br> <br> <br> <p align="left"> <img src="images/Cut_outs/Cut_out_03.png" width="200px" height="150px"> </p> --- ## Answers ### 1. ```r mydf <- data.frame(Punk = c(0,1,2,3,4), Slot = c(3,2,1,3,4) ) mydf ``` ``` ## Punk Slot ## 1 0 3 ## 2 1 2 ## 3 2 1 ## 4 3 3 ## 5 4 4 ``` --- ### 2. ```r mydf[c(2,5),] ``` ``` ## Punk Slot ## 2 1 2 ## 5 4 4 ``` -- ### 3. ```r alien <- Punks$Type == "Alien" which(alien == TRUE) ``` ``` ## [1] 698 4358 4802 5536 5537 9983 9984 10121 10122 10457 10458 13103 ## [13] 13104 13595 13596 ``` --- ## Lists - Lists are a structured collection of R objects. - R objects in a list do not need to be the same type. - Create lists using the `list` function. - Lists are indexed using double square brackets `[[ ]]` to select an object. --- ## List Example Creating a list containing a matrix of size 2 by 5, and a vector of length 5, and a string: ```r mylist <- list(matrix(letters[1:10], nrow = 2, ncol = 5), c("Alien", "Ape", "Female", "Male", "Zombie"), "CryptoPunks are awesome!") mylist ``` ``` ## [[1]] ## [,1] [,2] [,3] [,4] [,5] ## [1,] "a" "c" "e" "g" "i" ## [2,] "b" "d" "f" "h" "j" ## ## [[2]] ## [1] "Alien" "Ape" "Female" "Male" "Zombie" ## ## [[3]] ## [1] "CryptoPunks are awesome!" ``` --- ## List Example (Continued) Note that unlike data frames, lists can contain elements of varying sizes and structures. Use indexing to select the second list element: ```r mylist[[3]] # Selections third argument in mylist ``` ``` ## [1] "CryptoPunks are awesome!" ``` --- ## Your Turn 1. Create a list containing `mydf` as well as a vector of length 7 containing different Punk Skins. 2. Use indexing to select mydf from your list. <br> <br> <br> <br> <p align="right"> <img src="images/Cut_outs/Cut_out_05.png" width="200px" height="150px"> </p> --- ## Answers ### 1. ```r mylist <- list(mydf, c("Albino","Alien","Ape","Dark","Light", "Mid", "Zombie")) ``` -- ### 2. ```r mylist[[1]] ``` ``` ## Punk Slot ## 1 0 3 ## 2 1 2 ## 3 2 1 ## 4 3 3 ## 5 4 4 ``` --- ## Examining Objects - `head(x)` - View top 6 rows of a data frame - `tail(x)` - View bottom 6 rows of a data frame - `summary(x)` - Summary statistics - `str(x)` - View structure of object - `dim(x)` - View dimensions of object - `length(x)` - Returns the length of a vector --- ## Examining Objects Demo We can examine the first two values of an object by passing the `n` parameter to the `head()` function: ```r head(Punks, n = 2) # n = 2 displays only the first two rows. ``` ``` ## Transaction From To Crypto USD Txn ID Sex Type Skin ## 1 Sold 0xf5099e 14715954 25.0 2822 2018-11-30 0 Girl Female Mid ## 2 Sold 0x00d7c9 10528156 1.6 386 2017-07-07 0 Girl Female Mid ## Slots Rank ## 1 3 3682560000% ## 2 3 3682560000% ``` --- ## Examining Objects Demo (Continued) What is Punks structure? ```r str(Punks) ``` ``` ## 'data.frame': 17554 obs. of 12 variables: ## $ Transaction: chr "Sold" "Sold" "Sold" "Claimed" ... ## $ From : chr "0xf5099e" "0x00d7c9" "0xc352b5" NA ... ## $ To : chr "14715954" "10528156" "55241" "12800693" ... ## $ Crypto : num 25 1.6 0.98 NA 60 31 0.42 NA NA NA ... ## $ USD : num 2822 386 320 NA 36305 ... ## $ Txn : chr "2018-11-30" "2017-07-07" "2017-06-23" "2017-06-23" ... ## $ ID : int 0 0 0 0 1 1 1 1 2 3 ... ## $ Sex : chr "Girl" "Girl" "Girl" "Girl" ... ## $ Type : chr "Female" "Female" "Female" "Female" ... ## $ Skin : chr "Mid" "Mid" "Mid" "Mid" ... ## $ Slots : int 3 3 3 3 2 2 2 2 1 3 ... ## $ Rank : chr "3682560000%" "3682560000%" "3682560000%" "3682560000%" ... ``` --- ## Your Turn 1. View the last 4 rows of the Punks data set. 2. How many rows are in Punks data set? (try finding this using dim or indexing + length). <br> <br> <br> <p align="left"> <img src="images/Cut_outs/Cut_out_15.png" width="200px" height="200px"> </p> --- ## Answers ### 1. ```r tail(Punks, n=4) ``` ``` ## Transaction From To Crypto USD Txn ID Sex Type ## 17551 Claimed <NA> TJ2010 NA NA 2017-06-23 9997 Guy Zombie ## 17552 Sold 7595170 TokenAng… 15 9499 2020-12-27 9998 Girl Female ## 17553 Claimed <NA> 0x73e4a2 NA NA 2017-06-23 9998 Girl Female ## 17554 Claimed <NA> 8269084 NA NA 2017-06-23 9999 Girl Female ## Skin Slots Rank ## 17551 Zombie 2 0.023188000% ## 17552 Mid 3 1452800000% ## 17553 Mid 3 1452800000% ## 17554 Dark 2 1752960000% ``` --- ### 2. ```r dim(Punks) ``` ``` ## [1] 17554 12 ``` ```r dim(Punks)[1] #Picks first output element ``` ``` ## [1] 17554 ``` --- ## Working with Output from a Function - You can save the output from a function as an object. - An object is generally a list of output objects. - You can extract items from the output for further computing. - Examine objects using functions like `str()`. -- ## Saving Output Demo - Apply a t-test using the Punks data set to see if the Sale Price in Crypto for punks with 1 slot and 7 slots are statistically different - `t.test()` can only handle two groups, so we use a subset of the other slots. --- ## Demo (Continued) Save the output of the t-test to an object: ```r tout <- t.test(Crypto ~ Slots, data = Punks[Punks$Slots %in% c(1,7), ]) tout ``` ``` ## ## Welch Two Sample t-test ## ## data: Crypto by Slots ## t = -1.3373, df = 2.0036, p-value = 0.3127 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -139.46701 73.24138 ## sample estimates: ## mean in group 1 mean in group 7 ## 4.520519 37.633333 ``` An interpretation of this is that there is not a statistical difference in the average sale price in Crypto between the 1 and 7 slot Punks. --- Let's look at the structure of this object: ```r str(tout) ``` ``` ## List of 10 ## $ statistic : Named num -1.34 ## ..- attr(*, "names")= chr "t" ## $ parameter : Named num 2 ## ..- attr(*, "names")= chr "df" ## $ p.value : num 0.313 ## $ conf.int : num [1:2] -139.5 73.2 ## ..- attr(*, "conf.level")= num 0.95 ## $ estimate : Named num [1:2] 4.52 37.63 ## ..- attr(*, "names")= chr [1:2] "mean in group 1" "mean in group 7" ## $ null.value : Named num 0 ## ..- attr(*, "names")= chr "difference in means" ## $ stderr : num 24.8 ## $ alternative: chr "two.sided" ## $ method : chr "Welch Two Sample t-test" ## $ data.name : chr "Crypto by Slots" ## - attr(*, "class")= chr "htest" ``` --- ## Demo: Extracting the P-Value Since this is simply a list, we can use our regular indexing: ```r tout$p.value ``` ``` ## [1] 0.3127299 ``` ```r tout[[3]] ``` ``` ## [1] 0.3127299 ``` --- ## Your Turn 1. Return the p-value from t.test comparing the difference between Sale Price in USD from the 2 and 4 Slots. 2. What does this p-value imply? <br> <br> <br> <br> <p align="center"> <img src="images/Cut_outs/Cut_out_09.png" width="200px" height="200px"> </p> --- ## Answer ### 1. ```r tout <- t.test(USD ~ Slots, data = Punks[Punks$Slots %in% c(2,4), ]) tout ``` ``` ## ## Welch Two Sample t-test ## ## data: USD by Slots ## t = 0.54773, df = 2123.8, p-value = 0.5839 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -141.4595 251.1010 ## sample estimates: ## mean in group 2 mean in group 4 ## 903.6992 848.8785 ``` --- ### 2. Since p = 0.5839 > 0.05, there is no significant difference in the means of the two groups. From this we can claim that the Punks with 2 slots and 4 slots were sold at similar price.