class: center, middle, inverse, title-slide .title[ # Introduction to Machine Learning
] .subtitle[ ## Workflow and Data Exploration
] .author[ ### Omni Analytics Group ] --- ## Outline: - Framing ML Projects with the DDEMA Process - Definition - D - Problem statement and objective - Data - D - Integrity checks and cleaning - Exploration - E - Answering questions via data wrangling, and initial hypotheses - Action - A - Being Productive --- ## The DDEMA Process Value extraction from a data science project doesn't come just from a few printed out reports, but during the entire process where the problem is defined, structured, the data is procured, analyzed and insights are put into action. This module on supervised and unsupervised learning will help you to apply these techniques to data projects and expose you to a few interesting crypto concepts in the process. <p align="center"> <img src="images/ddema.png" height="160px" caption="See Appendix"> </p> - Definition - define the problem well to create mathematical objective - Data - collection process, completeness, - Exploration - data integrity, univariate and multivariate analysis - Modeling - map the business problem to an appropriate statistical technique - Action - bring model and results into action --- ## Definition Let's imagine for a moment the marketing team of a fictional crypto project has approached us to help them understand Ethereum holders. They would like us to find the characteristics of Ethereum holders that differentiate them from non-holders. They were able to collect responses from some loyal Ethereum enthusiasts, and would like to be able to predict whether someone is likely to own Ethereum given their survey answers so that they can target them with rewards and other adoption incentivization methods. In this module, our objectives are to - - Find distinguishing characteristics of Ethereum owners: We will find this using statistical analysis and unsupervised machine learning techniques. - Predict whether a new individual is an Ethereum owner or not: We will find this using a number of supervised learning techniques. --- ## Tidyverse metapackage As our main tool, we will use the tidyverse set of packages for data reading, manipulations and visualizations. https://www.tidyverse.org/packages/ The following packages will be used in this analysis: - readr: for data reading - dplyr: for creating, modifying, deleting, grouping and summarising data - ggplot2: for data visualisation Let us first install it by running ```r install.packages("tidyverse") ``` --- ## Data Our dataset is a survey response dataset containing certain demographic details of adults and an indicator to whether or not they own Ethereum. Let's - load **tidyverse** library for data analysis. - read in data and look at the dimensions of the data ```r library(tidyverse) data_raw <- read_csv("data/data_raw.csv") dim(data_raw) ``` ``` ## [1] 48842 13 ``` There are 48842 observations, or adults, and 13 characteristics collected for each adult --- Looking at the first few rows ```r head(data_raw) ``` ``` ## # A tibble: 6 × 13 ## age workclass education_num marital_status occupation relationship race ## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> ## 1 39 State-gov 13 Never-married Adm-cleri… Not-in-fami… White ## 2 50 Self-emp-not… 13 Married-civ-s… Exec-mana… Husband White ## 3 38 Private 9 Divorced Handlers-… Not-in-fami… White ## 4 53 Private 7 Married-civ-s… Handlers-… Husband Black ## 5 28 Private 13 Married-civ-s… Prof-spec… Wife Black ## 6 37 Private 14 Married-civ-s… Exec-mana… Wife White ## # … with 6 more variables: sex <chr>, capital_gain <dbl>, capital_loss <dbl>, ## # hours_per_week <dbl>, owns_ethereum <lgl>, donated <lgl> ``` Notice that the columns containing text are shown as `<chr>` (character) variables. We will convert these to factors so that we can more easily create tables and graphs. --- ### Assigning data types The columns which have categorical data are not represented well by the 'character' data type. In R we use the 'factor' data structure to represent these columns correctly. ```r data <- data_raw %>% mutate(across(where(is.character), as_factor)) ``` Let's now look at the summary statistics ```r summary(data) ``` --- ``` ## age workclass education_num ## Min. :17.00 Private :33906 Min. : 1.00 ## 1st Qu.:28.00 Self-emp-not-inc: 3862 1st Qu.: 9.00 ## Median :37.00 Local-gov : 3136 Median :10.00 ## Mean :38.64 State-gov : 1981 Mean :10.08 ## 3rd Qu.:48.00 Self-emp-inc : 1695 3rd Qu.:12.00 ## Max. :90.00 (Other) : 1463 Max. :16.00 ## NA's : 2799 ## marital_status occupation relationship ## Never-married :16117 Prof-specialty : 6172 Not-in-family :12583 ## Married-civ-spouse :22379 Craft-repair : 6112 Husband :19716 ## Divorced : 6633 Exec-managerial: 6086 Wife : 2331 ## Married-spouse-absent: 628 Adm-clerical : 5611 Own-child : 7581 ## Separated : 1530 Sales : 5504 Unmarried : 5125 ## Married-AF-spouse : 37 (Other) :16548 Other-relative: 1506 ## Widowed : 1518 NA's : 2809 ## race sex capital_gain capital_loss ## White :41762 Male :32650 Min. : 0 Min. : 0.0 ## Black : 4685 Female:16192 1st Qu.: 0 1st Qu.: 0.0 ## Asian-Pac-Islander: 1519 Median : 0 Median : 0.0 ## Amer-Indian-Eskimo: 470 Mean : 1079 Mean : 87.5 ## Other : 406 3rd Qu.: 0 3rd Qu.: 0.0 ## Max. :99999 Max. :4356.0 ## ## hours_per_week owns_ethereum donated ## Min. : 1.00 Mode :logical Mode :logical ## 1st Qu.:40.00 FALSE:32562 FALSE:29501 ## Median :40.00 TRUE :16280 TRUE :19341 ## Mean :40.42 ## 3rd Qu.:45.00 ## Max. :99.00 ## ``` Notice max values of '99999' in *capital_gain* and '99' in *hours_per_week* --- ### Data Integrity When we look at the summary statistics for *capital_gain* and *hours_per_week*, we notice max values look suspicious. We can see this clearly using a boxplot. ```r ggplot(data, aes(y = capital_gain)) + geom_boxplot() ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ### Missing Values On cross checking this with the appropriate data collection team they tell us that 9999 in capital_gain and 99 in hours_per_week were encoded to denote missing values. Let's correct this discrepancy in the data. ```r data <- data %>% mutate(capital_gain = ifelse(capital_gain == 99999, NA, capital_gain)) %>% mutate(hours_per_week = ifelse(hours_per_week == 99, NA, hours_per_week)) ``` --- ## Summary using skimr `skimr` is a useful package to view summary statistics, distributions and missing values ```r library(skimr) data %>% skim() ``` --- Table: Data summary | | | |:------------------------|:----------| |Name |Piped data | |Number of rows |48842 | |Number of columns |13 | |_______________________ | | |Column type frequency: | | |factor |6 | |logical |2 | |numeric |5 | |________________________ | | |Group variables |None | **Variable type: factor** |skim_variable | n_missing| complete_rate|ordered | n_unique|top_counts | |:--------------|---------:|-------------:|:-------|--------:|:--------------------------------------------| |workclass | 2799| 0.94|FALSE | 8|Pri: 33906, Sel: 3862, Loc: 3136, Sta: 1981 | |marital_status | 0| 1.00|FALSE | 7|Mar: 22379, Nev: 16117, Div: 6633, Sep: 1530 | |occupation | 2809| 0.94|FALSE | 14|Pro: 6172, Cra: 6112, Exe: 6086, Adm: 5611 | |relationship | 0| 1.00|FALSE | 6|Hus: 19716, Not: 12583, Own: 7581, Unm: 5125 | |race | 0| 1.00|FALSE | 5|Whi: 41762, Bla: 4685, Asi: 1519, Ame: 470 | |sex | 0| 1.00|FALSE | 2|Mal: 32650, Fem: 16192 | **Variable type: logical** |skim_variable | n_missing| complete_rate| mean|count | |:-------------|---------:|-------------:|----:|:----------------------| |owns_ethereum | 0| 1| 0.33|FAL: 32562, TRU: 16280 | |donated | 0| 1| 0.40|FAL: 29501, TRU: 19341 | **Variable type: numeric** |skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist | |:--------------|---------:|-------------:|------:|-------:|--:|---:|---:|---:|-----:|:-----| |age | 0| 1| 38.64| 13.71| 17| 28| 37| 48| 90|▇▇▅▂▁ | |education_num | 0| 1| 10.08| 2.57| 1| 9| 10| 12| 16|▁▁▇▃▁ | |capital_gain | 244| 1| 582.41| 2536.65| 0| 0| 0| 0| 41310|▇▁▁▁▁ | |capital_loss | 0| 1| 87.50| 403.00| 0| 0| 0| 0| 4356|▇▁▁▁▁ | |hours_per_week | 137| 1| 40.26| 12.01| 1| 40| 40| 45| 98|▁▂▇▁▁ | --- ## Exploration After the initial summary statistics and integrity checks on the columns, we can look at individual characteristics and see how they vary between the Ethereum owner vs non owner groups. ### Univariate analysis What is the proportion of owners vs non owners? ```r data %>% group_by(owns_ethereum) %>% summarise(n = n(), nperc = n/nrow(data)*100) ``` ``` ## # A tibble: 2 × 3 ## owns_ethereum n nperc ## <lgl> <int> <dbl> ## 1 FALSE 32562 66.7 ## 2 TRUE 16280 33.3 ``` --- For every 2 non-Ethereum holders, there is a single Ethereum holder. ```r ggplot(data, aes(x = owns_ethereum)) + geom_bar() ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- Distribution of Respondent Age ```r ggplot(data, aes(x = age)) + geom_histogram() ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- What is the percentage of Males vs Females in the data? ```r data %>% group_by(sex) %>% summarise(n = n(), nperc = n/nrow(data)*100) ``` ``` ## # A tibble: 2 × 3 ## sex n nperc ## <fct> <int> <dbl> ## 1 Male 32650 66.8 ## 2 Female 16192 33.2 ``` --- ## Exploration - Multivariate Analysis Correlations ```r cor(data %>% select(where(is.numeric), donated, owns_ethereum) , use = "complete.obs") ``` ``` ## age education_num capital_gain capital_loss ## age 1.00000000 0.02781145 0.11393840 0.05814910 ## education_num 0.02781145 1.00000000 0.14492726 0.08265118 ## capital_gain 0.11393840 0.14492726 1.00000000 -0.04997409 ## capital_loss 0.05814910 0.08265118 -0.04997409 1.00000000 ## hours_per_week 0.06692050 0.14313917 0.08546265 0.05783405 ## donated 0.14561433 0.27445697 -0.04772665 -0.17546048 ## owns_ethereum -0.24821172 0.34212081 -0.04179162 0.14265086 ## hours_per_week donated owns_ethereum ## age 0.06692050 0.14561433 -0.24821172 ## education_num 0.14313917 0.27445697 0.34212081 ## capital_gain 0.08546265 -0.04772665 -0.04179162 ## capital_loss 0.05783405 -0.17546048 0.14265086 ## hours_per_week 1.00000000 -0.39189907 -0.04060555 ## donated -0.39189907 1.00000000 0.48975608 ## owns_ethereum -0.04060555 0.48975608 1.00000000 ``` From the numeric variables - Age, Education and Donations are correlated to *owns_ethereum* --- We can add in the owns_ethereum variable to understand distributional differences. ```r ggplot(data, aes(x = age, fill = owns_ethereum)) + geom_histogram(alpha = 0.5) ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> Ethereum owners seem to be younger. Try using `geom_density()` to make a neater plot! --- What is the percentage of men who are Ethereum owners? ```r data %>% filter(sex == "Male") %>% group_by(owns_ethereum) %>% summarise(n = n(), nperc = n/nrow(.)*100) ``` ``` ## # A tibble: 2 × 3 ## owns_ethereum n nperc ## <lgl> <int> <dbl> ## 1 FALSE 22674 69.4 ## 2 TRUE 9976 30.6 ``` What about women? ```r data %>% filter(sex == "Female") %>% group_by(owns_ethereum) %>% summarise(n = n(), nperc = n/nrow(.)*100) ``` ``` ## # A tibble: 2 × 3 ## owns_ethereum n nperc ## <lgl> <int> <dbl> ## 1 FALSE 9888 61.1 ## 2 TRUE 6304 38.9 ``` --- Answering this visually ```r ggplot(data = data) + aes(x = sex, fill = owns_ethereum) + geom_bar() ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> is the plot useful to answer the question? --- ## Visualizing Proportions When visualizing categorical variables, it is useful to look at proportions. Especially if there are more observations in one level compared to the other levels, like this case where there are more males. ```r ggplot(data = data) + aes(x = sex, fill = owns_ethereum) + geom_bar(position = "fill") ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Your Turn Find what percentage of ethereum owners are also donors? --- ## Answer ```r data %>% filter(owns_ethereum == "TRUE") %>% group_by(donated) %>% summarise(n = n(), nperc = n/nrow(.)*100) ``` ``` ## # A tibble: 2 × 3 ## donated n nperc ## <lgl> <int> <dbl> ## 1 FALSE 4335 26.6 ## 2 TRUE 11945 73.4 ``` --- ```r ggplot(data = data) + aes(x = donated, fill = owns_ethereum) + geom_bar(position = "fill") ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ## Mean We can also find the average of each characteristic for ethereum owners vs non owners. Since the mean only works for numeric data types, we first `select` all the numeric columns in the data, and then use the familiar `group` and `summarise` command. ```r data %>% select(owns_ethereum, where(is.numeric)) %>% group_by(owns_ethereum) %>% summarise_all(funs(mean(., na.rm = TRUE))) %>% data.frame() ``` ``` ## owns_ethereum age education_num capital_gain capital_loss hours_per_week ## 1 FALSE 41.01317 9.464069 658.1090 47.41358 40.59572 ## 2 TRUE 33.90412 11.306204 430.9576 167.68471 39.57719 ``` --- ## Your Turn - Mode - Find the mode or the most frequently occurring characteristics of the ethereum holders and the non holders. You can use the helper function 'get_mode()' given below ```r get_mode <- function(v) { all_levels <- unique(v) all_levels[which.max(tabulate(match(v, all_levels)))] } ``` - How do the majority characteristics vary between the two groups? --- ## Answer ```r data %>% group_by(owns_ethereum) %>% summarise_all(funs(get_mode(.))) %>% data.frame() ``` ``` ## owns_ethereum age workclass education_num marital_status occupation ## 1 FALSE 36 Private 9 Married-civ-spouse Craft-repair ## 2 TRUE 23 Private 13 Never-married Prof-specialty ## relationship race sex capital_gain capital_loss hours_per_week donated ## 1 Husband White Male 0 0 40 FALSE ## 2 Husband White Male 0 0 40 TRUE ``` We see that there are more donors, and more higher educated, lower age adults in the owns_ethereum = TRUE group --- ## Statistical significance testing Let's test our hypothesis by applying a t test to see difference between ages of Ethereum owners and non owners ```r t.test(age ~ owns_ethereum, data = data) ``` ``` ## ## Welch Two Sample t-test ## ## data: age by owns_ethereum ## t = 59.084, df = 38110, p-value < 2.2e-16 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## 6.873225 7.344894 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 41.01317 33.90412 ``` as the p < 0.05, we reject the null hypothesis. The mean ages are significantly different between Ethereum owners and non-owners. --- ## Chi-square test of independence Let us test the relationship between marital_status and ownership ```r chisq.test(table(data$owns_ethereum, data$marital_status)) ``` ``` ## ## Pearson's Chi-squared test ## ## data: table(data$owns_ethereum, data$marital_status) ## X-squared = 2617.6, df = 6, p-value < 2.2e-16 ``` X squared value is high and as the p < 0.05, we reject the null hypothesis. There is a significant difference between expected and observed data showing a relationship between these variables. --- ## Your Turn Is there a statistically significant difference between education levels of ethereum owners and non owners? Use an appropriate visualization to showcase this difference. --- ## Answer ```r t.test(education_num ~ owns_ethereum, data = data) ``` ``` ## ## Welch Two Sample t-test ## ## data: education_num by owns_ethereum ## t = -80.107, df = 33443, p-value < 2.2e-16 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## -1.887208 -1.797063 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 9.464069 11.306204 ``` We see that there is a significant difference (p<0.05) between the age, and education of ethereum owners vs non owners --- ```r ggplot(data, aes(x = owns_ethereum, y = education_num)) + geom_boxplot() ``` <img src="3-1-Intro_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- ## Multivariate Visualizations - Pair plots One way to look at the interactions between multiple variables at the same time, is to use pair plots. When we specify a color variable, the correlation for each level is shown. We load the GGally package, installed using `install.packages("GGally")`, to create pair plots of all the numeric variables ```r library(GGally) # numerical vars data %>% select(owns_ethereum, age, education_num, capital_gain, capital_loss, hours_per_week) %>% ggpairs(columns = 2:6, aes(color = owns_ethereum, alpha = 0.5)) ``` --- <img src="3-1-Intro_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- Finally we can take a look at all the categorical variables using faceting. The proportions are visualized to understand which variables contribute more to each group ```r data %>% select( workclass, marital_status, occupation, relationship, race, sex, donated, owns_ethereum ) %>% pivot_longer(workclass:donated) %>% ggplot(aes(y = value, fill = owns_ethereum)) + geom_bar(position = "fill", na.rm = FALSE) + facet_wrap(vars(name), scales = "free") + labs(x = NULL, y = NULL, fill = NULL) ``` --- <img src="3-1-Intro_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" />