class: center, middle, inverse, title-slide .title[ # Unsupervised Learning
] .author[ ### Omni Analytics Group ] --- ## Outline - Objective - Data preparation - Normalization - K means clustering - 3 ways of visualizing clusters - Naming clusters - Other methods: pam, hclust --- ## Glossary - Observation - Cluster --- ## Objective The marketing team for our fictional crypto project would like to understand more about the kinds of Ethereum holders in their survey. They are wondering if there are specific profiles within the Etherem holder base that distinguish them from each other. *One way of understanding the kinds of Ethereum holders is to categorize them into groups, based on the similarity of characteristics between customers.* Since the marketing team do not already have defined profiles that they are looking to identify, they will be attempting to solve a `latent varible` problem where the unobserved variable are the proposed groups of holders. When given unlabelled data such as this, we use unsupervised methods to cluster individual observations within the dataset. We will go through 3 different clustering techniques to help the marketing team! --- ## Dataset Preparation Let us read in the dataset, filter it to take only the Ethereum owners, remove outliers, and prepare it by doing the following: - create a numeric donated column that has a value of 1 if donated == TRUE and 0 if donated == FALSE - select only the numeric columns - omit all rows that have NA values ```r library(tidyverse) data_raw <- read_csv("data/data_raw.csv") data_clus <- data_raw %>% filter(owns_ethereum == "TRUE") %>% mutate(capital_gain = ifelse(capital_gain == 99999, NA, capital_gain)) %>% mutate(hours_per_week = ifelse(hours_per_week == 99, NA, hours_per_week)) %>% mutate(donated = as.numeric(donated)) %>% # change to 0 and 1 select(where(is.numeric)) %>% # select only the numeric values na.omit() ``` --- ## Normalization The variables are of differing units. For e.g, age has a range of ~17 to 90 years, where as donated is just an indicator variable of 0 and 1. When using analysis techniques like clustering we need to make sure each variable has the same influence. Hence, we normalize all the numeric values. ```r norm_mat <- as.data.frame(sapply(data_clus, function(x) { (x - min(x)) / (max(x) - min(x)) })) ``` --- ## K means clustering K means is a technique that groups data points or observations into k groups in the data. The data is grouped based on feature similarity. The algorithm allocates k centroids to the data, and tries to minimize the distance from each data point to the centroid. In other words it tries to find the most similar data points around a representative centroid. The formal algorithm is as follows: - Choose the number of clusters k - Select k random points from the data as cluster centroids - Assign all the other surrounding data points to the closest centroid - Recompute the centroids of newly formed clusters - Repeat the 3rd and 4th steps till the data points remain in the same cluster/ till the centroids do not change. Alternatively we can have a stopping criteria of maximum iterations. For e.g, 100 iterations --- ## Deciding the number of clusters 'k' is commonly decided by trying different numbers of groups. We decide to decrease k if there are many highly similar groups or increase the k if we think there should be more variety of groups. Another way to visually decide on the number of clusters is to use the 'elbow method', where we create multiple clusters and extract the total within-cluster sum of squares value from each model. We want to know the value of 'k' that decreases the within-cluster sum of squares the most (has the clusters with individuals most like each other), at the same time the k clusters are not redundant. This is commonly found at the elbow of the plot, which is the last value of k before the slope levels off. --- ```r set.seed(1234) # run many models with varying value of k tot_withinss <- map_dbl(1:10, function(k){ model <- kmeans(x = data_clus, centers = k) model$tot.withinss }) # create a data frame containing both k and tot_withinss elbow_df <- data.frame( k = 1:10, tot_withinss = tot_withinss ) # make the elbow plot ggplot(elbow_df, aes(x = k, y = tot_withinss)) + geom_line() + geom_point()+ scale_x_continuous(breaks = 1:10) ``` --- <img src="3-2-Unsupervised_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Running K means algorithm Now using the optimal k value, we carry out a k means cluster analysis on the customer dataset. ```r set.seed(1234) myclust <- kmeans(x = norm_mat, centers = 3) ``` --- ## Visualizations ### Tabular Cluster Summary One way to visualize the clusters is by using tables to display group averages. ```r data_clus %>% mutate(kmeans = myclust$cluster) %>% group_by(kmeans) %>% mutate(Size = n()) %>% summarise_all(funs(mean)) %>% round() %>% data.frame() ``` ``` ## kmeans age education_num capital_gain capital_loss hours_per_week donated ## 1 1 37 13 670 1 42 1 ## 2 2 23 9 89 0 20 1 ## 3 3 40 10 371 645 56 0 ## Size ## 1 7445 ## 2 4415 ## 3 4224 ``` --- ### Parallel Coordinate Plot This is another way to create side by side visualizations to compare the average feature values in each cluster Similar to how we needed to scale the data before clustering, we need to make sure the values are of asimilar scale so that we can compare oroportions ```r scaled_clus_avg <- norm_mat %>% mutate(kmeans = myclust$cluster) %>% group_by(kmeans) %>% mutate(Size = n()) %>% unite(col = kmeans, kmeans, Size, sep = " size:") %>% group_by(kmeans) %>% summarise_all(funs(mean)) ``` --- ```r melted_df <- scaled_clus_avg %>% gather(key = variable, value = value, 2:(ncol(.))) %>% mutate(Cluster = factor(kmeans)) ggplot(melted_df, aes(variable, value, color = Cluster, group = Cluster)) + geom_line(size = 1.3) + theme_bw() ``` --- <img src="3-2-Unsupervised_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ### Pair plots We can also use pair plots to visualize groups that are well separated. ```r library(GGally) data_clus %>% mutate(kmeans = myclust$cluster) %>% ggpairs(columns = 1:6, aes(color = as_factor(kmeans), alpha = 0.5)) ``` --- <img src="3-2-Unsupervised_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## Naming Clusters A common practice is to name the different clusters by a tagline or a persona, that characterizes the group of people well. We find there are 3 distinguishable groups, - Cluster 1 are donors, highly educated, have capital gains, and are in their middle age. - Cluster 2 are donors, work few hours, have some capital gains, and are the youngest in age. - Cluster 3 not donors, work the highest number of hours per week, have capital losses, and are in their middle to senior ages The personas corresponding to the groups: - Cluster 1: "Savvy Investors" - Cluster 2: "Benevolent Youth" - Cluster 3: "Hardworking Speculators" --- ## PAM The k-medoids algorithm is a clustering algorithm related to the k-means algorithm. It is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. ```r library(cluster) myclust <- pam(norm_mat, k = 3) scaled_clus_avg <- norm_mat %>% mutate(pam = factor(myclust$cluster)) %>% group_by(pam) %>% mutate(Size = n()) %>% summarise_all(funs(mean)) ``` --- ```r melted_df <- scaled_clus_avg %>% gather(key = variable, value = value, 2:(ncol(.) - 1)) ggplot(melted_df, aes(variable, value, color = pam, group = pam)) + geom_line(size = 1.3) + theme_bw() ``` <img src="3-2-Unsupervised_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## HCLUST Hierarchical clustering requires a square matrix that represents the distance between samples. Calculated using the `dist()` function. ```r hclust_complete <- norm_mat %>% dist() %>% hclust() ``` --- ### Choosing the number of clusters We use a dendogram to find the best allocation of observations to clusters. ```r plot(hclust_complete) ``` <img src="3-2-Unsupervised_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ```r clusters <- cutree(hclust_complete, k = 4) ``` Now let's view a cluster summary table ```r data_clus %>% mutate(hclust = clusters) %>% group_by(hclust) %>% mutate(Size = n()) %>% summarise_all(funs(mean)) %>% round() %>% data.frame() ``` ``` ## hclust age education_num capital_gain capital_loss hours_per_week donated ## 1 1 32 12 391 1 34 1 ## 2 2 41 10 0 1855 41 0 ## 3 3 39 10 555 41 64 0 ## 4 4 40 13 29012 0 41 1 ## Size ## 1 11834 ## 2 1406 ## 3 2818 ## 4 26 ``` --- ```r melted_df <- norm_mat %>% mutate(hclust = factor(clusters)) %>% gather(key = variable, value = value, 1:(ncol(.) - 1)) %>% group_by(hclust, variable) %>% summarise(value = mean(value)) ggplot(melted_df, aes(variable, value, color = hclust, group = hclust)) + geom_line(size = 1.3) + theme_bw() ``` <img src="3-2-Unsupervised_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Your Turn The marketing team loved the groups of ethereum holders. Now they want to segment the entire customer base to find unique groups. Task: Find well separated groups using the entire dataset. - How many different clusters have you settled at? - Are you able to see which features distinguish the ethereum holders from the non ethereum holders? --- ## Answer ```r ## read in data data_raw <- read_csv("data/data_raw.csv") data_clus <- data_raw %>% mutate(donated = as.numeric(donated)) %>% # change to 0 and 1 select(where(is.numeric)) %>% na.omit() # normalize numeric values norm_mat <- as.data.frame(sapply(data_clus, function(x) { (x - min(x)) / (max(x) - min(x)) })) ## Clustering set.seed(1234) myclust <- kmeans(x = norm_mat, centers = 5) # add the cluster assignment data_clus$Cluster <- myclust$cluster ## Produce group averages data_clus %>% group_by(Cluster) %>% mutate(Size = n()) %>% summarise_all(funs(mean)) %>% round() ``` ``` ## # A tibble: 5 × 8 ## Cluster age education_num capital_gain capital_loss hours_per_week donated ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 48 6 722 128 41 0 ## 2 2 35 10 1313 148 45 0 ## 3 3 22 9 183 0 20 1 ## 4 4 56 10 649 1 36 1 ## 5 5 36 13 1516 1 41 1 ## # … with 1 more variable: Size <dbl> ``` --- ```r norm_mat$Cluster <- myclust$cluster scaled_clus_avg <- norm_mat %>% group_by(Cluster) %>% mutate(Size = n()) %>% unite(col = Cluster, Cluster, Size, sep = " size:") %>% group_by(Cluster) %>% summarise_all(funs(mean)) melted_df <- scaled_clus_avg %>% gather(key=variable, value = value, 2:(ncol(.))) %>% mutate(Cluster = factor(Cluster)) ``` --- <img src="3-2-Unsupervised_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" />