Exploring Numerai Data via R - Cryptocommunity Research Portal

Numerai is a crowdsourced hedgefund that hosts tournaments, which attract thousands of data scientists around the world to compete for Numeraire cryptocurrency. The company provides clean, regularized, and obfuscated data, where anyone with expertise in machine learning can freely participate. Other than that, they also include various guides and coding samples to help novice data scientists get started. One way to get the data is simply visit https://numer.ai/.

In this article, we will perform exploratory data analysis on the numerai_training_data.csv using R to find out more about each feature and the target. In a recent development, starting for round 238 opening on November 14th, 2020 the official target will migrate from “Kazutsugi” to the newer “Nomi”. In preparation for the migration, we’ll perform our analysis on the new Nomi target.

The Data

The training_data is 1.3 GB and contains information on 501808 rows across 314 columns. The first three columns are as follow:

‘id’ – an unique identifier of each row
‘era’ – a time period corresponding to a trading day
‘data_type’ – indication of whether the row is part of train/test/validation/live

These three columns are then followed by 310 features columns and the last column for the target. Our very first observation of the data is that all the feature columns and the target column take on only 5 numeric values: 0, 0.25, 0.5, 0.75, and 1. Below shows a sample of the first few columns and rows.

See Code

#Load libraries needed
library(dplyr)
library(mosaic)
library(skimr)
library(ggplot2)
library(GGally)
library(reshape)
library(ppcor)
library(knitr)
library(ggfortify)
library(stringr)
library(gridExtra)
library(tidyr)
library(janitor)

#import data
training_and_val_data<-read.csv("training_and_val.csv", header=T)
training_data<-training_and_val_data %>% filter(data_type=="train")
#training_data<-training_data %>% sample_frac(0.01)
head(training_data[1:7],3)

id	era	data_type	feature_intelligence1	feature_intelligence2	feature_intelligence3	feature_intelligence4
n000315175b67977	era1	train	0.00	0.5	0.25	0.00
n0014af834a96cdd	era1	train	0.00	0.0	0.00	0.25
n001c93979ac41d4	era1	train	0.25	0.5	0.25	0.25

The features are categorized into 6 categories: intelligence, charisma, strength, dexterity, constitution, and wisdom. Note that these categories are not of the same size. There are 12 “Intelligence”, 86 “Charisma”, 38 “Strength”, 14 “Dexterity”, 114 “Constitution”, and 46 “Wisdom” variables. Let’s now take a look the summary statistics of these features.

See Code

#all features 
features<-training_data %>%  dplyr::select(starts_with("feature"))

#6 categories of features
intelligence<-training_data %>%  dplyr::select(starts_with("feature_intelligence"))
charisma<-training_data %>%  dplyr::select(starts_with("feature_charisma"))
strength<-training_data %>%  dplyr::select(starts_with("feature_strength"))
dexterity<-training_data %>%  dplyr::select(starts_with("feature_dexterity"))
constitution<-training_data %>%  dplyr::select(starts_with("feature_constitution"))
wisdom<-training_data %>%  dplyr::select(starts_with("feature_wisdom"))

summary<-skim(features)
head(summary,5)

Table: Data summary


Name	features
Number of rows	501808
Number of columns	310
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
feature_intelligence1	1	0.5	0.35	0.25	0.5	0.75	1	▇▇▇▇▇
feature_intelligence2	1	0.5	0.35	0.25	0.5	0.75	1	▇▇▇▇▇
feature_intelligence3	1	0.5	0.35	0.25	0.5	0.75	1	▇▇▇▇▇
feature_intelligence4	1	0.5	0.35	0.25	0.5	0.75	1	▇▇▇▇▇
feature_intelligence5	1	0.5	0.35	0.25	0.5	0.75	1	▇▇▇▇▇

We see in the data that all the columns are in the correct format, the data is clean and ready for analysis. There are no missing value and from first glance, many of the features share similar distributions. Histograms of the actual values point to an almost universal uniform distribution across each variable, the distributions of the summary statistics appear to be either normally distributed or heavily skewed.

See Code

p1 <- ggplot(summary, aes(x=numeric.mean)) + 
  geom_histogram()+
  labs(title="Distribution of Mean",x="Mean", y = "Frequency")
p2 <- ggplot(summary, aes(x=numeric.sd)) + 
  geom_histogram()+
  labs(title="Distribution of Standard Deviation",x="Standard Deviation", y = "Frequency")
p3 <- ggplot(summary, aes(x=numeric.p25)) + 
  geom_histogram()+
  labs(title="Distribution of First Quartile",x="First Quartile", y = "Frequency")
p4 <- ggplot(summary, aes(x=numeric.p75)) + 
  geom_histogram()+
  labs(title="Distribution of Third Quartile",x="Third Quartile", y = "Frequency")


grid.arrange(p1,p2,p3,p4, ncol=2, nrow=2)

plot of chunk unnamed-chunk-4

It seems that the mean and median of every feature are very close to 0.5. Most of the features have standard deviation around 0.35, while there are a few with exceptionally low standard deviations of less than 0.2. Almost all of the features have their first quartile and third quartile exactly on 0.25 and 0.75 respectfully. A few of them have first quartile and third quartile curiously at 0.50.

How do these distributions look for each feature category? Let’s find out:

See Code

#Create a category column
summary$category<-ifelse(startsWith(summary$skim_variable, 'feature_intelligence') ,'Intelligence',ifelse(startsWith(summary$skim_variable, 'feature_charisma'),'Charisma', ifelse(startsWith(summary$skim_variable, 'feature_strength'), 'Strength', ifelse(startsWith(summary$skim_variable, 'feature_dexterity'), 'Dexterity', ifelse(startsWith(summary$skim_variable, 'feature_constitution'),'Constitution', 'Wisdom')))))

ggplot(summary, aes(x=numeric.mean)) + 
  geom_histogram()+
  labs(title="Distribution of Mean",x="Mean", y = "Frequency") +facet_wrap(~category) +theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 9, hjust = 1))

plot of chunk unnamed-chunk-5

See Code

ggplot(summary, aes(x=numeric.sd)) + 
  geom_histogram()+
  labs(title="Distribution of Standard Deviation",x="Standard Deviation", y = "Frequency") +facet_wrap(~category)

plot of chunk unnamed-chunk-5

We note that outliers seem to appear in the “Strength” and “Charisma” categories. Let’s look more closely to find out which ones.

See Code

#filter to see the outliers
summary %>% filter(numeric.sd<0.2 | numeric.p25==0.5 | numeric.p75==0.5) %>% arrange(numeric.sd)

Table: Data summary


Name	features
Number of rows	501808
Number of columns	310
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist	category
feature_strength20	1	0.5	0.18	0.5	0.5	0.5	1	▁▁▇▁▁	Strength
feature_strength38	1	0.5	0.18	0.5	0.5	0.5	1	▁▁▇▁▁	Strength
feature_charisma25	1	0.5	0.27	0.5	0.5	0.5	1	▂▂▇▂▂	Charisma
feature_charisma47	1	0.5	0.27	0.5	0.5	0.5	1	▂▂▇▂▂	Charisma

We can see that feature_strength20 and feature_strength38 are the ones that have the lowest standard deviation (almost half the norm). It is also very interesting that all of the outliers come from two categories: “Charisma” and “Strength”.

Though we have many features, they are all symmetrical and they look like one of the following types.

See Code

h1<-ggplot(features, aes(x=feature_intelligence1))+geom_histogram()+labs(title="Type 1")+ theme(axis.title.x = element_blank())
h2<-ggplot(features, aes(x=feature_constitution2))+geom_histogram()+labs(title="Type 2")+ theme(axis.title.x = element_blank())
h3<-ggplot(features, aes(x=feature_strength29))+geom_histogram()+labs(title="Type 3")+ theme(axis.title.x = element_blank())
h4<-ggplot(features, aes(x=feature_charisma48))+geom_histogram()+labs(title="Type 4")+ theme(axis.title.x = element_blank())
h5<-ggplot(features, aes(x=feature_charisma25))+geom_histogram()+labs(title="Type 5")+ theme(axis.title.x = element_blank())
h6<-ggplot(features, aes(x=feature_strength20))+geom_histogram()+labs(title="Type 6")+ theme(axis.title.x = element_blank())

grid.arrange(h1,h2,h3,h4,h5,h6)

plot of chunk unnamed-chunk-7

Type 1 is uniform distribution and Type 2 to Type 6 are simply various degrees of spread (measured by standard deviation) centered at 0.5. Note that Type 5 and Type 6 are the outliers we mentioned earlier.

We see that the two smallest categories: “Intelligence” and “Dexterity” all have Type 1. “Strength” and “Charisma” have a mix of almost all the types.

Here is a frequency table showing exactly how many of each category is of each Type.

See Code

summary$type<-ifelse(summary$numeric.sd>=0.35, 'Type 1', ifelse(summary$numeric.sd>=0.34 & summary$numeric.sd<0.35, 'Type 2', ifelse(summary$numeric.sd>=0.31 & summary$numeric.sd<0.34, 'Type 3', ifelse(summary$numeric.sd>=0.28 & summary$numeric.sd<0.31, 'Type 4', ifelse(summary$numeric.sd>=0.26 & summary$numeric.sd<0.28, 'Type 5', 'Type 6')))))

category_type<-summary %>% group_by(category) %>% count(type) %>% spread(key = type, value = n) %>% replace_na(list('Type 1' = 0, 'Type 2' =0, 'Type 3' = 0, 'Type 4' =0, 'Type 5' = 0, 'Type 6' =0)) %>% mutate(Total = sum(c_across()))

category_type %>% adorn_totals() %>% kable()

category	Type 1	Type 2	Type 3	Type 4	Type 5	Type 6	Total
Charisma	22	24	24	14	2	0	86
Constitution	62	22	30	0	0	0	114
Dexterity	14	0	0	0	0	0	14
Intelligence	12	0	0	0	0	0	12
Strength	14	6	10	6	0	2	38
Wisdom	30	12	2	2	0	0	46
Total	154	64	66	22	2	2	310

How about the target value? Let’s take a look at it’s summary statistics:

skim(training_data['target'])

Table: Data summary


Name	training_data[“target”]
Number of rows	501808
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
target	0	1	0.5	0.22	0	0.5	0.5	0.5	1	▁▃▇▃▁

The target does not belong to any of the aforementioned Types. The distribution looks like a normal distribution.

Pearson Correlation Coefficient

We now take a look at the correlations between the features.

See Code

cm<-round(cor(features[,1:310]),2) #correlation between features
melted_cm<-melt(cm)
ggplot(data = melted_cm, aes(x=X1, y=X2, fill=value)) + 
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ 
  theme( axis.text.x = element_blank(), axis.text.y= element_blank())

Though it is not easy to get much out of the heatmap, we can sort of see that features within each category, they are mostly correlated (some are negative correlated) with each other.

See Code

melted_cm$absvalue<-abs(melted_cm$value)
#sum(melted_cm$value==1) # it's 310 so no feature has correlation 1.
melted_cm_filtered<-melted_cm %>% filter(value!=1) %>% arrange(-absvalue) #filtered out correlation 1 and order by absvalue descending order
kable(head(melted_cm_filtered,5))

X1	X2	value	absvalue
feature_wisdom12	feature_wisdom2	0.97	0.97
feature_wisdom2	feature_wisdom12	0.97	0.97
feature_wisdom46	feature_wisdom7	0.96	0.96
feature_wisdom7	feature_wisdom46	0.96	0.96
feature_charisma69	feature_charisma9	0.95	0.95

See Code

#filter rows with zero correlation
melted_cm_zero<-melted_cm %>% filter(value==0)
#look at constitution
conzero<-melted_cm_zero %>% filter(grepl('feature_constitution',X1)) %>% filter(grepl('feature_constitution',X2)) 
#conzero %>% group_by(X2) %>%tally() %>% arrange(-n)

The table confirms that features within the same category are highly correlated with each other (mostly positive correlated). We also see that some features in “Intelligence” are highly correlated with some features in “Dexterity”. For example:

feature_dexterity1 and feature_intelligence10 have correlation 0.82
feature_dexterity3 and feature_intelligence2 have correlation 0.82
feature_dexterity8 and feature_intelligence3 have correlation 0.81

On the other hand, we noticed that there are many features that have 0 correlation with each other, and it is surprising that some of them belong to the same category. For example, feature_constitution3 has zero correlation with 14 other features in the same category.

We can also explore correlations between linear combinations of features within the same category. Here, we can use equal weighted linear combination (i.e. mean) of the features within each group. Below shows the correlations heatmap:

See Code

features$intelligence_mean<-rowMeans(intelligence) 
features$charisma_mean<-rowMeans(charisma)
features$strength_mean<-rowMeans(strength)
features$dexterity_mean<-rowMeans(dexterity)
features$constitution_mean<-rowMeans(constitution)
features$wisdom_mean<-rowMeans(wisdom)

cm_lc<-round(cor(features[,(ncol(features)-5):ncol(features)]),2)
melted_cm_lc<-melt(cm_lc)


ggplot(data = melted_cm_lc, aes(x=X1, y=X2, fill=value)) + 
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
  theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1))+
  labs(title="Correlations of linear combination of features within a group")

plot of chunk unnamed-chunk-12

We can see that the linear combination of “Intelligence” has no correlation with “Constitution” (0.01). Another interesting observation is that “Dexterity” are moderately correlated with both “Intelligence” and “Charisma” (0.5 and 0.55 respectively) but “Intelligence” and “Charisma” are not very correlated (0.27). We can see the relationships with the scatter plots as well.

See Code

p5<-ggplot(data = features) + 
  geom_point(mapping = aes(x = intelligence_mean, y = constitution_mean))
p6<-ggplot(data = features) + 
  geom_point(mapping = aes(x = intelligence_mean, y = charisma_mean))
p7<-ggplot(data = features) + 
  geom_point(mapping = aes(x = intelligence_mean, y = dexterity_mean))
p8<-ggplot(data = features) + 
  geom_point(mapping = aes(x = charisma_mean, y = dexterity_mean))

grid.arrange(p5,p6,p7,p8, ncol=2, nrow=2)

plot of chunk unnamed-chunk-13

When we look at the correlations between the features and the target, we see that all of them are not correlated with the target. The highest correlation of 0.01/-0.01 with the target. This simply implies that the features do not have a linear relationship with the target. What it does not imply is that the features have no relationship with the target.

See Code

cm_target<-round(cor(training_data[,4:314]),2)
melted_cm_target<-melt(cm_target) %>% filter(X2=='target')
melted_cm_target$absvalue<-abs(melted_cm_target$value)
kable(head(melted_cm_target %>% arrange(-absvalue),5))

X1	X2	value	absvalue
target	target	1.00	1.00
feature_intelligence2	target	-0.01	0.01
feature_intelligence3	target	-0.01	0.01
feature_charisma1	target	0.01	0.01
feature_charisma2	target	0.01	0.01

As expected, when we look at correlations between linear combinations of features and the target, we see that none of them are correlated with the target.

See Code

cm_target_lc<-round(cor(features[,(ncol(features)-5):ncol(features)],training_data$target),2)
kable(cm_target_lc)


intelligence_mean	0.00
charisma_mean	0.01
strength_mean	0.01
dexterity_mean	-0.01
constitution_mean	0.00
wisdom_mean	0.01

Scatter Plots

Looking at many of the scatter plots, almost all of the scatter plots look like the following, even the ones with high correlation. Unfortunately, we are not able to say much about the features from the scatter plots other than they have taken almost all possible combinations of values in the data.

pairs(intelligence[1:5])

plot of chunk unnamed-chunk-16

Mosaic Plots

Let’s explore some mosaic plots. For instance, going back to the outlier we discussed, this mosaic plot shows that feature_strength20 are heavily centered at 0.5 while feature_intelligence1 is evenly distributed at each value.

mosaicplot(~feature_strength20+feature_intelligence1,data=features)

plot of chunk unnamed-chunk-17

Here is another interesting mosaic plot where it clearly shows that there is a strong correlation between feature_intelligence10 and feature_dexterity1, which we have seen while we looked at correlations.

mosaicplot(~feature_intelligence10+feature_dexterity1,data=features)

plot of chunk unnamed-chunk-18

Principal Components Analysis

Since we have many features, principal component analysis could be useful. We see that the first 50 principal components will be able to explain 82% of variance, which is not bad considering we started from 310 features.

See Code

pca<-prcomp(features[,1:310], scale=TRUE) #pca
VE<-pca$sdev^2
pca_tve<-round(VE/sum(VE),2)
pca_tve[1:10]

##  [1] 0.10 0.08 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.02

Here are two plots of PC1 vs PC2, where no clear clusters can be seen. Notice that similar situation occurs across the rest of the components as well. We can only conclude that the target value exists all across different inputs, which is unfortunate as it implies predicting the target could be a difficult task.

See Code

training_data$target_factor=factor(training_data$target)
x <- pca$x %>%
  as_tibble() %>%
  mutate(target=training_data$target_factor)

p9<-ggplot(data = x, aes(x = PC2, y = PC1, colour =target)) +
    geom_point()

p10<-ggplot(data = x, aes(x = PC2, y = PC1, colour =target)) +
  geom_point()+facet_wrap(~target)


p9

plot of chunk unnamed-chunk-20

See Code

p10

plot of chunk unnamed-chunk-20

Conclusion

From the analysis, we found that many of the features are very similar to each other and even though there are 6 categories of features, there are no clear distinction between them. Here are some interesting things we found:

50% of the features and the target are uniformly distributed.
“Charisma” and “Strength” features are the categories that vary the most in their distribution.
Many features within the same category are highly correlated with each other. However, there are also features within the same category that have zero correlation.
“Dexterity” is moderately correlated with both “Intelligence” and “Charisma”, which suggests that dimension reduction is likely to play a significant role in the modeling process.
Due to the discrete structure of the features and target, the scatter plots we produced were not very helpful.

We also found that none of the features are correlated with the target, which was indeed very curious; however, when performing the same analysis across different eras we found trends that inspired our next post analyzing time dependencies in the dataset. We leave you with the following example of the correlation between feature_intelligence1 and the target across eras where we clearly see the relationship is a function of time.

See Code

training_data<-training_data %>%
  dplyr::mutate(era = as.numeric(str_extract(era, "[^era]+quot;))) eras<-training_data %>% distinct(era) result <- by(training_data, training_data$era, function(x) {cor(x$feature_intelligence1, x$target)}) result.dataframe <- as.data.frame(as.matrix(result)) result.dataframe$era <- rownames(result) result.dataframe$era <- factor(result.dataframe$era, levels=eras[,1]) ggplot(data = result.dataframe, aes(x = era, y = V1, group = 1)) + geom_line()+ theme(axis.text.x = element_text(angle = 90, vjust = 1, size = 7, hjust = 1))+ labs(title="Correlation between feature_intelligence1 and target across eras")

plot of chunk unnamed-chunk-21

References

Numerai Sample Scripts – [https://github.com/numerai/example-scripts]
New Target Nomi Release – [https://forum.numer.ai/t/new-target-nomi-release/959]
Numerai [https://numer.ai/]
Numerai – Wikipedia [https://en.wikipedia.org/wiki/Numerai]

The Data

Pearson Correlation Coefficient

Scatter Plots

Mosaic Plots

Principal Components Analysis

Conclusion

References

Leave a Reply Cancel reply