[PYTHON] Numerai Tournament-Fusion of Traditional Quants and Machine Learning-


This article was contributed to Medium's Towards Data Science "[Numerai Tournament: Blending Traditional Quantitative Approach & Modern Machine Learning](https://towardsdatascience.com/numerai-tournament] -blending-traditional-quantitative-approach-modern-machine-learning-67ebbb69e00c) ”is translated into Japanese.

About the Numerai tournament

Numerai is a crowdsourced fund, a hedge fund managed based on the results of stock price forecasts by an unspecified number of people. Numerai will hold a tournament to compete for predictive performance. Tournament participants build and submit a predictive model based on the dataset provided by Numerai. Participants are ranked according to their predictive performance and are paid (and may be collected). Investors in Numerai include Howard Morgan, co-founder of Renaissance Technologies, Paul Tudor Jones, founder of Tudor Investments, Union Square Ventures, a long-established US VC, and other prominent investors. It includes experienced VCs and hedge funds, and the dataset is supervised by an advisor specializing in finance M / L. The total prize money paid to the participants so far has exceeded $ 34 million, and it is estimated that the progress of the project is good. Medium Skyscraper A Sigil.png (Image: Provided by Numerai)

About the author

The author manages assets in Japanese stocks using a method called market neutral. Market neutral predicts the relative rise and fall of stock prices within the universe (stocks to be invested), and combines buying and selling to aim for absolute returns that do not depend on market price movements. Based on traditional quant methods and statistics, I use machine learning to build this stock price forecasting model. The operational results are good and the yield is about 40%.

Purpose of this article

In this article, I will share the knowledge I gained in the process of building an operating model. We first explain the traditional quant operation concept and then discuss how to blend it with machine learning to build the latest predictive models.


The Numerai dataset is obfuscated and I don't have any insider information about it. The content of this article is from a unique perspective based on my investment and modeling experience.

Traditional quant method

Research on forecasting stock returns has been around for a long time. First, let's explain what the traditional quant method is, from its background.

BARRA risk model

The prototype of the current quants is probably the risk model advocated by Bar Rosenberg. [1] There are various theories about this, but in order to know the history of Wall Street around here, you should definitely read Peter Bernstein's book Capital Ideas (Japanese translation title "The Thought Revolution of Securities Investment") [2].

In the 1960s, Rosenberg devised a method to explain the risks of individual companies using various factors, based on Markovitz's covariance model. He found that these risk factors were linked to the excess return of stock prices (risk premium). In 1975, Rosenberg founded Bar Rosenberg Associates, a consulting firm. The company became known to managers around the world as BARRA.

Today, the BARRA model is the most well-known risk model, provided by MSCI as a vendor. Other risk models include Axioma. There are various types of BARRA models, but the BARRA Global Equity Model (GEM) is a risk model for equities in major stock markets around the world [3]. In this model, stock returns are broken down into country factors, industrial factors, risk factors, and individual factors as follows. 01.png

This is described by a multiple regression model as follows. Rn is the excess return of stock n (for risk-free interest rates), x is the factor exposure of stock n to each factor (k, j, i), f is the factor return, and en is the specific return. What is important here is the idea of factor return. 02.png

Factor return

For the sake of simplicity, we will use a single-factor model instead of a multi-factor model. In addition, as a concrete example, the explanation will proceed with the data set structure of Numerai. The factor return indicates the regression coefficient f in the following cross-section regression. r is the target vector in eraX and x is the vector of featureA in eraX. 03.png

Factor return is an indicator of how much return you can expect to bet on that risk factor in that universe. Factor exposure indicates how much the stock is exposed to the risk factor (exposed), and the larger this is, the greater the benefit from factor return. As you can see from the above equation, this regression model is a cross-section model in a specific period (eraX), and in actual verification, this is accumulated in time series for each period (for example, monthly) and its characteristics are observed. Will be done.

Below is a partial excerpt of factor returns from the BARRA GEM material. The reason why the factor return changes remarkably upward is that if you bet on that factor, you can get a stable return. On the other hand, if the price drops significantly, you can bet on that factor (swap long and short). In the current 2020, there are few cases where the factor return changes significantly in one direction. Therefore, with the factor exposure of each stock in mind, the portfolio is constructed so that bets can be spread over various factors. 04.png (Figure: Created by the author from reference [3])

Relationship between factor return and Correlation

Since the factor return is a regression coefficient, it can be converted to Correlation using the volatility of the objective and explanatory variables. In the equation below, b is the regression coefficient of the explanatory variable x for the objective variable y, σxy is the covariance of x and y, and σx and σy are the standard deviations of x and y, respectively. Correlation is a factor-return regression coefficient corrected by volatility and standardized between -1 and 1. 05.png

Correlation is a very important indicator in risk models and, by extension, active management theory. In active management theory, Correlation is called Information Coefficient and is an indicator of investor skills. Detailed explanation around here is omitted. Those who are interested should refer to the most famous books on active management theory [4].

Here, the factor return (calculated by Correlation) of each feature of Numerai is described. It is calculated simply by a single factor, not by a multi-factor. From this figure, it is possible to determine at a glance which Feature has what feature and how much explanatory power it has. 06.png

It should be noted that these factor returns include variations due to randomness. The following is a Monte Carlo simulation when Correlation = 0.0 and Correlation = 0.005 (100 trials). It should always be kept in mind that this degree of variation in randomness occurs. It is a very difficult problem to judge the statistical significance in the sample period of about 120. Of course, dexterity 4 and 7 have the most remarkable factor returns. 07.png

About evaluation by Correlation

Thinking this way, you can see why Numerai evaluates in Correlation. The predictions submitted by each of us tournament participants are themselves individual factors that are more informative to Numerai than existing features. Numerai is looking for excellent factor returns that participants have created independently. If the factor return is excellent, Numerai may operate by simply combining them, or in some cases, individual factors gathered to improve performance may be further trained.

Risk factors as a feature

In this chapter, we will consider what it would be like to incorporate conventional risk factors as features for machine learning. First and foremost are the Country feature and the Industry feature.

Country Feature Numerai is considered to have stocks in major markets around the world as its universe. In the Numerai tournament data, the ids of individual stocks are encrypted, and there is no way to know this. However, since the target stock list was published in Numerai Signals, I tried to aggregate it. I'm wondering if it's the same as the current Numerai tournament in terms of the number of stocks. There are 41 Numerai Signals brands, with the largest number being the US, followed by Japan, South Korea and the United Kingdom. It is possible that these are not simply imported as a Country, but are imported into a feature as a Region (North America, South America, Pacific, etc.). 08.png

In a normal risk model, the Country feature is introduced as a 0/1 categorical variable. However, Numerai's data set is basically about 5 quantiles, and the number of stocks in each quantile is often the same. Therefore, if it is featured in this way, if you are yourself, perform multiple regression on the index of each Country (or each Region) and divide the beta as a feature quantity. 09.png

For example, if you do this, Japanese stocks will have a larger beta than the TSE index and will gather in the larger quantiles of their features (or in the smaller quantiles depending on the code of the classification). Then, if the Country feature exists, the important one is the farthest quantile, and the others are unnecessary quantiles for information. In Numerai's analysis_and_tips, there was a report that the feature value was 0 or 1 and the feature appeared extremely, but I think this is possible.

For reference, the transition of relative returns in each country since 2010 is shown. 01-01.png

Industry Feature The next important thing is the Industry feature. In the market magician, Steve Cohen states that 40% of stock price movements are formed by the market, 30% by industry, and the remaining 30% by individual factors. This feature cannot have been incorporated. There are various industry definitions, but BARRA GEM defines 38 industries. In addition, GICS defines 60 sectors, and RBICS provided by FactSet defines 12 Economy, 31 Sector, and 89 Subject. For reference, the number of stocks by Economy in the US market is shown. 10.png

Like Country, Industry may be quantized using multiple regression betas for industry indexes as features. In this case as well, the most important quantile is the farthest quantile, and other quantiles are unnecessary as information. 11.png

For reference, the transition of relative returns in each industry in the US market since 2010 is shown. 01-02.png

Risk Index Feature It is highly possible that the Risk Index includes the ones used in BARRA. Size, value, success (momentum), and volatility. These can be simply taken in, but they are often normalized in consideration of the bias due to divisions such as Country and Industry. If it is size, not only market capitalization but also factors such as sales, total assets, and number of employees can be considered. If it is value, PBR, PER, PCFR, etc. can be considered. Other Risk Indexes include liquidity, growth, dividends and financial leverage. In addition to these traditional Risk Indexes, alternative variables such as analyst information and sentiment indexes extracted from news can also be captured.

For reference, the relative return transition of each Risk Index in the US market since 2010 is shown. 01-03.png

Fusion of traditional quants and machine learning

This chapter describes the methodology of how machine learning can be used to improve performance over traditional quants.

Tree model

The Barra model is simply a weighted average of individual risk factors. There is a simple and convenient way to develop this a little further. That is to take an interaction. To give a simple example, there are industries where value is effective and industries where value is not effective. Taking the size of a stock as an example rather than the type of industry, there are factors that are easy to work with large-cap stocks and factors that are easy to work with small-cap stocks. In addition, different industries outperform depending on the country. Linear models are unsuitable for considering such interactions. This is because in a linear model, the interaction term must be specified by a human and set as a feature. If it is a tree-based method, the model can learn the interaction independently without any intention. On the other hand, the tree-based method is not good at linear classification because it divides in a grid pattern, and is not good at understanding the risk premium itself of the original BARRA model.

The solution to this is ensemble and stacking of linear and tree models. In the actual Two Sigma competition held at Kaggle, the linear model Ridge regression and the tree model Extra Trees ensemble won the top prizes [5]. 12.png (Figure: From reference [5])

Deep Factor model

On the other hand, there are cases where deep learning is used as a model. This is a technique called the Deep Factor model [6]. In conventional quants management, the fund manager, who is the manager, performs the process from factor creation to selection based on experience, but in the Deep Factor model, by replacing this with deep learning, human judgment is eliminated and individual judgments are eliminated. The purpose is to capture the non-linearity of the factor.

This method uses 80 factors to predict monthly returns, confirming that it can outperform predictions from linear models and other machine learning methods (SVR and random forest). 13.png (Figure: From reference [6])

By using machine learning in this way, we believe that it is relatively easy to surpass the traditional quant model. However, on the other hand, there are pitfalls such as readability deterioration due to the complexity of the model and overfitting and snooping bias, so knowledge and intuition peculiar to the Finance field are required to build the model. For technical techniques around this, you should refer to the book Finance Machine Learning by Prado, an advisor to Numerai. [7]

in conclusion

In this article, I explained the concept of traditional quant operation, described a method of incorporating conventional risk factors as features, and explained how conventional quants and machine learning are blended. You can see that traditional quants can be blended with modern machine learning to further improve production performance.

Also, if readers are more interested in the actual market by learning how to observe the market based on the conventional quant idea, the analysis in Numerai should be more enjoyable. We hope that this article will inspire your readers' curiosity and inspire your model. Thank you for reading to the end.


In writing this article, we would like to thank Numerai management for providing images and proofreading the text. We would like to take this opportunity to thank you.


[1]Barr Rosenberg, Marathe Vinay, "The prediction of investment risk: Systematic and residual risk", 1975 [2]Peter Bernstein, "Capital ideas: The improbable origins of modern Wall Street", 1992 [3]Barra global equity model handbook [4]Richard Grinold, Ronald Kahn, "Active portfolio management", 1995 [5]Team Best Fitting, "Two Sigma Financial Modeling Code Competition, 5th Place Winners’ Interview", 2017 [6]Kei Nakagawa, Takumi Uchida, "Deep Factor Model: Explaining deep learning decisions for forecasting stock returns with LRP", 2018 [7]Marcos Lopez de Prado, "Advances in financial machine learning", 2018

Recommended Posts

Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Significance of machine learning and mini-batch learning
Basics of Machine Learning (Notes)
Importance of machine learning datasets
Machine learning and mathematical optimization
Machine learning ③ Summary of decision tree
Classification and regression in machine learning
Organize machine learning and deep learning platforms
Summary of recommended APIs for artificial intelligence, machine learning, and AI
[Machine learning] Summary and execution of model evaluation / indicators (w / Titanic dataset)
[Machine learning] OOB (Out-Of-Bag) and its ratio
Machine learning algorithm (generalization of linear regression)
Meaning of deep learning models and parameters
2020 Recommended 20 selections of introductory machine learning books
Machine learning
Machine learning algorithm (implementation of multi-class classification)
Personal notes and links about machine learning ① (Machine learning)
Machine learning algorithm classification and implementation summary
Python and machine learning environment construction (macOS)
[Machine learning] List of frequently used packages
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Basic machine learning procedure: ③ Compare and examine the selection method of features
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Judgment of igneous rock by machine learning ②
Summary of mathematical scope and learning resources required for machine learning and data science
Evaluation method of machine learning regression problem (mean square error and coefficient of determination)
[For beginners of artificial intelligence] Machine learning / Deep Learning Programming Learning path and reference books
Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 2: Learning and evaluation)
List of main probability distributions used in machine learning and statistics and code in python
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1
Study machine learning and computer science. Resource list
Machine learning of sports-Analysis of J-League as an example-②
Python & Machine Learning Study Memo ⑤: Classification of irises
A memorandum of studying and implementing deep learning
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning
Machine learning Training data division and learning / prediction / verification
List of links that machine learning beginners are learning
Parallel learning of deep learning by Keras and Kubernetes
Overview of machine learning techniques learned from scikit-learn
About the development contents of machine learning (Example)
Summary of evaluation functions used in machine learning
Analysis of shared space usage by machine learning
Stock price forecast by machine learning Numerai Signals
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Machine learning memo of a fledgling engineer Part 2
Reasonable price estimation of Mercari by machine learning
Classification of guitar images by machine learning Part 2
Get a glimpse of machine learning in Python
Try using Jupyter Notebook of Azure Machine Learning
Arrangement of self-mentioned things related to machine learning
Causal reasoning using machine learning (organization of causal reasoning methods)
[Machine learning] "Abnormality detection and change detection" Let's draw the figure of Chapter 1 in Python.
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Machine learning with Raspberry Pi 4 and Coral USB Accelerator
Key points of "Machine learning with Azure ML Studio"
Mayungo's Python Learning Note: List of stories and links