This article was contributed to Medium's Towards Data Science "[Numerai Tournament: Blending Traditional Quantitative Approach & Modern Machine Learning](https://towardsdatascience.com/numerai-tournament] -blending-traditional-quantitative-approach-modern-machine-learning-67ebbb69e00c) ”is translated into Japanese.
Numerai is a crowdsourced fund, a hedge fund managed based on the results of stock price forecasts by an unspecified number of people. Numerai will hold a tournament to compete for predictive performance. Tournament participants build and submit a predictive model based on the dataset provided by Numerai. Participants are ranked according to their predictive performance and are paid (and may be collected). Investors in Numerai include Howard Morgan, co-founder of Renaissance Technologies, Paul Tudor Jones, founder of Tudor Investments, Union Square Ventures, a long-established US VC, and other prominent investors. It includes experienced VCs and hedge funds, and the dataset is supervised by an advisor specializing in finance M / L. The total prize money paid to the participants so far has exceeded $ 34 million, and it is estimated that the progress of the project is good. (Image: Provided by Numerai)
The author manages assets in Japanese stocks using a method called market neutral. Market neutral predicts the relative rise and fall of stock prices within the universe (stocks to be invested), and combines buying and selling to aim for absolute returns that do not depend on market price movements. Based on traditional quant methods and statistics, I use machine learning to build this stock price forecasting model. The operational results are good and the yield is about 40%.
In this article, I will share the knowledge I gained in the process of building an operating model. We first explain the traditional quant operation concept and then discuss how to blend it with machine learning to build the latest predictive models.
The Numerai dataset is obfuscated and I don't have any insider information about it. The content of this article is from a unique perspective based on my investment and modeling experience.
Research on forecasting stock returns has been around for a long time. First, let's explain what the traditional quant method is, from its background.
The prototype of the current quants is probably the risk model advocated by Bar Rosenberg.  There are various theories about this, but in order to know the history of Wall Street around here, you should definitely read Peter Bernstein's book Capital Ideas (Japanese translation title "The Thought Revolution of Securities Investment") .
In the 1960s, Rosenberg devised a method to explain the risks of individual companies using various factors, based on Markovitz's covariance model. He found that these risk factors were linked to the excess return of stock prices (risk premium). In 1975, Rosenberg founded Bar Rosenberg Associates, a consulting firm. The company became known to managers around the world as BARRA.
Today, the BARRA model is the most well-known risk model, provided by MSCI as a vendor. Other risk models include Axioma. There are various types of BARRA models, but the BARRA Global Equity Model (GEM) is a risk model for equities in major stock markets around the world . In this model, stock returns are broken down into country factors, industrial factors, risk factors, and individual factors as follows.
This is described by a multiple regression model as follows. Rn is the excess return of stock n (for risk-free interest rates), x is the factor exposure of stock n to each factor (k, j, i), f is the factor return, and en is the specific return. What is important here is the idea of factor return.
For the sake of simplicity, we will use a single-factor model instead of a multi-factor model. In addition, as a concrete example, the explanation will proceed with the data set structure of Numerai. The factor return indicates the regression coefficient f in the following cross-section regression. r is the target vector in eraX and x is the vector of featureA in eraX.
Factor return is an indicator of how much return you can expect to bet on that risk factor in that universe. Factor exposure indicates how much the stock is exposed to the risk factor (exposed), and the larger this is, the greater the benefit from factor return. As you can see from the above equation, this regression model is a cross-section model in a specific period (eraX), and in actual verification, this is accumulated in time series for each period (for example, monthly) and its characteristics are observed. Will be done.
Below is a partial excerpt of factor returns from the BARRA GEM material. The reason why the factor return changes remarkably upward is that if you bet on that factor, you can get a stable return. On the other hand, if the price drops significantly, you can bet on that factor (swap long and short). In the current 2020, there are few cases where the factor return changes significantly in one direction. Therefore, with the factor exposure of each stock in mind, the portfolio is constructed so that bets can be spread over various factors. (Figure: Created by the author from reference )
Since the factor return is a regression coefficient, it can be converted to Correlation using the volatility of the objective and explanatory variables. In the equation below, b is the regression coefficient of the explanatory variable x for the objective variable y, σxy is the covariance of x and y, and σx and σy are the standard deviations of x and y, respectively. Correlation is a factor-return regression coefficient corrected by volatility and standardized between -1 and 1.
Correlation is a very important indicator in risk models and, by extension, active management theory. In active management theory, Correlation is called Information Coefficient and is an indicator of investor skills. Detailed explanation around here is omitted. Those who are interested should refer to the most famous books on active management theory .
Here, the factor return (calculated by Correlation) of each feature of Numerai is described. It is calculated simply by a single factor, not by a multi-factor. From this figure, it is possible to determine at a glance which Feature has what feature and how much explanatory power it has.
It should be noted that these factor returns include variations due to randomness. The following is a Monte Carlo simulation when Correlation = 0.0 and Correlation = 0.005 (100 trials). It should always be kept in mind that this degree of variation in randomness occurs. It is a very difficult problem to judge the statistical significance in the sample period of about 120. Of course, dexterity 4 and 7 have the most remarkable factor returns.
Thinking this way, you can see why Numerai evaluates in Correlation. The predictions submitted by each of us tournament participants are themselves individual factors that are more informative to Numerai than existing features. Numerai is looking for excellent factor returns that participants have created independently. If the factor return is excellent, Numerai may operate by simply combining them, or in some cases, individual factors gathered to improve performance may be further trained.
In this chapter, we will consider what it would be like to incorporate conventional risk factors as features for machine learning. First and foremost are the Country feature and the Industry feature.
Country Feature Numerai is considered to have stocks in major markets around the world as its universe. In the Numerai tournament data, the ids of individual stocks are encrypted, and there is no way to know this. However, since the target stock list was published in Numerai Signals, I tried to aggregate it. I'm wondering if it's the same as the current Numerai tournament in terms of the number of stocks. There are 41 Numerai Signals brands, with the largest number being the US, followed by Japan, South Korea and the United Kingdom. It is possible that these are not simply imported as a Country, but are imported into a feature as a Region (North America, South America, Pacific, etc.).
In a normal risk model, the Country feature is introduced as a 0/1 categorical variable. However, Numerai's data set is basically about 5 quantiles, and the number of stocks in each quantile is often the same. Therefore, if it is featured in this way, if you are yourself, perform multiple regression on the index of each Country (or each Region) and divide the beta as a feature quantity.
For example, if you do this, Japanese stocks will have a larger beta than the TSE index and will gather in the larger quantiles of their features (or in the smaller quantiles depending on the code of the classification). Then, if the Country feature exists, the important one is the farthest quantile, and the others are unnecessary quantiles for information. In Numerai's analysis_and_tips, there was a report that the feature value was 0 or 1 and the feature appeared extremely, but I think this is possible.
For reference, the transition of relative returns in each country since 2010 is shown.
Industry Feature The next important thing is the Industry feature. In the market magician, Steve Cohen states that 40% of stock price movements are formed by the market, 30% by industry, and the remaining 30% by individual factors. This feature cannot have been incorporated. There are various industry definitions, but BARRA GEM defines 38 industries. In addition, GICS defines 60 sectors, and RBICS provided by FactSet defines 12 Economy, 31 Sector, and 89 Subject. For reference, the number of stocks by Economy in the US market is shown.
Like Country, Industry may be quantized using multiple regression betas for industry indexes as features. In this case as well, the most important quantile is the farthest quantile, and other quantiles are unnecessary as information.
For reference, the transition of relative returns in each industry in the US market since 2010 is shown.
Risk Index Feature It is highly possible that the Risk Index includes the ones used in BARRA. Size, value, success (momentum), and volatility. These can be simply taken in, but they are often normalized in consideration of the bias due to divisions such as Country and Industry. If it is size, not only market capitalization but also factors such as sales, total assets, and number of employees can be considered. If it is value, PBR, PER, PCFR, etc. can be considered. Other Risk Indexes include liquidity, growth, dividends and financial leverage. In addition to these traditional Risk Indexes, alternative variables such as analyst information and sentiment indexes extracted from news can also be captured.
For reference, the relative return transition of each Risk Index in the US market since 2010 is shown.
This chapter describes the methodology of how machine learning can be used to improve performance over traditional quants.
The Barra model is simply a weighted average of individual risk factors. There is a simple and convenient way to develop this a little further. That is to take an interaction. To give a simple example, there are industries where value is effective and industries where value is not effective. Taking the size of a stock as an example rather than the type of industry, there are factors that are easy to work with large-cap stocks and factors that are easy to work with small-cap stocks. In addition, different industries outperform depending on the country. Linear models are unsuitable for considering such interactions. This is because in a linear model, the interaction term must be specified by a human and set as a feature. If it is a tree-based method, the model can learn the interaction independently without any intention. On the other hand, the tree-based method is not good at linear classification because it divides in a grid pattern, and is not good at understanding the risk premium itself of the original BARRA model.
The solution to this is ensemble and stacking of linear and tree models. In the actual Two Sigma competition held at Kaggle, the linear model Ridge regression and the tree model Extra Trees ensemble won the top prizes . (Figure: From reference )
On the other hand, there are cases where deep learning is used as a model. This is a technique called the Deep Factor model . In conventional quants management, the fund manager, who is the manager, performs the process from factor creation to selection based on experience, but in the Deep Factor model, by replacing this with deep learning, human judgment is eliminated and individual judgments are eliminated. The purpose is to capture the non-linearity of the factor.
This method uses 80 factors to predict monthly returns, confirming that it can outperform predictions from linear models and other machine learning methods (SVR and random forest). (Figure: From reference )
By using machine learning in this way, we believe that it is relatively easy to surpass the traditional quant model. However, on the other hand, there are pitfalls such as readability deterioration due to the complexity of the model and overfitting and snooping bias, so knowledge and intuition peculiar to the Finance field are required to build the model. For technical techniques around this, you should refer to the book Finance Machine Learning by Prado, an advisor to Numerai. 
In this article, I explained the concept of traditional quant operation, described a method of incorporating conventional risk factors as features, and explained how conventional quants and machine learning are blended. You can see that traditional quants can be blended with modern machine learning to further improve production performance.
Also, if readers are more interested in the actual market by learning how to observe the market based on the conventional quant idea, the analysis in Numerai should be more enjoyable. We hope that this article will inspire your readers' curiosity and inspire your model. Thank you for reading to the end.
In writing this article, we would like to thank Numerai management for providing images and proofreading the text. We would like to take this opportunity to thank you.
Barr Rosenberg, Marathe Vinay, "The prediction of investment risk: Systematic and residual risk", 1975 Peter Bernstein, "Capital ideas: The improbable origins of modern Wall Street", 1992 Barra global equity model handbook Richard Grinold, Ronald Kahn, "Active portfolio management", 1995 Team Best Fitting, "Two Sigma Financial Modeling Code Competition, 5th Place Winners’ Interview", 2017 Kei Nakagawa, Takumi Uchida, "Deep Factor Model: Explaining deep learning decisions for forecasting stock returns with LRP", 2018 Marcos Lopez de Prado, "Advances in financial machine learning", 2018