[PYTHON] Stock price forecast by machine learning is so true Numerai Signals

Introduction

The previous article was here. The specifications for Numerai Signals, which was previously in beta, are almost finalized. There is a big change from the beta version that predicts the return of the stock price and competes for Sharp, and it is a very tough specification to search for the original Signal that no one has seen. The author considers this tournament to be the most advanced finance data tournament in the world, and I would like to explain the reason for thinking so while checking each specification. I will publish a new article here entitled "True Numerai Signals".

This article is intended for those who have participated in the Numerai Tournament, and will be explained assuming that they have prior knowledge.

Numerai Signals specifications

Signals overview

The Signals documentation is here [https://docs.numer.ai/numerai-signals/signals-overview). Signals aims to find the original investment index = Signal that no one has ever seen, rather than predicting the rise and fall of stock prices in markets around the world as in tournaments. Participants' ultimate goal is to have a data-driven hedge fund like Numerai "buy" the submitted Signal. It accesses various data sources around the world, finds features with plenty of alpha, and extracts Signals with high prediction performance and originality from them. And it replaces itself as part of the hedge fund brain. What an exciting attempt.

But that is unacceptable in a half-baked Signal. Hedge funds, of course, don't want Signals generated from known information. The Signals specification is for exploring your own Signal. Let's start the explanation.

Target assets

Numerai Signals targets stocks in markets around the world, with a total of approximately 5,200 stocks at this time. The list changes daily, but most are deferred, with only illiquid stocks being replaced. The latest list is available here [https://numerai-quant-public-data.s3-us-west-2.amazonaws.com/example_predictions/latest.csv).

For reference, we have tabulated how many stocks are in which market. The largest market is the US market, with more than 2000 stocks. This is followed by the Japanese market, the Korean market, and the London market.

Participants do not have to submit forecasts for all of these stocks. If you submit a forecast of at least 100 stocks, you can climb the stage as an evaluation target. However, for stocks that did not submit a forecast, the forecast will be uniformly assigned as the median, which will reduce the forecast performance from the perspective of the entire universe. If you want high performance, you should submit forecasts for as many stocks as possible (the effects of forecast deficiencies will be discussed later).

02.png

About data acquisition

For these stocks, the participants themselves need to collect the data necessary for forecasting. Numerai Signals is a platform for users who already have their own forecasting system built and have access to market data. Quandl is the official data source used by the operation to evaluate the predictive performance of participants. Other data sources include Quantopian and Alpaca. Numerai's Forum shares a list of cheap data sources, so there ) Should be referred to. I'm currently using Yahoo Finance.

Also, in the Signals Example model, a pipeline to download stock prices from Yahoo Finance is built. There should also be referred to.

Submission time schedule

Signals has a weekly rounding system. ROUND starts at UTC 18:00 on Saturday (Sunday 3:00 in Japan time) and the deadline for submitting forecasts is UTC 14:30 on the following Monday (Monday 23:30 in Japan time, the same time as the deadline for submitting the Numerai tournament). The forecast timeframe is from Tuesday's closing price to the following Monday's closing price in each country's market. In other words, from the weekend when ROUND started, the return of the next 6 business days minus the first 2 days. This lag takes into account the time required to build a portfolio, but in short, Numerai wants a small alpha for Time Decay. Predictive performance in a very short period of time has no meaning. This point alone is a sufficiently difficult specification.

01.png

Orthogonalization of prediction results

Signals is looking for a whole new Signal that doesn't correlate with existing factors or Signals. The means to achieve this is to Neutralize the submitted predictions to known factors and Signals.

Think of the submitted Signal as an N-dimensional vector. At this time, if orthogonalized to the known factor, the correlation with the known factor can be converted to 0 (that is, the original for the known factor) while maintaining the information (linear relationship) of the original Signal as much as possible. Ingredients can be extracted). An easy-to-understand two-dimensional (N = 2) example is shown below. The correlation coefficient between the submitted Signal and the known factor indicates the angle formed by each vector (to be exact, cosθ). The correlation can be set to 0 (that is, cosθ'= 0) by orthogonalizing (perpendicularly) the vector of Signal to the vector of a known factor as follows.

04.png

Importantly, this orthogonalization can be done on multiple vectors. Imagine a three-dimensional space. The Signal vector can extract the perpendicular component to the plane formed by the known factor vectors 1 and 2. Generalized, N-dimensional vectors should be orthogonal to N-1 vectors. In other words, since the dimension of Signals is about 5000, it is possible to perform orthogonalization for at least several thousand factors at the same time.

The Signals specification states that the forecasts submitted will be orthogonal to the Barra factor, country, industry, and all other unique factors owned by Numerai. This orthogonalization is likely to not only deduct linear information from a single factor, but also predictive results modeled on known factors at the same time. Numerai can deduct all the components generated by simple information modeling by creating some non-linear models such as tree-type models and neural networks for the features he has.

Again, in Signals, submitted predictions are orthogonalized using all of Numerai's information prior to evaluation.

Predictive target

The forecast target is also orthogonalized in advance to all the information that Numerai has with respect to the market return. Of course, this is never shared with users. The target of Signals is the black box.

Since there is no prediction target, users usually cannot judge whether the prediction result they made is good or not. Regarding this, if you submit the latest prediction (live) and the past prediction result (validation) at the same time, you can get the evaluation result with historical data. The evaluation period based on this historical data is from January 4, 2013 to February 28, 2020. However, as a matter of course, the evaluation of this historical data is for reference only, and it is better not to make efforts to improve it. There are concerns about overfitting, and above all, it is mentioned that good results obtained in the past are likely to deteriorate by being reflected in the future. 07.png

Forecast evaluation and leaderboard

The evaluation of the forecast is carried out by the following procedure. First, the prediction results submitted by the participants are orthogonalized to all the information by Numerai. Calculate the correlation coefficient COR between the result and Numerai's custom target (which is also orthogonalized). This correlation coefficient COR is called the information coefficient (IC) in active portfolio theory, and is judged as the predictive power of Signal.

The COR average of the last 20 ROUND (that is, 20 weeks) is used for the leaderboard ranking. 08.png

Reward system

Although it is a reward system, the stake amount of participants multiplied by 2 * COR is given (or collected) as a reward. In the tournament, the COR average of the top prize winners was about 0.03 (that is, about 3%). Signals are expected to be lower than this, so they are multiplied by a factor of two. For example, if the average weekly COR in Signals is 0.015, an average weekly profit of 3% is expected for the stake amount. If such performance can be achieved, the annual interest will be a large return of 156% for simple interest calculation and 365% for compound interest calculation.

Like tournaments, Signals also has MMC rewards. MMC is a Meta Model Contribution, which is simply a part of competing for originality against the predictions of other participants. In the COR calculation in the previous section, the information held by Numerai was deducted in advance, while in MMC, the predictions submitted by other participants are deducted. A metamodel of all participants (here, Signal's stake-weighted average after Neutralize) is used for this deduction.

MMC rewards are optional. MMC is a very strict specification in which those who have searched for the original compete for more originality. 09.png

Points to keep in mind in Signals

Missing prediction

Since Signals targets more than 5,000 stocks, it is natural that some of them will not be able to obtain data, and many participants will want to limit their forecasts. If you are looking for alternative data in the first place, it is impossible to collect it for all stocks. Participants should predict at least 100 stocks, but in that case, the missing value will be uniformly replaced by the median, and the COR value will deteriorate. This time, I estimated the effect. 10.png

The figure on the right shows the result of a random simulation of how the COR changes when there is a 50% defect in the submitted prediction. The coefficient of the regression line is 0.715, and when there is a 50% defect, the COR is about 0.7 times the value when all the predictions are submitted. In the figure on the left, the horizontal axis is divided by the defect rate. For example, it can be confirmed that the COR gradually deteriorates as the number of defects increases.

In conclusion,

--If there is a deficiency in the prediction, the COR value deteriorates. ――This can happen on both the plus side and the minus side (that is, the loss is reduced on the minus side) ――In other words, there is no asymmetry, and from the viewpoint of reward, the original COR is de-leveraged. ――On the other hand, the absolute value of COR obtained is low, which is disadvantageous for aiming at the top of LB.

It is important for participants to decide how to select the expected number of stocks.

How to place the target

The target of Signals is the black box. So what should participants target? At least simple stock price movements should not be targeted. Most of the predictable part of stock price movements is composed of known factors (especially the influence of the market and industry). In other words, targeting a simple return results in a model that correlates well with known factors, resulting in a low rating in Signals. This is also noted in the documentation that "a signal that has a strong correlation with normal returns is likely to be badly evaluated."

This means that in order to create a predictive model, participants must first create their own custom targets. But in reality, this is a daunting task. Factors that fail to be deducted in the process of creating a custom target will be deducted on the Numerai side after submitting the forecast. Then, no matter how much you have the predictive performance in your own model, there is a high possibility that you will end up with a residue before scoring.

Then how to capture

The conclusion I have made so far is that I will not make a predictive model. Anyway, think about alternative data that Numerai doesn't seem to have and structure it. And submit it for the time being. If you can't create a target in the first place, you can't do proper modeling, let alone be happy with the result. I will submit the data for validation for the time being, but the evaluation result returned is just a reference value, and even if it is bad, I will continue submission without worrying about it.

Let's give an example below. Alternative data that can be easily conceived is chart image recognition and feature extraction. Prepare tens of thousands of charts and extract features by unsupervised learning. Then, the chart features of each brand are compressed to one dimension, and this value is submitted as a prediction. It does not matter whether or not there is a predictive power for the stock price movement itself, it is sufficient if something is related to the information (custom target) of the residual part predicted by Numerai.

I will continue to submit alternative data that I can think of. It should be the most appreciated for Numerai who wants to collect various data, and if there is one that is convenient for Numerai, there will surely be a consultation that he wants to buy it.

Incentive to participate in Signals

To be honest, at Signals, the incentive for stake rewards is small. I have no idea how my bet will be evaluated in the first place, and it makes more sense to try to get paid directly from the market using known factors rather than looking for esoteric Signals.

I think the incentive to participate in Signals is the "honor" of being part of a hedge fund once you have submitted a good Signal.

in conclusion

In this article, I explained how strict the specifications of Numerai Signals are, and described the concept of the strategy. Again, Signals isn't about predicting stock prices, it's about discovering unknown data from somewhere in the world.

The search for alternative data is carried out by hedge funds around the world. Signals is more than just a finance data tournament, it means joining hedge funds around the world in the search for alternative data. Signals is a platform that allows data scientists around the world to explore alternative data and evaluate it automatically.

That's why I think Signals is the most advanced finance data tournament. Of course, the threshold is surprisingly high, but if you could excavate a gold vein, the highest honor would be waiting for you.

Let's embark on a journey to find Signals buried all over the world.

Recommended Posts

Stock price forecast by machine learning is so true Numerai Signals
Stock price forecast by machine learning Numerai Signals
Stock price forecast by machine learning Let's get started Numerai
Stock price forecast using machine learning (scikit-learn)
Stock price forecast using machine learning (regression)
Python & Machine Learning Study Memo ⑦: Stock Price Forecast
Is it possible to eat stock price forecasts by machine learning [Implementation plan]
Stock price forecast using deep learning (TensorFlow)
Is it possible to eat by forecasting stock prices by machine learning [Machine learning part 1]
Try to forecast power demand by machine learning
Stock Price Forecast Using Deep Learning (TensorFlow) -Part 2-
Reasonable price estimation of Mercari by machine learning
Stock price forecast using deep learning [Data acquisition]
Stock Price Forecast 2 Chapter 2
Stock Price Forecast 1 Chapter 1
What is machine learning?
[Introduction to Systre] Stock price forecast; Monday is weak m (__) m
Stock price forecast with tensorflow
Python: Stock Price Forecast Part 2
4 [/] Four Arithmetic by Machine Learning
Python: Stock Price Forecast Part 1
Machine learning summary by Python beginners