[PYTHON] Stock price forecast by machine learning Let's get started Numerai

Introduction

The general public may not have heard of it, but there is a hedge fund called "Numerai". It is a hedge fund that became a little known in the neighborhood after being featured in the media such as Wired and Forbes from the latter half of 2016 to the first half of 2017. This hedge fund is a so-called crowdsourcing type fund, which is managed based on the results of stock price forecasts by an unspecified number of people.

I also participated in Numerai around 2017. Numerai's method is a tournament method that is ranked based on prediction results, that is, it is like Kaggle. Tournaments are held weekly and the top rankings are paid in cryptocurrencies. However, in the tournament at that time, the criteria for ranking were unclear, and the ranking fluctuation was extremely severe (such as falling to 100th place or less the next week even though it was in the TOP 10), so I quit immediately because it was a so-called luck game. ..

Recently, due to some circumstances, I took a peek at Numerai for the first time in about three years. Then, the tournament specifications will be much more sophisticated than they were in 2017. Downloading the dataset and observing the features is quite interesting. I thought this was a great test of my skills as a person involved in finance M / L, so I was canned for about a week and wrestled with the dataset.

We consider Numerai's datasets and tournaments to be very useful teaching materials for those interested in finance M / L. I would like to take this opportunity to introduce it to readers.

About Numerai

Let me add a little more explanation about Numerai. For an overview of Numerai, please first refer to the Past Blog that I wrote in 2017. On top of that, the features of Numerai can be summarized.

--Numerai performs meta-learning based on the prediction results collected from the participants, and operates based on the results. --Numerai's dataset is concealed, and participants have no idea what its features or the stocks it predicts are. --Only the prediction results are submitted to Numerai, and the prediction model does not need to be submitted. Participant's intellectual property is protected and therefore Numerai can collect a large amount of prediction results. --Marcos Prado, author of finance machine learning and head of AQR's ML division, has been appointed as an advisor. --Numerai is also funded by a well-known hedge fund, Renaissance Technology. --Numerai's own assets under management and yield are not disclosed and are completely unknown, but the total prize money paid to participants so far has exceeded 2.5 billion yen, and it is estimated that the investment status is good. ..

Tournament specifications

Now let's take a look at the tournament specifications. Numerai tournaments are held weekly and new datasets will be available for download over the weekend. The deadline for uploading is 23:30 (Japan time) the following Monday. The procedure for creating a forecast to upload is as follows.

Dataset download

When you log in, there is a download button on the left side of the screen. Just press this. The downloaded Zip file contains the following datasets:

training_data is a dataset for training and tournament_data is a dataset for forecasting. These are provided in CSV, but it is difficult to open them in Excel due to their large capacity. For reference, the dimension of trainig_data is (310 features + α) x about 500,000 samples and its capacity is 770MB, and the dimension of tournament_data is (310 features + α) x about 1.7 million samples and its capacity is 2.6GB. It reaches. The contents of the dataset will be discussed in a little more detail in the chapters below.

Model creation-submission of forecast results

Python or R would be appropriate for creating the model. Sample code is provided for both. Sample code is also included when you download the dataset. For reference, the code provided on Numerai's HP is listed below (Python ver.).

Just upload the CSV created by the following procedure on Numerai's HP. It's easy.

import pandas as pd
from xgboost import XGBRegressor

# Read the csv file into a pandas Dataframe
training_data = pd.read_csv("numerai_training_data.csv").set_index("id")
tournament_data = pd.read_csv("numerai_tournament_data.csv").set_index("id")
feature_names = [f for f in training_data.columns if "feature" in f]

# train a model to make predictions on tournament data
model = XGBRegressor(max_depth=5, learning_rate=0.01, \
                     n_estimators=2000, colsample_bytree=0.1)
model.fit(training_data[feature_names], training_data["target"])

# submit predictions to numer.ai
predictions = model.predict(tournament_data[feature_names])
predictions.to_csv("predictions.csv")

Model evaluation

Accuracy and Logloss are used to evaluate general machine learning models, while Rank Correlation is used to evaluate Numerai tournament models. This is very good. Because the skill in trading is the Information Coefficient, that is, the correlation coefficient between the forecast target and the investment index used for forecasting. The idea around here is based on traditional active management theory, and if you are interested, please read my Past Blog. Please refer to. Rank Correlation can be calculated with the following code. This code is also provided by Numerai.

ranked_prediction = training_data["prediction"].rank(pct=True, method="first")
correlation = np.corrcoef(training_data["target"], ranked_prediction)[0, 1]

Leaderboard ranking

Leaderboards are ranked by reputation. reputation is the average value of Rank Correlation in the forward forecast for the last 100 days (* The tournament specification has been updated recently. The old specification was the average value of the last 20 tournaments. The document on this has not been updated yet. It doesn't seem to be).

There is another index on the leaderboard called MMC (Meta Model Contribution). This is the correlation value with the metamodel actually operated by Numerai. This metric has just been implemented in the last update and is only visible and doesn't seem to work yet at the time of publication. 01.png

Reward system

Now, I will explain the reward system of Numerai, which is the most interesting subject for readers. First of all, the main premise is that "Numerai's reward will be paid in Numerai's own cryptocurrency called NMR (Numeraire)". At this point, if you think that "cryptocurrency is not easy to understand and troublesome", you can skip this chapter.

Another major premise is that you "need to stake your NMR" in order to receive the reward. I will explain this in a little more detail.

About stake

If you proceed with NMR, some people may cause rejection, so here we will proceed in yen.

Suppose one of the participants made a confident prediction result. I want to get a reward with this. If you think so, bet, for example, 10,000 yen on your own prediction result. As a result, if the prediction is good, you will get a reward according to the 10,000 yen you bet. On the contrary, if the prediction is bad, a part of the 10,000 yen bet will be collected by the operation.

This is a cumbersome specification, but if you think about it, it can't be helped. If the user does not take any risk, for example, if you create about 1000 accounts and keep submitting appropriate prediction results, you may get good results by chance and get rewards. The purpose is to prevent it. In addition, since the bet amount shows the amount of confidence of the participants, the amount of stake is very important for Numerai to build a metamodel.

The reward for stake is as follows. Depending on the Rank Correlation in that tournament round, it will be decided whether the reward will be given or collected. Grants and collections are capped at 25% of the bet amount. 02.png

To be more specific, the Rank Correlation value in each tournament round falls between -0.15 and +0.15 (that is, the collection / grant falls within -15% to + 15% of the bet amount). The variation between each round is large, but if you are a skilled participant, it will settle down to about 0.03 on average. In other words, if you participate for a long time, you can expect a reward of about + 3% for each round.

Daily bonus

"Is it only 3%?" I think many people think so. There is another way to get big rewards with Numerai. That is the ranking. If you are ranked high on the leaderboard, you will be given a daily bonus according to the amount of stake. 5% at the top of the ranking and 4% at the top 10. If we can maintain the top 10 rankings, we can enjoy a huge yield of 4% per day. However, the following conditions exist.

--Payout is 5 days a week --The amount you earn can be compounded every Thursday --Payout will be applied 100 days after stake (after 100 days forward forecast)

Also, of course, there is an upper limit to the amount that Numerai pays to all participants, up to 250 NMR per day (current market price is 450,000 yen). Beyond this, the reward will be evenly divided by the value the participant should receive so that the total payment is within the maximum amount. If you guts re-stake and reign at the top of the ranking, it is quite possible to get tens of thousands of yen per day. The reward you should aim for is about 1 million yen per month. It will be enough for additional income as a data scientist.

However, it should be noted that NMR is a low-credit cryptocurrency called grass coin. Prices fluctuate so much that we don't know when it will be garbage. Although it is listed on a cryptocurrency exchange, it may not be possible to convert a fixed amount into a fixed amount due to low liquidity. Cryptocurrencies are also disadvantageous in handling tax systems, and if you handle remittances incorrectly, they will be lost and you will not be able to return them. It is important to enjoy "surplus funds" within the scope of "self-responsibility". Even if you make a mistake, don't think that you will make money from the trading profit of NMR. 03.png

data set

Next, let's explain the contents of the dataset. The dataset has been supervised by Marcos Prado since 2019. Prado's book (Japanese translation) "Finance Machine Learning" released in December 2019 was sold stupidly on Amazon, so many people should have seen it. Under the supervision of Mr. Prado, the total number of features has been expanded from about 40 to about 300. Prado has published a paper on Numerai's dataset (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3454234). Those who are interested should read it.

About items

The items of the dataset are as follows. The number of samples is about 500,000 for training_data and about 1.7 million for tournament_data. 04.png

--id is a numerical value assigned to each sample individually, and there is no duplication in all samples. The id number is considered to be the encrypted date and time and brand name. --era indicates the period of data. The training_data contains 120 periods from era1 to era120. Since the stock market changes over time, it is necessary to perform EDA considering the period structure of the dataset. There is no explanation as to whether era1 to era120 are in chronological order. Also note that the number of samples is different for each era. --data_type is only train in training_data. There are three types of tournament_data: validation, test, and live. validation is data for model validation and target is assigned. test is a set used by Numerai for grade judgment, and target is a nan value. live is the ongoing market data. Naturally, target is a nan value. The number of features is different for each. --The total number of features is 310. Each feature is given a subname. intelligence, charisma, strength, dexterity, constitution, wisdom. It's named after the RPG status name Dungeons & Dragons. --target is teacher data.

About the numerical values of feature and target

As you may have noticed by looking at the above data, the features and target values are discrete values. There are five types of numerical values, 0, 0.25, 0.5, 0.75, and 1, and in short, this is the quintile data that is common in quant operation. Observing the histogram for each era is as follows. Some have the same number of samples for each quantile and some do not. The original series of features such as feature_constitution1 in the figure below may have a fat tail distribution, or it may be extremely rare as a categorical variable. While thinking like that, we will proceed with EDA. 05.png

About the period structure

Next, let's look at the transition of the number of samples for each era. The number of samples in each era is quite different as shown below. Correlation or whatever needs to be followed in chronological order. Since the transition of the number of samples seems to be almost continuous, it is highly possible that era1 to era120 are arranged in chronological order. 06.png

Further EDA

The above is the basics, and as we proceed with EDA, we will find out various things. Compared to 2017, the dataset is very sophisticated and interesting. "Well, it's only supervised by Prado. This tournament shouldn't be a luck game." I am convinced among the authors.

Tournament participation results

The author will participate in Numerai from the next ROUND 208. The model has already been created and the validation result was good. The reputation used for ranking is the result of the forward forecast for the last 100 days, and -0.1 is applied uniformly for days without data. Therefore, new participants will start with a reputation of -0.1 and will gradually move up in rank as the results of forward predictions accumulate.

This chapter will be updated one by one.

(Added on April 19, 2020) I submitted the prediction of ROUND 208 as scheduled. Validation Correlation is 0.034, which is a reasonable value. By the way, I stake all the NMR I have. It is 26NMR (about 60,000 yen at the market price as of April 19), which is the reward I got when I participated in 2017. Again, the payout will be applied after 100 days.

(Added on April 24, 2020) The result of ROUND 208, which was the first submission, was good. While the top 10 rankings were all negative ROUND CORRELATION, my model was able to finish with a positive (the table below is the author's total). ROUND208-2.png

(Added on April 30, 2020) The above result was only the first day of ROUND208 (I misunderstood that it was a definite value after ROUND. Excuse me). The actual ROUND is 20 days, and the value of correlation changes every day during the period. Today, I will show the distribution of correlation and MMC of the top 100 ranking users and the position of the author after 5 days of ROUND208. So far it's going well. I added MMC in the chapter below, so please check that as well. round208_submission_dist.png

(Added on May 22, 2020) I updated the model ver several times and settled down from ROUND212 to almost the final model. I will see the situation for a while. 21.png

Additional information (established on April 25, 2020)

Since I have learned a lot by actually participating in ROUND 208, I will establish this chapter. If I find out anything in the future, I will add it to this chapter.

(Added on 2020/4/25) ――I wrote that the payout will be applied from 100 days later, but it was found that this is a condition for daily bonuses and that rewards will be given / collected according to Correlation immediately from the stake round (see the figure below). --- I wrote that the amount of payout by Correlation is generated according to Rank Correlation in each tournament round, but this was not accurate. I thought that the reward / grant would be given only once when the round was over and the Correlation through the round was confirmed, but in reality, the reward / collection will occur every day according to the daily Correlation during the round. It turned out (see the figure below). ~~ ――Since I have grasped the specific time schedule of the round, I will describe it (all below is Japan time). The dataset will be updated before dawn on Sunday. I thought the deadline for submitting predictions was Monday 23:30, but the deadline for submission in ROUND 208 was Tuesday 0:30. Also, the start date of the round seems to be Thursday. Taking ROUND208 as an example, the dataset was downloaded in the early morning of 4/19 (Sun), the deadline for submitting predictions was 0:30 on 4/21 (Tue), and the round started on 4/23 ( It was a tree) (see the figure below). 07.png

--Training_data is basically unchanged. So you don't have to recreate the model every time you download the dataset. Improvements may be made based on the results submitted in the previous round, but the results of the rounds vary and should not be judged on the basis of short-term results. --About cheating. As discussed in the Numerai forum, using three accounts to keep the average total Correlation close to 0 (that is, reducing the risk of stakes to 0), while luckily ranking the top accounts for daily bonuses. The act of aiming at is confirmed (it seems that the daily bonus payout is applied 100 days later in order to deter such attacks in the first place). Therefore, there is a possibility that the compensation system will change in the future.

(Added on 2020/4/30)

--Validation data structure will be changed from ROUND210. There is no change in tournamnet_data due to this. Specifically, as shown in the figure below, validation for two periods will be prepared with a period in between. The detailed announcement is here. It is said that the data of the corona shock period is also included in either. The test data and live data are weekly, but the actual ROUND is held in 4 weeks (= 1 month), so there is no difference from the training data.

08.png

--For reference, the Validation Correlation of all ROUND 208 participants has been tabulated. Looking at the whole (left figure), there are users with extremely high correlation, because the validation data included in tournament_data is also included in the training data to build the model. If you have your own out-of-sample prediction performance, I think this method is also good. If you check it in a realistic range (right figure), you can see that there are many 0.03 units. Users with extremely low Validation Correlation may have other intents such as cheats. round208_validation_dist.png

--Payout by MMC will start from the end of May 2020. Users will be able to choose whether payouts are based on correlation or MMC (MMC payouts are multiplied by 2 times MMC). As a result, the ranking bonus will be abolished by September 2020. The background to the changes in the remuneration system is described below. --If the ranking bonus is abolished to prevent cheating, the only payout to users will be correlation. If payouts are granted / collected based only on good or bad forecasts, there is no incentive for users to participate in the tournament (users with high forecast performance can invest their own stocks. Correlation can be set arbitrarily by the management side. Because of the nature, you don't have to bother to make opaque bets). If the payout is used as an MMC (contribution to the metamodel) standard, even if the prediction performance is low, the reward will be obtained according to the contribution to the model. If the MMC standard is used as opposed to the correlation standard, 86 of the current TOP100 users will receive an increase in reward. Click here for details (https://forum.numer.ai/t/mmc-payout-details-and-analysis/220). --Click here for the detailed calculation method of MMC (https://forum.numer.ai/t/mmc2-announcement/93). Simply put, it is the correlation between the return and the rest of the model (residual part) after removing the correlation with the metamodel. It is an indicator of how much the originality part of the model is tied to predicting returns.

(Added on May 1, 2020)

--The format of Numerai's tournament page and each user's profile page has recently changed, which has helped to deepen our understanding of tournament specifications and reward systems. Many of Numerai's documents lack information, so I try to grasp the actual tournament specifications and reward system while checking with the actual machine. For this reason, the content of the article may be incomplete or may change, but please forgive me. If you have any questions, I would be very grateful if you could comment. ――In Numerai, it seems that four ROUNDs are held in parallel. This is because one ROUND is 4 Weeks and new ROUNDs start every week. The description of the payout in the ongoing ROUND (Resolving Round) is only provisional (Pending), and the actual payout will be resolved after the ROUND is completely finished (that is, 4 weeks after the start of the ROUND). After becoming).

(Added on May 5, 2020)

--From ROUND 210, the validation period is now two. All of these have the same label, "validation", but for the sake of convenience in this article, we will refer to the data for the period that was used up to now as validation1 and the data for the newly added period as validation2. Performance is overwhelmingly worse with validation2 than with validation1. validation2 is truly out-of-sample data. To be honest, I think this is a management mistake. The addition of validation2 reveals that validation1 is completely unworthy of validation data. In response to this change, the author is also wondering what to do in the future. For the time being, I estimated the Validato in Correlation of validation 2 of the TOP 100 users, so I will describe it. The part where the data is missing is the part that could not be estimated. We believe that 0.01 to 0.015 is appropriate for Validation Correlation at least during Validation2. In the end, the validation correlation for the validation 2 period of ROUND 210 submitted by the author was set to about 0.013.

12.png

(Added on May 22, 2020) Numerai allows you to submit multiple models, but until now you had to register for each account with an independent email address. This time, it has been changed so that you can submit up to 10 models per account. The stake amount can also be allocated and managed for each model. The setting method is here.

The author has submitted two models. The sub model is for monitors, and we plan to combine it with the main model if it looks good after a few rounds. 20.png

Numerai API (established on May 1, 2020)

Since there is Numerai's API, I will introduce it. You can use this API to submit dataset DLs and predictions. But more useful than that would be the ability to aggregate user information. Below is a sample code to get the Round Correlation and MMC of the TOP 100 users on the last day of each Round. I'm not sure about the API limit, so be careful not to hit it too much. If there are other convenient uses, I will introduce them.

import time
import numerapi
import matplotlib.pyplot as plt

round = 208
api = numerapi.NumerAPI()

LB = api.get_leaderboard(limit=100)  #TOP100 users
users = [LB[i]["username"] for i in range(len(LB))]

submission_corr = []
mmc = []

for user in users:
    sub = api.daily_submissions_performances(username=user)
    sub_round = [sub[i] for i in range(len(sub)) if sub[i]["roundNumber"]==round]
    submission_corr.append(sub_round[0]["correlation"])
    mmc.append(sub_round[0]["mmc"])
    time.sleep(0.5)

plt.hist(submission_corr)
plt.hist(mmc)

Numerai community (established on May 22, 2020)

Numerai has an open chat community on a platform called Rocket.chat. Rocket.chat is like Discord. There are discussions about important tournament changes and model data science. If you have any useful information, I will pick it up and introduce it in this chapter, but those who are serious about it should definitely join the community. My ID in the community is uki1.

community.numer.ai

(Added on May 23, 2020) Payouts by MMC have begun, and it seems that the community is using a method of neutralizing their own prediction results. MMC neutralizes the metamodel, but since the metamodel is of course unknown, the prediction result of the Example model etc. is considered as a representative prediction result, and the prediction result of itself is neutralized. By performing this action, the correlation with a typical model can be kept low (however, whether or not the prediction performance remains is different). Below is the code for neutralize. There have been reports of cases in which predictive performance can still be maintained by performing neutralize with proportion = 0.5. In the next ROUND 213, try neutralizing for the time being, and if the prediction performance remains, it is not a bad idea to submit one prediction result after neutralizing.

import pandas as pd
import numpy as np

#series is your own prediction result, by is the prediction result that is the basis of neutralize(Metamodel prediction results in MMC)
def neutralize_series(series, by, proportion=1.0):
    scores = series.values.reshape(-1, 1)
    exposures = by.values.reshape(-1, 1)
    exposures = np.hstack((exposures, np.array([np.mean(series)] * len(exposures)).reshape(-1, 1)))
    correction = proportion * (exposures.dot(np.linalg.lstsq(exposures, scores)[0]))
    corrected_scores = scores - correction
    neutralized = pd.Series(corrected_scores.ravel(), index=series.index)
    return neutralized

Superiority strategy in tournaments (established on May 2, 2020)

As mentioned above, the ranking bonus will be abolished and the payout will shift to the Correlation standard or MMC standard selection system. Correlation-based payouts are entirely dependent on individual participants' predictive skills, have no revenue asymmetry, and have no incentive for participants to choose from other than skill testing. If we choose the MMC criteria here, can participants really get a profit asymmetry? In this chapter, we analyze MMC and consider the superiority strategy of participants whose primary purpose is reward.

TOP100 analysis

Before analyzing MMC, we will first analyze the models built by the current TOP100 users (as of May 2, 2020). First, for TOP100 users, clustering of the prediction model created by them is performed based on the transition of Submission Correlation of the latest 20 Rounds (see the figure below). This is because there are many users whose transitions are clearly similar. 09.png

SOM was used as the clustering method. In addition, the plots were graded by prediction skill (reputation) and correlation with metamodel (corr w / metamodel) so that the differences in the plots could be seen as the characteristics of each region. 10.png Now, when you look at this, there are clearly different users. At the bottom left of the SOM map, madmin and wilfred, two of them, ranked first and second. Currently ranked number one, madmin is suspected of cheating (reference site). Unfortunately, the second place wilfred will be the same. It can be inferred from the fact that their Validation Correlation is clearly low (see the figure below). For reference, I think that the author's model is located in the upper left group on the map (due to lack of data, it is only estimated from the behavior so far). 11.png

The question is, after excluding these two people, ** where on the map is the most effective reward **. You can simply improve your predictive skills, but if the correlation with the metamodel becomes large, the MMC may decrease and the reward may decrease. Based on the MMC specifications, we will consider what will happen to the superior strategy for maximizing rewards (to be reported next time).

(Added on May 19, 2020) We report the findings of MMC payouts. First, the definition of MMC is here. Briefly, when the prediction is considered as an N-dimensional vector, MMC is the correlation between the "orthogonal component of the vector of the metamodel and the vector of the user's prediction model" and the target vector. The following is a schematic diagram in the case of two dimensions (N = 2). At this time, the correlation coefficient, which is the prediction performance, indicates the angle at which each vector targets (to be exact, cosθ). In this figure, the target vector is fixed to the (1,0) component, so if the prediction vector is in the first quadrant, the prediction performance will be positive. Conversely, if the prediction vector is in the second quadrant, the prediction performance will be negative. In the figure below, the prediction performance of the user model is positive, but the direct component with the metamodel is negative, which means that the payout will be negative if it is MMC. This relationship is greatly influenced by the positional relationship between the target, metamodel, and user vector. 18.png

As shown in the above figure, 2D is too extreme, so here we generated random numbers for N = 5000 dimensions and observed the relationship between each parameter and payout in a Monte Carlo manner. The figure below is a comparison of the old payout and the new payout. As parameters, the prediction performance of the metamodel and the correlation between the metamodel and the user model are given. The red line drawn diagonally in the graph is the line showing the old payout: new payout = 1: 1, and the plot located on the upper left of this red line is the new payout and the reward increases, and conversely it is located on the lower right. The plot is a sample of new payouts that reduce rewards.

The first thing to know is that when the metamodel is doing well (top row), MMC payouts tend to be lower than traditional payouts (ie simple predictive performance). Conversely, when the metamodel is sick (lower row), MMC payouts tend to be higher than traditional payouts. In other words, it has the effect of smoothing the payout to the user during the period when the prediction is good (the period when the prediction is easy) and the period when the prediction is bad (the period when the prediction is difficult).

Next, when observing by correlation with the metamodel, it can be seen that the regression coefficient increases as the correlation decreases (as it progresses to the right column). When the correlation is low, it can be seen that the payout by MMC fluctuates greatly with respect to the conventional payout (that is, simple prediction performance). This means that if the correlation with the metamodel becomes low, the payout will be leveraged for its own prediction performance.

Finally, regarding the asymmetry of earnings, as can be seen from the graph, the payout by MMC has a linear relationship with the old payout (that is, simple forecast performance), and it can be seen that there is no free lunch. 19.png

To summarize the above,

--MMC payouts stabilize payouts for easy / difficult forecast periods --Leverage can be applied by lowering the correlation with the metamodel (naturally, it becomes difficult due to restrictions). ――There are no asymmetric profit opportunities, and it is considered that there is no superior strategy in the tournament.

In short, the incentive to join Numerai will be inflated payouts during periods of unpredictability. It's difficult to make a quantitative comparison, but it may be a good option if you consider it as one of your investment portfolios (provided you can tolerate NMR risk). The most stylish thing is to hit the predictions with a model that has a low correlation with the metamodel. Participants who are familiar with their skills are encouraged to take on the challenge.

(Added on May 22, 2020) Payout by MMC has started. You can choose between payout by Corr and payout by MMC. The setting method is here.

(Added on June 4, 2020) About a month and a half has passed since I entered the war. During this period, the author conducted a large amount of verification from various perspectives. Based on the results and the author's knowledge, we have summarized the incentives to participate in Numerai.

** Incentive 1. Great leverage ** Numerai's payout is proportional to CORR. If CORR = 0.1, the payout is 10% (in the case of MMC, multiply by 2 more). This specification is high leverage in the first place. Numerai's model is market-neutral, and while such a strategy can eliminate market risk, it naturally results in lower yields. In my experience, the long / short returns of large-capitalization stocks in the actual stock market are estimated to be 1% per CORR = 0.1 (see the figure below). In other words, when the payout is CORR, the leverage is already 10 times higher than the actual stock price movement. In the case of MMC, the leverage is up to twice as much as this, so the actual leverage is about 20 times. In addition, you can bet this for 4 consecutive weeks without running out of margin, so the actual leverage is about 80 times. Although high leverage is a double-edged sword, it is a very welcome specification in terms of dramatically improving the financial efficiency of market-neutral strategies. 23.png

** Incentive 2. Portfolio construction with zero cost ** Users don't actually buy stock, they can just get paid based on the forecast results. This means that there are no transaction fees or market impact impacts. In other words, you don't have to worry about deterioration from backtesting at all. This is a unique benefit for system traders, especially for users with large stakes.

** Incentive 3. Stabilization of P / L by MMC ** In the case of a high leverage strategy, if the P / L variation is large, you will be sent off immediately. This is because the accumulated rewards are released at once with only a few losses. Bet on MMC has the effect of smoothing rewards during periods of good / bad forecasts. This will enable us to provide sustainability to our high-level compound interest investment strategy without compromising expectations.

** Incentive 4. Excellent Feature ** In general, the major barrier to the construction of investment strategies is the search for investment indicators. This is a challenge especially faced by beginners in the market. I don't know what indicators are explanatory, and I don't know how to find them. The predictive power of Features contained in Numerai's data is very good. The author has conducted a multifaceted verification, but using these Features will almost certainly make it possible to make profits and losses positive in the long run.

To summarize the above, what is the incentive to participate in Numerai?

--Leverage to dramatically improve the financial efficiency of market neutral strategies --You can enjoy the ideal return at zero cost ――Providing sustainability to high-level compound interest strategy with profit and loss smoothing effect by MMC --Excellent Features can be used free of charge

It's a dream-like platform for system traders. Now, it is not fair to explain only the incentives, so let's reconfirm the disadvantages. The biggest risks in participating in a tournament are operational risk, NMR price fluctuation risk, and NMR liquidity risk.

The operational risk is the risk that the tournament will be abolished suddenly or that the specifications will be changed and the desired return will not be obtained. as for this, it cant be helped. If that happens, it just raises the NMR (I don't think it will be confiscated). Next is the risk of price fluctuations in NMR, which can only be accepted. There is no free lunch in the world in the first place, and there are always risks involved in getting a high return. However, since TOP players have quintupled the funds they stake in about 30 weeks, the risk that the price of NMR will be reduced to about 1/5 is acceptable. Conversely, the price of NMR may rise sharply, so I think it is worth betting on NMR. Finally, regarding liquidity risk, it is difficult to convert an extreme amount of NMR into cash in a short period of time because there are few exchanges on which NMR is listed and the plates are thin. I think this is the biggest bottleneck. It's okay to set a large fee, so I would be very grateful if OTC could swap NMR to BTC or USDT for Numerai management.

Tips for forecasting

This chapter will be updated when we are able to reign supreme in the ranking.

(Added on May 11, 2020) Although it is still far from the top of the ranking, here I would like to introduce one of the EDAs I am doing. The basic thing is to observe the time-series transition of the correlation between features (many readers may have confirmed this). What I devised in this EDA is that I observed the test data of tournament_data instead of training_data. As you may know who read Numerai's Forum, the test data of tournament_data, ERA854-899 (weekly), is It almost corresponds to ERA 197-206 (monthly) of Validation2. Furthermore, this data almost corresponds to ROUND168 to ROUND204 of the actual tournament. This is 40 weeks before April 2020 and includes the period of corona shock.

The graph below shows the correlation transition between dexterity1 and other dexterity (left figure) and the correlation transition between constitution1 and other constitutions (right figure) in ERA853 to ERA904 (weekly).

As you can see, the correlation coefficient drops significantly at the right end of the dexterity correlation. This part is the corona shock. As will be described later, dexterity is an index of price increase / decrease, and 14 indicators are composed by combining the calculation period (1W, 4W, 52W, etc.) and the calculation method (simple rate of increase / decrease, moving average deviation rate, etc.). Has been done.

Also, unlike dexterity, where the correlation between each other fluctuates greatly, the constitution has an almost constant correlation transition. The constitution is arguably an indicator of the attributes of a stock (financial statements, sectors, regions, etc.). There is an index that drifts slightly up and down in the corona shock part, but this is a financial index that incorporates prices such as PER and PBR. 13.png

Based on these data, we tried to reproduce the dexterity, which has great explanatory power and is the key to improving correlation, with actual market data. Considering stocks from around the world can be a hassle, so we've focused on the S & P 500 stocks here. The figure below shows the mutual correlation transition of the rising and falling indicators. features 1 to 4 are the rate of increase / decrease of 1W, 4W, 12W, and 52W, respectively. It can be seen that a shape similar to the dexterity correlation in the above figure appears locally. 14.png

Next, the return balance curves were compared for these reproduction indexes and dexterity. In conclusion, since the balance curves are very similar, I think that dexterity4 and dexterity7, which are particularly powerful, are arguably 52W (or similar periods) ups and downs indicators. 15.png

In addition, top-ranked users are greatly benefiting from dexterity 4. The following is a comparison of the Round Correlation transitions of madmin, hb, and steve2, who are ranked high, and the Correlation transitions of dexterity4. 16.png

Now, the question is whether this dexterity is a really robust indicator for the whole year. At the end of this verification, we confirmed the behavior of S & P 500 stocks when they are selected using the index of increase / decrease rate for the period after 2010. In conclusion, it is difficult to get enough performance just by betting on long-term ups and downs (of course). Although dexterity4 seems to be a very excellent index, the period of validation2 is actually only the red frame part in the figure below. 17.png

From this verification, I think it is a very difficult task to create a stable model on a weekly basis. In order to break through this, I introduced a pilot model with ROUND211. It is a model for checking the behavior, and I do not expect much results.

(Added on May 29, 2020) I received a comment that the monitor model introduced from ROUND212 is extremely strong, so I will explain it. UKI_MONITOR1 is still a short time after submitting the model, but it seems that the latest performance is certainly excellent (see the figure below). 22.png

In conclusion, UKI_MONITOR1 is just dexterity7 (but multiplied by minus). The code is as follows. It's just scaled to a value between 0.4 and 0.6.

tournament_data["prediction"] = -tournament_data["feature_dexterity7"]

The reason for doing this is that dexterity7 (or dexterity4) is the highest-variance of all features and is essential to boost the model's predictive performance (ie Correlation). However, since this index has a large time-dependency and there is a high risk of simply incorporating it into the model, we decided to watch the behavior for the time being. Users who have recently achieved high predictive performance are benefiting from dexterity 7, but we believe that it is likely that their performance will be significantly impaired in the future. In such a case, this monitor should be useful for factor isolation.

To tell the truth, I thought about making this monitor for all the features of 310, but I decided not to do it because it would be a nuisance to the operation and not very popular. Readers are also encouraged to try making monitors in quantities that are within the bounds of common sense.

in conclusion

How is it? This article provided an overview of Numerai, tournaments, reward structures, and datasets. Numerai's dataset, supervised by an advisor specializing in finance M / L, is a great subject beyond Kaggle for practical learning of finance M / L. There are also ample rewards available. The battlefield for those involved in finance M / L is here.

Let's get started with Numerai.

Reference article

Stock price forecast by machine learning "I" Stock price forecast by machine learning "Ro"

Recommended Posts

Stock price forecast by machine learning Let's get started Numerai
Stock price forecast by machine learning Numerai Signals
Stock price forecast by machine learning is so true Numerai Signals
Stock price forecast using machine learning (scikit-learn)
Stock price forecast using machine learning (regression)
Python & Machine Learning Study Memo ⑦: Stock Price Forecast
Get started with machine learning with SageMaker
Stock price forecast using deep learning (TensorFlow)
Try to forecast power demand by machine learning
Stock Price Forecast Using Deep Learning (TensorFlow) -Part 2-
Reasonable price estimation of Mercari by machine learning
Stock price forecast using deep learning [Data acquisition]
Is it possible to eat stock price forecasts by machine learning [Implementation plan]
Stock Price Forecast 2 Chapter 2
Stock Price Forecast 1 Chapter 1
Stock price forecast with tensorflow
Python: Stock Price Forecast Part 2
Get stock price with Python
4 [/] Four Arithmetic by Machine Learning
Python: Stock Price Forecast Part 1
Predicting stock price changes using metal labeling and two-step machine learning
Machine learning summary by Python beginners
[Python] My stock price forecast [HFT]
Is it possible to eat by forecasting stock prices by machine learning [Machine learning part 1]
Stock price acquisition code by scraping (Selenium)
[Failure] Find Maki Horikita by machine learning
Four arithmetic operations by machine learning 6 [Commercial]
Stock Price Forecast with TensorFlow (LSTM) ~ Stock Forecast Part 1 ~
Python & Machine Learning Study Memo ④: Machine Learning by Backpropagation
Judgment of igneous rock by machine learning ②
[Blender x Python] Let's get started with Blender Python !!