[PYTHON] [For beginners] How to read Numerai's HP + Submit + Convenient links

** Introduction **

Nice to meet you. My name is tit_BTCQASH. (Https://twitter.com/tit_BTCQASH

We will introduce the points that are difficult to understand in the project called Numerai, their explanations, and the submission of prediction results. (Https://numer.ai/tournament

Numerai is a project to help hedge funds that manage stock prices based on forecast results. We don't have to prepare the data ourselves, we are required to optimize the data given by the team and submit the prediction results. (In Japan, blog_UKI's article ** Stock price forecast by machine learning Let's get started Numerai ** https://qiita.com/blog_UKI/items/fb401725288e58c92bd6 Is famous, so please read that for details. )

By the way, in this article, I will explain as carefully as possible from how to read the Numerai homepage to how to submit the prediction results using Google Colaboratory. Please do not hesitate to contact us on twitter etc. if you have any questions or points that should be added.

** Table of contents of this article **

① How to read Numerai's homepage (2) How to optimize the features (explanatory variables) and stock price data given by the Numerai team and submit forecasts. ③ Convenient links related to numerai (will be added, please comment) ④ How to read DIAGNO STICS ⑤ What you want from Numerai ⑥ In the end

** ① How to read Numerai's homepage **

The URL of the Numerai homepage is https://numer.ai/tournament is. There are various contents, but I will explain each one.

** Frequently used terms **

** NMR Token **: An ERC-20-based cryptocurrency equivalent to the stake used in Numerai. NMR is issued by paying out Numerai and making technical contributions to Numerai. The first way to get an NMR token is to buy it on an exchange. However, it is not available on Japanese exchanges and cannot be purchased. The exchanges that carry NMR tokens are listed at this link. https://coinmarketcap.com/ja/currencies/numeraire/ Uniswap and Coinlist can also be used on DEX exchanges. Uniswap:https://uniswap.org/ Coinlist:https://coinlist.co/dashboard

NMR tokens can be used in erasure as well as Numerai. https://erasure.world/ ** correlation **: Correlation coefficient between the submitted stock price forecast result and the answer. The larger the value, the higher the reward. ** Reputation **: Correlation 20 round mean ** MMC **: Abbreviation for Meta Model Contribution. A correlation coefficient that indicates how much the MMC model (a model obtained by weighted averaging the prediction results submitted to Numerai) is correlated with the answer from your own model. The result that is superior to the MMC model is + MMC. ** Stake **: Depositing NMR ** Payouts **: Rewards received according to the bet NMR

** HP top **

image.png

** DOCS **: Link to Numerai's documentation (English version) The documentation explains Numerai's rules and more. Translation into Japanese is also progressing, https://jp.docs.numer.ai/ There is a translated version in, but the translation seems to be in the middle.

** CHAT **: A chat space for information about Numerai, chats, feedback, etc. Only available in English (no Japanese version)

** FORUM **: Space for discussing Numerai Mainly information sharing with the community. Only available in English (no Japanese version)

** LEADERBOARD **: Ranking of stock price forecast results submitted to Numerai

** ACCOUNT **: There are four links for WALLET MODELS SETTINGS LOGOUT. ** WALLET **: This page is about depositing and withdrawing NMR tokens. You can deposit by sending the NMR token obtained by Binance etc. to the address in the wallet address. The withdraw tab allows you to withdraw to NMR tokens.

image.png ** MODELS **: This page adds/deletes models to be submitted to Numerai. Models can be added/removed by pressing ADD NEW MODEL/ABSORB EXISTING ACCOUNT. image.png ** SETTINGS **: This page is about setting E-mail, Password, 2-step authentication and API key. image.png

** Bottom of HP **

** Model Information **: This page is about ranking, Reputation, and MMC Rep of the model (TIT_BTCQASH this time). You can switch models by pressing the ↓ button. ** Data Information **: Data download link for the latest round and upload link for forecast results. ** Stake ** Settings for depositing NMR tokens From Manage Stake, you can set the amount of NMR to bet on this round. There are different types of stake methods available, such as Corr and Corr + MMC. For example, Corr allows you to bet NMR only on correlation, and Corr + MMC allows you to bet NMR on correlation and MMC. ** Pending Payouts ** A table of forecast payouts for each round. This is a summary of the amount of rewards you plan to receive. image.png

** ② How to optimize the features (explanatory variables) and stock price data given by the Numerai team **

If you want to submit Numerai data without knowing anything, katsu1110's article https://www.kaggle.com/code1110/numerai-tournament And an article by Carlo Lepelaars https://www.kaggle.com/carlolepelaars/how-to-get-started-with-numerai Is very helpful. In the article I will introduce this time, I will explain a step further from the above article and the example model issued by the official (where to improve with the code, etc.) and the sample code that you can get the prediction data by pressing the button. I would like to offer it. After reading this article, I hope that the number of model submissions will increase as much as possible. (If you find a good way to raise corr, please tell me secretly)

** The code introduced this time can be run on Google colaboratory. If you press the Run button, you can create a submission file, so Try using it. ** ** https://colab.research.google.com/drive/1u5Cc3NlJQZJwJNmOrPjjqBchk928gT4C?usp=sharing

** Basics before explaining the code **

** i) Structure of Numerai dataset ** The dataset can be downloaded from the latest round data download link. (See the top of this article) UKI's article mentioned above (https://qiita.com/blog_UKI/items/fb401725288e58c92bd6 There is a detailed explanation in, so I will briefly describe the contents.

numerai_training_data.csv is a csv file that contains training data. numerai_tournament_data.csv is a csv file that contains data for validation. (Contents of csv file) image.png id: Label for the encrypted stock. era: A label about how long the data was collected. If the era is the same, it means that the data was collected during the same period. There are four values: data_type: train, validation, test, live. train is the data for training, validation is the data for verification, test is the data for Numerai to test, and live is the data for the current round. feature: Binned feature quantity. It is the 5th quantile of 0,0.25,0.5,0.75,1. Features are grouped by being tagged with "feature_intelligence", "feature_wisdom", "feature_charisma", "feature_dexterity", "feature_strength", "feature_constitution". target: Binned teacher data. It is the 5th quantile of 0,0.25,0.5,0.75,1. The target data is given in numerai_training_data.csv, but it is NAN in the live data of numerai_tournament_data.csv.

** ii) Flow of data submission ** ① Read data ② Feature engineering ③ Machine learning ④ About the strength of the model ⑤ Preparation of csv file in which the prediction result is written ⑥ Neutrize method

** ① Data reading ** I will quote (partially edit) the data reading part from the article by Carlo Lepelaars. Call download_current_data (DIR) to download the latest round of data to the directory specified by DIR. Calling train, val, test = load_data (DIR, reduce_memory = True) will store the data separately for train, val, test data.


!pip install numerapi
import numerapi
NAPI = numerapi.NumerAPI(verbosity="info")
import numpy as np
import random as rn
import pandas as pd
import seaborn as sns
import lightgbm as lgb
import matplotlib.pyplot as plt
from scipy.stats import spearmanr, pearsonr
from sklearn.metrics import mean_absolute_error
import os
DIR = "/kaggle/working"
def download_current_data(directory: str):
        """
        Downloads the data for the current round
        :param directory: The path to the directory where the data needs to be saved
        """
        current_round = NAPI.get_current_round()
        if os.path.isdir(f'{directory}/numerai_dataset_{current_round}/'):
            print(f"You already have the newest data! Current round is: {current_round}")
        else:
            print(f"Downloading new data for round: {current_round}!")
            NAPI.download_current_dataset(dest_path=directory, unzip=True)

def load_data(directory: str, reduce_memory: bool=True) -> tuple:
        """
        Get data for current round
        :param directory: The path to the directory where the data needs to be saved
        :return: A tuple containing the datasets
        """
        print('Loading the data')
        full_path = f'{directory}/numerai_dataset_{NAPI.get_current_round()}/'
        train_path = full_path + 'numerai_training_data.csv'
        test_path = full_path + 'numerai_tournament_data.csv'
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
        # Reduce all features to 32-bit floats
        if reduce_memory:
            num_features = [f for f in train.columns if f.startswith("feature")]
            train[num_features] = train[num_features].astype(np.float32)
            test[num_features] = test[num_features].astype(np.float32)
        val = test[test['data_type'] == 'validation']
        test = test[test['data_type'] != 'validation']
        return train, val, test
    # Download, unzip and load data
download_current_data(DIR)
train, val, test = load_data(DIR, reduce_memory=True)

② Feature engineering The features of the Numerai dataset have a low correlation with each other, and some results can be obtained without feature engineering. Also, reducing features using techniques such as PDAs tends to lower Corr, which is not very good.

(* This is just the result of my verification. I do not deny the possibility that Corr will improve.)

I think that what is effective in Numerai is to reduce the correlation between features while increasing the features. Officially, the number of features will be increased from 310 to 3100 (https://twitter.com/numerai/status/1347361350205415425) It seems that even feature engineering may not be necessary from now on, but I will briefly introduce how to handle features.

First of all, if you look at the train data, you can see that it is roughly divided into 6 types: "feature_intelligence", "feature_wisdom", "feature_charisma", "feature_dexterity", "feature_strength", and "feature_constitution".

I quote the code from Carlo Lepelaars's article, but the mean, deviation, skewness, etc. of these features are useful features. Therefore, call train = get_group_stats (train) to add these features to the train data and so on.


def get_group_stats(df: pd.DataFrame) -> pd.DataFrame:
        for group in ["intelligence", "wisdom", "charisma", "dexterity", "strength", "constitution"]:
            cols = [col for col in df.columns if group in col]
            df[f"feature_{group}_mean"] = df[cols].mean(axis=1)
            df[f"feature_{group}_std"] = df[cols].std(axis=1)
            df[f"feature_{group}_skew"] = df[cols].skew(axis=1)
        return df
train = get_group_stats(train)
val = get_group_stats(val)
test = get_group_stats(test)

If you have a PC with plenty of memory, it is a good feature to include feature difference data, interaction features, etc. (corr will increase by about 20%). When I run Run on Google colaboratory, it crashes, so I will post only the code,


from sklearn import preprocessing
ft_corr_list=['feature_dexterity7', 'feature_charisma18', 'feature_charisma63', 'feature_dexterity14']#ft_corr_In list, enter what you want to create an interaction feature.
interactions.fit(train[ft_corr_list], train["target"])
X_train_interact = pd.DataFrame(interactions.transform(train[ft_corr_list]))
X_best_val_inter =pd.DataFrame(interactions.transform(val[ft_corr_list]))
X_best_test_inter =pd.DataFrame(interactions.transform(test[ft_corr_list]))
train=pd.concat([train,X_train_interact],axis=1)
val=val.reset_index().drop(columns='index')
val=pd.concat([val,X_best_val_inter],axis=1)
test=test.reset_index().drop(columns='index')
test=pd.concat([test,X_best_test_inter],axis=1)

Just add.

Since feature engineering such as that used in Kaggle can be used as it is in Numerai, good Corr and Sharpe ratio can be obtained by playing with train, val, and test data. One of the tasks required to get good results with Numerai is feature engineering **, so this part is one of the elements of replay.

③ Machine learning What you need to consider when applying the Numerai dataset to machine learning is i) What machine learning method to use (LightGBM, XGBoost, NLP, etc.) ii) What hyperparameters to use iii) Whether to stack the prediction results Etc.

This time, I will use LightGBM considering the calculation time. Except for train data, id, era, and data_type are not necessary for machine learning. Train the remaining feature_ ○○ as an explanatory variable and target as teacher data. Using the trained data, predictive data is also created for Validation data and Live data included in val.

If you consider ** i) to iii), the values ​​of Corr etc. will improve **, so this part is also one of the elements of replay.


dtrain = lgb.Dataset(train[train.columns.drop('id').drop('era').drop('data_type').drop('target')].fillna(0), label=train["target"])
dvalid = lgb.Dataset(val[train.columns.drop('id').drop('era').drop('data_type').drop('target')].fillna(0), label=val["target"])
best_config ={"objective":"regression", "num_leaves":31,"learning_rate":0.01,"n_estimators":2000,"max_depth":5,"metric":"mse","verbosity": 10, "random_state": 0} 
model = lgb.train(best_config, dtrain)
train.loc[:, "prediction"] = model.predict(train[train.columns.drop('id').drop('era').drop('data_type').drop('target')])
val.loc[:,"prediction"]=val["target"]
val.loc[:,"prediction"] = model.predict(val[train.columns.drop('id').drop('era').drop('data_type').drop('target')])

④ About the strength of the model Calculate spearman, payout, numerai_sharpe, mae to estimate the strength of the model in the Validation data. The larger the spearman, payout and numerai_sharpe, the better. Among these, first of all, you can make a good model by finding the condition that the value of spearman is large (0.025 or more is a guide). (* If you focus only on Corr, various problems may occur. I think there is some disagreement with those who are familiar with Numerai, but since this is an article for people who submit prediction results for the first time, let me express it like this.)

The explanation of terms is as follows.

The higher the average value of spearman: Correlation, the better (reference is 0.022 to 0.04). payout: average return numerai_sharpe: The ratio of the average return divided by the standard deviation. The higher the better (1 or more as a guide) mae: mean absolute error


def sharpe_ratio(corrs: pd.Series) -> np.float32:
        """
        Calculate the Sharpe ratio for Numerai by using grouped per-era data

        :param corrs: A Pandas Series containing the Spearman correlations for each era
        :return: A float denoting the Sharpe ratio of your predictions.
        """
        return corrs.mean() / corrs.std()


def evaluate(df: pd.DataFrame) -> tuple:
        """
        Evaluate and display relevant metrics for Numerai 

        :param df: A Pandas DataFrame containing the columns "era", "target_kazutsugi" and a column for predictions
        :param pred_col: The column where the predictions are stored
        :return: A tuple of float containing the metrics
        """
        def _score(sub_df: pd.DataFrame) -> np.float32:
            """Calculates Spearman correlation"""
            return spearmanr(sub_df["target"], sub_df["prediction"])[0]

        # Calculate metrics
        corrs = df.groupby("era").apply(_score)
        print(corrs)
        payout_raw = (corrs / 0.2).clip(-1, 1)
        spearman = round(corrs.mean(), 4)

        payout = round(payout_raw.mean(), 4)
        numerai_sharpe = round(sharpe_ratio(corrs), 4)
        mae = mean_absolute_error(df["target"], df["prediction"]).round(4)

        # Display metrics
        print(f"Spearman Correlation: {spearman}")
        print(f"Average Payout: {payout}")
        print(f"Sharpe Ratio: {numerai_sharpe}")
        print(f"Mean Absolute Error (MAE): {mae}")
        return spearman, payout, numerai_sharpe, mae
feature_spearman_val = [spearmanr(val["prediction"], val[f])[0] for f in feature_list]
feature_exposure_val = np.std(feature_spearman_val).round(4)
spearman, payout, numerai_sharpe, mae = evaluate(val)

⑤ Preparation of csv file in which the prediction result is written Write the file for neutrize to submission_file.csv. This file requires id and prediction columns, and id is required to be in the order of Validation data and test data (+ Live data). Please note that if the order is different, it will be repelled on the Numerai side.


test.loc[:, "prediction"] =0
test.loc[:, "prediction"] = model.predict(test[feature_list])
test[['id', "prediction"]].to_csv("submission_test.csv", index=False)
val[['id', "prediction"]].to_csv("submission_val.csv", index=False)

test=0
val=0

directory = "/kaggle/working"
full_path = f'{directory}/numerai_dataset_{NAPI.get_current_round()}/'
test_path = full_path + 'numerai_tournament_data.csv'
tournament_data = pd.read_csv(test_path)
tournament_data_id=tournament_data['id']
tournament_data_id2=tournament_data['feature_dexterity7']
tournament_data_id=pd.concat([tournament_data_id,tournament_data_id2],axis=1)
val=pd.read_csv("submission_val.csv")
test=pd.read_csv("submission_test.csv")

test_val_concat=pd.concat([val[['id', "prediction"]],test[['id', "prediction"]]],axis=0).set_index('id')
tournament_data_id=tournament_data_id.set_index('id')
conc_submit=pd.concat([tournament_data_id,test_val_concat],axis=1).drop(columns='feature_dexterity7').reset_index()
conc_submit=conc_submit.rename(columns={'index': 'id'})
conc_submit.to_csv("submission_file"+".csv", index=False)

⑥neutrize By linearly regressing Example_model (a sample model officially distributed by Numerai) and your own model You can improve the Sharpe ratio while reducing the correlation between a single feature and the prediction result. However, if you overdo it, Corr will drop significantly, so I think about 0.3 to 0.5 is good. One of the elements of the game is what kind of model to neutralize and how much to neutralize.


def neutralize(series,by, proportion):
    
    scores = series.values.reshape(-1, 1)
    exposures = by.values.reshape(-1, 1)
    exposures = np.hstack((exposures, np.array([np.mean(series)] * len(exposures)).reshape(-1, 1)))
    correction = proportion * (exposures.dot(np.linalg.lstsq(exposures, scores)[0]))
    corrected_scores = scores - correction
    neutralized = pd.Series(corrected_scores.ravel(), index=series.index)
    
    return neutralized
by=pd.read_csv('/kaggle/working/numerai_dataset_'+str(NAPI.get_current_round())+'/example_predictions.csv')
neut=pd.read_csv("submission_file.csv")
neut=pd.DataFrame({'prediction':neutralize(neut['prediction'],by['prediction'], 0.3)})#You can change the amount of Neutralize by playing with it.
conc=pd.concat([by.drop(columns="prediction"),neut],axis=1)
conc.to_csv("neutralized_submission_file.csv", index=False)#Submission file

Submit the obtained neutralized_submission_file.csv from Upload predictions on the Numerai homepage and you're done.

** How to read DIAGNO STICS **

** Validation Sharpe **: Sharpe ratio of 1 or higher in Validation data is good. ** Validation Mean **: Corr average value in Validation data should be around 0.025 ~ ** Feature Neutral Mean **: Corr mean value when neutralizing all features (not very helpful) ** Validation SD **: Standard deviation of correlation between validation data and predicted values ​​for each Era (not very helpful) ** Feature Exposure **: An index showing how well the feature amount and the prediction result are balanced. The smaller the better. ** Max Drawdown ** Maximum drawdown-0.05 or less is a guide ** Corr + MMC Sharpe **: Sharpe ratio of Corr and MMC combined ** MMC Mean **: Mean of MMC ** Corr With Example Preds ** Correlation with sample model 0.5 ~ 0.8 is a guide

** Convenient links related to numerai (will be added, please comment) **

A tool that makes it easy to see how high you are https://dashboard.numeraipayouts.com/ Tool to calculate total payouts https://apps.apple.com/app/id1522158691 Numerai Advent Calendar 2020 (Information collection on how to raise Corr, sponsored by @kunigaku) https://adventar.org/calendars/5031 numerati https://github.com/woobe/numerati

** What you want from Numerai **

(1) Since NMR is a grass coin, the price is not stable. Since the price fluctuates around 20 to 60 USD, depending on the time of entry, even if the model submitted to Numerai is excellent, the price of NMR tokens may drop and you may lose money. I hope the token price will be stable. (2) There are few Japanese documents. If you don't follow the forums, you won't be able to respond immediately to changes in functionality. (For example, when the parameter calculation method is changed https://forum.numer.ai/t/model-diagnostics-update/902 I would be happy if the Japanese articles were expanded and the homepage was translated into Japanese. (Currently, it seems that we are working on it ...)

at the end

I myself am about 800th, which is not so high, but it is a positive return. https://numer.ai/tit_btcqash Although it is quite different from the code posted this time, I am using a model with 7 types of Neutralize while maintaining Corr to some extent. image.png If you give me about 3NMR, I will give you an .ipynb file with the code, so please let me know if you are interested.

For chips NMR:0x0000000000000000000000000000000000021d96

Recommended Posts

[For beginners] How to read Numerai's HP + Submit + Convenient links
Beginners read "Introduction to TensorFlow 2.0 for Experts"
Memo # 4 for Python beginners to read "Detailed Python Grammar"
How to make Spigot plugin (for Java beginners)
Memo # 3 for Python beginners to read "Detailed Python Grammar"
Memo # 1 for Python beginners to read "Detailed Python Grammar"
How to use data analysis tools for beginners
Memo # 2 for Python beginners to read "Detailed Python Grammar"
Memo # 7 for Python beginners to read "Detailed Python Grammar"
Memo # 6 for Python beginners to read "Detailed Python Grammar"
How to make Python faster for beginners [numpy]
[For beginners] How to study programming Private memo
Memo # 5 for Python beginners to read "Detailed Python Grammar"
How to read PyPI
How to read JSON
[For beginners] How to use say command in python!
[For beginners] How to study Python3 data analysis exam
Python # How to check type and type for super beginners
[For beginners] How to use for statements on Linux (variables, etc.)
How to implement 100 data science knocks for data science beginners (for windows10 Home)
How to learn TensorFlow for liberal arts and Python beginners
How to convert Python # type for Python super beginners: int, float
For beginners, how to deal with common errors in keras
~ Tips for beginners to Python ③ ~
Convenient Linux shortcuts (for beginners)
[For beginners] How to implement O'reilly sample code in Google Colab
[For beginners] How to register a library created in Python in PyPI
[Unity] How to run ML-Agents Release 8 even for transcendental beginners [Windows]
How to read e-Stat subregion data
[For non-programmers] How to walk Kaggle
How to read the SNLI dataset
How to create / delete symbolic links
How to study for the Deep Learning Association G test (for beginners) [2020 version]
How to create * .spec files for pyinstaller.
[Python] Organizing how to use for statements
How to install Windows Subsystem For Linux
How to use Pylint for PyQt5 apps
[Python] Read images with OpenCV (for beginners)
How to use "deque" for Python data
How to read CSV files in Pandas
How to read problem data with paiza
How to use fingerprint authentication for KDE
For beginners of SageMaker --Collection of material links -