[PYTHON] What is labeling in financial forecasting?

References [Finance Machine Learning](https://www.amazon.co.jp/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%8A%E3%83%B3 % E3% 82% B9% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E2% 80% 95% E9% 87% 91% E8% 9E% 8D% E5 % B8% 82% E5% A0% B4% E5% 88% 86% E6% 9E% 90% E3% 82% 92% E5% A4% 89% E3% 81% 88% E3% 82% 8B% E6% A9 % 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 82% A2% E3% 83% AB% E3% 82% B4% E3% 83% AA% E3% 82% BA % E3% 83% A0% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-% E3% 83% 9E% E3% 83% AB% E3% 82% B3% E3% 82% B9% E3% 83% BB% E3% 83% AD% E3% 83% 9A% E3% 82% B9% E3% 83% BB% E3% 83% 87% E3% 83% BB% E3% 83% 97% E3% 83% A9% E3% 83% 89-ebook / dp / B0834XJQTY)

Motivation for this article

When forecasting financial data, you need to define what you want to forecast, and the approach is completely different depending on what you want to forecast. Perhaps you are most familiar with defining whether the stock price $ T + 1 $ goes up or down with the price change rate or the sign of the price change rate? However, in some cases, it may be difficult to predict, and even if the correct answer rate is high, the average rate of return and Sharpe ratio may be terrible. Such a problem is not a problem that can be solved by labeling alone, but labeling is often neglected, but it actually has a deep meaning.

Examples of labeling in time series data and their interpretation

For example, suppose you have daily OHLC data of the Nikkei Stock Average. If you want to predict the closing price of the Nikkei Stock Average on the next business day with each closing price as $ X_1, X_2, ..., X_T , set the forecast label to $ Y_n = \ frac {X_ {n + 1} -X_n } {X_n} (1 \ leq n \ leq n) Will be $$. At this time, predicting this label is a strategy to make a new long (short) product with the Nikkei Stock Average as the underlying asset at today's discount price and settle at the market at the close of the next business day. There will be. For example, if you succeed in predicting this label and get a correct answer rate of 55%, it is not always successful in operation. Here, the cost is ignored once. If the probability of success is $ p $, the profit of success is $ \ mu_ + $, and the loss of failure is $ \ mu_- $, the expected value is $ \ mu = p \ mu_ +-(1-p ) \ Mu_- $, and under the condition of $ \ mu> 0 $, it must be $ p> \ frac {\ mu_-} {\ mu_ + + \ mu_-} $. Here, even if $ \ mu_- = 1 $ does not lose generality, it becomes $ p> \ frac {1} {\ mu_ + + 1} $. Here, if $ p = 0.55 $, then $ \ mu_ +> 0.8181 ... $. In this way, it is necessary to decide what to look for according to the purpose. In other words, I interpret that a forecast label is an investment strategy.

Labeling application example

Derived from the above example, is there an example where it is sufficient to give a correct answer rate of 50% or more on the prediction label? For example, what about such a strategy? We make a strong assumption that we can trade with the price of assets and have excellent liquidity (do not jump). If you hold a new asset at $ T = 0 $ and the reconciliation moves up + 1bps or -1bps, settle. This is the simplest binomial model introduced in finance. In this case, 50% or more of the predicted labels will have a positive expected value.

So how do you label it?

The data is tick data of board information (mid).

I want to explain using Python code.

label.py



labels = df["mid"].diff().shift(-1).replace(0, np.nan).bfill()
labels = labels / abs(labels)

――Since it changes by 1bps, look at it with diff. --Next, I want to see the difference between $ X_ {T} $ and $ X_ {T + 1} $, so shift the index one step to the left. --If the difference is 0, no transaction is made, so set 0 to Null. --Since the settlement is made only when the difference is not 0, if the stop time is set to $ t $, the profit of the strategy at time $ T = 0 $ will be $ X_t-X_0 $. ――At the end, I want to have two labels (1 or -1), so I only look at the code.

Other

In addition, Triple Barrier method, Trend-Scanning method, etc. were introduced in this book, so why not try it as a reference?

Recommended Posts

What is labeling in financial forecasting?
Financial Forecasting Feature Engineering: What are the features in financial forecasting?
What is "mahjong" in the Python library? ??
What is on_delete used in django's model?
What is namespace
What is copy.copy ()
What is Django? .. ..
What is dotenv?
What is POSIX?
What is Linux
What is klass?
What is SALOME?
What is Linux?
What is python
What is hyperopt?
What is Linux
What is pyvenv
What is __call__
What is Linux
What is Python
What is "functional programming" and "object-oriented" in Python?
What is wheezy in the Docker Python image?
What is a distribution?
What is Piotroski's F-Score?
What is Raspberry Pi?
[Python] What is Pipeline ...
What is Calmar Ratio?
What is a terminal?
[PyTorch Tutorial ①] What is PyTorch?
What is hyperparameter tuning?
What is the domain attribute written in Plotly's Layout?
What is a hacker?
What is JSON? .. [Note]
What is Linux for?
What is a pointer?
What is ensemble learning?
What is TCP / IP?
What is Python's __init__.py?
What is an iterator?
What is UNIT-V Linux?
[Python] What is virtualenv
What is machine learning?
What is Minisum or Minimax?
What is Linux? [Command list]
What is Logistic Regression Analysis?
What is the activation function?
What is the Linux kernel?
What is an instance variable?
What is a decision tree?
What is a Context Switch?
What is Google Cloud Dataflow?
[DL] What is weight decay?
[Python] Python and security-① What is Python?
What is a super user?
Competitive programming is what (bonus)
[Python] * args ** What is kwrgs?
What is a system call
[Definition] What is a framework?
What is the interface for ...
What is Project Euler 3 Acceleration?
What I learned in Python