Preface

Hi, this is KO. Thank you for always watching. If you're an investor and you've been talking about financial labeling all the time, you're probably thinking about your investment strategy and what you can do to analyze it. I think that many investors will analyze the market first and think about the investment strategy, but the opposite is true here, in financial machine learning, it is more common to think about the investment strategy and then fit it to the market. I think. (Maybe it applies only to me, even if it's general (laughs)

References are here.

Basics of the concept of features

First, decide on an investment strategy and label it. Then backtest. Actually, this isn't doing anything. Even if you get good results here, it may happen, and I think that what you have to think about in order to actually operate it is "** Will this strategy be valid in the future?" .. In other words, the backtesting itself does not guarantee performance, it merely shows the investment strategy in the past market. Let's talk about important logical relationships here. "Always profitable strategy" ⇒ "The strategy is profitable even in the past" This logical relationship is correct, but the opposite is clearly not true. It sounds harsh, but in order to apply it in practice, it is necessary to analyze based on this principle. In other words, we think that features are variables that make investment strategies more fitting to the market. It's a demon and a gold rod.

Definition in feature engineering

In sample: Training data Out of sample: Test data Substitution effect: The effect that occurs when the estimated importance of one feature is reduced by the presence of another related feature.

Feature importance due to substitution effect

Mean Decrease Impurity

MDI (: = Mean Decrease Impurity) is a method of measuring the importance of explanation with an in-sample specific to a tree-based classifier such as a random tree (: = RF). In other words, how to find features when there are a lot of RF features_importance. First, I will explain the MDI method through the code.

def featImpMDI(fit, featNames):
    df0 = {i: tree.feature_importances_ for i, tree in enumerate(fit.estimators_)}
    df0 = pd.DataFrame.from_dict(df0, orient='index')
    df0.columns = featNames
    df0 = df0.replace(0, np.nan)
    imp = pd.concat({'mean':df0.mean(), 'std':df0.std()*df0.shape[0]**-.5}, axis=1)
    imp /= imp['mean'].sum()
    return imp

First, prepare some features. Here, in this method, the first four lines dataframe model.feature_importances_ from the already fitted model. Since it is a random tree, the importance of features is calculated for each large number of trees. Here, for the feature amount displayed as 0, set it to np.nan. This is to set the maximum value of the importance of the feature to 1. After that, the average and standard deviation are calculated for the row direction (axis = 1), and it is output as the average feature. The basic usage is introduced below for those who have never used a random tree.

from sklearn.ensemble import RandomForestClassifier as RF

model = RF(max_features=1)
model.fit(X_train, y_train)
model.feature_importances_

How to choose random tree parameters

--Set max_features = int (1) to avoid the masking effect (systematically ignoring certain features and emphasizing other features). By doing so, only one random feature amount is selected for each layer. --This technique should only be used in-sample. Even if there is no predictive power here, all features will have some importance. --MDI cannot be generalized and applied to non-tree-based classifiers. --Structurally, MDI has a total feature importance of 1, and each importance is between 0 and 1. --This method does not consider the substitution effect when there are correlated features. In other words, if you have two features with the same feature, the importance will be halved, so be careful. This is a fairly important story, so I'll write it in another article in the future.

--There may be a bias towards some predictors. In a single decision tree, this bias is brought about by the general impure function unfairly focusing on predictors with many categories. (Strobl et al. [2007])

Conclusion

This time, I introduced the position of features in finance and what to be careful about in forecasting. It will be a technical talk about how to contribute to the investment strategy by the feature evaluation method, so I would like to introduce it slowly in the future.

[PYTHON] Financial Forecasting Feature Engineering: What are the features in financial forecasting?