Feature Prediction Statistics python

Since I wrote it as an output of study, there may be some mistakes. Feel free to comment. This time, Predictive Statistics (Practice Edition Multiple Regression) python We will further improve the accuracy of the multiple regression done in Python. In conclusion, what is needed to improve the prediction accuracy is " features </ b>". This time, we will deal with the features.

Contents

・ What is a feature? ・ How to improve prediction accuracy using features ・ Data processing

What is a feature?

Features are explanatory variables. In the world of machine learning, it is often a feature rather than an explanatory variable. Features are indispensable for improving analysis accuracy.

How to improve prediction accuracy

There are the following two methods to improve the prediction accuracy using features. ① Make a feature ② Select the feature amount

Make features

What does it mean to create a feature? That is, processing given data and external data to create new features </ b>. For example, regression is to create means and standard deviations, and classification is to aggregate data for people in their twenties only. By doing this, useless data can be eliminated and prediction accuracy can be improved.

Select features

This is to select the features without excess or deficiency. There are the following three methods for selecting features. ① Univariate analysis ② Model base selection ③ Iterative selection

Univariate analysis

This is to analyze the objective variable and the explanatory variable in a one-to-one relationship. So to speak, it is a simple regression analysis. An example is analysis of variance.

Model-based selection

This is a method to calculate the importance of features in the model to be created.

Iterative selection

We will improve the prediction accuracy by increasing or decreasing the features. Stepwise is an example.

Data processing

I explained that it is important to think about features in order to improve the prediction accuracy. Here, we will explain whether to actually process the features. There are various methods for selecting features, so I will write an article at a later date.

Convenient function

There are functions that are useful for processing data. This time, I would like to introduce the following two points. ・ Split function ・ Apply function

split function

This is a function that splits a string. If you assign the character you want to split to the argument, that character is excluded and the string is split.

apply function

This is a function that applies a number to each value in the data. In data processing, you can easily process numerical values by specifying an anonymous function (lambda function) as an argument.

code

I will explain how to actually use these functions. For example, suppose the date column contains a date string, such as "2019-12-12". If you want to put only the year in the column called year, write as follows.

df["year"] = df["date"].apply(lambda x: x.split("-")[0])

Recommended Posts