[PYTHON] Need for random_state for train_test_split in sklearn

What is train_test_split?

Train_test_split is a commonly used function in machine learning. This is a function that divides numpy array or pandas dataframe and easily divides train data and test data.

What is random_state

Random_state is a value that makes the order in which train_test_split divides data random, and fixes it so that the same random division will be returned no matter how many times it is executed. If you do not put random_state as an argument in this function, the train data will change every time you execute it, and the tuning of hyperparameters will be meaningless, or the train data when making multiple models and taking a majority decision (ensemble method) will be It is a story that different things lead to overfitting.

First, create a data frame for the experiment.

import pandas as pd
from sklearn.model_selection import train_test_split
df=pd.DataFrame({"a":[1,2,3,4,5],"b":[1,2,3,4,5]})
print(df)
   a  b
0  1  1
1  2  2
2  3  3
3  4  4
4  5  5

Then train_test_split. You can specify the ratio by putting float in either train_size or test_size, and the number by putting int, but it is confusing, so it is better to explicitly specify both train_size = 4 and test_size = 1. think.

train_x, test_x = train_test_split(df, train_size=4, test_size=1)
print(train_x)
print(test_x)
   a  b
1  2  2
2  3  3
0  1  1
3  4  4

Code without random_state

This code returns different train and test data each time.

for i in range(3):
    train_x, test_x = train_test_split(df, train_size=4, test_size=1)
    print()
    print(i,"Time")
    print(train_x)
    print(test_x)
0th time
   a  b
2  3  3
3  4  4
4  5  5
0  1  1
   a  b
1  2  2

First time
   a  b
2  3  3
4  5  5
1  2  2
0  1  1
   a  b
3  4  4

Second time
   a  b
4  5  5
0  1  1
1  2  2
3  4  4
   a  b
2  3  3

For example, if you try to make three different models from these three data and take a majority vote (ensemble method) asdfasdf.png In this way, the train data learned for each model is different, and since the model has no unknown parts, it is expected that high accuracy will be obtained at the time of the majority vote of test. Random_state is required to avoid this overfitting.

Code with random_state

for i in range(3):
    train_x, test_x= train_test_split(df, train_size=4, test_size=1, random_state=42)
    print()
    print(i,"Time")
    print(train_x)
    print(test_x)
0th time
   a  b
4  5  5
2  3  3
0  1  1
3  4  4
   a  b
1  2  2

First time
   a  b
4  5  5
2  3  3
0  1  1
3  4  4
   a  b
1  2  2

Second time
   a  b
4  5  5
2  3  3
0  1  1
3  4  4
   a  b
1  2  2

As test_x is fixed at [2 2], the same random division is returned no matter how many times it is executed, so unknown data can be secured during training and overfitting can be avoided.

Recommended Posts

Need for random_state for train_test_split in sklearn
Need for __name__ == "__main__" as seen in circular import problems
Search for strings in Python
Search for strings in files
Techniques for sorting in Python
About "for _ in range ():" in python
About the need for the first slash in the subscriber name and publisher name