[PYTHON] Random seed research in machine learning

Introduction

When I was thinking about random number generation, I was worried and couldn't sleep, so I summarized it.

First from the conclusion

In machine learning code, it is often reproducible by executing a function like this first.

`seal_seed.py`


def fix_seed(seed):
    # random
    random.seed(seed)
    # Numpy
    np.random.seed(seed)
    # Pytorch
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    # Tensorflow
    tf.random.set_seed(seed)

SEED = 42
fix_seed(SEED)

Is this really okay? I'm worried, but this is all right for fixing the seed. However, there are some points to note about the difference between random_seed and RandomState and around the GPU, so I will explain a little.

Random seeding of Python built-in modules

random --- Generate Pseudo-Random Numbers — Python 3.8.3 Documentation

random.seed(seed)

By default, the current system time is used, but some OSs have OS-specific random number sources.

[Mersenne Twister](https://ja.wikipedia.org/wiki/%E3%83%A1%E3%83%AB%E3%82%BB%E3%83%B3%E3%83%8C%E3 A pseudo-random number generator called% 83% BB% E3% 83% 84% E3% 82% A4% E3% 82% B9% E3% 82% BF) is used.

Numpy seed fixing

Note that Numpy random number generation uses a different seed each time it is executed.

import numpy as np
np.random.seed(42)
#First time
print(np.random.randint(0, 1000, 10))
# -> [102 435 860 270 106  71 700  20 614 121]

#Second time
print(np.random.randint(0, 1000, 10))
# -> [466 214 330 458  87 372  99 871 663 130]

If you want to fix it, set the seed each time.

import numpy as np
np.random.seed(42)
#First time
print(np.random.randint(0, 1000, 10))
# -> [102 435 860 270 106  71 700  20 614 121]

#Second time
np.random.seed(42)
print(np.random.randint(0, 1000, 10))
# -> [102 435 860 270 106  71 700  20 614 121]

Even if the environment or OS changes, if the first fixed seed is the same, the output after that seems to be the same.

If you just want to keep the reproducibility of the experiment, it seems that there is no problem if you fix only the seed at the beginning as mentioned above.

Library using Numpy

np.random.seed (42) is basically okay, but be careful if the seed is fixed even in the external module. If you overwrite it like np.random.seed (43) in the external module, the seed of the caller will also be overwritten.

Libraries such as Optuna and Pandas have taken this into account and prepared a new random number generation class with numpy.random.RandomState.

np.random.seed(42)
'''
Some processing
'''
df.sample(frac=0.5, replace=True, random_state=43)

The seed of pandas is fixed by including random_state = 43 in the argument.

With this, the seed of numpy fixed at the beginning will not be overwritten by 43.

s = pd.Series(np.arange(100))
np.random.seed(42)
#First run at 42
print(s.sample(n=3)) # -> (83, 53, 70)
#The second time another random seed is applied
print(s.sample(n=3)) # -> (79, 37, 65)

print(s.sample(n=3, random_state=42)) # -> (83, 53, 70)
print(s.sample(n=3, random_state=42)) # -> (83, 53, 70)

Furthermore, like Numpy, note that the seed is not fixed after the second time. Save it in a variable or set the value of random_state each time.

If you run the jupyter notebook sequentially and finally the number of calls is the same, you can keep the reproducibility by setting np.random.seed (42) once at the beginning.

However, please note that reproducibility may not be maintained slightly when using GPU as described later.

Scikit-learn seed fixing

You can specify random_state with the train_test_split function of Scikit-learn, but there is no way to fix it for the entire Scikit-learn.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=SEED)

How to set the global random_state in Scikit Learn | Bartosz Mikulski

According to the above link, it is okay if you fix the random seed of Numpy, but be careful because the result will change every time you execute split from the second time onwards.

Optuna seed fixation

How can I obtain reproducible optimization results?

sampler = TPESampler(seed=SEED)  # Make the sampler behave in a deterministic way.
study = optuna.create_study(sampler=sampler)
study.optimize(objective)

Since another RandomState instance is prepared in Optuna, it is possible to specify seed. RandomState is used internally.

Seed fixing with LightGBM

When using Cross-Validation

lgb.cv(lgbm_params,
       lgb_train,
       early_stopping_rounds=10,
       nfold=5,
       shuffle=True,
       seed=42,
       callbacks=callbacks,
       )

Can be set as. In the manual

Seed used to generate the folds (passed to numpy.random.seed)

Because it says, "Oh! Is this the seed will be rewritten?", But if you look at the source code randidx = np.random.RandomState(seed).permutation(num_data) It seems to be okay because it was.

Also, when using the Scikit-learn API

clf = lgb.LGBMClassifier(random_state=42)

Can be set as.

The manual states that the C ++ default seed will be used if not set.

If None, default seeds in C++ code are used.

If you start to wonder what the default seed of C ++ is, there is no end to it, so I will stop here.

PyTorch seed fixed

Reproducibility — PyTorch 1.5.0 documentation

torch.manual_seed(seed)
#For cuDNN
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

There is a method called torch.cuda.manual_seed_all (seed), but with the latest Pytorch, torch.manual_seed (seed) is enough.

Also, the manual says:

Deterministic operation may have a negative single-run performance impact, depending on the composition of your model. Due to different underlying operations, which may be slower, the processing speed (e.g. the number of batches trained per second) may be lower than when the model functions nondeterministically. However, even though single-run speed may be slower, depending on your application determinism may save time by facilitating experimentation, debugging, and regression testing.

Note that if the GPU processing is set to Deterministic, the processing speed may slow down.

However, considering the entire process such as debugging and testing, it is also written that "it will eventually lead to a reduction in time".

When reproducibility does not matter & When the network structure (calculation graph) does not change torch.backends.cudnn.benchmark = True Can speed up

Fixed seed of TensorFlow

Basically fix the seed as shown below

tf.random.set_seed(seed)

However, you can also specify the seed value at the operation level as shown below.

tf.random.uniform([1], seed=1)

Deep Learning framework and GPU seeding

To be honest, I didn't find much information about Tensorflow's GPU. GPU and random number generation seem to have some deep-seated problems. Software and hardware will be completely different.

NVIDIA/tensorflow-determinism: Tracking, debugging, and patching non-determinism in TensorFlow

Just as Pytorch also runs the risk of slowing down, you should consider that there is a trade-off between reproducibility and GPU processing performance.

Since data types such as FP16 and INT8 may be converted inside the GPU for speeding up, rounding errors may not be negligible. There are likely to be many things to think about in order to maintain reproducibility.

Where did Seed = 42 come from?

"The answer to the ultimate question about life, the universe, and all things was released by the supercomputer Deep Thought in the novel The Hitchhiker's Guide to the Galaxy. % 94% 9F% E5% 91% BD% E3% 80% 81% E5% AE% 87% E5% AE% 99% E3% 80% 81% E3% 81% 9D% E3% 81% 97% E3% 81 % A6% E4% B8% 87% E7% 89% A9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6% E3% 81% AE% E7% A9% B6 % E6% A5% B5% E3% 81% AE% E7% 96% 91% E5% 95% 8F% E3% 81% AE% E7% AD% 94% E3% 81% 88) "is 42.

What is it about the random seed "4242"? | Kaggle

In Kaggle, the code ~~ copy ~~ is reused frequently, so the part that someone used in the joke, seed = 42, has become popular.

Nowadays, we sometimes ensemble the prediction of the model trained by changing the seed value.

Summary

--Be careful because the seed changes every time you execute numpy-related random number generation. ――Reproducibility cannot be maintained unless you explicitly set seed, especially when random numbers are generated each time you execute a method, or when you do not know how many times it will be called. --When using an external library, set random_state each time you call it. --Prepare RandomState again so as not to overwrite the seed of numpy when creating a module by yourself --Random number generation around GPU is quite complicated. There is a trade-off between processing speed and reproducibility (or rather accuracy?)

Click here for a simple experimental code machine_leraning_experiments/random_seed_experiment.ipynb at master · si1242/machine_leraning_experiments