[PYTHON] Make sure to align the pre-processing at the time of forecast model creation and forecast

Introduction

It's nice to create a predictive model, but when predicting, do you make it exactly the same as the prerequisites for creating the model? Talk. It seems to be a very important story in the operation of the opportunity learning system.

Especially in the field of chemoinformatics, models are often created by combining various commercial and free software. Pretreatment of the compound is performed with the A tool, then the descriptor is calculated with the B tool, and the prediction model is created with the C tool. .. .. It's okay to make a model like that, but this time I tried to verify what would happen if the user did not do the same preprocessing.

environment

Verification scenario

There are various pre-processing, but since it happened to be found, we proceeded with the following scenario this time.

--When dealing with compound data, hydrogen may or may not be explicitly added (obvious ones may be omitted). --In the Morgan fingerprint of RDKit, the value of the explanatory variable that is output differs moderately depending on whether hydrogen is explicitly added. --This time, we will verify how much the predicted value will fluctuate depending on whether the conditions for explicitly adding hydrogen are met or not for the input compounds given at the time of creating the prediction model and at the time of prediction. I tried it.

RDKit Morgan Fingerprint Calculation Method

What is RDKit's Morgan fingerprint in the first place? But it looks like this in the source.

from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles("CCC")
mol = Chem.AddHs(mol)
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)

In the first line, "CCC" creates a compound object from the character string SMILES, which represents a compound, in the second line, hydrogen is explicitly added to the compound, and in the third line, the descriptor calculation is performed. The descriptor calculation result is an array of 2048 bits, and each bit is 0 or 1.

Prediction models are created and predictions are made using this, but after that, `Chem.AddHs (mol)` is not attached at the time of creating the prediction model and at the time of prediction. , I confirmed how the prediction result is different when it is attached only when the prediction model is created.

Tried & considered

The correlation between the results predicted by the combination of the following three patterns is summarized in the table for the data of about 100 training data and about 10,000 prediction target data.

--Prediction model creation: without hydrogen, prediction: with hydrogen --Prediction model creation: with hydrogen, prediction: with hydrogen --Prediction model creation: with hydrogen, prediction: without hydrogen

The results are as follows. image.png

For the prediction model created by explicitly adding hydrogen to the training data and calculating the descriptor, the predicted value when the descriptor calculation / prediction is performed by omitting hydrogen in the prediction target data is explicitly hydrogen. There is only a correlation of about 0.48 compared to the predicted value when it is given to and predicted. The plot of the relationship between the two is as follows. It is a considerable error.

image.png

This value of 0.48 is lower than the correlation of 0.58 between those who made predictions with and without hydrogen by aligning the conditions at the time of creating the prediction model and at the time of prediction. There is some debate about which is more appropriate as an input for the Morgan fingerprint, with or without hydrogen (in some cases it is not specified), but first of all, it seems important to properly align the input conditions.

Conclusion

Make sure that the preprocessing conditions are the same when creating a prediction model and when making a prediction. It is best to provide it on the system side including preprocessing, but if for some reason it is not possible to do so, write it firmly in the document.

reference

Recommended Posts

Make sure to align the pre-processing at the time of forecast model creation and forecast
I want to make a music player and file music at the same time
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
[PyTorch] Make sure the model and dataset are in cuda mode
How to make VS Code aware of the venv environment and its benefits
Use Pillow to make the image transparent and overlay only part of it
Make it easy to specify the time of AWS CloudWatch Events with CDK.
At the time of python update on ubuntu
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
The story of Airflow's webserver and DAG, which takes a long time to load
I just wanted to extract the data of the desired date and time with Django
It's time to seriously think about the definition and skill set of data scientists
How to visualize the decision tree model of scikit-learn
Visualize data and understand correlation at the same time
Run the Caffe model on Google Colaboratory to predict the age and gender of the world's supermodels
[Introduction to SIR model] Predict the end time of each country with COVID-19 data fitting ♬
How to start the PC at a fixed time every morning and execute the python program
Python built-in function ~ divmod ~ Let's get the quotient and remainder of division at the same time
How to insert a specific process at the start and end of spider with scrapy
I tried to make a script that traces the tweets of a specific user on Twitter and saves the posted image at once
Try to evaluate the performance of machine learning / regression model
SIGNATE Quest ② From creation of targeting model to creation of submitted data
Make the display of Python module exceptions easier to understand
Grep so that grep does not appear at the time of grep
Try to evaluate the performance of machine learning / classification model
I made a function to check the model of DCGAN
I tried to illustrate the time and time in C language
I tried to display the time and today's weather w
Hook to the first import of the module and print the module path
I want to know the features of Python and pip
[Introduction to Tensorflow] Understand Tensorflow properly and try to make a model
Commands and files to check the version of CentOS Linux
It is surprisingly troublesome to get a list of the last login date and time of Workspaces
The story of returning to the front line for the first time in 5 years and refactoring Python Django
Learn the flow of Bayesian estimation and how to use Pystan through a simple regression model
python memo: enumerate () -get index and element of list at the same time and turn for statement