[PYTHON] Learn with chemoinformatics scikit-learn

Introduction

Continuing from Matplotlib learned from chemoinformatics, "Matplotlib" is one of the representative libraries of Python with the theme of lipidomics (comprehensive analysis of lipids). I will explain about. We will mainly explain practical examples of chemoinformatics, so if you want to check the basics, please read the following article before reading this article.

Pharmaceutical researcher summarized scikit-learn

Data set preparation

scikit-learn is a library for machine learning.

Here, consider predicting the retention time (RT) in liquid chromatography (LC) from the physical properties of a compound using partial least squares (PLS) regression. I will.

First, create a dataset for machine learning.

import pandas as pd


params_fatty_acids = ['Heavy atoms', 'Rotatable Bonds', 'van der Waals Molecular Volume', 'logP', 'Molar Refractivity']

lauric = [14, 10, 231.10, 3.99, 59.48]
myristic = [16, 12, 265.70, 4.77, 68.71]
palmitic = [18, 14, 300.30, 5.55, 77.95]
palmitoleic = [18, 13, 297.66, 5.33, 77.85]
stearic = [20, 16, 334.90, 6.33, 87.18]
oleic = [20, 15, 332.26, 6.11, 87.09]
linoleic = [20, 14, 329.62, 5.88, 86.99]
linolenic = [20, 13, 326.98, 5.66, 86.90]
stearidonic = [20, 12, 324.34, 5.44, 86.81]
arachidic = [22, 18, 369.50, 7.11, 96.42]
bishomo_gamma_linolenic = [22, 15, 361.58, 6.44, 96.13]
arachidonic = [22, 14, 358.94, 6.22, 96.04]
eicosapentaenoic = [22, 13, 356.30, 5.99, 95.95]
behenic = [24, 20, 404.10, 7.89, 105.65]
adrenic = [24, 16, 393.54, 7.00, 105.27]
docosapentaenoic = [24, 15, 390.90, 6.77, 105.18]
docosahexaenoic = [24, 14, 388.26, 6.55, 105.09]

df_fatty_acids = pd.DataFrame([lauric, myristic, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, stearidonic, arachidic, bishomo_gamma_linolenic, arachidonic, eicosapentaenoic, behenic, adrenic, docosapentaenoic, docosahexaenoic], columns=params_fatty_acids)
df_fatty_acids['Experimental Retention Time (min)'] = [4.53, 7.52, 11.02, 10.59, 14.45, 11.86, 9.76, 8.31, 6.71, 17.52, 11.20, 9.96, 8.27, 20.40, 12.75, 11.52, 9.84]

print(df_fatty_acids)

Here, the list of physical property parameter names used as explanatory variables is params_fatty_acids. Each physical property value refers to the information stored in the database of LIPID MAPS. In addition, RT is published on the website PRIMe of RIKEN RT data in reverse phase LC. .riken.jp/Metabolomics_Software/MrmDatabase/Detail%20of%20LCQqQMS%20method%20(ODS-lipids).xlsx) is referenced. Also, in reality, I think that CSV files etc. are often read with pandas.read_csv etc. In addition, data preprocessing such as missing value completion is often required.

Model building

Next, we will build a prediction model and calculate the prediction value using the model.

from sklearn.cross_decomposition import PLSRegression


X = df_fatty_acids[params_fatty_acids] #Explanatory variable
y = df_fatty_acids['Experimental Retention Time (min)'] #Objective variable

pls_rt = PLSRegression()
pls_rt.fit(X, y) #Build a PLS prediction model

y_pred = pls_rt.predict(X) #Calculate the predicted value

df_fatty_acids['Predicted Retention Time (min)'] = y_pred
df_fatty_acids['Diff (min)'] = df_fatty_acids['Predicted Retention Time (min)'] - df_fatty_acids['Experimental Retention Time (min)']
df_fatty_acids['Accuracy (%)'] = (df_fatty_acids['Diff (min)'] / df_fatty_acids['Experimental Retention Time (min)']) * 100

print(df_fatty_acids)

The relationship between the measured value and the predicted value is shown below.

%matplotlib inline
import matplotlib.pyplot as plt


plt.scatter(y, y_pred)
plt.xlabel('Experimental Retention Time (min)')
plt.ylabel('Predicted Retention Time (min)')

plt.savefig('rts_fatty_acids.png')
plt.show()

rts_fatty_acids.png

In this data, it seems that the measured value and the predicted value match well. You can check the degree of fit of the built model with r2_score.

from sklearn.metrics import r2_score


print(r2_score(y, y_pred))

r2_score takes a value between 0 and 1, and the closer it is to 1, the better the measured and predicted values are. In this data, r2_score is a value exceeding 0.98, which is a fairly good model.

This time, we used 5 types of physical property parameters to predict RT, but let's see which of them contributes significantly to the prediction.

print(pls_rt.coef_)

From this result, it can be seen that in this data, the coefficient (absolute value) for Rotable Bonds is the largest at 3.44, and this physical property value strongly contributes to the prediction of RT.

Prediction using a model

We have discussed the prediction accuracy of the data used to build the prediction model, but finally, let's see how accurate the data not used to build the model can be predicted.

lignoceric = [26, 22, 438.70, 8.67, 114.88]
x_lignoceric = pd.DataFrame([lignoceric], columns=params_fatty_acids)
y_pred_lignoceric = pls_rt.predict(x_lignoceric)

y_exp_lignoceric = 22.31 #Measured value

print(y_exp_lignoceric)
print(y_pred_lignoceric)

Here, I tried to predict the RT of lignoceric acid (FA 24: 0). The difference between the predicted value and the measured value is about 1.2 minutes. I think there are various views on whether this difference is large or small, but I personally think that the prediction accuracy is rather low. The reason is that lignoceric acid is a molecular species that is more hydrophobic than any fatty acid molecular species included in the dataset used for model construction, and data fitting close to the physical properties of lignoceric acid is performed at the model construction stage. It is thought that it is related to what was not done.

PLS regression can be performed by the above procedure. Although not mentioned here, the number of latent variables n_components is also important when performing PLS regression. This time, I used the default value 2, but by changing this, the prediction accuracy will change little by little. I would like to explain it at another time.

Summary

Here, we have explained scikit-learn, focusing on practical knowledge that can be used in chemoinformatics. Let's review the main points again.

--By using scikit-learn, you can easily perform machine learning. --Machine learning is performed in the flow of data preprocessing, model construction, and prediction. --Proceed while observing the r2_score of the constructed model and the difference between the predicted value and the measured value.

Reference materials / links

What is the programming language Python? Can it be used for AI and machine learning?

Recommended Posts

Learn with chemoinformatics scikit-learn
Isomap with Scikit-learn
DBSCAN with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
PCA with Scikit-learn
kmeans ++ with scikit-learn
Learn Python with ChemTHEATER
Multi-class SVM with scikit-learn
Learn Zundokokiyoshi with LSTM
Clustering with scikit-learn + DBSCAN
Learn with Cheminformatics Matplotlib
DBSCAN (clustering) with scikit-learn
Learn with Cheminformatics NumPy
DCGAN with TF Learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Learn Pendulum-v0 with DDPG
Learn librosa with a tutorial 1
Neural network with Python (scikit-learn)
Learn elliptical orbits with Chainer
Learn new data with PaintsChainer
Parallel processing with Parallel of scikit-learn
[Python] Linear regression with scikit-learn
Robust linear regression with scikit-learn
Learn algorithms with Go @ recursive call
Grid search of hyperparameters with Scikit-learn
Creating a decision tree with scikit-learn
Learn with Causal ML Package Meta-Learner
Image segmentation with scikit-image and scikit-learn
[TensorFlow 2] Learn RNN with CTC Loss
Let's learn Deep SEA with Selene
Learn search with Python # 2bit search, permutation search
Identify outliers with RandomForestClassifier in scikit-learn
Non-negative Matrix Factorization (NMF) with scikit-learn
Learn document categorization with spaCy CLI
Try machine learning with scikit-learn SVM
Python data structures learned with chemoinformatics
Scikit-learn DecisionTreeClassifier with datetime type values
The most basic clustering analysis with scikit-learn
Getting Started with python3 # 1 Learn Basic Knowledge
Learn to colorize monochrome images with Chainer
Let's tune the model hyperparameters with scikit-learn!
[Scikit-learn] I played with the ROC curve
Try SVM with scikit-learn on Jupyter Notebook
Multi-label classification by random forest with scikit-learn
[Python] Use string data with scikit-learn SVM
Clustering representative schools in summer 2016 with scikit-learn
Implement a minimal self-made estimator with scikit-learn
Learn Python! Comparison with Java (basic function)
Learn with Splatoon nervous breakdown! Graph theory
"How to pass PATH" to learn with homebrew
Fill in missing values with Scikit-learn impute
Learn the design pattern "Singleton" with Python
Preparing to learn technical indicators with TFlearn
Learn the design pattern "Facade" with Python
Visualize scikit-learn decision trees with Plotly's Treemap