[PYTHON] A Study on Visualization of the Scope of Prediction Models

Introduction

Article AI drug discovery started free of charge using papers and public databases As a bonus, we visualized the scope of application of the prediction model. At that time, the data to be predicted was applied to the dimensional compression model by PCA and UMAP learned only from the training data and visualized. At that time, at first glance, I concluded that ** the data to be predicted seems to be within the range of the training data **, but I was wondering if that was the case, so I will verify it.

Verification content

In the article, "The figure shows that the data to be predicted does not seem to deviate significantly from the range of the training data, but it may be affected by the fact that the dimensions are reduced only by the data composed of the training data. No. " Therefore, this time, I would like to try dimensional compression using all of the training data and the prediction target data, and compare it with the figure when dimensional compression is performed only from the training data **.

Source

For the previous source, I will post the source modified to compress the dimensions using all of the training data and the prediction target data.

view_ad.py


import argparse
import csv

import pandas as pd
import numpy as np
import umap
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE, MDS
from sklearn.decomposition import PCA

from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
from rdkit import rdBase, Chem, DataStructs

def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-train", type=str, required=True)
    parser.add_argument("-predict", type=str)
    parser.add_argument("-result", type=str)
    parser.add_argument("-method", type=str, default="PCA", choices=["PCA", "UMAP"])

    args = parser.parse_args()

    # all
    all_datas = []

    # all_train.loading csv,fp calculation
    train_datas = []
    train_datas_active = []
    train_datas_inactive = []

    with open(args.train, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            smiles = row["canonical_smiles"]

            mol = Chem.MolFromSmiles(smiles)
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
            train_datas.append(fp)

            if int(row["outcome"]) == 1:
                 train_datas_active.append(fp)
            else:
                 train_datas_inactive.append(fp)

            all_datas.append(fp)



    if args.predict and args.result:
        result_outcomes = []
        result_ads = []

        #Prediction result reading
        with open(args.result, "r",encoding="utf-8_sig") as f:
            reader = csv.DictReader(f)
            for i, row in enumerate(reader):
                #print(row)
                if row["Prediction"] == "Active":
                    result_outcomes.append(1)
                else:
                    result_outcomes.append(0)

                result_ads.append(row["Confidence"])


        # drugbank.loading csv,fp calculation
        predict_datas = []
        predict_datas_active = []
        predict_datas_inactive = []
        predict_ads = []
        with open(args.predict, "r") as f:
            reader = csv.DictReader(f)
            for i, row in enumerate(reader):
                print(i)
                smiles = row["smiles"]
                mol = Chem.MolFromSmiles(smiles)
                fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
                predict_datas.append(fp)

                if result_outcomes[i] == 1:
                    predict_datas_active.append(fp)
                else:
                    predict_datas_inactive.append(fp)

                all_datas.append(fp)

    #analysis
    model = None
    if args.method == "PCA":
        model = PCA(n_components=2)
        model.fit(train_datas)
        #model.fit(all_datas)

    if args.method == "UMAP":
        model = umap.UMAP()
        model.fit(train_datas)
        #model.fit(all_datas)

    result_train = model.transform(train_datas)
    result_train_active = model.transform(train_datas_active)
    result_train_inactive = model.transform(train_datas_inactive)

    plt.title(args.method)
    #plt.scatter(result_train[:, 0], result_train[:, 1], c="blue", alpha=0.1, marker="o")
    plt.scatter(result_train_active[:, 0], result_train_active[:, 1], c="blue", alpha=0.5, marker="o")
    plt.scatter(result_train_inactive[:, 0], result_train_inactive[:, 1], c="blue", alpha=0.5, marker="x")

    #Forecast(predict)
    if args.predict and args.result:

        result_predict = model.transform(predict_datas)
        result_predict_active = model.transform(predict_datas_active)
        result_predict_inactive = model.transform(predict_datas_inactive)

        #plt.scatter(result_predict[:, 0], result_predict[:, 1], c=result_ads, alpha=0.1, cmap='viridis_r')
        plt.scatter(result_predict_active[:, 0], result_predict_active[:, 1], c="red", alpha=0.1, marker="o")
        plt.scatter(result_predict_inactive[:, 0], result_predict_inactive[:, 1], c="red", alpha=0.1, marker="x")

    plt.show()


if __name__ == "__main__":
  import argparse
import csv

import pandas as pd
import numpy as np
import umap
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE, MDS
from sklearn.decomposition import PCA

from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
from rdkit import rdBase, Chem, DataStructs

def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-train", type=str, required=True)
    parser.add_argument("-predict", type=str)
    parser.add_argument("-result", type=str)
    parser.add_argument("-method", type=str, default="PCA", choices=["PCA", "UMAP"])

    args = parser.parse_args()

    # all
    all_datas = []

    # all_train.loading csv,fp calculation
    train_datas = []
    train_datas_active = []
    train_datas_inactive = []

    with open(args.train, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            smiles = row["canonical_smiles"]

            mol = Chem.MolFromSmiles(smiles)
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
            train_datas.append(fp)

            if int(row["outcome"]) == 1:
                 train_datas_active.append(fp)
            else:
                 train_datas_inactive.append(fp)

            all_datas.append(fp)



    if args.predict and args.result:
        result_outcomes = []
        result_ads = []

        #Prediction result reading
        with open(args.result, "r",encoding="utf-8_sig") as f:
            reader = csv.DictReader(f)
            for i, row in enumerate(reader):
                #print(row)
                if row["Prediction"] == "Active":
                    result_outcomes.append(1)
                else:
                    result_outcomes.append(0)

                result_ads.append(row["Confidence"])


        # drugbank.loading csv,fp calculation
        predict_datas = []
        predict_datas_active = []
        predict_datas_inactive = []
        predict_ads = []
        with open(args.predict, "r") as f:
            reader = csv.DictReader(f)
            for i, row in enumerate(reader):
                print(i)
                smiles = row["smiles"]
                mol = Chem.MolFromSmiles(smiles)
                fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, nBits=2048, useFeatures=False, useChirality=False)
                predict_datas.append(fp)

                if result_outcomes[i] == 1:
                    predict_datas_active.append(fp)
                else:
                    predict_datas_inactive.append(fp)

                all_datas.append(fp)

    #analysis
    model = None
    if args.method == "PCA":
        model = PCA(n_components=2)
        #model.fit(train_datas)
        model.fit(all_datas)

    if args.method == "UMAP":
        model = umap.UMAP()
        #model.fit(train_datas)
        model.fit(all_datas)

    result_train = model.transform(train_datas)
    result_train_active = model.transform(train_datas_active)
    result_train_inactive = model.transform(train_datas_inactive)

    plt.title(args.method)
    #plt.scatter(result_train[:, 0], result_train[:, 1], c="blue", alpha=0.1, marker="o")
    plt.scatter(result_train_active[:, 0], result_train_active[:, 1], c="blue", alpha=0.5, marker="o")
    plt.scatter(result_train_inactive[:, 0], result_train_inactive[:, 1], c="blue", alpha=0.5, marker="x")

    #Forecast(predict)
    if args.predict and args.result:

        result_predict = model.transform(predict_datas)
        result_predict_active = model.transform(predict_datas_active)
        result_predict_inactive = model.transform(predict_datas_inactive)

        #plt.scatter(result_predict[:, 0], result_predict[:, 1], c=result_ads, alpha=0.1, cmap='viridis_r')
        plt.scatter(result_predict_active[:, 0], result_predict_active[:, 1], c="red", alpha=0.1, marker="o")
        plt.scatter(result_predict_inactive[:, 0], result_predict_inactive[:, 1], c="red", alpha=0.1, marker="x")

    plt.show()


if __name__ == "__main__":
    main()

For the time being, the modification of the program is that the argument when fitting with model is changed from training data to all data (learning data + prediction target data).

     #model.fit(train_datas)
     model.fit(all_datas)

result

For the time being, if repeated, blue will be the training data and red will be the prediction data. Please forgive the points that overlap and are difficult to see.

PCA

When fitting with only training data

image.png

When fitting with all training data + prediction target data

image.png

UMAP

When fitting with only training data

image.png

When fitting with all training data + prediction target data

image.png

Consideration

--In both PCA / UMAP, when fitting only with training data, it seems that a lot of prediction target data is within the applicable range. ――However, when fitting with all the training data + prediction target data, there are a large number of data that exist largely outside the training data area. ――In other words, it can be said that it is ** extremely dangerous ** to judge whether or not the forecast target data is in the applicable area by looking at the former figure. ――The reason why this happened is that the former figure is a dimensional compression model that considers the tendency of only the training data as a whole, and even if fitting for data that does not follow that tendency, the model works well. This is because it cannot be caught. ――So what should you do to determine the scope of application? That being said, one is to define some formula for measuring the distance to the training data without relying on figures, and rely on numerical values. --The other is to create a dimensional compression model using a larger set of compounds that includes the entire training data and prediction target data. ――In the latter case, you may think that you should prepare all the compounds that exist in the natural world, but in that case, the number of data is too large to make a real model, and the space is too large. Is too large to distinguish between training data and prediction data, so I think it is necessary to prepare an appropriate compound set according to the domain of the prediction model (I think there is a paper). ..

Recommended Posts

A Study on Visualization of the Scope of Prediction Models
Calculate the probability of outliers on a boxplot
Create a shape on the trajectory of an object
A note on the default behavior of collate_fn in PyTorch
A memo that reproduces the slide show (gadget) of Windows 7 on Windows 10.
On Linux, the time stamp of a file is a little past.
Find the rank of a matrix in the XOR world (rank of a matrix on F2)
A command to easily check the speed of the network on the console
The story of writing a program
Get the number of readers of a treatise on Mendeley in Python
Approximation of distance between two points on the surface of a spheroid (on the surface of the earth)
Semi-automatically generate a description of the package to be registered on PyPI
The story of the escape probability of a random walk on an integer grid
Measure the relevance strength of a crosstab
A quick overview of the Linux kernel
Post the subject of Gmail on twitter
[python] [meta] Is the type of python a type?
Study on Tokyo Rent Using Python (3-1 of 3)
Get the filename of a directory (glob)
The story of blackjack A processing (python)
[Python] A progress bar on the terminal
Notice the completion of a time-consuming command
Calculate the probability of outliers on a boxplot
Plot the environmental concentration of organofluorine compounds on a map using open data
How to access the contents of a Linux disk on a Mac (but read-only)
A record of the time it took to deploy mysql on Cloud9 + Rails
How to calculate the volatility of a brand
A memo of a tutorial on running python on heroku
Visualize the inner layer of a neural network
The behavior of signal () depends on the compile options
A note on customizing the dict list class
[2020July] Check the UDID of the iPad on Linux
Make a copy of the list in Python
Use the latest version of PyCharm on Ubuntu
Find the number of days in a month
A note about the python version of python virtualenv
The story of making a lie news generator
[Python] A rough understanding of the logging module
At the time of python update on ubuntu
Change the resolution of Ubuntu running on VirtualBox
A discussion of the strengths and weaknesses of Python
[AWS S3] Confirmation of the existence of folders on S3
Create a GUI on the terminal using curses
I did a little research on the class
Can you study with a minimum of belongings? Developed on iPad << 3rd >> ~ Savior Appears ~
[Python3] Take a screenshot of a web page on the server and crop it further
[Example of Python improvement] I learned the basics of Python on a free site in 2 weeks.
Make the initial directory of JupyterLab a Google Drive mounted on an external HDD
I want to take a screenshot of the site on Docker using any font
A memo on how to overcome the difficult problem of capturing FX with AI
We held an in-house study session on mob programming with the theme of FizzBuzz.