Python hand play (descriptor calculation: serious version)

What is this article?

Once upon a time, when I started writing Python, I wrote calculations for compound descriptors. But, well, "What is Pnadas?" "Is it okay to answer?" It's been a long time since then, so I rewrote it.

Then, it continues to I / O to the DB to be continuously thrown.

Well, there are a lot of things to do, but I'm going to say "Done is better than perfect", so I'll break it down here.

Overview

Prepare an SDF file as compound information. So, using RDKit, make it a variable like this.

# Compound acquisition
sdfpath = 'xxx.sdf'
mols = get_mols(sdfpath)

So, make a "row" for each compound and make some functions that return some kinds of "columns". Since the things to do are the same, I tried to arrange the shape of the function. By the way, the return value is Pandas DataFrame type. I finally noticed it recently, but this is convenient.

So, output as csv collectively.

code

import os
import pandas as pd


# Returns the compound
# I: SDF path
# O: Compound object list
def get_mols(sdfpath):
    from rdkit import Chem
    mols = [mol for mol in Chem.SDMolSupplier(sdfpath) if mol is not None]
    return mols


# Returns basic information about a compound [Compound name, structural information, number of atoms, number of bonds, SMILES, InChI]
# I: Compound object list
# O: Result data
def get_values_base(mols):
    from rdkit import Chem
    columns = ['Name', 'Structure', 'Atoms', 'Bonds', 'SMILES', 'InChI']
    values = list()
    for mol in mols:
        tmp = list()
        tmp.append(mol.GetProp('_Name'))
        tmp.append(Chem.MolToMolBlock(mol))
        tmp.append(mol.GetNumAtoms())
        tmp.append(mol.GetNumBonds())
        tmp.append(Chem.MolToSmiles(mol))
        tmp.append(Chem.MolToInchi(mol))
        values.append(tmp)
    index = [i for i in range(len(mols))]
    df = pd.DataFrame(values, columns=columns, index=index)
    return df


# Returns the external parameters of the compound
# I: Compound object list
# O: Result data
def get_values_external(mols):
    from rdkit import Chem
    columns = ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']
    values = list()
    for mol in mols:
        tmp = list()
        for column in columns:
            tmp.append(mol.GetProp(column))
        values.append(tmp)
    columns = ['ext_' + column for column in columns]
    index = [i for i in range(len(mols))]
    df = pd.DataFrame(values, columns=columns, index=index)
    return df


# Calculate descriptor: RDKit
# I: Compound object list
# O: Result data
def get_rdkit_descriptors(mols):
    from rdkit.Chem import AllChem, Descriptors
    from rdkit.ML.Descriptors import MoleculeDescriptors
 # RDKit descriptor calculation
    # names = [mol.GetProp('_Name') for mol in mols]
    descLists = [desc_name[0] for desc_name in Descriptors._descList]
    calcs = MoleculeDescriptors.MolecularDescriptorCalculator(descLists)
    values = [calcs.CalcDescriptors(mol) for mol in mols]
 Convert to #DataFrame
    index = [i for i in range(len(mols))]
    df = pd.DataFrame(values, columns=descLists, index=index)
    return df


# Calculate descriptor: mordred
# I: Compound object list
# O: Result data
def get_mordred_descriptors(mols):
 Calculation of # mordred descriptor
    from mordred import Calculator, descriptors
    calcs = Calculator(descriptors, ignore_3D=False)
    df = calcs.pandas(mols)
    df['index'] = [i for i in range(len(mols))]
    df.set_index('index', inplace=True)
    return df


# Calculate descriptor: CDK
# I: SDF file
# java executable file path
# CDK jar file path
# O: Result data
def get_cdk_descriptors(sdfpath, workfolderpath, java_path, cdk_jar_path):
    filepath = os.path.join(workfolderpath, 'tmp.csv')
    import subprocess
    command = f'{java_path} -jar {cdk_jar_path} -b {sdfpath} -t all -o {filepath}'
    print(command)
    subprocess.run(command, shell=False)
    df = pd.read_table(filepath)
    os.remove(filepath)
    return df


# Main processing
def main():
    data_folderpath = 'D:\\data\\python_data\\chem'
    sdfpath = os.path.join(data_folderpath, 'sdf\\solubility.test.20.sdf')
    csvpath = 'solubility.test.csv'

    java_path = 'C:\\Program Files\\Java\\jdk-14.0.1\\bin\\java.exe'
    workfolderpath = os.path.dirname(os.path.abspath(__file__))
    cdk_jar_path = os.path.join(data_folderpath, 'jar\\CDKDescUI-1.4.6.jar')

 # Compound acquisition
    mols = get_mols(sdfpath)

 # Get each value
 # (python library)
    dfs = list()
    for calcs in [get_values_base, get_values_external, get_rdkit_descriptors, get_mordred_descriptors]:
        dfs.append(calcs(mols))

 # (jar file calculation)
    dfs.append(get_cdk_descriptors(sdfpath, workfolderpath, java_path, cdk_jar_path))

 # Combine all
    df = pd.concat(dfs, axis=1)
    df.to_csv('all_parameters.csv')
    print(df)


# Start process
if __name__ == '__main__':
    main()

(Output: Omitted)

>python CalculateDescriptors.py
 100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 17.22it/s]

...

   Name                                          Structure  Atoms  Bonds  ...   ALogP     ALogp2      AMR nAcid
0     1  1\n     RDKit          2D\n\n  6  5  0  0  0  ...      6      5  ... -0.3400   0.115600  26.1559     0
1     2  2\n     RDKit          2D\n\n  7  6  0  0  0  ...      7      6  ...  1.2082   1.459747  33.4010     0
2     3  3\n     RDKit          2D\n\n  5  4  0  0  0  ...      5      4  ...  0.7264   0.527657  23.4093     0
3     4  4\n     RDKit          2D\n\n  6  6  0  0  0  ...      6      6  ...  0.4030   0.162409  25.0454     0
4     5  5\n     RDKit          2D\n\n  5  4  0  0  0  ...      5      4  ...  1.4774   2.182711  25.1598     0
5     6  6\n     RDKit          2D\n\n  7  7  0  0  0  ...      7      7  ...  1.4658   2.148570  35.8212     0
6     7  7\n     RDKit          2D\n\n  8  7  0  0  0  ...      8      7  ... -0.2734   0.074748  30.1747     0
7     8  8\n     RDKit          2D\n\n  8  8  0  0  0  ...      8      8  ...  1.5147   2.294316  40.0862     0
8     9  9\n     RDKit          2D\n\n  9  9  0  0  0  ...      9      9  ...  2.7426   7.521855  43.8018     0
9    10  10\n     RDKit          2D\n\n  9 10  0  0  0 ...      9     10  ...  0.8490   0.720801  41.1580     0
10   11  11\n     RDKit          2D\n\n 10 10  0  0  0 ...     10     10  ...  2.1019   4.417984  48.7581     0
11   12  12\n     RDKit          2D\n\n 12 12  0  0  0 ...     12     12  ...  0.1695   0.028730  52.1462     0
12   13  13\n     RDKit          2D\n\n 14 15  0  0  0 ...     14     15  ...  2.5404   6.453632  69.2022     0
13   14  14\n     RDKit          2D\n\n 12 13  0  0  0 ...     12     13  ...  2.0591   4.239893  58.2832     0
14   15  15\n     RDKit          2D\n\n 12 13  0  0  0 ...     12     13  ...  2.8406   8.069008  57.7168     0
15   16  16\n     RDKit          2D\n\n 14 16  0  0  0 ...     14     16  ...  2.4922   6.211061  67.3498     0
16   17  17\n     RDKit          2D\n\n 16 18  0  0  0 ...     16     18  ...  3.3850  11.458225  75.9138     0
17   18  18\n     RDKit          2D\n\n 18 21  0  0  0 ...     18     21  ...  3.0366   9.220940  85.5468     0
18   19  19\n     RDKit          2D\n\n 18 21  0  0  0 ...     18     21  ...  3.0366   9.220940  85.5468     0
19   20  20\n     RDKit          2D\n\n 14 16  0  0  0 ...     14     16  ... -0.5223   0.272797  60.8303     0

[20 rows x 2322 columns]


Oh, I have assumed that there are 5 external parameters in the SDF file ... Let's fix it soon. .. ..

Including that, 6 from the compound information of RDKit, 5 from the above external parameters, 200 from RDKit, 286 from CDK, 1824 from mordred, a total of 2322 values were obtained. ··Hmm? Is it different by one? Oh, Index? I see.

Impressions

Well, maybe not bad. There are many other things I want to make, so I'll think about the functions seriously after I have a little more.

So, I think I will continue to the next post.

Recommended Posts

Python hand play (descriptor calculation: serious version)
Python hand play (RDKit descriptor calculation: SDF to CSV using Pandas)
Python hand play (division)
Python hand play (two-dimensional list)
Python hand play (Pandas / DataFrame beginning)
python descriptor
Python hand play (calculated full of mordred)
Python hand play (one line notation of if)
Play Python async
Play with 2016-Python
Python hand play (interoperability between CSV and PostgreSQL)
PYTHON2.7 64bit version
Python hand play (get column names from CSV file)
Play youtube in python
python numpy array calculation
Age calculation using python
Date calculation in python
Python version switching (pyenv)
Date calculation in Python
Numerical calculation with Python
Check version with python