Python hand play (RDKit descriptor calculation: SDF to CSV using Pandas)

What is this article?

It is a story that I wrote Python a little seriously based on a library that understands a compound called RDKit and returns a lot of numbers. I'm still exploring my common function and the form of classification, and although there are still various restrictions following the other day, I think I could see a little direction.

So what do you do?

Create a CSV file from a file with compound information called an SDF file. A library called RDKit creates 200 columns of numbers, so in addition to that, it outputs 210 columns including names and 10 columns. However, since the generalization is partially broken, it is not possible to limit it to a specific file. Well, I plan to upgrade it later. I'm going.

Limitations

-The compounds in the SDF file should have the following parameters. ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']

・ Weird compounds are NG. (Separation, ions, etc. If you get confused, RDKit will not give a calculation error)

code

import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors


def get_basevalues(sampleid, mol):
    tmps = list()
    tmps.append(('SampleID', sampleid))
    tmps.append(('SampleName', mol.GetProp('_Name')))
    tmps.append(('Structure', Chem.MolToMolBlock(mol)))
    tmps.append(('Atoms', len(mol.GetAtoms())))
    tmps.append(('Bonds', len(mol.GetBonds())))
    names = [tmp[0] for tmp in tmps]
    values = [tmp[1] for tmp in tmps]
    return names, values


def get_exvalues(sampleid, mol):
    names = ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']
    values = list()
    for name in names:
        values.append(mol.GetProp(name))
    return names, values


#Calculate descriptor from SDF file and output CSV
# I :Compound file path
#CSV file path
def ExportCSVFromSDF(sdfpath, csvpath):

    #Get compound
    mols = Chem.SDMolSupplier(sdfpath)

    #Preparing for RDKit descriptor calculation
    descLists = [desc_name[0] for desc_name in Descriptors._descList]
    desc_calc = MoleculeDescriptors.MolecularDescriptorCalculator(descLists)

    #Give ID with serial number
    sampleids = list()
    #Compound name, etc.
    values_base = list()
    #External parameters(Current status:Fixed 5 pieces)
    values_ex = list()

    #Get the value of each compound
    for i, mol in enumerate(mols, 1):
        sampleids.append(i)
        names_base, values = get_basevalues(i, mol)
        values_base.append(values)
        names_ex, values = get_exvalues(i, mol)
        values_ex.append(values)

    #Calculate RDKit descriptor
    values_rdkit = [desc_calc.CalcDescriptors(mol) for mol in mols]

    #Convert to DataFrame
    df_base = pd.DataFrame(values_base, columns=names_base, index=sampleids)
    df_ex = pd.DataFrame(values_ex, columns=names_ex, index=sampleids)
    df_rdkit = pd.DataFrame(values_rdkit, columns=descLists, index=sampleids)

    #Combine all
    df = pd.concat([df_base, df_ex, df_rdkit], axis=1)

    #Print for confirmation()
    print(df)

    #Output to CSV
    df.to_csv(csvpath, index=False)


def main():
    sdfpath = 'solubility.test.sdf'
    csvpath = 'solubility.test.csv'
    ExportCSVFromSDF(sdfpath, csvpath)


if __name__ == '__main__':
    main()


Output example

SampleID SampleName Structure Atoms Bonds ID NAME SOL SMILES SOL_classification MaxEStateIndex MinEStateIndex
1 3-methylpentane 6 5 5 3-methylpentane -3.68 CCC(C)CC (A) low 2.2777777777777777 0.9351851851851851
2 2,4-dimethylpentane 7 6 10 2,4-dimethylpentane -4.26 CC(C)CC(C)C (A) low 2.263888888888889 0.8749999999999998
3 ...
4

Impressions

Yup. Pandas, maybe I got a little better. So, I will expand it in various ways from now on. maybe. .. ..

Recommended Posts

Python hand play (RDKit descriptor calculation: SDF to CSV using Pandas)
Python hand play (Pandas / DataFrame beginning)
[Python] Loading csv files using pandas
Python hand play (CSV is applied with Pandas DataFrame, but only full-column Insert from CSV to DB?)
Python hand play (interoperability between CSV and PostgreSQL)
How to convert JSON file to CSV file with Python Pandas
Age calculation using python
[Python] A memo to write CSV vertically with Pandas
Process csv data with python (count processing using pandas)
Python hand play (get column names from CSV file)
Python hand play (division)
Convert from Pandas DataFrame to System.Data.DataTable using Python for .NET
Try to operate an Excel file using Python (Pandas / XlsxWriter) ①
Try to operate an Excel file using Python (Pandas / XlsxWriter) ②
Output product information to csv using Rakuten product search API [Python]
Python hand play (two-dimensional list)
Read csv with python pandas
Post to Twitter using Python
Start to Selenium using python
# 1 [python3] Simple calculation using variables
Write to csv with Python
Convert SDF to CSV quickly
Data analysis using python pandas
How to paste a CSV file into an Excel file using Pandas
Join csv normalized by Python pandas to make it easier to check
Get RSS feed using Python + pandas → Post to Mattermost & Save to DB
How to install python using anaconda
[Python] Write to csv file with Python
[Python] How to use Pandas Series
Output to csv file with Python
[Introduction to Python] Let's use pandas
[Introduction to Python] Let's use pandas
[Introduction to Python] Let's use pandas
[Python] How to read a csv file (read_csv method of pandas module)