Once upon a time, when I started writing Python, I wrote calculations for compound descriptors. But, well, "What is Pnadas?" "Is it okay to answer?" It's been a long time since then, so I rewrote it.
Then, it continues to I / O to the DB to be continuously thrown.
Well, there are a lot of things to do, but I'm going to say "Done is better than perfect", so I'll break it down here.
Prepare an SDF file as compound information. So, using RDKit, make it a variable like this.
# Compound acquisition
sdfpath = 'xxx.sdf'
mols = get_mols(sdfpath)
So, make a "row" for each compound and make some functions that return some kinds of "columns". Since the things to do are the same, I tried to arrange the shape of the function. By the way, the return value is Pandas DataFrame type. I finally noticed it recently, but this is convenient.
So, output as csv collectively.
import os
import pandas as pd
# Returns the compound
# I: SDF path
# O: Compound object list
def get_mols(sdfpath):
from rdkit import Chem
mols = [mol for mol in Chem.SDMolSupplier(sdfpath) if mol is not None]
return mols
# Returns basic information about a compound [Compound name, structural information, number of atoms, number of bonds, SMILES, InChI]
# I: Compound object list
# O: Result data
def get_values_base(mols):
from rdkit import Chem
columns = ['Name', 'Structure', 'Atoms', 'Bonds', 'SMILES', 'InChI']
values = list()
for mol in mols:
tmp = list()
tmp.append(mol.GetProp('_Name'))
tmp.append(Chem.MolToMolBlock(mol))
tmp.append(mol.GetNumAtoms())
tmp.append(mol.GetNumBonds())
tmp.append(Chem.MolToSmiles(mol))
tmp.append(Chem.MolToInchi(mol))
values.append(tmp)
index = [i for i in range(len(mols))]
df = pd.DataFrame(values, columns=columns, index=index)
return df
# Returns the external parameters of the compound
# I: Compound object list
# O: Result data
def get_values_external(mols):
from rdkit import Chem
columns = ['ID', 'NAME', 'SOL', 'SMILES', 'SOL_classification']
values = list()
for mol in mols:
tmp = list()
for column in columns:
tmp.append(mol.GetProp(column))
values.append(tmp)
columns = ['ext_' + column for column in columns]
index = [i for i in range(len(mols))]
df = pd.DataFrame(values, columns=columns, index=index)
return df
# Calculate descriptor: RDKit
# I: Compound object list
# O: Result data
def get_rdkit_descriptors(mols):
from rdkit.Chem import AllChem, Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
# RDKit descriptor calculation
# names = [mol.GetProp('_Name') for mol in mols]
descLists = [desc_name[0] for desc_name in Descriptors._descList]
calcs = MoleculeDescriptors.MolecularDescriptorCalculator(descLists)
values = [calcs.CalcDescriptors(mol) for mol in mols]
Convert to #DataFrame
index = [i for i in range(len(mols))]
df = pd.DataFrame(values, columns=descLists, index=index)
return df
# Calculate descriptor: mordred
# I: Compound object list
# O: Result data
def get_mordred_descriptors(mols):
Calculation of # mordred descriptor
from mordred import Calculator, descriptors
calcs = Calculator(descriptors, ignore_3D=False)
df = calcs.pandas(mols)
df['index'] = [i for i in range(len(mols))]
df.set_index('index', inplace=True)
return df
# Calculate descriptor: CDK
# I: SDF file
# java executable file path
# CDK jar file path
# O: Result data
def get_cdk_descriptors(sdfpath, workfolderpath, java_path, cdk_jar_path):
filepath = os.path.join(workfolderpath, 'tmp.csv')
import subprocess
command = f'{java_path} -jar {cdk_jar_path} -b {sdfpath} -t all -o {filepath}'
print(command)
subprocess.run(command, shell=False)
df = pd.read_table(filepath)
os.remove(filepath)
return df
# Main processing
def main():
data_folderpath = 'D:\\data\\python_data\\chem'
sdfpath = os.path.join(data_folderpath, 'sdf\\solubility.test.20.sdf')
csvpath = 'solubility.test.csv'
java_path = 'C:\\Program Files\\Java\\jdk-14.0.1\\bin\\java.exe'
workfolderpath = os.path.dirname(os.path.abspath(__file__))
cdk_jar_path = os.path.join(data_folderpath, 'jar\\CDKDescUI-1.4.6.jar')
# Compound acquisition
mols = get_mols(sdfpath)
# Get each value
# (python library)
dfs = list()
for calcs in [get_values_base, get_values_external, get_rdkit_descriptors, get_mordred_descriptors]:
dfs.append(calcs(mols))
# (jar file calculation)
dfs.append(get_cdk_descriptors(sdfpath, workfolderpath, java_path, cdk_jar_path))
# Combine all
df = pd.concat(dfs, axis=1)
df.to_csv('all_parameters.csv')
print(df)
# Start process
if __name__ == '__main__':
main()
(Output: Omitted)
>python CalculateDescriptors.py
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 17.22it/s]
...
Name Structure Atoms Bonds ... ALogP ALogp2 AMR nAcid
0 1 1\n RDKit 2D\n\n 6 5 0 0 0 ... 6 5 ... -0.3400 0.115600 26.1559 0
1 2 2\n RDKit 2D\n\n 7 6 0 0 0 ... 7 6 ... 1.2082 1.459747 33.4010 0
2 3 3\n RDKit 2D\n\n 5 4 0 0 0 ... 5 4 ... 0.7264 0.527657 23.4093 0
3 4 4\n RDKit 2D\n\n 6 6 0 0 0 ... 6 6 ... 0.4030 0.162409 25.0454 0
4 5 5\n RDKit 2D\n\n 5 4 0 0 0 ... 5 4 ... 1.4774 2.182711 25.1598 0
5 6 6\n RDKit 2D\n\n 7 7 0 0 0 ... 7 7 ... 1.4658 2.148570 35.8212 0
6 7 7\n RDKit 2D\n\n 8 7 0 0 0 ... 8 7 ... -0.2734 0.074748 30.1747 0
7 8 8\n RDKit 2D\n\n 8 8 0 0 0 ... 8 8 ... 1.5147 2.294316 40.0862 0
8 9 9\n RDKit 2D\n\n 9 9 0 0 0 ... 9 9 ... 2.7426 7.521855 43.8018 0
9 10 10\n RDKit 2D\n\n 9 10 0 0 0 ... 9 10 ... 0.8490 0.720801 41.1580 0
10 11 11\n RDKit 2D\n\n 10 10 0 0 0 ... 10 10 ... 2.1019 4.417984 48.7581 0
11 12 12\n RDKit 2D\n\n 12 12 0 0 0 ... 12 12 ... 0.1695 0.028730 52.1462 0
12 13 13\n RDKit 2D\n\n 14 15 0 0 0 ... 14 15 ... 2.5404 6.453632 69.2022 0
13 14 14\n RDKit 2D\n\n 12 13 0 0 0 ... 12 13 ... 2.0591 4.239893 58.2832 0
14 15 15\n RDKit 2D\n\n 12 13 0 0 0 ... 12 13 ... 2.8406 8.069008 57.7168 0
15 16 16\n RDKit 2D\n\n 14 16 0 0 0 ... 14 16 ... 2.4922 6.211061 67.3498 0
16 17 17\n RDKit 2D\n\n 16 18 0 0 0 ... 16 18 ... 3.3850 11.458225 75.9138 0
17 18 18\n RDKit 2D\n\n 18 21 0 0 0 ... 18 21 ... 3.0366 9.220940 85.5468 0
18 19 19\n RDKit 2D\n\n 18 21 0 0 0 ... 18 21 ... 3.0366 9.220940 85.5468 0
19 20 20\n RDKit 2D\n\n 14 16 0 0 0 ... 14 16 ... -0.5223 0.272797 60.8303 0
[20 rows x 2322 columns]
Oh, I have assumed that there are 5 external parameters in the SDF file ... Let's fix it soon. .. ..
Including that, 6 from the compound information of RDKit, 5 from the above external parameters, 200 from RDKit, 286 from CDK, 1824 from mordred, a total of 2322 values were obtained. ··Hmm? Is it different by one? Oh, Index? I see.
Well, maybe not bad. There are many other things I want to make, so I'll think about the functions seriously after I have a little more.
So, I think I will continue to the next post.
Recommended Posts