I contacted ChEMBL from CHEMBLID to find out how to retrieve structural data such as SDF and SMILES.
First, to get MOL data by specifying a single ChEMBLID, just type the URL as follows. CHEMBLID specifies a predetermined one. The point here is to add format = sdf at the end.
https://www.ebi.ac.uk/chembl/api/data/molecule/CHEMBL1607289?format=sdf
Next, to specify multiple ChEMBLIDs and get the SDF at once, just type the following URL. The point here is that when specifying multiple ChEMBLE IDs, they are separated by a semicolon.
https://www.ebi.ac.uk/chembl/api/data/molecule/set/CHEMBL1607289;CHEMBL1607290?format=sdf
To get various data such as molecular weight and SMILES instead of SDF, you do not have to specify format = sdf as follows.
https://www.ebi.ac.uk/chembl/api/data/molecule/set/CHEMBL1607289
Then, the compound data can be obtained in the following XML format.
<?xml version="1.0" encoding="utf-8"?>
<response>
<molecules>
<molecule>
<atc_classifications/>
<availability_type>-1</availability_type>
<biotherapeutic/>
<black_box_warning>0</black_box_warning>
<chebi_par_id/>
<chirality>-1</chirality>
<cross_references>
<molecule>
<xref_id>26754105</xref_id>
<xref_name>SID: 26754105</xref_name>
<xref_src>PubChem</xref_src>
</molecule>
</cross_references>
<dosed_ingredient/>
<first_approval/>
<first_in_class>-1</first_in_class>
<helm_notation/>
<indication_class/>
<inorganic_flag>-1</inorganic_flag>
<max_phase/>
<molecule_chembl_id>CHEMBL1607289</molecule_chembl_id>
<molecule_hierarchy>
<molecule_chembl_id>CHEMBL1607289</molecule_chembl_id>
<parent_chembl_id>CHEMBL1607289</parent_chembl_id>
</molecule_hierarchy>
<molecule_properties>
<acd_logd>-2.11</acd_logd>
<acd_logp>-2.11</acd_logp>
<acd_most_apka>13.22</acd_most_apka>
<acd_most_bpka>5</acd_most_bpka>
<alogp>-1.58</alogp>
<aromatic_rings>1</aromatic_rings>
<full_molformula>C16H22N2O6</full_molformula>
<full_mwt>338.36</full_mwt>
<hba>8</hba>
<hba_lipinski>8</hba_lipinski>
<hbd>4</hbd>
<hbd_lipinski>4</hbd_lipinski>
<heavy_atoms>24</heavy_atoms>
<molecular_species>NEUTRAL</molecular_species>
<mw_freebase>338.36</mw_freebase>
<mw_monoisotopic>338.1478</mw_monoisotopic>
<num_lipinski_ro5_violations/>
<num_ro5_violations/>
<psa>123.35</psa>
<qed_weighted>0.49</qed_weighted>
<ro3_pass>N</ro3_pass>
<rtb>3</rtb>
</molecule_properties>
<molecule_structures>
<canonical_smiles>COC(=O)[C@@H]1C[C@@]2(O)[C@H](O)[C@H](O)[C@H](O)C[C@H]2N1Cc3cccnc3</canonical_smiles>
<standard_inchi>InChI=1S/C16H22N2O6/c1-24-15(22)10-6-16(23)12(5-11(19)13(20)14(16)21)18(10)8-9-3-2-4-17-7-9/h2-4,7,10-14,19-21,23H,5-6,8H2,1H3/t10-,11+,12+,13+,14+,16-/m0/s1</standard_inchi>
<standard_inchi_key>JKLLFWDPMLWZFY-SFEJEDPTSA-N</standard_inchi_key>
</molecule_structures>
<molecule_synonyms/>
<molecule_type>Small molecule</molecule_type>
<natural_product>-1</natural_product>
<oral/>
<parenteral/>
<polymer_flag/>
<pref_name/>
<prodrug>-1</prodrug>
<structure_type>MOL</structure_type>
<therapeutic_flag/>
<topical/>
<usan_stem/>
<usan_stem_definition/>
<usan_substem/>
<usan_year/>
<withdrawn_class/>
<withdrawn_country/>
<withdrawn_flag/>
<withdrawn_reason/>
<withdrawn_year/>
</molecule>
</molecules>
</response>
Finally, using the method examined this time, I wrote a Python program that collectively obtains SDF and SMILES from the list of CHEMBLE IDs described in the text file.
GetStructureFromChEMBLE.py
import argparse
import requests
import xml.etree.ElementTree as ET
def main():
parser = argparse.ArgumentParser()
parser.add_argument("-input", type=str, required=True)
parser.add_argument("-type", type=str, default="sdf", choices=["sdf", "smiles"]),
parser.add_argument("-output", type=str, required=True)
args = parser.parse_args()
ids_str = ""
ids = []
with open(args.input) as f:
lines = f.readlines();
for line in lines:
line = line.rstrip()
if len(ids_str) > 0:
ids_str += ";"
ids_str += line
ids.append(line)
url = "https://www.ebi.ac.uk/chembl/api/data/molecule/set/" + ids_str
if args.type == "sdf":
url += "?format=sdf"
result = requests.get(url)
if args.type == "sdf":
with open(args.output, "w") as f:
f.write(result.text)
else:
root = ET.fromstring(result.text)
with open(args.output, "w") as f:
for i in range(len(root[0])):
structures = root[0][i].find("molecule_structures")
f.write("{0}\t{1}\t{2}\n".format(
ids[i],
structures.find("canonical_smiles").text,
structures.find("standard_inchi").text))
if __name__ == "__main__":
main()
To use, specify a file that describes the CHEMBLE ID list line by line with the -input option, and specify sdf or smiles with the -type option. Then specify the output file with the -output option.
Input file example
input.txt
CHEMBL6329
CHEMBL6328
CHEMBL265667
CHEMBL6362
CHEMBL267864
Output file example When -type = smiles is specified, CHEMLEID, SMILES, and Inchi are output in tab-delimited format as shown below.
output.tsv
CHEMBL6329 Cc1cc(ccc1C(=O)c2ccccc2Cl)N3N=CC(=O)NC3=O InChI=1S/C17H12ClN3O3/c1-10-8-11(21-17(24)20-15(22)9-19-21)6-7-12(10)16(23)13-4-2-3-5-14(13)18/h2-9H,1H3,(H,20,22,24)
CHEMBL6328 Cc1cc(ccc1C(=O)c2ccc(cc2)C#N)N3N=CC(=O)NC3=O InChI=1S/C18H12N4O3/c1-11-8-14(22-18(25)21-16(23)10-20-22)6-7-15(11)17(24)13-4-2-12(9-19)3-5-13/h2-8,10H,1H3,(H,21,23,25)
CHEMBL265667 Cc1cc(cc(C)c1C(O)c2ccc(Cl)cc2)N3N=CC(=O)NC3=O InChI=1S/C18H16ClN3O3/c1-10-7-14(22-18(25)21-15(23)9-20-22)8-11(2)16(10)17(24)12-3-5-13(19)6-4-12/h3-9,17,24H,1-2H3,(H,21,23,25)
CHEMBL6362 Cc1ccc(cc1)C(=O)c2ccc(cc2)N3N=CC(=O)NC3=O InChI=1S/C17H13N3O3/c1-11-2-4-12(5-3-11)16(22)13-6-8-14(9-7-13)20-17(23)19-15(21)10-18-20/h2-10H,1H3,(H,19,21,23)
CHEMBL267864 Cc1cc(ccc1C(=O)c2ccc(Cl)cc2)N3N=CC(=O)NC3=O InChI=1S/C17H12ClN3O3/c1-10-8-13(21-17(24)20-15(22)9-19-21)6-7-14(10)16(23)11-2-4-12(18)5-3-11/h2-9H,1H3,(H,20,22,24)
There is an API description in the reference URL, but there is not enough information, so I had to think and make mistakes while hacking the source of the previously installed ChEMBL webresource client. I also installed ChEMBL V25 locally, so I'd like to hack it from now on.
Recommended Posts