Introduction

I contacted ChEMBL from CHEMBLID to find out how to retrieve structural data such as SDF and SMILES.

Acquisition of single compound SDF (MOL)

First, to get MOL data by specifying a single ChEMBLID, just type the URL as follows. CHEMBLID specifies a predetermined one. The point here is to add format = sdf at the end.

https://www.ebi.ac.uk/chembl/api/data/molecule/CHEMBL1607289?format=sdf

Acquisition of SDF for multiple compounds

Next, to specify multiple ChEMBLIDs and get the SDF at once, just type the following URL. The point here is that when specifying multiple ChEMBLE IDs, they are separated by a semicolon.

https://www.ebi.ac.uk/chembl/api/data/molecule/set/CHEMBL1607289;CHEMBL1607290?format=sdf

Acquisition of compound data

To get various data such as molecular weight and SMILES instead of SDF, you do not have to specify format = sdf as follows.

https://www.ebi.ac.uk/chembl/api/data/molecule/set/CHEMBL1607289

Then, the compound data can be obtained in the following XML format.

<?xml version="1.0" encoding="utf-8"?>
<response>
  <molecules>
    <molecule>
      <atc_classifications/>
      <availability_type>-1</availability_type>
      <biotherapeutic/>
      <black_box_warning>0</black_box_warning>
      <chebi_par_id/>
      <chirality>-1</chirality>
      <cross_references>
        <molecule>
          <xref_id>26754105</xref_id>
          <xref_name>SID: 26754105</xref_name>
          <xref_src>PubChem</xref_src>
        </molecule>
      </cross_references>
      <dosed_ingredient/>
      <first_approval/>
      <first_in_class>-1</first_in_class>
      <helm_notation/>
      <indication_class/>
      <inorganic_flag>-1</inorganic_flag>
      <max_phase/>
      <molecule_chembl_id>CHEMBL1607289</molecule_chembl_id>
      <molecule_hierarchy>
        <molecule_chembl_id>CHEMBL1607289</molecule_chembl_id>
        <parent_chembl_id>CHEMBL1607289</parent_chembl_id>
      </molecule_hierarchy>
      <molecule_properties>
        <acd_logd>-2.11</acd_logd>
        <acd_logp>-2.11</acd_logp>
        <acd_most_apka>13.22</acd_most_apka>
        <acd_most_bpka>5</acd_most_bpka>
        <alogp>-1.58</alogp>
        <aromatic_rings>1</aromatic_rings>
        <full_molformula>C16H22N2O6</full_molformula>
        <full_mwt>338.36</full_mwt>
        <hba>8</hba>
        <hba_lipinski>8</hba_lipinski>
        <hbd>4</hbd>
        <hbd_lipinski>4</hbd_lipinski>
        <heavy_atoms>24</heavy_atoms>
        <molecular_species>NEUTRAL</molecular_species>
        <mw_freebase>338.36</mw_freebase>
        <mw_monoisotopic>338.1478</mw_monoisotopic>
        <num_lipinski_ro5_violations/>
        <num_ro5_violations/>
        <psa>123.35</psa>
        <qed_weighted>0.49</qed_weighted>
        <ro3_pass>N</ro3_pass>
        <rtb>3</rtb>
      </molecule_properties>
      <molecule_structures>
        <canonical_smiles>COC(=O)[C@@H]1C[C@@]2(O)[C@H](O)[C@H](O)[C@H](O)C[C@H]2N1Cc3cccnc3</canonical_smiles>
        <standard_inchi>InChI=1S/C16H22N2O6/c1-24-15(22)10-6-16(23)12(5-11(19)13(20)14(16)21)18(10)8-9-3-2-4-17-7-9/h2-4,7,10-14,19-21,23H,5-6,8H2,1H3/t10-,11+,12+,13+,14+,16-/m0/s1</standard_inchi>
        <standard_inchi_key>JKLLFWDPMLWZFY-SFEJEDPTSA-N</standard_inchi_key>
      </molecule_structures>
      <molecule_synonyms/>
      <molecule_type>Small molecule</molecule_type>
      <natural_product>-1</natural_product>
      <oral/>
      <parenteral/>
      <polymer_flag/>
      <pref_name/>
      <prodrug>-1</prodrug>
      <structure_type>MOL</structure_type>
      <therapeutic_flag/>
      <topical/>
      <usan_stem/>
      <usan_stem_definition/>
      <usan_substem/>
      <usan_year/>
      <withdrawn_class/>
      <withdrawn_country/>
      <withdrawn_flag/>
      <withdrawn_reason/>
      <withdrawn_year/>
    </molecule>
  </molecules>
</response>

Obtained from Python

Finally, using the method examined this time, I wrote a Python program that collectively obtains SDF and SMILES from the list of CHEMBLE IDs described in the text file.

`GetStructureFromChEMBLE.py`


import argparse
import requests
import xml.etree.ElementTree as ET


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-input", type=str, required=True)
    parser.add_argument("-type", type=str, default="sdf", choices=["sdf", "smiles"]),
    parser.add_argument("-output", type=str, required=True)
    args = parser.parse_args()

    ids_str = ""
    ids = []
    with open(args.input) as f:
        lines = f.readlines();
        for line in lines:
            line = line.rstrip()
            if len(ids_str) > 0:
                ids_str += ";"
            ids_str += line
            ids.append(line)

    url = "https://www.ebi.ac.uk/chembl/api/data/molecule/set/" + ids_str
    if args.type == "sdf":
        url += "?format=sdf"

    result = requests.get(url)

    if args.type == "sdf":
        with open(args.output, "w") as f:
            f.write(result.text)
    else:
        root = ET.fromstring(result.text)
        with open(args.output, "w") as f:
            for i in range(len(root[0])):
                structures = root[0][i].find("molecule_structures")
            
                f.write("{0}\t{1}\t{2}\n".format(
                    ids[i],
                    structures.find("canonical_smiles").text,
                    structures.find("standard_inchi").text))


if __name__ == "__main__":
    main()

To use, specify a file that describes the CHEMBLE ID list line by line with the -input option, and specify sdf or smiles with the -type option. Then specify the output file with the -output option.

Input file example

`input.txt`


CHEMBL6329
CHEMBL6328
CHEMBL265667
CHEMBL6362
CHEMBL267864

Output file example When -type = smiles is specified, CHEMLEID, SMILES, and Inchi are output in tab-delimited format as shown below.

`output.tsv`


CHEMBL6329	Cc1cc(ccc1C(=O)c2ccccc2Cl)N3N=CC(=O)NC3=O	InChI=1S/C17H12ClN3O3/c1-10-8-11(21-17(24)20-15(22)9-19-21)6-7-12(10)16(23)13-4-2-3-5-14(13)18/h2-9H,1H3,(H,20,22,24)
CHEMBL6328	Cc1cc(ccc1C(=O)c2ccc(cc2)C#N)N3N=CC(=O)NC3=O	InChI=1S/C18H12N4O3/c1-11-8-14(22-18(25)21-16(23)10-20-22)6-7-15(11)17(24)13-4-2-12(9-19)3-5-13/h2-8,10H,1H3,(H,21,23,25)
CHEMBL265667	Cc1cc(cc(C)c1C(O)c2ccc(Cl)cc2)N3N=CC(=O)NC3=O	InChI=1S/C18H16ClN3O3/c1-10-7-14(22-18(25)21-15(23)9-20-22)8-11(2)16(10)17(24)12-3-5-13(19)6-4-12/h3-9,17,24H,1-2H3,(H,21,23,25)
CHEMBL6362	Cc1ccc(cc1)C(=O)c2ccc(cc2)N3N=CC(=O)NC3=O	InChI=1S/C17H13N3O3/c1-11-2-4-12(5-3-11)16(22)13-6-8-14(9-7-13)20-17(23)19-15(21)10-18-20/h2-10H,1H3,(H,19,21,23)
CHEMBL267864	Cc1cc(ccc1C(=O)c2ccc(Cl)cc2)N3N=CC(=O)NC3=O	InChI=1S/C17H12ClN3O3/c1-10-8-13(21-17(24)20-15(22)9-19-21)6-7-14(10)16(23)11-2-4-12(18)5-3-11/h2-9H,1H3,(H,20,22,24)

in conclusion

There is an API description in the reference URL, but there is not enough information, so I had to think and make mistakes while hacking the source of the previously installed ChEMBL webresource client. I also installed ChEMBL V25 locally, so I'd like to hack it from now on.

reference

ChEMBL web services API live documentation Explorer

[PYTHON] Get structural data from CHEMBLID