[PYTHON] Get structural data from CHEMBLID

Introduction

I contacted ChEMBL from CHEMBLID to find out how to retrieve structural data such as SDF and SMILES.

Acquisition of single compound SDF (MOL)

First, to get MOL data by specifying a single ChEMBLID, just type the URL as follows. CHEMBLID specifies a predetermined one. The point here is to add format = sdf at the end.

https://www.ebi.ac.uk/chembl/api/data/molecule/CHEMBL1607289?format=sdf

Acquisition of SDF for multiple compounds

Next, to specify multiple ChEMBLIDs and get the SDF at once, just type the following URL. The point here is that when specifying multiple ChEMBLE IDs, they are separated by a semicolon.

https://www.ebi.ac.uk/chembl/api/data/molecule/set/CHEMBL1607289;CHEMBL1607290?format=sdf

Acquisition of compound data

To get various data such as molecular weight and SMILES instead of SDF, you do not have to specify format = sdf as follows.

https://www.ebi.ac.uk/chembl/api/data/molecule/set/CHEMBL1607289

Then, the compound data can be obtained in the following XML format.

<?xml version="1.0" encoding="utf-8"?>
<response>
  <molecules>
    <molecule>
      <atc_classifications/>
      <availability_type>-1</availability_type>
      <biotherapeutic/>
      <black_box_warning>0</black_box_warning>
      <chebi_par_id/>
      <chirality>-1</chirality>
      <cross_references>
        <molecule>
          <xref_id>26754105</xref_id>
          <xref_name>SID: 26754105</xref_name>
          <xref_src>PubChem</xref_src>
        </molecule>
      </cross_references>
      <dosed_ingredient/>
      <first_approval/>
      <first_in_class>-1</first_in_class>
      <helm_notation/>
      <indication_class/>
      <inorganic_flag>-1</inorganic_flag>
      <max_phase/>
      <molecule_chembl_id>CHEMBL1607289</molecule_chembl_id>
      <molecule_hierarchy>
        <molecule_chembl_id>CHEMBL1607289</molecule_chembl_id>
        <parent_chembl_id>CHEMBL1607289</parent_chembl_id>
      </molecule_hierarchy>
      <molecule_properties>
        <acd_logd>-2.11</acd_logd>
        <acd_logp>-2.11</acd_logp>
        <acd_most_apka>13.22</acd_most_apka>
        <acd_most_bpka>5</acd_most_bpka>
        <alogp>-1.58</alogp>
        <aromatic_rings>1</aromatic_rings>
        <full_molformula>C16H22N2O6</full_molformula>
        <full_mwt>338.36</full_mwt>
        <hba>8</hba>
        <hba_lipinski>8</hba_lipinski>
        <hbd>4</hbd>
        <hbd_lipinski>4</hbd_lipinski>
        <heavy_atoms>24</heavy_atoms>
        <molecular_species>NEUTRAL</molecular_species>
        <mw_freebase>338.36</mw_freebase>
        <mw_monoisotopic>338.1478</mw_monoisotopic>
        <num_lipinski_ro5_violations/>
        <num_ro5_violations/>
        <psa>123.35</psa>
        <qed_weighted>0.49</qed_weighted>
        <ro3_pass>N</ro3_pass>
        <rtb>3</rtb>
      </molecule_properties>
      <molecule_structures>
        <canonical_smiles>COC(=O)[C@@H]1C[C@@]2(O)[C@H](O)[C@H](O)[C@H](O)C[C@H]2N1Cc3cccnc3</canonical_smiles>
        <standard_inchi>InChI=1S/C16H22N2O6/c1-24-15(22)10-6-16(23)12(5-11(19)13(20)14(16)21)18(10)8-9-3-2-4-17-7-9/h2-4,7,10-14,19-21,23H,5-6,8H2,1H3/t10-,11+,12+,13+,14+,16-/m0/s1</standard_inchi>
        <standard_inchi_key>JKLLFWDPMLWZFY-SFEJEDPTSA-N</standard_inchi_key>
      </molecule_structures>
      <molecule_synonyms/>
      <molecule_type>Small molecule</molecule_type>
      <natural_product>-1</natural_product>
      <oral/>
      <parenteral/>
      <polymer_flag/>
      <pref_name/>
      <prodrug>-1</prodrug>
      <structure_type>MOL</structure_type>
      <therapeutic_flag/>
      <topical/>
      <usan_stem/>
      <usan_stem_definition/>
      <usan_substem/>
      <usan_year/>
      <withdrawn_class/>
      <withdrawn_country/>
      <withdrawn_flag/>
      <withdrawn_reason/>
      <withdrawn_year/>
    </molecule>
  </molecules>
</response>

Obtained from Python

Finally, using the method examined this time, I wrote a Python program that collectively obtains SDF and SMILES from the list of CHEMBLE IDs described in the text file.

GetStructureFromChEMBLE.py


import argparse
import requests
import xml.etree.ElementTree as ET


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-input", type=str, required=True)
    parser.add_argument("-type", type=str, default="sdf", choices=["sdf", "smiles"]),
    parser.add_argument("-output", type=str, required=True)
    args = parser.parse_args()

    ids_str = ""
    ids = []
    with open(args.input) as f:
        lines = f.readlines();
        for line in lines:
            line = line.rstrip()
            if len(ids_str) > 0:
                ids_str += ";"
            ids_str += line
            ids.append(line)

    url = "https://www.ebi.ac.uk/chembl/api/data/molecule/set/" + ids_str
    if args.type == "sdf":
        url += "?format=sdf"

    result = requests.get(url)

    if args.type == "sdf":
        with open(args.output, "w") as f:
            f.write(result.text)
    else:
        root = ET.fromstring(result.text)
        with open(args.output, "w") as f:
            for i in range(len(root[0])):
                structures = root[0][i].find("molecule_structures")
            
                f.write("{0}\t{1}\t{2}\n".format(
                    ids[i],
                    structures.find("canonical_smiles").text,
                    structures.find("standard_inchi").text))


if __name__ == "__main__":
    main()

To use, specify a file that describes the CHEMBLE ID list line by line with the -input option, and specify sdf or smiles with the -type option. Then specify the output file with the -output option.

Input file example

input.txt


CHEMBL6329
CHEMBL6328
CHEMBL265667
CHEMBL6362
CHEMBL267864

Output file example When -type = smiles is specified, CHEMLEID, SMILES, and Inchi are output in tab-delimited format as shown below.

output.tsv


CHEMBL6329	Cc1cc(ccc1C(=O)c2ccccc2Cl)N3N=CC(=O)NC3=O	InChI=1S/C17H12ClN3O3/c1-10-8-11(21-17(24)20-15(22)9-19-21)6-7-12(10)16(23)13-4-2-3-5-14(13)18/h2-9H,1H3,(H,20,22,24)
CHEMBL6328	Cc1cc(ccc1C(=O)c2ccc(cc2)C#N)N3N=CC(=O)NC3=O	InChI=1S/C18H12N4O3/c1-11-8-14(22-18(25)21-16(23)10-20-22)6-7-15(11)17(24)13-4-2-12(9-19)3-5-13/h2-8,10H,1H3,(H,21,23,25)
CHEMBL265667	Cc1cc(cc(C)c1C(O)c2ccc(Cl)cc2)N3N=CC(=O)NC3=O	InChI=1S/C18H16ClN3O3/c1-10-7-14(22-18(25)21-15(23)9-20-22)8-11(2)16(10)17(24)12-3-5-13(19)6-4-12/h3-9,17,24H,1-2H3,(H,21,23,25)
CHEMBL6362	Cc1ccc(cc1)C(=O)c2ccc(cc2)N3N=CC(=O)NC3=O	InChI=1S/C17H13N3O3/c1-11-2-4-12(5-3-11)16(22)13-6-8-14(9-7-13)20-17(23)19-15(21)10-18-20/h2-10H,1H3,(H,19,21,23)
CHEMBL267864	Cc1cc(ccc1C(=O)c2ccc(Cl)cc2)N3N=CC(=O)NC3=O	InChI=1S/C17H12ClN3O3/c1-10-8-13(21-17(24)20-15(22)9-19-21)6-7-14(10)16(23)11-2-4-12(18)5-3-11/h2-9H,1H3,(H,20,22,24)

in conclusion

There is an API description in the reference URL, but there is not enough information, so I had to think and make mistakes while hacking the source of the previously installed ChEMBL webresource client. I also installed ChEMBL V25 locally, so I'd like to hack it from now on.

reference

Recommended Posts

Get structural data from CHEMBLID
Get data from Quandl in Python
Get data from Twitter using Tweepy
[Note] Get data from PostgreSQL with Python
Get data from Cloudant with Bluemix flask
Get data from an oscilloscope with pyVISA
Get 10 or more data from SSM parameter store
Get data files from elsewhere during pip install
Extract data from S3
Get Gzip-compressed data on-memory
Get data from GPS module at 10Hz in Python
Get data from database via ODBC with Python (Access)
Export 3D data from QGIS
Get Youtube data with python
Get clipboard from Maya settings
Hit REST in Python to get data from New Relic
Get data from analytics API with Google API Client for python
I tried to get data from AS / 400 quickly using pypyodbc
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Get data from your website on a regular basis using ScraperWiki
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
Get Leap Motion data in Python.
From Elasticsearch installation to data entry
Python: Exclude tags from html data
Get celebrity tweet history from twitter
Get Salesforce data using REST API
Hit treasure data from Python Pandas
Extract specific data from complex JSON
Get the complete bitflyer tick data
Get one column from DataFrame with DataFrame
[Python] Get economic data with DataReader
Persistent data structure created from scratch
Get Amazon data using Keep API # 1 Get data
[Beginner] Get from Django Query database
Get the value from the [Django] Form
Get upcoming weather from python weather api
[Data science basics] Data acquisition from API
Get the address from the zip code
Get Splatoon 2 battle record data + bonus