Overview

A memo when I investigated how to scrape data from a certain public database

When scraping, do not put a load on the site.

Motivation

――I wanted to try machine learning with the data described in the paper. ――It turned out that you can get the necessary data by using the search function on the WEB screen, but since there were so many cases, I wanted to get it automatically to prevent copy mistakes. ――The site also had a REST API, but it was a hassle to check if the same results as the specifications and WEB screen could be obtained.

Data you want to get

I would like to obtain a compound (ligand) that inhibits a coronavirus protein from the Protein Data Bank PDB.

The procedure for manually acquiring data is shown below.

1. Search screen

Enter the conditions on the search screen (https://www.rcsb.org/search/advanced) and click the search button (magnifying glass icon).

The following includes the condition that the species is "coronavirus" and data in which Ki or IC50 exists as inhibition data is searched.

2. Search result list

A list of protein structures will be displayed as a search result, so click the ID you want to obtain. By the way, this time, I would like to collect the data obtained on the subsequent screens for all IDs.

3. Protein structure details screen

Obtain the ligand ID and Binding Affinity (inhibition constant, unit) from the protein structure details screen. Then click on the ID.

2. Ligand details screen

Obtain structural information of SMILES, Inchi, etc. from the ligand details screen

preliminary survey

We confirmed how the screen transition method and data expression method required for scraping are at the HTML level.

--In the search process on the first search screen, when I checked the communication in the development mode of the browser, I POSTed the search conditions with JSON list and received the search results with the JSON response. It seems that communication and screen display are carried out by JavaScript. --On the protein structure details screen, the URL of the ligand and inhibition data were stored in the td tag in the tr tag with the id "binding_row_N" (N is a number). --On the ligand details screen, items such as SMILES were output by td in the tr tag with an id such as "chemicalIsomeric". The same was true for inchi and inchiKey.

Try

Since the preliminary survey was over, I tried it. I was also interested in Scrapy, so I tried two patterns, one with Scrapy and one without Scrapy (raw Python).

Method 1 Raw Python

code

`get_pdp.py`


import requests
import json
import time
import lxml.html
import argparse
import csv


def get_ligand(ligand_id):
    tmp_url = "https://www.rcsb.org" + ligand_id

    response = requests.get(tmp_url)
    if response.status_code != 200:
        return response.status_code, []

    html = response.text
    root = lxml.html.fromstring(html)
    print(tmp_url)

    smiles = root.xpath("//tr[@id='chemicalIsomeric']/td[1]/text()")[0]
    inchi = root.xpath("//tr[@id='chemicalInChI']/td[1]/text()")[0]
    inchi_key = root.xpath("//tr[@id='chemicalInChIKey']/td[1]/text()")[0]

    return response.status_code, [smiles, inchi, inchi_key]


def get_structure(structure_id):
    structure_url = "https://www.rcsb.org/structure/"
    tmp_url = structure_url + structure_id
    print(tmp_url)
    html = requests.get(tmp_url).text
    root = lxml.html.fromstring(html)
    binding_trs = root.xpath("//tr[contains(@id,'binding_row')]")

    datas = []
    ids = []

    for tr in binding_trs:
        print(tr.xpath("@id"))
        d1 = tr.xpath("td[position()=1]/a/@href")
        if d1[0] in ids:
            continue

        ids.append(d1[0])
        status_code, values = get_ligand(d1[0])
        ligand_id = d1[0][(d1[0].rfind("/") + 1):]
        if status_code == 200:
            smiles, inchi, inchi_key = values
            item = tr.xpath("td[position()=2]/a/text()")[0]
            item = item.strip()
            value = tr.xpath("td[position()=2]/text()")[0]
            value = value.replace(":", "")
            value = value.replace(";", "")
            value = value.replace("&nbsp", "")
            value = value.replace("\n", "")
            print(value)
            values = value.split(" ", 1)
            print(values)
            value = values[0].strip()
            unit = values[1].strip()

            datas.append([ligand_id, smiles, inchi, inchi_key, item, value, unit])
        time.sleep(1)

    return datas


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-output", type=str, required=True)
    args = parser.parse_args()

    base_url = "https://www.rcsb.org/search/data"
    payloads = {"query":{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"or","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":0},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"IC50"},"node_id":1}],"label":"nested-attribute"},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":2},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"Ki"},"node_id":3}],"label":"nested-attribute"}]},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"operator":"exact_match","negation":False,"value":"Severe acute respiratory syndrome-related coronavirus","attribute":"rcsb_entity_source_organism.taxonomy_lineage.name"},"node_id":4}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"pager":{"start":0,"rows":100},"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}]},"request_info":{"src":"ui","query_id":"e757fdfd5f9fb0efa272769c5966e3f4"}}
    print(json.dumps(payloads))

    response = requests.post(
        base_url,
        json.dumps(payloads),
        headers={'Content-Type': 'application/json'})

    datas = []
    for a in response.json()["result_set"]:
        structure_id = a["identifier"]
        datas.extend(get_structure(structure_id))
        time.sleep(1)

    with open(args.output, "w") as f:
        writer = csv.writer(f, lineterminator="\n")
        writer.writerow(["ligand_id", "canonical_smiles", "inchi", "inchi_key", "item", "value", "unit"])

        for data in datas:
            writer.writerow(data)


if __name__ == "__main__":
    main()

Commentary

--The main method is searching by json request. Since we get an array of IDs, we loop around it and call the get_structure method. --In the get_sctructure method, the tr tag of BindingAffinity is acquired, the inhibition data is acquired from it, and the get_ligand method is called. --In the get_ligand method, SMILES, Inchi, etc. are acquired from id. --Finally, all the data acquired by the main method is output to csv.

Method 2 Method using Scrapy

code

`items.py`


class PdbGetItem(scrapy.Item):
    ligand_id = scrapy.Field()
    canonical_smiles = scrapy.Field()
    inchi = scrapy.Field()
    inchi_key = scrapy.Field()
    item = scrapy.Field()
    value = scrapy.Field()
    unit = scrapy.Field()

`get_pdb.py`


import scrapy
from scrapy.http import Request, JsonRequest
import json
from ..items import PdbGetItem

class LigandsSpider(scrapy.Spider):
    name = 'get_pdb'
    allowed_domains = ['www.rcsb.org']

    def start_requests(self):
        base_url = "https://www.rcsb.org/search/data"
        payloads = {"query":{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"or","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":0},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"IC50"},"node_id":1}],"label":"nested-attribute"},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":2},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"Ki"},"node_id":3}],"label":"nested-attribute"}]},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"operator":"exact_match","negation":False,"value":"Severe acute respiratory syndrome-related coronavirus","attribute":"rcsb_entity_source_organism.taxonomy_lineage.name"},"node_id":4}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"pager":{"start":0,"rows":100},"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}]},"request_info":{"src":"ui","query_id":"e757fdfd5f9fb0efa272769c5966e3f4"}}
        yield JsonRequest(url=base_url, data=payloads, callback=self.parse)

    def parse(self, response):
        jsonresponse = json.loads(response.text)
        for a in jsonresponse["result_set"]:
            ligand_id = a['identifier']
            structure_url = "https://www.rcsb.org/structure/"
            structure_url += ligand_id
            yield Request(url=structure_url, callback=self.parse_structure)

    def parse_structure(self, response):
        ids = response.xpath("//tr[contains(@id,'binding_row')]/td[position()=1]/a/@href").getall()
        items = response.xpath("//tr[contains(@id,'binding_row')]/td[position()=2]/a/text()").getall()
        values = response.xpath("//tr[contains(@id,'binding_row')]/td[position()=2]/text()").getall()

        for ligand_id, item, value in zip(ids, items, values):
            ligand_url = "https://www.rcsb.org" + ligand_id
            item = item.strip()
            value = value.replace(":", "")
            value = value.replace(";", "")
            value = value.replace("&nbsp", "")
            value = value.replace("\n", "")
            values = value.split(" ", 1)
            value = values[0].strip()
            unit = values[1].strip()
            pdb_item = PdbGetItem()
            ligand_id = ligand_id[(ligand_id.rfind("/") + 1):]
            pdb_item["ligand_id"] = ligand_id
            pdb_item["item"] = item
            pdb_item["value"] = value
            pdb_item["unit"] = unit

            request = Request(url=ligand_url, callback=self.parse_ligand)
            request.meta["item"] = pdb_item
            yield request

    def parse_ligand(self, response):
        smiles = response.xpath("//tr[@id='chemicalIsomeric']/td[1]/text()").get()
        inchi = response.xpath("//tr[@id='chemicalInChI']/td[1]/text()").get()
        inchi_key = response.xpath("//tr[@id='chemicalInChIKey']/td[1]/text()").get()

        pdb_item = response.meta['item']
        pdb_item["canonical_smiles"] = smiles
        pdb_item["inchi"] = inchi
        pdb_item["inchi_key"] = inchi_key
        yield pdb_item

`settings.py`


ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
    'pdb_get.pipelines.PdbGetPipeline': 300,
}

`pipelines.py`


from scrapy.exceptions import DropItem
import csv

class PdbGetPipeline:

    def open_spider(self, spider):
        self.f = open("output.csv", "w")
        self.writer = csv.writer(self.f, lineterminator="\n")
        self.writer.writerow(["ligand_id", "canonical_smiles", "inchi", "inchi_key", "item", "value", "unit"])

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        if item["canonical_smiles"]:
            self.writer.writerow([item["ligand_id"], item["canonical_smiles"], item["inchi"],
                                  item["inchi_key"], item["item"], item["value"], item["unit"]])
        else:
            raise DropItem(item["ligand_id"])

Commentary

--get_pdb.py is a scraper, items.py is the item class that holds the data, settings.py is the settings, and piplelines.py is the pipeline class for processing items. --In setting.py, sleep is specified per request (3 seconds), whether to follow Robots.txt or not, and pipeline.py is enabled. --You are searching with the start_requests method of LingadsSpider in get_pdb.py. --The parse method of LingadsSpider in get_pdb.py is used to process search results and call the protein structure details screen. --The parse_structure method of LingadsSpider in get_pdb.py is used to acquire inhibition data from the protein structure details screen and transition to the ligand details screen. --Structural information is acquired from the ligand details screen using the parse_ligand method of LingadsSpider in get_pdb.py. --The data required for the item is acquired by the two processes of the parse_structure method and the parse_ligand method, but the passing of these items is as follows: request.meta [" item "] = pdb_item in the parse_structure method. It is done with `pdb_item = response.meta ['item']` in the parse_ligand method. This is the part that can be called the petit liver. --In pipeline.py, the stream of the file is opened and closed and output to the stream is performed to output the result to the csv file.

Output CSV

Like this. Since we have structure and inhibition data, let's create a model that predicts candidate drugs.

ligand_id,canonical_smiles,inchi,inchi_key,item,value,unit
CYV,CCOC(=O)CC[C@H](C[C@H]1CCNC1=O)NC(=O)[C@H](CC=C(C)C)CC(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)OC(C)(C)C)C(C)C,"InChI=1S/C32H54N4O9/c1-9-44-26(39)13-12-23(16-22-14-15-33-28(22)40)34-29(41)21(11-10-19(2)3)17-25(38)27(20(4)5)36-30(42)24(18-37)35-31(43)45-32(6,7)8/h10,20-24,27,37H,9,11-18H2,1-8H3,(H,33,40)(H,34,41)(H,35,43)(H,36,42)/t21-,22-,23-,24+,27+/m1/s1",CSVHTXSZRNRRSB-AHPXAKOISA-N,IC50,80000,nM
GRM,C[C@@H](N1CC[C@@H](CC1)C(=O)NCc2ccc3OCOc3c2)c4cccc5ccccc45,"InChI=1S/C26H28N2O3/c1-18(22-8-4-6-20-5-2-3-7-23(20)22)28-13-11-21(12-14-28)26(29)27-16-19-9-10-24-25(15-19)31-17-30-24/h2-10,15,18,21H,11-14,16-17H2,1H3,(H,27,29)/t18-/m1/s1",IVXBCFLWMPMSAP-GOSISDBHSA-N,IC50,320,nM
XP1,CN(C)c1ccc(cc1)C(O)=O,"InChI=1S/C9H11NO2/c1-10(2)8-5-3-7(4-6-8)9(11)12/h3-6H,1-2H3,(H,11,12)",YDIYEOMDOWUDTJ-UHFFFAOYSA-N,IC50,5000,nM
S88,C[C@@H](N1CC[C@H](CC1)C(=O)NCc2cccc(F)c2)c3cccc4ccccc34,"InChI=1S/C25H27FN2O/c1-18(23-11-5-8-20-7-2-3-10-24(20)23)28-14-12-21(13-15-28)25(29)27-17-19-6-4-9-22(26)16-19/h2-11,16,18,21H,12-15,17H2,1H3,(H,27,29)/t18-/m1/s1",XJTWGMHOQKGBDO-GOSISDBHSA-N,IC50,150,nM
P85,C[C@@H](N1CC[C@@H](CC1)C(=O)NCc2ccc(F)cc2)c3cccc4ccccc34,"InChI=1S/C25H27FN2O/c1-18(23-8-4-6-20-5-2-3-7-24(20)23)28-15-13-21(14-16-28)25(29)27-17-19-9-11-22(26)12-10-19/h2-12,18,21H,13-17H2,1H3,(H,27,29)/t18-/m1/s1",VTAUBQMDNJOEAK-GOSISDBHSA-N,IC50,490,nM
3X5,O=C[C@H](Cc1[nH]cnc1)NC[C@H]2C[C@@H]3CCCC[C@H]3CN2C(=O)c4ccc(cc4)c5ccccc5,"InChI=1S/C29H34N4O2/c34-19-27(15-26-16-30-20-32-26)31-17-28-14-24-8-4-5-9-25(24)18-33(28)29(35)23-12-10-22(11-13-23)21-6-2-1-3-7-21/h1-3,6-7,10-13,16,19-20,24-25,27-28,31H,4-5,8-9,14-15,17-18H2,(H,30,32)/t24-,25-,27-,28+/m0/s1",VZCULZJNALRGNB-DNZWLJDLSA-N,IC50,240000,nM
3A7,Brc1ccc(cc1)C(=O)N2C[C@H]3CCCC[C@@H]3C[C@H]2CN[C@@H](Cc4[nH]cnc4)C=O,"InChI=1S/C23H29BrN4O2/c24-19-7-5-16(6-8-19)23(30)28-13-18-4-2-1-3-17(18)9-22(28)12-26-21(14-29)10-20-11-25-15-27-20/h5-8,11,14-15,17-18,21-22,26H,1-4,9-10,12-13H2,(H,25,27)/t17-,18-,21+,22+/m1/s1",SKLHMRHVVDDIOX-UBBRYJJRSA-N,IC50,63000,nM
SFG,N[C@@H](CC[C@H](N)C(O)=O)C[C@H]1O[C@H]([C@H](O)[C@@H]1O)n2cnc3c(N)ncnc23,"InChI=1S/C15H23N7O5/c16-6(1-2-7(17)15(25)26)3-8-10(23)11(24)14(27-8)22-5-21-9-12(18)19-4-20-13(9)22/h4-8,10-11,14,23-24H,1-3,16-17H2,(H,25,26)(H2,18,19,20)/t6-,7-,8+,10+,11+,14+/m0/s1",LMXOHSDXUQEUSF-YECHIGJVSA-N,IC50,740,nM
D3F,Cc1cc(c(Cl)cc1Cl)[S](=O)(=O)c2c(cc(cc2[N+]([O-])=O)C(F)(F)F)[N+]([O-])=O,"InChI=1S/C14H7Cl2F3N2O6S/c1-6-2-12(9(16)5-8(6)15)28(26,27)13-10(20(22)23)3-7(14(17,18)19)4-11(13)21(24)25/h2-5H,1H3",INAZPZCJNPPHGV-UHFFFAOYSA-N,IC50,300,nM
F3F,FC(F)(F)c1[nH]c(SC(=O)c2oc(cc2)C#Cc3ccccc3)nn1,"InChI=1S/C16H8F3N3O2S/c17-16(18,19)14-20-15(22-21-14)25-13(23)12-9-8-11(24-12)7-6-10-4-2-1-3-5-10/h1-5,8-9H,(H,20,21,22)",VNGWUVBXUIDQTK-UHFFFAOYSA-N,IC50,3000,nM
23H,CCC(C)(C)NC(=O)[C@H](N(C(=O)Cn1nnc2ccccc12)c3ccc(NC(C)=O)cc3)c4cccn4C,"InChI=1S/C28H33N7O3/c1-6-28(3,4)30-27(38)26(24-12-9-17-33(24)5)35(21-15-13-20(14-16-21)29-19(2)36)25(37)18-34-23-11-8-7-10-22(23)31-32-34/h7-17,26H,6,18H2,1-5H3,(H,29,36)(H,30,38)/t26-/m1/s1",BCIIGGMNYNWRQK-AREMUKBSSA-N,IC50,6200,nM
CY6,CCOC(=O)/C=C/[C@H](C[C@@H]1CCNC1=O)NC(=O)[C@H](CC=C(C)C)CC(=O)[C@@H](NC(=O)c2cc(C)on2)C(C)C,"InChI=1S/C29H42N4O7/c1-7-39-25(35)11-10-22(15-21-12-13-30-27(21)36)31-28(37)20(9-8-17(2)3)16-24(34)26(18(4)5)32-29(38)23-14-19(6)40-33-23/h8,10-11,14,18,20-22,26H,7,9,12-13,15-16H2,1-6H3,(H,30,36)(H,31,37)(H,32,38)/b11-10+/t20-,21+,22-,26+/m1/s1",CSNQHKJCKPMZCY-YFUAOJPXSA-N,IC50,70000,nM
0EN,CC(C)(C)NC(=O)[C@H](N(C(=O)c1occc1)c2ccc(cc2)C(C)(C)C)c3cccnc3,"InChI=1S/C26H31N3O3/c1-25(2,3)19-11-13-20(14-12-19)29(24(31)21-10-8-16-32-21)22(18-9-7-15-27-17-18)23(30)28-26(4,5)6/h7-17,22H,1-6H3,(H,28,30)/t22-/m1/s1",JXGIYKRRPGCLFV-JOCHJYFZSA-N,IC50,4800,nM
TTT,C[C@@H](NC(=O)c1cc(N)ccc1C)c2cccc3ccccc23,"InChI=1S/C20H20N2O/c1-13-10-11-16(21)12-19(13)20(23)22-14(2)17-9-5-7-15-6-3-4-8-18(15)17/h3-12,14H,21H2,1-2H3,(H,22,23)/t14-/m1/s1",UVERBUNNCOKGNZ-CQSZACIVSA-N,IC50,2640,nM
WR1,C[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)OCc2ccccc2)C(=O)CCO,"InChI=1S/C22H26N2O5/c1-16(20(26)12-13-25)23-21(27)19(14-17-8-4-2-5-9-17)24-22(28)29-15-18-10-6-3-7-11-18/h2-11,16,19,25H,12-15H2,1H3,(H,23,27)(H,24,28)/t16-,19+/m1/s1",UUOOAGBWJUGBMV-APWZRJJASA-N,Ki,2200,nM
S89,OC[C@H](Cc1ccccc1)NC(=O)[C@H](Cc2ccccc2)NC(=O)/C=C/c3ccccc3,"InChI=1S/C27H28N2O3/c30-20-24(18-22-12-6-2-7-13-22)28-27(32)25(19-23-14-8-3-9-15-23)29-26(31)17-16-21-10-4-1-5-11-21/h1-17,24-25,30H,18-20H2,(H,28,32)(H,29,31)/b17-16+/t24-,25-/m0/s1",GEVQDXBVGFGWFA-KQRRRSJSSA-N,Ki,2240,nM
ZU3,CC(C)C[C@H](NC(=O)[C@H](CNC(=O)C(C)(C)C)NC(=O)OCc1ccccc1)C(=O)N[C@H](CCC(C)=O)C[C@@H]2CCNC2=O,"InChI=1S/C32H49N5O7/c1-20(2)16-25(28(40)35-24(13-12-21(3)38)17-23-14-15-33-27(23)39)36-29(41)26(18-34-30(42)32(4,5)6)37-31(43)44-19-22-10-8-7-9-11-22/h7-11,20,23-26H,12-19H2,1-6H3,(H,33,39)(H,34,42)(H,35,40)(H,36,41)(H,37,43)/t23-,24+,25-,26-/m0/s1",IEQRDAZPCPYZAJ-QYOOZWMWSA-N,Ki,38,nM
ZU5,CC(C)C[C@H](NC(=O)[C@@H](NC(=O)OCc1ccccc1)[C@@H](C)OC(C)(C)C)C(=O)N[C@H](CCC(=O)C2CC2)C[C@@H]3CCNC3=O,"InChI=1S/C34H52N4O7/c1-21(2)18-27(31(41)36-26(14-15-28(39)24-12-13-24)19-25-16-17-35-30(25)40)37-32(42)29(22(3)45-34(4,5)6)38-33(43)44-20-23-10-8-7-9-11-23/h7-11,21-22,24-27,29H,12-20H2,1-6H3,(H,35,40)(H,36,41)(H,37,42)(H,38,43)/t22-,25+,26-,27+,29+/m1/s1",QIMPWBPEAHOISN-XSLDCGIXSA-N,Ki,99,nM
TLD,Cc1ccc(S)c(S)c1,"InChI=1S/C7H8S2/c1-5-2-3-6(8)7(9)4-5/h2-4,8-9H,1H3",NIAAGQAEVGMHPM-UHFFFAOYSA-N,Ki,1400,nM
PMA,OC(=O)c1cc(C(O)=O)c(cc1C(O)=O)C(O)=O,"InChI=1S/C10H6O8/c11-7(12)3-1-4(8(13)14)6(10(17)18)2-5(3)9(15)16/h1-2H,(H,11,12)(H,13,14)(H,15,16)(H,17,18)",CYIDZMCFTVVTJO-UHFFFAOYSA-N,Ki,700,nM

Consideration

Let's consider the case with and without Scrapy.

Benefits of using Scrapy

--As you can see from the log, it automatically recognizes access to the same URL and does not uselessly access it. --When an error occurs, it will automatically retry (default seems to be 3 times) --You can put sleep processing in a batch by setting .py. --You can make fine adjustments to the scraping style, such as whether to follow the rules of Roblots.txt. --It does a good log output. --The composition of the sauce is beautiful.

Disadvantages of using Scrapy

--You can't do fine control such as interrupting when an error occurs (although you can do it after all if you use pipeline) --Learning costs are high. (If you remember it, you can enjoy it ...) ――As you can see in this search process, there are increasing cases of request / responding with structured data such as JSON, and in that case you only have to operate JSON, so there is not much merit?

There are few disadvantages of Scrapy. Alright, let's collect pounding data with Scrapy in the future.

Finally this environment

This time, it was carried out in the following environment.

python3.7
scrapy 2.3
requests 2.24
lxml 4.5.2

[PYTHON] Collect machine learning data by scraping from bio-based public databases

Overview

Motivation

Data you want to get

1. Search screen

2. Search result list

3. Protein structure details screen

2. Ligand details screen

preliminary survey

Try

Method 1 Raw Python

code

get_pdp.py

Commentary

Method 2 Method using Scrapy

code

items.py

get_pdb.py

settings.py

pipelines.py

Commentary

Output CSV

Consideration

Benefits of using Scrapy

Disadvantages of using Scrapy

Finally this environment

`get_pdp.py`

`items.py`

`get_pdb.py`

`settings.py`

`pipelines.py`