[PYTHON] Collect machine learning data by scraping from bio-based public databases

Overview

A memo when I investigated how to scrape data from a certain public database

Motivation

――I wanted to try machine learning with the data described in the paper. ――It turned out that you can get the necessary data by using the search function on the WEB screen, but since there were so many cases, I wanted to get it automatically to prevent copy mistakes. ――The site also had a REST API, but it was a hassle to check if the same results as the specifications and WEB screen could be obtained.

Data you want to get

I would like to obtain a compound (ligand) that inhibits a coronavirus protein from the Protein Data Bank PDB.

The procedure for manually acquiring data is shown below.

1. Search screen

Enter the conditions on the search screen (https://www.rcsb.org/search/advanced) and click the search button (magnifying glass icon).

image.png

2. Search result list

A list of protein structures will be displayed as a search result, so click the ID you want to obtain. By the way, this time, I would like to collect the data obtained on the subsequent screens for all IDs.

image.png

3. Protein structure details screen

Obtain the ligand ID and Binding Affinity (inhibition constant, unit) from the protein structure details screen. Then click on the ID.

image.png

2. Ligand details screen

Obtain structural information of SMILES, Inchi, etc. from the ligand details screen image.png

preliminary survey

We confirmed how the screen transition method and data expression method required for scraping are at the HTML level.

--In the search process on the first search screen, when I checked the communication in the development mode of the browser, I POSTed the search conditions with JSON list and received the search results with the JSON response. It seems that communication and screen display are carried out by JavaScript. --On the protein structure details screen, the URL of the ligand and inhibition data were stored in the td tag in the tr tag with the id "binding_row_N" (N is a number). --On the ligand details screen, items such as SMILES were output by td in the tr tag with an id such as "chemicalIsomeric". The same was true for inchi and inchiKey.

Try

Since the preliminary survey was over, I tried it. I was also interested in Scrapy, so I tried two patterns, one with Scrapy and one without Scrapy (raw Python).

Method 1 Raw Python

code

get_pdp.py


import requests
import json
import time
import lxml.html
import argparse
import csv


def get_ligand(ligand_id):
    tmp_url = "https://www.rcsb.org" + ligand_id

    response = requests.get(tmp_url)
    if response.status_code != 200:
        return response.status_code, []

    html = response.text
    root = lxml.html.fromstring(html)
    print(tmp_url)

    smiles = root.xpath("//tr[@id='chemicalIsomeric']/td[1]/text()")[0]
    inchi = root.xpath("//tr[@id='chemicalInChI']/td[1]/text()")[0]
    inchi_key = root.xpath("//tr[@id='chemicalInChIKey']/td[1]/text()")[0]

    return response.status_code, [smiles, inchi, inchi_key]


def get_structure(structure_id):
    structure_url = "https://www.rcsb.org/structure/"
    tmp_url = structure_url + structure_id
    print(tmp_url)
    html = requests.get(tmp_url).text
    root = lxml.html.fromstring(html)
    binding_trs = root.xpath("//tr[contains(@id,'binding_row')]")

    datas = []
    ids = []

    for tr in binding_trs:
        print(tr.xpath("@id"))
        d1 = tr.xpath("td[position()=1]/a/@href")
        if d1[0] in ids:
            continue

        ids.append(d1[0])
        status_code, values = get_ligand(d1[0])
        ligand_id = d1[0][(d1[0].rfind("/") + 1):]
        if status_code == 200:
            smiles, inchi, inchi_key = values
            item = tr.xpath("td[position()=2]/a/text()")[0]
            item = item.strip()
            value = tr.xpath("td[position()=2]/text()")[0]
            value = value.replace(":", "")
            value = value.replace(";", "")
            value = value.replace("&nbsp", "")
            value = value.replace("\n", "")
            print(value)
            values = value.split(" ", 1)
            print(values)
            value = values[0].strip()
            unit = values[1].strip()

            datas.append([ligand_id, smiles, inchi, inchi_key, item, value, unit])
        time.sleep(1)

    return datas


def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("-output", type=str, required=True)
    args = parser.parse_args()

    base_url = "https://www.rcsb.org/search/data"
    payloads = {"query":{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"or","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":0},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"IC50"},"node_id":1}],"label":"nested-attribute"},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":2},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"Ki"},"node_id":3}],"label":"nested-attribute"}]},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"operator":"exact_match","negation":False,"value":"Severe acute respiratory syndrome-related coronavirus","attribute":"rcsb_entity_source_organism.taxonomy_lineage.name"},"node_id":4}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"pager":{"start":0,"rows":100},"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}]},"request_info":{"src":"ui","query_id":"e757fdfd5f9fb0efa272769c5966e3f4"}}
    print(json.dumps(payloads))

    response = requests.post(
        base_url,
        json.dumps(payloads),
        headers={'Content-Type': 'application/json'})

    datas = []
    for a in response.json()["result_set"]:
        structure_id = a["identifier"]
        datas.extend(get_structure(structure_id))
        time.sleep(1)

    with open(args.output, "w") as f:
        writer = csv.writer(f, lineterminator="\n")
        writer.writerow(["ligand_id", "canonical_smiles", "inchi", "inchi_key", "item", "value", "unit"])

        for data in datas:
            writer.writerow(data)


if __name__ == "__main__":
    main()

Commentary

--The main method is searching by json request. Since we get an array of IDs, we loop around it and call the get_structure method. --In the get_sctructure method, the tr tag of BindingAffinity is acquired, the inhibition data is acquired from it, and the get_ligand method is called. --In the get_ligand method, SMILES, Inchi, etc. are acquired from id. --Finally, all the data acquired by the main method is output to csv.

Method 2 Method using Scrapy

code

items.py


class PdbGetItem(scrapy.Item):
    ligand_id = scrapy.Field()
    canonical_smiles = scrapy.Field()
    inchi = scrapy.Field()
    inchi_key = scrapy.Field()
    item = scrapy.Field()
    value = scrapy.Field()
    unit = scrapy.Field()

get_pdb.py


import scrapy
from scrapy.http import Request, JsonRequest
import json
from ..items import PdbGetItem

class LigandsSpider(scrapy.Spider):
    name = 'get_pdb'
    allowed_domains = ['www.rcsb.org']

    def start_requests(self):
        base_url = "https://www.rcsb.org/search/data"
        payloads = {"query":{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"group","logical_operator":"or","nodes":[{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":0},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"IC50"},"node_id":1}],"label":"nested-attribute"},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.value","negation":False,"operator":"exists"},"node_id":2},{"type":"terminal","service":"text","parameters":{"attribute":"rcsb_binding_affinity.type","operator":"exact_match","value":"Ki"},"node_id":3}],"label":"nested-attribute"}]},{"type":"group","logical_operator":"and","nodes":[{"type":"terminal","service":"text","parameters":{"operator":"exact_match","negation":False,"value":"Severe acute respiratory syndrome-related coronavirus","attribute":"rcsb_entity_source_organism.taxonomy_lineage.name"},"node_id":4}]}],"label":"text"}],"label":"query-builder"},"return_type":"entry","request_options":{"pager":{"start":0,"rows":100},"scoring_strategy":"combined","sort":[{"sort_by":"score","direction":"desc"}]},"request_info":{"src":"ui","query_id":"e757fdfd5f9fb0efa272769c5966e3f4"}}
        yield JsonRequest(url=base_url, data=payloads, callback=self.parse)

    def parse(self, response):
        jsonresponse = json.loads(response.text)
        for a in jsonresponse["result_set"]:
            ligand_id = a['identifier']
            structure_url = "https://www.rcsb.org/structure/"
            structure_url += ligand_id
            yield Request(url=structure_url, callback=self.parse_structure)

    def parse_structure(self, response):
        ids = response.xpath("//tr[contains(@id,'binding_row')]/td[position()=1]/a/@href").getall()
        items = response.xpath("//tr[contains(@id,'binding_row')]/td[position()=2]/a/text()").getall()
        values = response.xpath("//tr[contains(@id,'binding_row')]/td[position()=2]/text()").getall()

        for ligand_id, item, value in zip(ids, items, values):
            ligand_url = "https://www.rcsb.org" + ligand_id
            item = item.strip()
            value = value.replace(":", "")
            value = value.replace(";", "")
            value = value.replace("&nbsp", "")
            value = value.replace("\n", "")
            values = value.split(" ", 1)
            value = values[0].strip()
            unit = values[1].strip()
            pdb_item = PdbGetItem()
            ligand_id = ligand_id[(ligand_id.rfind("/") + 1):]
            pdb_item["ligand_id"] = ligand_id
            pdb_item["item"] = item
            pdb_item["value"] = value
            pdb_item["unit"] = unit

            request = Request(url=ligand_url, callback=self.parse_ligand)
            request.meta["item"] = pdb_item
            yield request

    def parse_ligand(self, response):
        smiles = response.xpath("//tr[@id='chemicalIsomeric']/td[1]/text()").get()
        inchi = response.xpath("//tr[@id='chemicalInChI']/td[1]/text()").get()
        inchi_key = response.xpath("//tr[@id='chemicalInChIKey']/td[1]/text()").get()

        pdb_item = response.meta['item']
        pdb_item["canonical_smiles"] = smiles
        pdb_item["inchi"] = inchi
        pdb_item["inchi_key"] = inchi_key
        yield pdb_item

settings.py


ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
    'pdb_get.pipelines.PdbGetPipeline': 300,
}

pipelines.py


from scrapy.exceptions import DropItem
import csv

class PdbGetPipeline:

    def open_spider(self, spider):
        self.f = open("output.csv", "w")
        self.writer = csv.writer(self.f, lineterminator="\n")
        self.writer.writerow(["ligand_id", "canonical_smiles", "inchi", "inchi_key", "item", "value", "unit"])

    def close_spider(self, spider):
        self.f.close()

    def process_item(self, item, spider):
        if item["canonical_smiles"]:
            self.writer.writerow([item["ligand_id"], item["canonical_smiles"], item["inchi"],
                                  item["inchi_key"], item["item"], item["value"], item["unit"]])
        else:
            raise DropItem(item["ligand_id"])

Commentary

--get_pdb.py is a scraper, items.py is the item class that holds the data, settings.py is the settings, and piplelines.py is the pipeline class for processing items. --In setting.py, sleep is specified per request (3 seconds), whether to follow Robots.txt or not, and pipeline.py is enabled. --You are searching with the start_requests method of LingadsSpider in get_pdb.py. --The parse method of LingadsSpider in get_pdb.py is used to process search results and call the protein structure details screen. --The parse_structure method of LingadsSpider in get_pdb.py is used to acquire inhibition data from the protein structure details screen and transition to the ligand details screen. --Structural information is acquired from the ligand details screen using the parse_ligand method of LingadsSpider in get_pdb.py. --The data required for the item is acquired by the two processes of the parse_structure method and the parse_ligand method, but the passing of these items is as follows: request.meta [" item "] = pdb_item in the parse_structure method. It is done with `pdb_item = response.meta ['item']` in the parse_ligand method. This is the part that can be called the petit liver. --In pipeline.py, the stream of the file is opened and closed and output to the stream is performed to output the result to the csv file.

Output CSV

Like this. Since we have structure and inhibition data, let's create a model that predicts candidate drugs.

ligand_id,canonical_smiles,inchi,inchi_key,item,value,unit
CYV,CCOC(=O)CC[C@H](C[C@H]1CCNC1=O)NC(=O)[C@H](CC=C(C)C)CC(=O)[C@@H](NC(=O)[C@H](CO)NC(=O)OC(C)(C)C)C(C)C,"InChI=1S/C32H54N4O9/c1-9-44-26(39)13-12-23(16-22-14-15-33-28(22)40)34-29(41)21(11-10-19(2)3)17-25(38)27(20(4)5)36-30(42)24(18-37)35-31(43)45-32(6,7)8/h10,20-24,27,37H,9,11-18H2,1-8H3,(H,33,40)(H,34,41)(H,35,43)(H,36,42)/t21-,22-,23-,24+,27+/m1/s1",CSVHTXSZRNRRSB-AHPXAKOISA-N,IC50,80000,nM
GRM,C[C@@H](N1CC[C@@H](CC1)C(=O)NCc2ccc3OCOc3c2)c4cccc5ccccc45,"InChI=1S/C26H28N2O3/c1-18(22-8-4-6-20-5-2-3-7-23(20)22)28-13-11-21(12-14-28)26(29)27-16-19-9-10-24-25(15-19)31-17-30-24/h2-10,15,18,21H,11-14,16-17H2,1H3,(H,27,29)/t18-/m1/s1",IVXBCFLWMPMSAP-GOSISDBHSA-N,IC50,320,nM
XP1,CN(C)c1ccc(cc1)C(O)=O,"InChI=1S/C9H11NO2/c1-10(2)8-5-3-7(4-6-8)9(11)12/h3-6H,1-2H3,(H,11,12)",YDIYEOMDOWUDTJ-UHFFFAOYSA-N,IC50,5000,nM
S88,C[C@@H](N1CC[C@H](CC1)C(=O)NCc2cccc(F)c2)c3cccc4ccccc34,"InChI=1S/C25H27FN2O/c1-18(23-11-5-8-20-7-2-3-10-24(20)23)28-14-12-21(13-15-28)25(29)27-17-19-6-4-9-22(26)16-19/h2-11,16,18,21H,12-15,17H2,1H3,(H,27,29)/t18-/m1/s1",XJTWGMHOQKGBDO-GOSISDBHSA-N,IC50,150,nM
P85,C[C@@H](N1CC[C@@H](CC1)C(=O)NCc2ccc(F)cc2)c3cccc4ccccc34,"InChI=1S/C25H27FN2O/c1-18(23-8-4-6-20-5-2-3-7-24(20)23)28-15-13-21(14-16-28)25(29)27-17-19-9-11-22(26)12-10-19/h2-12,18,21H,13-17H2,1H3,(H,27,29)/t18-/m1/s1",VTAUBQMDNJOEAK-GOSISDBHSA-N,IC50,490,nM
3X5,O=C[C@H](Cc1[nH]cnc1)NC[C@H]2C[C@@H]3CCCC[C@H]3CN2C(=O)c4ccc(cc4)c5ccccc5,"InChI=1S/C29H34N4O2/c34-19-27(15-26-16-30-20-32-26)31-17-28-14-24-8-4-5-9-25(24)18-33(28)29(35)23-12-10-22(11-13-23)21-6-2-1-3-7-21/h1-3,6-7,10-13,16,19-20,24-25,27-28,31H,4-5,8-9,14-15,17-18H2,(H,30,32)/t24-,25-,27-,28+/m0/s1",VZCULZJNALRGNB-DNZWLJDLSA-N,IC50,240000,nM
3A7,Brc1ccc(cc1)C(=O)N2C[C@H]3CCCC[C@@H]3C[C@H]2CN[C@@H](Cc4[nH]cnc4)C=O,"InChI=1S/C23H29BrN4O2/c24-19-7-5-16(6-8-19)23(30)28-13-18-4-2-1-3-17(18)9-22(28)12-26-21(14-29)10-20-11-25-15-27-20/h5-8,11,14-15,17-18,21-22,26H,1-4,9-10,12-13H2,(H,25,27)/t17-,18-,21+,22+/m1/s1",SKLHMRHVVDDIOX-UBBRYJJRSA-N,IC50,63000,nM
SFG,N[C@@H](CC[C@H](N)C(O)=O)C[C@H]1O[C@H]([C@H](O)[C@@H]1O)n2cnc3c(N)ncnc23,"InChI=1S/C15H23N7O5/c16-6(1-2-7(17)15(25)26)3-8-10(23)11(24)14(27-8)22-5-21-9-12(18)19-4-20-13(9)22/h4-8,10-11,14,23-24H,1-3,16-17H2,(H,25,26)(H2,18,19,20)/t6-,7-,8+,10+,11+,14+/m0/s1",LMXOHSDXUQEUSF-YECHIGJVSA-N,IC50,740,nM
D3F,Cc1cc(c(Cl)cc1Cl)[S](=O)(=O)c2c(cc(cc2[N+]([O-])=O)C(F)(F)F)[N+]([O-])=O,"InChI=1S/C14H7Cl2F3N2O6S/c1-6-2-12(9(16)5-8(6)15)28(26,27)13-10(20(22)23)3-7(14(17,18)19)4-11(13)21(24)25/h2-5H,1H3",INAZPZCJNPPHGV-UHFFFAOYSA-N,IC50,300,nM
F3F,FC(F)(F)c1[nH]c(SC(=O)c2oc(cc2)C#Cc3ccccc3)nn1,"InChI=1S/C16H8F3N3O2S/c17-16(18,19)14-20-15(22-21-14)25-13(23)12-9-8-11(24-12)7-6-10-4-2-1-3-5-10/h1-5,8-9H,(H,20,21,22)",VNGWUVBXUIDQTK-UHFFFAOYSA-N,IC50,3000,nM
23H,CCC(C)(C)NC(=O)[C@H](N(C(=O)Cn1nnc2ccccc12)c3ccc(NC(C)=O)cc3)c4cccn4C,"InChI=1S/C28H33N7O3/c1-6-28(3,4)30-27(38)26(24-12-9-17-33(24)5)35(21-15-13-20(14-16-21)29-19(2)36)25(37)18-34-23-11-8-7-10-22(23)31-32-34/h7-17,26H,6,18H2,1-5H3,(H,29,36)(H,30,38)/t26-/m1/s1",BCIIGGMNYNWRQK-AREMUKBSSA-N,IC50,6200,nM
CY6,CCOC(=O)/C=C/[C@H](C[C@@H]1CCNC1=O)NC(=O)[C@H](CC=C(C)C)CC(=O)[C@@H](NC(=O)c2cc(C)on2)C(C)C,"InChI=1S/C29H42N4O7/c1-7-39-25(35)11-10-22(15-21-12-13-30-27(21)36)31-28(37)20(9-8-17(2)3)16-24(34)26(18(4)5)32-29(38)23-14-19(6)40-33-23/h8,10-11,14,18,20-22,26H,7,9,12-13,15-16H2,1-6H3,(H,30,36)(H,31,37)(H,32,38)/b11-10+/t20-,21+,22-,26+/m1/s1",CSNQHKJCKPMZCY-YFUAOJPXSA-N,IC50,70000,nM
0EN,CC(C)(C)NC(=O)[C@H](N(C(=O)c1occc1)c2ccc(cc2)C(C)(C)C)c3cccnc3,"InChI=1S/C26H31N3O3/c1-25(2,3)19-11-13-20(14-12-19)29(24(31)21-10-8-16-32-21)22(18-9-7-15-27-17-18)23(30)28-26(4,5)6/h7-17,22H,1-6H3,(H,28,30)/t22-/m1/s1",JXGIYKRRPGCLFV-JOCHJYFZSA-N,IC50,4800,nM
TTT,C[C@@H](NC(=O)c1cc(N)ccc1C)c2cccc3ccccc23,"InChI=1S/C20H20N2O/c1-13-10-11-16(21)12-19(13)20(23)22-14(2)17-9-5-7-15-6-3-4-8-18(15)17/h3-12,14H,21H2,1-2H3,(H,22,23)/t14-/m1/s1",UVERBUNNCOKGNZ-CQSZACIVSA-N,IC50,2640,nM
WR1,C[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)OCc2ccccc2)C(=O)CCO,"InChI=1S/C22H26N2O5/c1-16(20(26)12-13-25)23-21(27)19(14-17-8-4-2-5-9-17)24-22(28)29-15-18-10-6-3-7-11-18/h2-11,16,19,25H,12-15H2,1H3,(H,23,27)(H,24,28)/t16-,19+/m1/s1",UUOOAGBWJUGBMV-APWZRJJASA-N,Ki,2200,nM
S89,OC[C@H](Cc1ccccc1)NC(=O)[C@H](Cc2ccccc2)NC(=O)/C=C/c3ccccc3,"InChI=1S/C27H28N2O3/c30-20-24(18-22-12-6-2-7-13-22)28-27(32)25(19-23-14-8-3-9-15-23)29-26(31)17-16-21-10-4-1-5-11-21/h1-17,24-25,30H,18-20H2,(H,28,32)(H,29,31)/b17-16+/t24-,25-/m0/s1",GEVQDXBVGFGWFA-KQRRRSJSSA-N,Ki,2240,nM
ZU3,CC(C)C[C@H](NC(=O)[C@H](CNC(=O)C(C)(C)C)NC(=O)OCc1ccccc1)C(=O)N[C@H](CCC(C)=O)C[C@@H]2CCNC2=O,"InChI=1S/C32H49N5O7/c1-20(2)16-25(28(40)35-24(13-12-21(3)38)17-23-14-15-33-27(23)39)36-29(41)26(18-34-30(42)32(4,5)6)37-31(43)44-19-22-10-8-7-9-11-22/h7-11,20,23-26H,12-19H2,1-6H3,(H,33,39)(H,34,42)(H,35,40)(H,36,41)(H,37,43)/t23-,24+,25-,26-/m0/s1",IEQRDAZPCPYZAJ-QYOOZWMWSA-N,Ki,38,nM
ZU5,CC(C)C[C@H](NC(=O)[C@@H](NC(=O)OCc1ccccc1)[C@@H](C)OC(C)(C)C)C(=O)N[C@H](CCC(=O)C2CC2)C[C@@H]3CCNC3=O,"InChI=1S/C34H52N4O7/c1-21(2)18-27(31(41)36-26(14-15-28(39)24-12-13-24)19-25-16-17-35-30(25)40)37-32(42)29(22(3)45-34(4,5)6)38-33(43)44-20-23-10-8-7-9-11-23/h7-11,21-22,24-27,29H,12-20H2,1-6H3,(H,35,40)(H,36,41)(H,37,42)(H,38,43)/t22-,25+,26-,27+,29+/m1/s1",QIMPWBPEAHOISN-XSLDCGIXSA-N,Ki,99,nM
TLD,Cc1ccc(S)c(S)c1,"InChI=1S/C7H8S2/c1-5-2-3-6(8)7(9)4-5/h2-4,8-9H,1H3",NIAAGQAEVGMHPM-UHFFFAOYSA-N,Ki,1400,nM
PMA,OC(=O)c1cc(C(O)=O)c(cc1C(O)=O)C(O)=O,"InChI=1S/C10H6O8/c11-7(12)3-1-4(8(13)14)6(10(17)18)2-5(3)9(15)16/h1-2H,(H,11,12)(H,13,14)(H,15,16)(H,17,18)",CYIDZMCFTVVTJO-UHFFFAOYSA-N,Ki,700,nM

Consideration

Let's consider the case with and without Scrapy.

Benefits of using Scrapy

--As you can see from the log, it automatically recognizes access to the same URL and does not uselessly access it. --When an error occurs, it will automatically retry (default seems to be 3 times) --You can put sleep processing in a batch by setting .py. --You can make fine adjustments to the scraping style, such as whether to follow the rules of Roblots.txt. --It does a good log output. --The composition of the sauce is beautiful.

Disadvantages of using Scrapy

--You can't do fine control such as interrupting when an error occurs (although you can do it after all if you use pipeline) --Learning costs are high. (If you remember it, you can enjoy it ...) ――As you can see in this search process, there are increasing cases of request / responding with structured data such as JSON, and in that case you only have to operate JSON, so there is not much merit?

There are few disadvantages of Scrapy. Alright, let's collect pounding data with Scrapy in the future.

Finally this environment

This time, it was carried out in the following environment.

Recommended Posts

Collect machine learning data by scraping from bio-based public databases
How to collect machine learning data
A story about data analysis by machine learning
Python learning memo for machine learning by Chainer from Chapter 2
Time series data prediction by AutoML (automatic machine learning)
[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-
Scraping desired data from website by linking Python and Excel
4 [/] Four Arithmetic by Machine Learning
Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2
Creating artificial intelligence by machine learning using TensorFlow from zero knowledge-Introduction 1
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
[Machine learning] Understanding uncorrelatedness from mathematics
Machine learning summary by Python beginners
Machine learning in Delemas (data acquisition)
Preprocessing in machine learning 2 Data acquisition
"Scraping & machine learning with Python" Learning memo
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
One-click data prediction for the field realized by fully automatic machine learning
I tried scraping conversation data from Askfm
Collect images by scraping. Make more videos!
Making Sandwichman's Tale by Machine Learning ver4
Machine learning imbalanced data sklearn with k-NN
Use machine learning APIs A3RT from Python
[Failure] Find Maki Horikita by machine learning
Four arithmetic operations by machine learning 6 [Commercial]
Python: Preprocessing in machine learning: Data acquisition
[Python] First data analysis / machine learning (Kaggle)
Python: Preprocessing in machine learning: Data conversion
Python & Machine Learning Study Memo ④: Machine Learning by Backpropagation
Preprocessing in machine learning 1 Data analysis process
Learning record (2nd day) Scraping by #BeautifulSoup
Judgment of igneous rock by machine learning ②
Try to draw a "weather map-like front" by machine learning based on weather data (5)
Try to draw a "weather map-like front" by machine learning based on weather data (3)
Try to draw a "weather map-like front" by machine learning based on weather data (1)
Try to draw a "weather map-like front" by machine learning based on weather data (4)
Try to draw a "weather map-like front" by machine learning based on weather data (2)
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
[Machine learning] Create a machine learning model by performing transfer learning with your own data set
Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane