[PYTHON] I tried using scrapy for the first time

I was asked by a company person to find out which company is doing IOT, so If you go to "IOT company", you can get various information. Finally, I arrived at https://hnavi.co.jp (order navigation). I wonder if the company name, capital, address, homepage, and this much information are all right. Instead of using Bs4, I'll try using scrapy. (Environment python 3.8 + vscode)

1. install scrapy on terminal

pip3 install scrapy

With twisted install I was angry that there was no built tool for windows vc ++ 14, so I dropped it OK 2. Scrapy project creation

scrapy create

HTML analysis If you search for "iot", the url will be https://hnavi.co.jp/search/iot Search results are displayed with The second page is https://hnavi.co.jp/search/iot/2 Trying to target those URLs

Scrapy shell launch

scrapy shell https://hnavi.co.jp/search/iot

2020-12-29 18:46:14 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2020-12-29 18:46:15 [asyncio] DEBUG: Using proactor: IocpProactor
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 https://hnavi.co.jp/search/iot/>
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
2020-12-29 18:46:15 [asyncio] DEBUG: Using proactor: IocpProactor

Take a look at HTML

<div class="page-skill__content__company__head">
			<h3><a href="https://hnavi.co.jp/spa/02357/">DesignOne Japan Co., Ltd.</a></h3>
			<div class="page-skill__content__company__head__review-count">
				<i class="icon icon--bubble-20x19"></i>
				<a href="https://hnavi.co.jp/spa/02357/review/">Word-of-mouth communication(3 cases)</a>
			</div>
		</div>

I only have the name of the company, but I also want more details. The link is https://hnavi.co.jp/spa/XXXX/ It is in the shape of. First of all, extract the link. class is the href attribute of the a tag inside the h3 tag under the div tag that becomes page-skill__xx. Try with shell:

response.css('div .page-skill\_\_content\_\_company\_\_head').css('h3 a::attr(href)')

result

\,
 \,
 \,
 \,
 \,
 \,
 \,
 \,
 \,
 \

The value of data = is a link. It seems that data can be obtained with the get () function.

`iotSpider.py`


import scrapy
class IotSpider(scrapy.Spider):
    name = 'hnav-iot'
    k = name.split('-')[1]
    
    start_urls = ['http://hnavi.co.jp/search/'+k,
    'http://hnavi.co.jp/search/'+k+'/2',
    'http://hnavi.co.jp/search/'+k+'/3',
    'http://hnavi.co.jp/search/'+k+'/4']
    def parse(self, response):
        for company in response.css('div .page-skill__content__company__head'):
            print (company.css('h3 a::attr(href)')[0].get())

I wrote it without thinking about the page, and once I got the link.

scrapy crawl hnav-iot --nolog

https://hnavi.co.jp/web/03779/
https://hnavi.co.jp/spa/02360/
https://hnavi.co.jp/spa/02426/
https://hnavi.co.jp/spa/02470/
https://hnavi.co.jp/spa/02445/
https://hnavi.co.jp/web/03648/
https://hnavi.co.jp/spa/02435/
https://hnavi.co.jp/web/03701/
https://hnavi.co.jp/spa/02290/
https://hnavi.co.jp/spa/02232/
https://hnavi.co.jp/spa/02357/
https://hnavi.co.jp/spa/02190/
https://hnavi.co.jp/ecommerce/01191/
https://hnavi.co.jp/spa/02440/
https://hnavi.co.jp/spa/02447/
https://hnavi.co.jp/ecommerce/01216/
https://hnavi.co.jp/web/03759/
https://hnavi.co.jp/spa/02442/
https://hnavi.co.jp/spa/02458/
https://hnavi.co.jp/spa/02351/
https://hnavi.co.jp/spa/02427/
https://hnavi.co.jp/web/03491/
https://hnavi.co.jp/spa/02341/
https://hnavi.co.jp/web/03498/
https://hnavi.co.jp/ecommerce/01204/
https://hnavi.co.jp/spa/02349/
https://hnavi.co.jp/spa/02446/
https://hnavi.co.jp/spa/02418/
https://hnavi.co.jp/spa/02448/
https://hnavi.co.jp/spa/02331/
https://hnavi.co.jp/spa/02452/
https://hnavi.co.jp/spa/02365/
https://hnavi.co.jp/spa/02413/
https://hnavi.co.jp/web/03529/
https://hnavi.co.jp/spa/02388/
https://hnavi.co.jp/spa/02309/
https://hnavi.co.jp/web/03752/
https://hnavi.co.jp/spa/02353/

Get the information again by getting to the link above

`iotDetail.py`


import scrapy
class IotDetailSpider(scrapy.Spider):
    name = 'iotDetail'
    start_urls = [
       'https://hnavi.co.jp/spa/02445/'  #Copy the link here
,'https://hnavi.co.jp/spa/02351/'
,'https://hnavi.co.jp/web/03701/'
,'https://hnavi.co.jp/web/03498/'
,'https://hnavi.co.jp/spa/02341/'
,'https://hnavi.co.jp/web/03648/'
,'https://hnavi.co.jp/spa/02349/'
,'https://hnavi.co.jp/spa/02418/'
,'https://hnavi.co.jp/spa/02309/'
,'https://hnavi.co.jp/spa/02360/'
,'https://hnavi.co.jp/spa/02458/'
,'https://hnavi.co.jp/ecommerce/01204/'
,'https://hnavi.co.jp/spa/02446/'
,'https://hnavi.co.jp/spa/02435/'
,'https://hnavi.co.jp/web/03491/'
,'https://hnavi.co.jp/web/03752/'
,'https://hnavi.co.jp/spa/02353/'
,'https://hnavi.co.jp/web/03759/'
,'https://hnavi.co.jp/spa/02331/'
,'https://hnavi.co.jp/spa/02448/'
,'https://hnavi.co.jp/spa/02365/'
,'https://hnavi.co.jp/spa/02452/'
,'https://hnavi.co.jp/web/03529/'
,'https://hnavi.co.jp/spa/02413/'
,'https://hnavi.co.jp/spa/02290/'
,'https://hnavi.co.jp/spa/02232/'
,'https://hnavi.co.jp/spa/02357/'
,'https://hnavi.co.jp/spa/02190/'
,'https://hnavi.co.jp/ecommerce/01191/'
,'https://hnavi.co.jp/web/03542/'
,'https://hnavi.co.jp/spa/02447/'
,'https://hnavi.co.jp/spa/02440/'
,'https://hnavi.co.jp/spa/02388/'
,'https://hnavi.co.jp/spa/02427/'
,'https://hnavi.co.jp/spa/02426/'
,'https://hnavi.co.jp/spa/02326/'
,'https://hnavi.co.jp/spa/02442/'
]
def parse(self, response):
        cp_name = response.css('h2.company-info__title::text').get()
        r = {cp_name:{}}
        j = {}
        n=cp_name+"\t"
        # {'dddd':{'ddd':'3333','dddd':'333','dddd':'ddddd'}}
        cp_date ="N/A"
        cp_capital="N/A"
        cp_ceo = "N/A"
        cp_add ="N/A"
        cp_hp ="N/A"
        cp_person="N/A"
        cp_div ="N/A"
        for dtl in response.xpath('//h2[contains(text(), "About us")]/following-sibling::table').css('tr'):
            key = dtl.css('th::text').get()
            if key == 'Capital':
                cp_capital=dtl.css('td::text').get()
            if key == 'Established':
                cp_date=dtl.css('td::text').get()
            if key == 'number of employees':
                cp_person=dtl.css('td::text').get()
            if key == 'CEO':
                cp_ceo=dtl.css('td::text').get()
            if key == 'location':
                cp_add=dtl.css('td::text').get()
            if key == 'home page':
                cp_hp = dtl.css('td a::attr(href)').get()
            if key == 'Branch information':
                cp_div = dtl.css('td::text').get()
            # yield {
            #     key:rst
            # }
        # yield {cp_name:j}
        print(cp_name+"\t"+cp_date+"\t"+cp_capital+"\t"+cp_ceo+"\t"+cp_person+"\t"+cp_add+"\t"+cp_div+"\t"+cp_hp)
        # print(cp_name,cp_date,cp_capital,cp_ceo,cp_person,cp_add,cp_div,cp_hp)

Run

scrapy crawl iotDetail --nolog

result 画像2.png

When the result is a csv file scrapy crawl iotDetail --nolog > iotd.csv

Combine two scripts into one, eliminate solid writing, and enable automatic paging I'm going to upgrate I will leave it as a memo.