I was asked by a company person to find out which company is doing IOT, so If you go to "IOT company", you can get various information. Finally, I arrived at https://hnavi.co.jp (order navigation). I wonder if the company name, capital, address, homepage, and this much information are all right. Instead of using Bs4, I'll try using scrapy. (Environment python 3.8 + vscode)
pip3 install scrapy
With twisted install I was angry that there was no built tool for windows vc ++ 14, so I dropped it OK 2. Scrapy project creation
scrapy create
Scrapy shell launch
scrapy shell https://hnavi.co.jp/search/iot
2020-12-29 18:46:14 [scrapy.core.engine] DEBUG: Crawled (200)(referer: None) 2020-12-29 18:46:15 [asyncio] DEBUG: Using proactor: IocpProactor [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler [s] item {} [s] request [s] response <200 https://hnavi.co.jp/search/iot/> [s] settings [s] spider [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser 2020-12-29 18:46:15 [asyncio] DEBUG: Using proactor: IocpProactor
Take a look at HTML
<div class="page-skill__content__company__head">
<h3><a href="https://hnavi.co.jp/spa/02357/">DesignOne Japan Co., Ltd.</a></h3>
<div class="page-skill__content__company__head__review-count">
<i class="icon icon--bubble-20x19"></i>
<a href="https://hnavi.co.jp/spa/02357/review/">Word-of-mouth communication(3 cases)</a>
</div>
</div>
I only have the name of the company, but I also want more details. The link is https://hnavi.co.jp/spa/XXXX/ It is in the shape of. First of all, extract the link. class is the href attribute of the a tag inside the h3 tag under the div tag that becomes page-skill__xx. Try with shell:
response.css('div .page-skill\_\_content\_\_company\_\_head').css('h3 a::attr(href)')
result
\, \ , \ , \ , \ , \ , \ , \ , \ , \
The value of data = is a link. It seems that data can be obtained with the get () function.
iotSpider.py
import scrapy
class IotSpider(scrapy.Spider):
name = 'hnav-iot'
k = name.split('-')[1]
start_urls = ['http://hnavi.co.jp/search/'+k,
'http://hnavi.co.jp/search/'+k+'/2',
'http://hnavi.co.jp/search/'+k+'/3',
'http://hnavi.co.jp/search/'+k+'/4']
def parse(self, response):
for company in response.css('div .page-skill__content__company__head'):
print (company.css('h3 a::attr(href)')[0].get())
I wrote it without thinking about the page, and once I got the link.
scrapy crawl hnav-iot --nolog
https://hnavi.co.jp/web/03779/ https://hnavi.co.jp/spa/02360/ https://hnavi.co.jp/spa/02426/ https://hnavi.co.jp/spa/02470/ https://hnavi.co.jp/spa/02445/ https://hnavi.co.jp/web/03648/ https://hnavi.co.jp/spa/02435/ https://hnavi.co.jp/web/03701/ https://hnavi.co.jp/spa/02290/ https://hnavi.co.jp/spa/02232/ https://hnavi.co.jp/spa/02357/ https://hnavi.co.jp/spa/02190/ https://hnavi.co.jp/ecommerce/01191/ https://hnavi.co.jp/spa/02440/ https://hnavi.co.jp/spa/02447/ https://hnavi.co.jp/ecommerce/01216/ https://hnavi.co.jp/web/03759/ https://hnavi.co.jp/spa/02442/ https://hnavi.co.jp/spa/02458/ https://hnavi.co.jp/spa/02351/ https://hnavi.co.jp/spa/02427/ https://hnavi.co.jp/web/03491/ https://hnavi.co.jp/spa/02341/ https://hnavi.co.jp/web/03498/ https://hnavi.co.jp/ecommerce/01204/ https://hnavi.co.jp/spa/02349/ https://hnavi.co.jp/spa/02446/ https://hnavi.co.jp/spa/02418/ https://hnavi.co.jp/spa/02448/ https://hnavi.co.jp/spa/02331/ https://hnavi.co.jp/spa/02452/ https://hnavi.co.jp/spa/02365/ https://hnavi.co.jp/spa/02413/ https://hnavi.co.jp/web/03529/ https://hnavi.co.jp/spa/02388/ https://hnavi.co.jp/spa/02309/ https://hnavi.co.jp/web/03752/ https://hnavi.co.jp/spa/02353/
Get the information again by getting to the link above
iotDetail.py
import scrapy
class IotDetailSpider(scrapy.Spider):
name = 'iotDetail'
start_urls = [
'https://hnavi.co.jp/spa/02445/' #Copy the link here
,'https://hnavi.co.jp/spa/02351/'
,'https://hnavi.co.jp/web/03701/'
,'https://hnavi.co.jp/web/03498/'
,'https://hnavi.co.jp/spa/02341/'
,'https://hnavi.co.jp/web/03648/'
,'https://hnavi.co.jp/spa/02349/'
,'https://hnavi.co.jp/spa/02418/'
,'https://hnavi.co.jp/spa/02309/'
,'https://hnavi.co.jp/spa/02360/'
,'https://hnavi.co.jp/spa/02458/'
,'https://hnavi.co.jp/ecommerce/01204/'
,'https://hnavi.co.jp/spa/02446/'
,'https://hnavi.co.jp/spa/02435/'
,'https://hnavi.co.jp/web/03491/'
,'https://hnavi.co.jp/web/03752/'
,'https://hnavi.co.jp/spa/02353/'
,'https://hnavi.co.jp/web/03759/'
,'https://hnavi.co.jp/spa/02331/'
,'https://hnavi.co.jp/spa/02448/'
,'https://hnavi.co.jp/spa/02365/'
,'https://hnavi.co.jp/spa/02452/'
,'https://hnavi.co.jp/web/03529/'
,'https://hnavi.co.jp/spa/02413/'
,'https://hnavi.co.jp/spa/02290/'
,'https://hnavi.co.jp/spa/02232/'
,'https://hnavi.co.jp/spa/02357/'
,'https://hnavi.co.jp/spa/02190/'
,'https://hnavi.co.jp/ecommerce/01191/'
,'https://hnavi.co.jp/web/03542/'
,'https://hnavi.co.jp/spa/02447/'
,'https://hnavi.co.jp/spa/02440/'
,'https://hnavi.co.jp/spa/02388/'
,'https://hnavi.co.jp/spa/02427/'
,'https://hnavi.co.jp/spa/02426/'
,'https://hnavi.co.jp/spa/02326/'
,'https://hnavi.co.jp/spa/02442/'
]
def parse(self, response):
cp_name = response.css('h2.company-info__title::text').get()
r = {cp_name:{}}
j = {}
n=cp_name+"\t"
# {'dddd':{'ddd':'3333','dddd':'333','dddd':'ddddd'}}
cp_date ="N/A"
cp_capital="N/A"
cp_ceo = "N/A"
cp_add ="N/A"
cp_hp ="N/A"
cp_person="N/A"
cp_div ="N/A"
for dtl in response.xpath('//h2[contains(text(), "About us")]/following-sibling::table').css('tr'):
key = dtl.css('th::text').get()
if key == 'Capital':
cp_capital=dtl.css('td::text').get()
if key == 'Established':
cp_date=dtl.css('td::text').get()
if key == 'number of employees':
cp_person=dtl.css('td::text').get()
if key == 'CEO':
cp_ceo=dtl.css('td::text').get()
if key == 'location':
cp_add=dtl.css('td::text').get()
if key == 'home page':
cp_hp = dtl.css('td a::attr(href)').get()
if key == 'Branch information':
cp_div = dtl.css('td::text').get()
# yield {
# key:rst
# }
# yield {cp_name:j}
print(cp_name+"\t"+cp_date+"\t"+cp_capital+"\t"+cp_ceo+"\t"+cp_person+"\t"+cp_add+"\t"+cp_div+"\t"+cp_hp)
# print(cp_name,cp_date,cp_capital,cp_ceo,cp_person,cp_add,cp_div,cp_hp)
Run
scrapy crawl iotDetail --nolog
result
When the result is a csv file scrapy crawl iotDetail --nolog > iotd.csv
Combine two scripts into one, eliminate solid writing, and enable automatic paging I'm going to upgrate I will leave it as a memo.
Recommended Posts