Google collects information from search engines [Googlebot](https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%BC%E3%82%B0%E3%83%AB%E3 We are using% 83% 9C% E3% 83% 83% E3% 83% 88). Starting from a certain website, it automatically follows the links of that site and collects information.
You can do something similar with python's Scrapy module. Let's use Scrapy to collect information about the site.
Install Scrapy with pip. `$ pip install scrapy
Scrapy is managed on a project-by-project basis. After generating the project, edit the following files automatically generated there.
First, create a project.
$ scrapy startproject tutorial
Then, such a folder will be created.
tutorial/
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
...
Define what you get. This is the definition of the field in the database.
items.py
import scrapy
class WebItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
date = scrapy.Field()
This is a flower-shaped file that crawls the web and extracts data. Specify the patrol start address, patrol conditions, and data extraction conditions.
Make a spider. The syntax is $ scrapy genspider [options] <name> <domain>
.
commandline
$ scrapy genspider webspider exsample.com
Created spider 'webspider' using template 'basic' in module:
tutorial.spiders.webspider
The generated file is
tutorial/spiders/webspider.py
# -*- coding: utf-8 -*-
import scrapy
class WebspiderSpider(scrapy.Spider):
name = "webspider" #The name in the project. Used to specify a spider when moving
allowed_domains = ["exsample.com"] #Domain specification for patrol OK
start_urls = (
'http://www.exsample.com/', #This is the starting point. You can specify more than one in the list.
)
def parse(self, response): #Here are the extraction conditions
pass
A file like this will be generated. Change this to your liking.
tutorial/spiders/webspider.py
# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import WebItem
import re
import datetime
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class WebspiderSpider(CrawlSpider): #The class name has no meaning
name = 'WebspiderSpider' #This is important. Spider with this name(Crawler)Move
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
xpath = {
'title' : "//title/text()",
}
list_allow = [r'(Regular expressions)'] #Links that meet this condition are patrolled
list_deny = [
r'/exsample/hogehoge/hoge/', #This is an example of specifying a link that does not go around. List comprehension is also possible
]
list_allow_parse = [r'(Regular expressions)'] #Link specification to extract data
list_deny_parse = [ #Link specification without data extraction
r'(Regular expressions)',
r'(Regular expressions)',
]
rules = (
#Patrol rules.
Rule(LinkExtractor(
allow=list_allow,
deny=list_deny,
),
follow=True #Go into that link
),
#Data extraction rules
Rule(LinkExtractor(
allow=list_allow_parse,
deny=list_deny_parse,
unique=True #Do not extract data at the same link destination
),
callback='parse_items' #If the conditions are met, the data extraction execution function specified here is executed.
),
)
#Data extraction function definition
def parse_items(self, response): #The response contains website information
item = WebItem() # items.Class specified by py
item['title'] = response.xpath(self.xpath['title']).extract()[0]
item['link'] = response.url
item['date'] = datetime.datetime.utcnow() + datetime.timedelta(hours=9) #Current time. Throw in Japan time.
yield item
Please refer to the comments for what you are writing.
Push the yield item
from the spider created above into mongoDB.
pipelines.py
from pymongo import MongoClient #Connection with mongoDB
import datetime
class TutorialPipeline(object):
def __init__(self, mongo_uri, mongo_db, mongolab_user, mongolab_pass):
#Variable initialization with the arguments passed when creating the instance
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
self.mongolab_user = mongolab_user
self.mongolab_pass = mongolab_pass
@classmethod #Since there is a class in the argument, you can access the class variable
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'), # settings.Access variables defined by py
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items'),
mongolab_user=crawler.settings.get('MONGOLAB_USER'),
mongolab_pass=crawler.settings.get('MONGOLAB_PASS')
) # def __init__Becomes an argument of
def open_spider(self, spider): #Executed when the spider starts. Database connection
self.client = MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
self.db.authenticate(self.mongolab_user, self.mongolab_pass)
def close_spider(self, spider): #Executed at the end of the spider. Close database connection
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].update(
{u'link': item['link']},
{"$set": dict(item)},
upsert = True
) #Search link, create new if not, update if
return item
I've written a lot, but the point is that I just open the database, insert the data, and close it when I'm done.
settings.py First, define the various variables you are calling in pipelines.py.
settings.py
MONGO_URI = 'hogehoge.mongolab.com:(port number)'
MONGO_DATABASE = 'database_name'
MONGOLAB_USER = 'user_name'
MONGOLAB_PASS = 'password'
This is an example of mongolab. In settings.py, specify the behavior in addition.
settings.py
REDIRECT_MAX_TIMES = 6
RETRY_ENABLED = False
DOWNLOAD_DELAY=10
COOKIES_ENABLED=False
Here, the maximum number of redirects is 6, so that retries are not executed, the website is accessed every 10 seconds, and cookies are not saved.
If you do not specify DOWNLOAD_DELAY
, you will access with full power, which will put a heavy load on the destination site. let's stop.
Let's run it.
commandline
$ scrapy crawl WebspiderSpider
We will follow the links one after another, and data will be extracted from the links that meet the conditions.
Recommended Posts