[PYTHON] Collect data using scrapy and populate mongoDB

Google collects information from search engines [Googlebot](https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%BC%E3%82%B0%E3%83%AB%E3 We are using% 83% 9C% E3% 83% 83% E3% 83% 88). Starting from a certain website, it automatically follows the links of that site and collects information.

You can do something similar with python's Scrapy module. Let's use Scrapy to collect information about the site.

Preparation

Install Scrapy with pip. `$ pip install scrapy

How to use

Scrapy is managed on a project-by-project basis. After generating the project, edit the following files automatically generated there.

  1. items.py: Define the extracted data
  2. spiders / The following spider (crawler) files: patrol, data extraction conditions
  3. pipelines.py: Output destination of extracted data. This time mongoDB
  4. settings.py: Data patrol conditions (frequency, hierarchy, etc.)

Creating a project

First, create a project. $ scrapy startproject tutorial Then, such a folder will be created.

tutorial/


tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py
            ...

Definition of extracted data

Define what you get. This is the definition of the field in the database.

items.py


import scrapy

class WebItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    date = scrapy.Field()

Creating a spider

This is a flower-shaped file that crawls the web and extracts data. Specify the patrol start address, patrol conditions, and data extraction conditions.

Spider generation

Make a spider. The syntax is $ scrapy genspider [options] <name> <domain>.

commandline


$ scrapy genspider webspider exsample.com
  Created spider 'webspider' using template 'basic' in module:
  tutorial.spiders.webspider

The generated file is

tutorial/spiders/webspider.py


# -*- coding: utf-8 -*-
import scrapy

class WebspiderSpider(scrapy.Spider):
    name = "webspider"   #The name in the project. Used to specify a spider when moving
    allowed_domains = ["exsample.com"] #Domain specification for patrol OK
    start_urls = (
        'http://www.exsample.com/', #This is the starting point. You can specify more than one in the list.
    )

    def parse(self, response):  #Here are the extraction conditions
        pass

A file like this will be generated. Change this to your liking.

tutorial/spiders/webspider.py


# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import WebItem
import re
import datetime
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class WebspiderSpider(CrawlSpider):  #The class name has no meaning
    name = 'WebspiderSpider'  #This is important. Spider with this name(Crawler)Move
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    xpath = {
        'title' : "//title/text()",
    }

    list_allow = [r'(Regular expressions)'] #Links that meet this condition are patrolled
    list_deny = [
                r'/exsample/hogehoge/hoge/', #This is an example of specifying a link that does not go around. List comprehension is also possible
            ]
    list_allow_parse = [r'(Regular expressions)']  #Link specification to extract data
    list_deny_parse = [                #Link specification without data extraction
                r'(Regular expressions)',
                r'(Regular expressions)',
                ]

    rules = (
        #Patrol rules.
        Rule(LinkExtractor(
            allow=list_allow,
            deny=list_deny,
            ),
            follow=True #Go into that link
        ),
        #Data extraction rules
        Rule(LinkExtractor(
            allow=list_allow_parse,
            deny=list_deny_parse,
            unique=True #Do not extract data at the same link destination
            ),
            callback='parse_items' #If the conditions are met, the data extraction execution function specified here is executed.
        ),
    )

   #Data extraction function definition
   def parse_items(self, response): #The response contains website information
        item = WebItem()  # items.Class specified by py
        item['title'] = response.xpath(self.xpath['title']).extract()[0]
        item['link'] = response.url
        item['date'] = datetime.datetime.utcnow() + datetime.timedelta(hours=9) #Current time. Throw in Japan time.

        yield item

Please refer to the comments for what you are writing.

Editing pipeline.py

Push the yield item from the spider created above into mongoDB.

pipelines.py


from pymongo import MongoClient  #Connection with mongoDB
import datetime

class TutorialPipeline(object):

    def __init__(self, mongo_uri, mongo_db, mongolab_user, mongolab_pass):
        #Variable initialization with the arguments passed when creating the instance
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.mongolab_user = mongolab_user
        self.mongolab_pass = mongolab_pass

    @classmethod  #Since there is a class in the argument, you can access the class variable
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'), # settings.Access variables defined by py
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items'),
            mongolab_user=crawler.settings.get('MONGOLAB_USER'),
            mongolab_pass=crawler.settings.get('MONGOLAB_PASS')
        ) # def __init__Becomes an argument of

    def open_spider(self, spider): #Executed when the spider starts. Database connection
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.db.authenticate(self.mongolab_user, self.mongolab_pass)

    def close_spider(self, spider): #Executed at the end of the spider. Close database connection
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].update(
            {u'link': item['link']},
            {"$set": dict(item)},
            upsert = True
        ) #Search link, create new if not, update if

        return item

I've written a lot, but the point is that I just open the database, insert the data, and close it when I'm done.

settings.py First, define the various variables you are calling in pipelines.py.

settings.py


MONGO_URI = 'hogehoge.mongolab.com:(port number)'
MONGO_DATABASE = 'database_name'
MONGOLAB_USER = 'user_name'
MONGOLAB_PASS = 'password'

This is an example of mongolab. In settings.py, specify the behavior in addition.

settings.py


REDIRECT_MAX_TIMES = 6
RETRY_ENABLED = False
DOWNLOAD_DELAY=10
COOKIES_ENABLED=False

Here, the maximum number of redirects is 6, so that retries are not executed, the website is accessed every 10 seconds, and cookies are not saved. If you do not specify DOWNLOAD_DELAY, you will access with full power, which will put a heavy load on the destination site. let's stop.

Run

Let's run it.

commandline


$ scrapy crawl WebspiderSpider

We will follow the links one after another, and data will be extracted from the links that meet the conditions.

Recommended Posts

Collect data using scrapy and populate mongoDB
Collect tweets using tweepy in Python and save them in MongoDB
Collect product information and process data using Rakuten product search API [Python]
Create dummy data using Python's NumPy and Faker packages
Data analysis using xarray
Data analysis using Python 0
Data cleansing 2 Data cleansing using DataFrame
Data cleaning using Python
Collect images using icrawler
Visualize plant activity from space using satellite data and Python
Get data using Ministry of Internal Affairs and Communications API
How to add new data (lines and plots) using matplotlib
Graph time series data in Python using pandas and matplotlib
Analyze stock prices using pandas data aggregation and group operations
[Python] Random data extraction / combination from DataFrame using random and pandas