[PYTHON] Save the results of crawling with Scrapy to the Google Data Store

How to save the information crawled in Scrapy to Google Data Store. At that time, there were some pitfalls, so I summarized them.

Thing you want to do

Save the items collected by scrapy cloud to Google Data Store.

[Troublesome point 1] Around the authority of Google Cloud Platform

gcloud provides an auth command for authentication. https://cloud.google.com/sdk/gcloud/reference/auth/ However, you can't run this command in scrapy cloud.

Therefore, authenticate using the service account key json. You can download the json file by setting it on the screen below.

Screenshot from 2017-03-14 00-50-21.png

[Trouble point 2] Specify the json path in the environment variable

By writing like this, you can operate the crawler locally.

`pipeline.py`


from google.cloud import datastore
import os
import time
from threading import Lock


class HogePipeline(object):
    def __init__(self):
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.join(os.path.dirname(__file__), "./hogehogehoge.json")
        self.g_client = datastore.Client('hoge-project')

    def process_item(self, item, spider):
        # put 
        return item

[Troublesome point 3] Deploy with the json file included

`MANIFEST.ini`


include path/to/hogehogehoge.json

`setup.py`



from setuptools import setup, find_packages

setup(
    name         = 'project',
    version      = '1.0',
    packages     = find_packages(),
    entry_points = {'scrapy': ['settings = hoge.settings']},
    install_requires = [],
    include_package_data = True
)

Deployment commands

$ python setup.py bdist_egg
$ shub deploy --egg dist/project-1.0-py2.7.egg

Recommended Posts

Save the results of crawling with Scrapy to the Google Data Store

I tried to save the data with discord

Save in Japanese to StringProperty in Google App Engine data store

Try to extract the features of the sensor data with CNN

Try to image the elevation data of the Geographical Survey Institute with Python

Write the result of keyword search with ebaysdk to Google Spread Sheets

Convert data with shape (number of data, 1) to (number of data,) with numpy.

Save data to flash with STM32 Nucleo Board

Save the object to a file with pickle

Save the search results on Twitter to CSV.

[Introduction to Python] How to get the index of data with a for statement

Add information to the bottom of the figure with Matplotlib

Try to create a battle record table with matplotlib from the data of "Schedule-kun"

Try to get the contents of Word with Golang

I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly

I just wanted to extract the data of the desired date and time with Django

Extract the band information of raster data with python

I tried to display the point cloud data DB of Shizuoka prefecture with Vue + Leaflet

[Introduction to SIR model] Predict the end time of each country with COVID-19 data fitting ♬

I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action

Return the image data with Flask of Python and draw it to the canvas element of HTML

How to insert a specific process at the start and end of spider with scrapy

I tried to find the entropy of the image with python

Try scraping the data of COVID-19 in Tokyo with Python

Save the output of GAN one by one ~ With the implementation of GAN by PyTorch ~

A network diagram was created with the data of COVID-19.

I tried to find the average of the sequence with TensorFlow

Visualize the results of decision trees performed with Python scikit-learn

[Part.2] Crawling with Python! Click the web page to move!

Settings to debug the contents of the library with VS Code

Data analysis based on the election results of the Tokyo Governor's election (2020)

How to summarize the results of FreeSurfer ~ aparc, aseg, wmparc ~

The story of rubyist struggling with python :: Dict data with pycall

[Homology] Count the number of holes in data with Python

Try to automate the operation of network devices with Python

The story of copying data from S3 to Google's TeamDrive

Save images on the web to Drive with Python (Colab)

Django Changed to save lots of data in one go

[Introduction to Python] How to get data with the listdir function

Get the source of the page to load infinitely with python.

I sent the data of Raspberry Pi to GCP (free)

Reuse the results of clustering

Save tweet data with Django

How to extract features of time series data with PySpark Basics

The story of not being able to run pygame with pycharm

Save the result of the life game as a gif with python

Become familiar with (want to be) around the pipeline of spaCy

I tried to automate the watering of the planter with Raspberry Pi

How to get the ID of Type2Tag NXP NTAG213 with nfcpy

[Machine learning] Check the performance of the classifier with handwritten character data

[Introduction to StyleGAN] I played with "The Life of a Man" ♬

Try to solve the N Queens problem with SA of PyQUBO

I want to output the beginning of the next month with Python

Output the contents of ~ .xlsx in the folder to HTML with Python

Correspondence analysis of sentences with COTOHA API and save to file

Consider the speed of processing to shift the image buffer with numpy.ndarray

Solving the Maze with Python-Supplement to Chapter 6 of the Algorithm Quick Reference-

When you want to save the result of the callback function somewhere

How to monitor the execution status of sqlldr with the pv command

I tried to expand the size of the logical volume with LVM

The strongest way to use MeCab and CaboCha with Google Colab