[PYTHON] Get data from your website on a regular basis using ScraperWiki

If you use the "ScraperWiki" service, you can use Web scraping without having to rent a server yourself. 82% A6% E3% 82% A7% E3% 83% 96% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0) can be done on a regular basis.

Features of ScraperWiki

screenshot

Script edit screen scraperwiki201401221.JPG

DB 201401222.JPG

Scraper script sample

#!/usr/bin/env python
import scraperwiki
import lxml.html
import json

url = "http://target.website.hoge/index.html" #Target site to scrape
html = scraperwiki.scrape(url)		#html document acquisition
root = lxml.html.fromstring(html)	#Get root element object

data = []
id = 0
for el in root.cssselect("#hoge_contents > li > span"):  #Extract elements with css selector
    data.append({'id':id, 'text':el.text })	#Save the text of the extracted element
    id = id + 1

print repr(data)	#Output the saved data to the console


# Saving data:
unique_keys = [ 'id' ] #Specify a unique key
scraperwiki.sql.save(unique_keys, data)	#Save to DB

Example actually used http://shimz.me/blog/d3-js/3353

Recommended Posts

Get data from your website on a regular basis using ScraperWiki
Get data from Twitter using Tweepy
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
I tried collecting data from a website with Scrapy
I tried reading data from a file using Node.js.
How to get only the data you need from a structured data set using a versatile method
[Treasure Data] [Python] Execute a query on Treasure Data using TD Client
I tried to get data from AS / 400 quickly using pypyodbc
Get structural data from CHEMBLID
Move CloudWatch logs to S3 on a regular basis with Lambda
Create an API that returns data from a model using turicreate
Concisely write operations on every pair in your data using broadcast
Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
How to get a job as an engineer from your 30s
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
Get Salesforce data using REST API
Get data from Quandl in Python
Notes on using MeCab from Python
Get Amazon data using Keep API # 1 Get data
Using a serial console on Ubuntu 20.04
Create multiple line charts from a data frame at once using Matplotlib
[Personal memo] Get data on the Web and make it a DataFrame
Get a domain using GCP and MyDNS (NAT traversal Wake on LAN [1])