Python scraping notes

This is a memo of tools that can be used when scraping with Python.

requests --Get data from the web

The easiest way to access the web in Python is to use requests. You can install it with pip. For GET and POST, using requests.get and requests.post is generally sufficient.

Installation

$ pip install requests

Please see here for details. http://requests-docs-ja.readthedocs.org/en/latest/

BeautifulSoup4 --Parsing HTML

BeautifulSoup4 is a good way to parse HTML.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div><h1 id="test">TEST</h1></div>', 'html')
>>> soup.select_one('div h1#test').text
'TEST'

The characters in the tag are soup.text, and the attributes can be accessed withsoup ['id'](where id is the attribute name).

Frequently used methods of BeautifulSoup object

--BeautifulSoup.find ()-> Search for tags and return the first hit tag --BeautifulSoup.find_all ()-> Search for tags and return a list of hit tags --BeautifulSoup.find_previous ()-> Returns the previous tag --BeautifulSoup.find_next ()-> Returns the next tag --BeautifulSoup.find_parent ()-> Returns parent tag --BeautifulSoup.select ()-> css selector returns a list of tags --BeautifulSoup.select_one ()-> Search with css selector and return the first hit tag

Please see here for details. http://kondou.com/BS4/

Data persistence

CSV format

CSV is a comma-separated format file. You can use the csv module. Learn more about the csv module here. http://docs.python.jp/3.4/library/csv.html

writing

import csv
with open('some.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(someiterable)

reading

import csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

JSON format

The JSON format is also a commonly used format. Use the standard module json module.

>>> import json
>>> json.dumps([1, 2, 3, 4])
'[1, 2, 3, 4]'
>>> json.loads('[1, 2, 3, 4]')
[1, 2, 3, 4]
>>> json.dumps({'aho': 1, 'ajo': 2})
'{"aho": 1, "aro": 2}'
>>> json.loads('{"aho": 1, "ajo": 2}')
{u'aho': 1, u'aro': 2}

--json.dumps ()-> Make the object a JSON string --json.loads ()-> Make JSON string an object --json.dump ()-> Turn the object into a JSON string and write it to a file --json.load ()-> Read the JSON string in the file and make it an object

Please see here for details. http://docs.python.jp/3.4/library/json.html

sample

We have prepared some scraping samples. please refer. However, please do not throw requests bang bang as there are general sites. Even if you make a mistake, you can't just turn the loop.

--Extract tutorial information from PyConJP https://github.com/TakesxiSximada/happy-scraping/tree/master/pycon.jp --Extract new package information from PyPI https://github.com/TakesxiSximada/happy-scraping/tree/master/pypi.python.org --Break through Django's Admin site authentication https://github.com/TakesxiSximada/happy-scraping/tree/master/djangoadmin --User-Agent spoofing https://github.com/TakesxiSximada/happy-scraping/tree/master/fake-useragent --Extract the data dynamically generated by Javascript https://github.com/TakesxiSximada/happy-scraping/tree/master/dynamic-page

A site that looks interesting if you try to collect data

--https://teratail.com/ It might be a good idea to mow the entry on the top page. --http://isitchristmas.com/ Christmas Judgment (Timely) --https://data.nasa.gov/developer NASA data is available, so it may be interesting to look it up.

There are many other sites that look good ...

Recommended Posts

Python scraping notes
[Scraping] Python scraping
Web scraping notes in python3
Python Scraping get_ranker_categories
Scraping with Python
Python study notes _000
Python learning notes
Scraping with Python
Python beginner notes
Python study notes_006
Python Scraping eBay
python C ++ notes
Python Scraping get_title
Python study notes _005
Python grammar notes
Python Library notes
Python: Scraping Part 1
python personal notes
Scraping using Python
python pandas notes
Python study notes_001
python learning notes
Python3.4 installation notes
Python: Scraping Part 2
WEB scraping with Python (for personal notes)
Scraping with Python (preparation)
Summary about Python scraping
Try scraping with Python.
missingintegers python personal notes
UnicodeEncodeError:'cp932' during python scraping
Basics of Python scraping basics
Scraping with Python + PhantomJS
Python package development notes
python decorator usage notes
Python ipaddress package notes
[Personal notes] Python, Django
Python Pickle format notes
[Python] pytest-mock Usage notes
First Python miscellaneous notes
Matlab => Python migration notes
Scraping with Selenium [Python]
Notes around Python3 assignments
Python web scraping selenium
Scraping with Python + PyQuery
Notes using Python subprocesses
Python try / except notes
Scraping RSS with Python
Python framework bottle notes
Python notes using perl-ternary operator
Scraping using Python 3.5 async / await
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
O'Reilly python3 Primer Learning Notes
[Python] Scraping in AWS Lambda
python super beginner tries scraping
Python
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Python standard unittest usage notes
Python notes to forget soon