Easy scraping with Python (JavaScript / Proxy / Cookie compatible version)

Good morning everyone, @_akisato.

Crawler / web scraping Advent Calendar http://qiita.com/advent-calendar/2015/ It is written as an article on the 6th day of crawler.

Today, I would like to introduce scraping of web pages that cannot be read unless JavaScript and cookies are allowed.

The implementation is uploaded on GitHub https://github.com/akisato-/pyScraper.

First of all, scraping without any ingenuity

(1) Get the web page with requests, and (2) scrape with BeautufulSoup4. The Python standard HTML parser is not very good, so we will use lxml here. For the basic usage of BeautifulSoup4, refer to http://qiita.com/itkr/items/513318a9b5b92bd56185.

Installation of required packages

Use pip.

pip install requests
pip install lxml
pip install beautifulsoup4

Source

I think it will be as follows. If you specify the URL of the page you want to scrape and the output file name, the title of the page will be returned in JSON format. The function scraping is the main body.

scraping.py


import sys
import json
import requests
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # get a HTML response
    response = requests.get(url)
    html = response.text.encode(response.encoding)  # prevent encoding errors
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content'].text
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

Supports JavaScript

The number of web pages that cannot be seen without JavaScript enabled is increasing significantly. If you access such a page with the previous source, you will only get the page "Please enable JavaScript".

In order to support such pages, we will replace the web page acquisition that was done in requests with a combination of Selenium and PhantomJS. Selenium is a tool for automating browser operations, and PhantomJS is a Qt-based browser. [^ browser]

[^ browser]: PhantomJS is a browser, so you can replace it with a commonly used web browser such as IE, Firefox, Chrome, etc. For details, see the official document http://docs.seleniumhq.org/docs/03_webdriver.jsp#selenium-webdriver-s-drivers.

Install PhantomJS

On Mac and Linux, it can be installed immediately with a package manager such as brew or yum.

Mac


brew install phantomjs

CentOS


yum install phantomjs

On Windows, download the binary from http://phantomjs.org/download.html, put it in a suitable location, and then put it in the path.

Install Selenium

You can do it immediately with pip.

pip install selenium

Source

Using Selenium and PhantomJS, the scraping source is modified as follows. There is no need to change the procedure after acquiring the web page. Configure the PhantomJS web driver with Selenium and get the HTML through that driver. After that, it is the same. If you want to record the driver operation log, rename os.path.devnull to the file name.

scraping_js.py


import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # Selenium settings
    driver = webdriver.PhantomJS(service_log_path=os.path.devnull)
    # get a HTML response
    driver.get(url)
    html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content'].text
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

Corresponds to Proxy

You can enter the proxy setting as an argument of PhantomJS.

phantomjs_args = [ '--proxy=proxy.server.no.basho:0000' ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)

Corresponds to cookies

PhantomJS has cookies enabled by default. If you want to keep the cookie file handy, you can set it as an argument of PhantomJS.

phantomjs_args = [ '--cookie-file={}'.format("cookie.txt") ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)

Source final form

If all the functions are covered, it will be as follows.

scraping_complete.py


import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # Selenium settings
    phantomjs_args = [ '--proxy=proxy.server.no.basho:0000', '--cookie-file={}'.format("cookie.txt") ]
    driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)
    # get a HTML response
    driver.get(url)
    html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content']
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

Recommended Posts

Easy scraping with Python (JavaScript / Proxy / Cookie compatible version)
Scraping with Python
Https access via proxy with Python web scraping was easy with requests
Easy web scraping with Python and Ruby
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Check version with python
Quick web scraping with Python (while supporting JavaScript loading)
I tried scraping with Python
Scraping with selenium in Python
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Easy folder synchronization with Python
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Scraping weather forecast with python
Easy Python compilation with NUITKA-Utilities
Easy HTTP server with Python
Easy proxy login with django-hijack
Scraping with Selenium + Python Part 2
Specify python version with virtualenv
I tried scraping with python
Web scraping beginner with python
Try scraping with Python + Beautiful Soup
[Python] Easy parallel processing with Joblib
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Dynamic proxy with python, ruby, PHP
Scraping with Python, Selenium and Chromedriver
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Easy Python + OpenCV programming with Canopy
Easy email sending with haste python3
Let's do image scraping with Python
Bayesian optimization very easy with Python
Get Qiita trends with Python scraping
Manage each Python version with Homebrew
Master the type with Python [Python 3.9 compatible]
Easy data visualization with Python seaborn.
Easy parallel execution with python subprocess
Easy modeling with Blender and Python
"Scraping & machine learning with Python" Learning memo
Get weather information with Python & scraping
[Python Windows] pip install with Python version
Get property information by scraping with python
[Python] Super easy test with assert statement
WEB scraping with Python (for personal notes)
[Python] Easy argument type check with dataclass
Automate simple tasks with Python Part1 Scraping
Getting Started with Python Web Scraping Practice
I tried scraping Yahoo News with Python
Easy introduction of speech recognition with Python
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)