Getting Started with Python Web Scraping Practice

I would like to write an introduction to the practice of web scraping with Python.

Except for the general part, I would like to go with a style that understands by feeling.

things to do

Eventually, "Access the Nihon Keizai Shimbun every hour and record the Nikkei Stock Average at that time in csv."

I would like to make a program.

Caution

It is a note. Read it carefully. [Okazaki Municipal Central Library Case (Librahack Case) --Wikipedia](https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B % E4% B8% AD% E5% A4% AE% E5% 9B% B3% E6% 9B% B8% E9% A4% A8% E4% BA% 8B% E4% BB% B6) List of precautions for web scraping

What do you use

Language: Python 2.7.12 Libraries: urllib2, BeautifulSoup, csv, datetime, time

urllib2 is required to access the URL. BeautifulSoup is like an xml parser that opens an accessed and retrieved file This library is required when working with csv files. datetime is a library for getting time

Library installation

urllib2 is installed when you install Python. Use the pip command to install BeautifulSoup

shell.sh


$ pip install beautifulsoup4

Let's get the page title of the Nihon Keizai Shimbun as a starting point!

First of all, access the Nihon Keizai Shimbun with Python and get the HTML.

After that, make it into a form that can be handled by Beautiful Soup,

The page title is acquired from the form that can be handled and output.

Also, this time it may be difficult to get the image if you get only the title of the page, so I would like to get the title element and get the title from the title elements.

getNikkeiWebPageTitle.py


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup

#URL to access
url = "http://www.nikkei.com/"

#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)

#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

#Get the title element →<title>Economic, stock, business and political news:Nikkei electronic version</title>
title_tag = soup.title

#Get the element string → Economic, Stock, Business, Political News:Nikkei electronic version
title = title_tag.string

#Output title element
print title_tag

#Output the title as a character string
print title

Doing this will return the following results:

shell.sh


$ python getNikkeiWebPageTitle.py
<title>Economic, stock, business and political news:Nikkei electronic version</title>
Economic, stock, business and political news:Nikkei electronic version

By the way

print.py


print soup.title.string

Similar results can be obtained in this case as well.

I think you have a rough idea of this.

Practice!

The goal this time is to "access the Nihon Keizai Shimbun every hour and record the Nikkei Stock Average at that time in csv." If you check the program procedure

  1. Access the Nikkei Stock Average page of the Nihon Keizai Shimbun and get the HTML
  2. Get the Nikkei Stock Average using Beautiful Soup
  3. Write the date, time and Nikkei Stock Average in one record in csv

Header is not used for csv.

Let's do it.

Access the Nikkei Stock Average page

First of all, access the Nikkei Stock Average page.

The theory is to look up the URL yourself from the browser in advance.

If you look it up, you can find it on the page of "Nikkei → Market → Stocks".

I will use the previous program

getNikkeiHeikin.py


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup

#URL to access
url = "http://www.nikkei.com/markets/kabu/"

#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)

#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

#Output the title as a character string
print soup.title.string

This will output the title.

shell.sh


$ python getNikkiHeikin.py
>>Stocks: Market: Nikkei electronic version

Nikkei Stock Average acquisition

Next, get the Nikkei Stock Average.

Let's open Nikkei> Market> Stocks in your browser.

The Nikkei Stock Average is listed slightly below the top of this page.

To get this, you need to find the location of this data in HTML.

Right-click on the Nikkei Stock Average and press "Verify".

Then I think that the screen will look like this

スクリーンショット 2016-12-01 17.59.17.png

Class = "mkc-stock_prices" in the span element.

Now you know the position.

Let's actually print with Beautiful Soup.

getNikeiHeikin.py


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup

#URL to access
url = "http://www.nikkei.com/markets/kabu/"

#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)

#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

#Extract all span elements → All span elements are put back in the array →[<span class="m-wficon triDown"></span>, <span class="l-h...
span = soup.find_all("span")

#Declare it first so that it does not cause an error when printing.
nikkei_heikin = ""
#Class from all span elements in for minutes="mkc-stock_prices"Look for the one that has become
for tag in span:
    #Elements for which class is not set are tags.get("class").pop(0)Avoid the error with try as it will result in an error because you cannot do
    try:
        #class from tag="n"Extract the n string of. Since multiple classes may be set
        #The get function returns as an array. So the array function pop(0)Removes the very beginning of the sequence by
        # <span class="hoge" class="foo">  →   ["hoge","foo"]  →   hoge
        string_ = tag.get("class").pop(0)

        #Mkc in the extracted class string-stock_Check if it is set as prices
        if string_ in "mkc-stock_prices":
            # mkc-stock_Since prices are set, the character string enclosed in tags.Aburidashi with string
            nikkei_heikin = tag.string
            #Since the extraction is completed, I will exit for minutes
            break
    except:
        #Path → Do nothing
        pass

#Outputs the extracted Nikkei Stock Average.
print nikkei_heikin

result

shell.sh


$ python getNikeiHeikin.py
>>18,513.12

The explanation of the code is basically inserted in the comment

To express the flow simply

  1. Go to Nihon Keizai Shimbun> Market> Stocks and pick up HTML
  2. Since the Nikkei Stock Average is surrounded by span elements, all span elements inside HTML are extracted.
  3. Make sure that "mkc-stock_prices" is set to class for each span element.
  4. When the set class is found, get the value with .string and end for minutes
  5. Print the obtained value

It is a flow.

The flow of this program can be used in most situations The advantage is that it is not so difficult and can be applied in most situations. As a caveat, if the span element changes to another element or the content of the class changes, it cannot be output.

Repeat and csv output

Output this result to csv and repeat it every hour

getNikeiHeikin.py


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup
from datetime import datetime
import csv
import time

time_flag = True

#Let it run forever
while True:
    #If the time is other than 59 minutes, wait 58 seconds
    if datetime.now().minute != 59:
        #1 minute because it's not 59 minutes(58 seconds)Wait for a while(誤差がないとは言い切れないので58 secondsです)
        time.sleep(58)
        continue
    
    #Open csv in append mode → Open csv here because it takes time to open csv when the file becomes large
    f = open('nikkei_heikin.csv', 'a')
    writer = csv.writer(f, lineterminator='\n')

    #It's 59 minutes, but I can't get out until 59 seconds at second intervals to measure at the correct time.
    while datetime.now().second != 59:
            #It's not 00 seconds, so wait 1 second
            time.sleep(1)
    #The process finishes quickly and repeats twice, so wait here for a second.
    time.sleep(1)

    #Create a record to describe in csv
    csv_list = []

    #Get the current time in year, month, day, hour, minute, second
    time_ = datetime.now().strftime("%Y/%m/%d %H:%M:%S")
    #Insert time in the first column
    csv_list.append(time_)

    #URL to access
    url = "http://www.nikkei.com/markets/kabu/"

    #Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
    html = urllib2.urlopen(url)

    #Handle html with Beautiful Soup
    soup = BeautifulSoup(html, "html.parser")

    #Extract all span elements → All span elements are put back in the array →[<span class="m-wficon triDown"></span>, <span class="l-h...
    span = soup.find_all("span")

    #Declare it first so that it does not cause an error when printing.
    nikkei_heikin = ""
    #Class from all span elements in for minutes="mkc-stock_prices"Look for the one that has become
    for tag in span:
        #Elements for which class is not set are tags.get("class").pop(0)Avoid the error with try as it will result in an error because you cannot do
        try:
            #class from tag="n"Extract the n string of. Since multiple classes may be set
            #The get function returns as an array. So the array function pop(0)Removes the very beginning of the sequence by
            # <span class="hoge" class="foo">  →   ["hoge","foo"]  →   hoge
            string_ = tag.get("class").pop(0)

            #Mkc in the extracted class string-stock_Check if it is set as prices
            if string_ in "mkc-stock_prices":
                # mkc-stock_Since prices are set, the character string enclosed in tags.Aburidashi with string
                nikkei_heikin = tag.string
                #Since the extraction is completed, I will exit for minutes
                break
        except:
            #Path → Do nothing
            pass

    #The extracted Nikkei Stock Average is output over time.
    print time_, nikkei_heikin
    #Record the Nikkei 225 in the second column
    csv_list.append(nikkei_heikin)
    #Add to csv
    writer.writerow(csv_list)
    #Close to prevent file corruption
    f.close()

Speaking in a flow

  1. Wait until n: 00 seconds
  2. Open csv
  3. Create a record
  4. Get the Nikkei Stock Average
  5. Add to record
  6. Write record to csv

That's why

If you keep doing this, it will access once an hour and get the Nikkei 225 and record it.

You can do anything by applying this

For example, you can add carts at high speed (so-called scripting) at the time of sale on a long river in South America. .. ..

I don't recommend it very much

Then!

Also here

Python Web scraping technique collection "There is no value that cannot be obtained" JavaScript support [10,000 requests per second !?] Explosive web scraping starting with Go language [Golang] [For beginners] Re: Genetic algorithm starting from zero [Artificial intelligence]

Recommended Posts

Getting Started with Python Web Scraping Practice
Getting Started with Python Web Applications
1.1 Getting Started with Python
Getting Started with Python
Practice web scraping with Python and Selenium
Web scraping with python + JupyterLab
Getting Started with Python Django (1)
Getting Started with Python Django (4)
Getting Started with Python Django (3)
Getting Started with Python Django (6)
Web scraping beginner with python
Python3 | Getting Started with numpy
Getting Started with Python Django (5)
Getting Started with Python responder v2
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with Python First step
I tried web scraping with python.
Getting Started with Python for PHPer-Classes
Getting Started with Python Genetic Algorithms
Getting started with Python 3.8 on Windows
Getting Started with Python for PHPer-Functions
Scraping with Python
Scraping with Python
Getting Started with python3 # 1 Learn Basic Knowledge
WEB scraping with Python (for personal notes)
Getting Started with Flask with Azure Web Apps
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python for PHPer-Super Basics
Getting started with Dynamo from Python boto
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
[FastAPI] Getting started with FastAPI, an ASGI web framework made by Python
Django 1.11 started with Python3.6
Scraping with Python (preparation)
Try scraping with Python.
Getting started with Android!
Scraping with Python + PhantomJS
Getting started with apache2
Getting Started with Golang 1
Getting Started with Django 1
Getting Started with Optimization
Getting Started with Golang 3
Getting Started with Numpy
Getting started with Spark
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Getting Started with Pydantic
Getting Started with Golang 4
Getting Started with Jython
Scraping RSS with Python
Getting Started with Django 2
Getting started with Python with 100 knocks on language processing
AWS-Perform web scraping regularly with Lambda + Python + Cron
[Translation] Getting Started with Rust for Python Programmers
Introduction to Tornado (1): Python web framework started with Tornado
Getting started with AWS IoT easily in Python
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)