Web scraping with Python ② (Actually scraping stock sites)

0. Introduction

This is a continuation of the second summary article ** that I investigated for the purpose of using python for investment utilization. This time I will actually try scraping

1. Last review & confirmation of scraping target

I will try various things for the stock investment memo that was the only "robots.txt: Allow: /" in the previous article.

https://kabuoji3.com/

python


#First check with reppy(Last review)
from reppy.robots import Robots

robots = Robots.fetch('https://kabuoji3.com/robots.txt')
print(robots.allowed('https://kabuoji3.com/', '*'))

Execution result


False

** It becomes False at ↑, but this is because there is a problem in how to write robots.txt of the stock investment memo (to acquire with reppy). ** ** It's normal to have a space after Allow :, but this site doesn't. Therefore, it seems that the reppy introduced in the previous article could not be obtained successfully. In the case of NG, it is important to go to check robots.txt by yourself.

追加されたkernel今のHPのRobots.txt

追加されたkernel本来こうあるべきRobots.txt

** This time, we will take up the case where you want to get the red frame part in the figure below as an example. (URL: https://kabuoji3.com/stock/) **

2. Verify the HTML of the URL and find the part you want to get.

Now let's leave python. The goal this time is to "get the latest stock price information of all listed stocks", but even if you get the URL information without thinking about it, you will also get unnecessary information that exists on the same page (for example, help page). Where in the HTML is the "necessary part" written because it is a link to, title information, etc.)? You need to understand.

Then you need to know HTML ... and study a language ...? However, there is no problem if you know only the minimum. Moreover, in the case of Google Chrome, there is a function called "verification", so it is unnecessary and can be handled to some extent. If you left-click on the target URL, you should see a command like ↓.

** View page source: Show HTML as is ** ** Verification: Where does the page description refer to in HTML? Is immediately apparent (not only that, but main in this article) **

Try to display the verification on the target page "https://kabuoji3.com/stock/"

対象ページの検証

The verification window will appear on the right. The HTML description is written in the verification window, but try moving the mouse cursor to the <header id =" header "class =" "> part. When combined, you can see that the upper part of the URL is highlighted in blue as shown in the above figure. This means that the upper part of the URL (such as the title) is written in this part.

This time, the table part of the stock price data is the part you want to get, so you can search for it from the HTML that came out in the verification.

対象ページの検証 As you search, you can see that the part `table class =" stock_table "` looks like that. So, you can see that there are `thead` and` tbody` in it, and that they are the "header" and "stock price data of each row" of this table, respectively, to be displayed by further verification. By the way, the table element is called `table` in HTML, so you may look for it.

3. Get the latest stock price information of all listed stocks with the requests command

From here, we will get it with python based on the information examined in 2. As I mentioned a little last time, you should have Beautiful Soup installed (pip install beautiful soup 4).

** It is necessary to confirm your user agent (UA) information with Confirm in advance. ** ** The part described in "Current Browser" is UA information, so rewrite the following code according to your environment.

python


import requests
from bs4 import BeautifulSoup
import pandas as pd

#Enter the URL to be scraped
url = 'https://kabuoji3.com/stock/'

#My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}

#Information from websites using the Requests library(HTML)To get.
response = requests.get(url, headers = headers)

print(response)

Execution result


<Response [200]>

Then a 3-digit HTTP status code is returned. OK if 200 (successful request) is returned. If 403 (authentication refusal) or 404 (Not Found * URL invalid) is returned, it has failed, so review the description again.

Reference: HTTP Status Code Wiki

4. Analyze the contents obtained by the requests command and extract the necessary parts

Next, analyze the contents of HTML acquired by BeautiifulSoup, specify the necessary part with a tag, and extract it.

python


#Create a BeautifulSoup object from the acquired HTML
soup = BeautifulSoup(response.content, "html.parser")

"""
First, get the header part of the stock price list.
I know that the header part is tagged with "tr" in HTML in 2, so look for it.
There are several ways to do it, but the point is<thead>In the tag<tr>You can get all the headers by extracting all of them.
"""
#First, search for the head with the command using the find method, and find the tr in it._Extract all with all method
tag_thead_tr = soup.find('thead').find_all('tr')

print(tag_thead_tr)

Execution result


[<tr>
<th>Code / name</th>
<th>market</th>
<th>Open price</th>
<th>High price</th>
<th>Low price</th>
<th>closing price</th>
</tr>]

python


#Similarly, the stock price portion will be acquired. I already know that the tr tags inside the tbody tags are grouped together.
tag_tbody_tr = soup.find('tbody').find_all('tr')

#Since there are many, only the 0th is displayed
print(tag_tbody_tr[0])

Execution result


<tr data-href="https://kabuoji3.com/stock/1305/">
<td><a href="https://kabuoji3.com/stock/1305/">1305 Daiwa Exchange Traded Fund-Topics</a></td>
<td>TSE ETF</td>
<td>1883</td>
<td>1888</td>
<td>1878</td>
<td>1884</td>
</tr>

You can see that this is a good acquisition.

5. Display the acquired information in pandas

Let's put together a table with pandas, which is easy to handle with python.

python


import pandas as pd

#Search the acquired header part with th and convert it to text *[0]What we are trying to do is find_Because all cannot be overlapped.
head = [h.text for h in tag_thead_th[0].find_all('th')] 

#Apply the stock price part
data = []
for i in range(len(tag_tbody_tr)):
    #Since the data of each column is stored in the td tag, it is retrieved.
    data.append([d.text for d in tag_tbody_tr[i].find_all('td')])
    df = pd.DataFrame(data, columns = head)

#Show only the first two lines of the data frame
df.head(2)

Display result

Code / name market Open price High price Low price closing price
0 1305 Daiwa Exchange Traded Fund-Topics TSE ETF 1883 1888 1878 1884
1 1306 (NEXT FUNDS)TOPIX-linked exchange-traded fund TSE ETF 1861 1867 1856 1863

Now you can transform it into a form that is easy to handle with python. After that, you can boil it, bake it, process it as you like, and use it. You can also save to csv with the to_csv method of pandas.

6. Finally

You can see that the article itself is long, but the code itself is also short. In short, which tag in HTML has the information you want to get? All you have to do is look for. Of course, there are many cases where you can't get it with this (when javascript etc. are involved), but I think that you should wear it according to the time. ** Next time, as the 3rd article, we plan to combine the 1st and 2nd articles to store the stock price obtained by scraping in the database **

7. (Bonus) Simple HTML supplement

For the time being, I will briefly explain the HTML of https://kabuoji3.com/stock/. If you display HTML in the verification and fold the body ... part, it will be as shown in the figure below.

対象ページの検証

In other words, it looks like ⇓ in a simple way.

python


<!--First make an HTML declaration. lang="ja"Meaning that I will handle Japanese-->
<html class=・ ・> 

<!--This is the head part that is not displayed in the browser. Describes the character code and the part displayed as the search result when searching the page-->
<head>...</head>
<!--This is the body part of the HTML body. Created while grouping the entire page with div tags-->
<body class=・ ・ ・>...</body>

<!--HTML description end-->
</html>

The rough body structure of the HTML in the page is written below. The indent is divided into layers. It is easy to understand if you actually look at the page or click the verification to check. It looks like a lot of mess, but there is no duplication in the div id where you want to check, so just pay attention to it.

python


<!--▼ Excerpt only for the body part-->
<body>
    <div id="wrapper">
        <!--▼ Header(Heading)part-->
        <header id="header">...</header>
        <!--▼ Global navigation(It comes out when you press MENU)part-->
        <div id="gNav_wrap">...</div>
        <!--▼ Page main part-->
        <div id="contents_wrap">
            <!--▼ Main part-->
            <div id="container_in">
                <!--▼ Main part-->
                <div id="main">
             <!--▼ Only the important parts below from here. Others are omitted-->
             <div class="data_contents">
             <!--▼ Stock price table(table)-->
                        <table class="stock_table">
                 <!--▼ Table header(column)-->
                            <thead>
                                <!--▼ Column-->
                                <tr>...</tr>
               </thead>
                 <!--▼ Contents data part of the table-->
                            <tbody>
                                <!--▼ Stock price data for each line-->
                                <tr>...</tr>
                            </tbody>
                        </table>
                    </div>
                </div>
                <!--▼ Data Menu part-->
                <div id="side">...</div>                
            </div>
            <!-- ▼HOME,Navigation part with PAGE TOP link-->
            <div id="gNav_wrap">...</div>
            <!--▼ Footer(Information placed at the bottom is collected)part-->
            <div id="gNav_wrap">...</div>
        </div>
    <!--▼ Script part. Used when reading javascript or external script-->
    <script>...</script>
</body>

Recommended Posts

Web scraping with Python ② (Actually scraping stock sites)
Web scraping with python + JupyterLab
Web scraping beginner with python
Let's do web scraping with Python (stock price)
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python
Scraping with Python
WEB scraping with Python (for personal notes)
Getting Started with Python Web Scraping Practice
[Personal note] Web page scraping with python3
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping RSS with Python
AWS-Perform web scraping regularly with Lambda + Python + Cron
Let's do web scraping with Python (weather forecast)
I tried scraping with Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Save images with web scraping
Scraping with Selenium in Python
Quick web scraping with Python (while supporting JavaScript loading)
Easy web scraping with Scrapy
Scraping with Tor in Python
Web API with Python + Falcon
Python beginners get stuck with their first web scraping
Web scraping using Selenium (Python)
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
Web application with Python + Flask ② ③
I tried scraping with python
Get stock price with Python
Streamline web search with python
Web application with Python + Flask ④
Web crawling, web scraping, character acquisition and image saving with python
Try scraping with Python + Beautiful Soup
Scraping with Node, Ruby and Python
Scraping with Python, Selenium and Chromedriver
Getting Started with Python Web Applications
Scraping Alexa's web rank with pyQuery
Scraping with Python and Beautiful Soup
Monitor Python web apps with Prometheus
Get web screen capture with python
Let's do image scraping with Python
Get Qiita trends with Python scraping
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1