Last time, I implemented a function to acquire stock price information, but as a matter of fact, "That's right, I used the stock app without asking BOT. It's only a matter of time before I get the tsukkomi saying, "It's more convenient," and I'm probably using a certain portfolio management app on a daily basis, so I'd like BOT to pinpoint the stock price now. It does not come to have the need. So I went back to the starting point, asked myself what kind of function I wanted, and searched for a more practical function.
"Yes, it would be convenient if we could grasp the settlement (planned) dates all at once."
If you own many stocks, it is quite troublesome to keep track of the schedule of individual stocks, semi-annual financial results, quarterly financial results, etc. on a daily basis. It would be convenient if there was a function that could centrally collect and visualize this, and above all, I think it would be good if it was the original way of BOT.
So, this time, I would like to do a scraping technique aimed at strengthening the skill of extracting and analyzing information from the website with pinpoint.
When you enter the activation code including the brand code from the LINE app (①), BOT will access the website, investigate the latest financial results announcement (planned) date, and answer (②). It is also assumed that individual stocks will be specified or a portfolio will be set in advance and multiple stocks will be acquired at once.
We will use a site called Stock Forecast.
When you open the page for individual stocks, the latest financial results announcement (planned) date is displayed at the position shown in the figure below.
If the financial results have been announced, that fact will be displayed along with the date.
If it is scheduled to be announced in the future, it will be displayed with the date as the scheduled announcement date.
The HTML source corresponding to this position is the " header_main "
class starting at line 222.
You can see that all the information you want is included under this.
Therefore, it seems that this mission can be realized by understanding the skills of the following three processes.
<div class =" header_main ">
tag extractionSo, if you write the conclusion first
For 1. requests
2. For Beautiful Soup
3. For re
Very smart coding is realized by utilizing each of them.
Requests have already been packaged when the chatbot is implemented, so details are omitted. The coding example is as follows
(Script execution example)How to use requests
(botenv2) [botenv2]$ python
Python 3.6.7 (default, Dec 5 2018, 15:02:16)
>>> import requests
#Get HTML source
>>> r = requests.get('https://kabuyoho.ifis.co.jp/index.php?action=tp1&sa=report_top&bcode=4689')
#Content confirmation
>>> print(r.headers)
{'Cache-Control': 'max-age=1', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html; charset=UTF-8', (abridgement)
>>> print(r.encoding)
UTF-8
>>> print(r.content)
(abridgement)
With that feeling, no particular explanation is needed. Since the whole Body part is stored in r.content, it will be boiled or baked afterwards. You can pull the information using the target HTML tag as a key.
It seems that it is a famous package and already implements various parsers. When I introduced it, I was able to achieve 90% of the purpose this time. .. Compared to the time when I used to play with C language, it feels like a world away.
Beautiful Soup installation
(botenv2) [botenv2]$ pip install BeautifulSoup4
(Script execution example@Continued)How to use Beautiful Soup
>>> from bs4 import BeautifulSoup
#Parsing the Body part with parser
>>> soup = BeautifulSoup(r.content, "html.parser")
Try to display the header_main class.
python
>>> print(soup.find("div", class_="header_main"))
Execution result
<div class="header_main">
<div class="stock_code left">4689</div>
<div class="stock_name left">Z Holdings</div>
<div class="block_update right">
<div class="title left">
Earnings announced
</div>
<div class="settle left">
2Q
</div>
<div class="date left">
2019/11/01
</div>
<div class="float_end"></div>
</div>
<div class="float_end"></div>
</div>
It's amazing. It's too convenient to stop trembling.
All that remains is to delete unnecessary strings. HTML tags are not required, so use the text method.
(Script execution example@Continued)Text extraction
>>> s = soup.find("div", class_="header_main").text
>>> print(s)
4689
Z Holdings
Earnings announced
2Q
2019/11/01
>>>
The tags have been wiped out, but there are still a lot of mysterious gaps left. I didn't know if this was a space or a metacharacter, so I fell in love with it for a moment. In such a case, the substance can be seen by displaying it in byte type.
(reference)Character code confirmation
>>> s.encode()
b'\n4689\n\xef\xbc\xba\xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3\x82\xb0\xe3\x82\xb9\n\n\n\t\t\t\t\t\t\t\t\t\xe6\xb1\xba\xe7\xae\x97\xe7\x99\xba\xe8\xa1\xa8\xe6\xb8\x88\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t2Q\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t2019/11/01\n\t\t\t\t\t\t\t\t\t\t\t\t\n\n\n\n'
The point is to remove / n and / t. Let's mercilessly replace it with a comma and bury it.
(Script execution example@Continued)Metacharacter removal
>>> import re
>>> s = re.sub(r'[\n\t]+', ',', s)
>>> print(s)
,4689,Z Holdings,Earnings announced,2Q,2019/11/01,
If you remove the disturbing commas before and after the finish
(Script execution example@Continued)
>>> s = re.sub(r'(^,)|(,$)','', s)
>>> print(s)
4689,Z Holdings,Earnings announced,2Q,2019/11/01
It feels good. It seems that it can be converted to CSV or dataframe as it is.
By the way, when a brand code that does not exist is acquired, the following character code remains after the above processing.
For stock codes that do not exist
>>> print(s)
b'\xc2\xa0'
This \ xc2 \ xa0
means NO-BREAK SPACE in Unicode and corresponds to & nbsp
in HTML.
If this character code is included, it will interfere with the subsequent processing, so it is desirable to remove it if possible.
(It seems to be a common problem when scraping web pages.)
(Reference) [Python3] What to do if you encounter [\ xa0] during scraping
&removal of nbsp
s = re.sub(r'[\xc2\xa0]','', s)
Here is a function of the above processing.
getSettledata.py
import requests
from bs4 import BeautifulSoup
import re
import logging
logger = logging.getLogger('getSettledata')
source = 'https://kabuyoho.ifis.co.jp/index.php?action=tp1&sa=report_top&bcode='
#Settlement date acquisition function(4689 if the argument is empty(ZHD)Refer to the data of)
def get_settleInfo(code="4689"):
#Crawling
try:
logger.debug('read web data cord = ' + code) #logging
r = requests.get(source + code)
except:
logger.debug('read web data ---> Exception Error') #logging
return None, 'Exception error: access failed'
#Scraping
soup = BeautifulSoup(r.content, "html.parser")
settleInfo = soup.find("div", class_="header_main").text
settleInfo = re.sub(r'[\n\t]+', ',', settleInfo) #Removal of metacharacters
settleInfo = re.sub(r'(^,)|(,$)','', settleInfo) #Comma removal at the beginning and end of the line
settleInfo = re.sub(r'[\xc2\xa0]','', settleInfo) # (\xc2\xa0)Dealing with the problem
logger.debug('settleInfo result = ' + settleInfo) #logging
if not settleInfo:
settleInfo = 'There is no such brand ~'
return settleInfo
if __name__ == '__main__':
print(get_settleInfo())
For the main program, add conditional branching processing by identifying the activation code as usual.
If you create your own portfolio in SETTLEVIEW_LIST_CORD
in advance, you will be eligible for batch acquisition.
chatbot.py(★ Addition)##The existing function part is omitted because it is not changed.
# -*- Coding: utf-8 -*-
from django.views.decorators.csrf import csrf_exempt
from django.http import HttpResponse
from django.shortcuts import render
from datetime import datetime
from time import sleep
import requests
import json
import base64
import logging
import os
import random
import log.logconfig
from utils import tools
import re
from .getStockdata import get_chart
from .getSettledata import get_settleInfo
logger = logging.getLogger('commonLogging')
LINE_ENDPOINT = 'https://api.line.me/v2/bot/message/reply'
LINE_ACCESS_TOKEN = ''
###
###Omitted
###
SETTLEVIEW_KEY = ['Settlement','settle'] #★ Addition
SETTLEVIEW_LIST_KEY = ['Financial results list'] #★ Addition
SETTLEVIEW_LIST_CORD = ['4689','3938','4755','1435','3244','3048'] #★ Addition
@csrf_exempt
def line_handler(request):
#exception
if not request.method == 'POST':
return HttpResponse(status=200)
logger.debug('line_handler message incoming') #logging
out_log = tools.outputLog_line_request(request) #logging
request_json = json.loads(request.body.decode('utf-8'))
for event in request_json['events']:
reply_token = event['replyToken']
message_type = event['message']['type']
user_id = event['source']['userId']
#whitelist
if not user_id == LINE_ALLOW_USER:
logger.warning('invalid userID:' + user_id) #logging
return HttpResponse(status=200)
#action
if message_type == 'text':
if:
###
###Omitted
###
elif any(s in event['message']['text'] for s in SETTLEVIEW_KEY): #★ Addition
action_data(reply_token,'settleview',event['message']['text']) #★ Addition
else:
###
###Omitted
###
return HttpResponse(status=200)
def action_res(reply_token,command,):
###
###Omitted
###
def action_data(reply_token,command,value):
#Stock chart
###
###Omitted
###
#######################################################★ Addition from here
#Financial information
elif command == 'settleview':
logger.debug('get_settleInfo on') #logging
#Bulk acquisition of portfolio stocks
if any(s in value for s in SETTLEVIEW_LIST_KEY):
logger.debug('get_settleInfo LIST') #logging
results = []
for cord in SETTLEVIEW_LIST_CORD:
results.append(get_settleInfo(cord))
logger.debug('get_settleInfo LIST ---> ' + '\n'.join(results)) #logging
response_text(reply_token,'\n'.join(results))
#Acquisition of individual stocks
else:
cord = re.search('[0-9]+$', value)
logger.debug('get_settleInfo cord = ' + cord.group()) #logging
result = get_settleInfo(cord.group())
if result[0] is not None:
response_text(reply_token,result)
else:
response_text(reply_token,result[1])
#######################################################★ Addition up to here
def response_image(reply_token,orgUrl,preUrl,text):
###
###Omitted
###
def response_text(reply_token,text):
payload = {
"replyToken": reply_token,
"messages":[
{
"type": 'text',
"text": text
}
]
}
line_post(payload)
def line_post(payload):
url = LINE_ENDPOINT
header = {
"Content-Type": "application/json",
"Authorization": "Bearer " + LINE_ACCESS_TOKEN
}
requests.post(url, headers=header, data=json.dumps(payload))
out_log = tools.outputLog_line_response(payload) #logging
logger.debug('line_handler message -->reply') #logging
def ulocal_chatting(event):
###
###Omitted
###
This completes.
line_Launch the bot
(botenv2) [line_bot]$ gunicorn --bind 127.0.0.1:8000 line_bot.wsgi:application
If you drop a message according to the format from the LINE app, the result will be returned.
If you want to get all at once, enter Financial Statement List
.
It takes about 1 second to measure 6 brands serially. I was impressed that the processing was faster than I had imagined, but in case it bothers me too much, I will use it moderately so as not to access it frequently. Up to here for this time.
Recommended Posts