Overview

Last time, I implemented a function to acquire stock price information, but as a matter of fact, "That's right, I used the stock app without asking BOT. It's only a matter of time before I get the tsukkomi saying, "It's more convenient," and I'm probably using a certain portfolio management app on a daily basis, so I'd like BOT to pinpoint the stock price now. It does not come to have the need. So I went back to the starting point, asked myself what kind of function I wanted, and searched for a more practical function.

"Yes, it would be convenient if we could grasp the settlement (planned) dates all at once."

If you own many stocks, it is quite troublesome to keep track of the schedule of individual stocks, semi-annual financial results, quarterly financial results, etc. on a daily basis. It would be convenient if there was a function that could centrally collect and visualize this, and above all, I think it would be good if it was the original way of BOT.

So, this time, I would like to do a scraping technique aimed at strengthening the skill of extracting and analyzing information from the website with pinpoint.

System configuration

When you enter the activation code including the brand code from the LINE app (①), BOT will access the website, investigate the latest financial results announcement (planned) date, and answer (②). It is also assumed that individual stocks will be specified or a portfolio will be set in advance and multiple stocks will be acquired at once.

システム構成004.png

1. Selection of target website

We will use a site called Stock Forecast. When you open the page for individual stocks, the latest financial results announcement (planned) date is displayed at the position shown in the figure below. If the financial results have been announced, that fact will be displayed along with the date. If it is scheduled to be announced in the future, it will be displayed with the date as the scheduled announcement date. The HTML source corresponding to this position is the " header_main " class starting at line 222. You can see that all the information you want is included under this.

HTML BODY.png

Therefore, it seems that this mission can be realized by understanding the skills of the following three processes.

Get the HTML source
Perspective, <div class =" header_main "> tag extraction
String formatting

So, if you write the conclusion first For 1. requests 2. For Beautiful Soup 3. For re Very smart coding is realized by utilizing each of them.

2. Get HTML source by requests

Requests have already been packaged when the chatbot is implemented, so details are omitted. The coding example is as follows

`(Script execution example)How to use requests`


(botenv2) [botenv2]$ python
Python 3.6.7 (default, Dec  5 2018, 15:02:16) 
>>> import requests

#Get HTML source
>>> r = requests.get('https://kabuyoho.ifis.co.jp/index.php?action=tp1&sa=report_top&bcode=4689')

#Content confirmation
>>> print(r.headers)
{'Cache-Control': 'max-age=1', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html; charset=UTF-8', (abridgement)
>>> print(r.encoding)
UTF-8
>>> print(r.content)
(abridgement)

With that feeling, no particular explanation is needed. Since the whole Body part is stored in r.content, it will be boiled or baked afterwards. You can pull the information using the target HTML tag as a key.

3. Introduction of Beautiful Soup (parser)

It seems that it is a famous package and already implements various parsers. When I introduced it, I was able to achieve 90% of the purpose this time. .. Compared to the time when I used to play with C language, it feels like a world away.

`Beautiful Soup installation`


(botenv2) [botenv2]$ pip install BeautifulSoup4

`(Script execution example@Continued)How to use Beautiful Soup`


>>> from bs4 import BeautifulSoup

#Parsing the Body part with parser
>>> soup = BeautifulSoup(r.content, "html.parser")

Try to display the header_main class.

`python`


>>> print(soup.find("div", class_="header_main"))

`Execution result`



<div class="header_main">
<div class="stock_code left">4689</div>
<div class="stock_name left">Z Holdings</div>
<div class="block_update right">
<div class="title left">
Earnings announced
                                                        </div>
<div class="settle left">
                                                                        2Q
                                                        </div>
<div class="date left">
                                                                                                   2019/11/01
                                                                                                </div>
<div class="float_end"></div>
</div>
<div class="float_end"></div>
</div>

It's amazing. It's too convenient to stop trembling.

3. String formatting

All that remains is to delete unnecessary strings. HTML tags are not required, so use the text method.

`(Script execution example@Continued)Text extraction`


>>> s = soup.find("div", class_="header_main").text
>>> print(s)
4689
Z Holdings


Earnings announced
                                                 

                                                                        2Q
                                                 

                                                                                                   2019/11/01
                                                                                         




>>>

The tags have been wiped out, but there are still a lot of mysterious gaps left. I didn't know if this was a space or a metacharacter, so I fell in love with it for a moment. In such a case, the substance can be seen by displaying it in byte type.

`(reference)Character code confirmation`


>>> s.encode()
b'\n4689\n\xef\xbc\xba\xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3\x82\xb0\xe3\x82\xb9\n\n\n\t\t\t\t\t\t\t\t\t\xe6\xb1\xba\xe7\xae\x97\xe7\x99\xba\xe8\xa1\xa8\xe6\xb8\x88\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t2Q\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t2019/11/01\n\t\t\t\t\t\t\t\t\t\t\t\t\n\n\n\n'

The point is to remove / n and / t. Let's mercilessly replace it with a comma and bury it.

`(Script execution example@Continued)Metacharacter removal`


>>> import re
>>> s = re.sub(r'[\n\t]+', ',', s)
>>> print(s)
,4689,Z Holdings,Earnings announced,2Q,2019/11/01,

If you remove the disturbing commas before and after the finish

`(Script execution example@Continued)`


>>> s = re.sub(r'(^,)|(,$)','', s)
>>> print(s)
4689,Z Holdings,Earnings announced,2Q,2019/11/01

It feels good. It seems that it can be converted to CSV or dataframe as it is.

By the way, when a brand code that does not exist is acquired, the following character code remains after the above processing.

`For stock codes that do not exist`


>>> print(s)
b'\xc2\xa0'

This \ xc2 \ xa0 means NO-BREAK SPACE in Unicode and corresponds to & nbsp in HTML. If this character code is included, it will interfere with the subsequent processing, so it is desirable to remove it if possible. (It seems to be a common problem when scraping web pages.)

(Reference) [Python3] What to do if you encounter [\ xa0] during scraping

`&removal of nbsp`


s = re.sub(r'[\xc2\xa0]','', s)

4. Implemented in BOT app

Here is a function of the above processing.

`getSettledata.py`


import requests
from bs4 import BeautifulSoup
import re
import logging
logger = logging.getLogger('getSettledata')

source = 'https://kabuyoho.ifis.co.jp/index.php?action=tp1&sa=report_top&bcode='

#Settlement date acquisition function(4689 if the argument is empty(ZHD)Refer to the data of)
def get_settleInfo(code="4689"):

  #Crawling
  try:
    logger.debug('read web data cord = ' + code) #logging
    r = requests.get(source + code)
  except:
    logger.debug('read web data ---> Exception Error') #logging
    return None, 'Exception error: access failed'

  #Scraping
  soup = BeautifulSoup(r.content, "html.parser")
  settleInfo = soup.find("div", class_="header_main").text
  settleInfo = re.sub(r'[\n\t]+', ',', settleInfo) #Removal of metacharacters
  settleInfo = re.sub(r'(^,)|(,$)','', settleInfo) #Comma removal at the beginning and end of the line
  settleInfo = re.sub(r'[\xc2\xa0]','', settleInfo) #&nbsp(\xc2\xa0)Dealing with the problem
  logger.debug('settleInfo result = ' + settleInfo) #logging

  if not settleInfo:
    settleInfo = 'There is no such brand ~'

  return settleInfo

if __name__ == '__main__':
  print(get_settleInfo())

For the main program, add conditional branching processing by identifying the activation code as usual. If you create your own portfolio in SETTLEVIEW_LIST_CORD in advance, you will be eligible for batch acquisition.

`chatbot.py(★ Addition)##The existing function part is omitted because it is not changed.`


# -*- Coding: utf-8 -*-

from django.views.decorators.csrf import csrf_exempt
from django.http import HttpResponse
from django.shortcuts import render
from datetime import datetime
from time import sleep
import requests
import json
import base64
import logging
import os
import random
import log.logconfig
from utils import tools
import re
from .getStockdata import get_chart
from .getSettledata import get_settleInfo

logger = logging.getLogger('commonLogging')

LINE_ENDPOINT = 'https://api.line.me/v2/bot/message/reply'
LINE_ACCESS_TOKEN = ''

###
###Omitted
###

SETTLEVIEW_KEY = ['Settlement','settle'] #★ Addition
SETTLEVIEW_LIST_KEY = ['Financial results list'] #★ Addition
SETTLEVIEW_LIST_CORD = ['4689','3938','4755','1435','3244','3048'] #★ Addition

@csrf_exempt
def line_handler(request):

    #exception
    if not request.method == 'POST':
      return HttpResponse(status=200)

    logger.debug('line_handler message incoming') #logging
    out_log = tools.outputLog_line_request(request) #logging
    request_json = json.loads(request.body.decode('utf-8'))

    for event in request_json['events']:
      reply_token = event['replyToken']
      message_type = event['message']['type']
      user_id = event['source']['userId']

      #whitelist
      if not user_id == LINE_ALLOW_USER:
        logger.warning('invalid userID:' + user_id) #logging
        return HttpResponse(status=200)

      #action
      if message_type == 'text':
        if:
        ###
        ###Omitted
        ###

        elif any(s in event['message']['text'] for s in SETTLEVIEW_KEY): #★ Addition
          action_data(reply_token,'settleview',event['message']['text']) #★ Addition

        else:
        ###
        ###Omitted
        ###

    return HttpResponse(status=200)

def action_res(reply_token,command,):
    ###
    ###Omitted
    ###

def action_data(reply_token,command,value):

    #Stock chart
    ###
    ###Omitted
    ###

    #######################################################★ Addition from here
    #Financial information
    elif command == 'settleview':
      logger.debug('get_settleInfo on') #logging

      #Bulk acquisition of portfolio stocks
      if any(s in value for s in SETTLEVIEW_LIST_KEY): 
        logger.debug('get_settleInfo LIST') #logging

        results = []
        for cord in SETTLEVIEW_LIST_CORD:
          results.append(get_settleInfo(cord))

        logger.debug('get_settleInfo LIST ---> ' + '\n'.join(results)) #logging
        response_text(reply_token,'\n'.join(results))

      #Acquisition of individual stocks
      else:
        cord = re.search('[0-9]+$', value)
        logger.debug('get_settleInfo cord = ' + cord.group()) #logging

        result = get_settleInfo(cord.group())

        if result[0] is not None:
          response_text(reply_token,result)
        else:
          response_text(reply_token,result[1])
    #######################################################★ Addition up to here

def response_image(reply_token,orgUrl,preUrl,text):
    ###
    ###Omitted
    ###

def response_text(reply_token,text):
    payload = {
      "replyToken": reply_token,
      "messages":[
        {
          "type": 'text',
          "text": text
        }
      ]
    }
    line_post(payload)

def line_post(payload):
    url = LINE_ENDPOINT
    header = {
      "Content-Type": "application/json",
      "Authorization": "Bearer " + LINE_ACCESS_TOKEN
    }
    requests.post(url, headers=header, data=json.dumps(payload))
    out_log = tools.outputLog_line_response(payload) #logging
    logger.debug('line_handler message -->reply') #logging

def ulocal_chatting(event):
    ###
    ###Omitted
    ###

This completes.

`line_Launch the bot`


(botenv2) [line_bot]$ gunicorn --bind 127.0.0.1:8000 line_bot.wsgi:application

If you drop a message according to the format from the LINE app, the result will be returned.

If you want to get all at once, enter Financial Statement List.

It takes about 1 second to measure 6 brands serially. I was impressed that the processing was faster than I had imagined, but in case it bothers me too much, I will use it moderately so as not to access it frequently. Up to here for this time.

[PYTHON] How to make an interactive LINE BOT 004 (answer the closing date of a listed company)

Overview

System configuration

1. Selection of target website

2. Get HTML source by requests

(Script execution example)How to use requests

3. Introduction of Beautiful Soup (parser)

Beautiful Soup installation

(Script execution example@Continued)How to use Beautiful Soup

python

Execution result

3. String formatting

(Script execution example@Continued)Text extraction

(reference)Character code confirmation

(Script execution example@Continued)Metacharacter removal

(Script execution example@Continued)

For stock codes that do not exist

&removal of nbsp