Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]

Introduction

This is my first post. I'm a college student who has been studying programming on his own for more than half a year, but I'd like to output what I've learned little by little, so I'll post it. I think there are many messy and dirty points such as how to write the code, so feel free to point out.

Goals / procedures

As the title suggests, use a tool called lambda on AWS (Yahoo News Articles) on a regular basis ( Scraping every 6 hours) and using the LINE messaging API for updated articles to your LINE account The goal is to notify you. The development environment uses cloud9 in consideration of cooperation with lambda.

Regarding the procedure (1) Obtain 20 article titles and urls from News site using Python's BeautifulSoup module, url is AWS Write to the csv file placed in S3 of. (The part to be written is done at the end of the program.) (2) Compare the url of the previous execution (read csv from S3) with the url acquired this time to find the update. (3) Notify the title and url of the updated article using the LINE messaging API. It will be in the order.

For the LINE messaging API, Monitor web page updates with LINE BOT For scraping with BeautifulSoup, Web scraping with BeautifulSoup For reading and writing S3 csvfile, Python code to simply read CSV file of AWS S3 and [Write a Pandas dataframe to CSV on S3]( I referred to the article around https://www.jitsejan.com/write-dataframe-to-csv-on-s3.html).

code

import urllib.request
from bs4 import BeautifulSoup
import csv
import pandas as pd
import io
import boto3
import s3fs
import itertools
from linebot import LineBotApi
from linebot.models import TextSendMessage

def lambda_handler(event, context):
    
    url = 'https://follow.yahoo.co.jp/themes/051839da5a7baa353480'
    html = urllib.request.urlopen(url)
    #html perspective
    soup = BeautifulSoup(html, "html.parser")
    
    
    def news_scraping(soup=soup):
        """
Get article title and url
        """
        title_list = []
        titles = soup.select('#wrapper > section.content > ul > li:nth-child(n) > a.detailBody__wrap > div.detailBody__cnt > p.detailBody__ttl')
    
        for title in titles:
            title_list.append(title.string)
    
        url_list = []   
        urls = soup.select('#wrapper > section.content > ul > li:nth-child(n) > a.detailBody__wrap')
        
        for url in urls:
            url_list.append(url.get('href'))
       
        return title_list,url_list
        
    def get_s3file(bucket_name, key):
        """
Read csv from S3
        """
        s3 = boto3.resource('s3')
        s3obj = s3.Object(bucket_name, key).get()
    
        return io.TextIOWrapper(io.BytesIO(s3obj['Body'].read()))
        
    def write_df_to_s3(csv_list):
        """
Write to S3
        """
        csv_buffer = io.StringIO()
        csv_list.to_csv(csv_buffer,index=False,encoding='utf-8-sig')
        s3_resource = boto3.resource('s3')
        s3_resource.Object('Bucket name','file name').put(Body=csv_buffer.getvalue())
    
    def send_line(content):
        access_token = ********
        #Fill in the Channel access token
        line_bot_api = LineBotApi(access_token)
        line_bot_api.broadcast(TextSendMessage(text=content))
    
    ex_csv =[]
    #Enter the url for the previous scraping
    for rec in csv.reader(get_s3file('Bucket name', 'file name')):
        ex_csv.append(rec)
    
    ex_csv = ex_csv[1:]
    #index=It should have been written as False, but the index 0 at the beginning of the read csv(?)Was written
    ex_csv = list(itertools.chain.from_iterable(ex_csv))
    #Since the read csv was a two-dimensional array, it was converted to one dimension.
    
    title,url = news_scraping()
    #Scraping execution
    csv_list = url
    
    #ex_Extract updates by comparing with csv
    for i in range(20):
        if csv_list[i] in ex_csv[0]:
        #I used in because it didn't match exactly
            num = i
        #ex_The article at the beginning of csv is csv_Find out what number in the list it corresponds to
            break
        else:
            num = 'all'

    if num == 'all':
        send_list = [None]*2*20
        send_list[::2] = title
        send_list[1::2] = url
        send_list = "\n".join(send_list)
    #Insert title and url alternately, and start a new line
    
    elif num == 0:
        send_list = 'No new news'
    
    else:
        send_list = [None]*2*num
        send_list[::2] = title[:num]
        send_list[1::2] = url[:num]
        send_list = "\n".join(send_list)
    ##Insert title and url alternately, and start a new line
    
    send_line(send_list)
    
    csv_list = pd.DataFrame(csv_list)
    #If you write to S3 as a list, an error will occur, so convert the data type
    write_df_to_s3(csv_list)
    #Csv on S3_Write list and finish

That's it for the code. (It was better to define a function where you create send_list from num) After that, deploy to remote and set the periodic execution with Amazon CloudWatch Events. I think it is better to schedule with Cron expression to execute at specific time intervals. You need to be careful because you need to grant access authority to S3 with IAM.

in conclusion

Reading and writing csv in S3 didn't quite go as expected. In particular, there seems to be room for improvement, such as when writing to a two-dimensional array.

Actually, I had practiced scraping with selnium, headless-chrome and lambda before (in addition to using lambda for the first time, I had a lot of errors related to chrome binary and had a hard time). Therefore, I was able to write the code in a relatively short time this time. That said, it's a lot more work than local scraping. I've omitted how to use lambda here, but it's quite confusing, so please refer to other articles.

Recently, I've been working on django and Twitter API, so if I notice something around here, I'll post it again.

Thank you very much.

Recommended Posts

Use AWS lambda to scrape the news and notify LINE of updates on a regular basis [python]
Let's use AWS Lambda to create a mechanism to notify slack when the value monitored by CloudWatch is exceeded on Python
I wrote AWS Lambda, and I was a little addicted to the default value of Python arguments
[Python] Let LINE notify you of the ranking of search results on your site on a daily basis.
Summary of studying Python to use AWS Lambda
[Python] Create a script that uses FeedParser and LINE Notify to notify LINE of the latest information on the new coronavirus of the Ministry of Health, Labor and Welfare.
Steps to use the AWS command line interface (Python / awscli) on Mac OS X
Move CloudWatch logs to S3 on a regular basis with Lambda
Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
Upload data to s3 of aws with a command and update it, and delete the used data (on the way)
Various ways to read the last line of a csv file in Python
Periodically run a python program on AWS Lambda
Send a message to LINE with Python (LINE Notify)
I want to use Python in the environment of pyenv + pipenv on Windows 10
Use libsixel to output Sixel in Python and output a Matplotlib graph to the terminal.
Build a python environment to learn the theory and implementation of deep learning
Summary of points I was addicted to running Selenium on AWS Lambda (python)
[Python] A program that calculates the number of updates of the highest and lowest records
A discussion of the strengths and weaknesses of Python
I tried to use Twitter Scraper on AWS Lambda and it didn't work.
[Introduction to statistics] What kind of distribution is the t distribution, chi-square distribution, and F distribution? A little summary of how to use [python]
[Python3] Take a screenshot of a web page on the server and crop it further
It was a life I wanted to OCR on AWS Lambda to locate the characters.
Use AWS Lambda + LINE Notify to notify LINE not to forget your umbrella when you get home
A little trick to know when writing a Twilio application using Python on AWS Lambda
Support for Python 2.7 runtime on AWS Lambda (as of 2020.1)
I want to AWS Lambda with Python on Mac!
Procedure for creating a Line Bot on AWS Lambda
I tried to notify the honeypot report on LINE
Python Note: The mystery of assigning a variable to a variable
Convert pdf to Text on the command line. No knowledge of Python required. About pdf2txt.py attached to pdfminer and adjustment parameters.
Use Python to monitor Windows and Mac and collect information on the apps you are working on
[C / C ++] Pass the value calculated in C / C ++ to a python function to execute the process, and use that value in C / C ++.
Get the matched string with a regular expression and reuse it when replacing on Python3
I want to find the intersection of a Bezier curve and a straight line (Bezier Clipping method)
Use Heroku in python to notify Slack when a specific word is muttered on Twitter
I made a function to crop the image of python openCV, so please use it.
Scraping with Python + Selenium to add Apple refurbished products to your cart and notify on line
[Python] A program to find the number of apples and oranges that can be harvested
How to get the information of organizations, Cost Explorer of another AWS account with Lambda (python)
Posted as an attachment to Slack on AWS Lambda (Python)
How to use the __call__ method in a Python class
[Hyperledger Iroha] Notes on how to use the Python SDK
[Python] Summary of how to use split and join functions
Install pyenv on MacBook Air and switch python to use
[Introduction to AWS] A memorandum of building a web server on AWS
Post images of Papillon regularly on Python + AWS Lambda + Slack
I want to know the features of Python and pip
[Python] Allow pip3 packages to be imported on AWS Lambda
Write a script in Shell and Python to notify you in Slack when the process is finished
Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3
Find the white Christmas rate by prefecture with Python and map it to a map of Japan
[Python] The role of the asterisk in front of the variable. Divide the input value and assign it to a variable
[Python] I tried to make a simple program that works on the command line using argparse.
The story of returning to the front line for the first time in 5 years and refactoring Python Django
[Python] How to use the for statement. A method of extracting by specifying a range or conditions.
Learn the flow of Bayesian estimation and how to use Pystan through a simple regression model
I want to clear up the question of the "__init__" method and the "self" argument of a Python class.
How to use Python lambda
[Python scraping] Output the URL and title of the site containing a specific keyword to a text file
About the error I encountered when trying to use Adafruit_DHT from Python on a Raspberry Pi