Scraping with Python, posting on TwitterBot, regular execution on Heroku

I scraped it with Python and even posted it to TwitterBot on a regular basis on Heroku.

・ Because I am suffering from tinnitus, a bot that regularly tweets information about "tinnitus". ・ Mac ・ Python3 -Specifically, I created an application that executes the following two. [1] Scraping search results for "tinnitus" and "dizziness" from Yahoo News and tweeting regularly [2] Regularly retweet tweets such as "improvement of tinnitus" and "cause of tinnitus" (good)

* Environment construction, directory structure

Create a directory miminari on your desktop and scraping.py. Build and start a virtual environment as follows.

python3 -m venv .
source bin/activate

Install the required modules.

pip install requests
pip install beautifulsoup4
pip install lxml

Directory structure

miminari
├scraping.py
├date_list.txt
├source_list.txt
├text_list.txt
├title_list.txt
├url_list.txt
├twitter.py
├Procfile
├requirements.txt
└runtime.txt

[1] Scraping "tinnitus" and "dizziness" from Yahoo News and tweeting regularly

(1) Create scraping.py

① Scraping the news title and URL

Search for "tinnitus" and "dizziness" from Yahoo News and copy the url. The site shows 10 news items. Find a likely location for the title and URL. If you look at the "verification" of Google Chrome, you can see that it is in class = t of the h2 tag. Based on this, I will write the code. スクリーンショット 2020-03-21 19.15.50.png

.py:scraping.py


from bs4 import BeautifulSoup
import lxml
import requests

URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b=1"

res = requests.get(URL)
res.encoding = res.apparent_encoding
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")

list = soup.find_all("h2",class_="t")
print(list)

Then, you can get it in list format as follows.

[<h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200320-00000019-nkgendai-hlth">Misono is also fighting against Meniere's disease No radical cure has been found, but how do you deal with it?</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200316-00010000-flash-ent">Shoko Aida, the past suffering from sudden hearing loss and Meniere's disease<em>Tinnitus</em>But…"</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200315-00000004-nikkeisty-hlth">Gluten upset, treatment is dangerous without guidance</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200313-00000243-spnannex-ent">Shoko Aida confesses her illness to retire from the entertainment world for the first time. Thanks to the doctor's "mental care"</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200310-00000011-pseven-life">"Hearing loss" is a risk factor for dementia Depression risk 2.Data with 4 times</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200309-00010009-nishispo-spo">Olympic representative Ono's classmate overcomes illness and goes to the big stage 81 kg class indiscriminate challenge</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200226-00010011-newsweek-int">Iran's mysterious shock wave that hit the U.S. military takes several years to unravel</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200223-00986372-jspa-life">Chronic condition and fertility make me sick ... Tears at the words my husband gave to Alafor's wife</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200215-00001569-fujinjp-life">Insufficient thermal energy? Blood circulation stagnation? Know the type of "cold" and become a body that can withstand the cold</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200214-00010009-jisin-soci">Recommended by a doctor! Insomnia, menstrual cramps, headaches ... "Normal heat 36."5 degrees" prevents upset</a></h2>]

Supplementary explanation

-Encoding is the character encoding of the response returned from the server. The content is converted according to this character encoding. -Apparent_encoding means a process to enable the content to be acquired so that garbled characters do not occur. -Lxml is one of the HTML parsers that parses HTML words, determines tags, etc., and acquires them as a data structure. Normally I use html.parser. This time we will use lxml as a faster parser. lxml needs to be installed and imported separately. -From the verification of Google Chrome, it can be seen that the title and URL are in class = t of the h2 tag, so find_all ("h2", class_ = "t") was used.

② Scrap the news title and URL (save to file)

.py:scraping.py


from bs4 import BeautifulSoup
import lxml
import requests

URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b=1"
res = requests.get(URL)
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")


#Get title and url------------------------------
_list = soup.find_all("h2",class_="t")
title_list = []
url_list = []
for i in _list:
    a_tag = i.find_all('a')
    for _tag in a_tag:
        #Extract title, get_text()Extracts the string enclosed in tags
        href_text = _tag.get_text()
        #Create a list with extracted titles
        title_list.append(href_text)
        #get("href")Extracts urls enclosed in tags
        url_text = _tag.get("href")
        #Create a list with extracted titles
        url_list.append(url_text)

#Save in text format
with open('title_data'+'.txt','a',encoding='utf-8') as f:
    for i in title_list:
        f.write(i + '\n')
with open('url_data'+'.txt','a',encoding='utf-8') as f:
    for i in url_list:
        f.write(i + '\n')

Supplementary explanation

-Get.text () can extract the character string enclosed in tags. -Get ("href") can get the attribute value.

③ Scraping the news summary (text), date and time, and source

Scrap the news summary and date and time as well and save each one.

.py:scraping.py


from bs4 import BeautifulSoup
import lxml
import requests

URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b=1"
res = requests.get(URL)
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")


#Get title and url------------------------------
_list = soup.find_all("h2",class_="t")
title_list = []
url_list = []
for i in _list:
    a_tag = i.find_all('a')
    for _tag in a_tag:
        #Extract title, get_text()Extracts the string enclosed in tags
        href_text = _tag.get_text()
        #Create a list with extracted titles
        title_list.append(href_text)
        #get("href")Extracts urls enclosed in tags
        url_text = _tag.get("href")
        #Create a list with extracted titles
        url_list.append(url_text)

with open('title_data'+'.txt','a',encoding='utf-8') as f:
    for i in title_list:
        f.write(i + '\n')
with open('url_data'+'.txt','a',encoding='utf-8') as f:
    for i in url_list:
        f.write(i + '\n')


#Get text-----------------------------------------
_list2 = soup.find_all("p",class_="a")
text_list = []
for i in _list2:
    text_text = i.get_text()
    text_list.append(text_text)
with open('text_list'+'.txt','a',encoding='utf-8')as f:
    for i in text_list:
        f.write(i + '\n')


#Get date and time---------------------------------------------------------------
_list3 = soup.find_all("span",class_="d")
date_list = []
for i in _list3:
    _date_text = i.get_text()
    _date_text = _date_text.replace('\xa0','')
    date_list.append(_date_text)
with open('date_list'+'.txt','a',encoding='utf-8') as f:
    for i in date_list:
        f.write(i + '\n')


#Get the source---------------------------------------------------------------
    _list4 = soup.find_all("span",class_="ct1")
    source_list = []
    for i in _list4:
        _source_text = i.get_text()
        source_list.append(_source_text)
    with open('source_list'+'.txt','a',encoding='utf-8') as f:
        for i in source_list:
            f.write(i + '\n')

Supplementary explanation

-Although it is the date and time, if it is extracted as it is, the extra character "& nbsp" will also be extracted, so it is blanked using replace (when scraped, "& nbsp" is written as "\ xa0". Therefore, it was replaced ('\ xa0',''). スクリーンショット 2020-03-21 21.51.42.png

④ Scraping from the next page onwards

At this rate, only 10 news items for one page will be scraped, so turn this for 4 pages (because turning 4 pages means that the news search results for "tinnitus" and "dizziness" were only 4 pages). .. Modify the code as follows.

.py:scraping.py


from bs4 import BeautifulSoup
import lxml
import requests

mm = 0
for i in range(4):
    URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b={}".format(mm*10 + 1)
    res = requests.get(URL)
    html_doc = res.text
    soup = BeautifulSoup(html_doc,"lxml")

    #Get title and url------------------------------
    _list = soup.find_all("h2",class_="t")
    title_list = []
    url_list = []
    for i in _list:
        a_tag = i.find_all('a')
        for _tag in a_tag:
            #Extract title, get_text()Extracts the string enclosed in tags
            href_text = _tag.get_text()
            #Create a list with extracted titles
            title_list.append(href_text)
            #get("href")Extracts urls enclosed in tags
            url_text = _tag.get("href")
            #Create a list with extracted titles
            url_list.append(url_text)

    with open('title_list'+'.txt','a',encoding='utf-8') as f:
        for i in title_list:
            f.write(i + '\n')
    with open('url_list'+'.txt','a',encoding='utf-8') as f:
        for i in url_list:
            f.write(i + '\n')


    #Get text-----------------------------------------
    _list2 = soup.find_all("p",class_="a")
    text_list = []
    for i in _list2:
        text_text = i.get_text()
        text_list.append(text_text)
    with open('text_list'+'.txt','a',encoding='utf-8')as f:
        for i in text_list:
            f.write(i + '\n')


    #Date and time,---------------------------------------------------------------
    _list3 = soup.find_all("span",class_="d")
    date_list = []
    for i in _list3:
        _date_text = i.get_text()
        _date_text = _date_text.replace('\xa0','')
        date_list.append(_date_text)
    with open('date_list'+'.txt','a',encoding='utf-8') as f:
        for i in date_list:
            f.write(i + '\n')


    #Source---------------------------------------------------------------
    _list4 = soup.find_all("span",class_="ct1")
    source_list = []
    for i in _list4:
        _source_text = i.get_text()
        source_list.append(_source_text)
    with open('source_list'+'.txt','a',encoding='utf-8') as f:
        for i in source_list:
            f.write(i + '\n')

    #mm-------------------------------------------------------------------
    mm += 1

Supplementary explanation

The following parts have been added. Since the end of the URL is 1, 11, 21, 31 for each page, it was processed using the for statement and format.

mm = 0
for i in range(4): 〜〜〜〜

〜〜〜〜 q=&ei=UTF-8&b={}".format(mm*10 + 1)

〜〜〜〜
mm += 1

Up to this point, scraping has created the title (title_list), URL (url_list), summary (text_list), date and time (date_list), and source (source_list) of each news item. After that, I will proceed with posting to Twitter, but I used only the date and time (date_list), source (source_list), and URL (url_list).

(2) Post to Twitter

This time, I will omit the detailed method of creating Twitter Bot. When creating a bot, I referred to the following for how to register on the Twitter API and tweet to Twitter. Summary of steps from Twitter API registration (account application method) to approval Post to Twitter on Tweepy Search Twitter on Tweepy, Like, Retweet

Create twitter.py in the directory miminari and install Tweepy.

pip install tweeps

Create twitter.py as follows

.py:twitter.py


import tweepy
from random import randint
import os

auth = tweepy.OAuthHandler(os.environ["CONSUMER_KEY"],os.environ["CONSUMER_SECRET"])
auth.set_access_token(os.environ["ACCESS_TOKEN"],os.environ["ACCESS_TOKEN_SECERET"])

api = tweepy.API(auth)

twitter_source =[]
twitter_url = []
twitter_date = []

with open('source_list.txt','r')as f:
    for i in f:
        twitter_source.append(i.rstrip('\n'))
with open('url_list.txt','r')as f:
    for i in f:
        twitter_url.append(i.rstrip('\n'))
with open('date_list.txt','r')as f:
    for i in f:
        twitter_date.append(i.rstrip('\n'))

#Randomly extract articles from the 0th to n-1st range of the list with the randint and len functions
i = randint(0,len(twitter_source)-1)
api.update_status("<News related to tinnitus>" + '\n' + twitter_date[i] + twitter_source[i] + twitter_url[i])

Supplementary explanation

-Consumer_KEY etc. are set as environment variables based on the deployment to Heroku. ・ Articles are now tweeted randomly.

[2] Regularly retweet tweets such as "improvement of tinnitus" and "cause of tinnitus" (good)

Add code to twitter.py

.py:twitter.py


import tweepy
from random import randint
import os


#auth = tweepy.OAuthHandler(config.CONSUMER_KEY,config.CONSUMER_SECRET)
#auth.set_access_token(config.ACCESS_TOKEN,config.ACCESS_TOKEN_SECERET)

auth = tweepy.OAuthHandler(os.environ["CONSUMER_KEY"],os.environ["CONSUMER_SECRET"])
auth.set_access_token(os.environ["ACCESS_TOKEN"],os.environ["ACCESS_TOKEN_SECERET"])

api = tweepy.API(auth)


#-Yahoo_news (tinnitus, dizziness) Tweet processing----------------------------------------------
twitter_source =[]
twitter_url = []
twitter_date = []

with open('source_list.txt','r')as f:
    for i in f:
        twitter_source.append(i.rstrip('\n'))
with open('url_list.txt','r')as f:
    for i in f:
        twitter_url.append(i.rstrip('\n'))
with open('date_list.txt','r')as f:
    for i in f:
        twitter_date.append(i.rstrip('\n'))

#Randomly extract articles from the 0th to n-1st range of the list with the randint and len functions
i = randint(0,len(twitter_source)-1)
api.update_status("<News related to tinnitus>" + '\n' + twitter_date[i] + twitter_source[i] + twitter_url[i])




#-(The following is added) Retweet processing----------------------------------------------------------------------
search_results_1 = api.search(q="Improvement of tinnitus", count=10)
search_results_2 = api.search(q="Tinnitus is terrible", count=10)
search_results_3 = api.search(q="Tinnitus", count=10)
search_results_4 = api.search(q="Tinnitus medicine", count=10)
search_results_5 = api.search(q="What is tinnitus?", count=10)
search_results_6 = api.search(q="Cause of tinnitus", count=10)
search_results_7 = api.search(q="Tinnitus Chinese medicine", count=10)
search_results_8 = api.search(q="Tinnitus acupoints", count=10)
search_results_9 = api.search(q="Tinnitus headache", count=10)
search_results_10 = api.search(q="#Tinnitus", count=10)
search_results_11 = api.search(q="Tinnitus", count=10)

the_list = [search_results_1,
            search_results_2,
            search_results_3,
            search_results_4,
            search_results_5,
            search_results_6,
            search_results_7,
            search_results_8,
            search_results_9,
            search_results_10,
            search_results_11
            ]

for i in range(10):
  for result in the_list[i]:
      tweet_id = result.id
      #Handle exceptions. It seems that an error will occur if you process duplicates,
      #If exception handling is used, the program will not stop in the middle.
      try:
          api.retweet(tweet_id)#Retweet processing
          api.create_favorite(tweet_id) #Like processing
      except Exception as e:
          print(e)

[3] Deploy to Heroku

(1) Deploy

Create the Procfile, runtime.txt, and requirements.txt required for deployment. Create runtime.txt after confirming your own python version.

.txt:runtime.txt



python-3.8.0

Procfile describes the following.

Prockfile


web: python twitter.py

Describe requirements.txt by entering the following in the terminal.

pip freeze > requirements.txt

Next, deploy as follows. Initialize git, associate Heroku with git, add, and commit with the name the-first. Finally push to Heroku.

git init
heroku git:remote -a testlinebot0319
git add .
git commit -m'the-first'
git push heroku master

If you enter the following in the terminal and post it on Twitter before regular execution, it will be successful for the time being.

(2) Regular execution

Execute the following in the terminal to set the regular execution of Heroku directly on the browser.

heroku adding:add scheduler:standard
heroku adding:open scheduler
スクリーンショット 2020-03-22 0.15.55.png

If you set as above, it is completed (the above is tweet setting every 10 minutes)

in conclusion

If you like, please follow me Twitter@MiminariBot

Recommended Posts

Scraping with Python, posting on TwitterBot, regular execution on Heroku
Scraping with Python
Scraping with Python
Until you use PhantomJS with Python on Heroku
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Posting tweets with python
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
I sent regular emails from sendgrid on heroku, on python
I tried scraping with Python
Web scraping with python + JupyterLab
Operate TwitterBot with Lambda, Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Scraping with Selenium in Python
Scraping with Tor in Python
Execution time measurement with Python With
Scraping weather forecast with python
Regular expression manipulation with Python
Scraping with Selenium + Python Part 2
I tried scraping with python
Web scraping beginner with python
How to set cron for regular Python scraping on Sakura server.
Try scraping with Python + Beautiful Soup
Make Echolalia LINEbot with Python + heroku
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Scraping with Python, Selenium and Chromedriver
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
python + django + scikit-learn + mecab (1) on heroku
String replacement with Python regular expression
python + django + scikit-learn + mecab (2) on heroku
Let's do image scraping with Python
Python json.loads () returns str on Heroku
Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
Get Qiita trends with Python scraping
Notes on using rstrip with python.
Getting started with Python 3.8 on Windows
Easy parallel execution with python subprocess
"Scraping & machine learning with Python" Learning memo
Handling regular expressions with PHP / Python
Get weather information with Python & scraping
[Memo] Tweet on twitter with python
Periodically run Python on Heroku Scheduler
Get property information by scraping with python
Run servo with Python on ESP32 (Windows)
Automate simple tasks with Python Part1 Scraping
Getting Started with Python Web Scraping Practice
I tried scraping Yahoo News with Python
Let Heroku do background processing with Python
A memo with Python2.7 and Python3 on CentOS
Map rent information on a map with python
Follow active applications on Mac with Python
[C] [python] Read with AquesTalk on Linux