[PYTHON] I made a tool to get new articles

I created a tool to get the title and URL of a new blog post by scraping with python. It's github-> https://github.com/cerven12/blog_post_getter

Why did you make it?

When my friend started (restarted?) Blogging, I wanted to try to fix my memory and improve my writing skills by posting blogs. I thought that it would be more active to be stimulated by two people competing and cooperating with each other rather than doing it alone, so I made it as part of that.

What kind of tool is it?

Get the title and URL of the newly posted article. (I'd like to run it regularly and notify it via LINE API etc.)

Use the txt file that contains the URL of the existing post list and compare it with the URL of the latest post list. I try not to detect changes in the title or content. (Because it's hard to get a notification saying "New!" Just by editing the title!) However, the exception is when the URL changes when editing an article (Is there ...?).

I don't know about other sites because I made it so that it can be used with Qiita, but I think that it can be used on pages where html has the following format

<!--There is a class in the a tag. The title is written as an element of the a tag-->
<a class='articles'  href='#'>Title</a>

Usable page

Qiita user page: https://qiita.com/takuto_neko_like Hatena Blog user page: http://atc.hateblo.jp/about

Terms of use

  1. There is a page that displays a list of articles
  2. A common selector is set in the <a> tag of each article.
  3. The title must be written as an element of the a tag
  4. Create an empty .txt file in advance

Whole code


import requests, bs4


def new_post_getter(url, selecter, txt):
    '''
Article title and URL bs4_Get element
1st argument:URL of the page with the post list
2nd argument:Of each post<a>At the selector head attached to the tag.With
3rd argument:Recording txt path
    '''
    res = requests.get(url)
    posts = bs4.BeautifulSoup(res.text, 'html.parser').select(selecter)

    now_posts_url = [] #1 List of URLs of the acquired article list,Used to identify new posts by comparing with previous post data
    now_posts_url_title_set = [] #List of URLs and titles of the acquired article list,
    for post in posts:
        #Extract URL
        index_first = int(str(post).find('href=')) + 6
        index_end = int(str(post).find('">'))
        url = (str(post)[index_first : index_end])
        #Extract title
        index_first = int(str(post).find('">')) + 2
        index_end = int(str(post).find('</a'))
        title = (str(post)[index_first : index_end].replace('\u3000', ' ')) #Whitespace replacement

        now_posts_url.append(url)
        now_posts_url_title_set.append(f"{url}>>>{title}")

    old_post_text = open(txt)
    old_post = old_post_text.read().split(',') #From text file to list type
    # differences :Posts that have been posted but are not displayed on the list screen+New post
    differences = list(set(now_posts_url) - set(old_post))
    old_post_text.close()

    #Overwrite txt for recording all_posts are past posts+New post
    all_posts = ",".join(old_post + differences)
    f = open(txt, mode='w')
    f.writelines(all_posts)
    f.close()

    new_post_info = []
    for new in now_posts_url_title_set:
        for incremental in differences:
            if incremental in new:
                new_post_info.append(new.split(">>>"))
    return new_post_info

How to Use

Page of article list,Selector attached to a tag of each article, Specify the path of the txt file that saves the posting status as an argument

★ Try using

Try using


url = 'https://qiita.com/takuto_neko_like'
selecter = '.u-link-no-underline'
file = 'neko.txt'

my_posts = new_post_getter(url, selecter, file)
print(my_posts)

By doing the above ...

result



[['/takuto_neko_like/items/93b3751984e5e3fd3670', '[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~'], ['/takuto_neko_like/items/14e92797fa2b23a64adb', '[Python] What is inherited by multiple inheritance?']]

You'll get a double list of URLs and titles. [[URL, title], [URL, title], [URL, title], .......]

By turning the double list with a for statement and formatting the character string ...


for url, title in my_posts:
    print(f'{title} : {url}')

Easy-to-read output ↓

output



[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~: /takuto_neko_like/items/93b3751984e5e3fd3670
[Python] What is inherited by multiple inheritance?: /takuto_neko_like/items/14e92797fa2b23a64adb

By the way

The contents of neko.txt are like this.

/takuto_neko_like/items/93b3751984e5e3fd3670,/takuto_neko_like/items/14e92797fa2b23a64adb,/takuto_neko_like/items/bb8d0957347636b5bf4f,/takuto_neko_like/items/62aeb4271614f6f0347f,/takuto_neko_like/items/c9c80ff453d0c4fad239,/takuto_neko_like/items/aed9dd5619d8457d4894,/takuto_neko_like/items/6cf9bade3d9515a724c0

Contains a list of URLs. Try deleting the first and last ...

/takuto_neko_like/items/14e92797fa2b23a64adb,/takuto_neko_like/items/bb8d0957347636b5bf4f,/takuto_neko_like/items/62aeb4271614f6f0347f,/takuto_neko_like/items/c9c80ff453d0c4fad239,/takuto_neko_like/items/aed9dd5619d8457d4894

When you run ...


my_posts = new_post_getter(url, selecter, file)
print(my_posts)

Result ↓

[['/takuto_neko_like/items/c5791f267e0964e09d03', 'Created a tool to get new articles to work hard with friends on blog posts'], ['/takuto_neko_like/items/93b3751984e5e3fd3670', '[Fish] About the matter that the movement of fish was too slow ~ Trouble with git ~'], ['/takuto_neko_like/items/6cf9bade3d9515a724c0', '【Python】@What are classmethods and decorators?']]

Get only the deleted amount! ☺

How did you make it?

Below is a description of the code.

Code flow

  1. Get the <a> tags of all displayed articles from the article list page
  2. Extract the URL from the obtained <a> tags. Also, a set of URL and title is extracted separately.
  3. Use the existing post list (txt), 2. Compare with the URL obtained in. Extract the difference.
  4. Overwrite txt by combining the URL of the new post and the existing post record.
  5. From the set of URL and title obtained in 2., only the items that correspond to the difference are extracted. Make it a double list type for easy shaping

Each code

1. Get the <a> tag of all displayed articles from the article list page

1.Get the a tag of all displayed articles from the article list page



import requests, bs4


def new_post_getter(url, selecter, txt):
    '''
Article title and URL bs4_Get element
1st argument:URL of the page with the post list
2nd argument:Of each post<a>At the selector head attached to the tag.With
3rd argument:Recording txt path
    '''
    res = requests.get(url)
    posts = bs4.BeautifulSoup(res.text, 'html.parser').select(selecter)

We will use two third party libraries here.

  1. request A library that can use WebAPI. This time, I use the GET method to get the response object. The request object contains various information, but it can be converted to a character string by using .text. In 2. Beautiful Soup, the text HTML returned as a response is used.
  2. BeautifulSoup You can syntactically interpret the retrieved HTML text, then retrieve attributes with various methods, retrieve multiple elements using selectors, and much more. This time, I interpret the textual response as HTML and then use the .select method to specify a particular selector. This will give you multiple elements for that selector.

The data actually acquired by the above ★ Try using is the next white frame part

スクリーンショット 2020-03-08 0.07.31.png

2. Extract the URL from the obtained <a> tags. Also, a set of URL and title is extracted separately.

2.Get the URL of the title from the obtained a tag. Also, a set of URL and title is extracted separately.



    now_posts_url = [] #1 List of URLs of the acquired article list,Used to identify new posts by comparing with previous post data
    now_posts_url_title_set = [] #List of URLs and titles of the acquired article list,
    for post in posts:
        #Extract URL
        index_first = int(str(post).find('href=')) + 6
        index_end = int(str(post).find('">'))
        url = (str(post)[index_first : index_end])
        #Extract title
        index_first = int(str(post).find('">')) + 2
        index_end = int(str(post).find('</a'))
        title = (str(post)[index_first : index_end].replace('\u3000', ' ')) #Whitespace replacement

        now_posts_url.append(url)
        now_posts_url_title_set.append(f"{url}>>>{title}")

Turn the acquired <a> tag elements with a for statement. By specifying the string with .find (), you can find the index of the start position of the string, so you can get the URL part and the title part by slicing the string with that value.

スクリーンショット 2020-03-08 0.17.53.png

now_posts_url is the data used to compare with the posted data so far and extract the difference (excluding articles that disappeared from the list screen due to pagination etc.). This time, we will detect new arrivals using a URL that will not change even if the article is updated, but in order to output the title and URL later, save the set of ʻURL + titlenow. I want to. Therefore, usenow_posts_url to get the diff, and later extract only the data containing the diff URL from now_posts_url_title_set`.

3. Use the existing post list (txt), 2. Compare with the URL obtained in. Extract the difference.

3.List of existing posts(txt)Using, 2. Compare with the URL obtained in. Extract the difference.


    old_post_text = open(txt)
    old_post = old_post_text.read().split(',') #From text file to list type
    # differences :Posts that have been posted but are not displayed on the list screen+New post
    differences = list(set(now_posts_url) - set(old_post))
    old_post_text.close()

I want to compare with the txt file where the post records so far are saved and extract the difference from the latest newly acquired post list. It's a difference set. The Venn diagram is as follows A is a list of past posts B is the latest post list And the shaded area is the difference, which is a completely new post.

IMG_7686.jpg

Set operations can be easily performed by setting the calculation target to a set type object. This time, the list type character string ([URL1, URL2, URL3]) recorded in the txt file is converted to list type with split (). The difference is calculated while converting to the set type together with the latest post list obtained in 2.

4. Combine the URL of the new post with the existing post record and overwrite the txt

Overwrite txt with the URL of the new post and the existing post record in one


    #Overwrite txt for recording all_posts are past posts+New post
    all_posts = ",".join(old_post + differences)
    f = open(txt, mode='w')
    f.writelines(all_posts)
    f.close()

The txt file should also be updated with the latest information so that it can be used next time. Overwrite the txt file by adding the difference (new post) to the previous posts.

From the set of URL and title obtained in 5.2., Only the items that correspond to the difference are extracted. Make it a double list type for easy shaping

Format new post titles and URLs


    new_post_info = []
    for new in now_posts_url_title_set:
        for incremental in differences:
            if incremental in new:
                new_post_info.append(new.split(">>>"))
    return new_post_info

From the data of the character string "URL >>> Title" obtained in advance in 2., only the data containing the URL (character string) that matches the difference is obtained.

IMG_7687 2.jpg

Since it is a character string, it is OK if the same character is included in the character string with the ʻin` operator. This allowed me to get the URL and title of the new article.

from now on

I want to be notified on LINE

~~ I want to be able to regularly notify chats with friends. later. ~~

2020/03/09 postscript

I used Line notify.


def send_line_notify(posts, token):
    '''
    # new_post_Take the return value of getter as an argument
    '''
    notice_url = "https://notify-api.line.me/api/notify"
    headers = {"Authorization" : "Bearer "+ token}
    for url, title in posts:
        if 'http' not in url:
            url = 'https://qiita.com/' + url
        message = f'{title}:{url}'
        payload = {'message': message}
        r = requests.post(notice_url, headers=headers, params=payload,)

Use in this way



token = '########'
neko_post = new_post_getter(neko_url, neko_selecter, neko_txt)
send_line_notify(neko_post, token)

If you specify the return value and token of the new_post_getter function as arguments, it will be sent to LINE Notify. I referred to here.

I want to run it regularly

I want to run every minute using ~~ python anywhere. later. ~~

2020/03/09 Copy each file to python anywhere and create .sh as below

To be able to cron a virtual environment



source /home/<account>/blog_post_notice/venv/bin/activate
python3 /home/<account>/blog_post_notice/send.py ## 

Then, when I try to run .sh before setting up cron ...

Error



requests.exceptions.ProxyError: HTTPSConnectionPool(host='qiita.com', port=443): Max retries exceeded with url: /takuto_neko_like (Caused by ProxyError('Canno
t connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')))

After investigating, it seems that Python anywhere can only access external sites that correspond to the Whitelist in order to prevent unauthorized access for free accounts. So I gave up on python anywhere ...

I tried deploying to Heroku. However, since you can not save files on Heroku, you can not overwrite the txt file in the same directory with the processing in python like this time. I tried to update the file by manipulating the API of Google Drive and Dropbox from python. It seems that I can get the filename and metadata, and add a new file, but I didn't know how to get the contents of the file.

Therefore, this time I will set up cron on my PC and run it regularly.

As crontab -e ...

For the time being, try running it every minute


0 * * * * sh /Users/User name/dir1/post_notice/notice.sh

Recommended Posts

I made a tool to get new articles
I made a tool to create a word cloud from wikipedia
[Titan Craft] I made a tool to summon a giant to Minecraft
I made a new AWS S3 bucket
I made a browser automatic stamping tool.
I made a tool to get the answer links of OpenAI Gym all at once
〇✕ I made a game
I made a useful tool for Digital Ocean
I made a tool to automatically browse multiple sites with Selenium (Python)
I made a CLI tool to convert images in each directory to PDF
I made a router config collection tool Config Collecor
I made a tool to convert Jupyter py to ipynb with VS Code
I made a tool to estimate the execution time of cron (+ PyPI debut)
I made a tool to easily display data as a graph by GUI operation.
I made a tool to generate Markdown from the exported Scrapbox JSON file
I made a tool to automatically back up the metadata of the Salesforce organization
I made a library to separate Japanese sentences nicely
I made a cleaning tool for Google Container Registry
I made a script to put a snippet in README.md
I made a Python module to translate comment outs
I made a code to convert illustration2vec to keras model
I made a command to markdown the table clipboard
I made a python library to do rolling rank
I made a repeating text data generation tool "rpttxt"
I made a python text
I made a discord bot
I made a package to filter time series with python
I made a box to rest before Pepper gets tired
[Python] I want to get a common set between numpy
I made a command to generate a table comment in Django
I made a function to check the model of DCGAN
I tried to get started with Hy ・ Define a class
I wrote a script to get a popular site in Japan
I made you to execute a command from a web browser
I made a script to say hello at my Koshien
I made a toolsver that spits out OS, Python, modules and tool versions to Markdown
I made a tool that makes it a little easier to create and install a public key.
I made a class to get the analysis result by MeCab in ndarray with python
I made a tool to automatically generate a simple ER diagram from the CREATE TABLE statement
I made a tool that makes it convenient to set parameters for machine learning models.
I made a C ++ learning site
I get a UnicodeDecodeError in mecab-python3
I made a Line-bot using Python!
I made a wikipedia gacha bot
I get a KeyError in pyclustering.xmeans
I made a fortune with Python.
I made a CUI-based translation script
A tool to convert Juniper config
I made a daemon with Python
[5th] I tried to make a certain authenticator-like tool with python
I made a program to solve (hint) Saizeriya's spot the difference
I made a library to easily read config files with Python
I tried to get a database of horse racing using Pandas
[2nd] I tried to make a certain authenticator-like tool with python
A memorandum when I tried to get it automatically with selenium
I made a web server with Raspberry Pi to watch anime
[3rd] I tried to make a certain authenticator-like tool with python
[Python] A memo that I tried to get started with asyncio
I wrote a script to get you started with AtCoder fast!
I made a client / server CLI tool for WebSocket (like Netcat for WebSocket)
I tried to get a list of AMI Names using Boto3