Get RSS feed using Python + pandas → Post to Mattermost & Save to DB

What i did

I wanted to introduce Mattermost and try something, so I made a program to post RSS feeds. (It's a secret that I later realized that there was an official project)

I thought that it could be applied in various ways, so I processed the obtained feed with pandas. I decided to store it in the DB.

environment

PostgreSQL was introduced from the docker image.

docker pull postgres:9.5
docker run -p 5432:5432 --name postgres-server -v /var/lib/postgresql:/var/lib/postgresql:rw  postgres:9.5
firewall-cmd --permanent --add-port=5432/tcp
firewall-cmd --reload

This should launch a PostgreSQL container that can be connected remotely for the time being. docker is convenient. .. ..

Other environment construction is omitted, Python is implemented in the pyenv environment.

1. Get an RSS feed.

We use a Python library called feedparser. I installed it with pip, referring to this area.

http://qiita.com/shunsuke227ono/items/da52a290f78924c1f485

import feedparser

RSS_URL = "http://b.hatena.ne.jp/hotentry/it.rss"
print("Start get feed from %s" % (RSS_URL))
feed = feedparser.parse(RSS_URL)

Now you can get the feed. (By the way, I got a hot entry in the technology category of Hatena.)

2. Extract the obtained feed to pandas.DataFrame

Map it to pandas.DataFrame for ease of future processing.

import pandas as pd
entries = pd.DataFrame(feed.entries)

...the end. pandas is excellent.

In the case of Hatena's RSS feed, the following 12 column elements were acquired.

At this point, you can freely manipulate the data with the pandas function.

3. Check for new feeds

feedparser is very convenient, but it gets the feed at the time of access, so it will be duplicated with the feed obtained in the past.

This is where the meaning of expanding to DataFrame comes out! The following is an example of extracting and displaying only new feeds by operating DataFrame.

already_print_feeds = pd.Series()

while True:
        time.sleep(300)
        feed = feedparser.parse(RSS_URL)
        entries = pd.DataFrame(feed.entries)
        new_entries = entries[~entries['id'].isin(already_print_feeds)]
        if not new_entries.empty:
            for key, row in new_entries.iterrows():
                feedinfo = "[**%s**](%s)\n\n>%s"%(row['title'],row['link'],tag_re.sub('',row['summary']))
                print(feedinfo)
        already_print_feeds = already_print_feeds.append(new_entries['id'])

A little commentary

new_entries = entries[~entries['id'].isin(already_print_feeds)]

It only pulls out new arrivals from the retrieved RSS feed.

It is assumed that ʻalready_print_feeds contains the ʻid of the RSS feeds obtained so far.

Then, of the feeds stored in ʻentries, Since Serires with True set only for new lines is returned, If you specify this as the index of ʻentries, you can extract only new arrivals.

~entries['id'].isin(already_print_feeds)
# =>
0     False
1     True # => ★New!
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    True # => ★New!
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
Name: id, dtype: bool

ʻAlready_print_feeds` should be added with the ID of the new feed displayed so far.

already_print_feeds = already_print_feeds.append(new_entries['id'])

: warning: However, with the above code, data will be accumulated indefinitely in ʻalready_print_feeds`, so it will break down someday (memory). Let's flash once a day or read from the DB

4. Save to DB (PostgreSQL)

Save the obtained RSS feed in PostgreSQL. However, the columns have been narrowed down to the following.

First, create a table in the DB.

create table feed ( id text primary key , link text, title text, summary text, updated timestamp );

For the time being, I put a primary key constraint in id, and updated is a timestamp type. (It seems that the updated version of Hatena's feed can be INSERTed as a timestamp type as it is.)

from sqlalchemy import create_engine

DATABASE_CONN = "postgresql://xxxx:xxxx@xxxxx:xxx/xxxx"
DATABASE_TABLE = "feed"
# connect database
engine = create_engine(DATABASE_CONN)

# Store database
stored_entries = new_entries.ix[:, [
                "id", "link", "title", "summary", "updated"]]
stored_entries.to_sql(DATABASE_TABLE, engine, index=False, if_exists='append')

Use the DataFrame's to_sql method.

By doing so, the index column will not be added arbitrarily at the time of storage,

Then, the behavior is to add data to the table that already exists.

5. Post to MatterMost

It's very easy to post with a Python library called request that sends HTTP requests.

import requests
import json

mattermosturl = "MatterMost incomming webhook URL"
username = "Favorite name"
header = {'Content-Type': 'application/json'}
payload = {
        "text": feedinfo,
        "username": username,
        }

resp = requests.post(mattermosturl,
                     headers=header, data=json.dumps(payload))

so

Since I mapped it to pandas, I also want to do machine learning.

Recommended Posts

Get RSS feed using Python + pandas → Post to Mattermost & Save to DB
Post to Twitter using Python
Convert from Pandas DataFrame to System.Data.DataTable using Python for .NET
Try to operate an Excel file using Python (Pandas / XlsxWriter) ①
Try to operate an Excel file using Python (Pandas / XlsxWriter) ②
Get Python scripts to run quickly in Cloud Run using responder
Save images using python3 requests
Post from Python to Slack
[Python] Convert list to Pandas [Pandas]
Start to Selenium using python
Data analysis using python pandas
Post to Slack in Python
I tried to get a database of horse racing using Pandas
python / pandas / dataframe / How to get the simplest row / column / index / column
Process Splunk execution results using Python and save to a file
How to get followers and followers from python using the Mastodon API
Python hand play (RDKit descriptor calculation: SDF to CSV using Pandas)
[Python] I tried to get various information using YouTube Data API!
[Python] What to do if you get a ModuleNotFoundError when importing pandas using Jupyter Notebook in Anaconda
How to install python using anaconda
[Python] Loading csv files using pandas
Retry post request using python requests
Link to get started with python
[Python] How to use Pandas Series
Post from python to facebook timeline
[Lambda] [Python] Post to Twitter from Lambda!
How to get the Python version
How to get started with Python
[Introduction to Python] Let's use pandas
Get, post communication memo in Python
Post images from Python to Tumblr
Try to get statistics using e-Stat
[Introduction to Python] Let's use pandas
Easily post to twitter with Python 3
[Nanonets] How to post Memo [Python]
[Introduction to Python] Let's use pandas
Python version to get unused ports
How to get a value from a parameter store in lambda (using python)
POST photos with Microsoft Bing Image Search API to get Image Insights (Python)
Try to poke DB on IBM i with python + JDBC using JayDeBeApi