[PYTHON] Experiment to collect tweets for a long time (Program preparation (1))

Until last time

-[x] The required specifications (?) Have been decided for the time being.

Think first

If you have decided what to do and the suspension, it is your style to start ** by google. In short, "** 80% of what I wanted to do is someone doing it first **". Even in Qiita, Similar things ( I mean, there are people who are doing more advanced things) , and I have no choice but to use it as a reference, but unfortunately there are no cases where it is all right. As far as Stack Over Flow was investigated, there was no case where a program that could properly meet the required specifications was released. It would have been easier if there was ... Chi </ sub>

Even if you make something that meets the requirements within the range of ~~ script kiddie ~~ personal use, you will have to put together this area yourself.

Why Public Stream in the first place?

There are roughly two types of data acquisition from Twitter, "** REST API ** used" and "** Streaming API ** used". ** REST ** is a way to send a message and have the result returned, while ** Streaming API ** sends a command once and then sends the result to it endlessly. It will be sent. (And I understand. Please check the literature for details. )

The reason for making it Public Steram in the first place is

  1. There was a great possibility that the expected flow rate would exceed the limit that can be obtained in REST format.
  2. As long as you buy and read Twitter API Pocket Reference , do it with REST (if you only read it) It seems to be unexpectedly annoying.

It's a very lazy reason that the Stream API seems to be less troublesome than REST if you just bring in a suitable library and read it. You should just keep throwing what was sent into the DB.

There are also multiple Stream APIs, ** "All that flows to Twitter (contract required)" **, ** "Tweets on my timeline" **, ** "Random 1% of the total" * There are multiple *, ** "results narrowed down by keywords, locale, etc. for the whole" , but here the last " <a href="https://dev.twitter.com/streaming/reference" / post / statuses / filter "> Narrowed down by the specified search word **" is used. Public Stream seems to be a generic term for these, but I'm not sure about that. (I just looked it up, but is it okay to understand that the total number of tweets that can be obtained with Filfer is not 1%, but all?) </ Small>

Disconnect and reconnect

Perhaps because it's a Rotor human, the Stream API maintains an HTTP connection, but it's unreliable, or ** it disconnects even if nothing happens. Rather, I am wary of being disconnected ** because there is no problem. Even if you look at the previous reference book (pocket reference), it is written in such a way that you need to assume reconnection because it will be disconnected if something happens. …… But, even if you look at the page of a well-known library or the example of implementing it by yourself, there is no page that describes reconnection after disconnection within the range that you can see. * ...... Can you do that with the API? If so, it's never been so easy ...

(Note for those who know: I'm writing in chronological order in the brain while coding and remembering, so please wait for a while)

Library selection

I have a lot of things to think about, but I decided to consider a Twitter-connected library with a lot of Japanese materials from the perspective of referring to the wonderful code of my predecessors.

  • Of course, it needs to support Stream API
  • Ideally, you should get a quick answer by google. Based on that, I decided to use Tweepy .

When I searched for a library from the same viewpoint on the database side,

  • Anyway, the amount of coding description is small
  • You can usually find the answer by google.
  • It looks like MongoDB official or something close to that That's why I decided to use PyMongo .

An iron plate that tends to be apt? Well, if Shiroto had to do something about it, he would have to use the standard one ...

Development environment.

After deciding what to use, let's prepare a development environment and a test environment. I'm a Gatchigachi Win shop who came in from VB and came in C → VC ++ → C #. Of course, the environment that can be used for development is also Windows, so it goes without saying that it is ideal to be able to develop on Win until just before release = implementation. Or rather, if you don't have an IDE (Integrated Development Environment), you will die instantly. Even more so, if you ask me anything on Linux, I have no choice but to sit down.

Fortunately, since it is a script language, there is not much dependence on the environment, and nowadays the installation of libraries etc. is automated, so the trouble should be much less than in the past.

I think this is fine for infrastructure, but I want to use IDE, etc. ~~ I licked the world ~~ When I was googled in a familiar environment, Python Tools for Visual Studio is something like that. ** Moreover, you can easily debug on the spot by calling Win version Python such as Anaconda **. Because there is only this anymore

  • Anaconda (Python 3.5.2 :: Anaconda 4.1.1 (64-bit) )
  • MongoDB for Win(MongoDB shell version: 3.2.10)
  • VisualStudio2013 + Python Tools For Visual Studio(Ver.2.2.2)
  • Tweepy + PyMongo (latest at the time of acquisition)

Decided to develop and test with the configuration. As a concern,

  1. Is there any problem with execution in the execution environment? Is there any part that depends on the environment?
  2. Can you do enough testing in the development environment? Will there be any unforeseen problems?
  3. The version around the infrastructure may change depending on the development environment and execution environment.

Around, but ... Well, in any case, the actual machine test is necessary at the end, and if you do not write it well, no problem will occur, so leave it for the time being. (If this is a job, I'm afraid I have to pack it all around ...)

Program requirements

The program I'm making this time carries the demon's rule that once you start running, you can keep running anything for 3 months and stop it. Therefore, we will implement it by focusing on the essential functions, and do something else by other means **.

Functions to be implemented with the highest priority

  • Receive Public Stream from Twitter
  • Storage of received data in MongoDB
  • Ability to reconnect in the event of an unexpected disconnect

Function to be implemented as a second priority

  • A function to record when an event such as connection, disconnection, or reconnection occurs.
  • In addition to the above, a function to notify the occurrence of an event with Twitter's Direct Message (parental desire to contact)

Features to implement if possible

  • Notification of daily flow rate, remaining storage capacity, etc.

Function to consider if there is spare capacity

  • Automatically ends when a specific date and time arrives

Functions not included in the required specifications

  • Blacklisted specific account repelling feature (as a feature within the program).
  • Processing of tweet data assuming real-time analysis (as a function in the program)

Well, it looks like this. Let's start with the elements with the highest priority and gradually improve the degree of perfection.

Twitter reception program for the time being

With the vow of "** I have O'Reilly's Python tutorial " and " I google if I don't understand **", I started creating a demonstration program for the time being. From Creating a New Project in Visual Studio, specify "Python Application" and you're ready to write Python code in your familiar editor. This is convenient. After typing the tutorial code, execute it with [F5]. You can try the execution with the same procedure as the C language console application, so there is really no stress. I'm sorry I can't step.

Well, if you can't talk to Twitter first, it's out of the question, so let's start by making that area. You need to install Tweepy first. If you think that you can type the "pip" command from Anaconda Prompt, this can also be executed from Visual Studio.

  1. Select [View]-> [Other Windows]-> [Python Environments] from the menu bar to display.
  2. Select pip from the center dropdown in the Python Environments window.
  3. Enter "tweepy" in the "Search PyPl and installed packages" text box.
  4. Click "" pip install tweepy "from PyPl". Then the installation is completed.

It's as fun as NuGet.

Stream based on some pages found by google for the time being and tutorial of the original Tweepy Write a program to get tweets with API.

tweetCheck.py


#!/usr/bin/env python
# -*- coding:utf-8 -*-

import tweepy
#Prepare the variables required to execute the Twitter API by yourself.
CK = ''   # Consumer Key
CS = ''   # Consumer Secret
AT = ''   # Access Token
AS = ''   # Accesss Token Secert

class Listener(tweepy.StreamListener):
    def on_status(self, status):
        print(status.text.encode('shift_jis', 'ignore'))
        return True

    def on_error(self, status_code):
        print('Error occurred: ' + str(status_code))
        return True

#Main processing from here
auth = tweepy.OAuthHandler(CK, CS)
auth.set_access_token(AT, AS)     #Obtaining an access token

listener = Listener()                       #Instance of Listener class
stream = tweepy.Stream(auth, listener)      #Reception starts from here.

#Select one and uncomment it.
#stream.filter(track=['#xxxxxx'])  #Filter by specified search word
stream.sample()                    #1 from all tweets on Twitter%pick up
#stream.userstream()               #User's own TL

…… Eh, 31 lines (blank lines, including comments)? What can I do with this? ?? Execute while thinking. Execution result I can't read it (because it's UTF-8), but I can receive it. Forcibly terminate with Ctrl + C.

Notes

It feels like it's moving quickly, but it's actually clogged in two places.

  • Only one line is displayed, but it is forcibly terminated immediately.
    → The character code of the py file created by Visual Studio was "Shift-JIS".
    Select UTF-8 from [Menu] → [File] → [Detailed settings for saved file] and save.
  • If you execute it after fixing the above, it will be displayed for a moment, but it will be forcibly terminated. The timing is random.
    → The print under "def on_status (self, status):" was initially "print (status.text)", but
    <a href="http://lab.hde.co" .jp / 2008/08 / pythonunicodeencodeerror.html "> I heard that I die trying to display characters that cannot be displayed at the command prompt.
    When I tried to convert the encoding, it came to be displayed although it was garbled.

The former is okay if you fix it the first time. The latter is okay because it doesn't need to be displayed if it's always moving. If it's the first time Python can manage to this extent in a few days, it may be unexpectedly in time for the end of October.

Next time, I will flesh out this sauce. (Continue)

Recommended Posts

Experiment to collect tweets for a long time (Program preparation (3))
Experiment to collect tweets for a long time (Program preparation (1))
Experiment to collect tweets for a long time (program preparation (2))
Experiment to collect tweets for a long time (Program preparation (5))
Experiment to collect tweets for a long time (immediately before execution)
Experiment to collect tweets for a long period of time (aggregation & content confirmation)
Experiment to make a self-catering PDF for Kindle with Python
A study method for beginners to learn time series analysis
I made a program to collect images in tweets that I liked on twitter with Python
I want to create a Dockerfile for the time being.
[Profile] Identify where the program is taking a long time (google-perftool)
[Python] It was very convenient to use a Python class for a ROS program.
How to stop a program in python until a specific date and time
I tried to create a linebot (preparation)
Introduction to discord.py (1st day) -Preparation for discord.py-
A simple workaround for bots to try to post tweets with the same content
It takes a long time to shut down in CentOS 7 with LVM configuration.