Background

I wanted a lot of Japanese sentences for machine learning, so I decided to collect a lot of tweets on Twitter's Streaming API.

Even if I didn't collect it myself, if there was a file of someone doing the same thing somewhere, I thought I would use it, but after searching Google for a few minutes, I couldn't find the right one. Since it was the beginning of writing in 2017, I wrote it myself.

I'm using a library called Twython. (I used to use Tweepy, but apparently Twython is more popular these days.)

Tweets you want to exclude

--Tweets with images --Tweets containing URLs --Tweets with hashtags --Retweet --Tweets with reply mentions

I've excluded these tweets because I thought they weren't suitable for use as a corpus material.

File format

--LF delimited by 1 line 1 tweet --The line breaks included in tweets are separated by CR

It is in the format.

By doing this, you can leave line break information later, and you can make it easy to handle from the program "1 line 1 tweet".

How to use

If you execute it like this, it will output to the standard output. In this example, it ends when you get 10 valid tweets. (You can specify this number with the -n option.)

$ python tweetcorpus.py -n 10

application

If you shell like this, even if there is an error on the way, you can continue to take it almost infinitely. You can see the progress visually as you pipe the TTY with tee. Since it pipes gzip, it is a little safe to collect a large number of tweets.

$ while true; do python -u tweetcorpus.py -n 500 | tee /dev/tty | gzip -cn >> tweet.gz ; sleep 1 ; done

(For the gzip combination used in ↑, see Gzip-compressed text files can be connected with cat) (For the Python -u option used in ↑, see Option to disable the stdout / stderr buffer in Python)

Personally, I prefer the style of simplifying the program itself and connecting it with pipes, rather than using the gzip module for each programming language.

Environment variable

The OAuth token for the Twitter API is read from the environment variables ʻAPP_KEY, ʻAPP_SECRET, ʻOAUTH_TOKEN, ʻOAUTH_TOKEN_SECRET.

Create an application on Twitter and Prepare the following files

`.env`


#!/bin/sh
export APP_KEY='XXXXXXXXXXXXX'
export APP_SECRET='XXXXXXXXXXXXXXXXXXXX'
export OAUTH_TOKEN='XXXXX-XXXXXXXXXX'
export OAUTH_TOKEN_SECRET='XXXXXXXXXX'

source ./.env

Let's read it in advance.

Preparation

If you have a Python environment, install Twython and you're good to go.

$ pip3 install twython==3.4.0

Source

`tweetcorpus.py`


import argparse
import html
import os
import sys

from twython import TwythonStreamer


class CorpusStreamer(TwythonStreamer):

    def __init__(self, *args,
                 max_corpus_tweets=100,
                 write_file=sys.stdout):
        super().__init__(*args)
        self.corpus_tweets = 0
        self.max_corpus_tweets = max_corpus_tweets
        self.write_file = write_file

    def exit_when_corpus_tweets_exceeded(self):
        if self.corpus_tweets >= self.max_corpus_tweets:
            self.disconnect()

    def write(self, text):
        corpus_text = text.replace('\n', '\r')
        self.write_file.write(corpus_text + '\n')
        self.corpus_tweets += 1

    def on_success(self, tweet):
        if 'text' not in tweet:
            #Exclude other than tweet information(Notification etc.)
            return
        if 'retweeted_status' in tweet:
            #Exclude retweets
            return
        if any(tweet['entities'].values()):
            '''
            tweet.entities.url
            tweet.entities.media
            tweet.entities.symbol
Exclude tweets containing information that cannot be handled by natural language processing alone
            '''
            return
        text = html.unescape(tweet['text'])
        self.write(text)
        self.exit_when_corpus_tweets_exceeded()


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--number-of-corpus-tweets',
                        type=int, default=100)
    parser.add_argument('-o', '--outfile',
                        type=argparse.FileType('w', encoding='UTF-8'),
                        default=sys.stdout)
    parser.add_argument('-l', '--language', type=str, default='ja')

    app_key = os.environ['APP_KEY']
    app_secret = os.environ['APP_SECRET']
    oauth_token = os.environ['OAUTH_TOKEN']
    oauth_token_secret = os.environ['OAUTH_TOKEN_SECRET']

    args = parser.parse_args()
    stream = CorpusStreamer(app_key, app_secret,
                            oauth_token, oauth_token_secret,
                            max_corpus_tweets=args.number_of_corpus_tweets,
                            write_file=args.outfile)
    stream.statuses.sample(language=args.language)


if __name__ == '__main__':
    main()

environment

I tried it with the latest version of Python 3.6, but I think it will work if twython can be installed on 3 series.

Python 3.6.0 (default, Dec 29 2016, 18:49:32) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
twython==3.4.0

Collect Japanese tweets that do not include images, URLs or replies in Python