[PYTHON] Maybe you can scrape using Twitter Scraper

Preparing the Twitter API on your own has become a hassle, but I think you may want to scrape Twitter. There is (likely) a Python library called Twitter Scraper in OSS.

Introduction

Twitter is prohibited from scraping without the prior consent of Twitter. Therefore, this article is an imaginary article that may be used like this. (I will put up a preventive line)

(iii) Access to the Services by any other means (automatically or otherwise) without going through our currently available public interface provided by Twitter (and subject to applicable terms of use). Or do the search, or try to access or search. However, this is not the case if a separate contract with Twitter specifically allows this to be done (Note: Crawling of the Service is permitted as required by the robots.txt file. It is expressly prohibited to scrape this service without the prior consent of Twitter). https://twitter.com/ja/tos/previous/version_9

Those who see this and try to use it are at their own risk.

Thing you want to do

Aikatsu on Parade in the October 2019 issue of Comptiq! Special article was published. Among them, there is a plan called "Aikatsu on Parade Emergency Reader Questionnaire", and a questionnaire was conducted on Twitter for a short period from 18:00 on August 23, 2019 to 17:59 on August 25, 2019. I did.

I wanted to do it if I could use it as a collection of the questionnaire answers.

Implementation

from twitterscraper import query_tweets
import datetime as dt
import pandas as pd


# input
begin_date = dt.date(2019,8,23)
end_date = dt.date(2019,9,1)
pool_size = (end_date - begin_date).days

#Collecting tweets
tweets = query_tweets("#Comp Aikatsu Questionnaire", begindate=begin_date, enddate=end_date, poolsize=pool_size, lang="ja")

tuple_tweet=[(tweet.user_id, tweet.text.replace("\n","\t"), tweet.timestamp) for tweet in tweets]
#Since the tweet is duplicated, delete it
df = pd.DataFrame(set(tuple_tweet), columns=['user_id', 'tweet', 'post'])

df.sort_values('post').reset_index(drop=True)

Since the information I want is only the content of the tweet, unnecessary user names and the number of retweets are removed.

Since shaping work is required as pre-processing after this, line breaks are changed to tab delimiters to make the work easier.

Note that Twitter Scraper will duplicate records if you simply try to process it. (At first glance, it doesn't seem like the periods of begindate and ʻenddate` can be duplicated, but I don't know the details.) Therefore, duplicates are deleted with set.

I was not sure about poolsize either, but as far as I can see from the internal processing, it is necessary to set the period of since and until, and the default value is 20, so if you use the default value as it is, the data of the same day will be generated, so End-Sets the start date.

Finally

"Aikatsu on Parade! 』Is ** TV TOKYO system Every Saturday from 10:30 am BS TV Tokyo every Monday from 5 pm ** Now on air!

Also, the analysis results are posted on the Hatena blog, so if you are interested, please check them out as well.

Check out the top 30 popular episodes of the "Aikatsu!" Series from a total of 1036 votes