On Twitter, go to Settings> Accounts> Twiter Data> Download Archive As you know, there is a function to download all your past tweets.
The downloaded file contains information about past tweets, RT tweets, likes tweets, direct messages, and more. (It seems that you can browse by opening index.html which is usually downloaded together, but in my case index.html was not downloaded. Why?)
Hands-on to visualize your tweets while understanding BERT ↑ I noticed after reading this article. ↓ (Reference) [Solved] I can't download all tweet history on twitter [Method]
If you want to do text mining or some kind of analysis, you'll probably want to read tweet.json. In this article, we will process this json file into a csv that is easy to use for morphological analysis. The csv to be created is two columns, "Timestamp" and "Text Body".
Image of CSV that can be finally created
environment Python 3.6.5 Mac OS Mojave 10.14.4
pandas==0.23.0
When you open the downloaded json, it looks like this.
Underlined red
window.YTD.tweet.part0 =
Is unnecessary, so please delete it. Then change the extension to .txt and put it in your working directory.
read_dl_tweet.py
import pandas as pd
import json
tweets_file = open("tweet.txt", "r")
tweet = json.load(tweets_file)
Open json as a pandas dataframe in the script above. There are many columns, but only the necessary columns are extracted.
read_dl_tweet.py
df = tweet_data_frame.loc[:,["created_at","full_text"]]
Since there are troublesome characters such as line breaks and commas when making csv, remove them. It didn't work without regex = True.
read_dl_tweet.py
df = df.replace(['\n',',',' ','\r'],'',regex=True)
Also, the format of the time stamp is in a form that cannot be used for sorting, so correct it to make it easier to read. I was able to convert it in one shot with the to_datetime method of pandas.
read_dl_tweet.py
df_date = pd.to_datetime(df["created_at"])
df["date_form"] = df_date
df_sorted = df.sort_values("date_form")
df_text_date = df_sorted.loc[:,["date_form","full_text"]]
Sorted by the newly created time stamp.
read_dl_tweet.py
df_text_date.to_csv("df_text_date.csv", header=False, index=False,sep=',',encoding='utf-16')
Please change the option when outputting csv as appropriate (such as making the delimiter a tab).
In Next article, I will graph the number of tweets for each period from the created csv.
This code: https://github.com/KanikaniYou/plot_tweet_graph
Recommended Posts