[PYTHON] Fake News? Fake Tweet?

Fake News!

Since Mr. Trump took office, he has often referred to and criticized various news in this way.

Active on Twitter, with some aggressive posts.

While taking offensive words and deeds against the media in this way, it is also a fact that it is doubtful how credible the tweet of the person in question is.

Therefore, this time we will use machine learning technology to verify whether Mr. Trump's tweet is True or Fake.

What to do this time

Go to a site called Kaggle and use the news dataset from the page Fake and real news dataset Classifying the news We will create a model to judge the authenticity of the news after Mr. Trump took office.

Next, extract tweets after the inauguration of the president from the data in Trump Tweets Tweets from @realdonaldtrump, True and Fake Make a diagram of how much the judgment was made.

[Important] Precautions regarding the content of the article

This time, I worked on machine learning from the perspective of "Mr. Trump's tweet".

There is no political intention at all, and the output of machine learning is just using Mr. Trump's tweet as a starting point, and I understand that there is a good possibility that the verification result I will tell you will be a boomerang with Fake. Please read the above.

Please note that True or Fake is a computational result and does not guarantee what it really is, and I did not write the article to direct or criticize a particular person. Thank you.

Step1: Pre-process news data for model creation

As a first step, the dataset in Fake and real news dataset Classifying the news can be modeled for preprocessing. I will prepare it.

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder

Next, read the cav file of the corresponding page.

fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")

On the corresponding page, the correct one and the incorrect one are divided into two files, Fake and True.

Let's look at the columns and molds.

fake.columns
Index(['title', 'text', 'subject', 'date'], dtype='object')

fake.dtypes
title      object
text       object
subject    object
date       object
dtype: object

true.columns
Index(['title', 'text', 'subject', 'date'], dtype='object')

true.dtypes
title      object
text       object
subject    object
date       object
dtype: object

You can see that neither file has a direct indication of True or Fake.

After clearly stating that, I will put it together.

fake["Reality"] = 0
true["Reality"] = 1
#Create a new Reality. Set Fake to 0 and True to 1.

df = pd.concat([fake, true], axis=0)
df = df.reset_index(drop=True)

df.isnull().sum()
title      0
text       0
subject    0
date       0
Reality    0
dtype: int64

df.head()

#Neither of the original files was missing. Since the amount will be too large, it is omitted, but it is important to check it in advance.
title text subject date Reality
0 Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ... News December 31, 2017
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News December 31, 2017
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News December 30, 2017
3 Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ... News December 29, 2017
4 Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes... News December 25, 2017
df.tail()
title text subject date Reality
44893 'Fully committed' NATO backs new U.S. approach... BRUSSELS (Reuters) - NATO allies on Tuesday we... worldnews August 22, 2017
44894 LexisNexis withdrew two products from Chinese ... LONDON (Reuters) - LexisNexis, a provider of l... worldnews August 22, 2017
44895 Minsk cultural hub becomes haven from authorities MINSK (Reuters) - In the shadow of disused Sov... worldnews August 22, 2017
44896 Vatican upbeat on possibility of Pope Francis ... MOSCOW (Reuters) - Vatican Secretary of State ... worldnews August 22, 2017
44897 Indonesia to buy $1.14 billion worth of Russia... JAKARTA (Reuters) - Indonesia will buy 11 Sukh... worldnews August 22, 2017

From the above, it was confirmed that there are both True and Fake in the same file.

There is a subject, so let's take a look. If there is something that has nothing to do with Mr. Trump, I will cut the news.

df["subject"].unique()
array(['News', 'politics', 'Government News', 'left-news', 'US_News',
       'Middle-east', 'politicsNews', 'worldnews'], dtype=object)
#Trump relations are likely to be involved in any genre, so I won't consider them here.

Next, let's separate the dates.

df = pd.concat([df['title'], df['text'], df['subject'],df["date"]\
                .str.extract('(?P<Month>.*) (?P<Day>.*), (?P<Year>.*)',expand=True),df["date"],df["Reality"]], axis=1)
df.head()
title text subject Month Day Year date Reality
0 Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ... News December 31 2017 December 31, 2017
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News December 31 2017 December 31, 2017
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News December 30 2017 December 30, 2017
3 Trump Is So Obsessed He Even Has Obama’s Name... On Christmas day, Donald Trump announced that ... News December 29 2017 December 29, 2017
4 Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes... News December 25 2017 December 25, 2017

It seemed that the dates could be separated well, but there was a problem.

df.isnull().sum()
title       0
text        0
subject     0
Month      45
Day        45
Year       45
date        0
Reality     0
dtype: int64

There seems to be a day when things aren't going well here. I will separate it from the part that is working well and check it.

nonull = df.dropna()
nonull.isnull().sum()
title      0
text       0
subject    0
Month      0
Day        0
Year       0
date       0
Reality    0
dtype: int64
null = df[df.isnull().any(axis=1)]
#Check only the ones that are not working here.
null.head()
title text subject Month Day Year date Reality
9050 Democrat Senator Warns Mueller Not To Release ... According to The Hill, Democrat Senator Bob Ca... politics NaN NaN NaN 19-Feb-18
9051 MSNBC ANCHOR Flabbergasted at What Texas Teach... If we protect every other government building ... politics NaN NaN NaN 19-Feb-18
9052 WATCH: SNOWFLAKES ASKED Communist Party Platfo... Ami Horowitz is fantastic! Check out this man ... politics NaN NaN NaN 19-Feb-18
9053 JUST IN: BADASS GENERAL JOHN KELLY Shoved Chin... Just one more reminder of why President Trump ... politics NaN NaN NaN 18-Feb-18
9054 DOJ’s JEFF SESSIONS Opens Investigation Into W... Thank goodnesss Jeff Sessions is moving on fin... politics NaN NaN NaN 18-Feb-18
null = null.drop(["Month", "Day", "Year"], axis=1)
#Recreate the date here
null = null.reset_index(drop=True)
null.head()
title text subject date Reality
0 Democrat Senator Warns Mueller Not To Release ... According to The Hill, Democrat Senator Bob Ca... politics 19-Feb-18
1 MSNBC ANCHOR Flabbergasted at What Texas Teach... If we protect every other government building ... politics 19-Feb-18
2 WATCH: SNOWFLAKES ASKED Communist Party Platfo... Ami Horowitz is fantastic! Check out this man ... politics 19-Feb-18
3 JUST IN: BADASS GENERAL JOHN KELLY Shoved Chin... Just one more reminder of why President Trump ... politics 18-Feb-18
4 DOJ’s JEFF SESSIONS Opens Investigation Into W... Thank goodnesss Jeff Sessions is moving on fin... politics 18-Feb-18

Let's also look at the end.

null.tail()
title text subject date Reality
40 https://fedup.wpengine.com/wp-content/uploads/... https://fedup.wpengine.com/wp-content/uploads/... Government News https://fedup.wpengine.com/wp-content/uploads/...
41 https://fedup.wpengine.com/wp-content/uploads/... https://fedup.wpengine.com/wp-content/uploads/... Government News https://fedup.wpengine.com/wp-content/uploads/...
42 Homepage [vc_row][vc_column width= 1/1 ][td_block_trend... left-news MSNBC HOST Rudely Assumes Steel Worker Would N...
43 https://fedup.wpengine.com/wp-content/uploads/... https://fedup.wpengine.com/wp-content/uploads/... left-news https://fedup.wpengine.com/wp-content/uploads/...
44 https://fedup.wpengine.com/wp-content/uploads/... https://fedup.wpengine.com/wp-content/uploads/... left-news https://fedup.wpengine.com/wp-content/uploads/...

There are even URLs with dates. Other information is also a URL, so let's exclude such information.

Many of the dates are written as 〇〇-〇〇-〇〇, so I will start from there.

null_dates = null["date"].str.extract('(?P<Day>.*)-(?P<Month>.*)-(?P<Year>.*)',expand=True)
null_dates.dtypes
Day      object
Month    object
Year     object
dtype: object

null_dates
#See what you've separated the dates but still don't work. Since the amount will be too large, "omit" the parts that do not apply.
Day Month Year
abridgement abridgement abridgement
35 https://100percentfedup.com/served-roy-moore-v... commander
36 https://100percentfedup.com/video-hillary-aske... some
37 https://100percentfedup.com/12-yr-old-black-co... from
38 NaN NaN
39 NaN NaN
40 NaN NaN
41 NaN NaN
42 NaN NaN
43 NaN NaN
44 NaN NaN

Since the relevant part was the last one, let's look at the relevant part.

null["date"].tail(10)
35    https://100percentfedup.com/served-roy-moore-v...
36    https://100percentfedup.com/video-hillary-aske...
37    https://100percentfedup.com/12-yr-old-black-co...
38    https://fedup.wpengine.com/wp-content/uploads/...
39    https://fedup.wpengine.com/wp-content/uploads/...
40    https://fedup.wpengine.com/wp-content/uploads/...
41    https://fedup.wpengine.com/wp-content/uploads/...
42    MSNBC HOST Rudely Assumes Steel Worker Would N...
43    https://fedup.wpengine.com/wp-content/uploads/...
44    https://fedup.wpengine.com/wp-content/uploads/...
Name: date, dtype: object

You can see that the date itself is messed up anymore.

Delete the relevant part.

null = null[:-10]
null_dates = null_dates[:-10]

Although the strange part of the date was deleted, the problem remains that the date is the same and the one before taking office is deleted.

Let's continue to work on it.

#Notation of the month
df["Month"].unique()
array(['December', 'November', 'October', 'September', 'August', 'July',
       'June', 'May', 'April', 'March', 'February', 'January', nan, 'Dec',
       'Nov', 'Oct', 'Sep', 'Aug', 'Jul', 'Jun', 'Apr', 'Mar', 'Feb',
       'Jan'], dtype=object)

null_dates["Month"].unique()
array(['Feb'], dtype=object)

null_dates["Month"] = "February"
#Since it is only Feb, it matches with February.
#Notation of year
null_dates["Year"].unique()
array(['18'], dtype=object)

null_dates["Year"] = "2018"
#Only 18 so unified to 2018
null_filled = pd.concat([null['title'], null['text'], null['subject'],\
                null_dates["Month"],null_dates["Day"], null_dates["Year"],null["date"],\
                null["Reality"]], axis=1)
#null and null_Arrange dates
table = pd.concat([nonull, null_filled], axis=0)
table = table.drop("date", axis=1)
#Now you can unify the date format
table = table.reset_index(drop=True)
#Here the indexes are in order
#subject
le = LabelEncoder()
encoded = le.fit_transform(table['subject'].values)
decoded = le.inverse_transform(encoded)
table['subject'] = encoded
#Convert subject to numbers, not directly related
#Unification of the moon
table["Month"].unique()
array(['December', 'November', 'October', 'September', 'August', 'July',
       'June', 'May', 'April', 'March', 'February', 'January', 'Dec',
       'Nov', 'Oct', 'Sep', 'Aug', 'Jul', 'Jun', 'Apr', 'Mar', 'Feb',
       'Jan'], dtype=object)

table["Month"] = table["Month"].map({'December':12, 'November':11, 'October':10, 'September':9, 'August':8, 'July':7,
       'June':6, 'May':5, 'April':4, 'March':3, 'February':2, 'January':1, 'Dec':12,
       'Nov':11, 'Oct':10, 'Sep':9, 'Aug':8, 'Jul':7, 'Jun':6, 'Apr':4, 'Mar':3, 'Feb':2,
       'Jan':1})
#Type conversion
table.dtypes
#Day,Year is an object, so fix it
title      object
text       object
subject     int64
Month       int64
Day        object
Year       object
Reality     int

table["Day"] = table["Day"].astype("int64")
table["Year"] = table["Year"].astype("int64")

table.dtypes
title      object
text       object
subject     int64
Month       int64
Day         int64
Year        int64
Reality     int64
dtype: object
table.isnull().sum()
title      0
text       0
subject    0
Month      0
Day        0
Year       0
Reality    0
#This solves the problem of missing values

Up to this point, we have continued to perform type conversion and unification of notation.

The news before January 20, 2017, when Mr. Trump took office at the end, will be cut.

Therefore, datetime is used.

dates = table[["Year", "Month", "Day"]]

dates = pd.to_datetime(dates)
dates.head()
0   2017-12-31
1   2017-12-31
2   2017-12-30
3   2017-12-29
4   2017-12-25
dtype: datetime64[ns]
news = pd.concat([table, dates], axis=1)
news = news.rename(columns={0: 'dates'})
news = news[["text", "dates", "Reality"]]
news[news["dates"] > dt.datetime(2017,1,20)]
#Cut everything before the president takes office
news = news.drop("dates", axis=1)
news.head()
#This completes.
text Reality
0 Donald Trump just couldn t wish all Americans ...
1 House Intelligence Committee Chairman Devin Nu...
2 On Friday, it was revealed that former Milwauk...
3 On Christmas day, Donald Trump announced that ...
4 Pope Francis used his annual Christmas Day mes...

This completes a summary of the news since Mr. Trump took office and its truth.

Finally, save it to a csv file and proceed to the next step.

news.to_csv("news_for_learning.csv", index=False)

Step2: Preprocess the data of Mr. Trump's tweet for judgment

Next, we will extract tweets from Trump Tweets Tweets from @realdonaldtrump since the inauguration of the president.

Let's read the data and take a look at the contents.

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

trump = pd.read_csv("trumptweets.csv")

trump.head()
id link content date retweets favorites mentions hashtags geo
0 1698308935 https://twitter.com/realDonaldTrump/status/169... Be sure to tune in and watch Donald Trump on L... 2009-05-04 20:54:25 500 868 NaN NaN
1 1701461182 https://twitter.com/realDonaldTrump/status/170... Donald Trump will be appearing on The View tom... 2009-05-05 03:00:10 33 273 NaN NaN
2 1737479987 https://twitter.com/realDonaldTrump/status/173... Donald Trump reads Top Ten Financial Tips on L... 2009-05-08 15:38:08 12 18 NaN NaN
3 1741160716 https://twitter.com/realDonaldTrump/status/174... New Blog Post: Celebrity Apprentice Finale and... 2009-05-08 22:40:15 11 24 NaN NaN
4 1773561338 https://twitter.com/realDonaldTrump/status/177... "My persona will never be that of a wallflower... 2009-05-12 16:07:28 1399 1965 NaN NaN
trump.dtypes
id             int64
link          object
content       object
date          object
retweets       int64
favorites      int64
mentions      object
hashtags      object
geo          float64
dtype: object

There are various contents, but here we will use only date and content.

I will change the date to datetime.

trump['Date'] = pd.to_datetime(trump['date'])
trump[trump["Date"] > dt.datetime(2017,1,20)]

This erased all tweets before he took office.

Delete the data other than content and add new items called Truth and Percentage.

When creating and validating a model, Truth stores the conclusion of True or Fake, and Percentage stores the possibility of calculation results.

trump = trump[["content"]]
trump["Truth"] = "Unknown"
#Now the content is only content and Truth. The result of the authenticity judgment is the character type of True and Fake.
trump["Percentage"] = 0
#Also add a percentage
trump = trump.reset_index(drop=True)
#Here the indexes are in order
trump.head()
content Truth Percentage
0 Be sure to tune in and watch Donald Trump on L... Unknown
1 Donald Trump will be appearing on The View tom... Unknown
2 Donald Trump reads Top Ten Financial Tips on L... Unknown
3 New Blog Post: Celebrity Apprentice Finale and... Unknown
4 "My persona will never be that of a wallflower... Unknown

Now that you have narrowed down to just the information you need, save it in a new csv file.

trump.to_csv("trump_for_judging.csv", index=False)

We will work on model creation in the next step.

Step3: Create a model based on news data

Now that we have all the necessary csv files, let's create a model for judging news.

Here we use a naive Bayes classifier.

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

import pickle

news = pd.read_csv("news_for_learning.csv")
#Csv file created in Step1

This time, we will create a list with True and Fake respectively and store the corresponding words in dictionary format.

Fake =[]
Fact = []

for text, reality in zip(news['text'], news['Reality']):
    line = text.split(" ")
    dic = {}
    if reality == 0:
        for word in line:
            dic[word] = True
        ireru = (dic, 0)
        Fake.append(ireru)
    else:
        for word in line:
            dic[word] = True
        ireru = (dic, 1)
        Fact.append(ireru)

You have now stored the words in each list.

For example, even if you output Fake [0], a huge amount of words will appear, so I will omit it here.

Now, let's create training data and test data and connect them to. Here, 90% of the data is used for training.

threshold = 0.9
num_fake = int(threshold * len(Fake))
num_fact = int(threshold * len(Fact))

features_train = Fake[:num_fake] + Fact[:num_fact]
features_test = Fake[num_fake:] + Fact[num_fact:]

classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))
Accuracy of the classifier: 0.9585746102449889

Looking at the accuracy, we can see that it is quite high at 95% or more. Let's see what words hold the key.

N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
    print(str(i+1) + '. ' + item[0]) 

Top 15 most informative words:
1. (Reuters)
2. -
3. -
4. screenshot
5. Image:
6. MOSCOW
7. BERLIN
8. screengrab
9. Images
10. Editing
11. WASHINGTON
12. Kurdistan
13. corrects
14. racists
15. Image

Looking at it like this, you can see that symbols, place names, and some racists are important.

Finally, let's save the model.

filename = 'model_for_trump.sav'
pickle.dump(classifier, open(filename, 'wb'))

This completes saving.

If you execute the following code following the code described so far in Step 2 and copy and paste the news, the authenticity will be judged.

As mentioned at the beginning, it is a calculation result and we cannot guarantee that it is really the case, so please do so at your own risk.

def extract_features(words):
    return dict([(word, True) for word in words])
    print(dict([(word, True) for word in words]))
#Store the dictionary in the same format as the previous list

#See what you typed on the spot
input_review = input()
print("Fact Check:")
#extract_features(input_review)
print("\nNews Text:",input_review)
features = extract_features(input_review.split())
print(features)
probabilities = classifier.prob_classify(features)
predicted_truth = probabilities.max()
if predicted_truth == 0:
    answer = "Fake News!!!"
else:
    answer = "True News!!!"
print("Predicted Answer:", answer)
print("Probability:", round(probabilities.prob(predicted_truth), 2))

Step4: Verify the authenticity of the tweet using the model

In the final step, we will finally verify Mr. Trump's tweet.

Let's start by loading files and libraries.

import pandas as pd
import pickle

trump = pd.read_csv("trump_for_judging.csv")
#Load Trump Tweets
classifier = pickle.load(open('model_for_trump.sav', 'rb'))
#Load the saved model

import matplotlib.pyplot as plt
import seaborn as sns
#Finally, check the truth ratio

Next, create a function that determines authenticity. This is a slight tweak to the last code in the previous step.

def FactCheck(event):   
    global answer
    global percentage
    line = event.split(" ")
    dic = {}
    for word in line:
        dic[word] = True
    probabilities = classifier.prob_classify(dic)    
    predicted_truth = probabilities.max()
    percentage = round(probabilities.prob(predicted_truth))
    if predicted_truth == 0:
        answer = "Fake"
    else:
        answer = "True"
    return answer

Here, the judgment result for each news is selected as True or Fake, and the numerical value of how likely the judgment is made can be output.

From now on, we will perform the judgment work and write each result to the csv file that was read earlier.

for i, v in trump.iterrows():
    FactCheck(v["content"])
    trump.at[trump.index[i], 'Truth'] = answer
    trump.at[trump.index[i], 'Percentage'] = percentage

trump.head()
content Truth Percentage
0 Be sure to tune in and watch Donald Trump on L... Fake
1 Donald Trump will be appearing on The View tom... Fake
2 Donald Trump reads Top Ten Financial Tips on L... Fake
3 New Blog Post: Celebrity Apprentice Finale and... Fake
4 "My persona will never be that of a wallflower... True

With this, we were able to output the judgment result and numerical value for the tweet.

Let's graph the ratio of authenticity.

f,ax=plt.subplots(1,2,figsize=(18,8))
trump['Truth'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Truth')
ax[0].set_ylabel('')
sns.countplot('Truth',data=trump,ax=ax[1])
ax[1].set_title('Truth')
plt.show()

trump.describe()

image.png

Percentage
count
mean
std
min
25%
50%
75%
max

True is 13.6%. About 6 out of 7 tweets were Fake.

Also, it seems that the calculation result shows that the percentage of all tweets is 1, that is, 100% correct.

At the end

This time, I created a model using a file that specifies the authenticity of the news in Kaggle and judged the authenticity of Mr. Trump's tweet.

As a result, about 6 out of 7 of his tweets are likely to be fake.

I'm wondering why all the percentages are 1s, so I'm going to examine it.

As I mentioned at the beginning, this is just a calculation result, so there is no guarantee that this verification result will match the facts.

Also, I have no intention of defending or criticizing anyone in a particular political position, so I hope you understand it again.

It's been a long article, but thank you for reading to the end.

Recommended Posts

Fake News? Fake Tweet?