Fake News!

Since Mr. Trump took office, he has often referred to and criticized various news in this way.

Active on Twitter, with some aggressive posts.

While taking offensive words and deeds against the media in this way, it is also a fact that it is doubtful how credible the tweet of the person in question is.

Therefore, this time we will use machine learning technology to verify whether Mr. Trump's tweet is True or Fake.

What to do this time

Go to a site called Kaggle and use the news dataset from the page Fake and real news dataset Classifying the news We will create a model to judge the authenticity of the news after Mr. Trump took office.

Next, extract tweets after the inauguration of the president from the data in Trump Tweets Tweets from @realdonaldtrump, True and Fake Make a diagram of how much the judgment was made.

[Important] Precautions regarding the content of the article

This time, I worked on machine learning from the perspective of "Mr. Trump's tweet".

There is no political intention at all, and the output of machine learning is just using Mr. Trump's tweet as a starting point, and I understand that there is a good possibility that the verification result I will tell you will be a boomerang with Fake. Please read the above.

Please note that True or Fake is a computational result and does not guarantee what it really is, and I did not write the article to direct or criticize a particular person. Thank you.

Step1: Pre-process news data for model creation

As a first step, the dataset in Fake and real news dataset Classifying the news can be modeled for preprocessing. I will prepare it.

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder

Next, read the cav file of the corresponding page.

fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")

On the corresponding page, the correct one and the incorrect one are divided into two files, Fake and True.

Let's look at the columns and molds.

fake.columns
Index(['title', 'text', 'subject', 'date'], dtype='object')

fake.dtypes
title      object
text       object
subject    object
date       object
dtype: object

true.columns
Index(['title', 'text', 'subject', 'date'], dtype='object')

true.dtypes
title      object
text       object
subject    object
date       object
dtype: object

You can see that neither file has a direct indication of True or Fake.

After clearly stating that, I will put it together.

fake["Reality"] = 0
true["Reality"] = 1
#Create a new Reality. Set Fake to 0 and True to 1.

df = pd.concat([fake, true], axis=0)
df = df.reset_index(drop=True)

df.isnull().sum()
title      0
text       0
subject    0
date       0
Reality    0
dtype: int64

df.head()

#Neither of the original files was missing. Since the amount will be too large, it is omitted, but it is important to check it in advance.

title	text	subject	date	Reality
0	Donald Trump Sends Out Embarrassing New Year’...	Donald Trump just couldn t wish all Americans ...	News	December 31, 2017
1	Drunk Bragging Trump Staffer Started Russian ...	House Intelligence Committee Chairman Devin Nu...	News	December 31, 2017
2	Sheriff David Clarke Becomes An Internet Joke...	On Friday, it was revealed that former Milwauk...	News	December 30, 2017
3	Trump Is So Obsessed He Even Has Obama’s Name...	On Christmas day, Donald Trump announced that ...	News	December 29, 2017
4	Pope Francis Just Called Out Donald Trump Dur...	Pope Francis used his annual Christmas Day mes...	News	December 25, 2017

df.tail()

title	text	subject	date	Reality
44893	'Fully committed' NATO backs new U.S. approach...	BRUSSELS (Reuters) - NATO allies on Tuesday we...	worldnews	August 22, 2017
44894	LexisNexis withdrew two products from Chinese ...	LONDON (Reuters) - LexisNexis, a provider of l...	worldnews	August 22, 2017
44895	Minsk cultural hub becomes haven from authorities	MINSK (Reuters) - In the shadow of disused Sov...	worldnews	August 22, 2017
44896	Vatican upbeat on possibility of Pope Francis ...	MOSCOW (Reuters) - Vatican Secretary of State ...	worldnews	August 22, 2017
44897	Indonesia to buy $1.14 billion worth of Russia...	JAKARTA (Reuters) - Indonesia will buy 11 Sukh...	worldnews	August 22, 2017

From the above, it was confirmed that there are both True and Fake in the same file.

There is a subject, so let's take a look. If there is something that has nothing to do with Mr. Trump, I will cut the news.

df["subject"].unique()
array(['News', 'politics', 'Government News', 'left-news', 'US_News',
       'Middle-east', 'politicsNews', 'worldnews'], dtype=object)
#Trump relations are likely to be involved in any genre, so I won't consider them here.

Next, let's separate the dates.

df = pd.concat([df['title'], df['text'], df['subject'],df["date"]\
                .str.extract('(?P<Month>.*) (?P<Day>.*), (?P<Year>.*)',expand=True),df["date"],df["Reality"]], axis=1)
df.head()

title	text	subject	Month	Day	Year	date	Reality
0	Donald Trump Sends Out Embarrassing New Year’...	Donald Trump just couldn t wish all Americans ...	News	December	31	2017	December 31, 2017
1	Drunk Bragging Trump Staffer Started Russian ...	House Intelligence Committee Chairman Devin Nu...	News	December	31	2017	December 31, 2017
2	Sheriff David Clarke Becomes An Internet Joke...	On Friday, it was revealed that former Milwauk...	News	December	30	2017	December 30, 2017
3	Trump Is So Obsessed He Even Has Obama’s Name...	On Christmas day, Donald Trump announced that ...	News	December	29	2017	December 29, 2017
4	Pope Francis Just Called Out Donald Trump Dur...	Pope Francis used his annual Christmas Day mes...	News	December	25	2017	December 25, 2017

It seemed that the dates could be separated well, but there was a problem.

df.isnull().sum()
title       0
text        0
subject     0
Month      45
Day        45
Year       45
date        0
Reality     0
dtype: int64

There seems to be a day when things aren't going well here. I will separate it from the part that is working well and check it.

nonull = df.dropna()
nonull.isnull().sum()
title      0
text       0
subject    0
Month      0
Day        0
Year       0
date       0
Reality    0
dtype: int64

null = df[df.isnull().any(axis=1)]
#Check only the ones that are not working here.
null.head()

title	text	subject	Month	Day	Year	date	Reality
9050	Democrat Senator Warns Mueller Not To Release ...	According to The Hill, Democrat Senator Bob Ca...	politics	NaN	NaN	NaN	19-Feb-18
9051	MSNBC ANCHOR Flabbergasted at What Texas Teach...	If we protect every other government building ...	politics	NaN	NaN	NaN	19-Feb-18
9052	WATCH: SNOWFLAKES ASKED Communist Party Platfo...	Ami Horowitz is fantastic! Check out this man ...	politics	NaN	NaN	NaN	19-Feb-18
9053	JUST IN: BADASS GENERAL JOHN KELLY Shoved Chin...	Just one more reminder of why President Trump ...	politics	NaN	NaN	NaN	18-Feb-18
9054	DOJ’s JEFF SESSIONS Opens Investigation Into W...	Thank goodnesss Jeff Sessions is moving on fin...	politics	NaN	NaN	NaN	18-Feb-18

null = null.drop(["Month", "Day", "Year"], axis=1)
#Recreate the date here
null = null.reset_index(drop=True)
null.head()

title	text	subject	date	Reality
0	Democrat Senator Warns Mueller Not To Release ...	According to The Hill, Democrat Senator Bob Ca...	politics	19-Feb-18
1	MSNBC ANCHOR Flabbergasted at What Texas Teach...	If we protect every other government building ...	politics	19-Feb-18
2	WATCH: SNOWFLAKES ASKED Communist Party Platfo...	Ami Horowitz is fantastic! Check out this man ...	politics	19-Feb-18
3	JUST IN: BADASS GENERAL JOHN KELLY Shoved Chin...	Just one more reminder of why President Trump ...	politics	18-Feb-18
4	DOJ’s JEFF SESSIONS Opens Investigation Into W...	Thank goodnesss Jeff Sessions is moving on fin...	politics	18-Feb-18

Let's also look at the end.

null.tail()

title	text	subject	date	Reality
40	https://fedup.wpengine.com/wp-content/uploads/...	https://fedup.wpengine.com/wp-content/uploads/...	Government News	https://fedup.wpengine.com/wp-content/uploads/...
41	https://fedup.wpengine.com/wp-content/uploads/...	https://fedup.wpengine.com/wp-content/uploads/...	Government News	https://fedup.wpengine.com/wp-content/uploads/...
42	Homepage	[vc_row][vc_column width= 1/1 ][td_block_trend...	left-news	MSNBC HOST Rudely Assumes Steel Worker Would N...
43	https://fedup.wpengine.com/wp-content/uploads/...	https://fedup.wpengine.com/wp-content/uploads/...	left-news	https://fedup.wpengine.com/wp-content/uploads/...
44	https://fedup.wpengine.com/wp-content/uploads/...	https://fedup.wpengine.com/wp-content/uploads/...	left-news	https://fedup.wpengine.com/wp-content/uploads/...

There are even URLs with dates. Other information is also a URL, so let's exclude such information.

Many of the dates are written as 〇〇-〇〇-〇〇, so I will start from there.

null_dates = null["date"].str.extract('(?P<Day>.*)-(?P<Month>.*)-(?P<Year>.*)',expand=True)
null_dates.dtypes
Day      object
Month    object
Year     object
dtype: object

null_dates
#See what you've separated the dates but still don't work. Since the amount will be too large, "omit" the parts that do not apply.

Day	Month	Year
abridgement	abridgement	abridgement
35	https://100percentfedup.com/served-roy-moore-v...	commander
36	https://100percentfedup.com/video-hillary-aske...	some
37	https://100percentfedup.com/12-yr-old-black-co...	from
38	NaN	NaN
39	NaN	NaN
40	NaN	NaN
41	NaN	NaN
42	NaN	NaN
43	NaN	NaN
44	NaN	NaN

Since the relevant part was the last one, let's look at the relevant part.

null["date"].tail(10)
35    https://100percentfedup.com/served-roy-moore-v...
36    https://100percentfedup.com/video-hillary-aske...
37    https://100percentfedup.com/12-yr-old-black-co...
38    https://fedup.wpengine.com/wp-content/uploads/...
39    https://fedup.wpengine.com/wp-content/uploads/...
40    https://fedup.wpengine.com/wp-content/uploads/...
41    https://fedup.wpengine.com/wp-content/uploads/...
42    MSNBC HOST Rudely Assumes Steel Worker Would N...
43    https://fedup.wpengine.com/wp-content/uploads/...
44    https://fedup.wpengine.com/wp-content/uploads/...
Name: date, dtype: object

You can see that the date itself is messed up anymore.

Delete the relevant part.

null = null[:-10]
null_dates = null_dates[:-10]

Although the strange part of the date was deleted, the problem remains that the date is the same and the one before taking office is deleted.

Let's continue to work on it.

#Notation of the month
df["Month"].unique()
array(['December', 'November', 'October', 'September', 'August', 'July',
       'June', 'May', 'April', 'March', 'February', 'January', nan, 'Dec',
       'Nov', 'Oct', 'Sep', 'Aug', 'Jul', 'Jun', 'Apr', 'Mar', 'Feb',
       'Jan'], dtype=object)

null_dates["Month"].unique()
array(['Feb'], dtype=object)

null_dates["Month"] = "February"
#Since it is only Feb, it matches with February.

#Notation of year
null_dates["Year"].unique()
array(['18'], dtype=object)

null_dates["Year"] = "2018"
#Only 18 so unified to 2018

null_filled = pd.concat([null['title'], null['text'], null['subject'],\
                null_dates["Month"],null_dates["Day"], null_dates["Year"],null["date"],\
                null["Reality"]], axis=1)
#null and null_Arrange dates

table = pd.concat([nonull, null_filled], axis=0)
table = table.drop("date", axis=1)
#Now you can unify the date format

table = table.reset_index(drop=True)
#Here the indexes are in order

#subject
le = LabelEncoder()
encoded = le.fit_transform(table['subject'].values)
decoded = le.inverse_transform(encoded)
table['subject'] = encoded
#Convert subject to numbers, not directly related

#Unification of the moon
table["Month"].unique()
array(['December', 'November', 'October', 'September', 'August', 'July',
       'June', 'May', 'April', 'March', 'February', 'January', 'Dec',
       'Nov', 'Oct', 'Sep', 'Aug', 'Jul', 'Jun', 'Apr', 'Mar', 'Feb',
       'Jan'], dtype=object)

table["Month"] = table["Month"].map({'December':12, 'November':11, 'October':10, 'September':9, 'August':8, 'July':7,
       'June':6, 'May':5, 'April':4, 'March':3, 'February':2, 'January':1, 'Dec':12,
       'Nov':11, 'Oct':10, 'Sep':9, 'Aug':8, 'Jul':7, 'Jun':6, 'Apr':4, 'Mar':3, 'Feb':2,
       'Jan':1})

#Type conversion
table.dtypes
#Day,Year is an object, so fix it
title      object
text       object
subject     int64
Month       int64
Day        object
Year       object
Reality     int

table["Day"] = table["Day"].astype("int64")
table["Year"] = table["Year"].astype("int64")

table.dtypes
title      object
text       object
subject     int64
Month       int64
Day         int64
Year        int64
Reality     int64
dtype: object

table.isnull().sum()
title      0
text       0
subject    0
Month      0
Day        0
Year       0
Reality    0
#This solves the problem of missing values

Up to this point, we have continued to perform type conversion and unification of notation.

The news before January 20, 2017, when Mr. Trump took office at the end, will be cut.

Therefore, datetime is used.

dates = table[["Year", "Month", "Day"]]

dates = pd.to_datetime(dates)
dates.head()
0   2017-12-31
1   2017-12-31
2   2017-12-30
3   2017-12-29
4   2017-12-25
dtype: datetime64[ns]

news = pd.concat([table, dates], axis=1)
news = news.rename(columns={0: 'dates'})
news = news[["text", "dates", "Reality"]]
news[news["dates"] > dt.datetime(2017,1,20)]
#Cut everything before the president takes office
news = news.drop("dates", axis=1)
news.head()
#This completes.

text	Reality
0	Donald Trump just couldn t wish all Americans ...
1	House Intelligence Committee Chairman Devin Nu...
2	On Friday, it was revealed that former Milwauk...
3	On Christmas day, Donald Trump announced that ...
4	Pope Francis used his annual Christmas Day mes...

This completes a summary of the news since Mr. Trump took office and its truth.

Finally, save it to a csv file and proceed to the next step.

news.to_csv("news_for_learning.csv", index=False)

Step2: Preprocess the data of Mr. Trump's tweet for judgment

Next, we will extract tweets from Trump Tweets Tweets from @realdonaldtrump since the inauguration of the president.

Let's read the data and take a look at the contents.

import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import datetime as dt

import warnings
warnings.filterwarnings('ignore')

trump = pd.read_csv("trumptweets.csv")

trump.head()

id	link	content	date	retweets	favorites	mentions	hashtags	geo
0	1698308935	https://twitter.com/realDonaldTrump/status/169...	Be sure to tune in and watch Donald Trump on L...	2009-05-04 20:54:25	500	868	NaN	NaN
1	1701461182	https://twitter.com/realDonaldTrump/status/170...	Donald Trump will be appearing on The View tom...	2009-05-05 03:00:10	33	273	NaN	NaN
2	1737479987	https://twitter.com/realDonaldTrump/status/173...	Donald Trump reads Top Ten Financial Tips on L...	2009-05-08 15:38:08	12	18	NaN	NaN
3	1741160716	https://twitter.com/realDonaldTrump/status/174...	New Blog Post: Celebrity Apprentice Finale and...	2009-05-08 22:40:15	11	24	NaN	NaN
4	1773561338	https://twitter.com/realDonaldTrump/status/177...	"My persona will never be that of a wallflower...	2009-05-12 16:07:28	1399	1965	NaN	NaN

trump.dtypes
id             int64
link          object
content       object
date          object
retweets       int64
favorites      int64
mentions      object
hashtags      object
geo          float64
dtype: object

There are various contents, but here we will use only date and content.

I will change the date to datetime.

trump['Date'] = pd.to_datetime(trump['date'])
trump[trump["Date"] > dt.datetime(2017,1,20)]

This erased all tweets before he took office.

Delete the data other than content and add new items called Truth and Percentage.

When creating and validating a model, Truth stores the conclusion of True or Fake, and Percentage stores the possibility of calculation results.

trump = trump[["content"]]
trump["Truth"] = "Unknown"
#Now the content is only content and Truth. The result of the authenticity judgment is the character type of True and Fake.
trump["Percentage"] = 0
#Also add a percentage
trump = trump.reset_index(drop=True)
#Here the indexes are in order
trump.head()

content	Truth	Percentage
0	Be sure to tune in and watch Donald Trump on L...	Unknown
1	Donald Trump will be appearing on The View tom...	Unknown
2	Donald Trump reads Top Ten Financial Tips on L...	Unknown
3	New Blog Post: Celebrity Apprentice Finale and...	Unknown
4	"My persona will never be that of a wallflower...	Unknown

Now that you have narrowed down to just the information you need, save it in a new csv file.

trump.to_csv("trump_for_judging.csv", index=False)

We will work on model creation in the next step.

Step3: Create a model based on news data

Now that we have all the necessary csv files, let's create a model for judging news.

Here we use a naive Bayes classifier.

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy

import pickle

news = pd.read_csv("news_for_learning.csv")
#Csv file created in Step1

This time, we will create a list with True and Fake respectively and store the corresponding words in dictionary format.

Fake =[]
Fact = []

for text, reality in zip(news['text'], news['Reality']):
    line = text.split(" ")
    dic = {}
    if reality == 0:
        for word in line:
            dic[word] = True
        ireru = (dic, 0)
        Fake.append(ireru)
    else:
        for word in line:
            dic[word] = True
        ireru = (dic, 1)
        Fact.append(ireru)

You have now stored the words in each list.

For example, even if you output Fake [0], a huge amount of words will appear, so I will omit it here.

Now, let's create training data and test data and connect them to. Here, 90% of the data is used for training.

threshold = 0.9
num_fake = int(threshold * len(Fake))
num_fact = int(threshold * len(Fact))

features_train = Fake[:num_fake] + Fact[:num_fact]
features_test = Fake[num_fake:] + Fact[num_fact:]

classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))
Accuracy of the classifier: 0.9585746102449889

Looking at the accuracy, we can see that it is quite high at 95% or more. Let's see what words hold the key.

N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
    print(str(i+1) + '. ' + item[0]) 

Top 15 most informative words:
1. (Reuters)
2. -
3. -
4. screenshot
5. Image:
6. MOSCOW
7. BERLIN
8. screengrab
9. Images
10. Editing
11. WASHINGTON
12. Kurdistan
13. corrects
14. racists
15. Image

Looking at it like this, you can see that symbols, place names, and some racists are important.

Finally, let's save the model.

filename = 'model_for_trump.sav'
pickle.dump(classifier, open(filename, 'wb'))

This completes saving.

If you execute the following code following the code described so far in Step 2 and copy and paste the news, the authenticity will be judged.

As mentioned at the beginning, it is a calculation result and we cannot guarantee that it is really the case, so please do so at your own risk.

def extract_features(words):
    return dict([(word, True) for word in words])
    print(dict([(word, True) for word in words]))
#Store the dictionary in the same format as the previous list

#See what you typed on the spot
input_review = input()
print("Fact Check:")
#extract_features(input_review)
print("\nNews Text:",input_review)
features = extract_features(input_review.split())
print(features)
probabilities = classifier.prob_classify(features)
predicted_truth = probabilities.max()
if predicted_truth == 0:
    answer = "Fake News!!!"
else:
    answer = "True News!!!"
print("Predicted Answer:", answer)
print("Probability:", round(probabilities.prob(predicted_truth), 2))

Step4: Verify the authenticity of the tweet using the model

In the final step, we will finally verify Mr. Trump's tweet.

Let's start by loading files and libraries.

import pandas as pd
import pickle

trump = pd.read_csv("trump_for_judging.csv")
#Load Trump Tweets
classifier = pickle.load(open('model_for_trump.sav', 'rb'))
#Load the saved model

import matplotlib.pyplot as plt
import seaborn as sns
#Finally, check the truth ratio

Next, create a function that determines authenticity. This is a slight tweak to the last code in the previous step.

def FactCheck(event):   
    global answer
    global percentage
    line = event.split(" ")
    dic = {}
    for word in line:
        dic[word] = True
    probabilities = classifier.prob_classify(dic)    
    predicted_truth = probabilities.max()
    percentage = round(probabilities.prob(predicted_truth))
    if predicted_truth == 0:
        answer = "Fake"
    else:
        answer = "True"
    return answer

Here, the judgment result for each news is selected as True or Fake, and the numerical value of how likely the judgment is made can be output.

From now on, we will perform the judgment work and write each result to the csv file that was read earlier.

for i, v in trump.iterrows():
    FactCheck(v["content"])
    trump.at[trump.index[i], 'Truth'] = answer
    trump.at[trump.index[i], 'Percentage'] = percentage

trump.head()

content	Truth	Percentage
0	Be sure to tune in and watch Donald Trump on L...	Fake
1	Donald Trump will be appearing on The View tom...	Fake
2	Donald Trump reads Top Ten Financial Tips on L...	Fake
3	New Blog Post: Celebrity Apprentice Finale and...	Fake
4	"My persona will never be that of a wallflower...	True

With this, we were able to output the judgment result and numerical value for the tweet.

Let's graph the ratio of authenticity.

f,ax=plt.subplots(1,2,figsize=(18,8))
trump['Truth'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Truth')
ax[0].set_ylabel('')
sns.countplot('Truth',data=trump,ax=ax[1])
ax[1].set_title('Truth')
plt.show()

trump.describe()

Percentage
count
mean
std
min
25%
50%
75%
max

True is 13.6%. About 6 out of 7 tweets were Fake.

Also, it seems that the calculation result shows that the percentage of all tweets is 1, that is, 100% correct.

At the end

This time, I created a model using a file that specifies the authenticity of the news in Kaggle and judged the authenticity of Mr. Trump's tweet.

As a result, about 6 out of 7 of his tweets are likely to be fake.

I'm wondering why all the percentages are 1s, so I'm going to examine it.

As I mentioned at the beginning, this is just a calculation result, so there is no guarantee that this verification result will match the facts.

Also, I have no intention of defending or criticizing anyone in a particular political position, so I hope you understand it again.

It's been a long article, but thank you for reading to the end.

[PYTHON] Fake News? Fake Tweet?