Fake News!
Since Mr. Trump took office, he has often referred to and criticized various news in this way.
Active on Twitter, with some aggressive posts.
While taking offensive words and deeds against the media in this way, it is also a fact that it is doubtful how credible the tweet of the person in question is.
Therefore, this time we will use machine learning technology to verify whether Mr. Trump's tweet is True or Fake.
Go to a site called Kaggle and use the news dataset from the page Fake and real news dataset Classifying the news We will create a model to judge the authenticity of the news after Mr. Trump took office.
Next, extract tweets after the inauguration of the president from the data in Trump Tweets Tweets from @realdonaldtrump, True and Fake Make a diagram of how much the judgment was made.
This time, I worked on machine learning from the perspective of "Mr. Trump's tweet".
There is no political intention at all, and the output of machine learning is just using Mr. Trump's tweet as a starting point, and I understand that there is a good possibility that the verification result I will tell you will be a boomerang with Fake. Please read the above.
Please note that True or Fake is a computational result and does not guarantee what it really is, and I did not write the article to direct or criticize a particular person. Thank you.
As a first step, the dataset in Fake and real news dataset Classifying the news can be modeled for preprocessing. I will prepare it.
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
Next, read the cav file of the corresponding page.
fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")
On the corresponding page, the correct one and the incorrect one are divided into two files, Fake and True.
Let's look at the columns and molds.
fake.columns
Index(['title', 'text', 'subject', 'date'], dtype='object')
fake.dtypes
title object
text object
subject object
date object
dtype: object
true.columns
Index(['title', 'text', 'subject', 'date'], dtype='object')
true.dtypes
title object
text object
subject object
date object
dtype: object
You can see that neither file has a direct indication of True or Fake.
After clearly stating that, I will put it together.
fake["Reality"] = 0
true["Reality"] = 1
#Create a new Reality. Set Fake to 0 and True to 1.
df = pd.concat([fake, true], axis=0)
df = df.reset_index(drop=True)
df.isnull().sum()
title 0
text 0
subject 0
date 0
Reality 0
dtype: int64
df.head()
#Neither of the original files was missing. Since the amount will be too large, it is omitted, but it is important to check it in advance.
title | text | subject | date | Reality |
---|---|---|---|---|
0 | Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News | December 31, 2017 |
1 | Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News | December 31, 2017 |
2 | Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News | December 30, 2017 |
3 | Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News | December 29, 2017 |
4 | Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News | December 25, 2017 |
df.tail()
title | text | subject | date | Reality |
---|---|---|---|---|
44893 | 'Fully committed' NATO backs new U.S. approach... | BRUSSELS (Reuters) - NATO allies on Tuesday we... | worldnews | August 22, 2017 |
44894 | LexisNexis withdrew two products from Chinese ... | LONDON (Reuters) - LexisNexis, a provider of l... | worldnews | August 22, 2017 |
44895 | Minsk cultural hub becomes haven from authorities | MINSK (Reuters) - In the shadow of disused Sov... | worldnews | August 22, 2017 |
44896 | Vatican upbeat on possibility of Pope Francis ... | MOSCOW (Reuters) - Vatican Secretary of State ... | worldnews | August 22, 2017 |
44897 | Indonesia to buy $1.14 billion worth of Russia... | JAKARTA (Reuters) - Indonesia will buy 11 Sukh... | worldnews | August 22, 2017 |
From the above, it was confirmed that there are both True and Fake in the same file.
There is a subject, so let's take a look. If there is something that has nothing to do with Mr. Trump, I will cut the news.
df["subject"].unique()
array(['News', 'politics', 'Government News', 'left-news', 'US_News',
'Middle-east', 'politicsNews', 'worldnews'], dtype=object)
#Trump relations are likely to be involved in any genre, so I won't consider them here.
Next, let's separate the dates.
df = pd.concat([df['title'], df['text'], df['subject'],df["date"]\
.str.extract('(?P<Month>.*) (?P<Day>.*), (?P<Year>.*)',expand=True),df["date"],df["Reality"]], axis=1)
df.head()
title | text | subject | Month | Day | Year | date | Reality |
---|---|---|---|---|---|---|---|
0 | Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News | December | 31 | 2017 | December 31, 2017 |
1 | Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News | December | 31 | 2017 | December 31, 2017 |
2 | Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News | December | 30 | 2017 | December 30, 2017 |
3 | Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News | December | 29 | 2017 | December 29, 2017 |
4 | Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News | December | 25 | 2017 | December 25, 2017 |
It seemed that the dates could be separated well, but there was a problem.
df.isnull().sum()
title 0
text 0
subject 0
Month 45
Day 45
Year 45
date 0
Reality 0
dtype: int64
There seems to be a day when things aren't going well here. I will separate it from the part that is working well and check it.
nonull = df.dropna()
nonull.isnull().sum()
title 0
text 0
subject 0
Month 0
Day 0
Year 0
date 0
Reality 0
dtype: int64
null = df[df.isnull().any(axis=1)]
#Check only the ones that are not working here.
null.head()
title | text | subject | Month | Day | Year | date | Reality |
---|---|---|---|---|---|---|---|
9050 | Democrat Senator Warns Mueller Not To Release ... | According to The Hill, Democrat Senator Bob Ca... | politics | NaN | NaN | NaN | 19-Feb-18 |
9051 | MSNBC ANCHOR Flabbergasted at What Texas Teach... | If we protect every other government building ... | politics | NaN | NaN | NaN | 19-Feb-18 |
9052 | WATCH: SNOWFLAKES ASKED Communist Party Platfo... | Ami Horowitz is fantastic! Check out this man ... | politics | NaN | NaN | NaN | 19-Feb-18 |
9053 | JUST IN: BADASS GENERAL JOHN KELLY Shoved Chin... | Just one more reminder of why President Trump ... | politics | NaN | NaN | NaN | 18-Feb-18 |
9054 | DOJ’s JEFF SESSIONS Opens Investigation Into W... | Thank goodnesss Jeff Sessions is moving on fin... | politics | NaN | NaN | NaN | 18-Feb-18 |
null = null.drop(["Month", "Day", "Year"], axis=1)
#Recreate the date here
null = null.reset_index(drop=True)
null.head()
title | text | subject | date | Reality |
---|---|---|---|---|
0 | Democrat Senator Warns Mueller Not To Release ... | According to The Hill, Democrat Senator Bob Ca... | politics | 19-Feb-18 |
1 | MSNBC ANCHOR Flabbergasted at What Texas Teach... | If we protect every other government building ... | politics | 19-Feb-18 |
2 | WATCH: SNOWFLAKES ASKED Communist Party Platfo... | Ami Horowitz is fantastic! Check out this man ... | politics | 19-Feb-18 |
3 | JUST IN: BADASS GENERAL JOHN KELLY Shoved Chin... | Just one more reminder of why President Trump ... | politics | 18-Feb-18 |
4 | DOJ’s JEFF SESSIONS Opens Investigation Into W... | Thank goodnesss Jeff Sessions is moving on fin... | politics | 18-Feb-18 |
Let's also look at the end.
null.tail()
title | text | subject | date | Reality |
---|---|---|---|---|
40 | https://fedup.wpengine.com/wp-content/uploads/... | https://fedup.wpengine.com/wp-content/uploads/... | Government News | https://fedup.wpengine.com/wp-content/uploads/... |
41 | https://fedup.wpengine.com/wp-content/uploads/... | https://fedup.wpengine.com/wp-content/uploads/... | Government News | https://fedup.wpengine.com/wp-content/uploads/... |
42 | Homepage | [vc_row][vc_column width= 1/1 ][td_block_trend... | left-news | MSNBC HOST Rudely Assumes Steel Worker Would N... |
43 | https://fedup.wpengine.com/wp-content/uploads/... | https://fedup.wpengine.com/wp-content/uploads/... | left-news | https://fedup.wpengine.com/wp-content/uploads/... |
44 | https://fedup.wpengine.com/wp-content/uploads/... | https://fedup.wpengine.com/wp-content/uploads/... | left-news | https://fedup.wpengine.com/wp-content/uploads/... |
There are even URLs with dates. Other information is also a URL, so let's exclude such information.
Many of the dates are written as 〇〇-〇〇-〇〇, so I will start from there.
null_dates = null["date"].str.extract('(?P<Day>.*)-(?P<Month>.*)-(?P<Year>.*)',expand=True)
null_dates.dtypes
Day object
Month object
Year object
dtype: object
null_dates
#See what you've separated the dates but still don't work. Since the amount will be too large, "omit" the parts that do not apply.
Day | Month | Year |
---|---|---|
abridgement | abridgement | abridgement |
35 | https://100percentfedup.com/served-roy-moore-v... | commander |
36 | https://100percentfedup.com/video-hillary-aske... | some |
37 | https://100percentfedup.com/12-yr-old-black-co... | from |
38 | NaN | NaN |
39 | NaN | NaN |
40 | NaN | NaN |
41 | NaN | NaN |
42 | NaN | NaN |
43 | NaN | NaN |
44 | NaN | NaN |
Since the relevant part was the last one, let's look at the relevant part.
null["date"].tail(10)
35 https://100percentfedup.com/served-roy-moore-v...
36 https://100percentfedup.com/video-hillary-aske...
37 https://100percentfedup.com/12-yr-old-black-co...
38 https://fedup.wpengine.com/wp-content/uploads/...
39 https://fedup.wpengine.com/wp-content/uploads/...
40 https://fedup.wpengine.com/wp-content/uploads/...
41 https://fedup.wpengine.com/wp-content/uploads/...
42 MSNBC HOST Rudely Assumes Steel Worker Would N...
43 https://fedup.wpengine.com/wp-content/uploads/...
44 https://fedup.wpengine.com/wp-content/uploads/...
Name: date, dtype: object
You can see that the date itself is messed up anymore.
Delete the relevant part.
null = null[:-10]
null_dates = null_dates[:-10]
Although the strange part of the date was deleted, the problem remains that the date is the same and the one before taking office is deleted.
Let's continue to work on it.
#Notation of the month
df["Month"].unique()
array(['December', 'November', 'October', 'September', 'August', 'July',
'June', 'May', 'April', 'March', 'February', 'January', nan, 'Dec',
'Nov', 'Oct', 'Sep', 'Aug', 'Jul', 'Jun', 'Apr', 'Mar', 'Feb',
'Jan'], dtype=object)
null_dates["Month"].unique()
array(['Feb'], dtype=object)
null_dates["Month"] = "February"
#Since it is only Feb, it matches with February.
#Notation of year
null_dates["Year"].unique()
array(['18'], dtype=object)
null_dates["Year"] = "2018"
#Only 18 so unified to 2018
null_filled = pd.concat([null['title'], null['text'], null['subject'],\
null_dates["Month"],null_dates["Day"], null_dates["Year"],null["date"],\
null["Reality"]], axis=1)
#null and null_Arrange dates
table = pd.concat([nonull, null_filled], axis=0)
table = table.drop("date", axis=1)
#Now you can unify the date format
table = table.reset_index(drop=True)
#Here the indexes are in order
#subject
le = LabelEncoder()
encoded = le.fit_transform(table['subject'].values)
decoded = le.inverse_transform(encoded)
table['subject'] = encoded
#Convert subject to numbers, not directly related
#Unification of the moon
table["Month"].unique()
array(['December', 'November', 'October', 'September', 'August', 'July',
'June', 'May', 'April', 'March', 'February', 'January', 'Dec',
'Nov', 'Oct', 'Sep', 'Aug', 'Jul', 'Jun', 'Apr', 'Mar', 'Feb',
'Jan'], dtype=object)
table["Month"] = table["Month"].map({'December':12, 'November':11, 'October':10, 'September':9, 'August':8, 'July':7,
'June':6, 'May':5, 'April':4, 'March':3, 'February':2, 'January':1, 'Dec':12,
'Nov':11, 'Oct':10, 'Sep':9, 'Aug':8, 'Jul':7, 'Jun':6, 'Apr':4, 'Mar':3, 'Feb':2,
'Jan':1})
#Type conversion
table.dtypes
#Day,Year is an object, so fix it
title object
text object
subject int64
Month int64
Day object
Year object
Reality int
table["Day"] = table["Day"].astype("int64")
table["Year"] = table["Year"].astype("int64")
table.dtypes
title object
text object
subject int64
Month int64
Day int64
Year int64
Reality int64
dtype: object
table.isnull().sum()
title 0
text 0
subject 0
Month 0
Day 0
Year 0
Reality 0
#This solves the problem of missing values
Up to this point, we have continued to perform type conversion and unification of notation.
The news before January 20, 2017, when Mr. Trump took office at the end, will be cut.
Therefore, datetime is used.
dates = table[["Year", "Month", "Day"]]
dates = pd.to_datetime(dates)
dates.head()
0 2017-12-31
1 2017-12-31
2 2017-12-30
3 2017-12-29
4 2017-12-25
dtype: datetime64[ns]
news = pd.concat([table, dates], axis=1)
news = news.rename(columns={0: 'dates'})
news = news[["text", "dates", "Reality"]]
news[news["dates"] > dt.datetime(2017,1,20)]
#Cut everything before the president takes office
news = news.drop("dates", axis=1)
news.head()
#This completes.
text | Reality |
---|---|
0 | Donald Trump just couldn t wish all Americans ... |
1 | House Intelligence Committee Chairman Devin Nu... |
2 | On Friday, it was revealed that former Milwauk... |
3 | On Christmas day, Donald Trump announced that ... |
4 | Pope Francis used his annual Christmas Day mes... |
This completes a summary of the news since Mr. Trump took office and its truth.
Finally, save it to a csv file and proceed to the next step.
news.to_csv("news_for_learning.csv", index=False)
Next, we will extract tweets from Trump Tweets Tweets from @realdonaldtrump since the inauguration of the president.
Let's read the data and take a look at the contents.
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import datetime as dt
import warnings
warnings.filterwarnings('ignore')
trump = pd.read_csv("trumptweets.csv")
trump.head()
id | link | content | date | retweets | favorites | mentions | hashtags | geo |
---|---|---|---|---|---|---|---|---|
0 | 1698308935 | https://twitter.com/realDonaldTrump/status/169... | Be sure to tune in and watch Donald Trump on L... | 2009-05-04 20:54:25 | 500 | 868 | NaN | NaN |
1 | 1701461182 | https://twitter.com/realDonaldTrump/status/170... | Donald Trump will be appearing on The View tom... | 2009-05-05 03:00:10 | 33 | 273 | NaN | NaN |
2 | 1737479987 | https://twitter.com/realDonaldTrump/status/173... | Donald Trump reads Top Ten Financial Tips on L... | 2009-05-08 15:38:08 | 12 | 18 | NaN | NaN |
3 | 1741160716 | https://twitter.com/realDonaldTrump/status/174... | New Blog Post: Celebrity Apprentice Finale and... | 2009-05-08 22:40:15 | 11 | 24 | NaN | NaN |
4 | 1773561338 | https://twitter.com/realDonaldTrump/status/177... | "My persona will never be that of a wallflower... | 2009-05-12 16:07:28 | 1399 | 1965 | NaN | NaN |
trump.dtypes
id int64
link object
content object
date object
retweets int64
favorites int64
mentions object
hashtags object
geo float64
dtype: object
There are various contents, but here we will use only date and content.
I will change the date to datetime.
trump['Date'] = pd.to_datetime(trump['date'])
trump[trump["Date"] > dt.datetime(2017,1,20)]
This erased all tweets before he took office.
Delete the data other than content and add new items called Truth and Percentage.
When creating and validating a model, Truth stores the conclusion of True or Fake, and Percentage stores the possibility of calculation results.
trump = trump[["content"]]
trump["Truth"] = "Unknown"
#Now the content is only content and Truth. The result of the authenticity judgment is the character type of True and Fake.
trump["Percentage"] = 0
#Also add a percentage
trump = trump.reset_index(drop=True)
#Here the indexes are in order
trump.head()
content | Truth | Percentage |
---|---|---|
0 | Be sure to tune in and watch Donald Trump on L... | Unknown |
1 | Donald Trump will be appearing on The View tom... | Unknown |
2 | Donald Trump reads Top Ten Financial Tips on L... | Unknown |
3 | New Blog Post: Celebrity Apprentice Finale and... | Unknown |
4 | "My persona will never be that of a wallflower... | Unknown |
Now that you have narrowed down to just the information you need, save it in a new csv file.
trump.to_csv("trump_for_judging.csv", index=False)
We will work on model creation in the next step.
Now that we have all the necessary csv files, let's create a model for judging news.
Here we use a naive Bayes classifier.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
import pickle
news = pd.read_csv("news_for_learning.csv")
#Csv file created in Step1
This time, we will create a list with True and Fake respectively and store the corresponding words in dictionary format.
Fake =[]
Fact = []
for text, reality in zip(news['text'], news['Reality']):
line = text.split(" ")
dic = {}
if reality == 0:
for word in line:
dic[word] = True
ireru = (dic, 0)
Fake.append(ireru)
else:
for word in line:
dic[word] = True
ireru = (dic, 1)
Fact.append(ireru)
You have now stored the words in each list.
For example, even if you output Fake [0], a huge amount of words will appear, so I will omit it here.
Now, let's create training data and test data and connect them to. Here, 90% of the data is used for training.
threshold = 0.9
num_fake = int(threshold * len(Fake))
num_fact = int(threshold * len(Fact))
features_train = Fake[:num_fake] + Fact[:num_fact]
features_test = Fake[num_fake:] + Fact[num_fact:]
classifier = NaiveBayesClassifier.train(features_train)
print('Accuracy of the classifier:', nltk_accuracy(classifier, features_test))
Accuracy of the classifier: 0.9585746102449889
Looking at the accuracy, we can see that it is quite high at 95% or more. Let's see what words hold the key.
N = 15
print('Top ' + str(N) + ' most informative words:')
for i, item in enumerate(classifier.most_informative_features()[:N]):
print(str(i+1) + '. ' + item[0])
Top 15 most informative words:
1. (Reuters)
2. -
3. -
4. screenshot
5. Image:
6. MOSCOW
7. BERLIN
8. screengrab
9. Images
10. Editing
11. WASHINGTON
12. Kurdistan
13. corrects
14. racists
15. Image
Looking at it like this, you can see that symbols, place names, and some racists are important.
Finally, let's save the model.
filename = 'model_for_trump.sav'
pickle.dump(classifier, open(filename, 'wb'))
This completes saving.
If you execute the following code following the code described so far in Step 2 and copy and paste the news, the authenticity will be judged.
As mentioned at the beginning, it is a calculation result and we cannot guarantee that it is really the case, so please do so at your own risk.
def extract_features(words):
return dict([(word, True) for word in words])
print(dict([(word, True) for word in words]))
#Store the dictionary in the same format as the previous list
#See what you typed on the spot
input_review = input()
print("Fact Check:")
#extract_features(input_review)
print("\nNews Text:",input_review)
features = extract_features(input_review.split())
print(features)
probabilities = classifier.prob_classify(features)
predicted_truth = probabilities.max()
if predicted_truth == 0:
answer = "Fake News!!!"
else:
answer = "True News!!!"
print("Predicted Answer:", answer)
print("Probability:", round(probabilities.prob(predicted_truth), 2))
In the final step, we will finally verify Mr. Trump's tweet.
Let's start by loading files and libraries.
import pandas as pd
import pickle
trump = pd.read_csv("trump_for_judging.csv")
#Load Trump Tweets
classifier = pickle.load(open('model_for_trump.sav', 'rb'))
#Load the saved model
import matplotlib.pyplot as plt
import seaborn as sns
#Finally, check the truth ratio
Next, create a function that determines authenticity. This is a slight tweak to the last code in the previous step.
def FactCheck(event):
global answer
global percentage
line = event.split(" ")
dic = {}
for word in line:
dic[word] = True
probabilities = classifier.prob_classify(dic)
predicted_truth = probabilities.max()
percentage = round(probabilities.prob(predicted_truth))
if predicted_truth == 0:
answer = "Fake"
else:
answer = "True"
return answer
Here, the judgment result for each news is selected as True or Fake, and the numerical value of how likely the judgment is made can be output.
From now on, we will perform the judgment work and write each result to the csv file that was read earlier.
for i, v in trump.iterrows():
FactCheck(v["content"])
trump.at[trump.index[i], 'Truth'] = answer
trump.at[trump.index[i], 'Percentage'] = percentage
trump.head()
content | Truth | Percentage |
---|---|---|
0 | Be sure to tune in and watch Donald Trump on L... | Fake |
1 | Donald Trump will be appearing on The View tom... | Fake |
2 | Donald Trump reads Top Ten Financial Tips on L... | Fake |
3 | New Blog Post: Celebrity Apprentice Finale and... | Fake |
4 | "My persona will never be that of a wallflower... | True |
With this, we were able to output the judgment result and numerical value for the tweet.
Let's graph the ratio of authenticity.
f,ax=plt.subplots(1,2,figsize=(18,8))
trump['Truth'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Truth')
ax[0].set_ylabel('')
sns.countplot('Truth',data=trump,ax=ax[1])
ax[1].set_title('Truth')
plt.show()
trump.describe()
Percentage |
---|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
True is 13.6%. About 6 out of 7 tweets were Fake.
Also, it seems that the calculation result shows that the percentage of all tweets is 1, that is, 100% correct.
This time, I created a model using a file that specifies the authenticity of the news in Kaggle and judged the authenticity of Mr. Trump's tweet.
As a result, about 6 out of 7 of his tweets are likely to be fake.
I'm wondering why all the percentages are 1s, so I'm going to examine it.
As I mentioned at the beginning, this is just a calculation result, so there is no guarantee that this verification result will match the facts.
Also, I have no intention of defending or criticizing anyone in a particular political position, so I hope you understand it again.
It's been a long article, but thank you for reading to the end.
Recommended Posts