Last time this article I tried to cluster Aozora Bunko books with Doc2Vec. I wondered if it worked a little, but honestly, the result was subtle. So, this time, instead of Doc2Vec, I will use a library called fastText to cluster Yahoo news articles.
fastText is an open source natural language processing library developed by Facebook. It is highly functional, has good prediction accuracy, and makes predictions even faster. The main functions are classification by supervised learning and vector generation of words by unsupervised learning.
This time, I will try to predict the article category using the classification function by supervised learning.
For more information, go to fastText Official Reference! GitHub was detailed about the function about Python!
--Docker → here
import pandas as pd, numpy as np
import re
import MeCab
import fasttext
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.decomposition import PCA
import japanize_matplotlib
Load the data obtained using the code introduced in Scraping Yahoo News. It's just after the US presidential election, so there's a lot of international news.
df = pd.read_csv('./YahooNews.csv')
df
title | category | text | |
---|---|---|---|
0 | Loss? Azato cute Yoshioka Riho td> | Entertainment td> | Actress Yoshioka Riho 27's second photo book collection for the first time in two years by Asami Kiyokawa Shueisha ... td> |
1 | Devil's "sacred place" Moisturizing tourist destination td> | Economy td> | The movie version of Kimetsu no Yaiba, which has recorded an exceptional blockbuster despite the corona wreck, is not limited to the movie ... td> |
2 | Dentsu G Corona halves operating income td> | Economy td> | The Dentsu Group announced on the 10th that the consolidated financial statements for the fiscal year ending December 2020 have 9 revenues equivalent to sales ... td> |
3 | Hong Kong Democrats, all resigned td> | International td> | 12 According to Beijing Joint Xinhua News Agency, China's Standing Committee of the National People's Congress will qualify as a member of the Hong Kong legislative council with a fixed number of 70 on the 11th ... td> |
4 | Expert organization Increased infection nationwide td> | Domestic td> | The Ministry of Health, Labor and Welfare's expert organization advisory board meeting to advise on measures against the new coronavirus will be held on the 11th ... td> |
... | ... | ... | ... |
512 | 4 people stabbed around the White House td> | International td> | According to US NBC TV and others, supporters of both President Trump and former Vice President Biden gathered in the US presidential election ... td> |
513 | One-sided victory declaration The US reaction is td> | International td> | FNN Prime Online Washington on the 4th, 10 am before the election day of the US presidential election ... td> |
514 | NY stocks continue to grow, temporarily over $ 600 td> | International td> | New York current affairs are being counted. In the US presidential election, the close battle between both candidates Trump Biden continues, and on the morning of the 4th ... td> |
515 | Sega Sammy arcade withdrawal td> | Economy td> | SEGA SAMMY HOLDINGS shares in SEGA Entertainment Tokyo, a consolidated subsidiary that operates entertainment facilities on the 4th ... td> |
516 | China to use weapons for maritime security td> | International td> | Beijing Joint China National People's Congress National People's Congress on the 4th stipulates the authority of the China Coast Guard, which is responsible for maritime security ... td> |
517 rows × 3 columns
#Specify NEologd in the MeCab dictionary.
#mecab is for mobile phone analysis, wakati is for word-separation
mecab = MeCab.Tagger('-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/')
wakati = MeCab.Tagger("-Owakati -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/")
#Define a function to perform morphological analysis
#When you enter a file, it outputs a file, and when you pass a string, it returns a string. Change with the argument file.
#If you just want to divide it, mecab as an argument=This can be achieved with wakati.
def MecabMorphologicalAnalysis(path='./text.txt', output_file='wakati.txt', mecab=mecab, file=False):
mecab_text = ''
if file:
with open(path) as f:
for line in f:
mecab_text += mecab.parse(line)
with open(output_file, 'w') as f:
print(mecab_text, file=f)
else:
for path in path.split('\n'):
mecab_text += mecab.parse(path)
return mecab_text
#Outputs the cosine similarity between v1 and v2.
def cos_sim(v1, v2):
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
It transforms into a shape for use with fastText. FastText allows you to easily perform supervised learning by formatting the data in the following form. For details, please refer to Official Tutorial.
__label__sauce __label__cheese how much does potato starch affect a cheese sauce recipe ?
__label__food-safety __label__acidity dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove how do i cover up the white spots on my cast iron stove ?
__label__restaurant michelin three star restaurant; but if the chef is not there
Perform the following processing to shape it into the above shape.
① Insert __label__
before the news category → Store in list
(2) The text is divided using the Mecab Morphological Analysis function
defined above → Stored in the list
③ Divide into train data and valid data using train_test_split
④ Combine categories and text with train and valid, respectively, and save to a file
# ①
cat_lst = ['__label__' + cat for cat in df.category]
print("cat_lst[:5]:", cat_lst[:5]) #Check the contents
print("len(cat_lst):", len(cat_lst)) #Check the number of labels
cat_lst [: 5]: ['__label__Entertainment','__label__Economy','__label__Economy','__label__International','__label__Domestic'] len(cat_lst): 517
# ②
text_lst = [MecabMorphologicalAnalysis(text, mecab=wakati) for text in df.text]
print("text_lst[0][:50]:", text_lst[0][:50]) #Check the first line
print("text_lst[1][:50]:", text_lst[1][:50]) #Check the second line
print("len(text_lst):", len(text_lst)) #Check the number of articles
text_lst [0] [:50]: Actress Yoshioka Riho 27's second photo book for the first time in two years Riho Collection by Asami Kiyok text_lst [1] [:50]: The movie version of Kimetsu no Yaiba, which has recorded an exceptional blockbuster despite the corona wreck. len(text_lst): 517
# ③
text_train, text_valid, cat_train, cat_valid = train_test_split(
text_lst, cat_lst, test_size=0.2, random_state=0, stratify=cat_lst
)
# ④
with open('./news.train', mode='w') as f:
for i in range(len(text_train)):
f.write(cat_train[i] + ' '+ text_train[i])
with open('./news.valid', mode='w') as f:
for i in range(len(text_valid)):
f.write(cat_valid[i] + ' ' + text_valid[i])
fastText is train_supervised
and can be easily supervised.
You can perform n-gram processing by passing arguments to wordNgrams, or put hs in loss and use hierarchical softmax
to perform high-speed processing. Anyway, it is highly functional! !!
Learning is great, but I think the feature of fastText is that you can evaluate the accuracy immediately by using model.test
.
As shown below, you can see that the accuracy is quite good even for valid data.
model = fasttext.train_supervised(input='./news.train', lr=0.5, epoch=500,
wordNgrams=3, loss='ova', dim=300, bucket=200000)
print("TrainData:", model.test('news.train'))
print("Valid", model.test('news.valid'))
TrainData: (413, 1.0, 1.0)
Valid (104, 0.75, 0.75)
Let's check the accuracy of the model using valid data that is not used for training.
① Store the contents of valid data in l_strip
(2) Store label, text, and size in a list. label is the news category, text is the body, and size is the probability of the model's predictions. The necessary part is extracted using a regular expression.
③ Take out the news one by one and try to predict the category. Predictions are displayed in descending order of probability by the number of arguments k of predict
. The next array shows the corresponding probabilities. All questions are correct, so it's good.
# ①
with open("news.valid") as f:
l_strip = [s.strip() for s in f.readlines()] # strip()Newline character removal by using
# ②
labels = []
texts = []
sizes = []
for t in l_strip:
labels.append(re.findall('__label__(.*?) ', t)[0])
texts.append(re.findall(' (.*)', t)[0])
sizes.append(model.predict(re.findall(' (.*)', t))[1][0][0])
# ③-1
print("<{}>".format(labels[0]))
print(texts[0])
print(model.predict(texts[0], k=3))
# ③-2
print("<{}>".format(labels[1]))
print(texts[1])
print(model.predict(texts[1], k=3))
# ③-3
print("<{}>".format(labels[2]))
print(texts[2])
print(model.predict(texts[2], k=3))
You can finish the analysis up to this point, but let's use the function called get_sentence_vector
of fastText to get the vector for each article and perform further analysis.
(1) Obtain a vector for each article and store it in the list.
(2) Change the vector, label, and size to a numpy array. (Label and size have been acquired)
③ Standardize the vector using StandardScaler
④ Dimensionality reduction using the principal component analyzer PCA
⑤ Calculate the similarity for each article using the cos_sim function
defined above. Articles in the same category have the highest similarity.
⑥ Two-dimensional plot of the vector. The size of the point changes depending on the value of sizes. (sizes stores predictability.)
# ①
vectors = []
for t in texts:
vectors.append(model.get_sentence_vector(t))
# ②
vectors = np.array(vectors)
labels = np.array(labels)
sizes = np.array(sizes)
# ③
ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)
# ④
pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]
# ⑤-1
print("<{}><{}>".format(labels[0], labels[1]))
cos_sim(vectors[0], vectors[1])
# ⑤-2
print("<{}><{}>".format(labels[1], labels[2]))
cos_sim(vectors[1], vectors[2])
# ⑤-3
print("<{}><{}>".format(labels[0], labels[2]))
cos_sim(vectors[0], vectors[2])
# ⑥
x0, y0, z0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1], sizes[labels=='Entertainment']*1000
x1, y1, z1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1], sizes[labels=='Sports']*1000
x2, y2, z2 = feature[labels=='life', 0], feature[labels=='life', 1], sizes[labels=='life']*1000
x3, y3, z3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1], sizes[labels=='Domestic']*1000
x4, y4, z4 = feature[labels=='international', 0], feature[labels=='international', 1], sizes[labels=='international']*1000
x5, y5, z5 = feature[labels=='area', 0], feature[labels=='area', 1], sizes[labels=='area']*1000
x6, y6, z6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1], sizes[labels=='Economy']*1000
plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=z0)
plt.scatter(x1, y1, label="Sports", s=z1)
plt.scatter(x2, y2, label="life", s=z2)
plt.scatter(x3, y3, label="Domestic", s=z3)
plt.scatter(x4, y4, label="international", s=z4)
plt.scatter(x5, y5, label="area", s=z5)
plt.scatter(x6, y6, label="Economy", s=z6)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()
―― “Entertainment” and “Sports” are very close to each other. There are many overlapping parts in the first dimension, but they are clearly separated in the second dimension. This makes sense. ―― “International” and “domestic” are well separated, and the economy is in between. This is also convincing. ―― “Sports” and “domestic” are close, but is it because Yahoo News covers more domestic sports articles than overseas? --"Region" is plotted near "Domestic", but the probability is low because it is plotted small. Certainly it may be difficult to read the "Region" article and determine whether it is "domestic" or "regional".
I think that clustering is done well as a whole.
This time, I got a plot that was neatly classified by category, which may be because I did supervised learning using category labels. I think it is valuable to be able to create a model that can cleanly cluster valid data that is not used for learning. However, if unsupervised learning produces results similar to this one, I think it's interesting. So next time I would like to try clustering using unsupervised learning! → * Continued Unsupervised learning
Yahoo! News Clustering books from Aozora Bunko with Doc2Vec fastText GitHub (fastText/python) Build mecab (NEologd dictionary) environment with Docker (ubuntu) Scraping Yahoo News fastText tutorial(Text classification) [Python NumPy] How to find cosine similarity Understanding Principal Component Analysis in Python matplotlib Scatter plots with a legend