Last time supervised learning and clustering Yahoo News article data using fastText's train_supervised method. This time, let's perform unsupervised learning using the train_unsupervised method of fastText and analyze whether it can be clustered as beautifully as last time.
① Load library (2) Create a file called utility.py to store the functions created so far. From there, load the functions you need this time. ③ Use the YN function to get Yahoo News articles. You can get about 500 articles in about 10. This function is introduced at here.
# ①
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
import fasttext
from sklearn import preprocessing
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import japanize_matplotlib
import re
# ②
import utility as util
wakati = util.wakati
M2A = util.MecabMorphologicalAnalysis
YN = util.YahooNews
cos_sim = util.cos_sim
# ③
df = YN(1000)
# df = pd.read_csv('./YahooNews.csv') #If the data has already been acquired, read it and start analysis!
df
[====================] 502 Articles
title | category | text | |
---|---|---|---|
0 | Moved to a holding company in Pana 22 td> | Economy td> | Bloomberg Panasonic announced on April 13 that it will shift to a holding company structure from April 2022 ... td> |
1 | Over 1700 people infected nationwide, the highest number ever td> | Domestic td> | Nippon News Network NNN The new infected person of the new coronavirus on the 13th is N ... td> |
2 | Mako's marriage respects their Majesties td> | Domestic td> | The Imperial Household Agency will be with Mako Akishino, the eldest daughter of Mr. and Mrs. Akishino, and Kei Komuro, a classmate from the ICU era of International Christian University, 29 ... td> |
3 | Mr. Trump's advice splits td> | International td> | CNN President Trump's defeat in the U.S. presidential election is certain, and he's most trusted as he devises his next move strategy ... td> |
4 | Minister of Land, Infrastructure, Transport and Tourism "I want to extend GoTo" td> | Domestic td> | CopyrightC Japan News Network All rights reser... |
... | ... | ... | ... |
497 | Mr. Biden to a public TV speech td> | International td> | AFP Current Affairs US Democratic Party Presidential Candidate Joe Biden is on the night of the 6th and the morning of the 7th Japan time ... td> |
498 | Collision and riot alerts across the United States td> | International td> | All Nippon NewsNetworkANN The winner of the presidential election is uncertain in the United States ... td> |
499 | Agreed to resume business traffic between Japan and China td> | Domestic td> | The traffic of business people restricted by the Japanese and Chinese governments to prevent the new coronavirus will be revisited in the middle of this month ... td> |
500 | Presidential election to Georgia recount td> | International td> | AFP current affairs update Democrat Joe Biden in US presidential election Republican Donald Tran ... td> |
501 | IOC President Bach will come to Japan on the 15th td> | Sports td> | Thomas Bach of the International Olympic Committee IOC over the Tokyo Olympics Paralympics postponed to next summer ... td> |
502 rows × 3 columns
I didn't mention it last time, but this time I will consider subword, which is a major feature of fastText.
A subword is to divide a word into smaller "subwords" to capture the relevance of the words.
For example, you can learn the relevance of words that have common parts such as "Go" and "Going".
Applying subword improved the accuracy, so I will use it. On the contrary, it seems that there is a harmful effect on Katata language.
It can be implemented by passing maxn
and minn
to the arguments of train_supervised and train_unsupervised.
The default values are different for train_supervised and train_unsupervised. For more information, go to GitHub.
First, for comparison, execute the contents of Last time at once.
#Store category and body in list respectively
cat_lst = ['__label__' + cat for cat in df.category]
text_lst = [M2A(text, mecab=wakati) for text in df.text]
#Divided into train and valid
text_train, text_valid, cat_train, cat_valid = train_test_split(
text_lst, cat_lst, test_size=0.2, random_state=0, stratify=cat_lst
)
#Create train and valid files
with open('./s_train', mode='w') as f:
for i in range(len(text_train)):
f.write(cat_train[i] + ' '+ text_train[i])
with open('./s_valid', mode='w') as f:
for i in range(len(text_valid)):
f.write(cat_valid[i] + ' ' + text_valid[i])
#Model learning
model = fasttext.train_supervised(input='./s_train', lr=0.5, epoch=500, minn=3, maxn=5,
wordNgrams=3, loss='ova', dim=300, bucket=200000)
#Check the accuracy of the model
# print("TrainData:", model.test('./s_news_train'))
print("ValidData:", model.test('./s_valid'))
#Preparing a 2D plot with valid data
with open("./s_valid") as f:
l_strip = [s.strip() for s in f.readlines()] # strip()Newline character removal by using
labels = []
texts = []
sizes = []
for t in l_strip:
labels.append(re.findall('__label__(.*?) ', t)[0])
texts.append(re.findall(' (.*)', t)[0])
sizes.append(model.predict(re.findall(' (.*)', t))[1][0][0])
#Vector generation from valid article body
vectors = []
for t in texts:
vectors.append(model.get_sentence_vector(t))
#Convert to numpy
vectors = np.array(vectors)
labels = np.array(labels)
sizes = np.array(sizes)
#Standardization
ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)
#Dimensionality reduction
pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]
#plot
x0, y0, z0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1], sizes[labels=='Entertainment']*1000
x1, y1, z1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1], sizes[labels=='Sports']*1000
x2, y2, z2 = feature[labels=='life', 0], feature[labels=='life', 1], sizes[labels=='life']*1000
x3, y3, z3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1], sizes[labels=='Domestic']*1000
x4, y4, z4 = feature[labels=='international', 0], feature[labels=='international', 1], sizes[labels=='international']*1000
x5, y5, z5 = feature[labels=='area', 0], feature[labels=='area', 1], sizes[labels=='area']*1000
x6, y6, z6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1], sizes[labels=='Economy']*1000
plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=z0)
plt.scatter(x1, y1, label="Sports", s=z1)
plt.scatter(x2, y2, label="life", s=z2)
plt.scatter(x3, y3, label="Domestic", s=z3)
plt.scatter(x4, y4, label="international", s=z4)
plt.scatter(x5, y5, label="area", s=z5)
plt.scatter(x6, y6, label="Economy", s=z6)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()
ValidData: (101, 0.801980198019802, 0.801980198019802)
Unsupervised learning does not require labels, so all you have to do is divide the data for learning. (1) Category is stored in cat_lst, and sentences are separated by M2A function and stored in text_lst. (2) Divide into train data and valid data. ③ Save the text in a file.
# ①
cat_lst = [cat for cat in df.category]
text_lst = [M2A(text, mecab=wakati) for text in df.text]
# ②
text_train, text_valid, cat_train, cat_valid = train_test_split(
text_lst, cat_lst, test_size=0.2, random_state=0, stratify=cat_lst
)
# ③
with open('./u_train', mode='w') as f:
for i in range(len(text_train)):
f.write(text_train[i])
with open('./u_valid', mode='w') as f:
for i in range(len(text_valid)):
f.write(text_valid[i])
Unsupervised learning uses train_unsupervised
.
model = fasttext.train_unsupervised('./u_train', epoch=500, lr=0.01, minn=3, maxn=5, dim=300)
Use the trained model to calculate the similarity of article content.
① Use the get_sentence_vector
method to generate a sentence vector from the trained model.
② Display the article category and text.
③ Calculate the cosine similarity using the cos_sim
function read above.
# ①
vectors = []
for t in text_train:
vectors.append(model.get_sentence_vector(t.strip()))
# ②
print("<{}>".format(cat_train[0]))
print(text_train[0][:200], end="\n\n")
print("<{}>".format(cat_train[1]))
print(text_train[1][:200], end="\n\n")
print("<{}>".format(cat_train[2]))
print(text_train[2][:200], end="\n\n")
# ③
print("<{}><{}>".format(cat_train[0], cat_train[1]), cos_sim(vectors[0], vectors[1]))
print("<{}><{}>".format(cat_train[1], cat_train[2]), cos_sim(vectors[1], vectors[2]))
print("<{}><{}>".format(cat_train[0], cat_train[2]), cos_sim(vectors[0], vectors[2]))
Let's finally plot in two dimensions.
① Convert vector and label to numpy array respectively
② Standardize the vector
③ Vector dimension reduction using PCA
④ Plot with matplotlib. Unlike train_supervised
, we cannot get the probability for the prediction, so all plots are the same size.
# ①
vectors = np.array(vectors)
labels = np.array(cat_train)
# ②
ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)
# ③
pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]
# ④
x0, y0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1]
x1, y1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1]
x2, y2 = feature[labels=='life', 0], feature[labels=='life', 1]
x3, y3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1]
x4, y4 = feature[labels=='international', 0], feature[labels=='international', 1]
x5, y5 = feature[labels=='area', 0], feature[labels=='area', 1]
x6, y6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1]
plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=300)
plt.scatter(x1, y1, label="Sports", s=300)
plt.scatter(x2, y2, label="life", s=300)
plt.scatter(x3, y3, label="Domestic", s=300)
plt.scatter(x4, y4, label="international", s=300)
plt.scatter(x5, y5, label="area", s=300)
plt.scatter(x6, y6, label="Economy", s=300)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()
Next, let's cluster the valid data. The content is the same as train data analysis, so explanation is omitted.
vectors = []
for t in text_valid:
vectors.append(model.get_sentence_vector(t.strip()))
print("<{}>".format(cat_valid[0]))
print(text_valid[0][:200], end="\n\n")
print("<{}>".format(cat_valid[1]))
print(text_valid[1][:200], end="\n\n")
print("<{}>".format(cat_valid[2]))
print(text_valid[2][:200], end="\n\n")
print("<{}><{}>".format(cat_valid[0], cat_valid[1]), cos_sim(vectors[0], vectors[1]))
print("<{}><{}>".format(cat_valid[1], cat_valid[2]), cos_sim(vectors[1], vectors[2]))
print("<{}><{}>".format(cat_valid[0], cat_valid[2]), cos_sim(vectors[0], vectors[2]))
vectors = np.array(vectors)
labels = np.array(cat_valid)
ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)
pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]
x0, y0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1]
x1, y1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1]
x2, y2 = feature[labels=='life', 0], feature[labels=='life', 1]
x3, y3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1]
x4, y4 = feature[labels=='international', 0], feature[labels=='international', 1]
x5, y5 = feature[labels=='area', 0], feature[labels=='area', 1]
x6, y6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1]
plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=300)
plt.scatter(x1, y1, label="Sports", s=300)
plt.scatter(x2, y2, label="life", s=300)
plt.scatter(x3, y3, label="Domestic", s=300)
plt.scatter(x4, y4, label="international", s=300)
plt.scatter(x5, y5, label="area", s=300)
plt.scatter(x6, y6, label="Economy", s=300)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()
――By adopting subword, the accuracy has been further improved from the time of the previous article. 0.75 → 0.80 ――Similar to the last time, there are many overlapping parts of "entertainment" and "sports" in the first dimension, but they are clearly separated in the second dimension. I am convinced that "entertainment" and "sports" are close to each other. ―― “International” and “domestic” are well separated, and there is an “economy” between them. As with the last time, this is also convincing. ――In the previous article, the "region" was not clear, but this time the characteristics are strong in the two-dimensional direction. ――As a whole, it feels more clearly classified than last time.
--There are many overlapping parts for each category. ―― “International” and “domestic” are clearly separated. ――It was the usual pattern to say that "economy" is between "international" and "domestic", but this time it has been plotted in a wide range. ―― “Entertainment” and “Sports” overlap a lot, but in the one-dimensional direction, “Entertainment” was located slightly to the right and “Sports” was located to the left. ―― “Region” has almost no overlap with “international”, which is convincing. There is a considerable overlap between "domestic" and "economy."
――There are few overlapping parts, and it was classified neatly as a whole. ――There is "domestic" in the center, "international" on the left side, and "economy" on the center. It's quite different than before. --Although "sports" and "entertainment" are quite close to each other this time, "sports" were distributed over a wide range. ――The "region" is divided into two. This tendency can be seen even by supervised learning, so there may be some criteria.
This time, I analyzed it by unsupervised learning, but I was surprised that the classification was successful. Since there is no label, it means that you can get "international" or "sports" from the text. I am very curious about what features the model actually acquires and reflects in vector creation. I want to dig deeper into fastText! !! !!
Yahoo! News fastText Explanation of how to install and use fastText published by Facebook Adverse effects of fastText subword fastText tutorial(Word representations) GitHub (fastText/python) fastText is amazing! Clustering "Yahoo! News"