[PYTHON] fastText is pretty amazing! "Yahoo! News" Clustering-Unsupervised Learning-

Last time supervised learning and clustering Yahoo News article data using fastText's train_supervised method. This time, let's perform unsupervised learning using the train_unsupervised method of fastText and analyze whether it can be clustered as beautifully as last time.

Development environment

Docker
JupyterLab

Implementation start

① Load library (2) Create a file called utility.py to store the functions created so far. From there, load the functions you need this time. ③ Use the YN function to get Yahoo News articles. You can get about 500 articles in about 10. This function is introduced at here.

# ①
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
import fasttext
from sklearn import preprocessing
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import japanize_matplotlib
import re

# ②
import utility as util 
wakati = util.wakati
M2A = util.MecabMorphologicalAnalysis
YN = util.YahooNews
cos_sim = util.cos_sim

# ③
df = YN(1000)
# df = pd.read_csv('./YahooNews.csv') #If the data has already been acquired, read it and start analysis!
df

[====================] 502 Articles

	title	category	text
0	Moved to a holding company in Pana 22	Economy	Bloomberg Panasonic announced on April 13 that it will shift to a holding company structure from April 2022 ...
1	Over 1700 people infected nationwide, the highest number ever	Domestic	Nippon News Network NNN The new infected person of the new coronavirus on the 13th is N ...
2	Mako's marriage respects their Majesties	Domestic	The Imperial Household Agency will be with Mako Akishino, the eldest daughter of Mr. and Mrs. Akishino, and Kei Komuro, a classmate from the ICU era of International Christian University, 29 ...
3	Mr. Trump's advice splits	International	CNN President Trump's defeat in the U.S. presidential election is certain, and he's most trusted as he devises his next move strategy ...
4	Minister of Land, Infrastructure, Transport and Tourism "I want to extend GoTo"	Domestic	CopyrightC Japan News Network All rights reser...
...	...	...	...
497	Mr. Biden to a public TV speech	International	AFP Current Affairs US Democratic Party Presidential Candidate Joe Biden is on the night of the 6th and the morning of the 7th Japan time ...
498	Collision and riot alerts across the United States	International	All Nippon NewsNetworkANN The winner of the presidential election is uncertain in the United States ...
499	Agreed to resume business traffic between Japan and China	Domestic	The traffic of business people restricted by the Japanese and Chinese governments to prevent the new coronavirus will be revisited in the middle of this month ...
500	Presidential election to Georgia recount	International	AFP current affairs update Democrat Joe Biden in US presidential election Republican Donald Tran ...
501	IOC President Bach will come to Japan on the 15th	Sports	Thomas Bach of the International Olympic Committee IOC over the Tokyo Olympics Paralympics postponed to next summer ...

502 rows × 3 columns

About subword

I didn't mention it last time, but this time I will consider subword, which is a major feature of fastText. A subword is to divide a word into smaller "subwords" to capture the relevance of the words. For example, you can learn the relevance of words that have common parts such as "Go" and "Going". Applying subword improved the accuracy, so I will use it. On the contrary, it seems that there is a harmful effect on Katata language. It can be implemented by passing maxn and minn to the arguments of train_supervised and train_unsupervised. The default values are different for train_supervised and train_unsupervised. For more information, go to GitHub.

Supervised learning (train_supervised)

First, for comparison, execute the contents of Last time at once.

#Store category and body in list respectively
cat_lst = ['__label__' + cat for cat in df.category]
text_lst = [M2A(text, mecab=wakati) for text in df.text]

#Divided into train and valid
text_train, text_valid, cat_train, cat_valid = train_test_split(
    text_lst, cat_lst, test_size=0.2, random_state=0, stratify=cat_lst
)

#Create train and valid files
with open('./s_train', mode='w') as f:
    for i in range(len(text_train)):
        f.write(cat_train[i] + ' '+ text_train[i])
        
with open('./s_valid', mode='w') as f:
    for i in range(len(text_valid)):
        f.write(cat_valid[i] + ' ' + text_valid[i])

#Model learning
model = fasttext.train_supervised(input='./s_train', lr=0.5, epoch=500, minn=3, maxn=5,
                                  wordNgrams=3, loss='ova', dim=300, bucket=200000)

#Check the accuracy of the model
# print("TrainData:", model.test('./s_news_train'))
print("ValidData:", model.test('./s_valid'))

#Preparing a 2D plot with valid data
with open("./s_valid") as f:
    l_strip = [s.strip() for s in f.readlines()] # strip()Newline character removal by using
    
labels = []
texts = []
sizes = []
for t in l_strip:
    labels.append(re.findall('__label__(.*?) ', t)[0])
    texts.append(re.findall(' (.*)', t)[0])
    sizes.append(model.predict(re.findall(' (.*)', t))[1][0][0])

#Vector generation from valid article body
vectors = []
for t in texts:
    vectors.append(model.get_sentence_vector(t))

#Convert to numpy
vectors = np.array(vectors)
labels = np.array(labels)
sizes = np.array(sizes)

#Standardization
ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)

#Dimensionality reduction
pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]

#plot
x0, y0, z0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1], sizes[labels=='Entertainment']*1000
x1, y1, z1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1], sizes[labels=='Sports']*1000
x2, y2, z2 = feature[labels=='life', 0], feature[labels=='life', 1], sizes[labels=='life']*1000
x3, y3, z3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1], sizes[labels=='Domestic']*1000
x4, y4, z4 = feature[labels=='international', 0], feature[labels=='international', 1], sizes[labels=='international']*1000
x5, y5, z5 = feature[labels=='area', 0], feature[labels=='area', 1], sizes[labels=='area']*1000
x6, y6, z6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1], sizes[labels=='Economy']*1000

plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=z0)
plt.scatter(x1, y1, label="Sports", s=z1)
plt.scatter(x2, y2, label="life", s=z2)
plt.scatter(x3, y3, label="Domestic", s=z3)
plt.scatter(x4, y4, label="international", s=z4)
plt.scatter(x5, y5, label="area", s=z5)
plt.scatter(x6, y6, label="Economy", s=z6)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()

ValidData: (101, 0.801980198019802, 0.801980198019802)

Unsupervised learning (train_unsupervised)

Data preparation

Unsupervised learning does not require labels, so all you have to do is divide the data for learning. (1) Category is stored in cat_lst, and sentences are separated by M2A function and stored in text_lst. (2) Divide into train data and valid data. ③ Save the text in a file.

# ①
cat_lst = [cat for cat in df.category]
text_lst = [M2A(text, mecab=wakati) for text in df.text]

# ②
text_train, text_valid, cat_train, cat_valid = train_test_split(
    text_lst, cat_lst, test_size=0.2, random_state=0, stratify=cat_lst
)

# ③
with open('./u_train', mode='w') as f:
    for i in range(len(text_train)):
        f.write(text_train[i])
        
with open('./u_valid', mode='w') as f:
    for i in range(len(text_valid)):
        f.write(text_valid[i])

Model learning

Unsupervised learning uses train_unsupervised.

model = fasttext.train_unsupervised('./u_train', epoch=500, lr=0.01, minn=3, maxn=5, dim=300)

train data analysis

Comparison of sentence vector similarity

Use the trained model to calculate the similarity of article content. ① Use the get_sentence_vector method to generate a sentence vector from the trained model. ② Display the article category and text. ③ Calculate the cosine similarity using the cos_sim function read above.

# ①
vectors = []
for t in text_train:
    vectors.append(model.get_sentence_vector(t.strip()))

# ②
print("<{}>".format(cat_train[0]))
print(text_train[0][:200], end="\n\n")
print("<{}>".format(cat_train[1]))
print(text_train[1][:200], end="\n\n")
print("<{}>".format(cat_train[2]))
print(text_train[2][:200], end="\n\n")

# ③
print("<{}><{}>".format(cat_train[0], cat_train[1]), cos_sim(vectors[0], vectors[1]))
print("<{}><{}>".format(cat_train[1], cat_train[2]), cos_sim(vectors[1], vectors[2]))
print("<{}><{}>".format(cat_train[0], cat_train[2]), cos_sim(vectors[0], vectors[2]))

Shigeru Omi, Chairman of the Government's Coronavirus Countermeasures Subcommittee, held an urgent press conference on the 9th, and there is no doubt that infections are increasing nationwide. He has now complained that it is likely to lead to a gradual but rapid expansion trend, and as an urgent recommendation to the government 1 until now. Anri Kawai, a member of the House of Councilors accused of taking over the election of the House of Councilors in July last year, was accused of violating the Public Offices Election Act. Anri Kawai in a suit, who repeatedly claimed that it was a celebration of the election and a sympathy for the members of the House of Councilors, and complained that it was not illegal. AFP Current Affairs Update Donald Trump U.S. President Donald Trump is busy responding to Mr. Trump, who does not admit defeat in the presidential election, revealing that he has dismissed Secretary of Defense Mark Esper in a post on Twitter Twitter on the 9th. Mark Esper posted in a post by Mr. Trump, who was further shaken by the administration.

0.91201633 0.9294117 0.9201762

2D plot

Let's finally plot in two dimensions. ① Convert vector and label to numpy array respectively ② Standardize the vector ③ Vector dimension reduction using PCA ④ Plot with matplotlib. Unlike train_supervised, we cannot get the probability for the prediction, so all plots are the same size.

# ①
vectors = np.array(vectors)
labels = np.array(cat_train)


# ②
ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)


# ③
pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]


# ④
x0, y0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1]
x1, y1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1]
x2, y2 = feature[labels=='life', 0], feature[labels=='life', 1]
x3, y3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1]
x4, y4 = feature[labels=='international', 0], feature[labels=='international', 1]
x5, y5 = feature[labels=='area', 0], feature[labels=='area', 1]
x6, y6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1]

plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=300)
plt.scatter(x1, y1, label="Sports", s=300)
plt.scatter(x2, y2, label="life", s=300)
plt.scatter(x3, y3, label="Domestic", s=300)
plt.scatter(x4, y4, label="international", s=300)
plt.scatter(x5, y5, label="area", s=300)
plt.scatter(x6, y6, label="Economy", s=300)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()

valid data analysis

Next, let's cluster the valid data. The content is the same as train data analysis, so explanation is omitted.

vectors = []
for t in text_valid:
    vectors.append(model.get_sentence_vector(t.strip()))

print("<{}>".format(cat_valid[0]))
print(text_valid[0][:200], end="\n\n")
print("<{}>".format(cat_valid[1]))
print(text_valid[1][:200], end="\n\n")
print("<{}>".format(cat_valid[2]))
print(text_valid[2][:200], end="\n\n")
print("<{}><{}>".format(cat_valid[0], cat_valid[1]), cos_sim(vectors[0], vectors[1]))
print("<{}><{}>".format(cat_valid[1], cat_valid[2]), cos_sim(vectors[1], vectors[2]))
print("<{}><{}>".format(cat_valid[0], cat_valid[2]), cos_sim(vectors[0], vectors[2]))

KNT-CT Holdings HD, which owns Kinki Nippon Tourist, has announced that it will reduce one-third of its approximately 7,000 group employees by March 2025 due to voluntary retirement on the 11th. 138 nationwide handling personal travel Two-thirds of stores will be closed by March 2010 Due to a sharp decline in travel demand due to the spread of the new coronavirus infection On November 9, it was found that the number of people infected with the new coronavirus in Hokkaido was expected to exceed 200, the highest number ever. More than 100 people for the fifth consecutive day are expected to finally reach the 200 level, and the spread of infection is expected. It seems that a new cluster has been confirmed. In Hokkaido, 119 infected people were confirmed on the 5th. Hitachi announced on December 28th that it will encourage employees to take paid holidays until January 8th next year in response to the government's policy of requesting distributed acquisition of holidays during the year-end and New Year holidays. Normally from December 30th. Until January 3, approximately 150,000 people, including employees of group companies, will take leave to avoid holding in-house year-end and New Year holidays and unnecessary and unurgent meetings.

0.9284181 0.90896636 0.9533808

vectors = np.array(vectors)
labels = np.array(cat_valid)

ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)

pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]

x0, y0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1]
x1, y1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1]
x2, y2 = feature[labels=='life', 0], feature[labels=='life', 1]
x3, y3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1]
x4, y4 = feature[labels=='international', 0], feature[labels=='international', 1]
x5, y5 = feature[labels=='area', 0], feature[labels=='area', 1]
x6, y6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1]

plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=300)
plt.scatter(x1, y1, label="Sports", s=300)
plt.scatter(x2, y2, label="life", s=300)
plt.scatter(x3, y3, label="Domestic", s=300)
plt.scatter(x4, y4, label="international", s=300)
plt.scatter(x5, y5, label="area", s=300)
plt.scatter(x6, y6, label="Economy", s=300)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()

Consideration

Supervised learning

――By adopting subword, the accuracy has been further improved from the time of the previous article. 0.75 → 0.80 ――Similar to the last time, there are many overlapping parts of "entertainment" and "sports" in the first dimension, but they are clearly separated in the second dimension. I am convinced that "entertainment" and "sports" are close to each other. ―― “International” and “domestic” are well separated, and there is an “economy” between them. As with the last time, this is also convincing. ――In the previous article, the "region" was not clear, but this time the characteristics are strong in the two-dimensional direction. ――As a whole, it feels more clearly classified than last time.

Unsupervised learning

train data

--There are many overlapping parts for each category. ―― “International” and “domestic” are clearly separated. ――It was the usual pattern to say that "economy" is between "international" and "domestic", but this time it has been plotted in a wide range. ―― “Entertainment” and “Sports” overlap a lot, but in the one-dimensional direction, “Entertainment” was located slightly to the right and “Sports” was located to the left. ―― “Region” has almost no overlap with “international”, which is convincing. There is a considerable overlap between "domestic" and "economy."

valid data

――There are few overlapping parts, and it was classified neatly as a whole. ――There is "domestic" in the center, "international" on the left side, and "economy" on the center. It's quite different than before. --Although "sports" and "entertainment" are quite close to each other this time, "sports" were distributed over a wide range. ――The "region" is divided into two. This tendency can be seen even by supervised learning, so there may be some criteria.

This time, I analyzed it by unsupervised learning, but I was surprised that the classification was successful. Since there is no label, it means that you can get "international" or "sports" from the text. I am very curious about what features the model actually acquires and reflects in vector creation. I want to dig deeper into fastText! !! !!

References

Yahoo! News fastText Explanation of how to install and use fastText published by Facebook Adverse effects of fastText subword fastText tutorial(Word representations) GitHub (fastText/python) fastText is amazing! Clustering "Yahoo! News"