[PYTHON] fastText is amazing! Clustering "Yahoo! News"

Last time this article I tried to cluster Aozora Bunko books with Doc2Vec. I wondered if it worked a little, but honestly, the result was subtle. So, this time, instead of Doc2Vec, I will use a library called fastText to cluster Yahoo news articles.

What is fastText

fastText is an open source natural language processing library developed by Facebook. It is highly functional, has good prediction accuracy, and makes predictions even faster. The main functions are classification by supervised learning and vector generation of words by unsupervised learning.

This time, I will try to predict the article category using the classification function by supervised learning.

For more information, go to fastText Official Reference! GitHub was detailed about the function about Python!

Development environment

--Docker → here

JupyterLab

Implementation start

import pandas as pd, numpy as np
import re
import MeCab
import fasttext
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.decomposition import PCA
import japanize_matplotlib

News data acquisition

Load the data obtained using the code introduced in Scraping Yahoo News. It's just after the US presidential election, so there's a lot of international news.

df = pd.read_csv('./YahooNews.csv')
df

	title	category	text
0	Loss? Azato cute Yoshioka Riho	Entertainment	Actress Yoshioka Riho 27's second photo book collection for the first time in two years by Asami Kiyokawa Shueisha ...
1	Devil's "sacred place" Moisturizing tourist destination	Economy	The movie version of Kimetsu no Yaiba, which has recorded an exceptional blockbuster despite the corona wreck, is not limited to the movie ...
2	Dentsu G Corona halves operating income	Economy	The Dentsu Group announced on the 10th that the consolidated financial statements for the fiscal year ending December 2020 have 9 revenues equivalent to sales ...
3	Hong Kong Democrats, all resigned	International	12 According to Beijing Joint Xinhua News Agency, China's Standing Committee of the National People's Congress will qualify as a member of the Hong Kong legislative council with a fixed number of 70 on the 11th ...
4	Expert organization Increased infection nationwide	Domestic	The Ministry of Health, Labor and Welfare's expert organization advisory board meeting to advise on measures against the new coronavirus will be held on the 11th ...
...	...	...	...
512	4 people stabbed around the White House	International	According to US NBC TV and others, supporters of both President Trump and former Vice President Biden gathered in the US presidential election ...
513	One-sided victory declaration The US reaction is	International	FNN Prime Online Washington on the 4th, 10 am before the election day of the US presidential election ...
514	NY stocks continue to grow, temporarily over $ 600	International	New York current affairs are being counted. In the US presidential election, the close battle between both candidates Trump Biden continues, and on the morning of the 4th ...
515	Sega Sammy arcade withdrawal	Economy	SEGA SAMMY HOLDINGS shares in SEGA Entertainment Tokyo, a consolidated subsidiary that operates entertainment facilities on the 4th ...
516	China to use weapons for maritime security	International	Beijing Joint China National People's Congress National People's Congress on the 4th stipulates the authority of the China Coast Guard, which is responsible for maritime security ...

517 rows × 3 columns

Dictionary & function definition

#Specify NEologd in the MeCab dictionary.
#mecab is for mobile phone analysis, wakati is for word-separation
mecab = MeCab.Tagger('-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/')
wakati = MeCab.Tagger("-Owakati -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/")


#Define a function to perform morphological analysis
#When you enter a file, it outputs a file, and when you pass a string, it returns a string. Change with the argument file.
#If you just want to divide it, mecab as an argument=This can be achieved with wakati.
def MecabMorphologicalAnalysis(path='./text.txt', output_file='wakati.txt', mecab=mecab, file=False):
    mecab_text = ''
    if file:
        with open(path) as f:
            for line in f:
                mecab_text += mecab.parse(line)
        with open(output_file, 'w') as f:
            print(mecab_text, file=f)
    else:
        for path in path.split('\n'):
            mecab_text += mecab.parse(path)
        return mecab_text


#Outputs the cosine similarity between v1 and v2.
def cos_sim(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

Data preprocessing

It transforms into a shape for use with fastText. FastText allows you to easily perform supervised learning by formatting the data in the following form. For details, please refer to Official Tutorial.

__label__sauce __label__cheese how much does potato starch affect a cheese sauce recipe ? 
__label__food-safety __label__acidity dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove how do i cover up the white spots on my cast iron stove ? 
__label__restaurant michelin three star restaurant; but if the chef is not there

Perform the following processing to shape it into the above shape. ① Insert __label__ before the news category → Store in list (2) The text is divided using the Mecab Morphological Analysis function defined above → Stored in the list ③ Divide into train data and valid data using train_test_split ④ Combine categories and text with train and valid, respectively, and save to a file

# ①
cat_lst = ['__label__' + cat for cat in df.category]
print("cat_lst[:5]:", cat_lst[:5]) #Check the contents
print("len(cat_lst):", len(cat_lst)) #Check the number of labels

cat_lst [: 5]: ['__label__Entertainment','__label__Economy','__label__Economy','__label__International','__label__Domestic'] len(cat_lst): 517

# ②
text_lst = [MecabMorphologicalAnalysis(text, mecab=wakati) for text in df.text]
print("text_lst[0][:50]:", text_lst[0][:50]) #Check the first line
print("text_lst[1][:50]:", text_lst[1][:50]) #Check the second line
print("len(text_lst):", len(text_lst)) #Check the number of articles

text_lst [0] [:50]: Actress Yoshioka Riho 27's second photo book for the first time in two years Riho Collection by Asami Kiyok text_lst [1] [:50]: The movie version of Kimetsu no Yaiba, which has recorded an exceptional blockbuster despite the corona wreck. len(text_lst): 517

# ③
text_train, text_valid, cat_train, cat_valid = train_test_split(
    text_lst, cat_lst, test_size=0.2, random_state=0, stratify=cat_lst
)


# ④
with open('./news.train', mode='w') as f:
    for i in range(len(text_train)):
        f.write(cat_train[i] + ' '+ text_train[i])
        
with open('./news.valid', mode='w') as f:
    for i in range(len(text_valid)):
        f.write(cat_valid[i] + ' ' + text_valid[i])

Model learning and evaluation

fastText is train_supervised and can be easily supervised. You can perform n-gram processing by passing arguments to wordNgrams, or put hs in loss and use hierarchical softmax to perform high-speed processing. Anyway, it is highly functional! !!

Learning is great, but I think the feature of fastText is that you can evaluate the accuracy immediately by using model.test. As shown below, you can see that the accuracy is quite good even for valid data.

model = fasttext.train_supervised(input='./news.train', lr=0.5, epoch=500,
                                  wordNgrams=3, loss='ova', dim=300, bucket=200000)

print("TrainData:", model.test('news.train'))
print("Valid", model.test('news.valid'))

TrainData: (413, 1.0, 1.0)
Valid (104, 0.75, 0.75)

Accuracy confirmation using valid data

Let's check the accuracy of the model using valid data that is not used for training. ① Store the contents of valid data in l_strip (2) Store label, text, and size in a list. label is the news category, text is the body, and size is the probability of the model's predictions. The necessary part is extracted using a regular expression. ③ Take out the news one by one and try to predict the category. Predictions are displayed in descending order of probability by the number of arguments k of predict. The next array shows the corresponding probabilities. All questions are correct, so it's good.

# ①
with open("news.valid") as f:
    l_strip = [s.strip() for s in f.readlines()] # strip()Newline character removal by using
    

# ②    
labels = []
texts = []
sizes = []
for t in l_strip:
    labels.append(re.findall('__label__(.*?) ', t)[0])
    texts.append(re.findall(' (.*)', t)[0])
    sizes.append(model.predict(re.findall(' (.*)', t))[1][0][0])

# ③-1
print("<{}>".format(labels[0]))
print(texts[0])
print(model.predict(texts[0], k=3))

The Dentsu Group announced on the 10th that the consolidated financial results for the fiscal year ending 19th 2020 are based on the international accounting standards. Revenue, which is equivalent to sales, decreased by 94 to 676.3 billion yen. Operating income, which indicates the profit of the main business, was halved to 18.5 billion yen. Due to the impact of the disease, demand for advertisements for TV and the Internet has fallen in Japan and overseas. Operating income has been halved from the end of the year to March next year, including the early retirement program and the establishment of a new company that outsources operations to retirees. Due to the recording of structural reform costs of 25.1 billion yen, the cost of a series of MA merger acquisitions was lower than expected due to the Corona virus. As a result, net income increased 22 times to 10.2 billion yen. US Media Storm decided to make it a subsidiary in February. The value of many companies, including companies, has declined, and the estimated cost of acquiring additional shares has decreased by approximately 30 billion yen. (('__ label__ economy','__ label__ domestic','__ label__ life'), array ([9.88678277e-01, 1.48057193e-01, 3.89984576e-04]))

# ③-2
print("<{}>".format(labels[1]))
print(texts[1])
print(model.predict(texts[1], k=3))

Nintendo announced on the 5th that the cumulative worldwide sales of home video game Nintendo Switch reached 68.3 million units at the end of September, exceeding the 61.91 million units of the family computer Nintendo Entertainment System The accompanying consumption of nesting has been a tailwind, and the switch, which was achieved in about three and a half years since its launch in March 2017, can be played as a game even if it is deferred or carried, and is supported by a wide range of generations. No Mori was a hit and boosted switch sales. In September 2008 alone, 12.53 million units were sold. (('__ label__ economy','__ label__ sports','__ label__domestic'), array ([0.00338661, 0.00206074, 0.00081409]))

# ③-3
print("<{}>".format(labels[2]))
print(texts[2])
print(model.predict(texts[2], k=3))

JERA Se League Giants 6-2 Yakult 7th Tokyo Dome The retirement ceremony of giant Hisashi Iwakuma 39, who will retire from active duty only this season, was held on the 7th after the Yakult 23rd round Tokyo Dome game, which is the highest number of this season. In front of human fans, Iwakuma closed the curtain on 21 years of professional baseball life today. He said that he was blessed with wonderful teammates for 21 years and was able to live the best baseball life and was full of gratitude. And the Giants for the last two years. I wasn't able to return to the 1st army, but I'm happy that I was able to wear the Giants uniform at the end of my career, and I'm happy to have this day. I want to continue to be able to make someone happy through baseball. After that, I gave a speech saying thank you very much for 21 years. After that, I received a bouquet from Shima, who played with my teammate Aoki Rakuten in the Mariners era at Yakult, and also received a bouquet from Kanno. He rushed to Iwakuma himself and took a commemorative photo with Nine. Finally, he received a bouquet from the three children and took a commemorative photo again to say goodbye to the fans. (('__ label__ sports','__ label__ entertainment','__ label__ life'), array ([8.55861187e-01, 1.00888625e-01, 7.65405654e-04]))

Analysis of vector representation

You can finish the analysis up to this point, but let's use the function called get_sentence_vector of fastText to get the vector for each article and perform further analysis. (1) Obtain a vector for each article and store it in the list. (2) Change the vector, label, and size to a numpy array. (Label and size have been acquired) ③ Standardize the vector using StandardScaler ④ Dimensionality reduction using the principal component analyzer PCA ⑤ Calculate the similarity for each article using the cos_sim function defined above. Articles in the same category have the highest similarity. ⑥ Two-dimensional plot of the vector. The size of the point changes depending on the value of sizes. (sizes stores predictability.)

# ①
vectors = []
for t in texts:
    vectors.append(model.get_sentence_vector(t))

    
# ②
vectors = np.array(vectors)
labels = np.array(labels)
sizes = np.array(sizes)


# ③
ss = preprocessing.StandardScaler()
vectors_std = ss.fit_transform(vectors)


# ④
pca = PCA()
pca.fit(vectors_std)
feature = pca.transform(vectors_std)
feature = feature[:, :2]

# ⑤-1
print("<{}><{}>".format(labels[0], labels[1]))
cos_sim(vectors[0], vectors[1])

0.9514279

# ⑤-2
print("<{}><{}>".format(labels[1], labels[2]))
cos_sim(vectors[1], vectors[2])

0.9299138

# ⑤-3
print("<{}><{}>".format(labels[0], labels[2]))
cos_sim(vectors[0], vectors[2])

0.79527444

# ⑥
x0, y0, z0 = feature[labels=='Entertainment', 0], feature[labels=='Entertainment', 1], sizes[labels=='Entertainment']*1000
x1, y1, z1 = feature[labels=='Sports', 0], feature[labels=='Sports', 1], sizes[labels=='Sports']*1000
x2, y2, z2 = feature[labels=='life', 0], feature[labels=='life', 1], sizes[labels=='life']*1000
x3, y3, z3 = feature[labels=='Domestic', 0], feature[labels=='Domestic', 1], sizes[labels=='Domestic']*1000
x4, y4, z4 = feature[labels=='international', 0], feature[labels=='international', 1], sizes[labels=='international']*1000
x5, y5, z5 = feature[labels=='area', 0], feature[labels=='area', 1], sizes[labels=='area']*1000
x6, y6, z6 = feature[labels=='Economy', 0], feature[labels=='Economy', 1], sizes[labels=='Economy']*1000


plt.figure(figsize=(14, 10))
plt.rcParams["font.size"]=20
plt.scatter(x0, y0, label="Entertainment", s=z0)
plt.scatter(x1, y1, label="Sports", s=z1)
plt.scatter(x2, y2, label="life", s=z2)
plt.scatter(x3, y3, label="Domestic", s=z3)
plt.scatter(x4, y4, label="international", s=z4)
plt.scatter(x5, y5, label="area", s=z5)
plt.scatter(x6, y6, label="Economy", s=z6)
plt.title("Yahoo news")
plt.xlabel('1st dimension')
plt.ylabel('2nd dimension')
plt.legend(title="category")
plt.show()

Consideration

―― “Entertainment” and “Sports” are very close to each other. There are many overlapping parts in the first dimension, but they are clearly separated in the second dimension. This makes sense. ―― “International” and “domestic” are well separated, and the economy is in between. This is also convincing. ―― “Sports” and “domestic” are close, but is it because Yahoo News covers more domestic sports articles than overseas? --"Region" is plotted near "Domestic", but the probability is low because it is plotted small. Certainly it may be difficult to read the "Region" article and determine whether it is "domestic" or "regional".

I think that clustering is done well as a whole.

This time, I got a plot that was neatly classified by category, which may be because I did supervised learning using category labels. I think it is valuable to be able to create a model that can cleanly cluster valid data that is not used for learning. However, if unsupervised learning produces results similar to this one, I think it's interesting. So next time I would like to try clustering using unsupervised learning! → * Continued Unsupervised learning

References

Yahoo! News Clustering books from Aozora Bunko with Doc2Vec fastText GitHub (fastText/python) Build mecab (NEologd dictionary) environment with Docker (ubuntu) Scraping Yahoo News fastText tutorial(Text classification) [Python NumPy] How to find cosine similarity Understanding Principal Component Analysis in Python matplotlib Scatter plots with a legend