[PYTHON] Classify articles with tags specified by Qiita by unsupervised learning

Introduction

This time, we will classify the article data of the specified tag by unsupervised learning (k-means method).

Please refer to this article for how to get the article data of the specified tag.

・ How to get article data using Qiita API https://qiita.com/wakudar/items/8c594c8cc7bda9b93b4e

Program flow

  1. Reading and classifying article data (separate sentences)
  2. Vectorization of article data by TF-IDF
  3. Clustering by k-means method
  4. Visualization of results
  5. Find the majority of classification results and output misclassified article data

1. Reading and classifying article data

#Word-separation
def wakatigaki(text):
    mecab = MeCab.Tagger()
    mecab_result = mecab.parse(text).replace("EOS", "").split('\n')
    mecab_result = [i.replace("#", "").replace("\"", "").replace("\'", "").replace("\t", "_").replace(",","_").split("_") for i in mecab_result if i != ""]
    return mecab_result

#Reading and classifying article data
def load_article():
    category = ["Vagrant", "iOS", "numpy"]
    category_num = [0, 1, 2]
    docs  = []
    labels = []
    labels_num = []

    for c_name, c_num in zip(category, category_num):
        files = glob.glob("./qiita/{c_name}/*.txt".format(c_name=c_name))

        text = ""
        for file in files:
            with open(file, "r", encoding="utf-8") as f:
                lines = f.read().splitlines()
                body = "".join(lines[0:]).replace('\u3000', '')
                text = body
                text = " ".join([w[0] for w in wakatigaki(text)])

            docs.append(text)
            labels.append(c_name)
            labels_num.append(c_num)
    return docs, labels, category

#Reading and classifying article data
docs, labels, category = load_article()

Article data is saved in the form of qiita / tag name / ------. Txt. This time, we will estimate the categories of the three tags saved in advance, "Vagrant", "iOS", and "numpy".

2. Vectorization of sentences by TF-IDF

# TF-Generate vector representation converter by IDF
vectorizer = TfidfVectorizer()
#Document vector conversion
vecs = vectorizer.fit_transform(docs)

3. Implementation of k-means method

# k-Implement means method
kmeans_model = KMeans(n_clusters=n_cluster, random_state=0).fit(vecs)
#Stores labels for clustering results
predict_labels = kmeans_model.labels_

4. Visualization of results

#Aggregate and visualize results
res = {
    0:{},
    1:{},
    2:{}
}

#Storage and display of results
for pre_label, r_label in zip(predict_labels, labels):
    #What to do if there is a value
    try:
        res[pre_label][r_label] += 1
    #Exception handling
    except:
        res[pre_label][r_label] = 1

#Result output
for i in range(n_cluster):
    print(res[i])

5. Find the majority category of the classification result and output the misclassified article data

Find the majority of each label

#Majority category name major_cat
major_cat = []
#Element number of the majority category name
major_num = []
for i in range(n_cluster):
    major_cat.append(max(res[i], key=res[i].get))
    major_num.append(category.index(major_cat[i]))

Generates majority-based category labels adjusted_labels

adjusted_labels = []
#Number of articles in each category
article_num = [900, 900, 900]
for i in range(n_cluster):
    adjusted_labels.extend([major_num[i]] * article_num[i])

Comparison of changes in labeling results

#Variable for txt file name cnt
cnt = 0
#If the label before and after clustering is different, the content of the article is output.
for label1, label2 in zip(adjusted_labels, predict_labels):
    cnt += 1
    if label1 == label2:
        pass
    else:
        path_w = "./result/" + str(label1) + "-" + str(label2) + "/" + str(cnt) + ".txt"
        #File name path_output of w
        with open(path_w, mode='w') as f:
            f.write(docs[cnt])

result

{'Vagrant': 108, 'iOS': 900, 'numpy': 333}
{'Vagrant': 792}
{'numpy': 567}

The correct answer rate for each tag is iOS: about 67% Vagrant: about 88% numpy: about 63% The result was that.

Consideration / Summary

This time it wasn't very accurate ... Since it is running with a program that is almost the same as the one used for the Livedoor news corpus, it is possible that the source code part of many programs in Qiita is affecting it. In the future, I think it will be necessary to consider methods such as classifying by different learning methods in order to improve accuracy, so I would like to try it when I have time!

Other ingenuity

・ This time, we searched multiple tags so that the tags did not overlap, and selected the three tags with the least overlap. (Vagrant, iOS, numpy)

・ I tried to classify with Android and iOS tags once, but the results were disappointing. I think there were many articles with two tags in terms of smartphone development.

reference

・ Unsupervised sentence classification (sentence clustering) [python] https://appswingby.com/2019/08/15/python%E6%95%99%E5%B8%AB%E3%81%AA%E3%81%97%E6%96%87%E7%AB%A0%E5%88%86%E9%A1%9E%EF%BC%88%E6%96%87%E7%AB%A0%E3%82%AF%E3%83%A9%E3%82%B9%E3%82%BF%E3%83%AA%E3%83%B3%E3%82%B0%EF%BC%89/

・ Qiita tag list https://qiita.com/tags

・ Articles with both the tag "iOS" and the tag "Vagrant" https://qiita.com/search?q=tag%3A+iOS+tag%3AVagrant

Recommended Posts

Classify articles with tags specified by Qiita by unsupervised learning
Classify mnist numbers by unsupervised learning with keras [Autoencoder]
Classify anime faces by sequel / deep learning with Keras
Categorize news articles with deep learning
I tried to classify mnist numbers by unsupervised learning [PCA, t-SNE, k-means]
Classify anime faces with deep learning with Chainer
Get a list of articles posted by users with Python 3 Qiita API v2
Stock number ranking by Qiita tag with python
Classify Qiita posts without morphological analysis with Tweet2Vec
Deep learning learned by implementation ~ Anomaly detection (unsupervised learning) ~
99.78% accuracy with deep learning by recognizing handwritten hiragana
Classify machine learning related information by topic model