Introduction

This time, we will classify the article data of the specified tag by unsupervised learning (k-means method).

Please refer to this article for how to get the article data of the specified tag.

・ How to get article data using Qiita API https://qiita.com/wakudar/items/8c594c8cc7bda9b93b4e

Program flow

Reading and classifying article data (separate sentences)
Vectorization of article data by TF-IDF
Clustering by k-means method
Visualization of results
Find the majority of classification results and output misclassified article data

1. Reading and classifying article data

#Word-separation
def wakatigaki(text):
    mecab = MeCab.Tagger()
    mecab_result = mecab.parse(text).replace("EOS", "").split('\n')
    mecab_result = [i.replace("#", "").replace("\"", "").replace("\'", "").replace("\t", "_").replace(",","_").split("_") for i in mecab_result if i != ""]
    return mecab_result

#Reading and classifying article data
def load_article():
    category = ["Vagrant", "iOS", "numpy"]
    category_num = [0, 1, 2]
    docs  = []
    labels = []
    labels_num = []

    for c_name, c_num in zip(category, category_num):
        files = glob.glob("./qiita/{c_name}/*.txt".format(c_name=c_name))

        text = ""
        for file in files:
            with open(file, "r", encoding="utf-8") as f:
                lines = f.read().splitlines()
                body = "".join(lines[0:]).replace('\u3000', '')
                text = body
                text = " ".join([w[0] for w in wakatigaki(text)])

            docs.append(text)
            labels.append(c_name)
            labels_num.append(c_num)
    return docs, labels, category

#Reading and classifying article data
docs, labels, category = load_article()

Article data is saved in the form of qiita / tag name / ------. Txt. This time, we will estimate the categories of the three tags saved in advance, "Vagrant", "iOS", and "numpy".

2. Vectorization of sentences by TF-IDF

# TF-Generate vector representation converter by IDF
vectorizer = TfidfVectorizer()
#Document vector conversion
vecs = vectorizer.fit_transform(docs)

3. Implementation of k-means method

# k-Implement means method
kmeans_model = KMeans(n_clusters=n_cluster, random_state=0).fit(vecs)
#Stores labels for clustering results
predict_labels = kmeans_model.labels_

4. Visualization of results

#Aggregate and visualize results
res = {
    0:{},
    1:{},
    2:{}
}

#Storage and display of results
for pre_label, r_label in zip(predict_labels, labels):
    #What to do if there is a value
    try:
        res[pre_label][r_label] += 1
    #Exception handling
    except:
        res[pre_label][r_label] = 1

#Result output
for i in range(n_cluster):
    print(res[i])

5. Find the majority category of the classification result and output the misclassified article data

Find the majority of each label

#Majority category name major_cat
major_cat = []
#Element number of the majority category name
major_num = []
for i in range(n_cluster):
    major_cat.append(max(res[i], key=res[i].get))
    major_num.append(category.index(major_cat[i]))

Generates majority-based category labels adjusted_labels

adjusted_labels = []
#Number of articles in each category
article_num = [900, 900, 900]
for i in range(n_cluster):
    adjusted_labels.extend([major_num[i]] * article_num[i])

Comparison of changes in labeling results

#Variable for txt file name cnt
cnt = 0
#If the label before and after clustering is different, the content of the article is output.
for label1, label2 in zip(adjusted_labels, predict_labels):
    cnt += 1
    if label1 == label2:
        pass
    else:
        path_w = "./result/" + str(label1) + "-" + str(label2) + "/" + str(cnt) + ".txt"
        #File name path_output of w
        with open(path_w, mode='w') as f:
            f.write(docs[cnt])

result

{'Vagrant': 108, 'iOS': 900, 'numpy': 333}
{'Vagrant': 792}
{'numpy': 567}

The correct answer rate for each tag is iOS: about 67% Vagrant: about 88% numpy: about 63% The result was that.

Consideration / Summary

This time it wasn't very accurate ... Since it is running with a program that is almost the same as the one used for the Livedoor news corpus, it is possible that the source code part of many programs in Qiita is affecting it. In the future, I think it will be necessary to consider methods such as classifying by different learning methods in order to improve accuracy, so I would like to try it when I have time!

Other ingenuity

・ This time, we searched multiple tags so that the tags did not overlap, and selected the three tags with the least overlap. (Vagrant, iOS, numpy)

・ I tried to classify with Android and iOS tags once, but the results were disappointing. I think there were many articles with two tags in terms of smartphone development.

reference

・ Unsupervised sentence classification (sentence clustering) [python] https://appswingby.com/2019/08/15/python%E6%95%99%E5%B8%AB%E3%81%AA%E3%81%97%E6%96%87%E7%AB%A0%E5%88%86%E9%A1%9E%EF%BC%88%E6%96%87%E7%AB%A0%E3%82%AF%E3%83%A9%E3%82%B9%E3%82%BF%E3%83%AA%E3%83%B3%E3%82%B0%EF%BC%89/

・ Qiita tag list https://qiita.com/tags

・ Articles with both the tag "iOS" and the tag "Vagrant" https://qiita.com/search?q=tag%3A+iOS+tag%3AVagrant

[PYTHON] Classify articles with tags specified by Qiita by unsupervised learning