[PYTHON] Topic model by LDA with gensim ~ Thinking about user's taste from Qiita tag ~

Introduction

In this article, I will try to think about the user's preference from the relation of Qiita tags followed using the model of LDA in gensim of the python library.

The purpose is to see what it looks like by actually using the topic model and gensim with the data. We hope that this will be an opportunity for you to actually use the topic model and start studying in detail.

I don't really touch on how the inside of the LDA model implements it. Focus on "what you can do". It also touches on data acquisition (scraping, etc.).

--Data acquisition (scraping, API) --Data molding --Apply to model

There was also an article that explained in detail, so if you feel unsatisfactory after reading this article, you should read it.

-Challenge to explain the topic model without using mathematical formulas ――It is easy to understand because it is explained using figures. -Topic model story --Slide share. It is written in the formula in the LDA item of this. The amount is not large, but the impression is that it is neatly organized.

In addition, here are two books that were actually introduced in books on topic models and that are often introduced even if you look them up.

-[Topic Model (Machine Learning Professional Series)](https://www.amazon.co.jp/%E3%83%88%E3%83%94%E3%83%83%E3%82%AF%E3 % 83% A2% E3% 83% 87% E3% 83% AB-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E5% B2% A9% E7% 94% B0-% E5% 85% B7% E6% B2% BB / dp / 4061529048) -[Statistical latent meaning analysis by topic model (natural language processing series)](https://www.amazon.co.jp/%E3%83%88%E3%83%94%E3%83%83%E3 % 82% AF% E3% 83% A2% E3% 83% 87% E3% 83% AB% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% B5% B1% E8% A8 % 88% E7% 9A% 84% E6% BD% 9C% E5% 9C% A8% E6% 84% 8F% E5% 91% B3% E8% A7% A3% E6% 9E% 90-% E8% 87% AA%E7%84%B6%E8%A8%80%E8%AA%9E%E5%87%A6%E7%90%86%E3%82%B7%E3%83%AA%E3%83%BC% E3% 82% BA-% E4% BD% 90% E8% 97% A4% E4% B8% 80% E8% AA% A0 / dp / 4339027588)

I myself am studying at the time of writing, so if you have any mistakes or other advice, I would appreciate it if you could let me know in the comments.

What is a topic model?

Simply put, it is a model that estimates and classifies sentences by topic (category) from the content. The topic model can classify sentences by estimating the probability that words in the sentence will appear. If similar words appear, they are considered to be the same topic.

To explain what you can do, for example, there are multiple sentences and they can be classified into any number of categories. For example, prepare 500 news articles. And if you ask this model to "divide it into 10 categories", it will judge 500 articles from the words in the sentence and divide them into 10 related groups. The point to note here is that we don't always know if each group can be named a "sports" category or a "entertainment" category. It may be the "baseball" and "soccer" categories instead of the "sports" category. Since it is unsupervised learning, it does not have a specific label.

What's interesting is that the model determines how to divide things that you never thought possible. Looking at the heavily weighted words for each topic may give you some new suggestions. Also, if you change the number of categories to divide, the word group will change and it will be interesting.

How to do the topic model

By the way, if you summarize the procedure of the contents that appeared in the examples so far, it seems that you can classify the topic of the sentence by the following procedure.

  1. Prepare a sentence
  2. Divide sentences into words (morphological analysis)
  3. Adjust words (stop word removal, stemming)
  4. Vectorization (bag of words)
  5. Convert to the required format and put it in the LDA model

The following article that does something similar to this will be helpful. Creating an application using the topic model

Also, the following article is doing something similar, but at the end I am supervised learning in Random Forest instead of LDA. I think it's interesting to see the difference while comparing. Classify news articles by scikit-learn and gensim

What to do this time

By the way, LDA seems to be applicable to various things (that's how it came). So I would like to do something a little different. From here, we will finally do what we put out in the title, "Think about user preferences from Qiita tags."

In the examples so far, the sentences are categorized from the words in the sentence. It

--Sentence-> User --Words-> Tags you are following --Sentence categorization-> User attribute classification

I would like to try LDA as if it were.

Implementation

procedure

  1. Data acquisition
  2. User data
  3. Follow tag data associated with user data
  4. Data molding
  5. Apply to model

Data acquisition

First, get the user's data. This time, we will bring the top 1000 people in the number of contributions first. I used the data of Qiita User Ranking. Since the users of Qiita are arranged in the order of Contribution, get it as it is. Since there are 20 users per page, we will crawl 50 pages.

import requests
from bs4 import BeautifulSoup
import csv
import time

base_url = 'https://qiita-user-ranking.herokuapp.com/'
max_page = 50 #20 users per page

qiita_users = []


for i in range(max_page):
    target_url = base_url + "?page=" + str(i + 1)
    target_html = requests.get(target_url).text
    soup = BeautifulSoup(target_html, 'html.parser')
    users = soup.select('main > p > a') #Username location

    for k, user in enumerate(users):
        qiita_users.append([(i*20 + k + 1), user.get_text()]) #User id (rank) and username

    time.sleep(1) #Interval by 1 second so as not to overload the server
    print('scraping page: ' + str(i + 1))

#Exhale data to CSV
f = open('qiita_users.csv', 'w') 
writer = csv.writer(f, lineterminator='\n')
writer.writerow(['user_id', 'name'])
for user in qiita_users:
    print(user)
    writer.writerow(user)

f.close()

You will get the following CSV.

user_id,name
1,hirokidaichi
2,jnchito
3,suin
4,icoxfog417
5,shu223
...

Next, get the tag data. Use the API provided by Qiita to get the tag data that each user follows. You can get tag data in the form of https://qiita.com/api/v1/users/(user_name)/following_tags. However, since Qiita's API only accepts 150 requests per hour, it was not possible to acquire the data of 1,000 tags at once. This time, if you run it again after 1 hour, it will update the CSV (7 times it will reach 1,000 people). Even if the data is acquired for 150 people for the time being, we will proceed without problems after this. (In the first place, the number 1,000 is appropriate. If you change it, you may get interesting results. If you want, I want to get Qiita data directly with SQL instead of via API.)

import csv, requests, os.path, time

#I will use the user data I mentioned earlier.
f = open('qiita_users.csv', 'r')
reader = csv.reader(f)
next(reader)

qiita_tags = []
qiita_user_tags = []

#Gets the number of CSV user data. (The first time is irrelevant)
if os.path.isfile('qiita_user_tags.csv'):
    user_tag_num = sum(1 for line in open('qiita_user_tags.csv'))
else:
    user_tag_num = 0

#If there is already CSV tag data, get that tag data (the first time is irrelevant)
if os.path.isfile('qiita_tags.csv'):
    f_tag = open('qiita_tags.csv', 'r')
    reader_tag = csv.reader(f_tag)
    qiita_tags = [tag[0] for tag in reader_tag]

#Open CSV file
f_tag = open('qiita_tags.csv', 'w')
writer_tag = csv.writer(f_tag, lineterminator='\n')
f_user_tag = open('qiita_user_tags.csv', 'a')
writer_user_tag = csv.writer(f_user_tag, lineterminator='\n')

#Hit the API for each user
for user in reader:
    if user_tag_num < int(user[0]):
        target_url = 'https://qiita.com/api/v1/users/' + user[1] + '/following_tags'
        print('scraping: ' + user[0])

        #Error checking(There are two points: the number of requests is exceeded and the user does not exist.)
        try:
            result = requests.get(target_url)
        except requests.exceptions.HTTPError as e:
            print(e)
            break
        target = result.json()

    #Give up when the number of requests is exceeded
        if 'error' in target:
            print(target['error'])
            if target['error'] == 'Rate limit exceeded.':
                break
            continue

        # user_id, tag_1, tag_2, ...Put the data like
        qiita_user_tag = [int(user[0])]
        for tag in target:
            if tag['name'] in qiita_tags:
                qiita_user_tag.append(qiita_tags.index(tag['name']) + 1)
            else:
                qiita_tags.append(tag['name'])
                tag_num = len(qiita_tags)
                qiita_user_tag.append(tag_num)
        qiita_user_tags.append(qiita_user_tag)
        time.sleep(1) #Leave 1 second intervals to avoid overloading the server

#Exhale data to CSV
for tag in qiita_tags:
    writer_tag.writerow([tag])
writer_user_tag.writerows(qiita_user_tags)

f_tag.close()
f_user_tag.close()
f.close()

Below is an output example. There are two CSV files, a tag file and a user-tag relationship data file. The line number of the tag is the id of the tag.

qiita_tags.csv


GoogleAppsScript
ActionScript
JavaScript
CSS
docker
...

qiita_user_tags.csv


1,1,2,3,4
2,5,6,7,3,8,9,10,11,12,13,14,15,16,17
3,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37
4,38,39,40,41,42,43,44,45,46,47,48,3,49
5,50,51,52,53,54,55,56,57,58,59,60,41,61,62,63,64,65,66,67,68
...

Data shaping and application to models

Data molding and model molding are performed. It can be executed with the following one file, but actually I am working with iPython notebook. Also, it seems that it can be made easier to use as a class, but it is enough if you know the flow you created, so the work flow is the same.

import csv
import gensim
from pandas import DataFrame

#Tag id(key)And name(value)Creating a dictionary that connects
tag_name_dict = {}
with open('qiita_tags.csv', 'r') as f_tags:
    tag_reader = csv.reader(f_tags)

    for i, row in enumerate(tag_reader):
        tag_name_dict[(i+1)] = row[0]


#Which tag a user (key) is following
user_tags_dict = {}
with open('qiita_user_tags.csv', 'r') as f_user_tags:
    user_tags_reader = csv.reader(f_user_tags)

    for i, row in enumerate(user_tags_reader):
        user_tags_dict[int(row[0])] = row[1:-1]

# tags_list List of tags being followed(There is duplication if it is followed by multiple people)
tags_list = []
for k, v in user_tags_dict.items():
    tags_list.extend(v)

#Tags that are only followed by one person
once_tags = [tag for tag in tags_list if tags_list.count(tag) == 1]

#User a tag that is only followed by one person_Also removed from tags
user_tags_dict_multi = { k: [tag for tag in user_tags if not tag in once_tags] for k, user_tags in user_tags_dict.items()}

#Omit users who do not follow tags(Note that we have deleted tags that only one person is following)
user_tags_dict_multi = {k: v for k, v in user_tags_dict_multi.items() if not len(v) == 0}

#Convert for input to gemsim
corpus = [[(int(tag), 1) for tag in user_tags]for k, user_tags in user_tags_dict_multi.items()]

#Calling and learning the model of LDA Here the number of topics(Number of user groups) can be set
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=15)

#Get the top 10 appearance frequency of each topic
topic_top10_tags = []
for topic in lda.show_topics(-1, formatted=False):
    topic_top10_tags.append([tag_name_dict[int(tag[0])] for tag in topic[1]])

#Display the top 10 appearance frequencies of each topic
topic_data = DataFrame(topic_top10_tags)
print(topic_data)
print("------------------")

#Display of user preferences
c = [(1, 1), (2, 1)] #Users following Tag 1 and Tag 2
for (tpc, prob) in lda.get_document_topics(c):
    print(str(tpc) + ': '+str(prob))

result

Below is a group of words for each topic. (I used the display of iPython notebook) Each line represents each topic. It was divided into 15 user groups, such as line 0 user group, line 1 user layer, and so on.

スクリーンショット 2016-09-07 22.44.17.png

Which topic does the user following Tag 1 and Tag 2 belong to? It looks like "7". I input properly this time, but if you use the tags you are following as input data, you may be able to see what kind of user group you are. Also, if you use this, it seems that you can also recommend tags.

0: 0.0222222589596
1: 0.0222222222222
2: 0.0222222417019
3: 0.0222222755597
4: 0.022222240412
5: 0.0222222222222
6: 0.0222222374859
7: 0.688888678865
8: 0.02222225339
9: 0.0222222557189
10: 0.0222222222222
11: 0.0222222222222
12: 0.0222222222222
13: 0.0222222245736
14: 0.0222222222222

Bonus: The following is a 20-topic version. I feel that 20 pieces seem to be more prone. スクリーンショット 2016-09-07 22.45.19.png

It seems that various things can be seen by increasing the number of data and changing the number of topics.

At the end

I hope this is the catalyst for anyone who is interested in the topic model. I will study more myself.

Recommended Posts

Topic model by LDA with gensim ~ Thinking about user's taste from Qiita tag ~
Stock number ranking by Qiita tag with python