[PYTHON] Classify Qiita posts without morphological analysis with Tweet2Vec

――What I wanted to do --Get Qiita posts --Use Tweet2Vec --Use GPU instance --Tag prediction result --Try to post a similar post --Considerations and issues

What I wanted to do

--I want to classify short Japanese documents (tweets, etc.) --I want to use a neural network ――I want to do it without morphological analysis

When processing SNS posts, etc., there are many typographical errors, omissions, slang, new words, pictograms, emoticons, foreign languages, technical terms, notational fluctuations, etc., so the approach using a morphological analyzer seems to be disadvantageous. When reading recent NLP papers, it seems that students are learning at the character level instead of the word level, so let's follow that trend. I think Japanese is more advantageous than English because it has a large amount of information per character.

I decided to classify Qiita posts by title only, as the document is not too long and it seems difficult to morphologically analyze it, and the topic is cohesive to some extent. The text of Qiita is markdown or HTML, which seems to be difficult to handle, so I will not use it this time.

Get Qiita posts

Use Qiita API.

Since the text and other meta information may be used in the future, I saved the entire JSON returned by the API in PostgreSQL using Peewee in Python.


from peewee import *
from playhouse.postgres_ext import PostgresqlExtDatabase, BinaryJSONField

db = PostgresqlExtDatabase(
    'mytable',
    user='myuser',
    password='mypassword',
    register_hstore=False,
    autocommit=True,
    autorollback=True
)
db.connect()

#Model definition
class BaseModel(Model):
    """A base model that will use our Postgresql database"""
    class Meta:
        database = db

class QiitaItem(BaseModel):
    item_id = CharField(unique=True)
    created_at = DateTimeField(index=True)
    raw = BinaryJSONField()
    
    class Meta:
        db_table = 'qiita_items'

#Create table
db.create_tables([QiitaItem])

Calling the Qiita API

Note that the number of calls is tight if you do not use tokens. Note that even if you use tokens, the limit is 1,000 requests per hour.

import json, requests

url = 'http://qiita.com/api/v2/items'
headers = {'Authorization': 'Bearer myauthorizationtoken1234567890'}

#For the time being, the latest 10,Example of getting 000 posts
#For more posts, the same query field as web search can be used for params, so I will do my best by making full use of it.
for i in range(100):
    resp = requests.get(url, params={'per_page': 100, 'page': 1+i}, headers=headers)
    for item in resp.json():
        #If there is a Null character, saving to JSONB will fail, so remove it.
        item['body'] = item['body'].strip('\x00')
        item['rendered_body'] = item['rendered_body'].strip('\x00')
        try:
            QiitaItem.create_or_get(item_id=item['id'], created_at=item['created_at'], raw=item)
        except DataError:
            print(False)

Use Tweet2Vec

First of all, I want to convert the title of Qiita, which has various lengths, into a vector representation with a certain number of dimensions. Once a vector space model is used, various machine learning methods can be applied.

Initially, I tried gensim's Doc2Vec, but the accuracy was not good.

As a result of searching various papers, I found a method using RNN (GRU) called Tweet2Vec. Moreover, the code is available on GitHub. In the original paper, 2 million tweets were classified into 2000 types of hashtags, and the accuracy was higher than the word-based method.

Most of Qiita's posts have multiple tags, and the length of the text is about the same as the tweet, so I thought that it could be diverted.

Tweet2Vec is implemented in Theano + Lasagne (Python2.7)

Creating a data file

Uses 393 different tags with over 100 posts. Of the approximately 63,000 posts with these tags, 5% were for testing, 5% for CV, and the remaining 90% for learning.

Posts in English and Chinese are also mixed, but I do not exclude them in particular. The original paper used only English tweets and included pre-processing to convert uppercase letters to lowercase letters, but this is not done either.

The training data contains 2281 types of characters, and it seems that the number of input dimensions is significantly smaller than when using BOW.


from collections import Counter
from random import shuffle
import io

cnt = Counter()
for item in QiitaItem.select():
    for tag in item.raw['tags']:
        cnt[tag['name']] += 1

common_tags = [name for name, c in cnt.most_common(393)]

samples_all = []

for item in QiitaItem.select():
    tags = [tag['name'] for tag in item.raw['tags']]
    intersection = list(set(common_tags) & set(tags))
    if len(intersection) > 1:
        samples_all.append((item, intersection))
        
shuffle(samples_all)
n_all = len(samples_all)
n_test = n_cv = int(n_all / 20)
n_train = n_all - n_test - n_cv
samples_train = samples_all[:n_train]
samples_test = samples_all[n_train:(n_train+n_test)]
samples_cv = samples_all[(n_train+n_test):]

with io.open('misc/qiita_train.txt', 'w') as f:
    for item, tags in samples_train:
        for tag in tags:
            f.write(tag + '\t' + item.raw['title'] + '\n')

with io.open('misc/qiita_test.txt', 'w') as f:
    for item, tags in samples_test:
        f.write(','.join(tags) + '\t' + item.raw['title'] + '\n')

with io.open('misc/qiita_cv.txt', 'w') as f:
    for item, tags in samples_cv:
        f.write(','.join(tags) + '\t' + item.raw['title'] + '\n')

Run

Three types of shell scripts are provided, so rewrite the file path inside and execute it.

When learning

./tweet2vec_trainer.sh

When you want to see the accuracy

./tweet2vec_tester.sh

When you want to add a tag (for an unknown correct tag)

./tweet2vec_encoder.sh

As a caveat

--If you use too few tags to train your model, you will get an error when running tweet2vec_encoder.sh. --It seems that some of the Qiita titles contain tabs and line feed codes, so you need to remove them or modify the Tweet2Vec code.

Use GPU instance

When I trained with 8,000 lines of training data file as a trial, it took about 2.7 hours on my MacBook. The final training data file is about 140,000 lines, and if you train it on a local machine, it will take 48 hours to complete, so I decided to set up a GPU instance on AWS.

g2.2xlarge
Ubuntu 14.04 64bit --Use Spot Instances in the US East Region to save money --Hyperparameters are not tampered with from the initial settings of Tweet2Vec

Install Theano on a GPU instance on AWS by referring to this article. I was impatient because matplotlib did not enter with pip, but I entered with sudo apt-get install python-matplotlib.

If the connection is cut off in the middle, it will be a problem, so screen and execute. After that, the terminal crashed as if I was aiming for it, but thanks to this, it was safe.

screen -S "qiita"
./tweet2vec_trainer.sh

Finished in 12 epoch for about 3.5 hours. It was worth the charge because it was estimated to be 10 times faster.

Tag prediction result

Tag prediction accuracy

Test data with 72.10% accuracy, 70.49% accuracy was obtained even with CV data that was not referenced at all during the training.

./tweet2vec_tester.sh (qiita_test.txt)

Precision @ 1 = 0.721040939384
Recall @ 10 = 0.777895906062
Mean rank = 4.29197080292

./tweet2vec_tester.sh (qiita_cv.txt)

Precision @ 1 = 0.704855601396
Recall @ 10 = 0.779064847138
Mean rank = 4.05744208188

Since the hashtag prediction accuracy of Tweet in the original paper was about 24%, it seems that it is quite accurate even though the subject matter is significantly different.

The following is an example of predicting tags for CV data.

An example that can be predicted well

#	Post title	Actual tag	Predicted tag(TOP 10)
1	How to run a Java program from Maven	Java,Maven	Maven,Java,java8,Eclipse,gradle,Android,Tomcat,Groovy,JavaEE,JUnit
2	Asynchronous infinite loop in JavaScript	JavaScript,Node.js	JavaScript,HTML,jQuery,Node.js,CSS,HTML5,CoffeeScript,js,Ajax,es6
3	What to do from buying a Mac to developing Rails	Mac,Rails	Ruby,Rails,Mac,Rails4,rbenv,RubyOnRails,MacOSX,Zsh,homebrew,Gem
4	[Unity] Knowledge UI gained by developing games for smartphones at a game company	.NET,Unity3D,C#,Unity	Unity,C#,Unity3D,.NET,Android,iOS,LINQ,VisualStudio,android development,Java
5	R language-High performance computing	R,statistics	R,statistics,Math,Numerical calculation,Data analysis,Ruby,statistics,Natural language processing,Python,NLP R language
6	Initial setting of Sakura VPS server and construction of LAMP environment	PHP,MacOSX,Sakura VPS	Sakura VPS,vps,Apache,CentOS,PHP,MySQL,Linux,CentOS6.x,WordPress,postfix
7	Make it frame out with animation(StoryBoard)	Xcode,Objective-C	iOS,Swift,Storyboard,Xcode,Objective-C,Xcode6,Android,UI,iOS8,iPhone
8	Welcome to Apple's Reject Hell, Found by Making a Dating App	Xcode,iOS	iOS,Swift,Xcode,Objective-C,Android,iPhone,CocoaPods,JavaScript,Mac,AdventCalendar
9	Show non-transparent elements inside translucent elements	HTML,CSS	CSS,HTML,HTML5,CSS3,JavaScript,jQuery,bootstrap,Android,js,Java
10	For docker Japanese golang 做 1 piece Chinese participle application	golang,docker	Go,golang,docker,Ruby,Slack,vagrant,Rails,GitHub,OSX,Erlang

――Basically, if the tag name appears in the title as it is, it can be correctly listed as a top candidate. --Do not confuse Java with JavaScript at all, as in Examples 1 and 2. It is possible to classify related concepts without being confused by the degree of similarity of simple letters. It seems obvious, but when I applied Doc2Vec character by character, I couldn't do this. --Tags can be predicted from so-called "related terms". In Example 6, you can probably predict "Linux", "Apache", "MySQL", and "PHP" from "LAMP", and in Example 9, you can probably predict that it is a CSS-related post from the term "element". ――Example 10 is in Chinese, but you can tag it without being confused.

Examples that cannot be predicted well

#	Post title	Actual tag	Predicted tag(TOP 10)
1	A complete amateur participated in the ISUCON 5 main race	golang,MySQL,Go,nginx	iOS,Unity,C#,Objective-C,JavaScript,Swift,Ruby,.NET,IoT,JSON
2	Identify edible mushrooms	MachineLearning,Python,matplotlib	Linux,iOS,ShellScript,Ruby,Python,Objective-C,CentOS,PHP,Bash,Swift
3	Switch Japanese input to right Alt in gnome3(AX keyboard style)	CentOS,Linux	JavaScript,Java,OSX,homebrew,Mac,api,HTML,Node.js,MacOSX,Artificial intelligence
4	Static files are cached (if not browser cache)	VirtualBox,Apache	JavaScript,IoT,CSS,Chrome,firefox,MacOSX,Windows,HTML5,jQuery,Arduino
5	The username of the person who tweeted with the desired hashtag(@Behind the part)To get	Twitter,TwitterAPI,Gem,Ruby	Ruby,Rails,AWS,JavaScript,Python,Go,Java,jq,golang,PHP
6	Form in view: could not find implicit value for parameter messages: play.api.i18n.What to do when Messages appear	Scala,PlayFramework	Mac,MacOSX,OSX,Xcode,Android,Linux,Ruby,Ubuntu,Java,Windows

――Of course, if there are too few hints or too vague in the title, the prediction is not good. --"Gnome" in Example 3 and "hashtag" and "tweet" in Example 5 are hint terms, but you probably didn't know what they were related to because the training data was not enough.

Try to post a similar post

Since Tweet2Vec gives a vector representation of each post title, let's output a similar post based on the cosine distance.

Modify encode_char.py a little so that it outputs the necessary files.

`encode_char.py`



  print("Encoding...")
  out_data = []
  out_pred = []
  out_emb = []
  numbatches = len(Xt)/N_BATCH + 1
  for i in range(numbatches):
      xr = Xt[N_BATCH*i:N_BATCH*(i+1)]
      x, x_m = batch.prepare_data(xr, chardict, n_chars=n_char)
      p = predict(x,x_m)
      e = encode(x,x_m)
      ranks = np.argsort(p)[:,::-1]

      for idx, item in enumerate(xr):
          out_data.append(item)
          out_pred.append(' '.join([inverse_labeldict[r] for r in ranks[idx,:5]]))
          out_emb.append(e[idx,:])

  # Save
  print("Saving...")
  with open('%s/data.pkl'%save_path,'w') as f:
      pkl.dump(out_data,f)
  with io.open('%s/predicted_tags.txt'%save_path,'w') as f:
      for item in out_pred:
          f.write(item + '\n')
  with open('%s/embeddings.npy'%save_path,'w') as f:
      np.save(f,np.asarray(out_emb))

Approximately 130,000 Qiita posts are used, including posts that were not used for learning or testing.

with io.open('../misc/qiita_all_titles.txt', 'w') as f:
    for item in QiitaItem.select():
        f.write(item.raw['title'] + '\n')

Rewrite and execute so that the result is output to the directory called qiita_result_all.

./tweet2vec_encoder.sh

It took about 3 hours when I ran it on my local machine. I should have used a GPU instance for this as well.

The Python code that displays similar posts looks like this.

import cPickle as pkl
import numpy as np
from scipy import spatial
import random

with io.open('qiita_result_all/data.pkl', 'rb') as f:
    titles = pkl.load(f)

with io.open('qiita_result_all/embeddings.npy','r') as f:
    embeddings = np.load(f)

n = len(titles)
def most_similar(idx):
    sims = [(i, spatial.distance.cosine(embeddings[idx],embeddings[i])) for i in range(n) if i != idx]
    sorted_sims = sorted(sims, key=lambda sim: sim[1])
    print titles[idx]
    for sim in sorted_sims[:5]:
        print "%.3f" % (1 - sim[1]), titles[sim[0]]

Execution result

The numbers in the left column are the similarity (1 is the maximum, -1 is the minimum), and the right column is the title.

>>> most_similar(random.randint(0,n-1))
Multiple Google Maps and jquery wrap()Mysterious phenomenon when using
0.678 Combine multiple Google calendars
0.619 [GM] I want to know the latitude and longitude of addresses searched on Google Maps.
0.601 Quickly display Google Maps (API v3 version)
0.596 Speed measurement when using jQuery Sizzle and filter API together on a smartphone browser
0.593 How to display Google Maps on your site from your address

>>> most_similar(random.randint(0,n-1))
I just want to add scipy, but it's a messy note. Ubuntu,Python3.
0.718 Installation of SciPy and matplotlib(Python)
0.666 SciPy+Create a 3D scatter plot with matplotlib(Python)
0.631 scipy is python 2.7.If it is 8, pip install will stumble
0.622 [Note] future sentence ~ Python ~
0.Try using 610 scipy

>>> most_similar(random.randint(0,n-1))
Move the multi-line UI Label text to the upper left
0.782 IB aligns UILabel text
0.Export 699 NGUI UI Label character by character
0.624 Center UILabel text in Swift
0.624 Insert a line break from IB to UILabel text
0.624 Synchronize scrolling of two UITableViews

>>> most_similar(random.randint(0,n-1))
[Unity]What to do before managing your project with git or svn
0.810 Setting items when managing Git with Unity
0.800 [Unity]Ignore files when managing with git(.gitignore)Setting
0.758 [Unity]A simple sample of exchanging messages over a network
0.751 [Unity basic operation] Convenient V keys and snaps when creating maps and dungeons
0.746 [Unity]How to load a text file with a script and display its contents

>>> most_similar(random.randint(0,n-1))
Drawing a circle using a rotation matrix in Java
0.911 java Line drawing using rotation matrix
0.898 Mutual conversion of integers and byte arrays in Java
0.897 Display Fibonacci sequence in Java
0.896 Drawing a circle using a Java 2D rotation matrix
0.873 Confirmed Java string concatenation efficiency

>>> most_similar(random.randint(0,n-1))
List of songs played on WWDC14
0.830 Participation in WWDC15
0.809 WWDC 2014 notes
0.I tried to summarize 797 WWDC 2015
0.772 Summary of questions I asked at WWDC 2015
0.744 WWDC 2016 defeated email was a winning email

--Even though you learned the lower and upper letters of the alphabet as different characters, "scipy" and "SciPy", "Java" and "java", "jQuery" and "jquery", "Git" and "git" Seems to recognize that is the same concept. ――It seems that you can recognize that "Google Map", "Google Maps", and "Google Map" have the same concept. ――But "Java" and "JavaScript", "Go" and "Google" are not confused. ――The topic "WWDC", which is not included in the 393 types of learning tags, can be extracted as a close post, probably because of the similarity of the characters.

Consideration and challenges

――I am extremely grateful that this level of accuracy can be achieved simply by learning without morphological analysis, without creating a user dictionary, and with almost no preprocessing. ――Can it be applied to documents that are more difficult to process in natural language, such as the Qiita text? ――This time, I just used Tweet2Vec as it is, so I want to understand the contents exactly and make a better model. ――If you use it as an actual product, you want to tune the hyperparameters as well. I want to combine various meta information that I ignored this time.