[Python] Try to classify ramen shops by natural language processing

Introduction

Hello, you copy and paste data scientist.

About 3 years ago, I did a funny LT called "Ramen and Natural Language Processing", but now it's embarrassingly shabby, so Python I tried to remake it with.

It's been long, so if you put it together in 3 lines

Word-of-mouth or introductory texts on the Web
By parsing using the Python library
Established a method to find a ramen shop similar to your favorite ramen shop

Method

We use a technique called statistical latent semantic analysis. Roughly speaking, it gives you an idea of what topic the document has and what it is about.

Since the ratio allocated to each topic can be calculated with the following image, it is possible to calculate that A and B are close in the following example.

Ramen shop A: [0.75, 0.15, 0.10]
Ramen shop B: [0.60, 0.15, 0.15]
Ramen shop C: [0.05, 0.25, 0.70]

Although it is not used as described above, the following application examples and books may be helpful.

Mainly used

Python (3.x series): This article is very helpful for installation.
gensim: Topic modeling library (Official / Install)
MeCab
Required for word-separation
I want to play the proper name of the ramen shop, so I use a dictionary called neologd
Installation: Main body / neologd

Rough flow

It's been so long First, I will list the general flow.

Preparation of document data such as word of mouth

Prepare by gonio gonio (omitted)
Separate it with MeCab

Preparation for classifier learning

From the data prepared in 1.
Create a dictionary
Create TFIDF Corpus

Classifier learning: Learn LDA model with gensim
Store classification: Topic classification through document data in a trained classifier
Find a similar store

Calculate the similarity from the topic classification results for each store
Judge as similar stores in descending order of similarity

I will actually work from here

1. Word-of-mouth data preparation

#Reading the collected document data
from io_modules import load_data  #Self-made DB read library
rows = load_data(LOAD_QUERY, RAMEN_DB)

#Extract the stem with the stem function in the reference article
from utils import stems  #Implementation of reference article Almost as it is
docs = [stems(row) for row in rows]

"""
docs = [
  ['Large serving', 'Impressions', 'Direction', 'Best', 'ramen', ...
  ['ramen', 'queue', 'Cold', 'Hot', 'joy', ...
   ...
]
"""

as a whole
Just pretreatment
Prepare a large number of word-separated reviews / introductions (tens of thousands to hundreds of thousands)
We also do the work of deleting rare words that appear only a few times in all documents.
Stem extraction: Refer to this article
Remarks
Prepared a dictionary for ramen for word-separation
It works without it, but it was better to have it.
Separation → Count → Extract 2000 frequently occurring words
I've omitted things like high prices and low prices, and things that aren't related to the taste of the store, such as clean / dirty.

2. Advance preparation

From here, we will actually perform LDA using gensim. First, create a dictionary and corpus for gensim.

gensim load

from gensim import corpora, models

Creating a dictionary

It's confusing, but it's not the user dictionary used for word-separation in MeCab, but for gensim to map the appearing words and word IDs in the document.

dictionary = gensim.corpora.Dictionary(docs)
dictionary.save_as_text('./data/text.dict')  #Save
# gensim.corpora.Dictionary.load_from_text('./data/text.dict')  #File can be loaded from next time

"""
Word ID Word appearance count
1543 Clam 731
62 Easy 54934
952 Warm 691
672 hot 1282
308 Thank you 4137
・
・
"""

Creating a corpus

The first word of mouth collected will be maintained as a corpus and used for learning the classifier.

corpus = [dictionary.doc2bow(doc) for doc in docs]
gensim.corpora.MmCorpus.serialize('./data/text.mm', corpus)  #Save
# corpus = gensim.corpora.MmCorpus('./data/text.mm')  #File loading is possible from the next time

"""\
doc_id  word_id frequency of occurrence
     6      150       3  # word_id=150:Bean sprouts
     6      163       9  # word_id=163:soy sauce
     6      164       1
     6      165       1
・
・
"""

There is some debate about whether it is necessary, This time, we will perform TFIDF processing on the corpus and perform LDA.

tfidf = gensim.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

#I calculated it, so save it in pickle
import pickle
with open('./data/corpus_tfidf.dump', mode='wb') as f:
    pickle.dump(corpus_tfidf, f)

#You can load it from the next time
# with open('./data/corpus_tfidf.dump', mode='rb') as f:
#     corpus_tfidf = pickle.load(f)

3. Learning the classifier

Now that we are ready, we will actually do LDA with gensim. This time, I classified it into 50 topics.

#Depending on the amount of documents, it may take several hours.
# '18/12/03 postscript:Increasing the number of workers in LdaMulticore may be much faster
lda = gensim.models.LdaModel(corpus=corpus_tfidf, id2word=dictionary,
                             num_topics=50, minimum_probability=0.001,
                             passes=20, update_every=0, chunksize=10000)
lda.save('./data/lda.model')  #Save
# lda = gensim.models.LdaModel.load('./data/lda.model')  #Can be loaded from next time

Now let's display the contents of the learned model.

Although some topics that express impressions (# 0, # 36, # 42, etc.) are mixed, they are generally topics that express the taste of ramen (# 2: miso, # 49: family line, etc.). It seems that the classifier of is made.

for i in range(50):
    print('tpc_{0}: {1}'.format(i, lda.print_topic(i)[0:80]+'...'))

==============

tpc_0: 0.019*Impressed+ 0.014*impact+ 0.013*Long-sought+ 0.012*Difficulty+ 0.012*delicious+ 0.011*ramen+ 0.010*Deep emotion+...
tpc_1: 0.035*Grilled pork+ 0.022*chilled Chinese noodles+ 0.018*hot+ 0.010*Addictive+ 0.009*Stubborn+ 0.008*delicious+ 0.008*Ma...
tpc_2: 0.050*miso+ 0.029*Miso+ 0.017*ginger+ 0.013*butter+ 0.012*Bean sprouts+ 0.011*lard+ 0.009*corn+...
tpc_3: 0.013*Flavor+ 0.010*garlic+ 0.010*Rich+ 0.009*roasted pork fillet+ 0.008*oil+ 0.008*Rich+ 0.008*...
tpc_4: 0.010*Soy sauce+ 0.009*use+ 0.009*kelp+ 0.008*Material+ 0.007*soup+ 0.007*seafood+ 0.007*roasted pork fillet...
tpc_5: 0.015*Come+ 0.014*Clams+ 0.012*Thin+ 0.010*ramen+ 0.010*popularity+ 0.010*It feels good+ 0.010*...
tpc_6: 0.047*Shrimp+ 0.046*shrimp+ 0.014*sesame+ 0.014*shrimp+ 0.012*Addictive+ 0.008*delicious+ 0.008*Sukiyaki...
tpc_7: 0.016*Unpalatable+ 0.015*Expectations+ 0.013*bad+ 0.012*Sorry+ 0.012*delicious+ 0.011*usually+ 0.011*ramen...
tpc_8: 0.070*Soba+ 0.015*Soboro+ 0.013*Attach+ 0.012*Mentaiko+ 0.012*chicken+ 0.010*Rich+ 0.010*delicious+...
tpc_9: 0.041*Citron+ 0.024*Japanese style+ 0.017*Stew+ 0.010*Trefoil+ 0.010*life+ 0.009*delicious+ 0.009*seafood+...
tpc_10: 0.040*Vegetables+ 0.027*garlic+ 0.018*Extra+ 0.013*Garlic+ 0.010*Bean sprouts+ 0.010*Less+ 0.009*Ca...
tpc_11: 0.026*Handmade+ 0.023*Offal+ 0.016*Ginger+ 0.010*spicy+ 0.010*ramen+ 0.009*delicious+ 0.008*Feeling...
tpc_12: 0.031*Buckwheat+ 0.030*Soba+ 0.029*Chinese+ 0.016*Plain hot water+ 0.011*Shamo chicken+ 0.008*delicious+ 0.007*ramen+...
tpc_13: 0.057*black+ 0.023*black+ 0.020*Black+ 0.018*Soy sauce+ 0.011*stamina+ 0.010*oyster+ 0.009*Appearance...
tpc_14: 0.060*Tanmen+ 0.048*shrimp+ 0.019*Vegetables+ 0.014*Chinese cabbage+ 0.011*Fish ball+ 0.009*Gyoza+ 0.007*delicious...
tpc_15: 0.073*Spicy+ 0.015*Spicy+ 0.012*miso+ 0.011*Peppers+ 0.011*Sansho+ 0.010*Spicy+ 0.010*辛miso+ 0...
tpc_16: 0.031*Aoba+ 0.029*Mesh+ 0.019*double+ 0.012*seafood+ 0.010*trend+ 0.009*instant+ 0.009*Rame...
tpc_17: 0.041*Replacement ball+ 0.017*Replacement ball+ 0.014*Tonkotsu+ 0.014*Mustard+ 0.010*Extra fine+ 0.010*ramen+ 0.009*Red...
tpc_18: 0.032*Nostalgic+ 0.023*Easy+ 0.016*meaning+ 0.012*ramen+ 0.011*friendly+ 0.010*Feeling+ 0.010*Ah...
tpc_19: 0.027*Lemon+ 0.016*Normal+ 0.011*guts+ 0.009*Regrettable+ 0.009*steak+ 0.008*Rich+ 0.008*Delicious...
tpc_20: 0.088*Niboshi+ 0.009*Soba+ 0.008*fragrance+ 0.008*ramen+ 0.008*soup+ 0.007*roasted pork fillet+ 0.007*Soy sauce...
tpc_21: 0.023*sushi+ 0.015*Recommended+ 0.012*favorite+ 0.010*ramen+ 0.009*delicious+ 0.008*Growing up+ 0.008*...
tpc_22: 0.025*Fried+ 0.021*Fashionable+ 0.017*Fashionable+ 0.016*Cafe+ 0.014*Fashionable+ 0.014*atmosphere+ 0.011*...
tpc_23: 0.024*value+ 0.022*White miso+ 0.018*miso+ 0.014*赤miso+ 0.010*ultimate+ 0.010*delicious+ 0.009*burnt+...
tpc_24: 0.095*Fried rice+ 0.040*set+ 0.017*mini+ 0.013*Gyoza+ 0.012*ramen+ 0.011*delicious+ 0.009*...
tpc_25: 0.024*Oden+ 0.015*Nostalgic+ 0.013*Grilled meat+ 0.011*flat+ 0.010*Dark mouth+ 0.010*ramen+ 0.009...
tpc_26: 0.010*Off+ 0.009*ramen+ 0.009*delicious+ 0.008*serious+ 0.008*Delicious+ 0.008*Noisy+ 0.008...
tpc_27: 0.073*Mochi+ 0.032*Kimchi+ 0.012*Spicy miso+ 0.010*Delicious+ 0.010*delicious+ 0.008*roasted pork fillet+ 0.00...
tpc_28: 0.021*Sudachi+ 0.019*Shichimi+ 0.018*Men+ 0.015*onion+ 0.011*Onion+ 0.010*Disappointing+ 0.010*Attach...
tpc_29: 0.079*Gyoza+ 0.026*beer+ 0.011*delicious+ 0.010*ramen+ 0.009*生beer+ 0.009*Soy sauce+ 0.008...
tpc_30: 0.021*Tightening+ 0.018*Asexual+ 0.018*germ+ 0.015*Sake lees+ 0.010*Cooked in water+ 0.009*crab+ 0.009*Rich+ 0....
tpc_31: 0.051*Champon+ 0.024*student+ 0.015*Tantan+ 0.011*seafood+ 0.009*shock+ 0.009*Genuine+ 0.009*Delicious...
tpc_32: 0.025*odor+ 0.023*odor+ 0.016*smell+ 0.010*secret+ 0.010*Delicious+ 0.010*ramen+ 0.010*Soup...
tpc_33: 0.010*Soy sauce+ 0.009*roasted pork fillet+ 0.008*seafood+ 0.008*taste+ 0.007*soup+ 0.007*Menma+ 0.007*good...
tpc_34: 0.074*curry+ 0.040*Fried rice+ 0.015*Ganso+ 0.011*spices+ 0.010*set+ 0.008*delicious+ 0.008*La...
tpc_35: 0.068*Tomato+ 0.031*cheese+ 0.015*Italian+ 0.014*pasta+ 0.011*hormone+ 0.011*risotto+ 0.00...
tpc_36: 0.038*Colleague+ 0.014*strongest+ 0.010*hard+ 0.010*ramen+ 0.010*Dantotsu+ 0.009*delicious+ 0.009*topic...
tpc_37: 0.059*Tonkotsu+ 0.026*soy sauce+ 0.025*children+ 0.015*Delicious+ 0.012*Muddy+ 0.012*ramen+ 0.01...
tpc_38: 0.027*rice+ 0.025*rice ball+ 0.022*rice+ 0.016*Rice porridge+ 0.014*rice+ 0.012*pickles+ 0.011*set...
tpc_39: 0.026*Yuzu+ 0.019*pale+ 0.009*Aging+ 0.009*Grilled pork+ 0.008*Soy sauce+ 0.008*roasted pork fillet+ 0.007*soup...
tpc_40: 0.042*Udon+ 0.012*Skipjack+ 0.009*Yeah+ 0.009*tempura+ 0.009*ramen+ 0.008*delicious+ 0.008*Feeling...
tpc_41: 0.023*Salty+ 0.020*Who+ 0.012*junk+ 0.012*Attach+ 0.009*French+ 0.008*chef+ 0.008*Ra...
tpc_42: 0.029*friend+ 0.028*Delicious+ 0.015*queue+ 0.015*delicious+ 0.013*ramen+ 0.013*Easy+ 0.012*...
tpc_43: 0.012*Menma+ 0.011*roasted pork fillet+ 0.010*Soy sauce+ 0.009*Leek+ 0.009*good+ 0.008*seafood+ 0.008*soup...
tpc_44: 0.040*Attach+ 0.014*Rich+ 0.013*Slimy+ 0.013*Split+ 0.013*seafood+ 0.013*Fishmeal+ 0.011*Prime+ 0....
tpc_45: 0.019*Bad taste+ 0.017*Rock glue+ 0.017*Crowndaisy+ 0.012*No. 1 in Japan+ 0.010*delicious+ 0.009*ramen+ 0.008*line...
tpc_46: 0.074*Wonton+ 0.045*men+ 0.015*roasted pork fillet+ 0.009*delicious+ 0.008*Wonton+ 0.008*Soy sauce+ 0.007*...
tpc_47: 0.027*Usually+ 0.019*series+ 0.017*Soba noodles+ 0.012*Pickled+ 0.010*old+ 0.010*Delicious+ 0.010*Hard...
tpc_48: 0.018*half+ 0.014*salad+ 0.014*dessert+ 0.014*cuisine+ 0.013*Izakaya+ 0.012*tofu+ 0.010*set...
tpc_49: 0.068*Family line+ 0.019*spinach+ 0.013*Seaweed+ 0.010*Soy sauce+ 0.010*roasted pork fillet+ 0.010*Dark+ 0.010*La...

4. Store classification (topic classification)

Now that you have learned the classifier, you can divide the store into topics (topic vector calculation) by passing the reviews of the store through this classifier.

First of all, let's enter one tabelog pickup review of "Shinpuku Saikan Main Store @ Kyoto" that I have loved since I was a junior high school student and see the performance. But before that, I will share what kind of ramen Shinfukusaikan offers to make it easier to understand the following.

"Shinpuku Saikan Kyoto" スクリーンショット 2016-07-23 3.30.54.png

It's black ... It's a black ramen.

However, unlike its appearance, it is a delicious soy sauce ramen that is surprisingly light and rich.

Now, what kind of result will the classifier built this time return?

#Quote source: http://tabelog.com/kyoto/A2601/A260101/26000791/dtlrvwlst/763925/
#By the way, I personally prefer not to put raw eggs in it lol
>> str="When I was on a business trip to Kyoto, I stopped by the main store of Shinpuku Saikan, which I had longed for.(Omission)After all, the soup is adjusted, the noodles are boiled, the ingredients are served, the char siu is delicious, and the main store is even more delicious! I felt that. Anyway, this is cheap at 650 yen. "Chinese soba" at Shinfukusaikan is characterized by a black soup like the broth of char siu and a generous amount of Kujo green onions. Of course, "Yakimeshi" is irresistible.(Omission)On my way home, I suddenly saw a customer's order (I saw the review here later and found out that it was "extra large Shinfuku soba"), and I like raw eggs. Then, I was shocked, "Oh, could I have done that?" It seems to go well with the soup at Shinpuku Saikan, and I will definitely ask for it next time."
>> vec = dictionary.doc2bow(utils.stems(str))

#Classification result display
>> print(lda[vec])
[(0, 0.28870310712135505), (8, 0.25689765230576195), (13, 0.3333132412551591), (31, 0.081085999317724824)]

#Contents of each topic/In descending order of influence
>> lda.print_topic(13)  #It captures the characteristics of the black ramen at Shinfukusaikan
0.057*black+ 0.023*black+ 0.020*Black+ 0.018*Soy sauce+ 0.011*stamina+ 0.010*oyster+ 0.009*Appearance+ 0.008*Dark+ 0.008*ramen+ 0.008*delicious

>> lda.print_topic(0)
0.019*Impressed+ 0.014*impact+ 0.013*Long-sought+ 0.012*Difficulty+ 0.012*delicious+ 0.011*ramen+ 0.010*Deep emotion+ 0.010*queue+ 0.010*Delicious+ 0.008*delicious

>> lda.print_topic(8)
0.070*Soba+ 0.015*Soboro+ 0.013*Attach+ 0.012*Mentaiko+ 0.012*chicken+ 0.010*Rich+ 0.010*delicious+ 0.008*roasted pork fillet+ 0.008*Easy+ 0.007*rice

>> lda.print_topic(31)
0.051*Champon+ 0.024*student+ 0.015*Tantan+ 0.011*seafood+ 0.009*shock+ 0.009*Genuine+ 0.009*delicious+ 0.008*ramen+ 0.008*Vegetables+ 0.008*Special

The black ramen came out properly.

Now that we know it works, let's concatenate hundreds of sentences on the Web that refer to the Shinpuku Saikan main store and put them in a classifier.

>> str = "Sentences collected on the WEB"
>> sinpuku_vec = dictionary.doc2bow(utils.stems(str))
>> print(lda[sinpuku_vec])
[(0, 0.003061940579476011), (5, 0.001795672854987279), (7, 0.016165280743592875), (11, 0.0016683462844631061), (13, 0.387457274481951), (16, 0.048457912903426922), (18, 0.025816920842756448), (19, 0.0014647251485231138), (20, 0.0018013651819984121), (21, 0.001155430885775867), (24, 0.11249915373166983), (25, 0.0030405756373518885), (26, 0.0031413889216075561), (27, 0.0030955757983300515), (29, 0.0021349369911582098), (32, 0.006158571006380364), (34, 0.061260735988294568), (36, 0.0023903609848973475), (37, 0.020874795314517719), (41, 0.0018301667593946488), (42, 0.27803177713836785), (45, 0.0055461332216832828), (46, 0.0016396961473594117), (47, 0.0056507918659765869)]

>> lda.print_topic(13)  #value: 0.38
0.057*black+ 0.023*black+ 0.020*Black+ 0.018*Soy sauce+ 0.011*stamina+ 0.010*oyster+ 0.009*Appearance+ 0.008*Dark+ 0.008*ramen+ 0.008*delicious

>> lda.print_topic(42)  #value: 0.27
0.029*friend+ 0.028*Delicious+ 0.015*queue+ 0.015*delicious+ 0.013*ramen+ 0.013*Easy+ 0.012*Wow+ 0.011*Feeling+ 0.011*Famous+ 0.011*Many

>> lda.print_topic(24)  #value: 0.11
0.095*Fried rice+ 0.040*set+ 0.017*mini+ 0.013*Gyoza+ 0.012*ramen+ 0.011*delicious+ 0.009*Delicious+ 0.009*Single item+ 0.008*order+ 0.008*roasted pork fillet

The result is a little hard to see, but fried rice has appeared in the third topic as a result that reflects many opinions.

It seems that the person who wrote the first review was not asked, but in fact Shinpuku Saikan is also famous for its black fried rice. From this (?), It was found that the topic division of ramen shops by collective intelligence works.

5. Find a similar store

Now, the classifier can now calculate the topic vector of any ramen shop. Finally, I would like to use this classifier to find similar stores.

In theory, you can find a store that is similar to any store, but I will continue the content of the previous section and look for a store that is similar to the Shinpuku Saikan main store.

Although the detailed code is broken, in general, the LDA topic calculation of ramen shops nationwide → the similarity calculation with the LDA topic of the Shinpuku Saikan main store is performed as follows.

#various settings
MIN_SIMILARITY = 0.6  #Similarity threshold
RELATE_STORE_NUM = 20  #Number of similar stores extracted

#LDA topic calculation of ramen shops nationwide
from my_algorithms import calc_vecs
(names, prefs, vecs) = calc_vecs(reviews, lda, dictionary)

#Calculate the similarity between the Shinpuku Saikan main store and the LDA topics of ramen shops nationwide
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(sinpuku_vec, vecs)

#Show similar stores
import pandas as pd
df = pd.DataFrame({'name': names,
                   'pref': prefs,
                   'similarity': similarities})
relate_store_list = df[df.similarity > MIN_SIMILARITY] \
                      .sort_values(by="similarity", ascending=False) \
                      .head(RELATE_STORE_NUM)
print(relate_store_list)

==============

id	similarity	pref	name
0	0.934 toyama Makoto
1	0.898 hokkaido Isono Kazuo
2	0.891 shiga Kinkuemon Mitsui Outlet Park Shiga Ryuo
3	0.891 kyoto Shinfukusaikan Higashi Tsuchikawa store
4	0.888 osaka Kingemon Dotombori store
5	0.886 chiba Charcoal ramen
6	0.874 osaka Kingemon Esaka Esaka store
7	0.873 toyama Iroha Shosui Main Store
8	0.864 osaka Kingemon Umeda Umeda store
9	0.861 mie Hayashiya
10	0.847 niigata Ramen Tsurikichi
11	0.846 osaka Kingemon Main Store
12	0.838 toyama Menhachi Gotabiya
13	0.837 aichi Kikuya Hotel
14	0.820 hyogo Nakanoya
15	0.814 kyoto Kinkuemon Kyoto Saiin store
16	0.807 aichi yokoji
17	0.804 kumamoto favorite ramen shop
18	0.792 kyoto Shinfukusaikan Kumiyama store
19	0.791 niigata Goingoin

Let's pick up some and check the result.

"Makoto Ya Toyama" スクリーンショット 2016-07-23 2.31.42.png

"Isono Kazuo Hokkaido" スクリーンショット 2016-07-23 2.32.56.png

"Kanekuemon Mitsui Outlet Park Shiga Ryuo Store" スクリーンショット 2016-07-23 2.33.43.png

It's black ... it's superbly black. I also checked other things, but all of them were black ramen except Nakanoya in Hyogo.

Shinpuku Saikan has several branches, and I think it's a good point that the results came in properly even though the store name was completely removed and analyzed.

Speaking of black ramen, Toyama Black is famous and some have appeared, but I wonder if the taste of Shinfukusaikan is a little different from Toyama Black. It is undeniable that a similar restaurant was selected because of the blackness of the ramen rather than the taste. I would like to consider the complexity of this area as a point for improvement in the future.

Summary

This time, using ramen as a theme, we introduced how to obtain various knowledge from character data lying on the net without human intervention (almost).

It may seem like a very unreliable result compared to the method of classifying based on supervised data created by human intervention, but the result that can be easily obtained using the existing library of Python is not so good. I think.

Of course, not only ramen but also various things can be handled if there is a certain amount of sentences or more. If you like, please try various documents.