Hello, you copy and paste data scientist.
About 3 years ago, I did a funny LT called "Ramen and Natural Language Processing", but now it's embarrassingly shabby, so Python I tried to remake it with.
We use a technique called statistical latent semantic analysis. Roughly speaking, it gives you an idea of what topic the document has and what it is about.
Since the ratio allocated to each topic can be calculated with the following image, it is possible to calculate that A and B are close in the following example.
Although it is not used as described above, the following application examples and books may be helpful.
It's been so long First, I will list the general flow.
#Reading the collected document data
from io_modules import load_data #Self-made DB read library
rows = load_data(LOAD_QUERY, RAMEN_DB)
#Extract the stem with the stem function in the reference article
from utils import stems #Implementation of reference article Almost as it is
docs = [stems(row) for row in rows]
"""
docs = [
['Large serving', 'Impressions', 'Direction', 'Best', 'ramen', ...
['ramen', 'queue', 'Cold', 'Hot', 'joy', ...
...
]
"""
From here, we will actually perform LDA using gensim. First, create a dictionary and corpus for gensim.
from gensim import corpora, models
It's confusing, but it's not the user dictionary used for word-separation in MeCab, but for gensim to map the appearing words and word IDs in the document.
dictionary = gensim.corpora.Dictionary(docs)
dictionary.save_as_text('./data/text.dict') #Save
# gensim.corpora.Dictionary.load_from_text('./data/text.dict') #File can be loaded from next time
"""
Word ID Word appearance count
1543 Clam 731
62 Easy 54934
952 Warm 691
672 hot 1282
308 Thank you 4137
・
・
"""
The first word of mouth collected will be maintained as a corpus and used for learning the classifier.
corpus = [dictionary.doc2bow(doc) for doc in docs]
gensim.corpora.MmCorpus.serialize('./data/text.mm', corpus) #Save
# corpus = gensim.corpora.MmCorpus('./data/text.mm') #File loading is possible from the next time
"""\
doc_id word_id frequency of occurrence
6 150 3 # word_id=150:Bean sprouts
6 163 9 # word_id=163:soy sauce
6 164 1
6 165 1
・
・
"""
There is some debate about whether it is necessary, This time, we will perform TFIDF processing on the corpus and perform LDA.
tfidf = gensim.models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
#I calculated it, so save it in pickle
import pickle
with open('./data/corpus_tfidf.dump', mode='wb') as f:
pickle.dump(corpus_tfidf, f)
#You can load it from the next time
# with open('./data/corpus_tfidf.dump', mode='rb') as f:
# corpus_tfidf = pickle.load(f)
Now that we are ready, we will actually do LDA with gensim. This time, I classified it into 50 topics.
#Depending on the amount of documents, it may take several hours.
# '18/12/03 postscript:Increasing the number of workers in LdaMulticore may be much faster
lda = gensim.models.LdaModel(corpus=corpus_tfidf, id2word=dictionary,
num_topics=50, minimum_probability=0.001,
passes=20, update_every=0, chunksize=10000)
lda.save('./data/lda.model') #Save
# lda = gensim.models.LdaModel.load('./data/lda.model') #Can be loaded from next time
Now let's display the contents of the learned model.
Although some topics that express impressions (# 0, # 36, # 42, etc.) are mixed, they are generally topics that express the taste of ramen (# 2: miso, # 49: family line, etc.). It seems that the classifier of is made.
for i in range(50):
print('tpc_{0}: {1}'.format(i, lda.print_topic(i)[0:80]+'...'))
==============
tpc_0: 0.019*Impressed+ 0.014*impact+ 0.013*Long-sought+ 0.012*Difficulty+ 0.012*delicious+ 0.011*ramen+ 0.010*Deep emotion+...
tpc_1: 0.035*Grilled pork+ 0.022*chilled Chinese noodles+ 0.018*hot+ 0.010*Addictive+ 0.009*Stubborn+ 0.008*delicious+ 0.008*Ma...
tpc_2: 0.050*miso+ 0.029*Miso+ 0.017*ginger+ 0.013*butter+ 0.012*Bean sprouts+ 0.011*lard+ 0.009*corn+...
tpc_3: 0.013*Flavor+ 0.010*garlic+ 0.010*Rich+ 0.009*roasted pork fillet+ 0.008*oil+ 0.008*Rich+ 0.008*...
tpc_4: 0.010*Soy sauce+ 0.009*use+ 0.009*kelp+ 0.008*Material+ 0.007*soup+ 0.007*seafood+ 0.007*roasted pork fillet...
tpc_5: 0.015*Come+ 0.014*Clams+ 0.012*Thin+ 0.010*ramen+ 0.010*popularity+ 0.010*It feels good+ 0.010*...
tpc_6: 0.047*Shrimp+ 0.046*shrimp+ 0.014*sesame+ 0.014*shrimp+ 0.012*Addictive+ 0.008*delicious+ 0.008*Sukiyaki...
tpc_7: 0.016*Unpalatable+ 0.015*Expectations+ 0.013*bad+ 0.012*Sorry+ 0.012*delicious+ 0.011*usually+ 0.011*ramen...
tpc_8: 0.070*Soba+ 0.015*Soboro+ 0.013*Attach+ 0.012*Mentaiko+ 0.012*chicken+ 0.010*Rich+ 0.010*delicious+...
tpc_9: 0.041*Citron+ 0.024*Japanese style+ 0.017*Stew+ 0.010*Trefoil+ 0.010*life+ 0.009*delicious+ 0.009*seafood+...
tpc_10: 0.040*Vegetables+ 0.027*garlic+ 0.018*Extra+ 0.013*Garlic+ 0.010*Bean sprouts+ 0.010*Less+ 0.009*Ca...
tpc_11: 0.026*Handmade+ 0.023*Offal+ 0.016*Ginger+ 0.010*spicy+ 0.010*ramen+ 0.009*delicious+ 0.008*Feeling...
tpc_12: 0.031*Buckwheat+ 0.030*Soba+ 0.029*Chinese+ 0.016*Plain hot water+ 0.011*Shamo chicken+ 0.008*delicious+ 0.007*ramen+...
tpc_13: 0.057*black+ 0.023*black+ 0.020*Black+ 0.018*Soy sauce+ 0.011*stamina+ 0.010*oyster+ 0.009*Appearance...
tpc_14: 0.060*Tanmen+ 0.048*shrimp+ 0.019*Vegetables+ 0.014*Chinese cabbage+ 0.011*Fish ball+ 0.009*Gyoza+ 0.007*delicious...
tpc_15: 0.073*Spicy+ 0.015*Spicy+ 0.012*miso+ 0.011*Peppers+ 0.011*Sansho+ 0.010*Spicy+ 0.010*辛miso+ 0...
tpc_16: 0.031*Aoba+ 0.029*Mesh+ 0.019*double+ 0.012*seafood+ 0.010*trend+ 0.009*instant+ 0.009*Rame...
tpc_17: 0.041*Replacement ball+ 0.017*Replacement ball+ 0.014*Tonkotsu+ 0.014*Mustard+ 0.010*Extra fine+ 0.010*ramen+ 0.009*Red...
tpc_18: 0.032*Nostalgic+ 0.023*Easy+ 0.016*meaning+ 0.012*ramen+ 0.011*friendly+ 0.010*Feeling+ 0.010*Ah...
tpc_19: 0.027*Lemon+ 0.016*Normal+ 0.011*guts+ 0.009*Regrettable+ 0.009*steak+ 0.008*Rich+ 0.008*Delicious...
tpc_20: 0.088*Niboshi+ 0.009*Soba+ 0.008*fragrance+ 0.008*ramen+ 0.008*soup+ 0.007*roasted pork fillet+ 0.007*Soy sauce...
tpc_21: 0.023*sushi+ 0.015*Recommended+ 0.012*favorite+ 0.010*ramen+ 0.009*delicious+ 0.008*Growing up+ 0.008*...
tpc_22: 0.025*Fried+ 0.021*Fashionable+ 0.017*Fashionable+ 0.016*Cafe+ 0.014*Fashionable+ 0.014*atmosphere+ 0.011*...
tpc_23: 0.024*value+ 0.022*White miso+ 0.018*miso+ 0.014*赤miso+ 0.010*ultimate+ 0.010*delicious+ 0.009*burnt+...
tpc_24: 0.095*Fried rice+ 0.040*set+ 0.017*mini+ 0.013*Gyoza+ 0.012*ramen+ 0.011*delicious+ 0.009*...
tpc_25: 0.024*Oden+ 0.015*Nostalgic+ 0.013*Grilled meat+ 0.011*flat+ 0.010*Dark mouth+ 0.010*ramen+ 0.009...
tpc_26: 0.010*Off+ 0.009*ramen+ 0.009*delicious+ 0.008*serious+ 0.008*Delicious+ 0.008*Noisy+ 0.008...
tpc_27: 0.073*Mochi+ 0.032*Kimchi+ 0.012*Spicy miso+ 0.010*Delicious+ 0.010*delicious+ 0.008*roasted pork fillet+ 0.00...
tpc_28: 0.021*Sudachi+ 0.019*Shichimi+ 0.018*Men+ 0.015*onion+ 0.011*Onion+ 0.010*Disappointing+ 0.010*Attach...
tpc_29: 0.079*Gyoza+ 0.026*beer+ 0.011*delicious+ 0.010*ramen+ 0.009*生beer+ 0.009*Soy sauce+ 0.008...
tpc_30: 0.021*Tightening+ 0.018*Asexual+ 0.018*germ+ 0.015*Sake lees+ 0.010*Cooked in water+ 0.009*crab+ 0.009*Rich+ 0....
tpc_31: 0.051*Champon+ 0.024*student+ 0.015*Tantan+ 0.011*seafood+ 0.009*shock+ 0.009*Genuine+ 0.009*Delicious...
tpc_32: 0.025*odor+ 0.023*odor+ 0.016*smell+ 0.010*secret+ 0.010*Delicious+ 0.010*ramen+ 0.010*Soup...
tpc_33: 0.010*Soy sauce+ 0.009*roasted pork fillet+ 0.008*seafood+ 0.008*taste+ 0.007*soup+ 0.007*Menma+ 0.007*good...
tpc_34: 0.074*curry+ 0.040*Fried rice+ 0.015*Ganso+ 0.011*spices+ 0.010*set+ 0.008*delicious+ 0.008*La...
tpc_35: 0.068*Tomato+ 0.031*cheese+ 0.015*Italian+ 0.014*pasta+ 0.011*hormone+ 0.011*risotto+ 0.00...
tpc_36: 0.038*Colleague+ 0.014*strongest+ 0.010*hard+ 0.010*ramen+ 0.010*Dantotsu+ 0.009*delicious+ 0.009*topic...
tpc_37: 0.059*Tonkotsu+ 0.026*soy sauce+ 0.025*children+ 0.015*Delicious+ 0.012*Muddy+ 0.012*ramen+ 0.01...
tpc_38: 0.027*rice+ 0.025*rice ball+ 0.022*rice+ 0.016*Rice porridge+ 0.014*rice+ 0.012*pickles+ 0.011*set...
tpc_39: 0.026*Yuzu+ 0.019*pale+ 0.009*Aging+ 0.009*Grilled pork+ 0.008*Soy sauce+ 0.008*roasted pork fillet+ 0.007*soup...
tpc_40: 0.042*Udon+ 0.012*Skipjack+ 0.009*Yeah+ 0.009*tempura+ 0.009*ramen+ 0.008*delicious+ 0.008*Feeling...
tpc_41: 0.023*Salty+ 0.020*Who+ 0.012*junk+ 0.012*Attach+ 0.009*French+ 0.008*chef+ 0.008*Ra...
tpc_42: 0.029*friend+ 0.028*Delicious+ 0.015*queue+ 0.015*delicious+ 0.013*ramen+ 0.013*Easy+ 0.012*...
tpc_43: 0.012*Menma+ 0.011*roasted pork fillet+ 0.010*Soy sauce+ 0.009*Leek+ 0.009*good+ 0.008*seafood+ 0.008*soup...
tpc_44: 0.040*Attach+ 0.014*Rich+ 0.013*Slimy+ 0.013*Split+ 0.013*seafood+ 0.013*Fishmeal+ 0.011*Prime+ 0....
tpc_45: 0.019*Bad taste+ 0.017*Rock glue+ 0.017*Crowndaisy+ 0.012*No. 1 in Japan+ 0.010*delicious+ 0.009*ramen+ 0.008*line...
tpc_46: 0.074*Wonton+ 0.045*men+ 0.015*roasted pork fillet+ 0.009*delicious+ 0.008*Wonton+ 0.008*Soy sauce+ 0.007*...
tpc_47: 0.027*Usually+ 0.019*series+ 0.017*Soba noodles+ 0.012*Pickled+ 0.010*old+ 0.010*Delicious+ 0.010*Hard...
tpc_48: 0.018*half+ 0.014*salad+ 0.014*dessert+ 0.014*cuisine+ 0.013*Izakaya+ 0.012*tofu+ 0.010*set...
tpc_49: 0.068*Family line+ 0.019*spinach+ 0.013*Seaweed+ 0.010*Soy sauce+ 0.010*roasted pork fillet+ 0.010*Dark+ 0.010*La...
Now that you have learned the classifier, you can divide the store into topics (topic vector calculation) by passing the reviews of the store through this classifier.
First of all, let's enter one tabelog pickup review of "Shinpuku Saikan Main Store @ Kyoto" that I have loved since I was a junior high school student and see the performance. But before that, I will share what kind of ramen Shinfukusaikan offers to make it easier to understand the following.
"Shinpuku Saikan Kyoto"
It's black ... It's a black ramen.
However, unlike its appearance, it is a delicious soy sauce ramen that is surprisingly light and rich.
Now, what kind of result will the classifier built this time return?
#Quote source: http://tabelog.com/kyoto/A2601/A260101/26000791/dtlrvwlst/763925/
#By the way, I personally prefer not to put raw eggs in it lol
>> str="When I was on a business trip to Kyoto, I stopped by the main store of Shinpuku Saikan, which I had longed for.(Omission)After all, the soup is adjusted, the noodles are boiled, the ingredients are served, the char siu is delicious, and the main store is even more delicious! I felt that. Anyway, this is cheap at 650 yen. "Chinese soba" at Shinfukusaikan is characterized by a black soup like the broth of char siu and a generous amount of Kujo green onions. Of course, "Yakimeshi" is irresistible.(Omission)On my way home, I suddenly saw a customer's order (I saw the review here later and found out that it was "extra large Shinfuku soba"), and I like raw eggs. Then, I was shocked, "Oh, could I have done that?" It seems to go well with the soup at Shinpuku Saikan, and I will definitely ask for it next time."
>> vec = dictionary.doc2bow(utils.stems(str))
#Classification result display
>> print(lda[vec])
[(0, 0.28870310712135505), (8, 0.25689765230576195), (13, 0.3333132412551591), (31, 0.081085999317724824)]
#Contents of each topic/In descending order of influence
>> lda.print_topic(13) #It captures the characteristics of the black ramen at Shinfukusaikan
0.057*black+ 0.023*black+ 0.020*Black+ 0.018*Soy sauce+ 0.011*stamina+ 0.010*oyster+ 0.009*Appearance+ 0.008*Dark+ 0.008*ramen+ 0.008*delicious
>> lda.print_topic(0)
0.019*Impressed+ 0.014*impact+ 0.013*Long-sought+ 0.012*Difficulty+ 0.012*delicious+ 0.011*ramen+ 0.010*Deep emotion+ 0.010*queue+ 0.010*Delicious+ 0.008*delicious
>> lda.print_topic(8)
0.070*Soba+ 0.015*Soboro+ 0.013*Attach+ 0.012*Mentaiko+ 0.012*chicken+ 0.010*Rich+ 0.010*delicious+ 0.008*roasted pork fillet+ 0.008*Easy+ 0.007*rice
>> lda.print_topic(31)
0.051*Champon+ 0.024*student+ 0.015*Tantan+ 0.011*seafood+ 0.009*shock+ 0.009*Genuine+ 0.009*delicious+ 0.008*ramen+ 0.008*Vegetables+ 0.008*Special
The black ramen came out properly.
Now that we know it works, let's concatenate hundreds of sentences on the Web that refer to the Shinpuku Saikan main store and put them in a classifier.
>> str = "Sentences collected on the WEB"
>> sinpuku_vec = dictionary.doc2bow(utils.stems(str))
>> print(lda[sinpuku_vec])
[(0, 0.003061940579476011), (5, 0.001795672854987279), (7, 0.016165280743592875), (11, 0.0016683462844631061), (13, 0.387457274481951), (16, 0.048457912903426922), (18, 0.025816920842756448), (19, 0.0014647251485231138), (20, 0.0018013651819984121), (21, 0.001155430885775867), (24, 0.11249915373166983), (25, 0.0030405756373518885), (26, 0.0031413889216075561), (27, 0.0030955757983300515), (29, 0.0021349369911582098), (32, 0.006158571006380364), (34, 0.061260735988294568), (36, 0.0023903609848973475), (37, 0.020874795314517719), (41, 0.0018301667593946488), (42, 0.27803177713836785), (45, 0.0055461332216832828), (46, 0.0016396961473594117), (47, 0.0056507918659765869)]
>> lda.print_topic(13) #value: 0.38
0.057*black+ 0.023*black+ 0.020*Black+ 0.018*Soy sauce+ 0.011*stamina+ 0.010*oyster+ 0.009*Appearance+ 0.008*Dark+ 0.008*ramen+ 0.008*delicious
>> lda.print_topic(42) #value: 0.27
0.029*friend+ 0.028*Delicious+ 0.015*queue+ 0.015*delicious+ 0.013*ramen+ 0.013*Easy+ 0.012*Wow+ 0.011*Feeling+ 0.011*Famous+ 0.011*Many
>> lda.print_topic(24) #value: 0.11
0.095*Fried rice+ 0.040*set+ 0.017*mini+ 0.013*Gyoza+ 0.012*ramen+ 0.011*delicious+ 0.009*Delicious+ 0.009*Single item+ 0.008*order+ 0.008*roasted pork fillet
The result is a little hard to see, but fried rice has appeared in the third topic as a result that reflects many opinions.
It seems that the person who wrote the first review was not asked, but in fact Shinpuku Saikan is also famous for its black fried rice. From this (?), It was found that the topic division of ramen shops by collective intelligence works.
Now, the classifier can now calculate the topic vector of any ramen shop. Finally, I would like to use this classifier to find similar stores.
In theory, you can find a store that is similar to any store, but I will continue the content of the previous section and look for a store that is similar to the Shinpuku Saikan main store.
Although the detailed code is broken, in general, the LDA topic calculation of ramen shops nationwide → the similarity calculation with the LDA topic of the Shinpuku Saikan main store is performed as follows.
#various settings
MIN_SIMILARITY = 0.6 #Similarity threshold
RELATE_STORE_NUM = 20 #Number of similar stores extracted
#LDA topic calculation of ramen shops nationwide
from my_algorithms import calc_vecs
(names, prefs, vecs) = calc_vecs(reviews, lda, dictionary)
#Calculate the similarity between the Shinpuku Saikan main store and the LDA topics of ramen shops nationwide
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(sinpuku_vec, vecs)
#Show similar stores
import pandas as pd
df = pd.DataFrame({'name': names,
'pref': prefs,
'similarity': similarities})
relate_store_list = df[df.similarity > MIN_SIMILARITY] \
.sort_values(by="similarity", ascending=False) \
.head(RELATE_STORE_NUM)
print(relate_store_list)
==============
id similarity pref name
0 0.934 toyama Makoto
1 0.898 hokkaido Isono Kazuo
2 0.891 shiga Kinkuemon Mitsui Outlet Park Shiga Ryuo
3 0.891 kyoto Shinfukusaikan Higashi Tsuchikawa store
4 0.888 osaka Kingemon Dotombori store
5 0.886 chiba Charcoal ramen
6 0.874 osaka Kingemon Esaka Esaka store
7 0.873 toyama Iroha Shosui Main Store
8 0.864 osaka Kingemon Umeda Umeda store
9 0.861 mie Hayashiya
10 0.847 niigata Ramen Tsurikichi
11 0.846 osaka Kingemon Main Store
12 0.838 toyama Menhachi Gotabiya
13 0.837 aichi Kikuya Hotel
14 0.820 hyogo Nakanoya
15 0.814 kyoto Kinkuemon Kyoto Saiin store
16 0.807 aichi yokoji
17 0.804 kumamoto favorite ramen shop
18 0.792 kyoto Shinfukusaikan Kumiyama store
19 0.791 niigata Goingoin
Let's pick up some and check the result.
"Makoto Ya Toyama"
"Isono Kazuo Hokkaido"
"Kanekuemon Mitsui Outlet Park Shiga Ryuo Store"
It's black ... it's superbly black. I also checked other things, but all of them were black ramen except Nakanoya in Hyogo.
Shinpuku Saikan has several branches, and I think it's a good point that the results came in properly even though the store name was completely removed and analyzed.
Speaking of black ramen, Toyama Black is famous and some have appeared, but I wonder if the taste of Shinfukusaikan is a little different from Toyama Black. It is undeniable that a similar restaurant was selected because of the blackness of the ramen rather than the taste. I would like to consider the complexity of this area as a point for improvement in the future.
This time, using ramen as a theme, we introduced how to obtain various knowledge from character data lying on the net without human intervention (almost).
It may seem like a very unreliable result compared to the method of classifying based on supervised data created by human intervention, but the result that can be easily obtained using the existing library of Python is not so good. I think.
Of course, not only ramen but also various things can be handled if there is a certain amount of sentences or more. If you like, please try various documents.
Recommended Posts