[PYTHON] Similarity calculation between episodes of Precure using live timeline and topic model

This article is the 18th day article of Pretty Cure Advent Calendar 2015. Today is 12/19, but I believe that everyone who loves Pretty Cure will forgive me with a wider heart than the sea!

This article and this experiment are intended to be WIP.

Introduction I want to calculate the similarity between that story and that story in that anime, or I want to cluster between similar stories. What kind of means is there in such a case? One way to do this is to use Twitter's live timeline. Since the live timeline is for the viewer to tweet the situation and impressions while watching the animation, it can be regarded as a document that explains the animation in a one-to-one correspondence with the episode of the animation. This document = the similarity between live animation timelines can be used as an index of the similarity between episodes.

This time, I would like to calculate the similarity between each episode of Go! Princess Pretty Cure with this policy. Specifically, let's calculate the similarity between the live timelines of each story using LDA (Latent Dirichlet Allocation), which is one of the topic models.

Tools to use, etc.

data

--2015/2/1 (Episode 1) ~ 2015/12/13 (Episode 44) Go! Princess Precure live timeline on Twitter (37 episodes in total because there are some times when the timeline recording failed ) --Record what you tweeted with #precure between 8:20 and 9:30 --Excludes retweets (only original tweets are counted) --The number of tweets each time is 15,000 to 48,000 (excluding retweets) -Morphological analysis with mecab-ipadic-neologd --Exclude stop words such as "", "n", "this", etc. ――The two words "PreCure" and "Princess" are excluded because they can be considered as universal words common to all stories (I was at a loss, but that produced better results).

The top 20 words appearing in these cases are as follows.

Kirara 31344
Good morning 25404
Person 21358
Good 19218
Mr. 19092
Dream 17395
Minami 17061
Haruharu 15956
Time 15824
Towa 14579
Kanata 14426
Yui 13961
Close 13844
Cute 13311
Shut 13049
Puff 11786
Become 11757
Makeover 10679
See 10342
Twinkle 9483

Glitter strong. The two words "PreCure" and "Princess" are originally above Kirara.

Experiment

About LDA

The correct explanation of LDA is impossible for me now. But you have Google.

Explaining LDA to the extent of explaining terms, LDA infers potential topics from a set of documents. For example, in the case of news articles, this topic intuitively corresponds to categories such as "politics," "sports," and "art." The LDA assumes that a document potentially contains some percentage of each topic and represents the document as a probability distribution of topics. That is, one document has an 80% chance of talking about politics, a 15% talking about art, a 3% talking about sports, and so on. I will. Also, in LDA, each topic corresponds to the distribution of word occurrence probabilities. In other words, in the "politics" topic, "Obama" appears with a 7% probability, "Trump" with a 5% probability, "Abe" with a 3% probability, and so on.

I will explain the flow from here by replacing it with the precure live timeline. A precure live timeline for an episode is a long document that describes that episode. Precure includes scenes and times that are much closer to Haruno, so-called "Haruka times," which focus on a specific character. From now on, we will assume that these "Haruka times" and "Kirara times" are potential topics that exist in the Precure live timeline. I wrote that a topic is the probability of a word appearing. In the case of Go! Princess Precure, the "Kirara" topic will have a high probability of appearing in words such as "twinkle," "doughnut," and "fashion."

First, let's analyze the entire 37 Go! Princess Precure live timeline and calculate what topics exist. Then, in the topic I discovered, I calculated how much each time was "Kirara times", "Haruka times", ... or "Times that are not owned by anyone". I will continue.

I wrote it for a long time, but the point is that in calculating the similarity between live timelines, "Haruka times", "Minami times", "Kirara times", "Towa times", "Who's" It is used as a technology for dimensional compression, which is an intuitive and small index called "time that is not a thing".

The LDA implementation uses a Python library called gensim.

Extract topics

As mentioned earlier, when LDA gives a set of documents, it infers the topics that exist in the set of documents. You have to set the number of topics yourself. For now, let's assume that each of the four Go! Princess Pretty Cures has one topic, and one topic that isn't from any of the other Pretty Cures, and set the number of topics to 5. (Haruka times, Minami times, Kirara times, Towa times, assuming times that do not belong to anyone)

The topics inferred as a result are as follows. The number * word is the probability of a word appearing in the topic mentioned above.

topic #0 (0.200): 0.009*Have a nice day+ 0.007*Towa+ 0.007*Man+ 0.006*Mr+ 0.006*Yui+ 0.005*Good+ 0.005*Key+ 0.005*Transform+ 0.005*dream+ 0.005*Scarlet
topic #1 (0.200): 0.009*Yui+ 0.008*Close+ 0.007*Kanata+ 0.007*Haruharu+ 0.007*dream+ 0.006*Times+ 0.006*Good+ 0.005*South+ 0.005*Man+ 0.005*despair
topic #2 (0.200): 0.020*Kirara+ 0.010*Have a nice day+ 0.006*Man+ 0.006*Good+ 0.006*Haruharu+ 0.005*dream+ 0.005*twinkle+ 0.005*Close+ 0.005*Times+ 0.005*cute
topic #3 (0.200): 0.014*South+ 0.007*Haruharu+ 0.007*Have a nice day+ 0.006*Good+ 0.006*Kirara+ 0.005*Close+ 0.005*Man+ 0.005*Kanata+ 0.005*かわGood+ 0.004*Mr
topic #4 (0.200): 0.010*Mr+ 0.009*Kanata+ 0.008*Towa+ 0.008*shut+ 0.007*Man+ 0.006*Times+ 0.006*Have a nice day+ 0.005*Good+ 0.005*dream+ 0.005*South

It's more like that than I expected. It's surprising that Yui-chan appeared strongly, but the more I watched the live timeline, the more I felt. "Haruka," "Minami," "Kirara," and "Towa" weren't decided to be the maximum ingredients, but there was no shame.

Degree of similarity

Calculate the cosine similarity between documents, treating each document as a feature of each document (that is, treating each document as a five-dimensional vector consisting of the above five topics). Then, the following similarities can be obtained.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 17 19 21 22 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
1 0.315802 0.0 0.0 0.0 0.148118 0.0 0.0 0.0 0.997047 0.108117 0.0 0.0 0.0 0.0 0.0 0.0 0.998236 0.0 1.0 1.0 1.0 0.345272 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.116513 0.116513 0.116513 0.854927 0.151135 0.116513 0.472356 0.323976 0.149502 0.116513 0.141464 0.116513 0.116513 0.819519 0.814538 0.363768 0.814538 0.315962 0.315962 0.315962 0.109093 0.245021 0.0 0.814538 0.820534 0.814478 0.544964 0.472356 0.0 0.00903815 0.814489 0.0 0.143125 0.154445 0.816244
3 1.0 1.0 0.024152 0.0881911 1.0 0.0 0.0767926 0.989938 1.0 0.999529 1.0 1.0 0.0523216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.671438 0.0 0.0 0.214036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.99927 0.998638 0.0154795
4 1.0 0.024152 0.0881911 1.0 0.0 0.0767926 0.989938 1.0 0.999529 1.0 1.0 0.0523216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.671438 0.0 0.0 0.214036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.99927 0.998638 0.0154795
5 0.024152 0.0881911 1.0 0.0 0.0767926 0.989938 1.0 0.999529 1.0 1.0 0.0523216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.671438 0.0 0.0 0.214036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.99927 0.998638 0.0154795
6 0.00218089 0.0247291 0.0 0.147387 0.0402567 0.0247291 0.055079 0.0247291 0.0247291 0.988927 0.988987 0.204383 0.988987 0.145919 0.145919 0.145919 0.0503817 0.219115 0.0 0.988987 0.971281 0.988914 0.0905681 0.0 0.0 0.0109738 0.988928 0.0 0.0571255 0.0709432 0.989252
7 0.0877574 0.299502 0.00673912 0.1736 0.0877574 0.0877161 0.0877574 0.0877574 0.00459161 0.0 0.0 0.0 0.0 0.0 0.0 0.891625 0.735559 0.950051 0.0 0.0307165 0.0115478 0.298243 0.299502 0.950051 0.949993 0.0104241 0.950051 0.106319 0.109628 0.00135845
8 0.0 0.0767926 0.989938 1.0 0.999529 1.0 1.0 0.0523216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.671438 0.0 0.0 0.214036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.99927 0.998638 0.0154795
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.995798 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 0.182712 0.0756657 0.07563 0.0756657 0.0756657 0.00395895 0.0 0.995374 0.0 0.997133 0.997133 0.997133 0.344283 0.0508048 0.0 0.0 0.0161952 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0756105 0.0755626 0.00117127
11 0.989787 0.98932 0.989787 0.989787 0.0517872 0.0 0.112216 0.0 0.112415 0.112415 0.112415 0.121087 0.727015 0.087664 0.0 0.212951 0.00106555 0.0 0.0 0.087664 0.0876586 0.00096186 0.087664 0.990783 0.990468 0.0153215
12 0.999529 1.0 1.0 0.0523216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.671438 0.0 0.0 0.214036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.99927 0.998638 0.0154795
13 0.999397 0.999397 0.0869591 0.0347166 0.00206131 0.0347166 0.0 0.0 0.0 0.0 0.678142 0.0 0.0347166 0.247816 0.034714 0.00317923 0.0 0.0 0.000385217 0.0347145 0.0 0.999806 0.999659 0.0501827
14 1.0 0.0523216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.671438 0.0 0.0 0.214036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.99927 0.998638 0.0154795
17 0.0523216 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.671438 0.0 0.0 0.214036 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.99927 0.998638 0.0154795
19 0.998682 0.059297 0.998682 0.0 0.0 0.0 0.0 0.238963 0.0 0.998682 0.986444 0.998608 0.0914558 0.0 0.0 0.0110814 0.998621 0.0 0.0840274 0.0979639 0.999357
21 0.0593753 1.0 0.0 0.0 0.0 0.0 0.204766 0.0 1.0 0.976745 0.999926 0.0915765 0.0 0.0 0.011096 0.99994 0.0 0.0327753 0.0467628 0.99988
22 0.0611948 0.998126 0.998126 0.998126 0.344625 0.0125306 0.0 0.0611948 0.0597717 0.0611903 0.00560401 0.0 0.0 0.000679021 0.0611911 0.0 0.00200568 0.00286164 0.0611875
26 0.0 0.0 0.0 0.0 0.204766 0.0 1.0 0.976745 0.999926 0.0915765 0.0 0.0 0.011096 0.99994 0.0 0.0327753 0.0467628 0.99988
27 1.0 1.0 0.345272 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
28 1.0 0.345272 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29 0.345272 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
30 0.668285 0.938327 0.0 0.0117859 0.0114053 0.0 0.0 0.938327 0.938269 0.0102954 0.938327 0.0183955 0.0217185 0.0
31 0.711055 0.208365 0.356187 0.216992 0.0190813 0.0 0.711055 0.713323 0.216154 0.711055 0.691833 0.696841 0.218735
32 0.0 0.0125606 0.012155 0.0 0.0 1.0 0.999938 0.0109721 1.0 0.0196045 0.023146 0.0
33 0.976745 0.999926 0.0915765 0.0 0.0 0.011096 0.99994 0.0 0.0327753 0.0467628 0.99988
34 0.976811 0.0894461 0.0 0.0120418 0.0228789 0.97681 0.0120418 0.246198 0.259767 0.979934
35 0.0915719 0.0 0.0101176 0.0212125 1.0 0.0101176 0.032972 0.0469946 0.999829
36 0.995697 0.0 0.0010282 0.0926584 0.0 0.00303709 0.00433322 0.0926529
37 0.0 0.0 0.0 0.0 0.0 0.0 0.0
38 0.999938 0.0109721 1.0 0.0196045 0.023146 0.0
39 0.0226113 0.999932 0.0199847 0.0236888 0.0116392
40 0.0120332 0.0330089 0.0470379 0.999808
41 0.0196045 0.023146 0.0
42 0.999905 0.0490662
43 0.0620762

It may be subtle. Isn't it too peaky like 0.0 or 0.99?

Try to draw a network

Let's draw a network between documents using the above similarity. Here, we will create an edge between documents with a similarity of 0.99 or more and create a network to be constructed in a dynamic graph. In addition, each node is labeled "Haruka times", "Minami times", "Kirara times", "Towa times", and "Other times" at my discretion, and is pink, blue, yellow, red, and white, respectively. I colored it. I'm happy if the same people's times are close to each other in terms of network.

The Python library igraph is used to build and draw the network.

graph_kk.png

I'm glad that the yellow and white are strongly united. However, the others are not clear and subtle. "Kirara" appears frequently, and the Kirara topic may appear more strongly than others.

Conclusion

** Subtle results. ~ Complete ~ ** The following are possible reasons why the results are not good.

--The number of documents is small. LDA is usually used for the number of documents on the order of 4 or 5 digits, and the number of documents of 37 is overwhelmingly small. For this reason, we increased the number of calculation passes this time, but it is conceivable to take measures such as inflating the duplicated document by mixing it with fluctuations. --Similarity calculation may be strange. I feel that 0.0 or 0.99 is too peaky.

This experiment itself is not a good result, but I personally think that the policy of "analogizing anime episodes from the live anime timeline" is interesting (this is because there is already research on sports). It's not new). While reviewing the results of this time, I will continue to do something around here.

Recommended Posts

Similarity calculation between episodes of Precure using live timeline and topic model
Calculation of similarity between sentences using Word2Vec (simplified version)
Calculation of odometry using CNN and depth estimation Part 2 (CNN SLAM # 2)
Calculation of similarity by MinHash
Estimator calculation / prediction at Lasso and Ridge of generalized linear model