[PYTHON] Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!

Introduction

This is the "Yu-Gi-Oh! DS (Data Science)" series that analyzes various Yu-Gi-Oh! Card data using Python. The article will be published four times in total, and finally we will implement a program that predicts offensive and defensive attributes from card names by natural language processing + machine learning. In addition, the author's knowledge of Yu-Gi-Oh has stopped at around E ・ HERO. I'm sorry that both cards and data science are amateurs, but please keep in touch.

No. Article title Keyword
0 Get card information from the Yu-Gi-Oh! Database-Yugioh DS 0.Scraping beautifulsoup
1 Visualize Yu-Gi-Oh! Card data in Python-Yugioh Data Science 1.EDA edition pandas, seaborn
2 Process Yu-Gi-Oh card name in natural language-Yugioh DS 2.NLP edition wordcloud, word2vec, doc2vec, t-SNE This article!
3 Predict offensive and defensive attributes from the Yu-Gi-Oh card name-Yugioh DS 3.Machine learning lightgbm etc.

Purpose of this article

1. EDA will go deeper into the "card name" that was not the focus. Various monsters such as dragons, wizards, and HEROs will appear in Yu-Gi-Oh, but we will explore what kind of words are often used in the name. Furthermore, I would like to see what similarities each has when separated by attribute / type / level. The technical themes of this article are morphological analysis with MeCab, frequent word visualization with WordCloud, distributed representation of words with Word2Vec and Doc2Vec, dimension compression and word mapping with t-SNE. I will explain step by step with the implementation code.

Explanation of prerequisites (Usage environment, data, analysis policy)

usage environment

Python==3.7.4

data

The data acquired in this article is scraped with a handmade code from Yu-Gi-Oh! OCG Card Database. .. It is the latest as of June 2020. Various data frames are used depending on the graph to be displayed, but all data frames hold the following columns.

No. Column name Column name(日本語) sample Supplement
1 name card name Ojama Yellow
2 kana Reading the card name Ojama Yellow
1 rarity Rarity normal For convenience of acquisition, information such as "restriction" and "prohibition" is also included.
1 attr attribute 光attribute For non-monsters, enter "magic" and "trap"
1 effect effect NaN Contains "permanent" and "equipment", which are types of magic / trap cards. NaN for monsters
1 level level 2 Enter "Rank 2" for rank monsters
1 species Race Beast tribe
1 attack Offensive power 0
1 defence Defensive power 1000
1 text Card text A member of the jama trio who is said to jam by all means. When something happens when all three of us are together...
1 pack Recording pack name EXPERT Expert EDITION Edition Volume Volume 2
1 kind type - In the case of a monster card, information such as fusion and ritual is entered

image.png

Analysis policy

All analysis is intended to be performed with an interactive interpreter such as Jupter Lab.

Implementation

1. Package import

Import the required packages. I don't think that MeCab, gensim, and wordcloud are included in Anaconda from the beginning, so I'll do pip install if necessary.

python


import matplotlib.pyplot as plt
import MeCab
import numpy as np
import pandas as pd
import re
import seaborn as sns
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from gensim.models import word2vec
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
from PIL import Image
from wordcloud import WordCloud
%matplotlib inline
sns.set(font="IPAexGothic") #Supports Python in Japanese

2. Data import

The acquisition method of each data set is described in 0. Scraping (No article as of June 2020).

python


#Not used this time
# all_data = pd.read_csv("./input/all_data.csv") #Data set for all cards (cards with the same name have duplicate recording packs)
# print("all_data: {}rows".format(all_data.shape[0]))

cardlist = pd.read_csv("./input/cardlist.csv") #All card dataset (no duplication)
print("cardlist: {}rows".format(cardlist.shape[0]))

#Not used this time
# monsters = pd.read_csv("./input/monsters.csv") #Monster card only
# print("monsters: {}rows".format(monsters.shape[0]))

monsters_norank = pd.read_csv("./input/monsters_norank.csv") #Remove rank monsters from monster cards
print("monsters_norank: {}rows".format(monsters_norank.shape[0]))
cardlist: 10410rows
monsters_norank: 6206rows

3. MeCab verification

The procedure for using MeCab is roughly the following 2 steps.

  1. Instantiate the morphological analyzer in the form of mecab Tagger
  2. Execute the method parseToNode () that performs morphological analysis and store the result in the node object.

As a result of the above, the node object contains two attributes.

-** Surface : The word itself. A format that appears as a character string in a sentence - Feature **: List of word information

python


# 1.Instantiate the morphological analyzer and store the processing result in the object with the parseToNode method
text = "Blue-Eyes White Dragon"
mecabTagger = MeCab.Tagger("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/") #Dictionary: mecab-ipadic-Use neologd
node = mecabTagger.parseToNode(text)

#2.Surface type(surface)And features(feature)Create a data frame to store
surface_and_feature = pd.DataFrame()
surface = []
feature = []

#3.Extract surface shape and features from node object attributes
while node:
    surface.append(node.surface)
    feature.append(node.feature)
    node = node.next
    
surface_and_feature['surface'] = surface
surface_and_feature['feature'] = feature

surface_and_feature

image.png

It seems that feature contains a list, so we will further convert it into a data frame. When using the dictionary mecab-ipadic-neologd, the contents of the feature are ** Part of speech (pos), Part of speech subclassification 1 (pos1), Part of speech subclassification 2 (pos2), Part of speech subclassification 3 (pos3) ), Conjugation type (ctype), Conjugation type (cform), Base form, Read, Pronounce ** are stored as a list. Also, BOS / EOS at the beginning and end of the data frame is a value that directly represents the beginning and end of node.

python


text = "Blue-Eyes White Dragon"
mecabTagger = MeCab.Tagger("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/")
node = mecabTagger.parseToNode(text)

#Feature(feature)Contents of the list(Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Inflected form,Utilization type,Prototype,reading,pronunciation)In a data frame
features = pd.DataFrame(columns=["pos","pos1","pos2","pos3","ctype","cform","base","read","pronounce"])
posses = pd.DataFrame
while node:
    tmp = pd.Series(node.feature.split(','), index=features.columns)
    features = features.append(tmp, ignore_index=True)
    node = node.next
    
features

image.png

4. Morphological analysis

The read data is applied to the MeCab morphological analyzer.

4-1. Implementation of a function that performs morphological analysis

Create a function get_word_list that decomposes the list of card names into words. If particles such as "to" and "mo" are inserted, it will be noisy, so use only ** nouns, verbs, and adjectives **.

python


def get_word_list(text_list):
    m = MeCab.Tagger ("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/")
    lines = []
    for text in text_list:
        keitaiso = []
        m.parse('')
        node = m.parseToNode(text)
        while node:
            #Put morphemes in the dictionary
            tmp = {}
            tmp['surface'] = node.surface
            tmp['base'] = node.feature.split(',')[-3] #Prototype(base)
            tmp['pos'] = node.feature.split(',')[0] #Part of speech(pos)
            tmp['pos1'] = node.feature.split(',')[1] #Part of speech reclassification(pos1)
            
            #BOS representing the beginning and end of a sentence/EOS omitted
            if 'BOS/EOS' not in tmp['pos']:
                keitaiso.append(tmp)
                
            node = node.next
        lines.append(keitaiso)
    
    #Store the surface system for nouns and the original form for verbs / adjectives in the list.
    word_list = [] 
    for line in lines:
        for keitaiso in line:
            if (keitaiso['pos'] == 'noun'):
                word_list.append(keitaiso['surface'])
            elif  (keitaiso['pos'] == 'verb') | (keitaiso['pos'] == 'adjective') :
                if not keitaiso['base'] == '*' :
                    word_list.append(keitaiso['base'])
                else: 
                    word_list.append(keitaiso['surface'])
#Uncomment if you want to include nouns, verbs and adjectives
#             else:
#                 word_list.append(keitaiso['surface'])

    return word_list

4-2. Creating a data frame

Create two data frames for use in subsequent visualization and modeling processes.

-** cardlist_word_count : Created based on the unique dataset cardlist for all cards. The column has the word word used on all cards and the number of appearances word_count. - monsters_words **: Created based on the dataset monsters_norank excluding rank monsters from all monsters. It has the word word used in the column and the features name, level, ʻattr, rarity, species, kind` of the card in which the word appears. Note that line units are words, not cards.

By the way, in the name of the Yu-Gi-Oh card, there are many word divisions by the symbol "・", but Mecab does not divide this symbol. Therefore, before executing the above function, insert the process of dividing the word with "・" in advance.

cardlist_word_count

python


#"・" Creates a pre-separated list namelist
namelist = []
for name in cardlist.name.to_list():
    for name_ in name.split("・"):
        namelist.append(name_)
    
#Function get_word_String list word by list_generate list
word_list = get_word_list(namelist)

# word_Data frame words that map words and their frequency of occurrence from list_Generation of df
word_freq = pd.Series(word_list).value_counts()
cardlist_word_count = pd.DataFrame({'word' : word_freq.index,
             'word_count' : word_freq.tolist()})

cardlist_word_count

image.png

monsters_words

python


monsters_words= pd.DataFrame(columns=["word","name","level","attr","rarity","species","kind"])
for i, name in enumerate(monsters_norank.name.to_list()):
    words = get_word_list(name.split("・"))
    names = [monsters_norank.loc[i, "name"] for j in words]
    levels = [monsters_norank.loc[i, "level"] for j in words]
    attrs = [monsters_norank.loc[i, "attr"] for j in words]
    rarities = [monsters_norank.loc[i, "rarity"] for j in words]
    species = [monsters_norank.loc[i, "species"] for j in words]
    kinds = [monsters_norank.loc[i, "kind"] for j in words]
    tmp = pd.DataFrame({"word" : words, "name" : names, "level" : levels, "attr" : attrs, "rarity" : rarities, "species" : species, "kind" : kinds})
    monsters_words = pd.concat([monsters_words, tmp])
    
monsters_words

image.png

5. Visualization

5-1. Word ranking used

From cardlist_word_count, take out 50 frequently-used words of all cards and make a ranking. "Dragon" is overwhelmingly number one with 326 times. A total of 610 times have appeared with similar words "dragon (3rd place)" and "dragon (98th place)".

nlp5-1.png

python


df4visual = cardlist_word_count.head(50)

f, ax = plt.subplots(figsize=(20, 10))
ax = sns.barplot(data=df4visual, x="word", y="word_count")
ax.set_ylabel("frequency")
ax.set_title("Word ranking used in all cards")

for i, patch in enumerate(ax.patches):
    ax.text(i, patch.get_height()/2, int(patch.get_height()), ha='center')

plt.xticks(rotation=90)
plt.savefig('./output/nlp5-1.png', bbox_inches='tight', pad_inches=0)

Suddenly, I was curious about the proper use of "dragon" and "dragon" in the ranking, so I will detour and proceed with the search. Take a level on the x-axis and draw the kernel density estimation results for the words "dragon" and "dragon" respectively. Each mountain is drawn so that the total area is 1, and it can be interpreted that many monsters are gathered in the high part of the mountain. The dragon has a mountain peak on the right side of the graph compared to the dragon, so you can see that it is used for relatively high level and strong cards.

nlp5-1a.png

python


monsters_words_dragon = monsters_words.query("word == 'dragon' | word == 'Dragon'")
df4visual = monsters_words_dragon

f, ax = plt.subplots(figsize = (20, 5))
ax = sns.kdeplot(df4visual.query("word == 'dragon'").level, label="dragon")
ax = sns.kdeplot(df4visual.query("word == 'Dragon'").level, label="Dragon")
ax.set_xlim([0, 12]);
ax.set_title("Dragon/Dragon kernel distribution")
ax.set_xlabel("level")
plt.savefig('./output/nlp5-1a.png', bbox_inches='tight', pad_inches=0)

The source code and interpretation are omitted, but the count plot results by level and attribute are also posted.

nlp5-1b.png nlp5-1c.png

5-2. WordCloud

WordCloud is a library used for word visualization. Extract the words that appear frequently, and draw the words that appear more frequently in a larger size. wordcloud.generate_from_frequencies () takes a dictionary of words and their frequencies to generate a WordCloud object. If you look at the figure, you can see that the "dragon" is plotted in the largest size, as in 5-1.

nlp5-2a.png

python


def make_wordcloud(df,col_name_noun,col_name_quant):
    word_freq_dict = {}
    for i, v in df.iterrows(): #Dictionaries of words and their frequencies from data frames
        word_freq_dict[v[col_name_noun]] = v[col_name_quant]
    fpath = "/System/Library/Fonts/Hiragino Horn Gothic W3.ttc"
    
    #Instantiate WordCloud
    wordcloud = WordCloud(background_color='white',
                        font_path = fpath,
                          min_font_size=10,
                         max_font_size=200,
                         width=2000,
                         height=500
                         )
    wordcloud.generate_from_frequencies(word_freq_dict)
    return wordcloud

f, ax = plt.subplots(figsize=(20, 5))
ax.imshow(make_wordcloud(cardlist_word_count, 'word', 'word_count'))
ax.axis("off")
ax.set_title("All cards WordCloud")
plt.savefig('./output/nlp5-2a.png', bbox_inches='tight', pad_inches=0)

The extraction results by level and attribute are also displayed. It will be a little vertical, but if you are not interested, please scroll and skip it.

** By level ** There are evenly "dragons" at levels 1-12, but at level 9 there are more "dragons", and at level 11 there seems to be no dragon itself in the first place. nlp5-2b.png

** By attribute ** Warrior-type words such as warrior and saber stand out in the earth attribute. It goes without saying that there are many dark attributes such as "demon", "dark", and "demon". nlp5-2c.png

python


def make_wordclouds(df, colname):
    wordclouds = []
    df = df.sort_values(colname)
    for i in df[colname].unique():
        # word_freq = df.query("{} == {}".format(colname,i))["word"].value_counts() #Convert to pandas Series and value_counts()
        word_freq = df[df[colname] == i]["word"].value_counts()
        monsters_word_count = pd.DataFrame({'word' : word_freq.index, 'word_count' : word_freq.tolist()})
        wordclouds.append(make_wordcloud(monsters_word_count, 'word', 'word_count'))

    f, ax = plt.subplots(len(wordclouds), 1, figsize=(20, 5*int(len(wordclouds))))
    for i, wordcloud in enumerate(wordclouds):
        ax[i].imshow(wordcloud)
        ax[i].set_title("{}:".format(colname) + str(df[colname].unique()[i]))
        ax[i].axis("off");
        
make_wordclouds(monsters_words, "level")
plt.savefig('./output/nlp5-2b.png', bbox_inches='tight', pad_inches=0)

make_wordclouds(monsters_words, "attr")
plt.savefig('./output/nlp5-2c.png', bbox_inches='tight', pad_inches=0)

6. Modeling (distributed expression of words / sentences)

Vectorization is performed to make it easier for machines to interpret the meaning of words in order to proceed to the similarity between words and the subsequent machine learning process. Converting a word into a vector of several dimensions to several hundred dimensions is called ** distributed representation **. This time, we will use word2vec for distributed expression of words. By passing a list of words, you can easily convert it to a vector with any number of dimensions. In addition, Doc2Vec is used for sentence-based vectorization.

Please refer to the following links for the detailed mechanism and usage of Word2Vec and Doc2Vec.

-Understanding Word2Vec -Summary of Doc2Vec

6-1. Word2Vec

As a preliminary preparation, further modify the data frame monsters_words created in the previous chapter to create monsters_wordlist. Return the row unit to the monster unit, and add a new list of words included in this card to the column "word list" and the number of words as the column "length".

python


wordlist = monsters_words.groupby("name")["word"].apply(list).reset_index()
wordlist.columns = ["name", "wordlist"]
wordlist["length"] = wordlist["wordlist"].apply(len)

monsters_wordlist = pd.merge(wordlist, monsters_norank, how="left")
monsters_wordlist

image.png

Here is the code that actually performs the modeling. size is the number of dimensions, ʻiter is the number of repeated learning, and windwow` is a parameter that indicates how many words before and after the learning.

python


%time model_w2v = word2vec.Word2Vec(monsters_wordlist["wordlist"], size=30, iter=3000, window=3)
model_w2v

After learning, let's verify it easily. The wv.most_similar () method allows you to see the top n words that are determined to have similar meanings to a word. When I tried inputting "red", "** black **", which also represents the color, came first. it is a good feeling! If this recommendation result is not correct, move the above parameters in various ways and repeat the verification.

python


model_w2v.wv.most_similar(positive="Red", topn=20)
[('black', 0.58682781457901),
 ('Devil', 0.5581836700439453),
 ('Artif', 0.5535239577293396),
 ('phantom', 0.4850098788738251),
 ('To be', 0.460792601108551),
 ('of', 0.4455495774745941),
 ('Ancient', 0.43780404329299927),
 ('Water', 0.4303821623325348),
 ('Dragon', 0.4163920283317566),
 ('Holy', 0.4114375710487366),
 ('Genesis', 0.3962644040584564),
 ('Sin', 0.36455491185188293),
 ('white', 0.3636135756969452),
 ('Giant', 0.3622574210166931),
 ('Road', 0.3602677285671234),
 ('Guardian', 0.35134968161582947),
 ('power', 0.3466736972332001),
 ('Elf', 0.3355366587638855),
 ('gear', 0.3334060609340668),
 ('driver', 0.33207967877388)]

Next, let's consider visualizing this result. Since Word2Vec this time converts words into 30-dimensional vectors, it is necessary to reduce the dimensions (** dimension reduction **) in order to graph. t-SNE is one of the models of unsupervised learning that reduces dimensions, and it is possible to aggregate data in any dimension without losing information (dispersion) as much as possible. Consider plotting on a scatter plot with an xy axis and implement the process of dropping 30 dimensions into 2 dimensions.

python


#Extract 200 frequently-used words
n=200
topwords = monsters_words["word"].value_counts().head(n)
w2v_vecs = np.zeros((topwords.shape[0],30))
for i, word in enumerate(topwords.index):
    w2v_vecs[i] = model_w2v.wv[word]

    
# t-Dimensionality reduction with SNE: Drop from 30 dimensions to 2 dimensions
tsne= TSNE(n_components=2, verbose=1, n_iter=500)
tsne_w2v_vecs = tsne.fit_transform(w2v_vecs)

w2v_x = tsne_w2v_vecs[:, 0]
w2v_y = tsne_w2v_vecs[:, 1]

Since each word has two-dimensional vector data, draw a scatter plot with each on the x and y axes. If the information of the original data remains even after dimensionality reduction, it should be possible to interpret that the words closer to each other have similar meanings. At first glance, the plot result seems to have randomly arranged words, but it can also be seen as having similar meanings as shown below.

--Near the center left: Nouns representing people such as "person," "lady," and "man" are solidified. --Near the bottom: Nouns such as "master," "king," and "god" that represent people who are deified and superior are solidified.

nlp6-1.png

python


df4visual = pd.DataFrame({"word":topwords.index, "x":w2v_x, "y":w2v_y})
f, ax = plt.subplots(figsize=(20, 20))
ax = sns.regplot("x","y",data=df4visual,fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(topwords.index):
    ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.axis("off")
ax.set_title("Visualization of similarity of 200 words that frequently appear in card titles")
plt.savefig('./output/nlp6-1.png', bbox_inches='tight', pad_inches=0)

5-2. Doc2Vec

While Word2Vec acquires the distributed expression of words, Doc2Vec can acquire the distributed expression of sentences by adding the sentence to which the word belongs as tag information at the time of learning. This allows you to measure the meaning between sentences (card names) and then the umbrella.

As a preliminary preparation, create a TaggedDocument as the input of the model. Assign the card name that the word makes up to the list of words as a tag.

python


document = [TaggedDocument(words = wordlist, tags = [monsters_wordlist.name[i]]) for i, wordlist in enumerate(monsters_wordlist.wordlist)]
document[0]
TaggedDocument(words=['A', 'BF', 'May rain', 'Sohaya'], tags=['A BF-May rainのSohaya'])

The learning method is almost the same as word2vec. The learning method dm is set to the default 0, the number of dimensions vector_size is set to 30, and the number of repetitions ʻepochs` is set to 200. By the way, 1 epoch means to input all the words in the dataset once.

python


%time model_d2v = Doc2Vec(documents = document, dm = 0, vector_size=30, epochs=200)

Let's run the test in the same way. In the docvecs.most_similar () method, use the card name as an input and check the top few similar card names. When I entered "Black Magician", the Black Magician Girl returned in 1st place. Since the card names using the same word follow, it seems that the learning is almost done properly!

python


model_d2v.docvecs.most_similar("black magician")
[('Dark magician girl', 0.9794564843177795),
 ('Toon Dark Magician', 0.9433020949363708),
 ('Toon Dark Magician Girl', 0.9370808601379395),
 ('Dragon Knight Dark Magician', 0.9367024898529053),
 ('Dragon Knight Dark Magician Girl', 0.93293297290802),
 ('Black Bull Drago', 0.9305672645568848),
 ('Magician of Black Illusion', 0.9274455904960632),
 ('Astrograph magician', 0.9263750314712524),
 ('Chronograph magician', 0.9257084727287292),
 ('Disc magician', 0.9256418347358704)]

Dimension reduction is also performed in the same way as word2vec, and 200 cards are randomly selected for visualization. Similar words come in almost the same position, which makes it a little hard to see. .. However, you can see that the card names with the same word are scattered nearby.

nlp6-2.png

python


d2v_vecs = np.zeros((monsters_wordlist.name.shape[0],30))
for i, word in enumerate(monsters_wordlist.name):
    d2v_vecs[i] = model_d2v.docvecs[word]

tsne = TSNE(n_components=2, verbose=1, n_iter=500)
tsne_d2v_vecs = tsne.fit_transform(d2v_vecs)

d2v_x = tsne_d2v_vecs[:, 0]
d2v_y = tsne_d2v_vecs[:, 1]

monsters_vec = monsters_wordlist.copy()
monsters_vec["x"] = d2v_x
monsters_vec["y"] = d2v_y

df4visual = monsters_vec.sample(200, random_state=1).reset_index(drop=True)
f, ax = plt.subplots(figsize=(20, 20))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.name):
    ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.axis("off")
ax.set_title("Visualization of similarity of 200 monsters")
plt.savefig('./output/nlp6-2a.png', bbox_inches='tight', pad_inches=0)

Since it's a big deal, I'll plot all the cards without a card name. The following is a scatter plot that depicts the closeness (vector) of meaning of all cards by different races. I think it was a great result! The circular group at the bottom of the graph can be inferred to be a collection of non-series cards. Cards are sparsely scattered around it, but you can probably see that they form a small group in the same series.

nlp6-2b.png

python


df4visual = monsters_vec
g = sns.lmplot("x","y",data=df4visual, fit_reg=False, hue="attr", height=10)
g.ax.set_title("Distribution of closeness of meaning of all card names")

For example, there is a group of earth attributes around the coordinates (x, y) = (-40, -20). If you search this information with a query, you will find that it is a collection of the "Ancient Machines" series. feel well!

python


monsters_vec.query("-42 <= x <= -38 & -22 <= y <= -18")["name"]
2740 Perfect Machine King
3952 Ancient lizard warrior
3953 Ancient mechanical soldier
3954 Ancient mechanical synthetic beast
3955 Ancient mechanical synthetic dragon
3956 Ancient mechanic
3957 Ancient mechanical giant
3958 Ancient Mechanical Giants-Ultimate Pound
3959 Ancient mechanical giant dragon
3960 Ancient mechanical chaos giant
3961 Ancient mechanical thermonuclear dragon
3962 Ancient mechanical hounds
3963 Ancient mechanical beast
3964 Ancient mechanical battery
3965 Ancient Machine Ultimate Giants
3966 Ancient machine box
3967 Ancient mechanical body
3968 Ancient mechanical giant
3969 Ancient mechanical flying dragon
3970 Ancient mechanical knight
3971 Ancient mechanical genie
3972 Ancient gear
3973 Ancient gear machine
3974 Ancient Mage
4036 Earth Giants Gaia Plate
4279 Giants goggles
4491 Pendulum blade torture machine
4762 Mechanical Soldier
4764 Machine dog Maron
4765 Machine King
4766 Machine King-Prototype
4767 Mechanical Dragon Power Tool
4768 Mechanical Sergeant
4994 Lava Giant
5247 Sleeping Giant Zushin
5597 Super ancient monster

Finally, let's check the similarity by attribute, race, and level, not by card name. The vector obtained for each card is averaged for each attribute, race, and level, and plotted for each data cut.

** By attribute **

Only the darkness attribute was mapped to a place that was unpleasant.

nlp6-2c.png

python


df4visual = monsters_vec.groupby("attr").mean()[["x", "y"]].reset_index().query("attr != 'God attribute'").reset_index(drop=True) # God attributeは外れ値になるため省略する
f, ax = plt.subplots(figsize=(10, 10))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.attr):
    ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)

ax.set_title("Visualization of similarity of card names by attribute")
plt.savefig('./output/nlp6-2c.png', bbox_inches='tight', pad_inches=0)

** By race **

If you try to interpret it by force, the fish and reptiles are close together.

nlp6-2d.png

python


df4visual = monsters_vec.groupby("species").mean()[["x", "y"]].reset_index().query("species != 'Creative deity' & species != 'Phantom Beast'").reset_index(drop=True) #God attribute races are outliers and are omitted
f, ax = plt.subplots(figsize=(15, 15))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.species):
    ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)
ax.axis("on")
ax.set_title("Visualization of similarity of card names by race")
plt.savefig('./output/nlp6-2d.png', bbox_inches='tight', pad_inches=0)

** By level **

You can see that the low level band (1 ~ 4) is quite close. Even in the high level band, 10 and 11 are close, but 12 are far apart, so it can be inferred that they have different name characteristics.

nlp6-2e.png

python


df4visual = monsters_vec.groupby("level").mean()[["x", "y"]].reset_index().query("level != '0'").reset_index(drop=True) #Level 0 is an outlier and is omitted
f, ax = plt.subplots(figsize=(10, 10))
ax = sns.regplot("x","y",data=df4visual, fit_reg=False, scatter_kws={"alpha": 0.2})
for i, text in enumerate(df4visual.level):
    ax.text(df4visual.loc[i, 'x'], df4visual.loc[i, 'y'], text)

ax.set_title("Visualization of card name similarity by level")
plt.savefig('./output/nlp6-2e.png', bbox_inches='tight', pad_inches=0)

Summary

Thank you for reading this far. Further deepening the Yu-Gi-Oh card name, we performed a series of analysis of morphological analysis by Mecab, visualization by WordCloud, and acquisition of distributed expression by Word2Vec and Doc2Vec. I think the scatter plot of all the cards in Doc2Vec feels good to me. In the machine learning part of the next process, the features obtained here will be used as they are, so we expect to be able to build a highly accurate prediction model.

Next time preview

It's finally machine learning. I haven't implemented it yet, and I'm thinking about the theme, but I'd like to build the following prediction model. Please look forward to it.

  1. Predict offensive power, defensive power, attributes, races, etc. from any card name with Doc2Vec & LightGBM
  2. Generate a card name with LSTM (this may be omitted due to time constraints ...)

Recommended Posts

Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
Python: Natural language processing
RNN_LSTM2 Natural language processing
Unbearable shortness of Attention in natural language processing
Performance verification of data preprocessing in natural language processing
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Overview of natural language processing and its data preprocessing
Natural language processing 2 Word similarity
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Types of preprocessing in natural language processing and their power
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
Study natural language processing with Kikagaku
Natural language processing for busy people
[Natural language processing] Preprocessing with Japanese
Artificial language Lojban and natural language processing (artificial language processing)
100 Language Processing Knock-59: Analysis of S-expressions
Preparing to start natural language processing
Natural language processing analyzer installation summary
Summary of multi-process processing of script language
Why is distributed representation of words important for natural language processing?
[Word2vec] Let's visualize the result of natural language processing of company reviews
Answers and impressions of 100 language processing knocks-Part 1
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
Answers and impressions of 100 language processing knocks-Part 2
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 Language Processing Knock-26: Removal of emphasized markup
3. Natural language processing with Python 2-1. Co-occurrence network
[WIP] Pre-processing memo in natural language processing
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
Convenient goods memo around natural language processing
Easy padding of data that can be used in natural language processing
Learn the basics of document classification by natural language processing, topic model
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Language Processing Knock (2020): 28
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
[Python] I played with natural language processing ~ transformers ~
100 language processing knocks (2020): 45
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 Language Processing Knock-45: Extraction of verb case patterns
Python: Deep Learning in Natural Language Processing: Basics
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31