[PYTHON] Predicting offensive and defensive attributes from the Yu-Gi-Oh! Card name --Yu-Gi-Oh! Data Science 3. Machine Learning

Introduction

This is the "Yu-Gi-Oh! DS (Data Science)" series that analyzes various Yu-Gi-Oh! Card data using Python. The article will be published four times in total, and finally we will implement a program that predicts offensive and defensive attributes from card names by natural language processing + machine learning. In addition, the author's knowledge of Yu-Gi-Oh has stopped at around E ・ HERO. I'm sorry that both cards and data science are amateurs, but please keep in touch.

No. Article title Keyword
0 Get card information from the Yu-Gi-Oh! Database-Yugioh DS 0.Scraping beautifulsoup
1 Visualize Yu-Gi-Oh! Card data in Python-Yugioh Data Science 1.EDA edition pandas, seaborn
2 Process Yu-Gi-Oh card name in natural language-Yugioh Data Science 2.NLP edition wordcloud, word2vec, doc2vec, t-SNE This article!
3 Predict offensive and defensive attributes from the Yu-Gi-Oh card name-Yugioh Data Science 3.Machine learning lightgbm etc.

Purpose of this article

In Previous article, the vector obtained by converting the card name with Doc2Vec is the feature amount, and other elements of the card (attack power, defense power, attribute, race, level). ) Is used as a label, and a model that predicts offensive and defensive attributes from the card name by machine learning is implemented. Furthermore, using the learned model, we will execute a prediction task that gives offensive and defensive attributes to appropriate original card names. I would like to verify how strong the original monster I thought of was judged to be in the light of machine learning.

Explanation of prerequisites (Usage environment, data, analysis policy)

usage environment

Python==3.7.4

data

This article uses scraped data from Yu-Gi-Oh! OCG Card Database as the original data. .. It is the latest as of June 2020. In addition, it is assumed that the input data of machine learning uses the original data processed by NLP edition. 。

Analysis policy

As mentioned above, only the card name (vectorized) is used as the feature this time. There are 5 labels, attack power, defense power, attribute, race, and level, so prepare 5 models as well. Although it is a type of model, deep learning is not used due to the specifications of the machine used. For the time being, we will use LightGBM, which is the most accurate of the tree problems and can be applied to both classification / regression problems. Before implementation, set the problem of each model (regression model / classification model) and make a hypothesis of good accuracy.

No. Predicted label Problem setting hypothesis
1 attribute Multi-class classification 例えばカード名に「天使」と入ってたら光attributeになるなどの傾向はありそうなので、まあまあ精度は高そう
2 Race Multi-class classification No.1と同様。カード名に「ドラゴン」と入っているものはドラゴン族など、Raceによってはかなり精度が高そう
3 level Multi-class classification シリーズ物のカードだと同じ単語が入っていてもlevelがバラバラになったりするので精度は悪そう。また、ラベル自体のデータ数の偏りが影響しそう
4 Offensive power Regression No.Accuracy seems to be bad for the same reason as 3
5 Defensive power Regression No.Accuracy seems to be bad for the same reason as 3

Implementation

1. Package import

Import the required packages.

python


from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import mean_squared_error, r2_score
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import train_test_split

import gensim
import lightgbm as lgb
import matplotlib.pyplot as plt
%matplotlib inline
import MeCab
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
sns.set(font="IPAexGothic")

2. Data model import

Import the data to be used and the Doc2Vec model that vectorizes the card name.

python


model_d2v = pickle.load(open('./input/model_d2v.pickle', 'rb'))
X = pickle.load(open('./input/X.pickle', 'rb'))
y = pickle.load(open('./input/y.pickle', 'rb'))

print("shape")
print("X: {}".format(X.shape))
print("y: {}".format(y.shape))
print("----------------")
print("data")
print("y: ")
print(y.head())
shape
X: (6206, 30)
y: (6206, 7)
----------------
data
y: 
   rarity attr  level species  attack  defence  kind
0 Rare Darkness 5 Birds and Beasts 1500 2000 Synchro
1 Ultra Rare Darkness 7 Birds and Beasts 2600 2000 Synchro
2 Ultra Rare Darkness 12 Birds and Beasts 3000 2000 Synchro
3 Normal Darkness 2 Birds and Beasts 800 100 Synchro
4 Rare Darkness 5 Birds and Beasts 2100 1600-

The imported X (features) and y (labels) are saved as follows in Previous article. It is a prerequisite. The process to save easily is described below.

python



#Data frame monsters that holds a list of words contained in the card name_Create wordlist
#abridgement

#Create Tagged Document for Doc2Vec model
document = [TaggedDocument(words = wordlist, tags = [monsters_wordlist.name[i]]) for i, wordlist in enumerate(monsters_wordlist.wordlist)]

#Learning the Doc2Vec model
model_d2v = Doc2Vec(documents = document, dm = 0, vector_size=30, epochs=200)

#Vectorize all card names with Doc2Vec model
d2v_vecs = np.zeros((monsters_wordlist.name.shape[0],30))
for i, word in enumerate(monsters_wordlist.wordlist):
    d2v_vecs[i] = model_d2v.infer_vector(word)

#Store the vectorized card name as X and the data used for the label as y
X = d2v_vecs
y = monsters_wordlist[["rarity", "attr", "level", "species", "attack", "defence"]]

#Save
with open("./input/model_d2v.pickle", "wb") as f:
    pickle.dump(model_d2v, f)

with open("./input/X.pickle", "wb") as f:
    pickle.dump(X, f)
    
with open("./input/y.pickle", "wb") as f:
    pickle.dump(y, f)

monsters_wordlist is the following data frame. Please refer to Previous article for the generation method.

image.png

3. Learning

Implement a model that trains each of the five labels. If you write hyperparameters one by one or divide them by classification / regression, the amount of code will increase, so for simplification, implement a function that wraps those processes. I will also add a little about LightGBM.

--There are two learning methods for LightGBM, ** original interface ** and ** Scikit-Learn interface **, but the latter is adopted here. Partly because I'm used to it, but because the behavior of the predict method during multiclass classification is more convenient. --The metric (ʻeval_metric) uses Multi Logloss for multiclass classification and RMSLE (root mean square error) for regression. The reason for adopting RMSLE is that the distribution of offensive and defensive power is wider than the normal distribution (see 3-2-1 of EDA), so I chose this instead of RMSE. --Other hyperparameter settings are appropriate (appropriate from another model made in the past). You should really think about it ... The method of hyperparameter tuning using ʻOptuna as Appendix. Is described at the end of the article.

python


#Training data / test data divided into two
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

#Wrap the process of creating / learning a multi-class classification / regression model with a function
def fit_model(column, mode='Classifier'):
    if mode == 'Classifier':
        model_lgb = lgb.LGBMClassifier(num_leaves=5,
                                      learning_rate=0.05, n_estimators=720,
                                      max_bin = 55, bagging_fraction = 0.8,
                                      bagging_freq = 5, feature_fraction = 0.2319,
                                      feature_fraction_seed=9, bagging_seed=9,
                                      min_data_in_leaf =6, min_sum_hessian_in_leaf = 11, verbosity=-1)
        model_lgb.fit(X_train, y_train[column], eval_set=[(X_test, y_test[column])], eval_metric='multi_logloss', early_stopping_rounds=10)
        
    elif mode == 'Regressor':
        model_lgb = lgb.LGBMRegressor(num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11, verbosity=-1)
        model_lgb.fit(X_train, y_train[column], eval_set=[(X_test, y_test[column])], eval_metric='rmsle', early_stopping_rounds=10)

    
    return model_lgb

#Create a model for each label
model_attr = fit_model("attr", mode="Classifier")
model_level = fit_model("level", mode="Classifier")
model_species = fit_model("species", mode="Classifier")
model_attack = fit_model("attack", mode="Regressor")
model_defence = fit_model("defence", mode="Regressor")

4. Accuracy verification

We will verify the accuracy of the learning results using the test data.

4-1. Correct answer rate (Accuracy) / coefficient of determination (R2 Score)

As an evaluation index for accuracy verification, check the accuracy rate (Accuracy) for classification questions and the coefficient of determination (R2 Score) for regression questions.

python


def get_accuracy(column, model):
    y_pred = model.predict(X_test)
    accuracy = sum(y_test[column] == y_pred) / len(y_test)
    return accuracy

def get_r2score(column, model):
    y_pred = model.predict(X_test)
    r2score = r2_score(y_test[column], y_pred)
    return r2score

accuracy_attr = get_accuracy("attr", model_attr)
print("accuracy_attr: {}".format(accuracy_attr))

accuracy_species = get_accuracy("species", model_species)
print("accuracy_species: {}".format(accuracy_species))

accuracy_level = get_accuracy("level", model_level)
print("accuracy_level: {}".format(accuracy_level))

r2score_attack = get_r2score("attack", model_attack)
print("r2score_attack: {}".format(r2score_attack))

r2score_defence = get_r2score("defence", model_defence)
print("r2score_defence: {}".format(r2score_defence))
accuracy_attr: 0.5515297906602254
accuracy_species: 0.4669887278582931
accuracy_level: 0.3413848631239936
r2score_attack: 0.0804399379391485
r2score_defence: 0.04577024081390113

First, check the correct answer rate ʻAccuracy. Considering the correct answer rate of the model that gives a completely random label, there are 7 types of label for the attribute (ʻattr), so it is about ** 0.143 **, and the level (Level) is 13 types from 0 to 12. ** 0.077 **, There are 25 races (species), which is ** 0.04 **, so the above model seems to be learning as it is. On the other hand, in the case of level, 1925 out of 6206 target cards are biased data such that it is level 4, so even if all models are judged to be level 4, the correct answer rate will be about ** 0.31 **. I will. This is a value very close to the above correct answer rate, so it is necessary to confirm by deep digging whether there is actually a model that predicts that all are level 4.

Next is the coefficient of determination R2 Score, which can be read as the closer to 1 the higher the analysis accuracy (the variance of the label can be explained by the prediction formula by the feature quantity). Since the value is quite low in both offensive and defensive power, it can be said that the card name hardly represents offensive and defensive power (not relevant).

4-2. Confusion Matrix

To see the accuracy of the classification problem in more detail, draw a confusion matrix that maps the predictions to the original values.

def make_cm(column, model, normalize="false"):
    labels = y[column].unique()
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test[column], y_pred, labels=labels, normalize=normalize)
    cm_labeled = pd.DataFrame(cm, columns=labels, index=labels)
    f, ax = plt.subplots(figsize = (20, 10))
    ax = sns.heatmap(cm_labeled, annot=True, cmap="YlGnBu", fmt=".2f")
    ax.set_ylabel("Actual")
    ax.set_xlabel("Predicted")
    ax.set_title("confusion_matrix: {}".format(column))

make_cm("attr", model_attr, "pred")
plt.savefig('./output/ml4-2-1.png', bbox_inches='tight', pad_inches=0)

make_cm("level", model_level, "pred")
plt.savefig('./output/ml4-2-2.png', bbox_inches='tight', pad_inches=0)

make_cm("level", model_level, "pred")
plt.savefig('./output/ml4-2-3.png', bbox_inches='tight', pad_inches=0)

ml4-2-1.png

ml4-2-2.png

ml4-2-3.png

The values of the confusion matrix are normalized to add up to 1 in the row direction. In other words, the probability that what was predicted as A was actually A (= precision rate Precision) is the value. For example, in the plot of the first attribute, about 58% of the predicted darkness attributes are actually darkness attributes.

Looking at the attributes, you can see that the fire and water attributes can be predicted relatively accurately, perhaps because the card name tends to contain such information directly. The suitability of reptiles and dinosaurs is also high for the races. On the other hand, it can be understood that the prediction of the psychic tribe tends to be wrongly predicted even for what was actually a mechanical tribe.

At first glance, there are some labels that can predict the level accurately, but in order to verify the previous concern (the model that predicts level 4 is completed), it is necessary to check the recall rate Recall. .. The recall is a value that indicates the percentage of what is actually labeled A that is correctly predicted to be A. The image below is a confusion matrix that shows the recall by changing the level plot settings, but it was also confirmed that most levels were predicted to be 4.

ml4-2-6.png

So far, we have explained confusing indicators such as correct answer rate ʻAccuracy, precision rate Precision, recall rate Recall`, but for details, see [For beginners] Machine learning classification problem evaluation index explanation (correct answer rate / (Accuracy rate, recall rate, etc.).

5. Forecast

Implement the process of predicting offensive and defensive attributes for new card names that are not in the dataset. The new card name to be input is also preprocessed (vectorized) in the same way as during learning, and they are predicted by applying the predict method of each model. The preprocessing function get_vec () performs the process of word morphological analysis → vectorization using the doc2vec model. It is basically the same as the process of generating the feature X in the NLP edition.

python


#Function that preprocesses the card name
def get_vec(rawtext):
    m = MeCab.Tagger ("-Ochasen -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/")
    lines = []
    text_list = rawtext.split("・")
    
    for text in text_list:
        keitaiso = []
        m.parse('')
        node = m.parseToNode(text)
        while node:
            #Put morphemes in the dictionary
            tmp = {}
            tmp['surface'] = node.surface
            tmp['base'] = node.feature.split(',')[-3] #Prototype(base)
            tmp['pos'] = node.feature.split(',')[0] #Part of speech(pos)
            tmp['pos1'] = node.feature.split(',')[1] #Part of speech reclassification(pos1)
            
            #BOS representing the beginning and end of a sentence/EOS omitted
            if 'BOS/EOS' not in tmp['pos']:
                keitaiso.append(tmp)
                
            node = node.next
        lines.append(keitaiso)
    
    #Store the surface system for nouns and the original form for verbs / adjectives in the list.
    word_list = [] 
    for line in lines:
        for keitaiso in line:
            if (keitaiso['pos'] == 'noun'):
                word_list.append(keitaiso['surface'])
            elif  (keitaiso['pos'] == 'verb') | (keitaiso['pos'] == 'adjective') :
                if not keitaiso['base'] == '*' :
                    word_list.append(keitaiso['base'])
                else: 
                    word_list.append(keitaiso['surface'])
#Uncomment if you want to include nouns, verbs and adjectives
#             else:
#                 word_list.append(keitaiso['surface'])

    model_d2v = pickle.load(open('./input/model_d2v.pickle', 'rb'))
    vec = model_d2v.infer_vector(word_list).reshape(1,-1)

    return vec

#A function that predicts other information from the card name at once
def predict_cardinfo(name):
    vec=get_vec(name)
    print("attribute:{}".format(model_attr.predict(vec)[0]))
    print("level:{}".format(model_level.predict(vec)[0]))
    print("Race:{}".format(model_species.predict(vec)[0]))
    print("Offensive power:{}".format(model_attack.predict(vec)[0]))
    print("Defensive power:{}".format(model_defence.predict(vec)[0]))

Let's actually predict various card names with the predict_cardinfo () method.

** Blue-Eyes White Dragon **

python


predict_cardinfo("Blue-Eyes White Dragon")
Attribute: Light attribute
Level: 4
Race: Dragon
Attack power: 1916.3930197124996
Defensive power: 1366.9371594605982

** Red-eyed White Dragon **

python


predict_cardinfo("Red-eyed white dragon")
Attribute: Earth attribute
Level: 4
Race: Warrior
Attack power: 1168.203405707642
Defensive power: 1007.5706886946783

** Red Magician **

python


predict_cardinfo("Red magician")
Attribute: Dark attribute
Level: 4
Race: Wizards
Attack power: 1884.3345074514568
Defensive power: 1733.53872077943

** Ultra Super Chaos Magician **

python


predict_cardinfo("Ultra Super Chaos Magician")
Attribute: Dark attribute
Level: 4
Race: Wizards
Attack power: 2129.5019817399634
Defensive power: 1623.7306977987516

For words such as "dragon" and "magician", it is possible to predict "dragon tribe" and "witch tribe" almost as intended. The offensive and defensive powers tend to be pulled by the data of low-level monsters and have low values, but even so, it seems that they tend to increase a little when using strong words such as "chaos". As for the level, as confirmed by the confusion matrix, it is predicted to be almost level 4. This function is also predictable for unprecedented names in which words are used. I think it would be interesting to take it from the names of other games such as Pikachu.

Appendix. Hyperparameter tuning

Although the implementation is omitted this time, it is convenient to use ʻOptuna when performing hyperparameter tuning. ʻOptuna does not seem to support the sciki-learn interface as of July 2020, but the implementation example when assuming the original interface is described as a reference example.

import optuna.integration.lightgbm as lgb_o
def get_best_params(column, mode, metric):
    y_obj = y[column]
    X_trainval, X_test, y_trainval, y_test = train_test_split(X, y_obj, test_size=0.2)
    X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, test_size=0.1)

    #Convert to dataset for LightGBM
    train = lgb_o.Dataset(X_train, y_train)
    val = lgb_o.Dataset(X_valid, y_valid)

    #Hyperparameter search&Model building
    if mode == "regression":
        params = {'objective': '{}'.format(mode),
                  'metric': '{}'.format(metric),
                  'random_seed':0} 
    elif mode == "multiclass":
        params = {'objective': '{}'.format(mode),
          'metric': '{}'.format(metric),
          'num_class': len(y_obj.unique()),
          'random_seed':0} 

    model_o = lgb_o.train(params,
                        train,
                        valid_sets=val,
                        early_stopping_rounds=10,
                        verbose_eval=False)

    y_trainval_pred = model_o.predict(X_trainval,num_iteration=gbm_o.best_iteration)
    y_test_pred = model_o.predict(X_test,num_iteration=gbm_o.best_iteration)

    best_params = model_o.params
    return best_params

best_params_attack = get_best_params("attack", "regression", "rmse")

Summary

We have implemented a model that predicts attributes, race, level, offensive power, and defensive power from the card name of Yu-Gi-Oh! Especially for attributes and races, we were able to confirm the prediction accuracy as expected. For Todo, the wrapping by machine learning functions is a little rough, so the best practice of Pipeline conversion is to proceed with learning. Also, if you are going to a competition like kaggle, I think you should implement accuracy evaluation more carefully. As another direction, it seems interesting to make it a web application with Django etc. and publish it.

Next time preview

The implementation as data science will be the last in this article, but when I have time, I may publish an article on 0. Scraping. Data scraping is a problematic subject, so I'm not going to provide the complete code, but just the implementation tips.

Recommended Posts

Predicting offensive and defensive attributes from the Yu-Gi-Oh! Card name --Yu-Gi-Oh! Data Science 3. Machine Learning
Summary of mathematical scope and learning resources required for machine learning and data science
I tried to process and transform the image and expand the data for machine learning
Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane
Study machine learning and computer science. Resource list
Visualize Yu-Gi-Oh! Card data with Python-Yu-Gi-Oh! Data Science 1. EDA
Machine learning Training data division and learning / prediction / verification
Predicting the goal time of a full marathon with machine learning-③: Visualizing data with Python-
Python Machine Learning Programming Chapter 1 Gives Computers the Ability to Learn from Data Summary
[Machine learning] Understanding SVM from both scikit-learn and mathematics
Pip the machine learning library from one end (Ubuntu)
A concrete method of predicting horse racing by machine learning and simulating the recovery rate
I considered the machine learning method and its implementation language from the tag information of Qiita