Visualize Yu-Gi-Oh! Card data with Python-Yu-Gi-Oh! Data Science 1. EDA

Introduction

This is the "Yu-Gi-Oh! DS (Data Science)" series that analyzes various Yu-Gi-Oh! Card data using Python. The article will be published four times in total, and finally we will implement a program that predicts offensive and defensive attributes from card names by natural language processing + machine learning. In addition, the author's knowledge of Yu-Gi-Oh has stopped at around E ・ HERO. I'm sorry that both cards and data science are amateurs, but please keep in touch.

No. Article title Keyword
0 Get card information from the Yu-Gi-Oh! Database-Yugioh DS 0.Scraping beautifulsoup
1 Visualize Yu-Gi-Oh! Card data in Python-Yugioh DS 1.EDA edition pandas, seaborn This article!
2 Process Yu-Gi-Oh card name in natural language-Yugioh DS 2.NLP edition wordcloud, word2vec, doc2vec, t-SNE
3 Predict offensive and defensive attributes from the Yu-Gi-Oh card name-Yugioh DS 3.Machine learning lightgbm etc.

Purpose of this article

It's been over 10 years since I stopped seeing Yu-Gi-Oh cards, and I'm not sure what kind of cards are available now. In this first installment, we'll look at all the data cuts with ** Exploratory Data Analysis (EDA) **. In addition, the technical theme of this article is Visualization with seaborn. I will try to find an appropriate visualization method and seaborn method according to the nature of each data.

Explanation of prerequisites (Usage environment, data, analysis policy)

usage environment

If ʻAnaconda` is included, it should work. Python==3.7.4 seaborn==0.10.0

data

The data acquired in this article is scraped with a handmade code from Yu-Gi-Oh! OCG Card Database. .. It is the latest as of June 2020. Various data frames are used depending on the graph to be displayed, but all data frames hold the following columns.

No. Column name Column name(日本語) sample Supplement
1 name card name Ojama Yellow
2 kana Reading the card name Ojama Yellow
1 rarity Rarity normal For convenience of acquisition, information such as "restriction" and "prohibition" is also included.
1 attr attribute 光attribute For non-monsters, enter "magic" and "trap"
1 effect effect NaN Contains "permanent" and "equipment", which are types of magic / trap cards. NaN for monsters
1 level level 2 Enter "Rank 2" for rank monsters
1 species Race Beast tribe
1 attack Offensive power 0
1 defence Defensive power 1000
1 text Card text A member of the jama trio who is said to jam by all means. When something happens when all three of us are together...
1 pack Recording pack name EXPERT Expert EDITION Edition Volume Volume 2
1 kind type - In the case of a monster card, information such as fusion and ritual is entered

Drawing policy

All features (columns) can be classified as either categorical data (Categorical) or numerical data (Numerical). Also, when drawing certain data as a graph, the number of features (columns) that can be expressed at one time is at most three. Drawing a graph can be said to be the work of picking up features from the entire data and selecting an appropriate expression method based on each type (category / numerical value). In the following chapters on visualization in the implementation, the graph will be divided into 6 sections according to the number of features selected at one time and the type (category / numerical value) of each feature.

No. Number of features combination useseabornmethod of
1 1 Category data sns.barplot,sns.countplot
2 2 Numerical data x Numerical data sns.jointplot
3 2 Category data x numerical data sns.barplot,sns.boxplot
4 2 Category data x category data sns.heatmap
5 3 Category data x Category data x Numerical data sns.catplot
6 3 Category data x numerical data x numerical data sns.lmplot

Implementation

1. Package import

Import the required packages.

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
sns.set(font="IPAexGothic") #Japanese support for seaborn

2. Data import

Import the four datasets you want to use. The acquisition method of each data set is described in 0. Scraping (No article as of June 2020).

all_data = pd.read_csv("./input/all_data.csv") #Data set for all cards (cards with the same name have duplicate recording packs)
print("all_data: {}rows".format(all_data.shape[0]))

cardlist = pd.read_csv("./input/cardlist.csv") #All card dataset (no duplication)
print("cardlist: {}rows".format(cardlist.shape[0]))

monsters = pd.read_csv("./input/monsters.csv") #Monster card only
print("monsters: {}rows".format(monsters.shape[0]))

monsters_norank = pd.read_csv("./input/monsters_norank.csv") #Remove rank monsters from monster cards
print("monsters_norank: {}rows".format(monsters_norank.shape[0]))
all_data: 21796rows
cardlist: 10410rows
monsters: 6913rows
monsters_norank: 6206rows

3. Visualization

3-1. Category data

Select and visualize only category data. Basic, the horizontal axis is the category, and the vertical axis is the Count. In seaborn, it can be expressed by sns.barplot or sns.countplot.

3-1-1. Recording count ranking

For all cards ʻall_data`, the number of recordings is displayed up to the 50th place in the ranking. The number one recording is "Cyclone", and you can see that it has been recorded 45 times. It seems that it is often included in the starter kit (a set that you can dwell immediately if you buy it).

eda3-1-1.png

eda3-1-1


#Recording frequency ranking
df4visual = df.groupby("name").count().sort_values(by="kana", ascending=False).head(50)

f, ax = plt.subplots(figsize=(20, 10))
ax = sns.barplot(data=df4visual, x=df4visual.index, y="kana")
ax.set_ylabel("frequency")
ax.set_title("Recording frequency ranking")

for i, patch in enumerate(ax.patches):
    ax.text(i, patch.get_height()/2, int(patch.get_height()), ha='center')

plt.xticks(rotation=90);

3-1-2. Number of sheets (by attribute)

The breakdown of the attributes is 1946 with the top magic cards. As far as monster cards are concerned, the 1st place is the darkness attribute, and there are more than 3 times the number of the 6th place flame attribute. Why are there 6 gods? I thought, but now there are phoenixes and spheres of Ra's winged dragon. I didn't know.

eda3-1-2.png

eda3-1-2


df4visual = cardlist

f, ax = plt.subplots(figsize=(20, 10))
ax = sns.countplot(data=df4visual, x=df4visual.attr, order=df4visual['attr'].value_counts().index)

for i, patch in enumerate(ax.patches):
    ax.text(i, patch.get_height()/2, int(patch.get_height()), ha='center')

ax.set_ylabel("frequency")
ax.set_title("Number of sheets (by attribute)");

plt.savefig('./output/eda3-1-2.png', bbox_inches='tight', pad_inches=0)

python


print(cardlist.query("attr == 'God attribute'")["name"])
3471 Horakuti, the Creator of Light
4437 Ra's Wing God Dragon-Phoenix
5998 Ra's Wing God Dragon-Spherical
6677 Obelisk God Warriors
8747 Osiris Sky Dragon
9136 Ra's Wing God Dragon
Name: name, dtype: object

3-1-3. Number of sheets (by level)

There is a difference in interpretation depending on whether the level is a numerical value or a category, but here we interpret it as an ordinal scale (order has meaning, but interval has no meaning). Rank monsters are excluded and the number of cards by level is displayed. Somehow I intuitively thought, "Odd-level cards are less than even-level cards." There is always 1 <2, 3 <4, ....

eda3-1-3.png

eda3-1-3


df4visual = monsters_norank
#After 3-1-Omitted because it is almost the same as 2.

3-1-4. Number of sheets (by rarity)

There are quite a few rarities that I don't know (Millennium, etc.). Keep in mind that the denominator uses ʻall_data`, as each recording pack has different rarities. Prohibitions / restrictions may include duplication.

eda3-1-4.png

eda3-1-4


df4visual = all_data
#After 3-1-Omitted because it is almost the same as 2.

3-1-5. Number of sheets (by race)

The image was that there were many wizards and dragons, but surprisingly there were many warriors and machines. Is it because there are many series ("E ・ HERO" or nostalgic)?

eda3-1-5.png

eda3-1-5


df4visual = monsters
#After 3-1-Omitted because it is almost the same as 2.

3-1-5. Number of sheets (by type)

I'm not sure what kind of fusion, ritual, etc. is, but there are many words I don't know.

eda3-1-6.png

eda3-1-6


df4visual = monsters
#After 3-1-Omitted because it is almost the same as 2.

3-2. Numerical data x Numerical data

In many cases, the x-axis and y-axis have their respective numerical values and are represented by a scatter plot. With seaborn, you can draw scatter plots using sns.jointplot, sns.regplot, and sns.lmplot. The usage is slightly different, but please refer to the official document for details.

3-2-1. Distribution of offensive power x defensive power

For each monster card, a scatter plot is drawn with the offensive power on the x-axis and the defensive power on the y-axis. You can see that many cards are crowded below the range 3000. Also, when considering dividing by a line segment of y = x, it seems that most cards are generally offensive power> defensive power because the lower right is darker in color.

eda3-2-1.png

eda3-2-1


df4visual = monsters

g = sns.jointplot(data=df4visual, x="attack", y="defence", height=10, alpha=0.3)
plt.subplots_adjust(top=0.9)
plt.suptitle('Distribution of offensive power x defensive power')
plt.savefig('./output/eda3-2-1.png', bbox_inches='tight', pad_inches=0)

3-3. Category data x numerical data

A pattern is conceivable in which each category is on the x-axis and the numerical data for each category and the aggregated results (total, average, maximum ...) are on the y-axis.

3-3-1. Offensive / defensive power ranking

The card name is regarded as a category, and the attack power and defense power are displayed in descending order as sns.barplot. It seems that there are still no cards with offensive or defensive power whose original value exceeds 5000.

eda3-3-1.png

eda3-3-1


df4visual = monsters
df4visual_atk = df4visual.sort_values("attack", ascending=False).head(50)
df4visual_def = df4visual.sort_values("defence", ascending=False).head(50)

f, ax = plt.subplots(2, 1, figsize = (20, 15), gridspec_kw=dict(hspace=0.8))
f.subplots_adjust(hspace=2.0)
ax[0] = sns.barplot("name", "attack", data=df4visual_atk, ax=ax[0])
ax[0].tick_params(axis='x', labelrotation=90, labelsize = 9)
ax[0].set_xlabel("");
ax[1] = sns.barplot("name", "defence", data=df4visual_def, ax=ax[1])
ax[1].tick_params(axis='x', labelrotation=90, labelsize = 9)

plt.suptitle('Offensive / defensive power ranking')
plt.savefig('./output/eda3-3-1.png', bbox_inches='tight', pad_inches=0)

3-3-2. Attack power / defense power distribution (by attribute)

The offensive and defensive power of each attribute is represented by a box plot. The boxplot is a graph showing five summary statistics (minimum, 1st quartile, median, 3rd quartile, maximum). The horizontal line in each box corresponds to each statistic. In terms of attack power, the median light attribute and the third quartile are higher than others, so it can be seen that there are many monsters with relatively high attack power in the light attribute. Since the defensive power is the same, the light attribute seems to be excellent when looking only at the offense and defense.

eda3-3-2.png

eda3-3-2


df4visual = monsters

f, ax = plt.subplots(2, 1, figsize = (20, 10))
ax[0] = sns.boxplot("attr", "attack", data=df4visual, ax=ax[0])
ax[1] = sns.boxplot("attr", "defence", data=df4visual, ax=ax[1])

ax[0].set_xticks([])
ax[0].set_xlabel("")

ax[0].set_title("Attack power distribution (by attribute)")
ax[1].set_title("Defensive power distribution (by attribute)");

plt.savefig('./output/eda3-3-1.png', bbox_inches='tight', pad_inches=0)
monsters.groupby("attr").describe()[['attack', 'defence']]

image.png

3-3-3. Attack power / defense power distribution (by level)

The level and the median of each value seem to have a nice positive correlation. Also, especially at level 1, you can see that there are multiple monsters with outliers of 2000 or more with offensive and defensive power.

eda3-3-3.png

eda3-3-3


df4visual = monsters_norank
#After 3-3-Omitted because it is almost the same as 2.

image.png

3-3-4. Attack power / defense power distribution (by race)

Interpretation is omitted.

eda3-3-4.png

eda3-3-4


df4visual = monsters
#After 3-3-Omitted because it is almost the same as 2.

image.png

3-4. Category data x Category data

Take categories on both the x-axis and y-axis, and check the summary statistics (total amount, average, ..., etc.) of the data belonging to both categories. In this analysis, we will consider using sns.heatmap to represent the number of cards belonging to both categories as a heatmap.

3-4-1. Number of sheets (attribute x race)

The combination with the largest number of cards seems to be darkness x demons. If it is only the race, the number of warriors and machines is larger, but after all the darkness is the most in the combination of demons. Ignoring the rare attributes and races, it seems that there are no flame attributes x fish, flame attributes x thunder, etc. yet.

eda3-4-1.png

eda3-4-1


df4visual = pd.pivot_table(monsters, index="species", columns="attr", aggfunc="count", values='name').fillna(0).astype("int")
f, ax = plt.subplots(figsize = (20, 10))
ax = sns.heatmap(data=df4visual, cmap="YlGnBu", annot=True, fmt="d")
ax.set_title("Number of sheets (attribute x race)")

plt.savefig('./output/eda3-4-1.png', bbox_inches='tight', pad_inches=0)

3-4-2. Number of sheets (attribute x race)

It seems that there are many cards with unusual summoning methods for light and dark attributes.

eda3-4-2.png

eda3-4-1


df4visual = pd.pivot_table(monsters.query("kind != '-'"), index="kind", columns="attr", aggfunc="count", values='name').fillna(0).astype("int")
#After 3-4-Omitted because it is almost the same as 1.

3-5. Category data x Category data x Numerical data

To represent the distribution of a certain number, we use two categorical data to group them. There are two ways to divide into groups: (1) x-axis and (2) color. Sns.catplot is useful for showing the relationship between numerical data and two or more category data. Numerical values can be segmented by various methods using categorical data such as color coding, division within axes, and division for each table.

3-5-1. Attack power / defense power distribution (by attribute / level)

The following data is applied to the color, x-axis, and y-axis of the graph. There is no information on the horizontal spread of each data in the axis.

--Color: Level (category data) --x axis: Attribute (category data) --y-axis: offensive power or defensive power (numerical data)

Since the color of the level draws a beautiful gradation, it can be seen that the height of the level and the height of offensive and defensive power have a positive correlation even within each attribute.

eda3-5-1a.png eda3-5-1b.png

eda3-5-1


df4visual = monsters_norank
g1 = sns.catplot(x="attr", y="attack", data=df4visual, aspect=3, hue="level")
g1.ax.set_title("Attack power distribution (by attribute / level)")
plt.savefig('./output/eda3-5-1a.png', bbox_inches='tight', pad_inches=0)

g2 = sns.catplot(x="attr", y="attack", data=df4visual, aspect=3, hue="level")
g2.ax.set_title("Defensive power distribution (by attribute / level)")
plt.savefig('./output/eda3-5-1b.png', bbox_inches='tight', pad_inches=0)

In addition, the results of exchanging the colors and x-axis of the two categories are as follows. ʻEda3-3-3`'s boxplot did not show, for example, monsters with an attack power of 2000 or higher at level 1 are occupied by darkness and wind attributes, and level 11 has a small number of sheets in the first place. You can take it.

eda3-5-2c.png eda3-5-2d.png

3-5-1. Attack power / defense power distribution (by race / attribute)

The data used is as follows. Many attributes such as which attribute each race is biased to (eda3-4-1), the number of sheets by race (eda3-1-5), distribution of offensive and defensive power by race (eda3-3-4), etc. Information can be read from a single graph.

--Color: Attribute (category data) --x axis: Race (category data) --y-axis: offensive power or defensive power (numerical data)

eda3-5-2a.png eda3-5-2b.png

eda3-5-2


df4visual = monsters
g1 = sns.catplot(x="species", y="attack", data=df4visual, aspect=4, hue="attr")
g1.ax.set_title("Attack power distribution (by race / attribute)")
g1.ax.tick_params(axis='x', labelrotation=90)
plt.savefig('./output/eda3-5-2a.png', bbox_inches='tight', pad_inches=0)

g2 = sns.catplot(x="species", y="defence", data=df4visual, aspect=4, hue="attr")
g2.ax.set_title("Defensive power distribution (by race / attribute)")
g2.ax.tick_params(axis='x', labelrotation=90)
plt.savefig('./output/eda3-5-2b.png', bbox_inches='tight', pad_inches=0)

3-6. Category data x numerical data x numerical data

For the numerical data x numerical data in 3-2., We used a scatter plot that can take numerical values on both the x-axis and y-axis. Here, we will add more information by coloring the scatter plot colors by category.

3-6-1. Distribution of offensive and defensive power (by level)

The scatter plot of offensive and defensive power (eda3-2-1) is colored by level. You can see that the higher the level, the higher the level.

--Color: Level (category data) --x axis: Attack power (category data) --y-axis: Defensive power (numerical data)

eda3-6-1a.png

eda3-6-1a


df4visual = monsters_norank
g = sns.lmplot("attack","defence",data=df4visual, fit_reg=False, hue="level", height=10)
g.ax.set_title("Offensive / defensive power distribution (by level)")
plt.savefig('./output/eda3-6-1a.png', bbox_inches='tight', pad_inches=0)

In addition, since the level data takes discrete values even though it is categorical data, it is possible to interpret it as numerical data and draw numerical data x numerical data x numerical data. Use mplot3d to draw a 3D graph with 3 axes. However, 3D graphs are not very visible unless they are interactive and can be moved.

eda3-6-1b.png

eda3-6-1b


from mpl_toolkits import mplot3d
df4visual = monsters_norank
f = plt.figure(figsize = (20, 10))
ax = plt.axes(projection = '3d')
ax.scatter3D(df4visual.attack, df4visual.defence, df4visual.level, c=df4visual.level)
plt.gca().invert_xaxis()

ax.set_xlabel=('attack')
ax.set_ylabel=('defence')
ax.set_zlabel=('level')
plt.suptitle("Offensive power / defensive power / level distribution")

plt.savefig('./output/eda3-6-1b.png', bbox_inches='tight', pad_inches=0)

Summary

Thank you for reading this far. Using the data of the Yu-Gi-Oh card, I wrote various ways of thinking when making graphs and how to use seaborn.

The more features you select at one time, the more information you can convey in a single figure, but it becomes difficult to narrow down the message of that graph. I would like to keep in mind simple feature selection and visualization that expresses only one message that I want to convey (If you look back at the appropriate messaging itself, it may be that you really should keep in mind writing sentences. ・ ・).

Also, although I didn't mention it in each graph, I added various useful tips of seaborn to each code. It can be surprisingly troublesome to easily do things in Excel or Tableau, such as attaching a value label to each bar or changing the axis name. I hope you will find it helpful.

See also this article → Don't give up on seaborn's fine look adjustments

Next time preview

We are planning to perform an analysis based on the theme of natural language processing for card names that we did not pay much attention to this time. I would like to perform morphological analysis with MeCab, visualize with WordCloud, and calculate similarity with Word2Vec / Doc2Vec. image.png

Recommended Posts

Visualize Yu-Gi-Oh! Card data with Python-Yu-Gi-Oh! Data Science 1. EDA
Visualize data with Streamlit
Data science environment construction with Docker
Visualize corona infection data in Tokyo with matplotlib
Interactively visualize data with TreasureData, Pandas and Jupyter.
Folium: Visualize data on a map with Python
Learn data science
Predicting offensive and defensive attributes from the Yu-Gi-Oh! Card name --Yu-Gi-Oh! Data Science 3. Machine Learning
Visualize railway line data as a graph with Cytoscape 2
Challenge image classification with TensorFlow2 + Keras 3 ~ Visualize MNIST data ~
[Basics of data science] Collecting data from RSS with python