[PYTHON] Probability statistics in Pokemon (uncorrelated test) --Is there a correlation between CP, weight, and height of Magikarp?

pokemon go

Overview

Winner of the 2016 buzzword award ** Pokemon GO **: exclamation: Did you all catch and play Pokemon a lot: question:

Well, in Pokemon GO,

** There are different feature values = individual values (different values for each individual) ** for each captured Pokemon. I've always been wondering how these individual values are distributed, or what is the relationship between them. ** ** Well, I just wanted to find out (sweat)

Therefore, in this article, ** Using the individual value data of Magikarp that I actually caught, I confirmed the question of whether there is a correlation between each parameter of CP / weight / height (can I say that there is no correlation) with an uncorrelated test. **: fish:

This article is written with the intention of telling you that ** "You can perform statistical analysis using familiar data" in a fun way, so I will avoid difficult terms and ideas as much as possible. ** ** Recently, the field of data science has become popular, and I think that some people are interested in this kind of analysis, so I hope that you will use this as an opportunity to study statistics.

Before starting the commentary

Audience of this article

Experiment environment

Language used

Actually, this analysis can be done in Excel, but ** I think I'll try to make a script in Python because it's a big deal. ** Python version is 3.5.0.

I think the development environment can be anything, but I mainly made it with Sublime Text 3, which I'm used to, and the terminal.

Data to use

This time, I used the Magikarp data ($ n = 100 $) that I caught around my house and around Kagurazaka, Tokyo from summer to autumn 2016: fishing_pole_and_fish: Data is acquired by the following method.

  1. Catch Pokemon
  2. Enter the approximate place name captured in the name
  3. Take a screenshot on the individual value confirmation screen
  4. Send to doctor
  5. Manually enter the data by looking at the screenshot

No, it was an analog method, so it was quite difficult (laughs) It is convenient to sync with the computer with Google Photos or Dropbox, manually enter the eigenvalues from the images collected like this (I wish Deep Learning could automatically read the values ...)

screenshots.png

The entered data is saved in CSV format. If you want to use the data I have collected, please click here [http://tmp.imaizu.me/pokestat/magikarp.csv). The column structure of CSV data is as follows.

Only ** CP, Weight, and Height columns ** are used in this analysis.

Prerequisites

Originally, various "preconditions" are required to analyze by statistical methods, but this time I will ignore many of them and write with the feeling of "trying for the time being", so please forgive me.

Analysis method

Now let's start the analysis of the main subject. First, let's take in the CSV data and plot it once on the scatter plot: scales: This time, the read data is converted to dataframe type using the Python library Pandas.

from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv("magikarp.csv")
print(data.describe())

plt.figure()
scatter_matrix(data)
plt.savefig("image.png ")

The scatter plot for each variable looks like this.

image.png

In the case of Magikarp, weight and height show a fairly clean linear distribution. From fairly small to giant magikarp, it seems to be ecologically distributed with a reality similar to that of real fish: smile:

On the other hand, I don't know a little about CP ... Looking at the histogram, it seems that the CP of 10 is prominently large, and there is no difference in the number of individuals in the CP of other ranges. In Pokemon GO, the lowest CP is 10, and in weak Pokemon like Magikarp, the frequency of appearance of CP10 individuals is high, you certainly feel that you are actually playing: droplet:

Next, find the correlation coefficient ($ \ alpha $) of these variables. ** This value is an index showing whether there is a linear relationship between variables, and the closer the absolute value is to 1, the stronger the linear relationship between individual values **. For the correlation coefficient, use the corr function of dataframe. This is a great function that will calculate the correlation between all the variables in the dataframe.

print(data.corr())
#>               CP    Weight    Height
#> CP      1.000000  0.010724  0.086286
#> Weight  0.010724  1.000000  0.865564
#> Height  0.086286  0.865564  1.000000

Looking at the plot above, it was confirmed that the values were as expected. You can see that the correlation coefficient between weight and height is 0.866, which is quite strong **. On the other hand, the correlation coefficient of CP is not so large at first glance, and it seems a little unconvincing to say that it is "correlated".

Therefore, finally, check whether these correlation coefficients are significant by ** uncorrelated test. ** ** In the uncorrelated test, a hypothesis (null hypothesis) that "the obtained correlation coefficient is 0" is set, whereas "the probability that the correlation coefficient is accidentally 0 is extremely low" is significant. By obtaining the probability, it is a method to confirm whether it is a really meaningful correlation coefficient. This time

Null hypothesis $ H_0: \ alpha = 0 $ Alternative hypothesis $ H_1: \ alpha \ neq 0 $

It is tested as. Scipy has a function pearsonr for performing tests using" Pearson's product-moment correlation coefficient "(there are several other types of uncorrelated tests), so this can be used for each combination of variables. Execute and test. Given two corresponding variables, it returns a correlation coefficient of $ r $ and a significance probability of $ p $.

from scipy.stats import pearsonr
...
r, p = pearsonr(data.Height, data.Weight) #Height and weight
# r, p = pearsonr(data.Height, data.CP) #Height and CP
# r, p = pearsonr(data.Weight, data.CP) #Weight and CP
print('Correlation coefficient r= {r}'.format(r=r))
print('Significance probability p= {p}'.format(p=p))
print('Significance probability p> 0.05: {result}'.format(result=(p > 0.05)))

The result of the test is as follows. This time, if the significance probability $ p $ is less than $ 0.05 $ ( True in the result), $ H_0 $ that says" $ \ alpha = 0 $ is not correlated "is adopted, otherwise $ H_0 $ Is rejected.

Weight and height


>Correlation coefficient r: 0.8655637883468845
>Significance probability p: 1.7019782502122307e-31
>Significance probability p> 0.05: False #Significant

Again, as expected, it proved to be significantly correlated.

Height and CP


>Correlation coefficient r: 0.0862864395740605
>Significance probability p: 0.39090582918188466
>Significance probability p> 0.05: True #Not significant

Weight and CP


>Correlation coefficient r: 0.01072432286085844
>Significance probability p: 0.915233564101408
>Significance probability p> 0.05: True

On the other hand, the CP was also as expected until the end. The question of whether it makes sense to examine the correlation between CP and other variables is clearer, but it was just a simple example, but this method can predict the appearance parameters of the game to some extent. Did you know that?

Summary

So it was super easy, but I tried to do correlation analysis using Pokemon data. Since the distribution of the data this time is the distribution of the parameters of the game, it may be interesting to do something like estimating the parameters by keeping records in other Pokemon or other games. Perhaps the distribution of individual values may differ significantly in Pokemon other than Magikarp.

This time I made it an uncorrelated test, but I would like to do something else similar, so I would like to write a continuation somewhere. I have to study more statistics by then ...

References

Recommended Posts

Probability statistics in Pokemon (uncorrelated test) --Is there a correlation between CP, weight, and height of Magikarp?
[Free study] Is there a connection between Wikipedia updates and trends?
Difference between == and is in python
Is there a special in scipy? ??
Summary of probability distributions that often appear in statistics and data analysis