Currently, I am studying Python and data analysis (including statistics), so while studying on this page, I will add notes about statistics and graphing in Python. Judgment based on data is an effective tool that has the power to prevent mistakes in judgment that cannot be prevented by logical thinking, so I have been actively studying.
※Caution※ Since it is written while studying, it may be misunderstood. I will correct it as soon as I notice it, but if you notice an error, I would appreciate it if you could point it out.
・ Shinichi Kurihara "Introduction to Statistics" ・ Wes McKinney, "Introduction to Data Analysis with Python" ・ Anchibe, "Introduction to Practical Process of Data Analysis" ・ Bill Lubanovic, "Introduction to Python 3" ・ Yoshinori Fujii et al., "Analysis of Data for Japan Statistical Society Officially Certified Statistical Test Level 3" ・ Toyoki Tanaka et al., "Revised Edition of Japan Statistical Society Officially Certified Statistical Test Level 2 Correspondence Statistics Basics" ・ Haebaru, "Basics of Psychometrics"
The text of the two statistical tests is included, but it is recommended that you study for the statistical test rather than the content of the text. Although it is a private qualification, I think it is a qualification test that raises quite good questions. By the way, the statistical test grade 3 can be done only with the official textbook. For the second grade, "Introduction to Statistics" by Kurihara recommended by Mr. Anchibe is good. This is all you need.
When learning statistics, the variance is calculated by dividing by the number of samples n or by n-1, but if you do not understand this point well, it will be confusing.
From the standpoint of descriptive statistics, that is, when trying to summarize the samples taken and show their characteristics as statistical values, express them as sample mean / sample variance and divide by the number of samples n.
From the standpoint of inference statistics, we adopt the desirable property of impartiality, that is, the property that the expected value of the estimator matches the true value of the population to be estimated. First, the sample mean has an impartiality as it is, so this is also adopted from the standpoint of inference statistics. When estimating the variance from the standpoint of inference statistics, the number of samples n-, considering the result of considering the error when fetching the sample, and the tendency that the sample variance becomes a smaller value than the true variance. It will be divided by 1.
If there are outliers or if the data is biased, it is easy to think of it as the median (I think), but the average value is the minimum squared value of the "distance" from a certain point to each data. The median is the statistic that minimizes the absolute value (not the square) of the distance. It's a value that optimizes different criteria, so I can't simply say which one is right for me. When the mean and median are significantly different, it is recommended to show both values.
If the values differ greatly depending on the shape of the data distribution, it may be better to present a histogram as well.
The standardized variate is the value calculated by z_i = (x_i-μ) / σ, where the mean μ (of the population) and the standard deviation σ (of the population) calculated from the data (plural) data x_i. ..
You want the average to be zero and the deviation to be 1. Statistical Test Levels 3 and 2 (and real-life problems) may compare Student A's math and science grades. However, the average score and variance differ depending on the subject. Since it is difficult to compare as it is, we will calculate the standardized variate.
Since it is standardized, it is possible to compare the grades of different subjects only by the size of the value. By the way, the deviation value is the value obtained by multiplying this standardized variate by 10 and adding 50.
y=ax+When the data is converted in b, the average value is aE[x]+b, the deviation is|a|Since it is doubled, the deviation value means that the mean value is 50 points and the deviation value is 10 points using the standardized variate. Thinking this way, I think it's easy to remember what the deviation value is. Statistical tests also assume that you have memorized them.
It is difficult to compare the variances of distributions with different means. Therefore, the coefficient of variation is calculated by (deviation) / (average). It absorbs the difference between the average values and makes it easier to compare the scattered values. It is necessary to remember it because it can be obtained even in the third grade of the statistical test. The calculation is simple, so I think it can be used in real life.
If you remember only the word coefficient of variation, you may be confused as to whether the mean value was the denominator or the numerator. Therefore, if you remember that the coefficient of variation is the value when considering the variation of values as well as the variance as a major premise, it is easy to remember that the deviation is the numerator and the mean value is the denominator. I think you need to be careful not to get confused with the standardized variate when converting the mean to zero and the standard deviation to 1.
First, if the mean of x is E [x] and the variance is V [x], then y = ax + b, E[y]=aE[x]+b、V[y]=|a|^2 *E[x] is.
Using this relationship, create two types of data, calculate the coefficient of variation for data with similar variances, but with different scales, and try the following to get similar values. I did.
import numpy as np
import matplotlib.pyplot as plt
def main():
sample_size = 1000
a = 10
b = 5
data = np.random.standard_normal(sample_size)+1
data2 = a*data+b
print('data Mean:{} Var:{}'.format(np.mean(data), np.var(data)))
print('data2 Mean:{} Var:{}'.format(np.mean(data2), np.var(data2)))
print('coefficient of var {} : {}'.format(np.std(data)/np.mean(data), np.std(data2)/np.mean(data2)))
plt.subplot(2, 1, 1)
plt.hist(data)
plt.subplot(2, 1, 2)
plt.hist(data2)
plt.show()
if __name__ == "__main__":
main()
Average μ,Variance σ^To the random variable x of 2, ax+When b is converted, the average is aμ+b,Dispersion|a|^If you use the relationship that becomes 2, When you want to generate a random number that follows N (0,1), using numpy's randn, you want to generate a random number that follows a normal distribution with mean μ and variance σ ^ 2. σrandn() + μ I know what to do. numpy site: http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html It is also written.
The value that represents the asymmetry of the statistical distribution is the skewness. Zero represents symmetry, and when it becomes positive, it indicates that the distribution extends upward. If it is negative, the opposite is true. There is a function to calculate in scipy, and of course in Excel. ※scipy http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.skew.html
In practice, it can be used to check for outliers. For example, when there is a large positive value, it indicates that there is a large outlier in the positive direction. Outliers have some implications when summarizing statistical data (although the magnitude of the impact varies) and must be dealt with in some way. It is a value that can be used when considering the necessity. When processing manually, you can notice the outliers by drawing a histogram or a scatter plot, but it is useful when you want to change the processing according to the outliers in the program. I will.
Kurtosis is a value that indicates how sharp the distribution you are thinking of is when compared to the normal distribution. You can calculate with KURT.
If you look at statistics-related sources, you may see SKEW and KURT, but they are subject to skewness and kurtosis. For example, read the dirty_iris.csv data (search on Google and you will see the Github page) from Anchibe's book, including outliers, and calculate the statistics and skewness as follows: I will try.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def main():
iris = pd.read_csv('./dirty_iris.csv')
print(iris.head(n=5))
print(iris.describe())
print(iris.skew())
if __name__ == "__main__":
main()
.txt
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
sepallength sepalwidth petallength
count 150.000000 150.000000 150.000000
mean 6.464000 3.040667 3.738000
std 7.651181 0.578136 1.763221
min 4.300000 -1.000000 1.000000
25% 5.100000 2.800000 1.600000
50% 5.800000 3.000000 4.300000
75% 6.400000 3.300000 5.100000
max 99.000000 5.400000 6.900000
sepallength 12.030413
sepalwidth -1.681226
petallength -0.248723
dtype: float64
By the way, I somehow realize that the sepal length is suspicious just by describing, but even with the value given by skew, I can see that the sepal length is suspicious compared to others.
Outliers are not simply excluded, but the reason why the values are in the data is considered by using domain knowledge (knowledge in the field being analyzed such as business knowledge and specialized knowledge). It is necessary, but this time I decided to use NaN appropriately
iris[np.abs(iris['sepallength'])>10] = np.nan
iris.boxplot(by='class')
plt.show()
Then, it became a nice box plot.
The Poisson distribution is a statistical distribution used to analyze rare events. When the mean value = (variance =) λ, The point is that λ = (number of trials n) × (probability p). For the event that occurs in p, the value of the probability can be obtained by specifying the number of times x that actually occurs.
How to set n, p, x may differ depending on the interpretation of the problem, but when using the Poisson distribution formula, even if the interpretation is different, the same formula can be used to calculate the same probability, so be worried. You don't have to.
Since different values are obtained depending on the number of times x actually occurs, for example, when finding the probability of k times or less, it is necessary to repeat the calculation a plurality of times. But it's easy with a program, and the number of times the sum of probabilities reaches a certain value can be calculated using scipy, so it doesn't seem to matter much.
It is helpful to use scipy http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html
On the above page, I was interested in how to draw a graph with thin vertical bars, so I will quote and explain it.
>>> ax.plot(x, poisson.pmf(x, mu), 'bo', ms=8, label='poisson pmf')
>>> ax.vlines(x, 0, poisson.pmf(x, mu), colors='b', lw=5, alpha=0.5)
First, draw the value with a blue circle with'bo', and then draw a vertical line with ax.vlines. x is the position of the vertical line, 0 is the lower end of the vertical line, and poisson.pmf (x, mu) specifies the upper end of the vertical line. lw would be an abbreviation for line width. alpha specifies the transparency of the line. Use plt.xlim ()
to specify the range on the horizontal axis to make the graph easier to see.
By the way, pmph is an abbreviation for Probability mass function. For continuous values, it is the probability density function (PDF), but for discrete values like the Poisson distribution, it is PMF.
Data visualization is important for exploratory data analysis, but in addition to the commonly used scatter plots and histograms, it's a good idea to know the boxplot. It will be necessary knowledge even in the third grade of statistics. Suitable for comparing the distribution of multiple data.
Information such as the number of peaks that can be seen when drawing a histogram disappears, but there is also a violin plot that corresponds to this, but it is a slightly harsh design.
The method of expressing outliers is a little point, and it seems that values that are more than 1.5 times the (third quartile)-(first quartile) and are far from each value are often outliers. The definition of outliers should change depending on the data to be handled, and I think that the defaults can be used here.
If there is only one mountain, you can roughly imagine a histogram from the shape of the boxplot, so the boxplot is useful for comparison when there are multiple such data. A disgusting violin plot is fine, though. You can draw with Python.
Recommended Posts