The title is as follows
--Descriptive statistics and simple regression analysis --Descriptive statistics --Simple regression analysis --Comprehensive problem --Statistical basis and visualization --Lorenz curve and Gini coefficient
The degree of incomprehension and dangerous odor for a small number of items will pop up, but basically the calculation should be left to Python.
read_csv Read the data from CSV to actually solve the problem. To do this, use read_csv in Pandas. Certainly, I think there were several other ways to load the DataFrame, but when I looked at the read_csv parameters, I noticed.
?pd.read_csv Signature: pd.read_csv( filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None
With that feeling, you can specify the separator and delimiter. In the same sense? I thought
pd.read_csv("xxx.csv", sep=";") pd.read_csv("xxx.csv", delimiter=";")
I tried to compare, but it did not change There was a good commentary on StackOverflow on this.
What is the difference between sep
and delimiter
attributes in pandas.read_csv() method?
To put it plainly, if delimiter
is not specified (= None), the value of sep
will be entered. In other words, it doesn't matter which one you use.
However, considering compatibility and other functions (to_csv
etc.), it seems that sep
is preferable.
The source code also says delimiter is the annoying corner case
. Let's use sep
When there are multiple columns, it is necessary to correctly recognize the quality of the data. Does it contain null data, is it quantitative or qualitative? If it is quantitative data, it can be used as a value obtained by mathematical calculation. If it is qualitative data, it can be used as a category for classifying quantitative data or as a value for sorting data such as rank.
Statistical analysis can be divided into descriptive statistics and inference statistics.
--Descriptive statistics: The purpose is to organize the data in an easy-to-read manner and to grasp the general information that the data has. --Estimated statistics: Performing precise analysis using a model based on the probability distribution
Roughly speaking, it's probably like this with descriptive statistics and a prediction model! Is it a place called inference statistics? Considering that the purpose is to learn AI / machine learning, it is the overwhelming latter, but inference statistics seems to be the next chapter.
plt.boxplot(student_data_math.G1)
A box plot is drawn in the form of. I felt like that. For the purpose of seeing it, it looks like a candlestick showing the stock price. It seems that the lower and upper limits of Box for candles are 25% tiles and 75% tiles. It is called box plot as it is in English.
It's not limited to this, but it's a good idea to remember what English the function comes from as much as possible. This time, I learned that the boxplot function is a function that draws a boxplot, but if you know that the boxplot is Box Plot in English in the first place, you can find it from the function list.
The coefficient of variation is the standard deviation divided by the average.
student_data_math.std() / student_data_math.mean()
The function that leads to the variance is var It is cov that derives the covariance used when considering the variance of two or more variables. The covariance is mathematically calculated by the product of the deviations from the mean.
In their final form, the correlation coefficient is calculated to indicate whether the two variables have any relationship. Here, the coefficient is calculated using a Pearson function.
sp.stats.pearsonr(student_data_math.G1, student_data_math.G3)
Well, that's all for today!
Recommended Posts