[PYTHON] Data Scientist Training Course Chapter 3 Day 1 + 2

scope

The title is as follows

--Descriptive statistics and simple regression analysis --Descriptive statistics --Simple regression analysis --Comprehensive problem --Statistical basis and visualization --Lorenz curve and Gini coefficient

The degree of incomprehension and dangerous odor for a small number of items will pop up, but basically the calculation should be left to Python.

read_csv Read the data from CSV to actually solve the problem. To do this, use read_csv in Pandas. Certainly, I think there were several other ways to load the DataFrame, but when I looked at the read_csv parameters, I noticed.

?pd.read_csv Signature: pd.read_csv( filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None

With that feeling, you can specify the separator and delimiter. In the same sense? I thought

pd.read_csv("xxx.csv", sep=";") pd.read_csv("xxx.csv", delimiter=";")

I tried to compare, but it did not change There was a good commentary on StackOverflow on this.

What is the difference between sep and delimiter attributes in pandas.read_csv() method?

To put it plainly, if delimiter is not specified (= None), the value of sep will be entered. In other words, it doesn't matter which one you use. However, considering compatibility and other functions (to_csv etc.), it seems that sep is preferable. The source code also says delimiter is the annoying corner case. Let's use sep

Data quality

When there are multiple columns, it is necessary to correctly recognize the quality of the data. Does it contain null data, is it quantitative or qualitative? If it is quantitative data, it can be used as a value obtained by mathematical calculation. If it is qualitative data, it can be used as a category for classifying quantitative data or as a value for sorting data such as rank.

Statistical analysis

Statistical analysis can be divided into descriptive statistics and inference statistics.

--Descriptive statistics: The purpose is to organize the data in an easy-to-read manner and to grasp the general information that the data has. --Estimated statistics: Performing precise analysis using a model based on the probability distribution

Roughly speaking, it's probably like this with descriptive statistics and a prediction model! Is it a place called inference statistics? Considering that the purpose is to learn AI / machine learning, it is the overwhelming latter, but inference statistics seems to be the next chapter.

Box plot

plt.boxplot(student_data_math.G1)

A box plot is drawn in the form of. I felt like that. For the purpose of seeing it, it looks like a candlestick showing the stock price. It seems that the lower and upper limits of Box for candles are 25% tiles and 75% tiles. It is called box plot as it is in English.

It's not limited to this, but it's a good idea to remember what English the function comes from as much as possible. This time, I learned that the boxplot function is a function that draws a boxplot, but if you know that the boxplot is Box Plot in English in the first place, you can find it from the function list.

Formulas, formulas, formulas

The coefficient of variation is the standard deviation divided by the average.

student_data_math.std() / student_data_math.mean()

The function that leads to the variance is var It is cov that derives the covariance used when considering the variance of two or more variables. The covariance is mathematically calculated by the product of the deviations from the mean.

In their final form, the correlation coefficient is calculated to indicate whether the two variables have any relationship. Here, the coefficient is calculated using a Pearson function.

sp.stats.pearsonr(student_data_math.G1, student_data_math.G3)

Well, that's all for today!