GCI Data Scientist Training Course

"GCI Data Scientist Training Course" is offered by the University of Tokyo (Matsuo Laboratory) "* Practical data The contents of the exercise part are published in Jupyter NoteBook format (CC-BY-NC-ND) in the Scientist Training Course and Deep Learning Course </ u> * ". Chapter 6 is "** Data processing using Pandas **", where you will learn how to use a powerful and high-speed data manipulation library for data analysis. Instead of the "Like" button for the valuable and wonderful teaching materials that you can learn in Japanese, I will post the answers you have solved. Please point out any mistakes.

6 Data processing using Pandas

6.1 Pandas

6.1.1 Hierarchical index

** <Practice 1> ** Let's extract only the Kyoto column for the following data.

hier_data_frame1 = DataFrame(np.arange(12).reshape((3,4))
                           ,index = [['c','d','d'],[1,2,1]]
                           ,columns = [['Kyoto','Nagoya','Hokkaido','Kyoto']
                                      ,['Yellow','Yellow','Red','Blue']]
                           )

hier_data_frame1.index.names =['key1','key2']
hier_data_frame1.columns.names =['city','color']
hier_data_frame1

hier_data_frame1["Kyoto"]

** <Practice 2> ** For the data of Exercise 1, put the city together and calculate the average value between the columns.

hier_data_frame1.mean(level = "city", axis = 1)

** <Practice 3> ** Let's calculate the total value of the rows for each key2 for the data of Exercise 1.

hier_data_frame1.sum(level = "key2")

6.1.2 Data merging

** <Practice 1> ** Let's do an inner join to the following two data tables.

pd.merge(attri_data_frame4, attri_data_frame5, how = "inner")

** <Practice 2> ** Let's outer join the table of attri_data_frame5 based on attri_data_frame4.

pd.merge(attri_data_frame4, attri_data_frame5, how = "outer")

** <Practice 3> ** Let's vertically join the following data to attri_data_frame4.

pd.concat([attri_data_frame4, attri_data_frame6])

6.1.3 Data manipulation and conversion

** <Practice 1> ** Use the data from "student-mat.csv" used in the previous chapter. Let's add a column that doubles the age.

student_data_math["2age"] = 2 * student_data_math["age"]
student_data_math[["age", "2age"]].head()

** <Practice 2> ** With the same data as above, let's divide the "absences" column into the following three bins and count the number of each. In addition, in the option setting of cut, the right side is closed section by default, but this time add right = False for 0 to be entered.

# pd.cut(student_data_math.absences, absences_bins, right=False)
pd.value_counts(pd.cut(student_data_math.absences, absences_bins, right=False))

> [5, 100)    151
> [1, 5)      129
> [0, 1)      115
> Name: absences, dtype: int64

** <Practice 3> ** With the same data as above, let's divide the "absences" column into three bins using qcut.

# pd.qcut(student_data_math.absences, 3)
pd.value_counts(pd.qcut(student_data_math.absences, 3))

> (-0.001, 2.0]    183
> (6.0, 75.0]      115
> (2.0, 6.0]        97
> Name: absences, dtype: int64

6.1.4 Data aggregation and group operations

** <Practice 1> ** Let's use "student-mat.csv" that we used earlier and use "student-mat.csv" to aggregate pandas. First, let's find the average score of G1 with school as the axis.

student_data_math = pd.read_csv("student-mat.csv",sep=";")
student_data_math.groupby("school")["G1"].mean()

> school
> GP    10.939828
> MS    10.673913
> Name: G1, dtype: float64

** <Practice 2> ** Next, let's find the average score of G1, G2, and G3, centering on school and gender.

student_data_math.groupby(["school", "sex"])[["G1", "G2", "G3"]].mean()

** <Practice 3> ** Next, let's calculate the maximum and minimum values of G1, G2, and G3 together, centering on school and gender.

student_data_math.groupby(["school", "sex"])[["G1", "G2", "G3"]].agg(["max", "min"])

6.2 Basics of handling missing data and outliers

6.2.1 How to handle missing data

** <Practice 1> ** For the following data, if there is even one column of NaN, delete it and display the result.

sample_data_frame2.dropna()

** <Practice 2> ** Fill NaN with 0 for the data prepared above.

sample_data_frame2.fillna(0)

** <Practice 3> ** For the data prepared above, fill in NaN with the average value of each column.

sample_data_frame2.fillna(sample_data_frame2.mean())

6.2.2 How to handle abnormal data

6.3 Comprehensive problem

** 6.3.1 Comprehensive Problem 1 ** Use the previously used "student-mat.csv" to answer the following questions.

(1) For the above data, calculate the average score of G1 by age x gender, and create a table with age on the vertical axis and gender on the horizontal axis.

(2) For the result table displayed in (1), let's display the result of deleting all rows (records) that are NA.

student_data_math = pd.read_csv("student-mat.csv",sep=";")

# (1)
print("(1)")
print(student_data_math.groupby(["age", "sex"])["G1"].mean().unstack())

# (2)
print("\n(2)")
print(student_data_math.groupby(["age", "sex"])["G1"].mean().unstack().dropna())

> (1)
> sex          F          M
> age                      
> 15   10.052632  12.250000
> 16   10.203704  11.740000
> 17   11.103448  10.600000
> 18   10.883721  10.538462
> 19   10.642857   9.700000
> 20   15.000000  13.000000
> 21         NaN  10.000000
> 22         NaN   6.000000
> 
> (2)
> sex          F          M
> age                      
> 15   10.052632  12.250000
> 16   10.203704  11.740000
> 17   11.103448  10.600000
> 18   10.883721  10.538462
> 19   10.642857   9.700000
> 20   15.000000  13.000000

[PYTHON] Solving Exercises in GCI Data Scientist Training Course Chapter6