Aidemy 2020/10/30
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of "Data Analysis Titanic". Nice to meet you.
What to learn this time ・ ④ Pattern analysis, data analysis (continuation of the previous time) ・ ③ Data shaping, creation, cleansing (return) → __ Creation of new features __
-When Pclass = 1, the average of Survived is 0.62, and it can be said that there is a significant (positive) correlation that exceeds 0.5, so Pclass should be used as a feature of the model. And. -Similarly, when Sex = Female, the average of Survived is 0.74, so this is also used as a feature of the model. -Since there was no significant correlation between __SibSp and Parch __, create a new feature by combining the two __.
-__ To confirm the hypothesis that "children have a high survival rate" __, specify the range of __Age data and divide it __. Use this to create a distribution of data with a __histogram __. -For this histogram, the horizontal axis is Age, but the vertical axis is the number of data, that is, the number of passengers, so be careful.
-The histogram can be created with __ "df.hist ()" __. How many classes to create (how many data to divide) can be specified by __ "bins =" __ in the argument, and when Survive wants to see the number of data of 0 and 1, respectively __ You can specify the parameter with "by =" __. -Also, as you saw in Chapter 1, __Age contains missing values __, so delete the missing values first with dropna.
·code
・ Result (graph)
・ Looking at the results, we can see that __ "0-5 years old" has a high survival rate __. ・ Also, looking at the total number of data, we can see that the number of __ "15-35 years old" is large __.
・ Next, the histogram for each Age created in the previous section will be further visualized for each __P class. -The code is written to histogram __Survived and Age, and divide it by __ "by = train_df ['Pclass']" __.
·code ・ Graph (dead) ・ Graph (survivor)
・ From this graph, __ "There are overwhelmingly many deaths with P class = 3" "Many survivors with P class = 1" "Many survivors with Age of P class = 2 and 3 are 0 to 5 years old" __ You can see that. It can be said that both are as hypothesized.
-Create the relationship between "Survived" and "Pclass / Sex" when __ "Embarked" __ is __ ['C','Q','S'] __ in the pivot table, and __ "plt. It is illustrated by plot () "__.
-Create a pivot table in the same way as in Chapter 1. Don't forget to specify the contents of __Embarked in the conditional expression this time __. Also, since there are two groups (aggregate), Pclass and Sex, pass them as a list. -Sex divides the created pivot table by male and female and sorts them in the order of Pclass. -This is illustrated by __ "plt.plot ()" __. The x-axis is __ ['1', '2', '3'] __, and the y-axis is the number of __Survived data __.
-Code (when Embarked ='C'. The other two are created in the same way) (The cut part is the same as the previous time)
・ Graph (only when Embarked ='C', the other two are similar graphs)
・ From this graph, it can be said that __ "the survival rate of women is overwhelmingly high" __. This is also the hypothesis. It was also found that, although not shown here, __ "males with Embarked ='Q' have a very low survival rate" __.
-This time, replace the __ "P class" part in the previous section with "Survived" and the "Survived" part with "Fare" __, and show them in exactly the same way. -Once the relationship between "Fare" and "Survived / Sex" is created in the pivot table, the next step is to divide it according to whether Survived is 0 or 1. This time, it is represented by __subplot __. __ You can do it with "plt.subplot ()" . The horizontal axis is __ "Sex" __ The vertical axis is __ "Fare" __ A bar graph ( plt.bar () __) is set, the left is __ "Survived == 0" __, and the right is __ Set so that "Survived == 1" __.
-From here, __feature conversion / creation / completion __ is performed.
-First, as decided in Chapter 1, delete __ "Ticket" and "Cabin" with many defects and duplicates __. -Use __ "drop (axis = 1)" __ because the column is deleted.
-Similarly, __ "Name" and "Passenger Id" that are clearly uncorrelated with Survived are also deleted __, but __ titles (Mr, Mrs, Dr, etc.) included in "Name" and Survived There may be a correlation __, so check it. -To extract the title from Name, use __regular expression __. To extract the part that matches the regular expression from the str type data, use __ "str.extract ('regular expression')" __. The regular expression this time is '([A-Za-z] +) .'. This is expressed like this because there are multiple uppercase and lowercase letters before the "." In the title, such as "Dr.". In addition, __ "expand = False" __ is specified in this second argument, which indicates that the extracted one is returned in __DataFrame . -This extracted product is stored in a new column (feature amount) called 'Title'. -The relationship between 'Title' and'Sex' is __ "cross tabulation" . In this example, in this example, it aggregates how many Sex elements appear for each Title category such as Dr. and Mrs. . -Use the __ "pd.crosstab ()" __ function to perform crosstabs. In the first argument, pass the data corresponding to the result row ('Title' in this case), and in the second argument, pass the data corresponding to the result column ('Sex').
・ Code![Screenshot 2020-10-23 15.28.45.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/3a3d54fa-669e-ec58- d0cd-c0463cabf909.png)
・ Result (only part)![Screenshot 2020-10-23 15.29.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/b946ce52 -b42c-6610-b702-733193b204d0.png)
・ Of the honorific titles that appear here, those that are less frequent are summarized in the other frame called __ "'Rare'" __. Also, replace'Mile'with'Miss' and'Mme' with'Mrs', which have the same meaning. These replacements can be done with __ "replace ()" __. ・ Once you have reached this point, create a pivot table and check the correlation.
・ Code![Screenshot 2020-10-23 15.53.28.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/f5fbcf9e-4689-7002- c398-1e91950f9a23.png)
-For this Title, I want to treat each element as a numerical value, so convert it to __ {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5} __ To do. -The method can be converted by preparing the above correspondence table (dictionary) and applying the __ "map ()" __ function to dataset ['Title']. -When the following execution is completed, delete __ "Name" and "Passenger Id" as originally expected __.
-Code![Screenshot 2020-10-23 16.06.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/5943862b-98ea-c3c3- 5043-4354df7e6780.png)
・ Result![Screenshot 2020-10-23 16.06.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/5943862b-98ea-c3c3- 5043-4354df7e6780.png)
-Using the __map () __ function used in the previous section, the next step is to convert multivalued data to binary data. -Here, convert 'Sex' to male: 0, female: 1. -If __ data type __ is specified in the argument of __astype () __ method, a new DataFrame with all the specified data types of all columns changed is returned. Since we want to make all int types this time, specify int as an argument.
·code
-After deleting, converting, and creating data, perform __complement __. Completion is __ guessing and assigning __ value to Null or NaN. -First, complement the continuous numerical data __ "Age" __. There are the following three methods of complementation. 1: Generate __random numbers with reference to the average __ 2: Correlated __ Refer to other features __ 3: Combining 1 and 2 __, specifically, generate random numbers with reference to the mean and standard deviation
・ This time, the method of __ "2" __ is used. Specifically, there are two feature quantities that correlate with "Age": __ "Sex" and "P class" __. With reference to these two features, the estimated value of Age (intermediate age) is obtained. -First, prepare an empty array of __2 rows and 3 columns to store the Age value __. When creating an array because the specific matrix size is known in this way, it is better to use __ "np.zeros ()" . By specifying the form of the matrix in the argument, it is possible to create an array with all 0 values ( virtually empty __).
・ Next, calculate the estimated value of Age (intermediate age). First, extract the value of'Age'for all combinations of'Sex' and'Pclass' (excluding NaN). Since there are __ "2 * 3 ways" __ combinations, the size of the matrix is specified as __ (2,3) __. On the other hand, if you use __ "median ()" __ to get the __median __, this will be the middle age. -Store the median value in each case in the array of (2,3) and finish.
-Code![Screenshot 2020-10-24 11.45.22.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/8635098f-1cc3-b47d- 3fde-5d38456d4077.png)
·result
AgeBand -After completing Age, convert the entire Age continuous value to __discrete value __. This is done to make it easier to predict by specifying __ range and dividing __ (= converting to discrete data) as shown in the section of the guideline "Creation" in Chapter 1. -Converting to discrete values is called __ "binning processing" __ or __ "bin division" __. To do this, use __ "pd.cut ()" __. Data is passed to the first argument, and __how many divisions of data __ is specified in the second argument. This time, it will be divided into 5 and stored as discrete data in a new feature called __ "AgeBand" __.
-Also, I want to check the correlation between AgeBand and Survived, so I also create a __pivot table __.
・ Code![Screenshot 2020-10-24 15.05.02.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7e4a2eb9-9049-f7bf- 089f-9bc99492ab70.png)
-Next, the Age converted to discrete values is converted to __order data . Specifically, referring to AgeBand, if Age is __ "0 ~ 16", it is converted to 0, if it is "16 ~ 32", it is converted to 1, if it is "32 ~ 48", it is converted to 2, and if it is "48 ~ 64", it is converted to 3. .. -The conversion method is to extract the range of'Age'with __loc [conditional expression, column to be converted] __ and replace it with the above numerical value. ・ When you have finished so far, delete AgeBand with __drop () __.
・ Code![Screenshot 2020-10-24 15.06.17.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/91f81220-3eec-8b5e- a75e-54730478ecb5.png)
FamilySize -As shown in "Creation" of the policy, create a new feature amount called __ "Family Size" __ by combining the feature amounts __ "Parch" and "Sibsp" __ of the same system. This feature represents __ "number of families" __. -The method is usually __ just extract two columns and add them together __. However, since I am also included in the "number of families", do not forget the __ "+1" __ for that amount. -When creating a new feature, take the average with Survived and check the correlation __.
・ Code![Screenshot 2020-10-24 15.07.04.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/20905888-8aa5-75a6- c0de-58a3afe6aa48.png)
IsAlone -The above Family Size is classified as __ "single or family-friendly" __ to make it more abstract. That is, if FamilySize is 1, it is converted as "IsAlone = 1", and everything else is converted as "IsAlone = 0". -As a method of making, __ First, create a feature called __IsAlone with all values set to "0", and convert it to "1" only when FamilySize = 1 (use __loc () __). ・ When you reach this point, delete __Parch, SibSp, and FamilySize __.
・ Code![Screenshot 2020-10-24 15.07.46.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/0bac092b-b650-d823- 22d6-f72c14bf151b.png)
Age*Class -Here, we create an artificial feature called __ "Age * Class" __ that is weighted by multiplying "age" by "grade of the cabin". To make it, just extract Age and Pclass as it is.
-As seen in Chapter 1, there are two missing values in __Embarked of the training data __. For this missing value, first __delete __ and replace that part with __mode __. By the way, the mode is __ "S" __. -The mode can be obtained with __ "mode ()" __. Also, since we want to get only the mode (index is not necessary), we get only column with [0]. -Store the acquired mode value in the variable freq_port and supplement it with __ "fillna (freq_port)" __.
·code
-Also, like Sex etc., Embarked is a category value, so convert it to a __number like __ {'S': 0,'C': 1,'Q': 2} __ .. Use the __map () __ function. Also, use __astype (int) __.
·code
-There is only one missing value in __Fare of test data __. Substitute the median (median) for this missing value. -Furthermore, like Age, __Fare, which is a continuous value, is converted to a discrete value __. The converted version is stored in a new feature called FareBand. Discrete values are calculated by dividing Fare into four parts. -In the case of Age, the range was divided by __ "cut ()" __, but this is used when __ "divide so that the range is even" __. On the other hand, in this Fare, "__qcut () __" used when "dividing so that the number of __elements included in the range __ is even" is used.
・ Code![Screenshot 2020-10-24 17.00.23.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/2e191c9f-1baf-a732- b0c1-cebc72c7d1e8.png)
-Refer to this FareBand and replace Fare with a discrete value. This time, __ "~ 7.91" is converted to 0, "7.91 to 14.454" is converted to 1, "14.454 to 31" is converted to 2, "31 ~" is converted to 3__, and so on. -The conversion method is the same as for Age __ "loc [conditional expression, column to be converted]" __.
·code
・ The correlation with the variable __ "Survived", which is the purpose of prediction, was investigated , and those with a significant correlation were used as they were, and those without a significant correlation were extracted from the title Tirle. Like, you can take out the __ part of the variable and create a new feature, or you can create a new feature by combining Parch and SibSp __FamilySize. -In addition, continuous values such as Age and Fare are converted to discrete values to convert features. __ If it contains missing values, use the median or mode to complete the values __. -Sex and Embarked are data such as [male, female] [S, C, Q], so replace them with numerical data such as __ [0,1] [0,1,2] __.
This time is over. Thank you for reading until the end.
Recommended Posts