I will challenge the Kaggle Titanic competition using the AutoML? Called VARISTA that I recently learned. The score was 0.80861.
If you haven't registered with Kaggle, register with Kaggle. Please register from the upper right of the screen.
This competition is "Titanic: Machine Learning from Disaster" from here. Go to the competition and select the "Data" tab. You can also go to the data page by clicking here. When you reach the data screen, select Download All.
When the download is complete, you will find "titanic.zip", so unzip this file. After unzipping, you can see the following files.
The usage of each file is as follows.
file name | Use |
---|---|
train.csv | Teacher data |
test.csv | test data |
gender_submission.csv | Sample data for posting |
** Data variable description **
Column name | Japanese |
---|---|
PassengerID | Passenger ID |
Survived | Survival result(1:Survival, 0:death) |
Pclass | Room class 1=Upper, 2=Middle, 3=Lower |
Name | name |
Sex | sex |
Age | age |
SibSp | Number of siblings and spouse |
Parch | Number of parents and children |
Ticket | Ticket number |
Fare | Boarding fee |
Cabin | room number |
Embarked | Three types of ports on board: Cherbourg, Queenstown, and Southampton |
Create a VARISTA account. Go to http://www.varista.ai and register from the top page. By the way, if you register from this account, it will be a credit that can be used in the service, so if you like, I would be happy if you could fly from this link. .. If you don't like it, you can fly from ↑ at all. .. I did not know··. https://console.varista.ai/welcome/jamaica-draft-coach-cup-blend
There seems to be a paid plan, but for the time being, I tried it for free.
After logging in to VARISTA, create a workspace with any name. After creating the workspace, create the project. I think the name may be Titanic.
Follow the guide to upload the data.
The data to be uploaded is the teacher data "train.csv".
When the upload is complete, select the column you want to predict. In this competition, we will select "Survived" because we want to predict the survival of passengers.
Select START for which the settings have been completed to move to the next screen.
Once you've selected your target, you're ready to go.
You can start learning suddenly here, but since it's a big deal, let's take a look at the contents of the data. Select the data menu and select the "train.csv" you uploaded earlier.
If you look at the data deficiencies, you can see that there are deficiencies in the age and cabin data. However, in the case of VARISTA, it seems that missing data is automatically supplemented.
Let's look at the distribution of the data. If you select "Visualize" from the tab, the distribution will be displayed for each data in the feature column, which is convenient. If you select the Correlation tab, you can see the correlation between the columns you want to predict and each column.
** Gender, age ** Try replacing 0 with death and 1 with survival. Gender has a lot to do with it, and women seem to be alive. As for age, the survival rate is generally high under 7 years old, and the mortality rate seems to be high after 60 years old. There seems to be no big difference in the middle. The child seems to have been rescued preferentially
PClass The higher the grade, the higher the survival rate.
Let's actually learn. Select the AI model on the left and click "Create AI Model". Then make sure the column you want to predict is "Survived" and click the ** Start Learning ** button.
Learning will start automatically without any settings, especially on this side, which is popular these days. It seems that feature engineering is performed and learning is performed using multiple algorithms.
It has a score of 70. Looking at the degree of influence, it seems that gender and Pclass are related to survival.
Click ** Predict with this model ** on the ↑ screen. Click here to change the output format.
Set the columns that are not output.
Then change the format of the output columns to flags.
Finally, drag and drop test.csv from the file you downloaded earlier.
Download the completed file.
When you open the file, you can see that the rightmost column contains the prediction of survival. Delete any columns you don't need to post to Kaggle. This time I removed it with Numbers on Mac, but I think Excel etc. is good for Windows.
Select "Submit Predictions" from the Kaggle competition screen and drag and drop the file you downloaded earlier.
Finally, press Make Submission to post. After a while, it will be scored and the score will be output.
The score this time was 0.77511.
I changed the learning level, the percentage of validation data, the number of cross-validation divisions, and the random seed value from the learning settings, and the score improved, so I will post it.
Click the setting button at the top right of the model learning start screen.
I tried to make the value like this. I haven't tried it so much, so there may be better settings, but I'll try it later.
Now let's learn again and submit to Kaggle again.
The score went up to 0.80861. It takes about 30 minutes to study level 3, so I would like to try various things and write more.
Recommended Posts