[PYTHON] A story posted on Kaggle by an amateur who doesn't even know the terminal over 3 weeks

Introduction

This article shares my Kaggle experience based on my hardships. Explains the registration of Kaggle Competitions and the submission of predictions. However, it probably doesn't need to be explained to most people. (Really) Please forgive me for making your eyes dirty.

From Kaggle challenges to article posting

Aiming for the path of data science, I started studying on December 7th this month. However, including my previous job ** I have never dealt with programming in my life **.

I only use Word, Excel, and PowerPoint for the purpose of creating materials, and to put it bluntly, typing is a little faster from my experience of being addicted to online games.

As I borrowed from Qiita and kept detailed study records, I used books and Progate for the first two weeks of studying to understand and thoroughly understand the basics, but after all ** I thought that practical output was indispensable to master **, so I decided to post it to the Kaggle competition, and although it was half-forced, I managed to post the deliverable in the third week through trial and error. It's done.

However, Kaggle is English only, and for me, who has little understanding of machine learning, every move from competition registration to submission was a mess, and I didn't know what to do. (It wasn't a wasteful day because I was able to learn a lot while researching.)

However, it is also true that if there was material that I could understand at my level, I could post more smoothly.

Based on this idea, I hope that it will be helpful for those who are going to challenge the Kaggle competition and those who have a similar level (if any), so register the competition and submit the predicted value as easily as possible. I explained about.

If you find something difficult to understand, I would be very grateful if you could give us guidance and suggestions.

Register for Kaggle Competitions

    1. Go to Kaggle Official
  1. Click "Competitions". Sort by "Recently Created" sorts the competitions in the order they started most recently. qiita1-1.png

    1. This time we will participate in the competition surrounded by a red frame. (2 class classification, natural language processing) qiita1-2.png
  2. After confirming the contents of "Rule", participate in "Join Competition". The content of the rules varies depending on the competition, but there is a description about the maximum number of teams and the deadline for submission. qiita1-3.png

  3. Get the dataset to use for the competition from "Download all" in "Data". train.csv is for learning, test.csv is for prediction, sample_submission.csv is submission format. qiita1-4.png

  4. The competition started. You can also create and submit using Kaggle's "Notebooks" feature, but I did it using Jupyter Notebook. If you install Anaconda, it will come with you, so it is recommended.

Analysis / prediction (machine learning)

The general flow is as follows.


1. 1. Various module import
   pandas, numpy, scipy etc...

2. Data reading
   read_Read the dataset with csv.

3. 3. Preprocessing
Understanding missing values: Replace with representative values, average, or delete. )
Data split: train.csv and test.Divide csv into two, one for learning and one for prediction.(X and Y)

4. Modeling
Linear, decision tree, neural network etc...

5. Learning with a model
Learning data divided by 3 terms(X_train and y_Train) to train (fit) the model.

6. Calculation of predicted value
One of the forecast data (X) using the model of 5 terms_test)Predict(predict)And calculate the predicted value.

7. Evaluation
Using the predicted value calculated in item 6, y_Match the answer with test. (Sklearn accuracy_Use score)

8. Tuning of models, parameters, etc. (if necessary)

As mentioned at the beginning, the machine learning part is omitted in this article because of the rules of the competition I participated in and the analysis method differs depending on the competition.

There are many articles that can be used as reference for machine learning, including the explanation of Titanic Passenger Survival Prediction, so please search for it. Since this competition has already ended, there are no prizes, etc., but it is one of the competitions that is often recommended as a tutorial for machine learning beginners, and it will be very learning just to move your hands once while looking at the commentary article. I will.

Submit Predictions

Once you have created a satisfactory model through a series of machine learning, create a submission file. Every dataset in every competition should have a sample for submission, so shape it accordingly. I think there are various ways to do it, but I did it with ** the result of predicting test.csv (target column) and the method of replacing the target column of sample.csv **. Naturally, the processing differs depending on the nature of the competition, but natural language processing has the following forms.

python


# "test.csv"of"text"列ofみ抜き出す。
presub1 = test["text"]

#Vectorize. (For natural language processing tasks.)
tfidf_vect = TfidfVectorizer()
presub2 = tfidf_vect.transform(presub1)

#Predict with the created model.
presub3 = model.predict(presub2)

# "sample.csv"Target column and predicted result(Target column) is replaced.
sample["target"] = presub3

#Confirm that it can be replaced. This time I use print.
#All target columns before execution"0"Since it was, if it has changed, it is considered that it has been replaced.
print(sample)

#Output to a csv file. Since index (serial number) is attached, this time it is set to False.
#The first argument is the output file name. I hope I know myself.
sample.to_csv('submission.csv', index = False)

As a result, "submission.csv" is output to the current directory. Jump to the participation competition page and submit as follows.

    1. Click "Submit Predictions" at the top right of the screen qiita2-1.png
  1. Drag and drop the output CSV file for submission to ①, and click ② to submit when uploading is completed. If there are no abnormalities, it will be registered in the "Leaderboard". qiita2-2.png

This completes posting to Kaggle. After posting, we will try and error for higher accuracy and compete with other participants. There are many points that can be devised, such as devising preprocessing, changing models, superimposing models, tuning parameters, and so on. Some have published their kernels, while others are pursuing accuracy using amazing techniques.

Unfortunately, there seems to be a leak in the results of this competition the other day, and the Leaderboard is filled with score 1.0 (perfect score) ... Because there are also such things.

in conclusion

Process the data set a little, train the existing model, output the predicted value, output it to the csv file again and submit it. Looking back, it's not really a big deal, but for me, who started studying from the level of having no code or writing experience and not knowing what the terminal was, the road up to this point was a series of hardships. ..

Scores aren't good at this point, and they shouldn't be smart about how to write code. However, I was able to wear a great deal by posting by trial and error. Of course, there is still much to learn, so I will continue to work hard on my self-improvement.

It was a childish content, but I hope this article will be of some help to people in similar positions. Thank you for reading.

Recommended Posts

A story posted on Kaggle by an amateur who doesn't even know the terminal over 3 weeks
A story about an engineer who came only on the server side created a portfolio
The story of the escape probability of a random walk on an integer grid
[Python] A progress bar on the terminal