[PYTHON] [For beginners] Summary of suffering from kaggle's EDA and its struggle

1. Purpose

I started learning programming in January 2019, and I went to an online programming school and started learning in earnest in July 2019.

I was working on kaggle as an exercise, but ** I stumbled too much on the data visualization part before building a machine learning model, and I broke my heart many times **.

But in fact, I'm not the only one who is worried, ** I think that all the parts that stumble unexpectedly are the same **, ** I posted an article that is even a little useful for super beginners who are suffering now What to do / is the purpose **. Also, I try to write not only the conclusions but also my thoughts. Even though I am still a beginner, I will be able to work at a company specializing in AI from next April, so in that sense I hope that I can contribute to ** motivation for beginners **. ..

※Caution※ This article is not a knowledge-teaching article that says, "If you do this, you will be able to solve it." Based on my own knowledge of the current super beginners, I wrote a record of the struggle that it would be quite so. If there is something that will work better if you do this, I'm sorry to trouble you, but I would appreciate it if you could tell me.

2. Introducing kaggle

Kickstarter Projects Perhaps it's a super-introductory feeling before Titanic. It classifies the success or failure of crowdfunding. https://www.kaggle.com/kemical/kickstarter-projects

3. Introducing EDA struggle, which will be taken up this time

[1] There were outliers when visualizing the data

I think that visualization of data is the first step to be taken, but in reality, it took a long time to notice the fact that it was not visualized well due to outliers **.

[2] Characters cannot be read due to too many features in the figure.

First of all, visualization! I was enthusiastic and tried to make a bar graph that I looked up in a book or on the net, but I could not read the visualized characters at all because there were too many variables, ** I want to visualize only the upper items ... It takes a lot of time to find out how to do it. It took ** record.

4. Struggle [1] There were outliers when visualizing the data

(1) Before getting into the main subject

Import what you need and read the data.

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
import seaborn as sns

df = pd.read_csv(r"C:~~\ks-projects-201801.csv")

(2) Try various visualizations

Alright, it's important to look at the data first and foremost, both in the rabbit and in the corner! I thought that this kaggle itself does not have many features, so I decided to look at it one by one.

◆ state (classification of success or failure results)

If you write this code and visualize it, it will look like this:

df["state"].value_counts().plot(kind = "bar")
キャプチャ1.PNG

Hmm ... I'm failing so much ... However, there are various possibilities other than failure and success, such as things that ended in the middle ... But most of them are failure and success, so I'd like to cut off the data except for these two at first!

Indeed, I thought it was important to visualize and consider data in order to think about this. After visualizing various other features, we finally arrived at the feature in question.

◆ goal (target amount)

Depending on the target amount of crowdfunding, it seems that it will be successful or not! I thought, I tried to visualize it with a general histogram.

plt.hist(df["goal"])

Then ... the histogram was completed.

キャプチャ2.PNG

When I think about it now, this is because there is an outlier, but at this time I didn't notice it at all (I knew the outlier itself, but I couldn't connect it), and said, "What is this ...? Everyone It's the same, but ... what is 1e8 on the right side of the x-axis? What is the power of 10? "

I should have noticed it by looking at the numerical value in the feature goal, but I didn't notice it at all. Perhaps, I feel that I'm thinking now that ** "visualization" itself was the goal, and it wasn't the original data analysis **.

Actually, I was at a loss because I couldn't find out at all even if I checked it in my spare time for a day or two. Rather, I couldn't really think about what to look for in the first place.

(3) Back hand ... Ask the teacher

This is bad, maybe it's super rudimentary, but I don't know at all ... I'm sorry it's not a very general solution, but I was taking a machine learning course at a certain company at that time, so I asked the teacher. did.

When I showed the teacher my code and the visualized histogram, he instantly returned, "Isn't it because there are outliers?"

Outliers ... yeah ... seriously ... I only got the impression, but the teacher didn't teach me any more for my study, so I was on my way home. I vaguely thought that outliers should be looked at in a box plot, so I continued at home.

(4) Not resolved immediately ...

I had some background knowledge because I was studying the box plot itself when I took the 3rd grade statistical test. From the advice of the outliers from the teacher, I decided to make a boxplot for the time being.

sns.boxplot(y = "goal",data=df)
キャプチャ4.PNG

Somehow ... it's different from what I expected ... Box plots aren't such weird ones ...?

When I think about it now, there is too much data around 0.0 and the box is crushed, and the outliers themselves are depicted as dots, but I did not notice the thick line around 0.0 and thought that it did not make sense. It was.

So, I thought, "Yes, if you use the violin plot, you can see how much data is gathered around where. Let's do it for a moment." I did the following.

sns.violinplot(y = "goal",data=df)
キャプチャ5.PNG

When I saw this violin plot, which did not look different from the boxplot, I finally realized that there was too much data around 0.0 and the figure was strange.

Probably most of the data should be around 0.0, so let's take a large data for the time being! I thought. For the time being, I did the following to cut off more than 1 million from the data.

df_goal_train = df[(df["goal"]<1000000)]
sns.boxplot(y = "goal",data=df_goal_train)
キャプチャ6.PNG

I think it's still useless, so next time I'll try to reduce it to 100,000 or less.

df_goal_train = df[(df["goal"]<100000)]
sns.boxplot(y = "goal",data=df_goal_train)
キャプチャ7.PNG

It's pretty close to the boxplot I was thinking of! !! Let's try the violin plot in this state.

sns.violinplot(y = "goal",data=df_goal_train)
キャプチャ8.PNG

It feels good! !! At this point, we have some data cut down, but we can see that the median target amount is about 10,000 or less, and the volume zone is also 5,000 or less when looking at the violin plot. In other words, I think you can see from the analysis of the data that there are many relatively small amounts of crowdfunding.

However, it is not good to delete the ultra-high-priced crowdfunding that was once removed for illustration this time from the data, and if there is super-high-priced crowdfunding in the future, that data will not be able to be predicted well. , I think you can see from this visualization that the handling of data itself seems to need to be carefully determined.

This concludes the goal visualization chapter.

5. Struggle [2] There are too many features in the figure to read the characters.

(1) Before getting into the main subject

As with [1], we will use Kickstarter Projects. It's exactly the same as before, but let's import what we need.

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import math
import pandas as pd
import seaborn as sns

df = pd.read_csv(r"C:~~\ks-projects-201801.csv")

(2) Try various visualizations

Just like before, I faced this problem while looking at the features one by one.

First, let's look at the number of "main categories"! I thought, I tried to illustrate it.

df["main_category"].value_counts().plot(kind = "bar", stacked = True
キャプチャ9.PNG

On the other hand, I'm not really biased towards any one ... After this, I was vaguely thinking that I would like to illustrate by success and failure.

Well, for the time being, I thought I'd try to illustrate a similar "category" next time, so I tried it.

df["category"].value_counts().plot(kind ="bar")
キャプチャ10.PNG

... can not read·····. Actually, I was stuck with the exact same thing before, and at that time I couldn't find out even if I looked it up, so I left it alone.

I thought this would be the same again in the future, so I thought about what to do. What I wanted to know was the type of category that was frequently used, so I decided to take a look at what the top items were.

Therefore, I decided to narrow down to the data that appears 10,000 times or more, and did the following.

df["category"].value_counts()[df["category"].value_counts() > 10000].plot(kind="bar")
plt.show()
キャプチャ11.PNG

Then, only the top items were displayed neatly! !!

I did some research, but I think this is probably the best way to look at the data for the first time, at the level.

It's normal to think calmly, but ** When I was studying programming, I wasted my time thinking about something magical that looks beautiful in one shot * * But now I've come to the conclusion that this is the way to go, given what I want to do.

6. Conclusion

That's it. If you write it in words like this, it will be a moment, and I think that many people will say, "That's right." But I do.

** I hope you can tell that "not everyone knows how to do it well from the beginning" and "everyone is able to work persistently one by one and gradually become able to do it" **.

There are books and sites that say "I should do this", but I think that there is not much content that describes the process of "I suffered in such a place and solved it by thinking this way", so I think that it is motivating for beginners like myself. I hope it helps you to strengthen your knowledge.

Recommended Posts

[For beginners] Summary of suffering from kaggle's EDA and its struggle
Now "average of sums from 1 to 10" and its execution speed
[For beginners] Summary of standard input in Python (with explanation)
Summary of Hash (Dictionary) operation support for Ruby and Python
Pandas basics for beginners ④ Handling of date and time items
[For beginners] A word summary of popular programming languages (2018 version)
Overview of Docker (for beginners)
Reference resource summary (for beginners)
Summary of recommended APIs for artificial intelligence, machine learning, and AI
What is scraping? [Summary for beginners]
A brief summary of qubits (beginners)
Pandas basics summary link for beginners
[For competition professionals] Summary of doubling
Summary of Python indexes and slices
Procedure from AWS CDK (Python) development to AWS resource construction * For beginners of development
Summary of mathematical scope and learning resources required for machine learning and data science