[PYTHON] If you want to become a data scientist, start with Kaggle

at first

** "AI" **, ** "Big Data" **, ** "Data Scientist" **, how many people would like to work on these keywords? How many of them don't know how to study, have studied but have been frustrated, and haven't been able to put it into practice?

This article summarizes what I noticed while studying data science. It is just a summary of what I personally felt, and does not determine how to study.

What is Kaggle?

kaggle.png From Official

What is Kaggle? "A platform related to predictive modeling and analysis methods and its operating company, where companies and researchers post data and statisticians and data analysts around the world compete for the optimal model." [Wikipedia](https:: From //ja.wikipedia.org/wiki/Kaggle)

In short, companies, etc. post issues that they want to analyze, and data scientists around the world create prediction models and submit prediction results. The data scientist who created the best predictive model is a rewarding data science contest.

This article will be helpful for preparing to participate.

Why start with Kaggle?

The reason is that ** Kaggle can experience the flow of data analysis ** </ font>

Kaggle handles two types of data: data for analyzing and creating predictive models and data for predicting answers. This data is the number one reason to recommend Kaggle.

In fact, the data given is not always clean. Therefore, a good predictive model cannot be created without ** data cleansing **, which cleans the data so that it can be analyzed. And it is said that this data cleansing spends 70 to 80% of the time in a data analysis project.

In other words, most data analysis spends effort preparing data for statistical analysis and machine learning. In other words, without the ability to read data, you cannot make a good predictive model. Kaggle is a good learning experience, starting with looking at the data.

Data analysis project

Now let's take a look at the data analysis project. In a data analysis project, there is the idea of ** CRISP-DM ** (CRoss-Industry Standard Process for Data Mining), which defines a phase that is common to all industries. CRISP-DM_Process_Diagram.png

The image is here

As you can see from the above figure, when analyzing data, it starts with ** understanding the business (issue) **. After defining the problem to be solved, the next step is ** Understanding the data **. See if there is enough data to solve the set issue.

If you have data, move on to ** Data Preparation **. Now, prepare to make a prediction model such as data cleansing mentioned above. If you don't have the data, you'll have to collect the data you need or reconfigure the issue.

When you are ready to create a predictive model, do ** model creation ** and ** model evaluation (improvement) **. If the prediction accuracy of the created model is good, it will be ** deployed and delivered **. It should be noted that the accuracy of the created and improved model is ** not always better ** </ font>.

If the model is not accurate, go back to the business understanding phase and start over from issue setting.

Actually, as you can see from experience, it takes an overwhelmingly long time to stare at the data.

Statistics and machine learning

Next, we will briefly explain model creation and evaluation (improvement). Personally, I think I need knowledge of both statistics and machine learning. To be precise, ** you need to know both to make a good predictive model ** </ font>.

Knowledge of statistics is useful for viewing data. In my case, I have never studied statistics and I am studying machine learning, but I am at a loss as to which variable to choose. To be honest, it was ** intuition **. Recently, I had the opportunity to study statistics, so when I studied, I learned to first look at the correlation between the objective variable (the answer to be derived) and the explanatory variable (the element to derive the answer).

Knowledge of machine learning is effective in deriving answers. The more data you have, the more accurate the prediction model you can create. There is so much data today that it is called big data (not all can be used). Although the amount of data generated may increase in the future, it is unlikely that it will decrease, so machine learning technology will continue to develop.

It's just my personal opinion, but I think that knowledge of statistics is required to see the data, and knowledge of machine learning is required to derive the answer.

Kaggle is not enough

I think there's so much you can learn when you start studying with Kaggle, but Kaggle may not be enough. It is ** problem setting **. It is the same as the issue set in the above understanding of business (issue).

In the case of Kaggle, since it is a contest, questions are set, so you cannot practice question setting. However, if you want to be a data scientist, you can't analyze it unless you can set the problem. And if you can't set the problem, you can't evaluate the accuracy of the model.

You can't set problems with Kaggle, but you should know the relationship between problem settings and the accuracy of the prediction model.

Model accuracy

Do you know how to evaluate the accuracy of a model? For example, is a model with 90% accuracy a good model?

I think you should understand that the accuracy of the model is ** the lower limit can be set ** </ font> depending on how the problem is set.

As an example, let's say you want to create a model that predicts the best people. First, let's define a good person, but let's say you do a test and define the top 10% of the test scores as a good person.

The accuracy of the model is determined by the percentage of the total number of ** excellent ** and ** not excellent ** not excellent **. In the figure, the number that applies to the following light blue areas is what percentage of the total. 予測精度.png

Now, let's say you create a model that ** not everyone is excellent **. The accuracy of this model is 90% ( ** Predicted not excellent ** </ font>: 10%, ** Not excellent Predicting that it is not excellent ** </ font>: 90%). Is this model a good model? Perhaps no one is convinced that it is a good model.

In short, a good model is not ** an accuracy of XX% **, but a model that can give better accuracy than a model that makes all predictions 0 or 1 (excellent or not in the above example). ** is a good model.

In other words, the lower limit of accuracy of a model that judges whether it is excellent or not is 90%, so if the accuracy of the model to be created exceeds 90%, it will be a good model.

at the end

I've written that it's a good idea to start studying with Kaggle to become a data scientist. The first problem to tackle is the Titanic problem (Kaggle's tutorial problem). There are many things you can learn, such as not only modeling and improving machine learning, but also which variables to select.

Once your model is complete, anticipate and submit your answer. You can see how much you are and what your Score is. Being able to rank high will give you confidence, and it's a good idea to participate in other contests and aim for rewards. Working as a data scientist is not a dream either. (Although there are other things to study such as SQL ...)

I hope this article will be of some help to anyone who wants to become a data scientist. If you don't mind, please ** like **.

That was Poem.

Recommended Posts