[PYTHON] You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2

Click here until yesterday

You will become an engineer in 100 days --Day 76 --Programming --About machine learning

You will become an engineer in 100 days-Day 70-Programming-About scraping

You will become an engineer in 100 days --Day 66 --Programming --About natural language processing

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

This time is a continuation of the story about machine learning.

About the flow of machine learning

Since I did the last time, I roughly did what I could do and what I could do. Today I would like to do what I should do specifically.

First, let's talk about the flow of machine learning. I will explain how machine learning is done in business.

The flow of work when incorporating machine learning is as follows.

  1. Determine the purpose
  2. Data acquisition
  3. Data understanding / selection / processing
  4. Data mart (data set) creation
  5. Model creation
  6. Accuracy verification
  7. System implementation

Let's look at the specific content.

0. Determine the purpose

I think it's the most important place. What do you do machine learning for and what you want to do Decide the purpose.

The only thing you can do with machine learning is prediction.

That prediction Regression: Predict numerical values Category: Predict categories such as men and women Clustering: Divide into good feelings

You can only do three things.

First of all, we have to decide what to predict for the purpose of machine learning.

A good example is: I want to predict the user withdrawal rate because I want to reduce the user withdrawal rate. I want to increase sales, so I want to predict if the user will buy.

Predict the XX that leads to it for 〇〇 I think that would be the correct way to introduce machine learning.

Basically Sales and Profit I think whether it is directly connected here.

I don't know if it will lead to here, something that is difficult to judge It means that it is not a good idea to let machine learning.

In the first place, machine learning requires a huge amount of man-hours for the subsequent work. It costs a lot of money.

The development cost is estimated to be 30 million yen, but there is almost no profit to generate If so, it is wise to decide not to do it because it is useless to do it.

I want to verify how accurate it will be if I do machine learning. This is OK.

Whether the experiment works or not until the verification as POC If the purpose is loose, you can use the verification result as a result. I think it may be for the purpose of verification.

Most of the time, you just throw your money away.

1. Data acquisition

スクリーンショット 2020-06-05 18.51.27.png

Once the purpose is decided, it is necessary to create the data accordingly.

If you have already acquired the data and want to use it, just send and receive the data Very few words.

However, there is no data yet, and it will be difficult to start acquiring data from now on. Make sure to design the data to see what kind of data will work. We have to start by creating a mechanism that allows us to acquire data without excess or deficiency.

All you have to do is check to see if your clients and themselves have the right data, and if so, decide how to send and receive the data. If there is no data, it is from the design examination of data acquisition.

Regarding sending and receiving data, you can receive it on HDD or SSD, or you can receive it via cloud storage. I think it's mostly via cloud storage these days.

2. Data understanding / selection / processing

スクリーンショット 2020-06-05 18.51.20.png

This is the pre-process of what is called data pre-processing.

What kind of data do you have, what kind of data composition ratio do you have, and how much? We perform basic aggregation processing such as, and analyze and visualize the data.

Then, we will select data that can be used.

When it comes to large data, it takes many days just to grasp the data. If you don't understand the characteristics of the data properly here, the work will go back and forth.

3. Data mart (data set) creation

スクリーンショット 2020-06-05 18.51.00.png

Now, let's create data for machine learning from here. After narrowing down the data candidates that can be used to some extent, we will make the data usable for machine learning.

Only tabular data that can be used for machine learning must be grouped together.

Supervised machine learning requires data for the explanatory variable to be used for learning and the correct label to represent the correct answer.

It is necessary to process the part that says what you want to predict into a column of correct label.

Also, all the data used as explanatory variables must be converted to numerical values.

The work here is the process of data preprocessing.

There are few machine learning projects that say that the data is pretty clean. There are few data that require almost no preprocessing Unless it is designed to collect data neatly at the stage of data acquisition You spend time processing the data.

In machine learning, we will combine the data into one tabular format. If there are many types of data, it is necessary to devise ways to put them together.

Usually, you will end up with thousands to tens of thousands of columns of data.

The number of lines varies considerably depending on the type of business and the data acquisition mechanism. Don't worry too much, but if you have a small number of lines It may affect the accuracy.

As a result of preprocessing, there are 2 million lines that can be used 20 lines Then, there is a difference in accuracy.

4. Model creation

スクリーンショット 2020-06-05 18.49.15.png

Creating a model basically does not require so much man-hours. Even if you make many models, not all models will be used.

All you have to do is make one model that is accurate and usable.

You have to try a lot of techniques to create a model If you do it to a certain extent, it will be decided that this method is good, and if you select a method mechanically, you only have to try 10 types of methods at once and wait for the result, so it takes a lot of effort. not.

For services like DataRobot if you have the data at hand It will automatically create various models using that data.

Creating a model is now very easy and not time consuming This is a relatively small part in terms of machine learning man-hours.

5. Accuracy verification

スクリーンショット 2020-06-05 18.49.26.png

It will be done as a set with model creation, but we will make it while verifying how accurate it is after making a model.

There are several methods of accuracy verification, but in general, we will look at how much error occurred.

Since the one with less error is said to be a good model, I think that the models will be arranged in the order of less error, and finally one of the models with higher accuracy will be adopted.

Until a good machine learning flow is made

  1. Data understanding / selection / processing
  2. Data mart (data set) creation
  3. Model creation
  4. Accuracy verification

Will be repeated, and if the accuracy is still satisfactory

  1. Data acquisition

You may have to start over.

A good model can only come from good data. Garbage data is nothing more than garbage.

It is quite rare that a treasure trove is mixed in the garbage data.

6. System implementation

When the data has been prepared and it is found that the accuracy is reasonable, we will finally incorporate the model into the system.

Generally, if it is a WEB service, it is incorporated so that it is provided as a part of the function on the back end side.

It will be a form of building a system while considering how to operate it and how much it will cost, together with the system requirements.

It ’s a machine learning service for AWS sagemaker. Some of them will be available as endpoints immediately. Using such a service will reduce the number of implementations.

Summary

After learning what you can do with machine learning, it's a good idea to learn what to do to do machine learning.

The general flow is the same, so I think it's best to refer to the methods of various companies.

23 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 61 ――Programming ――About exploration
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days --Day 86 --Database --About Hadoop
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You will be an engineer in 100 days --Day 31 --Python --Python Exercise 2
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!
You have to be careful about the commands you use every day in the production environment.
Build an interactive environment for machine learning in Python
Programming learning record day 2
Until an engineer who was once frustrated about machine learning manages to use machine learning at work
[Machine learning] Let's summarize random forest in an easy-to-understand manner
Machine learning in Delemas (practice)
An introduction to machine learning
About machine learning mixed matrices
Python Machine Learning Programming> Keywords
Used in machine learning EDA
Learn machine learning anytime, anywhere in an on-demand Jupyter Notebook environment