[PYTHON] Application development using Azure Machine Learning

Azure Machine Learning is a platform that allows you to easily create machine learning models with a web browser (Official release in February 2015. ).

The model created here can be called from the Web API. Until now, when it came to incorporating machine learning, it was common to use libraries such as Python, and it was often necessary to develop on a platform different from the familiar language, but even with this, Web requests If you can skip it, you can develop applications that use machine learning algorithms.

This time, I will introduce the procedure for application development using this Azure Machine Learning. The following is the final diagram of the model created this time. I will explain along this flow.

flow1.PNG

You can access this model and the code for applications that use this model from the Github repository below. However, as will be described later, the accuracy of the model is disappointing, so I think it would be fun to try tuning your own.

pollen_prediction_by_azure

Please also refer to the Wiki of the repository for links to reference sites.

0. Data preparation and model design

First, decide what data you want to train and what you want to do. Specifically, design the inputs / outputs of the final model and hit the data for that.

What you can do with machine learning is, to put it roughly, "prediction of value" or "judgment of classification". This is no different for any complex learning machine like deep learning.

In many cases, it is more efficient (empirically) to first decide what you want to predict / classify, and then catch the necessary data to see if it is open to the public. However, since the data that is open to the public is not so, it is often bound by the restrictions ...

This time, I will predict the amount of pollen scattered. It is a dimension that predicts the amount of pollen scattered by inputting the parameters obtained from the weather forecast and is useful for pollen countermeasures.

The amount of pollen scattered is published below, so we will use this as learning data.

Ministry of the Environment Pollen Observation System (Hanako)

Actually, there was no weather forecast that could be used free of charge, but I found a nice one, so I will use it. In addition to temperature, there are wind speed and wind direction data for the amount of scattering, but this API can be used to obtain them.

OpenWeatherMap

1. Data acquisition

From here, we will use Azure Machine Learning. Use it now from here. You can create a new model from the + button at the bottom left.

Start by loading the data.

reader.PNG

I think the most used data source is "Web URL via HTTP". This time, I pushed the CSV file to the GitHub repository, specified the RAW address, and loaded the data.

To get the data, or rather to run the process you created, press the "RUN" button below. The data should have been retrieved when the run was done.

One of the characteristics of Azure Machine Learning is that the data visualization function is easy to use. The location is difficult to understand, but if you right-click at the following location after execution, there should be a menu called "Visualize". When you do that, you can see the information of the acquired data. Let's check if the data can be acquired properly with this.

right_click2.PNG

In addition, you can refer to the details of the statistical information of the data by connecting the Statistical Functions. This will be useful for detecting abnormal values in the future.

statistics.PNG

If the data has been loaded properly, save the data with "Save as Dataset" from the context menu as in Visualize. As you build your data flow in the future, you will have to run the process many times. It is inefficient to fetch a large amount of data by HTTP request each time, and it also puts a load on the fetching party, so save it as a Dataset and make it available at any time.

I will summarize the points so far.

2. Data shaping

Since there are missing or outliers in the data, we will work to remove or fill them. It's a very important process, as the accuracy can vary considerably depending on whether you're doing this or not.

The following is a Visualize of the imported data connected to Descriptive Statistics of Static Functions.

data_preview.PNG

From here, we can see the following.

The number of missing / outlier data is 82 and 4, respectively, which is quite small compared to about 30,000 in total, so I will delete all of them this time. When I checked the amount of pollen scattered, it seemed to be normal data, so I kept it as it was ... but since the accuracy was not so high, I finally set the upper limit to 1,500 for those over 1,500.

According to Mr. Hanako, the data provider, this value of 1,500 seems to be a dangerous area for more than 1,000 pollens, so I decided to cut it with a slightly larger number. According to the statistical information, the 3rd Quartile of pollen dispersal (75% quartile, the point where 75% of the data enters from the small quasi) is 45, and it can be seen that most of the data is in the small range. So the cut shouldn't have much effect.

At this point, it seems better to make a classification problem that divides the amount of pollen scattered into large, medium and small, rather than regression, but this time the purpose is to create an application using Azure Machine Learning. So I shelved it and proceeded to the next.

Once you have decided on a policy for missing data and outliers, you can implement it.

Actually, I implemented it as follows.

data_manipulation.PNG

Not limited to Clean Missing Data, I sometimes open Column Selector for the setting of ~ for this column. You need to be a little careful about the setting method here.

column_selector1.PNG

Basically, it is specified as "Exclude xx from all columns" or "Only columns of yy". In other words, in principle, use Exclude when Begin With is All columns and Include when No columns. Note that it doesn't make sense to include more in All columns.

From the name, I think that Filter can be used for "delete data whose xx column is yy or more", but that is not the case. Filter is used to set the upper and lower limits of the value (Threshold Filter, Greater Than 1,500 sets 1,500 or more to 1,500). Filter can be applied by combining Apply Filter and Filter.

On the contrary, there is nothing that can be used for "delete data whose xx column is yy or more" so far ... (Clean Missing Data works only for missing values), so it is supported by Python Script. I will decide.

The Python Script is passed the Dataset it receives as an argument and returns the Dataset to be passed to subsequent flows as a return value. The argument type comes in the form of a library that can handle data frames called pandas. Please note that Python 3 users can write 2.7 as of March 2015.

The Script used this time is as follows. The data of the value that seems to be an abnormal value is repelled.

# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
#   Param<dataframe1>: a pandas.DataFrame
#   Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):

    # Execution logic goes here
    # print('Input pandas.DataFrame #1:\r\n\r\n{0}'.format(dataframe1))

    # If a zip file is connected to the third input port is connected,
    # it is unzipped under ".\Script Bundle". This directory is added
    # to sys.path. Therefore, if your zip file contains a Python file
    # mymodule.py you can import it using:
    # import mymodule
    
    erased = dataframe1[dataframe1.Col11 >= 0]
    erased = erased[erased.Col14 > -50]
    erased = erased[erased.Col16 >= 0]
        
    # Return value must be of a sequence of pandas.DataFrame
    return erased,

At this point, try running Run to see if Visualize is processing the data properly (for example, if missing or outliers have disappeared).

3. Data item definition

Specifically, the item definition of data is as follows.

This time, we named the items so that they can be easily handled in the subsequent process, set the data type to Float, set the pollen scattering amount to the predicted value, and set the other features as the feature amount. To do this, use the Metadata Editor in Data Transformation> Manipulation.

And if you want to narrow down the features to be used, narrow down the columns to be used in Project Columns of Data Transformation> Manipulation. In the model, it is not enough to put in the features, but basically it is better to do it with a small number of elite people who work well for the prediction target. Therefore, even if you want to use all the features, it's a good idea to put these Project Columns in the process so that you can adjust the number of features you use later.

4. Model building

We will build the long-awaited machine learning model. As you can see from the fact that it has been so long, most of the machine learning process is also the process of data shaping and processing.

The Machine Learning section is divided into four sections: Evaluate, Initialize Model, Score, and Train.

machines.PNG

The basic flow is Initialize Model> Train> Score> Evaluate. The flow is to initialize the model, train it, predict the value with the evaluation data, and evaluate the result.

In model construction, we select Initialize Model, and this time we use Linear Regression because we predict the value (actually, we did it on Neural Network, but it wasn't very accurate, so we ended up returning to Linear Regression).

By the way, the provided Neural Network is fully compatible with deep learning. Especially on Windows, it is difficult to install deep learning libraries, so if it is not a large scale or if you just want to try various patterns, I think that it is enough to use this.

Each model has initialization parameters, and you still need to know what values to set. You can also refer to the document from the bottom of Properties, so set the parameters while looking at that as well.

5. Learning / evaluation

Now that we have the model and data, we will use it to evaluate learning and accuracy.

First, use Data Transformation> Sample and Split> Split to split it for training and evaluation (image below).

train.PNG

I think that you have to normalize the data before training, but since many training models specify the normalization method as part of the Property of Initialize Model, it is necessary to perform normalization individually. there is no. When predicting a value like this time, if it is normalized at the time of learning, the predicted value is also a normalized value, so originally a calculation for restoration is required ... but this is also good. It seems. On the contrary, if you normalize from Data Transformation> Normalize Data etc. and do not normalize in the model, this restoration will probably not work.

Only one predicted value (label value) item is set in the Train Model used for training, and the other items are passed to the model as features. A trained model is passed from the Train Model, which leads to the Score Model. Now, use the trained model to predict the value from the evaluation data. The difference between the predicted value and the actual value can be seen in the final Evaluate Model. Of course, you can also do it all at once with Cross Varidation.

After building the flow, learning and evaluation are actually performed by executing the process with RUN as before.

It's an interesting result, but ...

result.PNG

The easy-to-understand Root Mean Squared Error of 146 is not very good news. The standard deviation of the amount of pollen scattered after repelling the outlier is 137.4186, and the Root Mean Squared Error is larger than this.

This means that the predicted values are more variable than the original data, that is, they are completely unreliable, so the predicted values by this model are almost unreliable.

If you try Visualize with Score Model, you can see the tendency more prominently. The figure below is a plot of the predicted value of Scored Labels and the actual value of pollen dispersal (pollen). If the accuracy is high, the two should be perfectly correlated, so if you see a clean line from left to top, you can predict well ... but in reality.

result2.PNG

At first I thought this prediction would be easy (like if the temperature is high, it will increase and if it rains, it will decrease ...), but it wasn't. I tried tuning after this, but the accuracy did not improve as I expected, so I put it on the shelf and made a decent model, and started preparing to use this prediction model as a Web service.

6. Web service

There are the following two types of Web services.

Since learning is the flow that I was making now, I can make it into an API by pressing the "Publish Web Service" button below. In addition to the original data, prepare a receiving port from the Web API and an exit for the response. Both will be added automatically when you press the button.

Actually, you can switch the flow between normal time and Web API access, and you can switch which time to build the flow with the button at the bottom left.

↓ Normally (until now) it was in the left state, but if you switch this, it will switch when accessing the Web API on the right switch_space.PNG

To create a predictive API, press the button below.

create_scoring_model.PNG

When you press this, a predictive flow with the trained model built in will undulate and be created. Basically, the flow at the time of learning is copied, but since there are some things that are unnecessary at the time of prediction, create the flow at the time of prediction while deleting such processes.

This time, I created the following flow. The item definition part follows the flow at the time of learning, and normally, the data input by hand is flowed and tested (using Data Input and Output> Enter Data), and POSTed data is received when accessing the Web API. Returns the predicted value.

web_api_model.PNG

After building this flow and confirming the operation with RUN, publish the Web Service with "Publish Web Service" as well as the learning API.

You can check the access URL and API Key from the Web API page.

web_api.PNG

You can test the API here, and even include sample code for C # / Python / R. By using this, learning / prediction can be performed via Web API.

Implementation

This time, as I decided at the beginning, I got the weather forecast from the API, predicted the amount of pollen scattering using it as an input value, and displayed these results together.

pollen_prediction_by_azure/application.py

The format of the HTTP request is rather cumbersome, but in any case, you can now use the predictive model via the Web API.

By using Azure Machine Learning, you can easily build a model, and you will be able to learn and make predictions from various platforms via Web API.

It can be used immediately without the hassle of installation, so I think it is a good platform for studying and developing the first machine learning application.

Recommended Posts

Application development using Azure Machine Learning
Try using Jupyter Notebook of Azure Machine Learning
WEB application development using django-Development 1-
Build a machine learning application development environment with Python
WEB application development using Django [Django startup]
Machine learning algorithm (support vector machine application)
Machine learning
WEB application development using Django [Model definition]
Stock price forecast using machine learning (scikit-learn)
WEB application development using Django [Initial settings]
WEB application development using django-Development environment construction-
[Machine learning] LDA topic classification using scikit-learn
[Machine learning] FX prediction using decision trees
Notes on running Azure Machine Learning locally
WEB application development using Django [Request processing]
[Machine learning] Supervised learning using kernel density estimation
WEB application development using Django [Template addition]
Application development using SQLite with Django (PTVS)
[Python] Web application design for machine learning
Stock price forecast using machine learning (regression)
Creating a development environment for machine learning
[Machine learning] Regression analysis using scikit learn
[Memo] Machine learning
Machine learning classification
Machine Learning sample
A story about simple machine learning using TensorFlow
Data supply tricks using deques in machine learning
WEB application development using Django [Admin screen creation]
Upgrade the Azure Machine Learning SDK for Python
[Machine learning] Supervised learning using kernel density estimation Part 2
[Machine learning] Supervised learning using kernel density estimation Part 3
Face image dataset sorting using machine learning model (# 3)
Looking back on learning with Azure Machine Learning Studio
[Python3] Let's analyze data using machine learning! (Regression)
About the development contents of machine learning (Example)
[Machine learning] Extract similar words mechanically using WordNet
Causal reasoning using machine learning (organization of causal reasoning methods)
What I learned about AI / machine learning using Python (1)
Machine learning tutorial summary
Key points of "Machine learning with Azure ML Studio"
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Tool MALSS (application) that supports machine learning in Python
Machine learning logistic regression
Create machine learning projects at explosive speed using templates
Machine learning support vector machine
[Azure] Try using Azure Functions
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Machine Learning with Caffe -1-Category images using reference model
Somehow learn machine learning
Web application using Bottle (1)
[Machine learning] Try to detect objects using Selective Search
[Machine learning] Text classification using Transformer model (Attention-based classifier)
Memo for building a machine learning environment using Python
Machine learning library Shogun
What I learned about AI / machine learning using Python (2)