[PYTHON] Key points of "Machine learning with Azure ML Studio"

Key points of "Machine learning with Azure ML Studio"

This book → Machine learning revised version starting in the cloud <img src =" // ir-jp.amazon-adsystem.com/e/ir?t=kokkahasan-22&l=am2&o=9&a=4865941622" width=" 1 "height =" 1 "border =" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/> The point is that it is simple but not too difficult and is neatly organized. To summarize.

Numerical forecast by regression (Sample: New car sales price forecast)

Example: Sales forecast in retail (the number of sales in the previous year, the day of the week, the weather, the number of advertisements hit / not hit, etc. are analyzed and the future sales forecast is predicted.

Linear regression

[Formula] image.png

y: Number of sales on the forecast date x1 ~ xm: Variables such as horsepower, fuel type, fuel consumption, wheelbase, brand value, etc. w1 ~ wm: Partial regression coefficient (weight. Feature weight) c: Constant term (Bias)

Accuracy evaluation

・ MAE (Mean Absolute Error) ・ ・ ・ The closer it is to 0, the better The average of the difference between the predicted value and the correct answer value. ・ RMSE (Root Mean Squared Error):Image.png -cb03-363a-b559a2c829a7.png) ・ Coerfficient of Determination The square of the correlation coefficient between the predicted value and the correct answer value. The closer it is to 1, the better

Data split

・ Holdout method Randomly divide training data for training and evaluation ・ Cross validation Divide the training data into k pieces. And k times evaluation

Improved accuracy

・ Regularization To prevent overfitting due to too many variables and too large weight parameters, add a penalty value proportional to the weight parameters. [Formula] image.png Add the sum of the squares of the weight parameters to the sum of the squares of the error between the predicted value and the correct answer value as a parameter. image.png

Bayesian linear regression

The calculation formula is the same as linear regression. However, a model in which the weight parameters are not constant but a probability distribution. Considering the number of times an event has occurred, where only maximum likelihood estimation (MLE: Maximum Likelihood Optimization. A method of determining the weight parameter so that the error becomes 0) is inaccurate if the number of training data is not considered. Will do it. (Prior distribution and posterior distribution) image.png

Classification (Sample: Positive / Negative classification from breast cancer data)

Example: Predict payment ability by analyzing correlation items such as occupation, annual income, deposit amount, and delinquency delay in credit examination at a bank.

Logistic regression

Predict the probability that a particular event will occur. [Formula] image.png

x1 ~ xm: Variables such as age, tumor size, tumor malignancy, non-menopausal, etc. w1 ~ wm: Partial regression coefficient (weight. Feature weight) c: Constant term (Bias) P: Probability Estimate positive or negative by setting the probability threshold to, for example, 0.5

One-vs.-rest classifier

If classes are created from A to E, prepare a discriminant (the above formula) for each and assign the corresponding data to the class showing the highest probability.

One-to-one classifier (one-vs.-one classifier)

If you make classes from A to E, try all one-to-one combinations with A-B, A-C, A-D ... Number of combinations k × (k-1) ÷ 2 times. 10 times for A ~ E. The corresponding data is assigned to the class that was voted by majority out of the 10 times.

Accuracy evaluation

・ Accuracy: The closer it is to 100%, the better If there is 90% of the sun, the correct answer rate will be 90% even if all the stupid answers are correct.

・ True Positive Rate (TPR) ・ ・ ・ The closer it is to 100%, the better How well it fits only the positive data

・ False Positive Rate (FPR) ・ ・ ・ The closer it is to 0%, the better How much was mistakenly made positive only for yin data

・ AUC (Area Under the Curve) ・ ・ ・ The closer it is to 1.0, the better There is a trade-off between false positive rates and true positive rates. Therefore, the false positive rate and the true positive rate are graphed on the ROC curve, and the area under the curve is AUC.

・ Precision: The closer it is to 100%, the better How positive is the correct data for the data that the guess is positive

・ Recall: The closer it is to 100%, the better How much was the guess data correct for the positive data of the correct answer?

・ F value (F1 score) ・ ・ ・ The closer it is to 1.0, the better There is also a trade-off between recall and precision. Index to judge this comprehensively image.png

Improved accuracy

Try methods other than logistic regression. Support Vector Machine (SVM), Decision Forest, Boosted Decision Tree, etc. image.png

Clustering (Sample: Classification of irises)

Example: Travel agency customers are classified into groups such as near-field, overseas, and hot springs, and sales promotion materials are distributed based on each group.

k-means method

Select k center point data for any number of clusters and classify other data into each cluster using Euclidean distance or cosine similarity.

Euclidean distance

Easy. The distance between point a and point b on the graph. image.png [Formula] when there are m variables image.png

Cosine similarity

The closeness of vector orientation. +1 for the same direction, 0 for the vertical, -1 for the opposite [Formula] image.png

k-means ++ method

Improved k-means method. This is the mainstream. Choose as far as possible from the center point of the cluster. Also, a small number of populations ignore data that is far apart.

Accuracy evaluation

Since it is unsupervised learning, the analyst has no choice but to take a proper look.

Improved accuracy

·Normalization Scale the variable x with a large value scale so that the mean value is 0 and the standard deviation is 1. It is called z score (z-score).

image.png

Anomaly detection (Sample: Abnormal payment detection from credit card usage data)

Example: Detect disaster precursors such as flash floods from upstream, midstream, and downstream water level sensors of rivers.

One-Class SVM Density estimation algorithm. The normal range of data is represented by a circle, and data that does not fit in that circle is detected as abnormal. Learn to minimize the value calculated by the following formula. [Formula] image.png R: radius of circle n: Number of data ζ: The length of the data outside the circle ν: The weight of the penalty given by the analyst (the smaller the value, the more training data will be included in the normal range. If ν is 0, all the training data will be contained in a circle).

Kernel trick

When the dense areas of data are far apart, the normal area is surrounded by a distorted curve.

Accuracy evaluation

・ Accuracy: The larger the precision, the less detection omissions occur. How much correct data is "abnormal" for the data that the anomaly detection guess is "abnormal"

・ Recall: The larger the recall rate, the less detection omissions occur. How much was the guess data correctly judged as "abnormal" for the "abnormal" of the correct answer data?

・ F value (F1 score) ・ ・ ・ The closer it is to 1.0, the better There is also a trade-off between recall and precision. Index to judge this comprehensively image.png

Improved accuracy

There is a trade-off between precision and recall. Do you want to reduce omissions in detection of abnormal events or reduce false positives? Increasing ν narrows the normal range (such as 0.5). On the contrary, if it is set to 0.02 etc., it will be in the normal range. In ML Studio, set with η.

Try changing the kernel function

Kernel functions used in kernel tricks. RBF kernel (ML Studio default) ・ Polynomial kernel ・ Sigmoid kernel

Try changing the anomaly detection method

・ PCA-Based Anomaly Detection ・ Times Series Anomaly Detection (for time series data such as temperature transitions and stock price transitions) image.png

Recommendation (Sample: Present recommended restaurants to users based on restaurant evaluation data)

Example: Amazon's "People who bought this product also bought the following products"

Collaborative filtering

Guess recommended products by using the ratings and preferences given by you and the scores given by others. There is an item base and a user base.

Item-based recommendations

We recommend products that are highly similar to the products that the user gave a high score.

User base recommendations

Select multiple users with high similarity to users and recommend products with high scores for each

MatchBox Emphasis filtering puts first-time users and new products out of the mosquito net (cold-start problem. Microsoft's unique algorithm that makes recommendations based on product and user attribute information. Scores that gradually accumulate. Also used together. 【formula】 image.png κ is the number of attributes

Therefore, three types of learning data are prepared: score data, user attribute data, and product attribute data.

Accuracy evaluation

・ NDCG (Normarized Discounted Cumulative Gain) ・ ・ ・ The closer it is to 1.0, the better Default ・ MAE (Mean Absolute Error) ・ ・ ・ The closer it is to 0, the better The average of the difference between the predicted value and the correct answer value. ・ RMSE (Root Mean Squared Error):Image.png -cb03-363a-b559a2c829a7.png)

Improved accuracy

Adjust the length (κ) of the feature vector. In MS Studio, [Number of traits] of "Train Matchbox Recommender" image.png

Practical realization

Switch to actual battle mode

  1. Change [Recommended item selection] of Score Matchbox Recommender to [From All Items]. → In actual battle mode.
  2. Since the input port only accepts user IDs, add "Select Columns in Dataset" from Manipulation and set to output only user IDs.
  3. RUN image.png

Web service issuance

In actual battle mode

  1. Click [Predictive Web Service] from [SET UP WEB SERVICE]

  2. [Web service input] and [Web service output] modules are added image.png

  3. Remove [Select Columns in Dataset] and set [Web service input] to the input port of [Score Matchbox Recommender].

  4. Click Deploy web servce. Transition to the following screen. An API key has been issued. image.png

  5. Click REQUEST / RESPONSE to open another window with sample code at the bottom of the screen. image.png

  6. Change abc123 in the code ʻapi_key =" abc123 "` to your own. Change the values in the code below to, for example, UserID "U1048".

#Change before
data =  {

        "Inputs": {

                "input1":
                {
                    "ColumnNames": ["userID"],
                    "Values": [ [ "value" ], [ "value" ], ]
                },        },
#After change
data =  {

        "Inputs": {

                "input1":
                {
                    "ColumnNames": ["userID"],
                    "Values": [ [ "U1048" ] ]
                },        },

Save it with PythonApplication.py.

  1. Type $ python PythonApplication.py in the command and the result will be returned.
$ python PythonApplication.py 
{"Results":{"output1":{"type":"table","value":{"ColumnNames":["User","Item 1","Item 2","Item 3","Item 4","Item 5"],"ColumnTypes":["String","String","String","String","String","String"],"Values":[["U1048","134986","135030","135052","135045","135025"]]}}}}

  1. As of November 2019, it is a sample code of Python2 series. In the case of Python3 series, it is as follows.
import urllib.request
# If you are using Python 3+, import urllib instead of urllib2

import json 


data =  {

        "Inputs": {

                "input1":
                {
                    "ColumnNames": ["userID"],
                    "Values": [ [ "U1048" ] ]
                },        },
            "GlobalParameters": {
}
    }

body = str.encode(json.dumps(data))

url = 'https://japaneast.services.azureml.net/workspaces/0e3c5988af4b43d7ac14fa55244b9f9d/services/53da3266168a4c8a8814e3adac2a6821/execute?api-version=2.0&details=true'
api_key = '<API key>' # Replace this with the API key for the web service
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib.Request(url, body, headers) 

try:
    response = urllib.urlopen(req)

    # If you are using Python 3+, replace urllib2 with urllib.request in the above code:
    # req = urllib.request.Request(url, body, headers) 
    # response = urllib.request.urlopen(req)

    result = response.read()
    print(result) 
except urllib.request.HTTPError, error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())

    print(json.loads(error.read()))                 

Just change ʻurllib2 to ʻurllib.request.


that's all.

Recommended Posts

Key points of "Machine learning with Azure ML Studio"
Looking back on learning with Azure Machine Learning Studio
Try using Jupyter Notebook of Azure Machine Learning
Predict the gender of Twitter users with machine learning
Summary of the basic flow of machine learning with Python
Record of the first machine learning challenge with Keras
Machine learning learned with Pokemon
Basics of Machine Learning (Notes)
Machine learning with Python! Preparation
Machine learning Minesweeper with PyTorch
Importance of machine learning datasets
Beginning with Python machine learning
Try machine learning with Kaggle
A story stuck with the installation of the machine learning library JAX
Significance of machine learning and mini-batch learning
[Machine learning] Check the performance of the classifier with handwritten character data
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Machine learning ③ Summary of decision tree
Try machine learning with scikit-learn SVM
About learning method with original data of CenterNet (Objects as Points)
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
Quantum-inspired machine learning with tensor networks
Get started with machine learning with SageMaker
"Scraping & machine learning with Python" Learning memo
Application development using Azure Machine Learning
REST API of model made with Python with Watson Machine Learning (CP4D edition)
Source code of sound source separation (machine learning practice series) learned with Python
A beginner of machine learning tried to predict Arima Kinen with python
Memorandum of means when you want to make machine learning with 50 images
[Examples of improving Python] Learning Python with Codecademy
Machine learning algorithm (generalization of linear regression)
Predict power demand with machine learning Part 2
Amplify images for machine learning with python
Machine learning imbalanced data sklearn with k-NN
Machine learning with python (2) Simple regression analysis
A story about machine learning with Kyasuket
Notes on running Azure Machine Learning locally
2020 Recommended 20 selections of introductory machine learning books
Try deep learning of genomics with Kipoi
Machine learning algorithm (implementation of multi-class classification)
[Shakyo] Encounter with Python for machine learning
Sentiment analysis of tweets with deep learning
Machine learning with Pytorch on Google Colab
[Machine learning] List of frequently used packages
Build AI / machine learning environment with Python
Judgment of igneous rock by machine learning ②
Machine learning
Align the number of samples between classes of data for machine learning with Python
Machine learning memo of a fledgling engineer Part 1
The story of doing deep learning with TPU
Machine learning starting with Python Personal memorandum Part2
Beginning of machine learning (recommended teaching materials / information)
Machine learning of sports-Analysis of J-League as an example-②
Python & Machine Learning Study Memo ⑤: Classification of irises
See the behavior of drunkenness with reinforcement learning
Machine learning starting with Python Personal memorandum Part1
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Upgrade the Azure Machine Learning SDK for Python
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning