Key points of "Machine learning with Azure ML Studio"

This book → Machine learning revised version starting in the cloud <img src =" // ir-jp.amazon-adsystem.com/e/ir?t=kokkahasan-22&l=am2&o=9&a=4865941622" width=" 1 "height =" 1 "border =" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/> The point is that it is simple but not too difficult and is neatly organized. To summarize.

Numerical forecast by regression (Sample: New car sales price forecast)

Example: Sales forecast in retail (the number of sales in the previous year, the day of the week, the weather, the number of advertisements hit / not hit, etc. are analyzed and the future sales forecast is predicted.

Linear regression

[Formula]

y: Number of sales on the forecast date x1 ~ xm: Variables such as horsepower, fuel type, fuel consumption, wheelbase, brand value, etc. w1 ~ wm: Partial regression coefficient (weight. Feature weight) c: Constant term (Bias)

Overfitting if there are too many variables

Accuracy evaluation

・ MAE (Mean Absolute Error) ・・・ The closer it is to 0, the better The average of the difference between the predicted value and the correct answer value. ・ RMSE (Root Mean Squared Error): -cb03-363a-b559a2c829a7.png) ・ Coerfficient of Determination The square of the correlation coefficient between the predicted value and the correct answer value. The closer it is to 1, the better

Data split

・ Holdout method Randomly divide training data for training and evaluation ・ Cross validation Divide the training data into k pieces. And k times evaluation

Improved accuracy

・ Regularization To prevent overfitting due to too many variables and too large weight parameters, add a penalty value proportional to the weight parameters. [Formula] Add the sum of the squares of the weight parameters to the sum of the squares of the error between the predicted value and the correct answer value as a parameter.

Bayesian linear regression

The calculation formula is the same as linear regression. However, a model in which the weight parameters are not constant but a probability distribution. Considering the number of times an event has occurred, where only maximum likelihood estimation (MLE: Maximum Likelihood Optimization. A method of determining the weight parameter so that the error becomes 0) is inaccurate if the number of training data is not considered. Will do it. (Prior distribution and posterior distribution)

Classification (Sample: Positive / Negative classification from breast cancer data)

Example: Predict payment ability by analyzing correlation items such as occupation, annual income, deposit amount, and delinquency delay in credit examination at a bank.

Logistic regression

Predict the probability that a particular event will occur. [Formula]

x1 ~ xm: Variables such as age, tumor size, tumor malignancy, non-menopausal, etc. w1 ~ wm: Partial regression coefficient (weight. Feature weight) c: Constant term (Bias) P: Probability Estimate positive or negative by setting the probability threshold to, for example, 0.5

One-vs.-rest classifier

If classes are created from A to E, prepare a discriminant (the above formula) for each and assign the corresponding data to the class showing the highest probability.

One-to-one classifier (one-vs.-one classifier)

If you make classes from A to E, try all one-to-one combinations with A-B, A-C, A-D ... Number of combinations k × (k-1) ÷ 2 times. 10 times for A ~ E. The corresponding data is assigned to the class that was voted by majority out of the 10 times.

Accuracy evaluation

・ Accuracy: The closer it is to 100%, the better If there is 90% of the sun, the correct answer rate will be 90% even if all the stupid answers are correct.

・ True Positive Rate (TPR) ・・・ The closer it is to 100%, the better How well it fits only the positive data

・ False Positive Rate (FPR) ・・・ The closer it is to 0%, the better How much was mistakenly made positive only for yin data

・ AUC (Area Under the Curve) ・・・ The closer it is to 1.0, the better There is a trade-off between false positive rates and true positive rates. Therefore, the false positive rate and the true positive rate are graphed on the ROC curve, and the area under the curve is AUC.

・ Precision: The closer it is to 100%, the better How positive is the correct data for the data that the guess is positive

・ Recall: The closer it is to 100%, the better How much was the guess data correct for the positive data of the correct answer?

・ F value (F1 score) ・・・ The closer it is to 1.0, the better There is also a trade-off between recall and precision. Index to judge this comprehensively

Improved accuracy

Try methods other than logistic regression. Support Vector Machine (SVM), Decision Forest, Boosted Decision Tree, etc.

Clustering (Sample: Classification of irises)

Example: Travel agency customers are classified into groups such as near-field, overseas, and hot springs, and sales promotion materials are distributed based on each group.

k-means method

Select k center point data for any number of clusters and classify other data into each cluster using Euclidean distance or cosine similarity.

Euclidean distance

Easy. The distance between point a and point b on the graph. [Formula] when there are m variables

Cosine similarity

The closeness of vector orientation. +1 for the same direction, 0 for the vertical, -1 for the opposite [Formula]

k-means ++ method

Improved k-means method. This is the mainstream. Choose as far as possible from the center point of the cluster. Also, a small number of populations ignore data that is far apart.

Accuracy evaluation

Since it is unsupervised learning, the analyst has no choice but to take a proper look.

Improved accuracy

·Normalization Scale the variable x with a large value scale so that the mean value is 0 and the standard deviation is 1. It is called z score (z-score).

Anomaly detection (Sample: Abnormal payment detection from credit card usage data)

Example: Detect disaster precursors such as flash floods from upstream, midstream, and downstream water level sensors of rivers.

One-Class SVM Density estimation algorithm. The normal range of data is represented by a circle, and data that does not fit in that circle is detected as abnormal. Learn to minimize the value calculated by the following formula. [Formula] R: radius of circle n: Number of data ζ: The length of the data outside the circle ν: The weight of the penalty given by the analyst (the smaller the value, the more training data will be included in the normal range. If ν is 0, all the training data will be contained in a circle).

Kernel trick

When the dense areas of data are far apart, the normal area is surrounded by a distorted curve.

Accuracy evaluation

・ Accuracy: The larger the precision, the less detection omissions occur. How much correct data is "abnormal" for the data that the anomaly detection guess is "abnormal"

・ Recall: The larger the recall rate, the less detection omissions occur. How much was the guess data correctly judged as "abnormal" for the "abnormal" of the correct answer data?

・ F value (F1 score) ・・・ The closer it is to 1.0, the better There is also a trade-off between recall and precision. Index to judge this comprehensively

Improved accuracy

There is a trade-off between precision and recall. Do you want to reduce omissions in detection of abnormal events or reduce false positives? Increasing ν narrows the normal range (such as 0.5). On the contrary, if it is set to 0.02 etc., it will be in the normal range. In ML Studio, set with η.

Try changing the kernel function

Kernel functions used in kernel tricks. RBF kernel (ML Studio default) ・ Polynomial kernel ・ Sigmoid kernel

Try changing the anomaly detection method

・ PCA-Based Anomaly Detection ・ Times Series Anomaly Detection (for time series data such as temperature transitions and stock price transitions)

Recommendation (Sample: Present recommended restaurants to users based on restaurant evaluation data)

Example: Amazon's "People who bought this product also bought the following products"

Collaborative filtering

Guess recommended products by using the ratings and preferences given by you and the scores given by others. There is an item base and a user base.

Item-based recommendations

We recommend products that are highly similar to the products that the user gave a high score.

User base recommendations

Select multiple users with high similarity to users and recommend products with high scores for each

MatchBox Emphasis filtering puts first-time users and new products out of the mosquito net (cold-start problem. Microsoft's unique algorithm that makes recommendations based on product and user attribute information. Scores that gradually accumulate. Also used together. 【formula】 κ is the number of attributes

Therefore, three types of learning data are prepared: score data, user attribute data, and product attribute data.

Accuracy evaluation

・ NDCG (Normarized Discounted Cumulative Gain) ・・・ The closer it is to 1.0, the better Default ・ MAE (Mean Absolute Error) ・・・ The closer it is to 0, the better The average of the difference between the predicted value and the correct answer value. ・ RMSE (Root Mean Squared Error): -cb03-363a-b559a2c829a7.png)

Improved accuracy

Adjust the length (κ) of the feature vector. In MS Studio, [Number of traits] of "Train Matchbox Recommender"

Practical realization

Switch to actual battle mode

Change [Recommended item selection] of Score Matchbox Recommender to [From All Items]. → In actual battle mode.
Since the input port only accepts user IDs, add "Select Columns in Dataset" from Manipulation and set to output only user IDs.
RUN

Web service issuance

In actual battle mode

Click [Predictive Web Service] from [SET UP WEB SERVICE]
[Web service input] and [Web service output] modules are added
Remove [Select Columns in Dataset] and set [Web service input] to the input port of [Score Matchbox Recommender].
Click Deploy web servce. Transition to the following screen. An API key has been issued.
Click REQUEST / RESPONSE to open another window with sample code at the bottom of the screen.
Change abc123 in the code ʻapi_key =" abc123 "` to your own. Change the values in the code below to, for example, UserID "U1048".

#Change before
data =  {

        "Inputs": {

                "input1":
                {
                    "ColumnNames": ["userID"],
                    "Values": [ [ "value" ], [ "value" ], ]
                },        },

#After change
data =  {

        "Inputs": {

                "input1":
                {
                    "ColumnNames": ["userID"],
                    "Values": [ [ "U1048" ] ]
                },        },

Save it with PythonApplication.py.

Type $ python PythonApplication.py in the command and the result will be returned.

$ python PythonApplication.py 
{"Results":{"output1":{"type":"table","value":{"ColumnNames":["User","Item 1","Item 2","Item 3","Item 4","Item 5"],"ColumnTypes":["String","String","String","String","String","String"],"Values":[["U1048","134986","135030","135052","135045","135025"]]}}}}

As of November 2019, it is a sample code of Python2 series. In the case of Python3 series, it is as follows.

import urllib.request
# If you are using Python 3+, import urllib instead of urllib2

import json 


data =  {

        "Inputs": {

                "input1":
                {
                    "ColumnNames": ["userID"],
                    "Values": [ [ "U1048" ] ]
                },        },
            "GlobalParameters": {
}
    }

body = str.encode(json.dumps(data))

url = 'https://japaneast.services.azureml.net/workspaces/0e3c5988af4b43d7ac14fa55244b9f9d/services/53da3266168a4c8a8814e3adac2a6821/execute?api-version=2.0&details=true'
api_key = '<API key>' # Replace this with the API key for the web service
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib.Request(url, body, headers) 

try:
    response = urllib.urlopen(req)

    # If you are using Python 3+, replace urllib2 with urllib.request in the above code:
    # req = urllib.request.Request(url, body, headers) 
    # response = urllib.request.urlopen(req)

    result = response.read()
    print(result) 
except urllib.request.HTTPError, error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())

    print(json.loads(error.read()))

Just change ʻurllib2 to ʻurllib.request.

that's all.

[PYTHON] Key points of "Machine learning with Azure ML Studio"

Key points of "Machine learning with Azure ML Studio"

Numerical forecast by regression (Sample: New car sales price forecast)

Linear regression

Accuracy evaluation

Data split

Improved accuracy

Bayesian linear regression

Classification (Sample: Positive / Negative classification from breast cancer data)

Logistic regression

One-vs.-rest classifier

One-to-one classifier (one-vs.-one classifier)

Accuracy evaluation

Improved accuracy

Clustering (Sample: Classification of irises)

k-means method

Euclidean distance

Cosine similarity

k-means ++ method

Accuracy evaluation

Improved accuracy

Anomaly detection (Sample: Abnormal payment detection from credit card usage data)

Kernel trick

Accuracy evaluation

Improved accuracy

Try changing the kernel function

Try changing the anomaly detection method

Recommendation (Sample: Present recommended restaurants to users based on restaurant evaluation data)

Collaborative filtering

Item-based recommendations

User base recommendations

Accuracy evaluation

Improved accuracy

Practical realization

Web service issuance