This book → Machine learning revised version starting in the cloud <img src =" // ir-jp.amazon-adsystem.com/e/ir?t=kokkahasan-22&l=am2&o=9&a=4865941622" width=" 1 "height =" 1 "border =" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/> The point is that it is simple but not too difficult and is neatly organized. To summarize.
Example: Sales forecast in retail (the number of sales in the previous year, the day of the week, the weather, the number of advertisements hit / not hit, etc. are analyzed and the future sales forecast is predicted.
[Formula]
y: Number of sales on the forecast date x1 ~ xm: Variables such as horsepower, fuel type, fuel consumption, wheelbase, brand value, etc. w1 ~ wm: Partial regression coefficient (weight. Feature weight) c: Constant term (Bias)
・ MAE (Mean Absolute Error) ・ ・ ・ The closer it is to 0, the better The average of the difference between the predicted value and the correct answer value. ・ RMSE (Root Mean Squared Error): -cb03-363a-b559a2c829a7.png) ・ Coerfficient of Determination The square of the correlation coefficient between the predicted value and the correct answer value. The closer it is to 1, the better
・ Holdout method Randomly divide training data for training and evaluation ・ Cross validation Divide the training data into k pieces. And k times evaluation
・ Regularization To prevent overfitting due to too many variables and too large weight parameters, add a penalty value proportional to the weight parameters. [Formula] Add the sum of the squares of the weight parameters to the sum of the squares of the error between the predicted value and the correct answer value as a parameter.
The calculation formula is the same as linear regression. However, a model in which the weight parameters are not constant but a probability distribution. Considering the number of times an event has occurred, where only maximum likelihood estimation (MLE: Maximum Likelihood Optimization. A method of determining the weight parameter so that the error becomes 0) is inaccurate if the number of training data is not considered. Will do it. (Prior distribution and posterior distribution)
Example: Predict payment ability by analyzing correlation items such as occupation, annual income, deposit amount, and delinquency delay in credit examination at a bank.
Predict the probability that a particular event will occur. [Formula]
x1 ~ xm: Variables such as age, tumor size, tumor malignancy, non-menopausal, etc. w1 ~ wm: Partial regression coefficient (weight. Feature weight) c: Constant term (Bias) P: Probability Estimate positive or negative by setting the probability threshold to, for example, 0.5
If classes are created from A to E, prepare a discriminant (the above formula) for each and assign the corresponding data to the class showing the highest probability.
If you make classes from A to E, try all one-to-one combinations with A-B, A-C, A-D ... Number of combinations k × (k-1) ÷ 2
times. 10 times for A ~ E. The corresponding data is assigned to the class that was voted by majority out of the 10 times.
・ Accuracy: The closer it is to 100%, the better If there is 90% of the sun, the correct answer rate will be 90% even if all the stupid answers are correct.
・ True Positive Rate (TPR) ・ ・ ・ The closer it is to 100%, the better How well it fits only the positive data
・ False Positive Rate (FPR) ・ ・ ・ The closer it is to 0%, the better How much was mistakenly made positive only for yin data
・ AUC (Area Under the Curve) ・ ・ ・ The closer it is to 1.0, the better There is a trade-off between false positive rates and true positive rates. Therefore, the false positive rate and the true positive rate are graphed on the ROC curve, and the area under the curve is AUC.
・ Precision: The closer it is to 100%, the better How positive is the correct data for the data that the guess is positive
・ Recall: The closer it is to 100%, the better How much was the guess data correct for the positive data of the correct answer?
・ F value (F1 score) ・ ・ ・ The closer it is to 1.0, the better There is also a trade-off between recall and precision. Index to judge this comprehensively
Try methods other than logistic regression. Support Vector Machine (SVM), Decision Forest, Boosted Decision Tree, etc.
Example: Travel agency customers are classified into groups such as near-field, overseas, and hot springs, and sales promotion materials are distributed based on each group.
Select k center point data for any number of clusters and classify other data into each cluster using Euclidean distance or cosine similarity.
Easy. The distance between point a and point b on the graph. [Formula] when there are m variables
The closeness of vector orientation. +1 for the same direction, 0 for the vertical, -1 for the opposite [Formula]
Improved k-means method. This is the mainstream. Choose as far as possible from the center point of the cluster. Also, a small number of populations ignore data that is far apart.
Since it is unsupervised learning, the analyst has no choice but to take a proper look.
·Normalization Scale the variable x with a large value scale so that the mean value is 0 and the standard deviation is 1. It is called z score (z-score).
Example: Detect disaster precursors such as flash floods from upstream, midstream, and downstream water level sensors of rivers.
One-Class SVM Density estimation algorithm. The normal range of data is represented by a circle, and data that does not fit in that circle is detected as abnormal. Learn to minimize the value calculated by the following formula. [Formula] R: radius of circle n: Number of data ζ: The length of the data outside the circle ν: The weight of the penalty given by the analyst (the smaller the value, the more training data will be included in the normal range. If ν is 0, all the training data will be contained in a circle).
When the dense areas of data are far apart, the normal area is surrounded by a distorted curve.
・ Accuracy: The larger the precision, the less detection omissions occur. How much correct data is "abnormal" for the data that the anomaly detection guess is "abnormal"
・ Recall: The larger the recall rate, the less detection omissions occur. How much was the guess data correctly judged as "abnormal" for the "abnormal" of the correct answer data?
・ F value (F1 score) ・ ・ ・ The closer it is to 1.0, the better There is also a trade-off between recall and precision. Index to judge this comprehensively
There is a trade-off between precision and recall. Do you want to reduce omissions in detection of abnormal events or reduce false positives? Increasing ν narrows the normal range (such as 0.5). On the contrary, if it is set to 0.02 etc., it will be in the normal range. In ML Studio, set with η.
Kernel functions used in kernel tricks. RBF kernel (ML Studio default) ・ Polynomial kernel ・ Sigmoid kernel
・ PCA-Based Anomaly Detection ・ Times Series Anomaly Detection (for time series data such as temperature transitions and stock price transitions)
Example: Amazon's "People who bought this product also bought the following products"
Guess recommended products by using the ratings and preferences given by you and the scores given by others. There is an item base and a user base.
We recommend products that are highly similar to the products that the user gave a high score.
Select multiple users with high similarity to users and recommend products with high scores for each
MatchBox Emphasis filtering puts first-time users and new products out of the mosquito net (cold-start problem. Microsoft's unique algorithm that makes recommendations based on product and user attribute information. Scores that gradually accumulate. Also used together. 【formula】 κ is the number of attributes
Therefore, three types of learning data are prepared: score data, user attribute data, and product attribute data.
・ NDCG (Normarized Discounted Cumulative Gain) ・ ・ ・ The closer it is to 1.0, the better Default ・ MAE (Mean Absolute Error) ・ ・ ・ The closer it is to 0, the better The average of the difference between the predicted value and the correct answer value. ・ RMSE (Root Mean Squared Error): -cb03-363a-b559a2c829a7.png)
Adjust the length (κ) of the feature vector. In MS Studio, [Number of traits] of "Train Matchbox Recommender"
Switch to actual battle mode
In actual battle mode
Click [Predictive Web Service] from [SET UP WEB SERVICE]
[Web service input] and [Web service output] modules are added
Remove [Select Columns in Dataset] and set [Web service input] to the input port of [Score Matchbox Recommender].
Click Deploy web servce. Transition to the following screen. An API key has been issued.
Click REQUEST / RESPONSE to open another window with sample code at the bottom of the screen.
Change abc123 in the code ʻapi_key =" abc123 "` to your own. Change the values in the code below to, for example, UserID "U1048".
#Change before
data = {
"Inputs": {
"input1":
{
"ColumnNames": ["userID"],
"Values": [ [ "value" ], [ "value" ], ]
}, },
#After change
data = {
"Inputs": {
"input1":
{
"ColumnNames": ["userID"],
"Values": [ [ "U1048" ] ]
}, },
Save it with PythonApplication.py.
$ python PythonApplication.py
in the command and the result will be returned.$ python PythonApplication.py
{"Results":{"output1":{"type":"table","value":{"ColumnNames":["User","Item 1","Item 2","Item 3","Item 4","Item 5"],"ColumnTypes":["String","String","String","String","String","String"],"Values":[["U1048","134986","135030","135052","135045","135025"]]}}}}
import urllib.request
# If you are using Python 3+, import urllib instead of urllib2
import json
data = {
"Inputs": {
"input1":
{
"ColumnNames": ["userID"],
"Values": [ [ "U1048" ] ]
}, },
"GlobalParameters": {
}
}
body = str.encode(json.dumps(data))
url = 'https://japaneast.services.azureml.net/workspaces/0e3c5988af4b43d7ac14fa55244b9f9d/services/53da3266168a4c8a8814e3adac2a6821/execute?api-version=2.0&details=true'
api_key = '<API key>' # Replace this with the API key for the web service
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
req = urllib.Request(url, body, headers)
try:
response = urllib.urlopen(req)
# If you are using Python 3+, replace urllib2 with urllib.request in the above code:
# req = urllib.request.Request(url, body, headers)
# response = urllib.request.urlopen(req)
result = response.read()
print(result)
except urllib.request.HTTPError, error:
print("The request failed with status code: " + str(error.code))
# Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
print(error.info())
print(json.loads(error.read()))
Just change ʻurllib2 to ʻurllib.request
.
that's all.
Recommended Posts