[PYTHON] "First time Kikaigakushu" that even university students with no knowledge can do --AWS SageMaker

Writer's background

Amazon? Ah, that's a mail-order app, I know I know. Not long ago, that was the level of knowledge about Amazon. Of course, I didn't know about AWS.

I'm ashamed as an undergraduate student. </ font>

However, now that I am in the third year of university, I learned about AWS because I was assigned to a laboratory. Also, at this time, I happened to be addicted to the topic "Atsuhiko Nakata's YouTube University". While watching a video, a cloud service called AWS at exactly the same timing I had the vague knowledge that there was.

☟ The video that the author came to recognize the existence of AWS, called this blockchain The technology was also very interesting. キャプチャ.PNG [Economy] The ultimate weapon of the 5G era, "blockchain" -Part 1-A major invention that will change the future of humankind! --Atsuhiko Nakata's YouTube University

I got to know AWS through the above process, and although I don't have any knowledge, I think I can do various things. Let's actually use it! . . . It became.

Amazon SageMaker I wanted to do machine learning using AWS, so I decided to use a service called Amazon SageMaker.

SageMaker is a fully managed end-to-end machine learning service just announced and released at re: invent 2017. It provides services for managing the model development process of machine learning, and takes over the complicated and troublesome parts of the model development process. Not only does it lower the threshold for engineers looking to start machine learning, but it also enables data scientists, AI engineers, and machine learning experts to quickly build models for scalable training and quick release (deployment).

In other words, Amazon SageMaker is a service that allows you to easily perform machine learning </ font>. I am very grateful for myself as a beginner.

Overview of SageMaker

SageMaker consists of three modules: "authoring," "training," and "hosting." image.png ** ** The so-called data set pre-processing process. It is said that 90% of machine learning is preprocessing of data sets, which is an important process. With AWS SageMaker, Jupyter Notebook can be easily set up and used on the cloud, such as CPU-based or GPU-based, according to the usage situation.

** ** You can train your model using the Built-in algorithm provided by SageMaker, the Deep Learning framework, and the unique learning environment provided by Docker. The generated model is saved in S3. This model can be hosted on SageMaker as it is, or it can be taken out of AWS and deployed on IoT devices.

** ** An HTTPS endpoint is provided so that the built model can be used in real time.

SageMaker tutorial

That said, the reality is not so sweet that you can do machine learning just because you want to do machine learning somehow!

Because if you want to do machine learning on your own ** ① Data set used for machine learning ** ** ② Knowledge of analysis methods and surroundings required for model construction ** ** ③ Knowledge of Python and libraries required for preprocessing of data before that **

This is because the above is required. . . . . . No, I was in trouble. Although I was enthusiastic to try machine learning with a light feeling, I have no particular desire to analyze what I actually want to analyze, and since I have skipped programming so far, I have almost no knowledge of Python.

** But **, SageMaker has a tutorial, there is a dataset, The library to be used and the procedure for writing code are explained in detail.

Simply put, If you use SageMaker according to the tutorial, you can experience a series of machine learning flow even with 0 knowledge, and you can also learn the practicality of the analysis method. That is </ font>.

Forecasting prospective bank time deposit customers using XGboost

Then, I will explain the outline of the tutorial that will be actually performed this time.

Tutorial used this time (published on GitHub) ☟ Targeting Direct Marketing with Amazon SageMaker XGBoost

This tutorial is written entirely in English, but some people are doing this tutorial in Japanese, so I also refer to that page. ☟ Predict potential customers for bank time deposits using Amazon SageMaker [SageMaker + XGBoost Machine Learning Beginner Tutorial]-codExa

The dataset used this time is the result of a direct marketing of regular savings over the phone by a Portuguese bank. There is data such as age, occupation, and educational background of each user, and as a result of direct marketing, those who applied for time deposit (label = 1), those who did not apply (label = 0) ) Is given.

If you're a machine learning engineer, it seems like a very famous dataset for beginners that you'll hear once? (Of course I knew it for the first time.)

This dataset is analyzed using a learning method called XGboost to predict whether direct marketing to customers will be successful. Then, the general flow of this tutorial is to compare the prediction result with the actual data and verify how accurate the prediction was.

Overview of XG boost

This time, I will explain the outline of the analysis method "XG boost" used when building a model.

XGBoost is a well-known and efficient open source implementation of the Gradient Boost Tree algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict target variables by combining an ensemble of estimates from a simpler set of weaker models. XGBoost has been very successful in machine learning competitions such as kaggle because it can robustly handle a large number of hyperparameters that can be switched and adjusted to suit different data types, trust relationships, distributions, and needs.

In short, XGboost is the strongest supervised learning method! I interpreted it as such. Recently, it seems that the learning method called LightGBM is losing the title ... w

What is supervised learning? Introducing other learning methods

Earlier, XGboost said that it was supervised learning, but I'm not sure about "supervised learning", so I did a little research.

There are three methods of machine learning: "supervised learning," "unsupervised learning," and "reinforcement learning."

** **

Supervised learning is a method in which a computer learns using a learning model constructed based on data for which correct labels and numerical values are known. It is the simplest learning method of machine learning, and it is characterized by the fact that it is easy to obtain results close to the predictions made by humans in classification and prediction. Examples of use include market forecasts in transactions and estimates of clients who often buy their products.

** **

Unsupervised learning is a learning method that finds groups with common characteristics and extracts information that characterizes the data from input data that does not have a correct answer label. Clustering is a typical example of use. Clustering automatically finds data with similar characteristics from the data and divides it into several types of groups.

** **

Reinforcement learning is different from supervised learning and unsupervised learning, and is a method of finding the best judgment while actually acting on a task that takes time to produce results or requires many repetitions. It is used for automatic driving of automobiles, control of robots, and games such as AlphaGo.

What are the three learning methods of "machine learning" (supervised learning, unsupervised learning, and reinforcement learning)? --sweeep magazine

Actual flow of SageMaker tutorial

The introduction has become long, but I will explain how to actually proceed with the analysis from here.

For the AWS account required to use SageMaker, we adopted the AWS Educate Starter account. When you use AWS services, you basically incur charges. However, AWS Educate Starter accounts are given credits of several tens of dollars by default, so you can use AWS services virtually free of charge for a while. This is a very nice privilege for the author, who is a poor college student. However, there are various services that cannot be used due to many regulations, so be careful there. Details will be described later. In addition, regarding the flow of analysis described this time, I will describe in detail the part where the author failed and got stuck, and explain the solution procedure. In the future, I would like this article to be used as a troubleshooting for those who touch SageMaker with an AWS Educate Starter account as well as myself.

First, a brief explanation of the analysis procedure is as follows.

** ⓪ Open the AWS Management Console screen ① Create an Amazon S3 bucket ② Create an Amazon SageMaker notebook instance ③ Create a Jupyter notebook ④ Download, survey, and convert training data ⑤ Train the model ⑥ Model hosting ⑦ Verify the model **

Then, I will explain each procedure in detail. It's been very long, so I hope you'll take a quick look.

⓪ Open the AWS Management Console screen (this is surprisingly important)

First, log in to your AWS Educate account and click "AWS Account" in the upper left to move to the screen below. Capture 1.PNG After that, when you start AWS Console according to the instructions on the screen, the "AWS Management Console" screen opens as shown below.

Search for "Search for services" and specify the service you want to use. This time, I will use "S3" and "Amazon SageMaker". キャプチャ.PNG . . . . . If a beginner goes well, this will happen, but as mentioned above, the services that can be used with the "AWS Educate Starter Account" are limited! !! !!

List of services available for your AWS Educate Starter account

https://s3.amazonaws.com/awseducate-starter-account-services/AWS_Educate_Starter_Accounts_and_AWS_Services.pdf

Looking at this list, it says that SageMaker instance cannot be used.

In other words, you can't use "SageMaker" just because you created an "AWS Educate Starter account". When I proceeded without knowing it at first, I got the following error at the last model fitting of "⑤Training the model".

ClientError: An error occurred (AccessDeniedException) when calling the CreateTrainingJob operation: User: arn:aws:sts::780079846795:assumed-role/AmazonSageMaker-ExecutionRole-20191119T213649/SageMaker is not authorized to perform: sagemaker:CreateTrainingJob on resource: arn:aws:sagemaker:us-east-1:780079846795:training-job/xgboost-2019-11-19-13-34-49-653 with an explicit deny

So how do you get rid of this?

Looking at the list of available services again, it seems that SageMaker can also be used with the Starter account by using "Classroom" called "Machine Learning and AI".

Introduction page of AWS Educate Classrooms (all in English)

After somehow translating the content of the above page, the educator opens Classroom, sets up his own AWS Educate virtual education space, and then invites the students to check their usage. Since it was written about something like that, I asked a professor in the laboratory to which the university belongs to open a Classroom. (By the way, you can do it relatively quickly from application to opening of Classroom.

If you are invited to Classroom, press "My Classrooms" at the top left of the AWS Educate top screen to move to the screen shown below. キャプチャ21.PNG You can go to the AWS Management Console screen in classroom by pressing "Go to classroom" and following the on-screen instructions.

Finally I can start (crying)

① Create an Amazon S3 bucket

When you move to the S3 service screen, it looks like the following. Create by pressing "Create bucket" on the left side. キャプチャ2.PNG The bucket name may be freely named. This time, I named it "bank-xgboost". However, please note that you cannot create a name that has already been used. The region specifies the eastern United States (northern Virginia).

  • Remember the bucket name because the bucket created here will be used later. キャプチャ3.PNG After that, do nothing and create a bucket according to the instructions on the screen. Once the bucket is created, you can see that the bucket you named is created as shown in the figure below. キャプチャ4.PNG

② Create an Amazon SageMaker notebook instance

When you move to the Amazon SageMaker service screen, it looks like the following. キャプチャ5.PNG If you press "Notebook Instance" in the above figure, it will transition to the figure below. Press "Create notebook Instance" to proceed to create a notebook instance. キャプチャ6.PNG The instance name may be freely named. This time, I named it "bank-tutorial". The notebook build instance type is "ml.t2.medium" </ font> and remains the default. キャプチャ7.PNG In "Create IAM role" under "Notebook instance settings", specify the specified S3 packet as "Arbitrary S3 packet" and create a role. キャプチャ8.PNG Once you have created the IAM role, it will look like the figure below, so click "Create Notebook Instance" at the bottom right to complete the creation. キャプチャ9.PNG Immediately after creation, the status is "Pending", but after about 5 minutes, it will be "In Service" and proceed to the next step. キャプチャ10.PNG

③ Create a Jupyter notebook

When the status becomes "In Service", press "Open Jupyter" to move to the top screen of the Jupyter notebook. キャプチャ11.PNG When you go to the top screen of Jupyter notebook, specify "conda_python3" in "new" on the upper right and create a notebook, it will be as shown in the figure below. キャプチャ.PNG

Import the required configuration variables and libraries. Change the bucket item below to the S3 bucket name you decided earlier. Also, if the regions of SageMaker and S3 are different, it will not work, so be careful if the regions are the same. The folder specified by prefix is newly created in S3. No particular change is required, but there is no problem even if it is changed as appropriate. Finally, declare the roles of boto3 and IAM. Although it is boto3, it is a library that integrates Python developed by AWS and various AWS services.

To execute a cell, press Shift + Enter. When the execution is completed, it will be numbered as shown in In [1] in the figure below. Occasionally, the name of the cell will be In [*], but that's just because the cell is taking a long time to execute, so wait for a while before proceeding to the next step.

In[1]


#Please set the bucket name of S3 to the following
#Set S3 prefix (no change required)
bucket = 'bank-xgboost'
prefix = 'sagemaker/xgboost-dm'
 
#IAM role declaration
import boto3
import re
from sagemaker import get_execution_role
 
role = get_execution_role()

Next, import various libraries required for this model construction into. In addition to Numpy, Pandas, and Matplotlib, which are standard machine learning, modules for plotting tables with ipython (Jupyter Notebook), and SageMaker's Python SDK are also imported.

In[2]


import numpy as np                                #For matrix operations and numerical processing
import pandas as pd                               #For changing tabular data
import matplotlib.pyplot as plt                   #For visualization of figures, etc.
from IPython.display import Image                 #To display an image in a notebook
from IPython.display import display               #To display the output in notebook
from time import gmtime, strftime                 #For labeling SageMaker models, endpoints, etc.
import sys                                        #For writing output to notebook
import math                                       #For ceiling function
import json                                       #For analysis of hosting output
import os                                         #To manipulate the file path name
import sagemaker                                  #Many helper functions are provided by using Amazon SageMaker's Python SDK
from sagemaker.predictor import csv_serializer    #Convert the HTTP POST request string during inference

④ Download, survey, and convert training data

The dataset is available on the University of California, Irvine website, so get it directly from there. Download directly from the URL with wget and unizip.

In[3]


#Download the dataset from the public URL of the University of California, Irvine
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip

Load the downloaded csv file dataset as a Pandas dataframe.

In[4]


# bank-additional-full.Store csv in data
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
 
#Changed Pandas maximum display column and row count settings
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 30)
 
#Show first 10 lines
data.head(10)

If you look at the first 10 lines of the dataset, you'll see that it contains customer information. Here is a part of the outline of the items.

· Age – Customer age · Job – Job category ・ Marital – Marriage status ・ Education – Educational background · Default – Credit late payment status ・ Housing – Whether or not there is a real estate loan ・ Loan – Whether or not you have a personal loan, etc ... キャプチャ20.PNG

Check for missing data. If there is missing data, it is troublesome because it is necessary to discuss how to deal with it separately.

In[5]


#A table function that collects missing data in a data frame
def missing_values_table(df): 
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum()/len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        return mis_val_table_ren_columns 
 
#Check if data is missing
missing_values_table(data)

Looking at the figure below, it can be observed that there is no data loss. As expected, bank datasets. image.png

Next, data cleansing (data set preprocessing) is performed. Dataset cleansing has become an integral part of almost every machine learning process.

This time, four types of pretreatment are performed.

** ① Extract customers who have not been contacted from pdays ** The pdays (how many days have passed since the last contact) has a large amount of "999" data, and it can be said that most customers have not contacted since the last time. Data processing is performed with 999 days (that is, customers who have not been contacted) as "1" and the others as "0".

** ② Extract customers who are not currently in the job from job ** There are 12 types of job items, including unknown. This item includes customers who are not currently in employment, such as "student" or "unemployed". Therefore, the process of separating whether it is currently working or not and adding a new item "not_working" (not working) is performed.

** ③ Make categorical data a dummy variable ** The method of digitizing non-numerical data is called a dummy variable. For example, in the case of this data, the prediction target is "y", but the values it has are "yes" and "no". When this is made into a dummy variable, one item "y" is originally divided into two items "y_yes" and "y_no", and "0" and "1" are given according to each value.

** ④ Delete items that are not included in the forecast model ** Finally, use pd.drop to delete items not used in this training, such as the index of external environmental factors (emp.var.rate), from the data frame.

In[7]


#Addition of new items to identify people who have not been contacted before
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)
 
#Added "People who are not in employment" (students, etc.) flags from occupation
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)
 
#Make categorical data a dummy variable
model_data = pd.get_dummies(data)
 
#Deleted items not used in this model
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

After the pre-processing is completed, the next task is to divide the data into training data, verification data, and test data.

Regarding why it is divided, to put it simply, I want to train it many times to improve the accuracy of the model, so I will leave it to the point of view this time.

In[8]


#Preprocessed model_Randomly sort data into 3 data frames
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

Next is the final step of data preprocessing. The data format of the Amazon SageMaker XGBoost container is libSVM. Although it is in libSVM format, the feature (feature amount) and the prediction target (objective variable) must be set as separate arguments, so that process is performed. Finally, send this training dataset (libSVM format) to AWS S3 via boto3.

In[9]


#Export libSVM file
dump_svmlight_file(X=train_data.drop(['y_no', 'y_yes'], axis=1), y=train_data['y_yes'], f='train.libsvm')
dump_svmlight_file(X=validation_data.drop(['y_no', 'y_yes'], axis=1), y=validation_data['y_yes'], f='validation.libsvm')
dump_svmlight_file(X=test_data.drop(['y_no', 'y_yes'], axis=1), y=test_data['y_yes'], f='test.libsvm')
 
#Copy files to S3 using Boto3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.libsvm')).upload_file('train.libsvm')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.libsvm')).upload_file('validation.libsvm')

You now have a new folder called "sagemaker" in your first AWS S3 bucket. If train.libsvm is newly created in the directory of S3> bucket name> sagemaker> xgboost-dm> train as shown in the figure below, you can proceed without any problem. キャプチャ22.PNG

⑤ Train the model

Now that the pre-processing is complete, it's time to build the model using XGBoost.

First, specify the location of the ECR container for XGBoost of Amazon SageMaker, and link the training data (libSVM) with S3.

In[10]


#Specify ECR container for SageMaker XGBoost
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}

#I will link training data and S3
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='libsvm')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='libsvm')

Next, specify the necessary parameters and hyperparameters to SageMaker's Estimator and perform fitting. The code below uses "ml.m4.xlarge" </ font> as the training instance. The process is completed in about 10 minutes.

In[11]


#SageMaker session
sess = sagemaker.Session()
 
#Specify the required items in the sagemaker estimator
xgb = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
 
#Specifying hyperparameters
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)
 
#Model fitting and output destination specification (S3)
xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

If the process goes well, the following words will appear.

2019-11-18 16:36:46 Uploading - Uploading generated training model
2019-11-18 16:36:46 Completed - Training job completed
Training seconds: 63
Billable seconds: 63

⑥ Model hosting (also important point)

After fitting, model hosting is done next. The code below uses an instance of "ml.t2.medium" </ font>. This also takes about 10 minutes to complete the process.

In[12]


# ml.t2.Deploy with an instance of medium
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.t2.medium')

. . . . . Regarding this model hosting, I first tried deploying on a "ml.c4.xlarge" </ font> instance. That's because the code in the reference page above was that way.

However, I got the following error.

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.c4.xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

To put it simply, the resources of this account have reached the limit, so I contacted AWS Support to release the limit, and it said something like that.

As expected, Starter account. Low resources ...

Various instances were posted on the Amazon SageMaker instance fee page, so I tried various things, but regarding model deployment Only the ml.t2.medium instance worked.

Even so, the instance to use for the time being is ・ Construction ** "ml.t2.medium" ** ・ Training ** "ml.m4.xlarge" ** -Deploy ** "ml.t2.medium" **

With this combination, I didn't get stuck until the end, so if anyone will do the same thing with the Starter account in the future, please refer to it. (Low price = If you adopt an instance that does not use much resources, will you not get caught in the resource limit?)

By the way, when I asked AWS Support if the upper limit could be released by e-mail, it was said that the upper limit could not be released with the Starter account of the free frame in the first place ...

⑦ Verify the model

As a first step, we will specify how to pass and receive the data used in the evaluation. It's test data, but it's currently placed as a Numpy array on an instance of the SageMaker notebook. In order to send this to the prediction model with an HTTP POST request, serialize it using SageMaker's "serializer" and also specify the content_type.

In[13]


#Make settings for data transfer
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

Next, divide the test_data created in the previous step into small batches of 500 rows each, make predictions at the XGBoost endpoint, and output as a Numpy array.

In[14]


#Xgb in small batches of every 500 lines_Predict and calculate with predictor
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])
 
    return np.fromstring(predictions[1:], sep=',')
 
#Test created in the previous item_Delete the target item from data and output the forecast
predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).as_matrix())
 
#Comparison table of prediction and correct answer data
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

The predicted values are now stored in the Numpy array as predictions. Finally, using Pandas, tabulate the actual correct data and the predicted "predictions" results. . . . . . キャプチャ.PNG

The expected result this time is as shown in the above figure.

As it is, it is difficult to see, so I summarized it as shown in the table below. image.png If you read this,

I predicted that there would be 77 people who answered Yes, but in fact, only 3 of those 77 people answered Yes. Furthermore, among those who predicted No, there were actually 480 people who answered Yes.

This time, I can't say that the analysis result is very accurate, but this time I felt like trying to touch machine learning for the time being ... There seems to be more room for improving prediction accuracy, such as adjusting features and hyperparameters.

** Still the last important task. ** **

After finishing this tutorial, delete the endpoint created this time so as not to incur extra charges.

In[16]


#Delete the created endpoint
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)

In addition to this, I can check the status of "model", "endpoint", and "notebook" on the management screen of SageMaker, so I deleted unnecessary ones as appropriate.

Finally

This time, I think that I learned a little about how to use SageMaker, machine learning, and knowledge by doing the AWS SageMaker tutorial.

There are more tutorials besides this one, so if you have time, I definitely want to try it.

List of other tutorials (English) ☞ Amazon SageMaker Examples

I still feel like I've finally made it to the starting line of machine learning, so I hope I can continue to input various things and output to Qiita again like this time.

I think it was a childish sentence, but thank you for reading it.