[PYTHON] Try using AWS SageMaker Studio

Introduction

About a week ago, I wrote an article N reasons to recommend Jupyter Lab instead of Jupyter. It's a completely different story, but a product called SageMaker Studio was announced at an event called re: Invent at the AWS new product launch. It seems that you can manage the services of SageMaker that existed until now on AWS on one screen. I've only done machine learning in my local environment or Google Colaboratory, but I'd like to give this SageMaker Studio a try.

Try using

I will try using it. SegeMaker Studio is currently in preview release and is not available in the nearby Tokyo region. Only Ohio (us-east-2) can be used.

It feels like a normal Jupyter Lab. Shortcuts etc. can be used normally within the range I use (cell movement, mode switching, etc.). I will try using SageMaker Autopilot, which is one of the services of SageMaker.

About SageMaker Autopilot

Officially, SageMaker Autopilot is described as follows:

Amazon SageMaker Autopilot automatically trains and adjusts the best machine learning models for classification or regression based on your data, while maintaining full control and visibility.

It looks like an AutoML thing. It's a nice guy that deploys automatically in addition to AutoML. I don't have the data I want to analyze right away, so I'll try it with reference to the video below.

https://www.youtube.com/watch?v=qMEtqJPhqpA

The data used is the data distributed by UCI, which is a bank customer data and a dataset of whether or not you have applied for a time deposit.

Data import

%%sh
wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
unzip -o bank-additional.zip

Data import and display

import pandas as pd

data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page
data.head(10)

The correct answer label of this data is a binary value of yes and no. I will count each number

data["y"].value_counts()

`result`


no     36548
yes     4640

It's imbalanced (small average feeling). The analysis is not the purpose of this time, so I will skip it. It seems that Autopilot will manage the troublesome data from this adjustment. For the time being, divide it into train and test and save it.

import numpy as np

train_data, test_data, _ = np.split(data.sample(frac=1, random_state=123), 
                                                  [int(0.95 * len(data)), int(len(data))])  

# Save to CSV files
train_data.to_csv('automl-train.csv', index=False, header=True, sep=',') # Need to keep column names
test_data.to_csv('automl-test.csv', index=False, header=True, sep=',')

import sagemaker

prefix = 'sagemaker/DEMO-automl-dm/input'
sess   = sagemaker.Session()

uri = sess.upload_data(path="automl-train.csv", key_prefix=prefix)
print(uri)

Creating an Autopilot Experiment

Create an Autopilot Expreiment from SageMaker Studio

Fill in the items. The last item seems to create a notebook where you can try the model automatically generated by SageMaker when you select No.

Press Create Experiment to start building the model automatically.

The process is roughly divided into three processes.

Data analysis
Feature engineering
Model tuning

When I saw the result, I felt that I should leave it to the machine instead of doing it anymore. Technological progress is amazing.

I haven't finished the process yet, but I'm so sleepy that I'll stop here this time. s

Anxiety

As with all cloud services, not just AWS, the pricing structure is very difficult to understand. I wasn't sure how much it would cost to create a model for SageMaker Studio this time ... Even though I was out of focus, when I used DataProc on GCP, I forgot to erase the cluster and the resentment that blew away 7,000 yen has not disappeared ...

Summary

There weren't many articles I tried in Japanese, so I tried it. My honest impression is that I licked AutoML. However, there is a strong aspect that SageMaker functions can be integrated, so it is recommended for users who have full SageMaker functions, but what about those who just want an environment like Jupyter? It was like that.