[PYTHON] Try using AWS SageMaker Studio

Introduction

About a week ago, I wrote an article N reasons to recommend Jupyter Lab instead of Jupyter. It's a completely different story, but a product called SageMaker Studio was announced at an event called re: Invent at the AWS new product launch. It seems that you can manage the services of SageMaker that existed until now on AWS on one screen. I've only done machine learning in my local environment or Google Colaboratory, but I'd like to give this SageMaker Studio a try.

Try using

I will try using it. SegeMaker Studio is currently in preview release and is not available in the nearby Tokyo region. Only Ohio (us-east-2) can be used.

Screenshot from Gyazo

It feels like a normal Jupyter Lab. Shortcuts etc. can be used normally within the range I use (cell movement, mode switching, etc.). I will try using SageMaker Autopilot, which is one of the services of SageMaker.

About SageMaker Autopilot

Officially, SageMaker Autopilot is described as follows:

Amazon SageMaker Autopilot automatically trains and adjusts the best machine learning models for classification or regression based on your data, while maintaining full control and visibility.

It looks like an AutoML thing. It's a nice guy that deploys automatically in addition to AutoML. I don't have the data I want to analyze right away, so I'll try it with reference to the video below.

https://www.youtube.com/watch?v=qMEtqJPhqpA

The data used is the data distributed by UCI, which is a bank customer data and a dataset of whether or not you have applied for a time deposit.

Data import

%%sh
wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
unzip -o bank-additional.zip

Data import and display

import pandas as pd

data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page
data.head(10)

The correct answer label of this data is a binary value of yes and no. I will count each number

data["y"].value_counts()

result


no     36548
yes     4640

It's imbalanced (small average feeling). The analysis is not the purpose of this time, so I will skip it. It seems that Autopilot will manage the troublesome data from this adjustment. For the time being, divide it into train and test and save it.

import numpy as np

train_data, test_data, _ = np.split(data.sample(frac=1, random_state=123), 
                                                  [int(0.95 * len(data)), int(len(data))])  

# Save to CSV files
train_data.to_csv('automl-train.csv', index=False, header=True, sep=',') # Need to keep column names
test_data.to_csv('automl-test.csv', index=False, header=True, sep=',')

import sagemaker

prefix = 'sagemaker/DEMO-automl-dm/input'
sess   = sagemaker.Session()

uri = sess.upload_data(path="automl-train.csv", key_prefix=prefix)
print(uri)

Creating an Autopilot Experiment

Create an Autopilot Expreiment from SageMaker Studio

Screenshot from Gyazo

Fill in the items. The last item seems to create a notebook where you can try the model automatically generated by SageMaker when you select No.

Press Create Experiment to start building the model automatically.

The process is roughly divided into three processes.

When I saw the result, I felt that I should leave it to the machine instead of doing it anymore. Technological progress is amazing.

I haven't finished the process yet, but I'm so sleepy that I'll stop here this time. s

Anxiety

As with all cloud services, not just AWS, the pricing structure is very difficult to understand. I wasn't sure how much it would cost to create a model for SageMaker Studio this time ... Even though I was out of focus, when I used DataProc on GCP, I forgot to erase the cluster and the resentment that blew away 7,000 yen has not disappeared ...

Summary

There weren't many articles I tried in Japanese, so I tried it. My honest impression is that I licked AutoML. However, there is a strong aspect that SageMaker functions can be integrated, so it is recommended for users who have full SageMaker functions, but what about those who just want an environment like Jupyter? It was like that.

Recommended Posts

Try using AWS SageMaker Studio
Try using Tkinter
Try using docker-py
Try using PDFMiner
Try using geopandas
Try using Selenium
Try using scipy
Glue Studio [AWS]
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Try using virtualenv (virtualenvwrapper)
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
[Kaggle] Try using LGBM
Try using Python's feedparser.
Try using Python's Tkinter
Try using Tweepy [Python2.7]
Try AWS Lambda Destinations
Try using Pytorch's collate_fn
Try using PythonTex with Texpad.
[Python] Try using Tkinter's canvas
Try using Jupyter's Docker image
Try using scikit-learn (1) --K-means clustering
Try function optimization using Hyperopt
Try using Azure Logic Apps
[Kaggle] Try using xg boost
Try using the Twitter API
Try using OpenCV on Windows
Try using Jupyter Notebook dynamically
Try tweeting automatically using Selenium.
Try using SQLAlchemy + MySQL (Part 1)
Try using the Twitter API
Try using SQLAlchemy + MySQL (Part 2)
Try using Django's template feature
Try using the PeeringDB 2.0 API
Try using Pelican's draft feature
Web scraping using AWS lambda
Try using pytest-Overview and Samples-
Try using folium with anaconda
I tried using AWS Chalice
Try using Janus gateway's Admin API
[Statistics] [R] Try using quantile regression.
Try using Spyder included in Anaconda
Try using design patterns (exporter edition)
Summary if using AWS Lambda (Python)
Try using Pillow on iPython (Part 1)
Try using Pillow on iPython (Part 2)
Try using LevelDB in Python (plyvel)
Try using pynag to configure Nagios
Try using ArUco on Raspberry Pi
Try using cheap LiDAR (Camsense X1)
Tweet WakaTime Summary using AWS Lambda
[Sakura rental server] Try using flask.
Try giving AWS Lambda environment variables?
Using Lambda with AWS Amplify with Go
Stop your AWS instance using Boto3