About a week ago, I wrote an article N reasons to recommend Jupyter Lab instead of Jupyter. It's a completely different story, but a product called SageMaker Studio was announced at an event called re: Invent at the AWS new product launch. It seems that you can manage the services of SageMaker that existed until now on AWS on one screen. I've only done machine learning in my local environment or Google Colaboratory, but I'd like to give this SageMaker Studio a try.
I will try using it. SegeMaker Studio is currently in preview release and is not available in the nearby Tokyo region. Only Ohio (us-east-2) can be used.
It feels like a normal Jupyter Lab. Shortcuts etc. can be used normally within the range I use (cell movement, mode switching, etc.). I will try using SageMaker Autopilot, which is one of the services of SageMaker.
Officially, SageMaker Autopilot is described as follows:
Amazon SageMaker Autopilot automatically trains and adjusts the best machine learning models for classification or regression based on your data, while maintaining full control and visibility.
It looks like an AutoML thing. It's a nice guy that deploys automatically in addition to AutoML. I don't have the data I want to analyze right away, so I'll try it with reference to the video below.
https://www.youtube.com/watch?v=qMEtqJPhqpA
The data used is the data distributed by UCI, which is a bank customer data and a dataset of whether or not you have applied for a time deposit.
%%sh
wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
unzip -o bank-additional.zip
import pandas as pd
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500) # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50) # Keep the output on one page
data.head(10)
The correct answer label of this data is a binary value of yes and no. I will count each number
data["y"].value_counts()
result
no 36548
yes 4640
It's imbalanced (small average feeling). The analysis is not the purpose of this time, so I will skip it. It seems that Autopilot will manage the troublesome data from this adjustment. For the time being, divide it into train and test and save it.
import numpy as np
train_data, test_data, _ = np.split(data.sample(frac=1, random_state=123),
[int(0.95 * len(data)), int(len(data))])
# Save to CSV files
train_data.to_csv('automl-train.csv', index=False, header=True, sep=',') # Need to keep column names
test_data.to_csv('automl-test.csv', index=False, header=True, sep=',')
import sagemaker
prefix = 'sagemaker/DEMO-automl-dm/input'
sess = sagemaker.Session()
uri = sess.upload_data(path="automl-train.csv", key_prefix=prefix)
print(uri)
Create an Autopilot Expreiment from SageMaker Studio
Fill in the items. The last item seems to create a notebook where you can try the model automatically generated by SageMaker when you select No.
Press Create Experiment
to start building the model automatically.
The process is roughly divided into three processes.
When I saw the result, I felt that I should leave it to the machine instead of doing it anymore. Technological progress is amazing.
I haven't finished the process yet, but I'm so sleepy that I'll stop here this time. s
As with all cloud services, not just AWS, the pricing structure is very difficult to understand. I wasn't sure how much it would cost to create a model for SageMaker Studio this time ... Even though I was out of focus, when I used DataProc on GCP, I forgot to erase the cluster and the resentment that blew away 7,000 yen has not disappeared ...
There weren't many articles I tried in Japanese, so I tried it. My honest impression is that I licked AutoML. However, there is a strong aspect that SageMaker functions can be integrated, so it is recommended for users who have full SageMaker functions, but what about those who just want an environment like Jupyter? It was like that.
Recommended Posts