xgboost (python) on EC2 Spot instance environment prepared by AWS Lambda

I started to touch xgboost, but honestly it is troublesome to build an environment for machine learning.

Even if I wrote a procedure manual or a script, where did I do that file later? With EC2, you can save money on a spot instance and experiment with multiple units in parallel while changing parameters, but if the environment construction is based on a procedure manual, you can see what is the terminal running which parameter processing. It's gone.

In that area, if you create "things that build an environment with one button or one command with AWS Lambda", I feel that it is relatively easy to ad hoc and make various progress.

Below, I would like to write what I tried about how to set up a spot instance of EC2 and create an environment where xgboost works in one shot via Lambda.

Lamda function

Code

After pasting the following in Lambda's hello-world template by overwriting, various settings will be made.

Since it is long, I also raised it to gist. https://gist.github.com/pyr-revs/d768984ed68500bdbeb9

console.log('Loading function');

var ec2Region = 'ap-northeast-1';
var s3Region = ec2Region;
var snsRegion = ec2Region;

var s3Bucket = 'mybucket';
var shellScriptS3Key = 'sh/launch_xgboost.sh';
var shellScriptS3Path = 's3://' + s3Bucket + '/' + shellScriptS3Key;

var iamInstanceProfile = 'my_ec2_role';

var availabilityZone = ec2Region + 'a';
var spotPrice = '0.1';
var imageId = 'ami-9a2fb89a';
var instanceType = 'c3.2xlarge';
var securityGroup = 'launch-wizard-1';
var keyName = 'my_ssh_keypair';

var userData = (function () {/*#!/bin/bash
tmp=/root/sudoers_tmp
cat /etc/sudoers > $tmp
cat >> $tmp <<EOF
Defaults:ec2-user !requiretty
EOF
cat $tmp > /etc/sudoers
yum -y update
yum groupinstall -y "Development tools"
yum -y install gcc-c++ python27-devel atlas-sse3-devel lapack-devel
pip install numpy
pip install scipy
pip install pandas
aws s3 cp %s /home/ec2-user/launch_xgboost.sh
chown ec2-user /home/ec2-user/launch_xgboost.sh
chmod +x /home/ec2-user/launch_xgboost.sh
su - ec2-user /home/ec2-user/launch_xgboost.sh
*/}).toString().match(/[^]*\/\*([^]*)\*\/\}$/)[1];

var shellScriptContents = (function () {/*#!/bin/bash
git clone --recursive https://github.com/dmlc/xgboost.git
cd xgboost
./build.sh > build.log 2>&1
cd python-package
sudo -s python setup.py install > setup.log 2>&1
export AWS_DEFAULT_REGION=%s
aws sns publish --topic-arn arn:aws:sns:ap-northeast-1:xxxxxxxxxxxx:My-Sns-Topic --subject "Launch xgboost Done" --message "Launch xgboost Done!!"
*/}).toString().match(/[^]*\/\*([^]*)\*\/\}$/)[1];

exports.handler = function(event, context) {
    var util = require('util');
    var AWS = require('aws-sdk');
    
    // Write sh file for xgboost launch to S3
    AWS.config.region = s3Region;
    var shellScriptContentsFormatted = util.format(shellScriptContents, snsRegion);
    var s3 = new AWS.S3();
    var s3Params = {Bucket: s3Bucket, Key: shellScriptS3Key, Body: shellScriptContentsFormatted};
    var s3Options = {partSize: 10 * 1024 * 1024, queueSize: 1};
    //console.log(shellScriptContentsFormatted);
    
    s3.upload(s3Params, s3Options, function(err, data) {
        if (err) {
            console.log(err, err.stack);
            context.fail('[Fail]');
        }
        else {
            console.log(data);
            
            // Lauch EC2 Spot Instance with UserData
            var userDataFormatted = util.format(userData, shellScriptS3Path);
            var userDataBase64 = new Buffer(userDataFormatted).toString('base64');
    
            var ec2LaunchParams = {
                SpotPrice: spotPrice, 
                LaunchSpecification : {
                    IamInstanceProfile: {
                      Name: iamInstanceProfile
                    },
                    ImageId: imageId,
                    InstanceType: instanceType,
                    KeyName: keyName,
                    Placement: {
                      AvailabilityZone: availabilityZone
                    },
                    SecurityGroups: [
                        securityGroup
                    ],
                    UserData: userDataBase64
                }
            };
            //console.log(params);
            
            AWS.config.region = ec2Region;
            var ec2 = new AWS.EC2();
            ec2.requestSpotInstances(ec2LaunchParams, function(err, data) {
                if (err) {
                    console.log(err, err.stack);
                    context.fail('[Fail]');
                }
                else {
                    console.log(data);
                    context.succeed('[Succeed]');
                }
            });
        }
    });
};

What are you doing

  1. Create an sh file to install xgboost and save it in S3
  2. Request a spot instance of EC2 and write the initial processing in UserData
  3. When the instance is started, execute yum etc. with UserData. Furthermore, download the sh file created in Step 1 from s3 and hand it over to ec2-user.
  4. xgboost is installed by the sh file that runs with ec2-user privileges.
  5. Throw Notification to SNS when all is done

Basically, it is an extension of the following article.

Create a spot instance with AWS Lambda and let UserData automatically execute environment construction and long processing http://qiita.com/pyr_revs/items/c7f33d1cdee3118590e3

Where preparation / setting is required

Lambda execution timeout

Lambda timeout is as short as 3 seconds by default, so increase it to 60 seconds. The default memory should be 128MB.

IAM Role for Lambda

As an IAM Role ** that runs Lambda, you need to create a role that has "Create EC2 instance" and "Access to S3" authority. It means that you should attach the management policies "Amazon EC2 Full Access" and "Amazon S3 Full Access", but for some reason you can't write to S3, so I'm shaking "Administrator Access". .. ..

Create an IAM Role for your users-Getting started with AWS Lambda (1). Handle events from user applications http://dev.classmethod.jp/cloud/aws/getting-started-with-aws-lambda-1st-user-application/#prepare-iam-role

IAM Role for EC2 Instance Profile

As a role to attach to the EC2 instance profile **, you need to create a role with "Access to S3" and "Access to SNS" authority. You can attach the management policies "Amazon EC2 Full Access" and "Amazon SNS Full Access". After making it, rewrite the following part.

var iamInstanceProfile = 'my_ec2_role';

Grant permissions to applications running on Amazon EC2 instances using IAM roles http://docs.aws.amazon.com/ja_jp/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html

SNS Topic

Please make a suitable SNS Topic. You will be notified when xgboost is installed. Rewrite the following arn at the end of the shell script section.

aws sns publish --topic-arn arn:aws:sns:ap-northeast-1:xxxxxxxxxxxx:My-Sns-Topic --subject "Launch xgboost Done" --message "Launch xgboost Done!!"

Well, it's not essential, but if you subscribe to emails on SNS, you can track the status on your smartphone even if you are away from the front of your PC, so it is recommended.

Get started with Amazon Simple Notification Service https://docs.aws.amazon.com/ja_jp/sns/latest/dg/GettingStarted.html

Other settings

The following is as you can see, so I will omit the details, but please change it if necessary. However, the s3 bucket name and key pair must be changed. The imageId is from "Amazon Linux AMI 2015.09 / HVM (SSD) EBS-Backed 64-bit / Asia Pacific Tokyo", so change the region. It is necessary to change when an update comes.

var s3Bucket = 'mybucket';

var availabilityZone = ec2Region + 'a';
var spotPrice = '0.1';
var imageId = 'ami-9a2fb89a';
var instanceType = 'c3.2xlarge';
var securityGroup = 'launch-wizard-1';
var keyName = 'my_ssh_keypair';

Operation check

It's easiest to run it from the Lambda Test button. Of course you can also hit from aws cli. lambda.png

Please wait for a while after confirming that the Spot Instance request has been made. Is it about 15-20 minutes? It takes quite a while to install numpy, scipy, pandas.

Once you have an EC2 instance and xgboost installed, log in with ssh. Let's run the binary_classification demo to check the operation.

cd /home/ec2-user/xgboost/demo/binary_classification
./runexp.sh

Immediately, it returned as follows.

6513x126 matrix with 143286 entries is loaded from agaricus.txt.train
6513x126 matrix with 143286 entries is saved to agaricus.txt.train.buffer
1611x126 matrix with 35442 entries is loaded from agaricus.txt.test
1611x126 matrix with 35442 entries is saved to agaricus.txt.test.buffer
boosting round 0, 0 sec elapsed
tree prunning end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3
[0]     test-error:0.016139     train-error:0.014433
boosting round 1, 0 sec elapsed
tree prunning end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3
[1]     test-error:0.000000     train-error:0.001228

updating end, 0 sec in all
1611x126 matrix with 35442 entries is loaded from agaricus.txt.test.buffer
start prediction...
writing prediction to pred.txt
...[Abbreviation]...
booster[1]:
0:[odor=none] yes=2,no=1
        1:[bruises?=bruises] yes=4,no=3
                3:leaf=1.1457
                4:[gill-spacing=close] yes=8,no=7
                        7:leaf=-6.87558
                        8:leaf=-0.127376
        2:[spore-print-color=green] yes=6,no=5
                5:[gill-size=broad] yes=10,no=9
                        9:leaf=-0.0386054
                        10:leaf=-1.15275
                6:leaf=0.994744

I'd like to see it a little longer, so I'll try the kaggle-higgs demo. The data must be downloaded separately from Kaggle and sent to EC2.

cd /home/ec2-user/xgboost/demo/kaggle-higgs
mkdir data
# Copy training.csv and test.csv
./run.sh

result. This also takes less than a minute.

[0]     train-auc:0.910911      [email protected]:3.699574
[1]     train-auc:0.915308      [email protected]:3.971228
[2]     train-auc:0.917743      [email protected]:4.067463
...[Abbreviation]...
[118]   train-auc:0.945648      [email protected]:5.937291
[119]   train-auc:0.945800      [email protected]:5.935622
finish training
finish loading from csv 

Well, like this, it is a recognition that it is working without problems.

A memo of the place I was addicted to

UserData Section

Dependencies to put in yum

yum groupinstall -y "Development tools"
yum -y install gcc-c++ python27-devel atlas-sse3-devel lapack-devel

Development tools such as gcc and Blas / Lapack etc. are required, so I have included them. I referred to the following.

Installing scikit-learn on Amazon EC2 http://dacamo76.com/blog/2012/12/07/installing-scikit-learn-on-amazon-ec2/

Dependencies to put in with pip

pip install numpy
pip install scipy
pip install pandas

numpy and scipy are required. I like pandas because it takes a long time to install, but it seems that various improvements have been made, so I think it's okay to include it.

Improved Python XGBoost + pandas integration http://sinhrks.hatenablog.com/entry/2015/10/03/080719

/etc/sudoers

tmp=/root/sudoers_tmp
cat /etc/sudoers > $tmp
cat >> $tmp <<EOF
Defaults:ec2-user !requiretty
EOF
cat $tmp > /etc/sudoers

In EC2 (or AMI), sudo other than root is required tty by default (sudo cannot be done in shell script). That will be a problem later, so add Defaults: ec2-user! requiretty to the end of the file so that ec2-user can also sudo in the shell script.

Refer to the following, and set it roughly on the assumption that the instance will be deleted quickly when it is finished.

Omit password entry required many times http://qiita.com/yuku_t/items/5f995bb0dfc894c9f2df

ShellScript Section

Install xgboost

git clone https://github.com/dmlc/xgboost.git
cd xgboost
./build.sh > build.log 2>&1
cd python-package
sudo -s python setup.py install > setup.log 2>&1

Install from source. First, execute xgboost / build.sh to create a binary, and then install it in Python with xgboost / python-package / setup.py.

python setup.py install requires sudo. If you don't set / etc / sudoers, you'll get addicted to sudo: sorry, you must have a tty to run sudo.

in conclusion

If you want to do automatic termination when finished, or batch at night, the following may be helpful.

Cron Lambda from AWS Data Pipeline (without launching EC2) http://qiita.com/pyr_revs/items/d2ec88a8fafeace7da4a

Create a spot instance with AWS Lambda and let UserData automatically execute environment construction and long processing http://qiita.com/pyr_revs/items/c7f33d1cdee3118590e3

Recommended Posts

xgboost (python) on EC2 Spot instance environment prepared by AWS Lambda
Prepare the environment of Chainer on EC2 spot instance with AWS Lambda
# 2 Build a Python environment on AWS EC2 instance (ubuntu18.04)
# 3 Build a Python (Django) environment on AWS EC2 instance (ubuntu18.04) part2
Deployment procedure on AWS (2) Server (EC2 instance) environment settings
Building an environment to execute python programs on AWS EC2
Run Python on Schedule on AWS Lambda
Create an AWS Cloud9 development environment on your Amazon EC2 instance
Things to note when running Python on EC2 from AWS Lambda
[Python] Run Headless Chrome on AWS Lambda
I easily built an operating environment for Python3 + Tornado on AWS EC2.
Modules cannot be imported in Python on EC2 run from AWS Lambda
Periodically run a python program on AWS Lambda
[2020 version] How to install Python3 on AWS EC2
Build python environment with pyenv on EC2 (ubuntu)
code-server online environment (4) Launch code-server on the EC2 instance
Python development on Ubuntu on AWS EC2 (using JupyterLab)
I tried to reduce costs by starting / stopping EC2 collectively on AWS Lambda
Check types_map when using mimetypes on AWS Lambda (Python)
Try running a Schedule to start and stop an instance on AWS Lambda (Python)
Deploy Python3 function with Serverless Framework on AWS Lambda
Execute python3 system with PHP exec () on AWS EC2
Support for Python 2.7 runtime on AWS Lambda (as of 2020.1)
I want to AWS Lambda with Python on Mac!
Run GPU version tensorflow on AWS EC2 Spot Instances
Posted as an attachment to Slack on AWS Lambda (Python)
[AWS] Install node.js on EC2 instance and execute sample program
Build Keras environment on AWS E2 G2 instance February 2017 version
Post images of Papillon regularly on Python + AWS Lambda + Slack
[Python] Allow pip3 packages to be imported on AWS Lambda
[Python] Scraping in AWS Lambda
Build Python environment on Windows
Build python environment on windows
Python --Install MySQLDB on EC2
Best practice for logging in JSON format on AWS Lambda / Python
Install pip in Serverless Framework and AWS Lambda with Python environment
Building an environment to run ChainerMN on a GPU instance on AWS
Building a Python environment on Mac
Summary if using AWS Lambda (Python)
Prepare Python development environment on Ubuntu
Anaconda python environment construction on Windows 10
Touch AWS Lambda environment variable support
Building a Python environment on Ubuntu
Kivy + Python3 on OSX environment maintenance
Install xgboost (python version) on Windows
Create a Python environment on Mac (2017/4)
Write AWS Lambda function in Python
Use jupyter on AWS GPU instance
Python environment construction memo on Mac
Python development environment construction on macOS
Set up Python environment on CentOS
Try giving AWS Lambda environment variables?
Create a python environment on centos
Environment construction of python3.8 on mac
Install Python development environment on Windows 10
Notify HipChat with AWS Lambda (Python)
Build Python 3.8 + Pipenv environment on Ubuntu 18.04
Build a python3 environment on CentOS7
Introduce Python 3.5.2 environment on Amazon Linux
OpenCV3 & Python3 environment construction on Ubuntu
[AWS] Create a Python Lambda environment with CodeStar and do Hello World