[PYTHON] How to create a serverless machine learning API with AWS Lambda

This article is the 14th day article of Python Advent Calendar 2016.

Overview

NewsDigest classifies the categories of news articles to be delivered by machine learning algorithms. Specifically, about 1000 articles a day are classified into 10 categories such as "entertainment," "politics," and "sports."

NewsDigest provides a ** general-purpose API ** for classification in-house, instead of tightly coupling such categorization to the server module.

In order to realize this general-purpose API, we considered a serverless (AWS Lambda) machine learning API in order to make it more scalable, so it will be an introduction or a tutorial for creating a serverless API.

The API that actually works is https://3lxb3g0cx5.execute-api.us-east-1.amazonaws.com/prod/classify And the repository is https://github.com/yamitzky/serverless-machine-learning.

Premise

The API to be implemented this time is created based on the following assumptions.

--Classification API by supervised learning. That is, ** there is a learning stage and a classification (prediction) stage ** -** Not for big data **. Therefore, implement it only with scikit-learn without using Spark etc. --As mentioned earlier, create a serverless API

** I will omit explanations about machine learning, morphological analysis methods, how to use scikit learn, etc. **

In this tutorial, you will follow the steps below:

--First, write code for a minimal machine learning algorithm regardless of API --Use bottle to make a non-serverless API --Deploy to AWS Lambda to make it serverless

1. Minimum implementation for classification by machine learning

First, let's create a minimum implementation without thinking about API conversion. As a premise, I will prepare the following corpus (I made a corpus from Reuters corpus and generated it).

category\t Morphologically parsed text
money-fx\tu.k. money market given 120 mln stg late help london, march 17 - the bank of england said it provided the money market with late assistance of around 120 mln stg. this brings the bank's total help today to some 136 mln stg and compares with its forecast of a 400 mln stg shortage in the system.
grain\tu.s. export inspections, in thous bushels soybeans 20,349 wheat 14,070 corn 21,989 blah blah blah. 
earn\tsanford corp <sanf> 1st qtr feb 28 net bellwood, ill., march 23 - shr 28 cts vs 13 cts net 1,898,000 vs 892,000 sales 16.8 mln vs 15.3 mln
...

What does Naive Bayes' minimum implementation of categorization look like?

from gensim.corpora.dictionary import Dictionary
from gensim.matutils import corpus2csc
from sklearn.naive_bayes import MultinomialNB


def load_corpus(path):
    """Get corpus from file"""
    categories = []
    docs = []
    with open(path) as f:
        for line in f:
            category, line = line.split('\t')
            doc = line.strip().split(' ')
            categories.append(category)
            docs.append(doc)
    return categories, docs


def train_model(documents, categories):
    """Learn the model"""
    dictionary = Dictionary(documents)
    X = corpus2csc([dictionary.doc2bow(doc) for doc in documents]).T
    return MultinomialNB().fit(X, categories), dictionary


def predict(classifier, dictionary, document):
    """Estimate unknown sentence categories from the trained model"""
    X = corpus2csc([dictionary.doc2bow(document)], num_terms=len(dictionary)).T
    return classifier.predict(X)[0]


#Learn the model
categories, documents = load_corpus('corpus.txt')
classifier, dictionary = train_model(documents, categories)

#Categorize with the learned model
predict_sentence = 'a dollar of 115 yen or more at the market price of the trump market 4% growth after the latter half of next year'.split()  # NOQA
predict(classifier, dictionary, predict_sentence)  # money-fx

This minimum implementation

--Read data from the corpus and train the model --Category unknown sentences from trained models

In that respect, it has the minimum functionality of supervised learning. Let's make this an API.

2. Implement a simple API using bottle

Before making it serverless, let's simply make categorization into an API using bottle, a simple web framework.

from bottle import route, run, request

def load_corpus(path):
    """Get corpus from file"""
def train_model(documents, categories):
    """API for learning"""
def predict(classifier, dictionary, document):
    """API for classification"""

@route('/classify')
def classify():
    categories, documents = load_corpus('corpus.txt')
    classifier, dictionary = train_model(documents, categories)
    sentence = request.params.sentence.split()
    return predict(classifier, dictionary, sentence)

run(host='localhost', port=8080)

If you hit the curl command in this state, you will get the API result.

curl "http://localhost:8080/classify?sentence=a%20dollar%20of%20115%20yen%20or%20more%20at%20the%20market%20price%20of%20the%20trump%20market%204%%20growth%20after%20the%20latter%20half%20of%20next%20year"
# money-fx

But of course, ** there is a big problem with this implementation **. It's slow because it learns and classifies at the same time when you hit the classification endpoint (/ classify).

In general, in machine learning, learning takes time and classification is completed in a short time. So, let's cut out the learning endpoint and make the trained model persistent.

3. Create a learning endpoint and persist the model

This time, I prepared two APIs, / train and / classify. I tried to save the model persistence with joblib as described in scikit learn 3.4. Model persistence. The trick is to use joblib, and if you use joblib to compress the model, something about 200MB will fit in 2MB (because the file size is a constraint when converting to Lambda).

from sklearn.externals import joblib
import os.path

from bottle import route, run, request

def load_corpus(path):
    """Get corpus from file"""
def train_model(documents, categories):
    """API for learning"""
def predict(classifier, dictionary, document):
    """API for classification"""

@route('/train')
def train():
    categories, documents = load_corpus('corpus.txt')
    classifier, dictionary = train_model(documents, categories)
    joblib.dump((classifier, dictionary), 'model.pkl', compress=9)
    return "trained"

@route('/classify')
def classify():
    if os.path.exists('model.pkl'):
        classifier, dictionary = joblib.load('model.pkl')
        sentence = request.params.sentence.split()
        return predict(classifier, dictionary, sentence)
    else:
        #Without the file, it is not learned
        return "model not trained. call `/train` endpoint"

run(host='localhost', port=8080)

With this API, when a model is trained, the trained model is persisted as model.pkl. At the first stage, the model is not trained, so "model not trained" is displayed.

curl "http://localhost:8080/?sentence=a%20dollar%20of%20115%20yen%20or%20more%20at%20the%20market%20price%20of%20the%20trump%20market%204%%20growth%20after%20the%20latter%20half%20of%20next%20year"
# model not trained

If you learn and classify again, you can see that the API classifies normally.

curl http://localhost:8080/train
# trained
curl "http://localhost:8080/classify?sentence=a%20dollar%20of%20115%20yen%20or%20more%20at%20the%20market%20price%20of%20the%20trump%20market%204%%20growth%20after%20the%20latter%20half%20of%20next%20year"
# money-fx

4. Make it serverless

Here is the main issue. We will deploy the API created with bottle to AWS Lambda.

To make the machine learning API serverless, the learning phase and classification phase are defined as follows.

--Learning phase --Use Docker to download datasets, create datasets, train, and save trained model files --Zip the code + trained model and deploy it to AWS Lambda --Classification phase --With the combination of API Gateway + AWS Lambda, when a classification request comes, load the trained model and classify it.

4-1. Learning phase: Build a model using Docker

It's a bit amakudari, but prepare the following Dockerfile.

#Easy to build machine learning related, anaconda(miniconda)For the base image
FROM continuumio/miniconda

RUN mkdir -p /usr/src/app

WORKDIR /usr/src/app

#Dataset download
COPY download_corpus.sh /usr/src/app/
RUN sh download_corpus.sh

#Installation of machine learning related libraries
COPY conda-requirements.txt /usr/src/app/

RUN conda create -y -n deploy --file conda-requirements.txt
#Related libraries are/opt/conda/envs/deploy/lib/python2.7/site-Exhaled to packages

COPY . /usr/src/app/

#Learn and spit out the model
RUN python gen_corpus.py \
      && /bin/bash -c "source activate deploy && python train.py"

#Prepare deliverables for deployment
#Pack code, trained model, so files needed for execution, etc.
RUN mkdir -p build/lib \
      && cp main.py model.pkl build/ \
      && cp -r /opt/conda/envs/deploy/lib/python2.7/site-packages/* build/ \
      && cp /opt/conda/envs/deploy/lib/libopenblas* /opt/conda/envs/deploy/lib/libgfortran* build/lib/

Building this Dockerfile will result in a Docker image packed with code, a trained model, and the so files needed to run it. In other words, "build the model".

To extract the artifact from the Docker image and create the artifact for uploading to Lambda, run a command similar to the following:

docker build -t serverless-ml .
#Retrieving information from a Docker image
id=$(docker create serverless-ml)
docker cp $id:/usr/src/app/build ./build
docker rm -v $id
#Size reduction rm build/**/*.pyc
rm -rf build/**/test
rm -rf build/**/tests
#Zip the artifacts
cd build/ && zip -q -r -9 ../build.zip ./

You now have a zip file packed with code and models.

4-2. Deploy Lambda

I'll omit this because it's how to use AWS Lambda.

[Step 2.3: Create a Lambda function and test it manually](https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/with-s3-example-upload-deployment-pkg.html# walkthrough-s3-events-adminuser-create-test-function-upload-zip-test-upload) may be helpful.

4-3. Create a classification API

Use Amazon API Gateway to create a serverless "API".

This is also omitted because it is how to use API Gateway.

You may find it helpful to create an API to expose your Lambda function (http://docs.aws.amazon.com/ja_jp/apigateway/latest/developerguide/getting-started.html).

5. Completed!

As a Working example, I prepared the following API.

https://3lxb3g0cx5.execute-api.us-east-1.amazonaws.com/prod/classify

Let's actually hit the API with curl.

curl -X POST https://3lxb3g0cx5.execute-api.us-east-1.amazonaws.com/prod/classify -H "Content-type: application/json" -d '{"sentence": "a dollar of 115 yen or more at the market price of the trump market 4% growth after the latter half of next year"}'

You can safely get the classification result money-fx.

Can I use the serverless machine learning API?

In conclusion, it is ** none **.

The reason is that the API is returning results too late. In the previous example, it takes about 5 seconds. The API that returns 5 seconds is, well ** none ** (bitter smile)

The reason for the slow response is clear: it takes a long time to load the pkl file stored on disk into memory. Therefore, if the model file is huge, it can be said that the serverless machine learning API is too slow to respond **.

On the contrary, for example, if there is no model file, the API is simply using numpy, or if the model file is very lightweight, I think that it can be used relatively.

Recommended Posts

How to create a serverless machine learning API with AWS Lambda
[AWS] Create API with API Gateway + Lambda
How to quickly create a machine learning environment using Jupyter Notebook with UbuntuServer 16.04 LTS
How to quickly create a machine learning environment using Jupyter Notebook with UbuntuServer 16.04 LTS with anaconda
[AWS SAM] Create API with DynamoDB + Lambda + API Gateway
How to create a multi-platform app with kivy
How to create a Rest Api in Django
How to quickly create a machine learning environment using Jupyter Notebook on macOS Sierra with anaconda
How to create a submenu with the [Blender] plugin
Create a Layer for AWS Lambda Python with Docker
[Python] How to create a 2D histogram with Matplotlib
Create a machine learning environment from scratch with Winsows 10
How to set up a Google Colab environment with Coursera's advanced machine learning courses
Create a machine learning app with ABEJA Platform + LINE Bot
How to create a flow mesh around a cylinder with snappyHexMesh
Create a python machine learning model relearning mechanism with mlflow
How to create a Conda package
How to create a Dockerfile (basic)
How to create a config file
How to collect machine learning data
I wrote a script to create a Twitter Bot development environment quickly with AWS Lambda + Python 2.7
Create an alias for Route53 to CloudFront with the AWS API
[Python 3.8 ~] How to define a recursive function smartly with a lambda expression
How to create a heatmap with an arbitrary domain in Python
How to create a label (mask) for segmentation with labelme (semantic segmentation mask)
Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
I want to create a machine learning service without programming! WebAPI
How to write a docstring to create a named tuple document with sphinx
How to send a request to the DMM (FANZA) API with python
Try to create a Qiita article with REST API [Environmental preparation]
How to create a clone from Github
Introduction to Machine Learning: How Models Work
scikit-learn How to use summary (machine learning)
How to create a git clone folder
A story about machine learning with Kyasuket
How to enjoy Coursera / Machine Learning (Week 10)
How to create a repository from media
Connect to s3 with AWS Lambda Python
Create a private repository with AWS CodeArtifact
I tried to introduce a serverless chatbot linked with Rakuten API to Teams
Tweet in Chama Slack Bot ~ How to make a Slack Bot using AWS Lambda ~
I tried to delete bad tweets regularly with AWS Lambda + Twitter API
[AWS] Create a Python Lambda environment with CodeStar and do Hello World
Let's create a chat function with Vue.js + AWS Lambda + dynamo DB [AWS settings]
I tried to make "Sakurai-san" a LINE BOT with API Gateway + Lambda
A beginner of machine learning tried to predict Arima Kinen with python
I want to create a machine learning service without programming! Text classification
I tried to make a url shortening service serverless with AWS CDK
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
How to create sample CSV data with hypothesis
How to read a CSV file with Python 2/3
How to send a message to LINE with curl
How to create a Python virtual environment (venv)
How to draw a 2-axis graph with pyplot
Regular serverless scraping with AWS lambda + scrapy Part 1.8
How to create a function object from a string
How to develop a cart app with Django
How to create a JSON file in Python
Serverless scraping using selenium with [AWS Lambda] -Part 1-
LINE BOT with Python + AWS Lambda + API Gateway
Serverless application with AWS SAM! (APIGATEWAY + Lambda (Python))