Run GPU-required batch processing on AWS
Summary
- The service you should use is AWS Batch
- Note the library that behaves differently with/without GPU
- If you have an image of the version with GPU on Docker Hub, it is easier to use it as a base
- Example: PyTorch
background
- I want to build a system on AWS that meets the following requirements
- Run only once daily/weekly, don't charge for other times
- I'm happy if I can use Spot Instances
- GPU required
- Example: https://zenn.dev/miyatsuki/articles/0d66daf6615962
- There are various services that can do that, but I'm not sure which one to use.
- AWS Batch (adopted)
Why AWS Batch?
- Fargate
- NG because GPU cannot be used
- If you only have a CPU, you can use Fargate.
- ECS
- NG because it is necessary to keep launching the GPU instance
- SageMaker
- (I'm not sure because there are too many services ...)
- AWS Batch
- Starts only for a specific time and closes the instance when finished
- Spot instances can also be used
- GPU instance can also be used
How to use AWS Batch
- Push the image of the container you want to move to ECR (Elastic Container Registry)
- Create an AWS Batch computing environment
- Select managed type
- Instance settings-> Spot
- On-demand charge-> Set appropriately according to your budget. At 50%, I didn't have to wait that long.
- Allowed instance type-> Select an instance that can use the GPU. Since g4dn.xlarge is the cheapest, it may be good to see the situation from now on.
- Image type of EC2 settings-> Amazon Linux 2 (GPU)
- Create a job queue for AWS Batch
- Select the computing environment created in 2.
- Create a job definition for AWS Batch
- Platform-> EC2
- Image-> 1 pushed
- Command-> Write what you want to execute in batch so that it can be used as an argument of CMD. Since the CMD of the original image is overwritten, it is necessary to set it again even if it is described on the image side.
- If memory-> 2GB is not enough, increase it
- Number of GPUs-> 1
- Throw an AWS Batch job
- Click Create New Job from the job screen
- Select the job definition and job queue created in 4 and 3.
- Press the last submit and the job will be submitted
- Watch AWS Batch job execution
- Watch the job execution status from the dashboard screen.
About container image
Even if the container side is GPU Ready, if the software in the image file does not support GPU, it will eventually run on the CPU. The points that I personally should be aware of are as follows.
- When pip install, there is a library that puts the GPU-less version according to the status of the PC
- PyTorch etc.
- If you need a GPU version of such a library, it's easier to pull the image with the GPU version from Docker Hub.
- Example: https://hub.docker.com/layers/pytorch/pytorch/1.7.0-cuda11.0-cudnn8-runtime/images/sha256-9cffbe6c391a0dbfa2a305be24b9707f87595e832b444c2bde52f0ea183192f1?context=explore