Introduction

This article leverages Glue Studio, released September 23, 2020, to create, run, and monitor GUI-based Glue jobs. AWS Glue

AWS Glue provides a serverless environment that uses the power of Apache Spark to prepare and process datasets for analysis.

AWS Glue Documentation Optimize memory management in AWS Glue

AWS Glue Studio

AWS Glue Studio is a new visual interface for AWS Glue. This makes it easy for extract, transform, load (ETL) developers to create, run, and monitor AWS Glue ETL jobs. You can now use a simple visual interface to move and transform your data to create jobs that run on AWS Glue. You can then use the AWS Glue Studio Job Execution Dashboard to monitor your ETL execution and make sure your job is working as intended.

What is AWS Glue Studio?

See Making ETL easier with AWS Glue Studio on the AWS Big Data Blog. To create and run a job in Glue Studio.

Start creating a Job

1. Click either the Jobs on the navigation panel or Create and manage jobs, and start creating a job. スクリーンショット (18).png 2. Choose the Blank graph and click the Create button. スクリーンショット (98).png

Adding Data source

3. Choose the (+) icon.

On the Node properties tab, 4. For Name, enter input. 5. For Node type, choose S3(Glue Data Catalog table with S3 as the data source.). スクリーンショット (47).png On the Data source properties - S3 tab, (make a Data Catalog with Crawler beforehand) 6. For Database, pyspark_input 7. For Table, titanic_data_csv 8. For Partition predicate, leave blank. スクリーンショット (48).png On the Output schema tab, 9. Check the Schema. スクリーンショット (49).png

Adding Transform

10. Choose the input node. 11. Choose the (+) icon. On the Node properties tab, 12. For Name, enter transform. 13. For Node type, choose the Custom transform. 14. For Node parents, choose the input. スクリーンショット (50).png On the Transform tab, 15. For Code block, write Python code of PySpark. スクリーンショット (76).png On the Output schema tab, 16. Check the Schema. スクリーンショット (52).png By adding Custom transform, a next node to receive the DynamicFrameCollection is added automatically.

On the Node properties tab, 17. For Name, enter receive (The word "recieve" is spelled wrong.) 18. For Node type, choose the SelectFromCollection. 19. For Node parents, choose the transform. スクリーンショット (53).png スクリーンショット (54).png スクリーンショット (55).png

Adding Data target

20. Choose the receive node. 21. Choose the (+) icon.

On the Node properties tab, 22. For Name, enter output. 23. For Node type, choose the S3(Output data directly in an S3 bucket.). 24. For Node parents, choose the receive. スクリーンショット (56).png On the Data target properties - S3, 25. For Format, choose the CSV. 26. For Compression Type, None. 27. For S3 Target Location, enter S3 location in the format s3://bucket/prefix/object/ with a trailing slash (/). 28. For Partition, leave blank. スクリーンショット (57).png On the Output schema tab, 29. Check the Schema. スクリーンショット (58).png

Script

スクリーンショット (75).png スクリーンショット (60).png

Configuring the job

30. IAM Role: AmazonS3FullAccess / AWSGlueConsoleFullAccess スクリーンショット (61).png 31. For Job Bookmark, choose Disable. 32. For Number of retries, optionally enter 1. スクリーンショット (62).png

33. Choose save. 34. When the job is saved, choose Run. スクリーンショット (63).png

Monitoring the job

35. In the AWS Glue Studio navigation panel, choose Monitoring. スクリーンショット (67).png スクリーンショット (71).png スクリーンショット (72).png 35. In the Glue console, check the Glue Job. スクリーンショット (74).png

I was able to create, run and monitor the job.

That's all for the subject, but here's a quick look at what you can do with a service called Glue. This architecture is an example of a data processing infrastructure that performs batch processing using Glue.

1. Putting data on S3 triggers CloudWatch, which makes CloudWatch Target Step Functions starts 2. Step Functions receives the function from Lambda, Glue's crawler and PySpark's Run job for S3 3. Output the data converted by PySpark to S3

Summary

I used Glue Studio to create and run a GUI-based Glue job.

[PYTHON] Glue Studio [AWS]

Introduction

Summary