[PYTHON] Glue Studio [AWS]

Introduction

This article leverages Glue Studio, released September 23, 2020, to create, run, and monitor GUI-based Glue jobs. AWS Glue

AWS Glue provides a serverless environment that uses the power of Apache Spark to prepare and process datasets for analysis.

      AWS Glue Documentation       Optimize memory management in AWS Glue

AWS Glue Studio

AWS Glue Studio is a new visual interface for AWS Glue. This makes it easy for extract, transform, load (ETL) developers to create, run, and monitor AWS Glue ETL jobs. You can now use a simple visual interface to move and transform your data to create jobs that run on AWS Glue. You can then use the AWS Glue Studio Job Execution Dashboard to monitor your ETL execution and make sure your job is working as intended.

      What is AWS Glue Studio?

See Making ETL easier with AWS Glue Studio on the AWS Big Data Blog. To create and run a job in Glue Studio.

  1. Start creating a Job

    1. Click either the Jobs on the navigation panel or Create and manage jobs, and start creating a job. スクリーンショット (18).png     2. Choose the Blank graph and click the Create button. スクリーンショット (98).png

  1. Adding Data source

    3. Choose the (+) icon.

 On the Node properties tab,     4. For Name, enter input.     5. For Node type, choose S3(Glue Data Catalog table with S3 as the data source.). スクリーンショット (47).png  On the Data source properties - S3 tab,  (make a Data Catalog with Crawler beforehand)     6. For Database, pyspark_input     7. For Table, titanic_data_csv     8. For Partition predicate, leave blank. スクリーンショット (48).png   On the Output schema tab,     9. Check the Schema. スクリーンショット (49).png

  1. Adding Transform

    10. Choose the input node.     11. Choose the (+) icon.  On the Node properties tab,     12. For Name, enter transform.     13. For Node type, choose the Custom transform.     14. For Node parents, choose the input. スクリーンショット (50).png  On the Transform tab,     15. For Code block, write Python code of PySpark. スクリーンショット (76).png   On the Output schema tab,     16. Check the Schema. スクリーンショット (52).png By adding Custom transform, a next node to receive the DynamicFrameCollection is added automatically.

 On the Node properties tab,     17. For Name, enter receive (The word "recieve" is spelled wrong.)     18. For Node type, choose the SelectFromCollection.     19. For Node parents, choose the transform. スクリーンショット (53).png スクリーンショット (54).png スクリーンショット (55).png

  1. Adding Data target

    20. Choose the receive node.     21. Choose the (+) icon.

 On the Node properties tab,     22. For Name, enter output.     23. For Node type, choose the S3(Output data directly in an S3 bucket.).     24. For Node parents, choose the receive. スクリーンショット (56).png  On the Data target properties - S3,     25. For Format, choose the CSV.     26. For Compression Type, None.     27. For S3 Target Location, enter S3 location in the format s3://bucket/prefix/object/ with a trailing slash (/).     28. For Partition, leave blank. スクリーンショット (57).png  On the Output schema tab,     29. Check the Schema. スクリーンショット (58).png

  1. Script

スクリーンショット (75).png スクリーンショット (60).png

  1. Configuring the job

    30. IAM Role: AmazonS3FullAccess / AWSGlueConsoleFullAccess スクリーンショット (61).png     31. For Job Bookmark, choose Disable.     32. For Number of retries, optionally enter 1. スクリーンショット (62).png

    33. Choose save.     34. When the job is saved, choose Run. スクリーンショット (63).png

  1. Monitoring the job

    35. In the AWS Glue Studio navigation panel, choose Monitoring. スクリーンショット (67).png スクリーンショット (71).png スクリーンショット (72).png     35. In the Glue console, check the Glue Job. スクリーンショット (74).png

I was able to create, run and monitor the job.

That's all for the subject, but here's a quick look at what you can do with a service called Glue. This architecture is an example of a data processing infrastructure that performs batch processing using Glue.

1. Putting data on S3 triggers CloudWatch, which makes CloudWatch Target Step Functions starts 2. Step Functions receives the function from Lambda, Glue's crawler and PySpark's Run job for S3 3. Output the data converted by PySpark to S3

Summary

I used Glue Studio to create and run a GUI-based Glue job.

Recommended Posts

Glue Studio [AWS]
Import your own functions on AWS Glue
Overwrite data in RDS with AWS Glue