This article leverages Glue Studio, released September 23, 2020, to create, run, and monitor GUI-based Glue jobs. AWS Glue
AWS Glue provides a serverless environment that uses the power of Apache Spark to prepare and process datasets for analysis.
AWS Glue Documentation Optimize memory management in AWS Glue
AWS Glue Studio
AWS Glue Studio is a new visual interface for AWS Glue. This makes it easy for extract, transform, load (ETL) developers to create, run, and monitor AWS Glue ETL jobs. You can now use a simple visual interface to move and transform your data to create jobs that run on AWS Glue. You can then use the AWS Glue Studio Job Execution Dashboard to monitor your ETL execution and make sure your job is working as intended.
See Making ETL easier with AWS Glue Studio on the AWS Big Data Blog. To create and run a job in Glue Studio.
1. Click either the Jobs on the navigation panel or Create and manage jobs, and start creating a job. 2. Choose the Blank graph and click the Create button.
3. Choose the (+) icon.
On the Node properties tab, 4. For Name, enter input. 5. For Node type, choose S3(Glue Data Catalog table with S3 as the data source.). On the Data source properties - S3 tab, (make a Data Catalog with Crawler beforehand) 6. For Database, pyspark_input 7. For Table, titanic_data_csv 8. For Partition predicate, leave blank. On the Output schema tab, 9. Check the Schema.
10. Choose the input node. 11. Choose the (+) icon. On the Node properties tab, 12. For Name, enter transform. 13. For Node type, choose the Custom transform. 14. For Node parents, choose the input. On the Transform tab, 15. For Code block, write Python code of PySpark. On the Output schema tab, 16. Check the Schema. By adding Custom transform, a next node to receive the DynamicFrameCollection is added automatically.
On the Node properties tab, 17. For Name, enter receive (The word "recieve" is spelled wrong.) 18. For Node type, choose the SelectFromCollection. 19. For Node parents, choose the transform.
20. Choose the receive node. 21. Choose the (+) icon.
On the Node properties tab, 22. For Name, enter output. 23. For Node type, choose the S3(Output data directly in an S3 bucket.). 24. For Node parents, choose the receive. On the Data target properties - S3, 25. For Format, choose the CSV. 26. For Compression Type, None. 27. For S3 Target Location, enter S3 location in the format s3://bucket/prefix/object/ with a trailing slash (/). 28. For Partition, leave blank. On the Output schema tab, 29. Check the Schema.
30. IAM Role: AmazonS3FullAccess / AWSGlueConsoleFullAccess 31. For Job Bookmark, choose Disable. 32. For Number of retries, optionally enter 1.
33. Choose save. 34. When the job is saved, choose Run.
35. In the AWS Glue Studio navigation panel, choose Monitoring. 35. In the Glue console, check the Glue Job.
I was able to create, run and monitor the job.
That's all for the subject, but here's a quick look at what you can do with a service called Glue. This architecture is an example of a data processing infrastructure that performs batch processing using Glue.
1. Putting data on S3 triggers CloudWatch, which makes CloudWatch Target Step Functions starts 2. Step Functions receives the function from Lambda, Glue's crawler and PySpark's Run job for S3 3. Output the data converted by PySpark to S3I used Glue Studio to create and run a GUI-based Glue job.