AWS EMR vs GLUE evaluation for ETL workflow @Data engineering

Rajat Kumar
3 min readApr 26, 2020

To make Software development life cycle easy , following should be achievable while choosing external framework/library or cloud solution.

  1. Develop code locally on machine.
  2. test code with test data on local machine — end to end flow tests
  3. compile and build jar on local machine, Jenkins
  4. unit test, integration test — should be able to run on local machine, Jenkins.
  5. debugging of issue — Jobs logs should be easy readable.
  6. cost of system

Development Issues in Glue :

While using Glue API, development of code on local machine is not possible ( For compiling code glue-assembly.jar is needed which is not mentioned in official AWS documents https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html .

Even after compile error resolved on IDE , SBT package not able to generate jar since its FAT jar containing spark dependency conflicting with other library which is not easy to resolve.

Unit testing is not possible since you need to setup Glue spark ( glue specific spark lib ) environment on local machine as well as jenkins machine while deploying it stage/prod environment -really not easy to setup.

Here is blog mentioning same issue in more detail which we have faced as well while developing glue scala scripts . https://techmagie.wordpress.com/2019/07/29/implementing-etl-job-using-aws-glue/ .

Issues faced in GLUE while running a spark job (scala) .

  1. COLD start problem for each spark job run.
  2. Resource unavailable issue (even single job fails with this exception)
  3. Spark UI not available during job running ( after job finish , its available)
  4. code compile and unit testing on local environment is not possible , developer always need to go to aws console , which takes more time of developers and incurs cost to the system since testing also need to be done on AWS cloud solution ( using glue end point or GLUE console )
  5. Debugging is very difficult ( user need to go to aws console and scroll over long running distributed job logs ) , really hard to find original root cause of issue.

An AWS Glue job of type Apache Spark requires a minimum of 2 DPUs. By default, AWS Glue allocates 10 DPUs to each Apache Spark job. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each job of type Apache Spark.

1 (DPU) provides 4 vCPU and 16 GB of memory. (https://aws.amazon.com/glue/pricing/costing (0.21+ 0.06 = 0.27 $ per hour )

  • Job run and re-run will be executed immediately, It will help improving SLA .
  • Less costlier than EMR if used properly ( start and terminate EMR after workflow completion. )
  • Spark jobs can be configured and optimised to run fast with minimal resources. ( glue uses spark driver memory as 5gb fixed which can be simply 1 gb as well for EMR spark job depending on job )
  • Debugging of any spark job is quite easier in EMR than Glue to find out actual root cause.

)

In EC2 instance, equivaluent to 1 DPU (m4.xlarge)

Originally published at https://medium.com on April 26, 2020.

--

--