How to Debug EMR Step Failures
At Clumio, EMR has become one of the most important tools for processing data and we use it widely for creation and delivery of ETL, analytics and ML applications. EMR is relatively easy to set up and comes with a lot of built-in features like Spark, Hive, Sqoop etc. However EMR requires sufficient planning and design to maintain and debug. This blog aims to provide tips on debugging EMR step failures, in particular memory failures. In a follow-up blog we will go into more details on how to fine-tune EMR step settings.
For this blog, we assume the readers are already familiar with the EMR architecture. If you are new to EMR, please go through the “Getting Started” tutorial. Some key concepts we will refer to in this blog are
Node: an EC2 instance that is a member of the EMR cluster. A node can be a master node, a core node or a task node. The master node manages the EMR cluster and typically runs master components of distributed applications. Core nodes perform monitoring and data processing tasks. Task nodes are optional and add power to data processing tasks.
Step: A unit of work with instructions to manipulate data for processing. A step is distributed into multiple sub-units of work executed across the EMR cluster. For more information please refer to https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html
Container: Represents a resource allocation (memory + CPU) on a single node (core node or task node) of the EMR cluster. A step is distributed into multiple sub-units of work and each sub-unit runs inside such a container. By default EMR uses Yarn to schedule containers https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-arch.html
Application Master: When a step starts, EMR will launch a main container that is responsible for requesting, launching and monitoring additional containers for the step. This main container is called the application master. The entire set of containers launched to execute the step is referred to as an application.
The most common way to submit jobs to an EMR cluster is through steps, where each step is a unit of work that contains instructions to manipulate data for processing. To see the steps that have run on a cluster, simply click on the
Amazon EMR step logs are written by the monitoring process running on the master node and contain important summaries for the step https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html
controller — Information about the processing of the step. If your step fails while loading, you can find the stack trace in this log.
syslog — Describes the execution of Hadoop jobs in the step.
stderr — The standard error channel of Hadoop while it processes the step. If the step has finished successfully without any error,
stderrwill contain the message
Command exiting with ret '0'. If the step has any failures, stderr will contain the message
Command exiting with ret '1'.
stdout — The standard output channel of Hadoop while it processes the step.
Examples of Step Failures
If a step has finished successfully without any error,
stderr will contain the message
Command exiting with ret '0'. If the step has any failures, stderr will contain the message
Command exiting with ret '1'. In addition, the
stderr log file usually contains the last error that the step ran into. For example, the following error message indicates a possible syntax error in Spark SQL:
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.atomExpression(HiveParser_IdentifiersParser.java:6716) org.apache.spark.sql.AnalysisException: cannot recognize input near ''timestamp'' ')' ')' in primitive type specification; line 1 pos 112 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41
Some step errors such as syntax errors require fixing only the application logic. Other errors such as memory errors often require tuning of the EMR cluster. For the remainder of this blog, we will focus on how to debug memory errors. In the next blog we will go into more details on how to fine-tune EMR step memory settings.
Memory failures are the most common as usage of EMR scales but they are also the most difficult to debug. Fortunately the
stderr log file will almost always surface this error properly. Some common memory errors are
Container running out of physical memory
Container[pid=31642,containerID=container_1641007063720_0171_01_000009] is running beyond physical memory limits. Current usage: 16.7 GB of 16 GB physical memory used; 22.9 GB of 80 GB virtual memory used. Killing container.
Hadoop/Spark application running out of heap memory.
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL) Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled forActorSystem[spark] java.lang.OutOfMemoryError: Java heap space
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Container [pid=19146,containerID=container_1657863401829_0212_01_Command exiting with ret '1'
Locating container logs
It is important to know what the step was executing on at the time of memory failure. This is where EMR doesn’t always make it easy for us to figure out. The
controller/syslog/stderr/stdout log files are summaries written by the EMR monitoring process running on the master node, but the actual memory-intensive workload is executed in multiple containers across the EMR cluster. The summary log files alone are often not sufficient and we need to locate the container logs to understand why certain containers are struggling with memory usage.
If we expand the step ID, we can see an S3 URI for the log file location. In this example, it is
controller/syslog/stderr/stdout log files are all located in this directory.
Container logs are grouped together in a different subfolder
s3://my-emr-internal-tessting-log-bucket/emr-logs/j-3G123456789/containers/<application_id>. Here the
<application_id> can be translated from what is highlighted above in the
stderr output. If we see
container_1657863401829_0212_01 in the stderr output, then the
<application_id> would be
Application master logs
When a step starts, EMR will launch a main container that is responsible for requesting, launching and monitoring additional containers for the step. This main container is called the application master. The set of containers launched to execute the step is called an application. We can see a corresponding subfolder for every container that’s been launched
s3://my-emr-internal-tessting-log-bucket/emr-logs/j-3G123456789/containers/<application_id>. The first subfolder (in this example
container_1657863401829_0212_01_000002) corresponds to the container where the application master was running. We can see two files under the application master container subfolder
- stderr — The standard error channel of the application master container.
- syslog — Describes the execution of all containers in the application.
syslog usually contains a high-level execution history of all containers launched for the application. If a container has successfully completed, we will see these loglines which can be spaced far apart.
2022-07-15 05:47:26,335 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1657863401829_0020_01_000002 to attempt_1657863401829_0020_m_000000_0
2022-07-15 05:47:41,365 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1657863401829_0020_m_000000_0 TaskAttempt Transitioned from SUCCESS_FINISHING_CONTAINER to SUCCEEDED
If a container has completed with error, then we will see these loglines next to each other:
2022-07-15 06:36:32,593 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1657863401829_0212_01_000009
2022-07-15 06:36:32,594 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1657863401829_0212_m_000000_1: Container [pid=19146,containerID=container_1657863401829_0212_01_000009] is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 7.8 GB of 20 GB virtual memory used. Killing container.
In our example,
container_1657863401829_0212_01_000009 has failed. So the next thing to look up is what this particular container was doing at the time of failure. We can navigate down to the subfolder that contains the logs:
The last few lines from syslog tell us the commands Hadoop was executing:
2022-07-15 06:29:14,377 INFO [main] org.apache.sqoop.mapreduce.db.DBRecordReader: Working on split: 1=1 AND 1=1
2022-07-15 06:29:14,490 INFO [main] org.apache.sqoop.mapreduce.db.DBRecordReader: Executing query: SELECT "C_PRIM_KEY_11784961","C_DATA_7CCA9B4E" FROM "PCC_ITXA"."v_pcc_itxa_spe_map_da_5504807e" WHERE ( 1=1 ) AND ( 1=1 )
2022-07-15 06:35:48,295 INFO [s3n-worker-2] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream: uploadPart: partNum 1
The last few lines from stdout which is the output channel of the Hadoop commands:
Jul 15, 2022 6:36:19 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 10,774B for [C_PRIM_KEY_11784961] BINARY: 289 values, 12,145B raw, 10,668B comp, 1 pages, encodings: [RLE, PLAIN, BIT_PACKED]
Jul 15, 2022 6:36:19 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 52,612,603B for [C_DATA_7CCA9B4E] BINARY: 289 values, 116,276,727B raw, 36,761,571B comp, 20 pages, encodings: [RLE, PLAIN, BIT_PACKED]
Jul 15, 2022 6:36:27 AM INFO: parquet.hadoop.InternalParquetRecordWriter: mem size 79,227,168 > 52,428,800: flushing 100 records to disk.
Jul 15, 2022 6:36:27 AM INFO: parquet.hadoop.InternalParquetRecordWriter:
Too much memory consumed
Flushing mem columnStore to file. allocated memory: 79,398,528
This indicates that the application’s parquet writer was consuming a lot of memory. We likely need to increase the number of partitions or increase the container memory sizes.
In this blog we have provided a step-by-step guide on how to debug EMR step failures including common memory failures. In a follow-up blog we will go into more details on how to fine-tune EMR step settings.