Commvault Announced Acquisition of Clumio

// 23 Aug 2022

How to Debug EMR Step Failures

Xia Hua, Director of Engineering
ShareTwitterfacebookLinkedin

Unlocking the Potential of EMR: Mastering Debugging for Step Failures in Data Processing

At Clumio, EMR has become one of the most important tools for processing data and we use it widely for creation and delivery of ETL, analytics and ML applications. EMR is relatively easy to set up and comes with a lot of built-in features like Spark, Hive, Sqoop etc. However EMR requires sufficient planning and design to maintain and debug. This blog aims to provide tips on debugging EMR step failures, in particular memory failures. In a follow-up blog we will go into more details on how to fine-tune EMR step settings.

EMR Architecture: Understanding Nodes, Steps, Containers, and Application Masters in Data Processing Tasks

For this blog, we assume the readers are already familiar with the EMR architecture. If you are new to EMR, please go through the “Getting Started” tutorial. Some key concepts we will refer to in this blog are

  • Node: an EC2 instance that is a member of the EMR cluster. A node can be a master node, a core node or a task node. The master node manages the EMR cluster and typically runs master components of distributed applications. Core nodes perform monitoring and data processing tasks. Task nodes are optional and add power to data processing tasks.
  • Step: A unit of work with instructions to manipulate data for processing. A step is distributed into multiple sub-units of work executed across the EMR cluster. For more information please refer to https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html
  • Container: Represents a resource allocation (memory + CPU) on a single node (core node or task node) of the EMR cluster. A step is distributed into multiple sub-units of work and each sub-unit runs inside such a container. By default EMR uses Yarn to schedule containers https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview-arch.html
  • Application Master: When a step starts, EMR will launch a main container that is responsible for requesting, launching and monitoring additional containers for the step. This main container is called the application master. The entire set of containers launched to execute the step is referred to as an application.

EMR Steps

The most common way to submit jobs to an EMR cluster is through steps, where each step is a unit of work that contains instructions to manipulate data for processing. To see the steps that have run on a cluster, simply click on the Steps tab.

Amazon EMR step logs are written by the monitoring process running on the master node and contain important summaries for the step https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html

  • controller — Information about the processing of the step. If your step fails while loading, you can find the stack trace in this log.
  • syslog — Describes the execution of Hadoop jobs in the step.
  • stderr — The standard error channel of Hadoop while it processes the step. If the step has finished successfully without any error, stderr will contain the message Command exiting with ret '0'.  If the step has any failures, stderr will contain the message Command exiting with ret '1'.
  • stdout — The standard output channel of Hadoop while it processes the step.

Examples of Step Failures

Errors in stderr

If a step has finished successfully without any error, stderr will contain the message Command exiting with ret '0'.  If the step has any failures, stderr will contain the message Command exiting with ret '1'. In addition, the stderr log file usually contains the last error that the step ran into. For example, the following error message indicates a possible syntax error in Spark SQL:

org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.atomExpression(HiveParser_IdentifiersParser.java:6716) org.apache.spark.sql.AnalysisException: cannot recognize input near ''timestamp'' ')' ')' in primitive type specification; line 1 pos 112 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41

Some step errors such as syntax errors require fixing only the application logic. Other errors such as memory errors often require tuning of the EMR cluster. For the remainder of this blog, we will focus on how to debug memory errors. In the next blog we will go into more details on how to fine-tune EMR step memory settings.

Memory failures

Memory failures are the most common as usage of EMR scales but they are also the most difficult to debug. Fortunately the stderr log file will almost always surface this error properly. Some common memory errors are

  • Container running out of physical memory
    Container[pid=31642,containerID=container_1641007063720_0171_01_000009] is running beyond physical memory limits. Current usage: 16.7 GB of 16 GB physical memory used; 22.9 GB of 80 GB virtual memory used. Killing container.
  • Hadoop/Spark application running out of heap memory.
    14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL) Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled forActorSystem[spark] java.lang.OutOfMemoryError: Java heap space

    Container killed on request. Exit code is 143
    Container exited with a non-zero exit code 143
    Container [pid=19146,containerID=container_1657863401829_0212_01_Command exiting with ret '1'
Container Logs

Locating container logs

It is important to know what the step was executing on at the time of memory failure. This is where EMR doesn’t always make it easy for us to figure out. The controller/syslog/stderr/stdout log files are summaries written by the EMR monitoring process running on the master node, but the actual memory-intensive workload is executed in multiple containers across the EMR cluster. The summary log files alone are often not sufficient and we need to locate the container logs to understand why certain containers are struggling with memory usage.

If we expand the step ID, we can see an S3 URI for the log file location. In this example, it is s3://my-emr-internal-testing-log-bucket/emr-logs/j-3G123456789/steps/s-27O4XAVMAZ08M. The controller/syslog/stderr/stdout log files are all located in this directory.

Container logs are grouped together in a different subfolder s3://my-emr-internal-tessting-log-bucket/emr-logs/j-3G123456789/containers/<application_id>. Here the <application_id> can be translated from what is highlighted above in the stderr output. If we see container_1657863401829_0212_01 in the stderr output, then the <application_id> would be application_1657863401829_0212:

Application master logs

When a step starts, EMR will launch a main container that is responsible for requesting, launching and monitoring additional containers for the step. This main container is called the application master. The set of containers launched to execute the step is called an application. We can see a corresponding subfolder for every container that’s been launched s3://my-emr-internal-tessting-log-bucket/emr-logs/j-3G123456789/containers/<application_id>. The first subfolder (in this example container_1657863401829_0212_01_000002) corresponds to the container where the application master was running. We can see two files under the application master container subfolder container_1657863401829_0212_01_000002:

  • stderr — The standard error channel of the application master container.
  • syslog — Describes the execution of all containers in the application.

syslog usually contains a high-level execution history of all containers launched for the application. If a container has successfully completed, we will see these loglines which can be spaced far apart.

  • 2022-07-15 05:47:26,335 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1657863401829_0020_01_000002 to attempt_1657863401829_0020_m_000000_0
  • 2022-07-15 05:47:41,365 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1657863401829_0020_m_000000_0 TaskAttempt Transitioned from SUCCESS_FINISHING_CONTAINER to SUCCEEDED

If a container has completed with error, then we will see these loglines next to each other:

  • 2022-07-15 06:36:32,593 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1657863401829_0212_01_000009
  • 2022-07-15 06:36:32,594 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1657863401829_0212_m_000000_1: Container [pid=19146,containerID=container_1657863401829_0212_01_000009] is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 7.8 GB of 20 GB virtual memory used. Killing container.

Failed container

In our example, container_1657863401829_0212_01_000009 has failed. So the next thing to look up is what this particular container was doing at the time of failure. We can navigate down to the subfolder that contains the logs:

The last few lines from syslog tell us the commands Hadoop was executing:

2022-07-15 06:29:14,377 INFO [main] org.apache.sqoop.mapreduce.db.DBRecordReader: Working on split: 1=1 AND 1=1
2022-07-15 06:29:14,490 INFO [main] org.apache.sqoop.mapreduce.db.DBRecordReader: Executing query: SELECT "C_PRIM_KEY_11784961","C_DATA_7CCA9B4E" FROM "PCC_ITXA"."v_pcc_itxa_spe_map_da_5504807e" WHERE ( 1=1 ) AND ( 1=1 )
2022-07-15 06:35:48,295 INFO [s3n-worker-2] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream: uploadPart: partNum 1

The last few lines from stdout which is the output channel of the Hadoop commands:

Jul 15, 2022 6:36:19 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 10,774B for [C_PRIM_KEY_11784961] BINARY: 289 values, 12,145B raw, 10,668B comp, 1 pages, encodings: [RLE, PLAIN, BIT_PACKED]
Jul 15, 2022 6:36:19 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 52,612,603B for [C_DATA_7CCA9B4E] BINARY: 289 values, 116,276,727B raw, 36,761,571B comp, 20 pages, encodings: [RLE, PLAIN, BIT_PACKED]
Jul 15, 2022 6:36:27 AM INFO: parquet.hadoop.InternalParquetRecordWriter: mem size 79,227,168 > 52,428,800: flushing 100 records to disk.
Jul 15, 2022 6:36:27 AM INFO: parquet.hadoop.InternalParquetRecordWriter:
Too much memory consumed
Flushing mem columnStore to file. allocated memory: 79,398,528

This indicates that the application’s parquet writer was consuming a lot of memory. We likely need to increase the number of partitions or increase the container memory sizes.

Conclusion

In this blog we have provided a step-by-step guide on how to debug EMR step failures including common memory failures. In a follow-up blog we will go into more details on how to fine-tune EMR step settings.