Commvault Announced Acquisition of Clumio
At Clumio, EMR has become one of the most important tools for processing data and we use it widely for creation and delivery of ETL, analytics and ML applications. EMR is relatively easy to set up and comes with a lot of built-in features like Spark, Hive, Sqoop etc. However EMR requires sufficient planning and design to maintain and debug. This blog aims to provide tips on debugging EMR step failures, in particular memory failures. In a follow-up blog we will go into more details on how to fine-tune EMR step settings.
For this blog, we assume the readers are already familiar with the EMR architecture. If you are new to EMR, please go through the “Getting Started” tutorial. Some key concepts we will refer to in this blog are
The most common way to submit jobs to an EMR cluster is through steps, where each step is a unit of work that contains instructions to manipulate data for processing. To see the steps that have run on a cluster, simply click on the Steps
tab.
Amazon EMR step logs are written by the monitoring process running on the master node and contain important summaries for the step https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html
stderr
will contain the message Command exiting with ret '0'
. If the step has any failures, stderr will contain the message Command exiting with ret '1'
.stderr
If a step has finished successfully without any error, stderr
will contain the message Command exiting with ret '0'
. If the step has any failures, stderr will contain the message Command exiting with ret '1'
. In addition, the stderr
log file usually contains the last error that the step ran into. For example, the following error message indicates a possible syntax error in Spark SQL:
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.atomExpression(HiveParser_IdentifiersParser.java:6716) org.apache.spark.sql.AnalysisException: cannot recognize input near ''timestamp'' ')' ')' in primitive type specification; line 1 pos 112 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41
Some step errors such as syntax errors require fixing only the application logic. Other errors such as memory errors often require tuning of the EMR cluster. For the remainder of this blog, we will focus on how to debug memory errors. In the next blog we will go into more details on how to fine-tune EMR step memory settings.
Memory failures are the most common as usage of EMR scales but they are also the most difficult to debug. Fortunately the stderr
log file will almost always surface this error properly. Some common memory errors are
Container[pid=31642,containerID=container_1641007063720_0171_01_000009] is running beyond physical memory limits. Current usage: 16.7 GB of 16 GB physical memory used; 22.9 GB of 80 GB virtual memory used. Killing container.
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL) Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled forActorSystem[spark] java.lang.OutOfMemoryError: Java heap space
…
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Container [pid=19146,containerID=container_1657863401829_0212_01_Command exiting with ret '1'
It is important to know what the step was executing on at the time of memory failure. This is where EMR doesn’t always make it easy for us to figure out. The controller/syslog/stderr/stdout
log files are summaries written by the EMR monitoring process running on the master node, but the actual memory-intensive workload is executed in multiple containers across the EMR cluster. The summary log files alone are often not sufficient and we need to locate the container logs to understand why certain containers are struggling with memory usage.
If we expand the step ID, we can see an S3 URI for the log file location. In this example, it is s3://my-emr-internal-testing-log-bucket/emr-logs/j-3G123456789/steps/s-27O4XAVMAZ08M
. The controller/syslog/stderr/stdout
log files are all located in this directory.
Container logs are grouped together in a different subfolder s3://my-emr-internal-tessting-log-bucket/emr-logs/j-3G123456789/containers/<application_id>
. Here the <application_id>
can be translated from what is highlighted above in the stderr
output. If we see container_1657863401829_0212_01
in the stderr output, then the <application_id>
would be application_1657863401829_0212
:
When a step starts, EMR will launch a main container that is responsible for requesting, launching and monitoring additional containers for the step. This main container is called the application master. The set of containers launched to execute the step is called an application. We can see a corresponding subfolder for every container that’s been launched s3://my-emr-internal-tessting-log-bucket/emr-logs/j-3G123456789/containers/<application_id>
. The first subfolder (in this example container_1657863401829_0212_01_000002
) corresponds to the container where the application master was running. We can see two files under the application master container subfolder container_1657863401829_0212_01_000002
:
syslog
usually contains a high-level execution history of all containers launched for the application. If a container has successfully completed, we will see these loglines which can be spaced far apart.
2022-07-15 05:47:26,335 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container container_1657863401829_0020_01_000002 to attempt_1657863401829_0020_m_000000_0
2022-07-15 05:47:41,365 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1657863401829_0020_m_000000_0 TaskAttempt Transitioned from SUCCESS_FINISHING_CONTAINER to SUCCEEDED
If a container has completed with error, then we will see these loglines next to each other:
2022-07-15 06:36:32,593 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1657863401829_0212_01_000009
2022-07-15 06:36:32,594 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1657863401829_0212_m_000000_1: Container [pid=19146,containerID=container_1657863401829_0212_01_000009] is running beyond physical memory limits. Current usage: 4.1 GB of 4 GB physical memory used; 7.8 GB of 20 GB virtual memory used. Killing container.
In our example, container_1657863401829_0212_01_000009
has failed. So the next thing to look up is what this particular container was doing at the time of failure. We can navigate down to the subfolder that contains the logs:
The last few lines from syslog tell us the commands Hadoop was executing:
2022-07-15 06:29:14,377 INFO [main] org.apache.sqoop.mapreduce.db.DBRecordReader: Working on split: 1=1 AND 1=1
2022-07-15 06:29:14,490 INFO [main] org.apache.sqoop.mapreduce.db.DBRecordReader: Executing query: SELECT "C_PRIM_KEY_11784961","C_DATA_7CCA9B4E" FROM "PCC_ITXA"."v_pcc_itxa_spe_map_da_5504807e" WHERE ( 1=1 ) AND ( 1=1 )
2022-07-15 06:35:48,295 INFO [s3n-worker-2] com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream: uploadPart: partNum 1
The last few lines from stdout which is the output channel of the Hadoop commands:
Jul 15, 2022 6:36:19 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 10,774B for [C_PRIM_KEY_11784961] BINARY: 289 values, 12,145B raw, 10,668B comp, 1 pages, encodings: [RLE, PLAIN, BIT_PACKED]
Jul 15, 2022 6:36:19 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 52,612,603B for [C_DATA_7CCA9B4E] BINARY: 289 values, 116,276,727B raw, 36,761,571B comp, 20 pages, encodings: [RLE, PLAIN, BIT_PACKED]
Jul 15, 2022 6:36:27 AM INFO: parquet.hadoop.InternalParquetRecordWriter: mem size 79,227,168 > 52,428,800: flushing 100 records to disk.
Jul 15, 2022 6:36:27 AM INFO: parquet.hadoop.InternalParquetRecordWriter:
Too much memory consumed
Flushing mem columnStore to file. allocated memory: 79,398,528
This indicates that the application’s parquet writer was consuming a lot of memory. We likely need to increase the number of partitions or increase the container memory sizes.
In this blog we have provided a step-by-step guide on how to debug EMR step failures including common memory failures. In a follow-up blog we will go into more details on how to fine-tune EMR step settings.