Azure HDInsights Issue With Hive/Python Map-Reduce

Azure HDInsights Issue With Hive/Python Map-Reduce - python

Running a very simple test example using Azure HDInsights and Hive/Python. Hive does not appear to be loading the Python script.
Hive contains a small test table with a field called 'dob' that I'm trying to transform using a Python script via map-reduce.
Python script is blank and located at asv:///mapper_test.py. I made the script blank because I wanted to first isolate the issue of Hive accessing this script.
Hive Code:
ADD FILE asv:///mapper_test.py;
SELECT
TRANSFORM (dob)
USING 'python asv:///mapper_test.py' AS (dob)
FROM test_table;
Error:
Hive history file=c:\apps\dist\hive-0.9.0\logs/hive_job_log_RD00155DD090CC$_201308202117_1738335083.txt
Logging initialized using configuration in file:/C:/apps/dist/hive-0.9.0/conf/hive-log4j.properties
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201308201542_0025, Tracking URL = http://jobtrackerhost:50030/jobdetails.jsp?jobid=job_201308201542_0025
Kill Command = c:\apps\dist\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd job -Dmapred.job.tracker=jobtrackerhost:9010 -kill job_201308201542_0025
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-08-20 21:18:04,911 Stage-1 map = 0%, reduce = 0%
2013-08-20 21:19:05,175 Stage-1 map = 0%, reduce = 0%
2013-08-20 21:19:32,292 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201308201542_0025 with errors
Error during job, obtaining debugging information...
Examining task ID: task_201308201542_0025_m_000002 (and more) from job job_201308201542_0025
Exception in thread "Thread-24" java.lang.RuntimeException: Error while reading from task log urlatorg.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:242)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:227)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:92)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: http://workernode1:50060/tasklog?taskid=attempt_201308201542_0025_m_000000_7&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1616)
at java.net.URL.openStream(URL.java:1035)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getStackTraces(TaskLogProcessor.java:193)
... 3 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

I had a similar experience while using PIG + Python in Azure HDInsight Version 2.0. One thing I found out was, Python was available only in the head node and not in all the nodes in the cluster. You can see a similar question here
You can remote login to the head node of the cluster, and from the head node, find out the IP of the Task Tracker node and remote login to any of the task traker nodes to find out whether python is installed in the node.
This issue is fixed in HDInsight Version 2.1 cluster. But still Python is not added to the "PATH". You may need to do this yourself.

Related

Real Time Cluster Log Delivery in a Databricks Cluster

I have some Python code that I am running on a Databricks Job Cluster. My Python code will be generating a whole bunch of logs and I want to be able to monitor these logs in real time (or near real time), say through something like a dashboard.
What I have done so far is, I have configured my cluster log delivery location and my logs are delivered to the specified destination every 5 minutes.
This is explained here,
https://learn.microsoft.com/en-us/azure/databricks/clusters/configure
Here is an extract from the same article,
When you create a cluster, you can specify a location to deliver the
logs for the Spark driver node, worker nodes, and events. Logs are
delivered every five minutes to your chosen destination. When a
cluster is terminated, Azure Databricks guarantees to deliver all logs
generated up until the cluster was terminated.
Is there some way I can have these logs delivered somewhere in near real time, rather than every 5 minutes? It does not have to be through the same method either, I am open to other possibilities.

As shown in below screenshot by default it is 5 minutes. Unfortunately, it cannot be changed. There is no information given in official documentation.
However, you can raise feature request here

Azure Batch JobPreparationTask fails with "UserError"

I am trying to mount a File Share (Not a blob storage) during the JobPerparationTask. My node OS is Ubuntu 16.04.
To do this, I am doing the following:
job_user = batchmodels.AutoUserSpecification(
scope=batchmodels.AutoUserScope.pool,
elevation_level=batchmodels.ElevationLevel.admin)
start_task = batch.models.JobPreparationTask(command_line=start_commands, user_identity=batchmodels.UserIdentity(auto_user=job_user))
end_task = batch.models.JobReleaseTask(command_line=end_commands,user_identity=batchmodels.UserIdentity(auto_user=job_user))
job = batch.models.JobAddParameter(
job_id,
batch.models.PoolInformation(pool_id=pool_id),job_preparation_task=start_task, job_release_task=end_task)
My start_commands and end_commands are fine, but there is something wrong with the User permissions...
I get no output in the stderr.txt or in the stdout.txt file.
I do not see any logs what-so-ever (where are they?). All I am able to find is a message showing this:
Exit code
1
Retry count
0
Failure info
Category
UserError
Code
FailureExitCode
Message
The task exited with an exit code representing a failure
Details
Message: The task exited with an exit code representing a failure
Very detailed error message!
Anyway, I have also tried changing AutoUserScope.oool to AutoUserScope.task, but there is no change.
Anyone have any ideas?

I had this issue which was frustrating me because I couldn't get any logs from my application.
What I ended up doing is RDP'ing into the node my job ran on, going to %AZ_BATCH_TASK_WORKING_DIR% as specified in Azure Batch Compute Environment Variables and then checking the stdout.txt and stderr.txt in my job.
The error was that I formulated my CloudTask's commandline incorrectly, so it could not launch my application in the first place.
To RDP into your machine, in Azure Portal:
Batch Account
Pools (select your pool)
Nodes
Select the node that ran your job
Select "Connect" link at the top.

Getting the running jobs on an LSF cluster using python and PlatformLSF

I'm trying to write a simple task manager in python that will be used to run a large number of jobs in an LSF cluster. I'm stuck trying to determine (within a python script) the number of running jobs for a given user. On the command line this would come from the command bjobs.
IBM makes available a python wrapper to the LSF C API. Working with one of their examples and some documentation from a copy of the C API that I found online, I have been able to cobble together the following script.
from pythonlsf import lsf
lsf.lsb_init("test")
userArr = lsf.new_stringArray(1)
lsf.stringArray_setitem(userArr, 0, 'my_username')
intp_num_users = lsf.new_intp()
lsf.intp_assign(intp_num_users, 1)
user_info = lsf.lsb_userinfo(userArr,intp_num_users)
The variable user_info has attributes 'numPEND', 'numRESERVE', 'numRUN', and 'numStartJobs', but all of these are 0. They remain zero even when bjobs reports a running job.
Can anyone tell what I might be doing wrong in the code snippet above? I've read through both the C and python documentation several times, but can't find an error.

Python: Increasing timeout value in EMR using yelps MRJOB

I am using the yelp MRjob for writing some of the mapreduce programs. I am running it on EMR. My program has reducer code which takes a long time to execute. I am noticing that because of the default timeout period in EMR I am getting this error
Task attempt_201301171501_0001_r_000000_0 failed to report status for 600 seconds.Killing!
I want a way to increase the timeout of the EMR. I read the mrjobs official documentation about the same but I was not able to understand the procedure. Can someone suggest a way to solve this issue.

I've dealt with a similar issue with EMR in the past, the property you are looking for mapred.task.timeout which corresponds to the number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string.
With MRJob, you could add the following option:
--jobconf mapred.task.timeout=1800000
EDIT: It appears that some EMR AMIs appear do not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration like this:
--bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.task.timeout=1800000"
I would still try the first one to start with and see if you can get it to work, otherwise try the bootstrap action.
To run any of these parameters, just create your job extending from MRJob, this class has a jobconf method that will read your --jobconf parameters, so you should specify these as regular options on command line:
python job.py --num-ec2-instances 42 --python-archive t.tar.gz -r emr --jobconf mapred.task.timeout=1800000 /path/to/input.txt

Amazon Elastic MapReduce - SIGTERM

I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 inputs, at about 20 seconds per input), after 2.5 hours of crunching I get the following error:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
If I am reading this correctly, the subprocess failed with code 143 because someone sent a SIGTERM signal to the streaming job.
Is my understanding correct? If so: When would the EMR infrastructure send a SIGTERM?

I figured out what was happening, so here's some information if anyone else experiences similar problems.
The key to me was to look at the "jobtracker" logs. These live in your task's logs/ folder on S3, under:
<logs folder>/daemons/<id of node running jobtracker>/hadoop-hadoop-jobtracker-XXX.log.
There were multiple lines of the following kind:
2012-08-21 08:07:13,830 INFO org.apache.hadoop.mapred.TaskInProgress
(IPC Server handler 29 on 9001): Error from attempt_201208210612_0001_m_000015_0:
Task attempt_201208210612_0001_m_000015_0 failed to report status
for 601 seconds. Killing!
So my code was timing out, and it was being killed (it was going beyond the 10 minute task timeout). 10 minutes I wasn't doing any I/Os, which was certainly not expected (I would typically do an I/O every 20 seconds).
I then discovered this article:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
"In one of our science projects, we have a few Hadoop Streaming jobs that run over ruby and rely on libxml to parse documents. This creates a perfect storm of badness – the web is full of really bad html and libxml occasionally goes into infinite loops or outright segfaults. On some documents, it always segfaults."
It nailed it. I must be experiencing one of these "libxml going into infinite loop" situations (I am using libxml heavily -- only with Python, not Ruby).
The final step for me was to trigger skip mode (instructions here: Setting hadoop parameters with boto?).

I ran into this output from Amazon EMR ("subprocess failed with code 143"). My streaming job was using PHP curl to send data to a server that didn't have the MapReduce job servers part of its security group. Therefore the reducer was timing out and being killed. Ideally I'd like to add my jobs to the same security group but I opted to simply add a URL security token param infront of my API.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.