Amazon Elastic MapReduce - SIGTERM - python

I have an EMR streaming job (Python) which normally works fine (e.g. 10 machines processing 200 inputs). However, when I run it against large data sets (12 machines processing a total of 6000 inputs, at about 20 seconds per input), after 2.5 hours of crunching I get the following error:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
If I am reading this correctly, the subprocess failed with code 143 because someone sent a SIGTERM signal to the streaming job.
Is my understanding correct? If so: When would the EMR infrastructure send a SIGTERM?

I figured out what was happening, so here's some information if anyone else experiences similar problems.
The key to me was to look at the "jobtracker" logs. These live in your task's logs/ folder on S3, under:
<logs folder>/daemons/<id of node running jobtracker>/hadoop-hadoop-jobtracker-XXX.log.
There were multiple lines of the following kind:
2012-08-21 08:07:13,830 INFO org.apache.hadoop.mapred.TaskInProgress
(IPC Server handler 29 on 9001): Error from attempt_201208210612_0001_m_000015_0:
Task attempt_201208210612_0001_m_000015_0 failed to report status
for 601 seconds. Killing!
So my code was timing out, and it was being killed (it was going beyond the 10 minute task timeout). 10 minutes I wasn't doing any I/Os, which was certainly not expected (I would typically do an I/O every 20 seconds).
I then discovered this article:
http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code
"In one of our science projects, we have a few Hadoop Streaming jobs that run over ruby and rely on libxml to parse documents. This creates a perfect storm of badness – the web is full of really bad html and libxml occasionally goes into infinite loops or outright segfaults. On some documents, it always segfaults."
It nailed it. I must be experiencing one of these "libxml going into infinite loop" situations (I am using libxml heavily -- only with Python, not Ruby).
The final step for me was to trigger skip mode (instructions here: Setting hadoop parameters with boto?).

I ran into this output from Amazon EMR ("subprocess failed with code 143"). My streaming job was using PHP curl to send data to a server that didn't have the MapReduce job servers part of its security group. Therefore the reducer was timing out and being killed. Ideally I'd like to add my jobs to the same security group but I opted to simply add a URL security token param infront of my API.

Related

How to instrument a python process which crashes after ~5 days without log entries

I am running a multi-process (and multi-threaded) python script on debian linux. One of the processes repeatedly crashes after 5 or 6 days. It is always the same, unique workload on the process that crashes. There are no entries in syslog about the crash - the process simply disappears silently. It also behaves completely normally and produces normal results, then suddenly stops.
How can I instrument the rogue process. Increasing the loglevel will produce large amounts of logs, so that's not my preferred option.
I used good-old log analysis to determine what happens when the process fails.
increased log level of the rogue process to INFO after 4 days
monitored the application for the rogue process failing
pin-pointed the point in time of the failure in syslog
analysed syslog at that time
I found following error at that time; first row is the last entry made by the rogue process (just before it fails), the 2nd row is the one pointing to the underlying error.
In this case there is a problem with pyzmq bindings or zeromq library. I'll open a ticket with them.
Aug 10 08:30:13 rpi6 python[16293]: 2021-08-10T08:30:13.045 WARNING w1m::pid 16325, tid 16415, taking reading from sensors with map {'000005ccbe8a': ['t-top'], '000005cc8eba': ['t-mid'], '00000676e5c3': ['t
Aug 10 08:30:14 rpi6 python[16293]: Too many open files (bundled/zeromq/src/ipc_listener.cpp:327)
A
Hope this helps someone in the future.

pythonw.exe creates network activity when running any script

When I run any python script that doesn't even contain any code or imports that could access the internet in any way, I get 2 pythonw.exe processes pop up in my resource monitor under network activity. One of them is always sending more than receiving while the other has the same activity but the amount of sending vs receiving is reversed. The amount of overall activity is dependent on the file size, regardless of how many line are commented out. Even a blank .py document will create network activity of about 200 kb/s. The activity drops from its peak, which is as high as 15,000 kb/s for a file with 10,000 lines, to around zero after around 20 seconds, and then the processes quit on their own. The actual script has finished running long before the network processes stop.
Because the activity is dependent on file size I'm suspicious that every time I run a python script, the whole thing is being transmitted to a server somewhere else in the world.
Is this something that could be built into python, a virus that's infecting my computer, or just something that python is supposed to do and its innocent activity?
If anyone doesn't have an answer but could check to see if this activity affects their own installation of python, that would be great. Thanks!
EDIT:
Peter Wood, to start the process just run any python script from the editor, its runs on its own, at least for me. I'm on 2.7.8.
Robert B, I think you may be right, but why would the communication continue after the script has finished running?

Error with mpi4py after awhile of spawning, How to Debug

This is similar to a few questions on the internet, but this code seems to be working for awhile instead of returning an error instantly, which suggests to me it is maybe not just a host-file error?
I am running a code that spawns multiple MPI processes which then each create a loop, within which they send some data with bcast and scatter, then gathers data from those processes. This runs the algorithm and saves data. It then disconnects from the spawned comm, and creates another set of spawns on the next loop. This works for a few minutes, then after around 300 files, it will spit this out:
[T7810:10898] [[50329,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 758
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error.
More information may be available above.
I am testing this on a local machine (single node), so the end deployment will have multiple nodes that each spawn their own mpi processes within that node. I am trying to figure out if this is an issue with testing the multiple nodes on my local machine and will work fine on the HPC or is a more serious error.
How can I debug this? Is there a way to be printing out what MPI is trying to do during, or monitor MPI, such as a verbose mode?
Since MPI4PY is so close to MPI (logically, if not in terms of lines-of-code), one way to debug this is to write the C version of your program and see if the problem persists. When you report this bug to OpenMPI, they are going to want a small c test case anyway.

Python socket program with shell script?

I have two machines connected by a switch. I have a popular server application which we can call "SXC_SERVER" on machine A and I interrogate the "SXC_SERVER" with the corresponding application from machine B, which I'll call "SXC_CLIENT". What I am trying to do is two-fold:
firstly, gain the traffic flow of SXC_SERVER and SXC_CLIENT interaction through tcpdump. The interaction between the two is a simple GET and RESPONSE, but I require the traffic traces.
secondly, I am wanting to log the Resident Set Size (RSS) usage of the SXC_SERVER process during each interaction/iteration
Moreover, I don't just need one traffic trace of the communication and one memory usage log of the SXC_SERVER process otherwise I wouldn't be writing this because I could go away and do that in ten minutes... In fact I am aiming to do very many! But let's say here for simplicity I want to do 10.
Since this will be very labor intensive as it will require me to be at both machines stopping and starting all of the SCX_CLIENT-to-SXC_SERVER interrogation, the tcpdump traffic capture, and the RSS memory usage of SXC_SERVER logging I want to write an automation script.
But! I am not a programmer, or software guy...(darn)
However, that said I can imaging a separate client/server program that oversees this automation, which we can call AUTO_SERVER and AUTO_CLIENT. My thoughts are that machine B would run AUTO_CLIENT and machine A would run AUTO_SERVER. The aim of both are to facilitate the automation, i.e. the stopping and starting of the tcpdump, and the memory logging on machine A of SXC_SERVER process before machine B queries SXC_SERVER with SXC_CLIENT (if you follow me!).
Effectively after one run of the SXC_SERVER-to-SXC_CLIENT GET/RESPONSE interaction I'll end up with:
one traffic capture *.pcap file called n1.pcap
and one memory log dump (of the RSS associated to the process) called n1.csv.
I am not a programmer or software guy but I can see a rough method (to the best of my ability) to achieve this, as follows:
Machine A: AUTO_SERVER
BEGIN:
msgRecieved = open socket(listen on port *n*)
DO
1. wait for machine A to tell me when to start watch (as in the program) to log RSS memory usage of the SXC_SERVER process using hardcoded command:
watch -n 0.1 'ps -p $(pgrep -d"," -x snmpd) -o rss= | awk '\''{ i += $1 } END { print i }'\'' >> ~/Desktop/mem_logs/mem_i.csv
UNTIL (messageRecieved == "FINISH")
quit
END.
Machine B: AUTO_CLIENT
BEGIN:
open socket(new)
for i in 10, do
1. locally start tcpdump with hardcoded hardcoded tcpdump command with relevant filter to only capture the SXC_SERVER-to-SXC_CLIENT traffic and set output flag to capture all traffic to a PCAP file called n*i*.pcap where *i* is the integer of the current for loop, saving the file in folder "~/Desktop/test_captures/".
2. Send the GET request to SXC_SERVER
3. wait for RESPONSE reply from SXC_SERVER
4. after recieved reply tell machine B to stop watch command
i++
5. send string "FINISH" to machine A.
END.
As you can see I would assume that this would be achieved by the use of a separate, and small client/server-like program (which here I've called AUTO_SERVER and AUTO_CLIENT) on both machines. The really rought pseudo-code design should be self-explanatory.
I have found a small client/server socket program located here: http://www.velvetcache.org/2010/06/14/python-unix-sockets which I would think may be suitable if I edit it, but I am not sure how exactly I can feasibly achieve this. Which is where you may be able to provide some assistance.
Can Python to do this automating?
Can it be done with a single bash script?
Do you think I am on the right path with this?
Or have you any helpful suggestions?
Regards.
You can use Python for this kind of thing, but I would strongly recommend using SSH for the bulk of the work (rather than coding the connection stuff yourself), and then using either a bash script or Python script to launch the tcpdump etc. processes.
Your question, however, is a bit too open-ended for stackoverflow - it sounds like you are asking someone to write this program for you, rather than for help with a specific problem.

Recommendation on how to write a good python wrapper LSF

I am creating a python wrapper script and was wondering what'd be a good way to create it.
I want to run code serially. For example:
Step 1.
Run same program (in parallel - the parallelization is easy because I work with an LSF system so I just submit three different jobs).
I run the program in parallel, and each run takes one fin.txt and outputs one fout.txt, i.e., when they all run they would produce 3 output files from the three input files, f1in.txt, f2in.txt, f3in.txt, f1out.txt, f2out.txt, f3out.txt.
(in the LSF system) When each run of the program is completed successfully, it produces a log file output, f1log.out, f2log.out, f3log.out.
The log files output are of this form, i.e., f1log.out would look something like this if it runs successfully.
------------------------------------------------------------
# LSBATCH: User input
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 86.20 sec.
Max Memory : 103 MB
Max Swap : 881 MB
Max Processes : 4
Max Threads : 5
The output (if any) is above this job summary.
Thus, I'd like my wrapper to check (every 5 min or so) for each run (1,2,3) if the log file has been created, and if it has been created I'd like the wrapper to check if it was successfully completed (aka, if the string Successfully completed appears in the log file).
Also if the one of the runs finished and produces a log file that was not successfully completed I'd like my wrapper to end and report that run (k=1,2,3) was not completed.
After that,
Step2. If all three runs are successfully completed I would run another program that takes those three files as input... else I'd print an error.
Basically in my question I am looking for two things:
Does it sound like a good way to write a wrapper?
How in python I can check the existence of a file, and search for a pattern every certain time in a good way?
Note. I am aware that LSF has job dependencies but I find this way to be more clear and easy to work with, though may not be optimal.
I'm a user of an LSF system, and my major gripes are exit handling, and cleanup. I think a neat idea would be to send a batch job array that has for instance: Initialization Task, Legwork Task, Cleanup Task. The LSF could complete all three and send a return code to the waiting head node. Alot of times LSF works great to send one job or command, but it isn't really set up to handle systematic processing.
Other than that I wish you luck :)

Categories