Jupyter notebook on EMR not printing output while code is running Pyspark - python

I am running a very, very simple script in a Jupyter pyspark notebook, but it is not printing results as it runs, it just spits out the output when it's done. Here is the code:
import time
import sys
for i in range(10):
print(i)
time.sleep(1)
This waits 10 seconds and then prints:
0
1
2
3
4
5
6
7
8
9
I would like to print results as they happen. I have tried to flush them using
for i in range(10):
print(i)
sys.stdout.flush()
and print(i, flush=True) to no avail. Any suggestions?

It is a buffering issue. You could also use python -u command or set PYTHONUNBUFFERED envvar. python uses line-buffering if it is run interactively (in a terminal) and block buffering (e.g., ~4K bytes buffers) if the output is redirected

Depending on what you are doing, if you are running pyspark code and are hoping to see results before the job is complete, this may not work.
You may be running into an issue with how spark/pyspark runs your code. Spark is designed to efficiently divide up your task into parts, and distribute those parts to the nodes of your EMR cluster.
This means that the actual work does not happen on the machine where your notebook is running. The main node, where your notebook is running, sends tasks out to all the worker nodes, then collects the results as they are sent back, and only displays them once the job is complete. This can be troublesome for someone used to debugging plain python, but is a big part of what makes pyspark so fast when working with large amounts of data.

Related

TaskManager: Process is Terminated At Exactly 2 Hours

I'm running a python script remotely from a task machine and it creates a process that is supposed to be running for 3 hours. However, it seems to be terminating prematurely at exactly 2 hours. I don't believe it is a problem with the code because after the while loop ends, I am logging to a log file. The log file doesn't show that it exits out of that while loop successfully. Is there a specific setting on the machine that I need to look into that's interrupting my python process?
Is this perhaps a Scheduled Task? If so, have you checked the task's properties?
On my Windows 7 machine under the "Settings" tab is a checkbox for "Stop the task if it runs longer than:" with a box where you can specify the duration.
One of the suggested durations on my machine is "2 hours."

Stop a python script without losing data

We have been running a script on partner's computer for 18 hours. We underestimated how long it would take, and now need to turn in the results. Is it possible to stop the script from running, but still have access to all the lists we are building?
We need to add additional code to the one we are currently running that will use the lists being populated right now. Is there a way to stop the process, but still use (what has been generated of) the lists in the next portion of code?
My partner was using python interactively.
update
We were able to successfully print the results and copy and paste after interrupting the program with control-C.
Well, OP doesn't seem to need an answer anymore. But I'll answer anyway for anyone else coming accross this.
While it is true that stopping the program will delete all data from memory you can still save it. You can inject a debug session and save whatever you need before you kill the process.
Both PyCharm and PyDev support attaching their debugger to a running python application.
See here for an explanation how it works in PyCharm.
Once you've attached the debugger, you can set a breakpoint in your code and the program will stop when it hits that line the next time. Then you can inspect all variables and run some code via the 'Evaluate' feature. This code may save whatever variable you need.
I've tested this with PyCharm 2018.1.1 Community Edition and Python 3.6.4.
In order to do so I ran this code which I saved as test.py
import collections
import time
data = collections.deque(maxlen=100)
i = 0
while True:
data.append(i % 1000)
i += 1
time.sleep(0.001)
via the command python3 test.py from an external Windows PowerShell instance.
Then I've opened that file in PyCharm and attached the debugger. I set a Breakpoint at the line i += 1 and it halted right there. Then I evaluated the following code fragment:
import json
with open('data.json', 'w') as ofile:
json.dump(list(data), ofile)
And found all entries from data in the json file data.json.
Follow-up:
This even works in an interactive session! I ran the very same code in a jupyter notebook cell and then attached the debugger to the kernel. Still having test.py open, I set the breakpoint again on the same line as before and the kernel halted. Then I could see all variables from the interactive notebook session.
I don't think so. Stopping the program should also release all of the memory it was using.
edit: See Swenzel's comment for one way of doing it.

running multiple tesseract instances in parallel using multiprocessing not returning any results

I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel.
when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).
In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.
Here's my code :
def processPage(i):
nameJPG="converted-"+str(i)+".jpg"
nameHocr="converted-"+str(i)
p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
print "tesseract did the job for the ",str(i+1),"page"
pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3
or other wise you can hit google vision api they have lot of servers and there output is quite good too
Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment. but you must not run multiple process at same servers for tesseract
See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167
Your code is launching a Pool and exiting before it finishes its job. You need to call close and join.
pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
pool1.close()
pool1.join()
Alternatively, you can wait for its results.
pool1=Pool(4)
print pool1.map(processPage, range(len(pdf.pages)))

Running mpi4py with nohup. mpiexec noticed that process rank 6 with PID 70581 on node ip-x-xx-xx-xxxx exited on signal 1 (Hangup)

I was trying to run a python script which is modified to perform operations in multiple cores. I tried to run the same operation using nohup and it was working fine but when I tried to run it on a larger sample I got the following error. mpiexec noticed that process rank 6 with PID 70581 on node ip-xx-x-x-xxx exited on signal 1 (Hangup).
Any idea about why it is happening and how to rectify it. Also it would be great if somebody can suggest a better way to run a python script in multiple cores.
For example I have a script abc.py
a=[1,2,3,4]
result=[]
def foo(x,y):
#do some maths
score_pair=#some more maths depending on x and y
return score_pair
if __name__=='__main__':
for x in a:
for y in a:
score_pair=foo(x,y)
result.append([x,y,score_pair])
#save result in pickle
I modified the for loop using mpi4py such that the script could utilise the multiple cores available on the computer. I was running the python script using
nohup mpiexec -n 16 python abc.py > nohupfile.out &
The code works well as long as the size of a is small (i.e. the execution time is small). If I increase the size of a, the process hangup giving the above mentioned error.
Any idea how to rectify this? If there is some other way to parallelize the script?

Error with mpi4py after awhile of spawning, How to Debug

This is similar to a few questions on the internet, but this code seems to be working for awhile instead of returning an error instantly, which suggests to me it is maybe not just a host-file error?
I am running a code that spawns multiple MPI processes which then each create a loop, within which they send some data with bcast and scatter, then gathers data from those processes. This runs the algorithm and saves data. It then disconnects from the spawned comm, and creates another set of spawns on the next loop. This works for a few minutes, then after around 300 files, it will spit this out:
[T7810:10898] [[50329,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/plm/base/plm_base_launch_support.c at line 758
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error.
More information may be available above.
I am testing this on a local machine (single node), so the end deployment will have multiple nodes that each spawn their own mpi processes within that node. I am trying to figure out if this is an issue with testing the multiple nodes on my local machine and will work fine on the HPC or is a more serious error.
How can I debug this? Is there a way to be printing out what MPI is trying to do during, or monitor MPI, such as a verbose mode?
Since MPI4PY is so close to MPI (logically, if not in terms of lines-of-code), one way to debug this is to write the C version of your program and see if the problem persists. When you report this bug to OpenMPI, they are going to want a small c test case anyway.

Categories