Compiled Python Multiprocessing Locks CPU - python

I'm running a scraper program (using the requests library) that's using a simplistic threading scheme. Each thread goes to the internet, scrapes some data, and returns a dictionary. The multithreading code (using multiprocessing library's Pool) which I'm using looks like the following:
def get_stats():
symbols = create_input_list('.\\combined_in.csv')
pool = Pool(4)
results = pool.map(return_info, symbols)
print results
curr_date_time = datetime.now().strftime('%m-%d-%y_[%H_%M_%S]')
out_uri = '.\scraped_info_out_' + curr_date_time + '.csv'
create_output_file(out_uri, results)
This works GREAT as a script running in powershell, but not so well when compiled to an exe. I used py2exe initially, which created the exe just fine, but when run opens a blank terminal, locks the whole computer, spawns about 10 processes that I can see in task manager, and eventually has to be manually rebooted. The py2exe script is similarly simple and looks like:
from distutils.core import setup
import py2exe
setup(console=['scraper.py'])
Thinking that py2exe might just not play nice with the multiprocessing library, I also tried pyInstaller, with the same result. Additionally, I do have the blocker on the main function call as follows.
if __name__ == '__main__':
get_stats()
Is there a simple trick that I'm missing for when compiling with the multiprocessing library? I'm trying to figure out why it would work fine as a script but break so hard as an exe.

I cant answer about py2exe ... but pyinstaller this is a known problem with a well documented work around (that has always worked for me)
https://stackoverflow.com/a/27694505/541038
provides a good overview of the problem and the solution

Related

Import Python library in terminal

I need to run a Python script in a terminal, several times. This script requires me to import some libraries. So every time I call the script in the terminal, the libraries are loaded again, which results in a loss of time. Is there any way I can import the libraries once and for all at the beginning?
(If I try the "naive" way, calling first a script just to import libraries then running my code, it doesn't work).
EDIT: I need to run the script in a terminal because actually it is made to serve in another program developed in Java. The Java code calls the Pythin script in the terminal, reads its result and processes it, then calls it again.
One solution is that you can leave the python script always running and use a pipe to communicate between processes like the code below taken from this answer.
import os, time
pipe_path = "/tmp/mypipe"
if not os.path.exists(pipe_path):
os.mkfifo(pipe_path)
# Open the fifo. We need to open in non-blocking mode or it will stalls until
# someone opens it for writting
pipe_fd = os.open(pipe_path, os.O_RDONLY | os.O_NONBLOCK)
with os.fdopen(pipe_fd) as pipe:
while True:
message = pipe.read()
if message:
print("Received: '%s'" % message)
print("Doing other stuff")
time.sleep(0.5)
The libraries will be unloaded once the script finishes, so the best way you can handle this is to write the script so it can iterate however many times you want, rather than running the whole script multiple times. I would likely use input() (or raw_input() if you're running Python2) to read in however many times you want to iterate over it, or use a library like click to create a command line argument for it.

python multiprocessing pool.map hangs

I cannot make even the simplest of examples of parallel processing using the multiprocessing package run in python 2.7 (using spyder as a UI on windows) and I need help figuring out the issue. I have run conda update so all of the packages should be up to date and compatible.
Even the first example in the multiprocessing package documentation (given below) wont work, it generates 4 new processes but the console just hangs. I have tried everything I can find over the last 3 days but none of the code that runs without hanging will allocate more than 25% of my computing power to this task (I have a 4 core computer).
I have given up on running the procedure I have designed and need parallel processing for at this point and I am only trying to get proof of concept so I can build from there. Can someone explain and point me in the right direction? Thanks
Example 1 from https://docs.python.org/2/library/multiprocessing.html
#
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool()
print(p.map(f, [1, 2, 3]))
Example 2 (modified from original) from http://chriskiehl.com/article/parallelism-in-one-line/
from multiprocessing import Pool
def fn(i):
return [i,i*i,i*i*i]
test = range(10)
if __name__ == '__main__':
pool = Pool()
results = map(fn,test)
pool.close()
pool.join()
I apologize if there is indeed an answer to this as it seems as though I should be able to manage such a modest task but I am not a programmer and the resources I have found have been less than helpful given my very limited level of knowledge. Please let me know what further information is needed.
Thank you.
After installing spyder on my virtualmachine, it seems to be a spyder specific bug. Example 1 works in IDLE, executed via the command line, executed from within spyder (first saved and then executed), but not when executed line by line in spyder.
I would suggest simply to create a new file in spyder, add the lines of code, save it, and then run it..
For related reports see:
https://groups.google.com/forum/#!topic/spyderlib/LP5d8QZTXd0
QtConse in Spyder cannot use multiprocessing.Manager
Multiprocessing working in Python but not in iPython
https://github.com/spyder-ide/spyder/issues/1900

Python multiprocessing not working when used from C++ [duplicate]

This question already has an answer here:
Embedded python: multiprocessing not working
(1 answer)
Closed 8 years ago.
I'm trying to embed Python 3.3 x64 script with 'multiprocessing' to C++ code under Windows 7 x64.
Simple script like:
from multiprocessing import Process
def spawnWork(fileName, index):
print("spawnWork: Entry... ")
process = Process(target=execute, args=(fileName, index, ))
process.start()
print("spawnWork: ... Exit.")
def execute(fileName, index):
print("execute: Entry... ")
#Do some long processing
print("execute: ... Exit.")
works fine from Python, but when embedded it stuck at .start() and locks.
I'm using all the relevant API calls to ensure safe GIL processing for Python. It works pretty well when not dealing with 'multiprocessing' package but locks when attempting to start another 'Process'.
Is it possible to use both Python/C++ mix and 'multiprocessing'?
Thanks
I wouldn't expect this to work, as the way multiprocessing works on Windows (where there's no fork) is to CreateProcess another copy of the same executable. And since that executable is your embedding C++ app, not the Python interpreter, you will probably have to cooperate very closely with it to make that work. You can see the relevant code in posix_spawn_win32.py.
Another potential problem is that on Windows, multiprocessing relies on a C extension module that fakes POSIX semaphores on top of Windows kernel semaphores; I haven't read through the code, but that could easily be doing something funky to GIL/threadstate and/or relying on something under the covers to share the semaphores with child Python executables.

Python multiprocessing/threading blocking main thread

I’m trying to write a program in Python. What I want to write is a script which immediately returns a friendly message to the user, but spawns a long subprocess in the background that takes with several different files and writes them to a granddaddy file. I’ve done several tutorials on threading and processing, but what I’m running into is that no matter what I try, the program waits and waits until the subprocess is done before it displays the aforementioned friendly message to the user. Here’s what I’ve tried:
Threading example:
#!/usr/local/bin/python
import cgi, cgitb
import time
import threading
class TestThread(threading.Thread):
def __init__(self):
super(TestThread, self).__init__()
def run(self):
time.sleep(5)
fileHand = open('../Documents/writable/output.txt', 'w')
fileHand.write('Big String Goes Here.')
fileHand.close()
print 'Starting Program'
thread1 = TestThread()
#thread1.daemon = True
thread1.start()
I’ve read these SO posts on multithreading
How to use threading in Python?
running multiple threads in python, simultaneously - is it possible?
How do threads work in Python, and what are common Python-threading specific pitfalls?
The last of these says that running threads concurrently in Python is actually not possible. Fair enough. Most of those posts also mention the multiprocessing module, so I’ve read up on that, and it seems fairly straightforward. Here’s the some of the resources I’ve found:
How to run two functions simultaneously
Python Multiprocessing Documentation Example
https://docs.python.org/2/library/multiprocessing.html
So here’s the same example translated to multiprocessing:
#!/usr/local/bin/python
import time
from multiprocessing import Process, Pipe
def f():
time.sleep(5)
fileHand = open('../Documents/writable/output.txt', 'w')
fileHand.write('Big String Goes Here.')
fileHand.close()
if __name__ == '__main__':
print 'Starting Program'
p = Process(target=f)
p.start()
What I want is for these programs to immediately print ‘Starting Program’ (in the web-browser) and then a few seconds later a text file shows up in a directory to which I’ve given write privileges. However, what actually happens is that they’re both unresponsive for 5 seconds and then they print ‘Starting Program’ and create the text file at the same time. I know that my goal is possible because I’ve done it in PHP, using this trick:
//PHP
exec("php child_script.php > /dev/null &");
And I figured it would be possible in Python. Please let me know if I’m missing something obvious or if I’m thinking about this in the completely wrong way. Thanks for your time!
(System information: Python 2.7.6, Mac OSX Mavericks. Python installed with homebrew. My Python scripts are running as CGI executables in Apache 2.2.26)
Ok- I think I found the answer. Part of it was my own misunderstanding. A python script can't simply return message to a client-side (ajax) program but still be executing a big process. The very act of responding to the client means that the program has finished, threads and all. The solution, then, is to use the python version of this PHP trick:
//PHP
exec("php child_script.php > /dev/null &");
And in Python:
#Python
subprocess.call(" python worker.py > /dev/null &", shell=True)
It starts an entirely new process outside the current one, and it will continue after the current one has ended. I'm going to stick with Python because at least we're using a civilized api function to start the worker script instead of the exec function, which always made me uncomfortable.

Multiprocessing launching too many instances of Python VM

I am writing some multiprocessing code (Python 2.6.4, WinXP) that spawns processes to run background tasks. In playing around with some trivial examples, I am running into an issue where my code just continuously spawns new processes, even though I only tell it to spawn a fixed number.
The program itself runs fine, but if I look in Windows TaskManager, I keep seeing new 'python.exe' processes appear. They just keep spawning more and more as the program runs (eventually starving my machine).
For example,
I would expect the code below to launch 2 python.exe processes. The first being the program itself, and the second being the child process it spawns. Any idea what I am doing wrong?
import time
import multiprocessing
class Agent(multiprocessing.Process):
def __init__(self, i):
multiprocessing.Process.__init__(self)
self.i = i
def run(self):
while True:
print 'hello from %i' % self.i
time.sleep(1)
agent = Agent(1)
agent.start()
It looks like you didn't carefully follow the guidelines in the documentation, specifically this section where it talks about "Safe importing of main module".
You need to protect your launch code with an if __name__ == '__main__': block or you'll get what you're getting, I believe.
I believe it comes down to the multiprocessing module not being able to use os.fork() as it does on Linux, where an already-running process is basically cloned in memory. On Windows (which has no such fork()) it must run a new Python interpreter and tell it to import your main module and then execute the start/run method once that's done. If you have code at "module level", unprotected by the name check, then during the import it starts the whole sequence over again, ad infinitum
When I run this in Linux with python2.6, I see a maximum of 4 python2.6 processes and I can't guarantee that they're all from this process. They're definitely not filling up the machine.
Need new python version? Linux/Windows difference?
I don't see anything wrong with that. Works fine on Ubuntu 9.10 (Python 2.6.4).
Are you sure you don't have cron or something starting multiple copies of your script? Or that the spawned script is not calling anything that would start a new instance, for example as a side effect of import if your code runs directly on import?

Categories