My program has 2 parts divided into the core and downloader. The core handles all the app logic while the downloader just downloads urls. Right now, I am trying to use the python multiprocessing module to accomplish the task of the core as a process and the downloader as a process.
The first problem I noticed was that if I spawn the downloader process from the core process so that the downloader is the child process and the core is the parent, the core process(parent) is blocked until the child is finished. I do not want this behaivor though. I would like to have a core process and a downloader process that are both able to execute their code and communicate between each other.
example
...
def main():
jobQueue = Queue()
jobQueue.put("http://google.com)
d = Downloader(jobQueue)
p = Process(target=d.start())
p.start()
if __name__ == '__main__':
freeze_support()
main()
where Downloader's start() just takes the url out of the queue and downloads it.
In order to have the 2 processes unblocked, would I need to create 2 processes from the parent process and then share something between them?
On the p = Process(target=d.start()) line, you're calling d.start(). Remove the parens, like below:
...
def main():
jobQueue = Queue()
jobQueue.put("http://google.com")
d = Downloader(jobQueue)
p = Process(target=d.start)
p.start()
if __name__ == '__main__':
freeze_support()
main()
Related
I have the codes like this:
It is clear the 'finished' has been printed out. but join still blocks.
Why should this happend?
from multiprocessing import Process
class MyProcess(Process):
def run(self):
## do someting
print 'finished'
processes = []
for i in range(3):
p = MyProcess()
p.start()
processes.append(p)
for p in processes:
p.join()
you should add this line if __name__ == '__main__': for things to work properly
Explanation:
your main script will be imported by process.py module, then it will execute your script lines 2 times, one during importing and one from your script execution,
here is the runtime error if we didn't include if __name__ == '__main__':
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
your working code in python 3.6 is:
from multiprocessing import Process
class MyProcess(Process):
def run(self):
## do someting
print ('finished')
processes = []
if __name__ == '__main__':
for i in range(3):
p = MyProcess()
p.start()
processes.append(p)
for p in processes:
p.join()
print('we are done here .......')
output:
finished
finished
finished
we are done here .......
join would not block if the task is finished, also your program is invalid.
for i in 3: # X integer is not iterable,
for i in range(3): # should be like this.
I am new to Python and I am currently working on a multiple webpage scraper. While I was playing around with Python I found out about Threading, this really speeds up the code. The problem is that the script scrapes lots of sites and I like to this in an array of 'batches' while using Threading.
When I've got an array of 1000 items I'd like to grab 10 items. When the script is done with these 10 items, grab 10 new items until there is nothing left
I hope someone can help me. Thanks in advance!
import subprocess
import threading
from multiprocessing import Pool
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
pool = Pool()
sites = ["http://site1.com", "http://site2.com", "http://site3.com", "http://site4.com"]
results = pool.imap(scrape, sites)
for result in results:
print(result)
In the future I use an sqlite database where I store all the URLs (this will replace the array). When I run the script I want the control of stopping the process and continue whenever I want to. This is not my question but the context of my problem.
Question: ... array of 1000 items I'd like to grab 10 items
for p in range(0, 1000, 10):
process10(sites[p:p+10])
How about using Process and Queue from multiprocessing? Writing a worker function and calling it from a loop makes it run as batches. With Process, jobs can be started and stopped when needed and give you more control over them.
import subprocess
from multiprocessing import Process, Queue
def worker(url_queue, result_queue):
for url in iter(url_queue.get, 'DONE'):
scrape_result = scrape(url)
result_queue.put(scrape_result)
def scrape(url):
return subprocess.call("casperjs test.js --url=" + url, shell=True)
if __name__ == '__main__':
sites = ['http://site1.com', "http://site2.com", "http://site3.com", "http://site4.com", "http://site5.com",
"http://site6.com", "http://site7.com", "http://site8.com", "http://site9.com", "http://site10.com",
"http://site11.com", "http://site12.com", "http://site13.com", "http://site14.com", "http://site15.com",
"http://site16.com", "http://site17.com", "http://site18.com", "http://site19.com", "http://site20.com"]
url_queue = Queue()
result_queue = Queue()
processes = []
for url in sites:
url_queue.put(url)
for i in range(10):
p = Process(target=worker, args=(url_queue, result_queue))
p.start()
processes.append(p)
url_queue.put('DONE')
for p in processes:
p.join()
result_queue.put('DONE')
for response in iter(result_queue.get, 'DONE'):
print response
Note that Queue is the FIFO Queue that supports putting and pulling elements.
I'm struggling to get my head around multiprocessing and passing a global True/False variable into my function.
After get_data() finishes I want the analysis() function to start and process the data, while fetch() continues running. How can I make this work? TIA
import multiprocessing
ready = False
def fetch():
global ready
get_data()
ready = True
return
def analysis():
analyse_data()
if __name__ == '__main__':
p1 = multiprocessing.Process(target=fetch)
p2 = multiprocessing.Process(target=analysis)
p1.start()
if ready:
p2.start()
You should run the two processes and use a shared queue to exchange information between them, such as signaling the completion of an action in one of the processes.
Also, you need to have a join() statement to properly wait for completion of the processes you spawn.
from multiprocessing import Process, Queue
import time
def get_data(q):
#Do something to get data
time.sleep(2)
#Put an event in the queue to signal that get_data has finished
q.put('message from get_data to analyse_data')
def analyse_data(q):
#waiting for get_data to finish...
msg = q.get()
print msg #Will print 'message from get_data to analyse_data'
#get_data has finished
if __name__ == '__main__':
#Create queue for exchanging messages between processes
q = Queue()
#Create processes, and send the shared queue to them
processes = [Process(target=get_data,args(q,)),Process(target=analyse_data,args=(q,))]
#Start processes
for p in processes:
p.start()
#Wait until all processes complete
for p in processes:
p.join()
You example won't work for a few reasons :
Process cannot share a piece of memory with each other (you can't change the global in one process and see the change in the other)
Even if you could change the global value, you are checking it too fast and most likely it won't change in time
Read https://docs.python.org/3/library/ipc.html for more possibilities for inter-process-communications
Using Windows 7 + python 2.6, I am trying to run a simulation model in parallel. I can launch multiple instances of the executable by double-clicking on them in my file browser. However, asynchronous calls with Popen result in each successive instance interrupting the previous one. For what it's worth, the executable returns text to the console, but I don't need to collect results interactively.
Here's where I am so far:
import multiprocessing, subprocess
def run(c):
exe = os.path.join("<location>","folder",str(c),"program.exe")
run = os.path.join("<location>","folder",str(c),"run.dat")
subprocess.Popen([exe,run],creationflags = subprocess.CREATE_NEW_CONSOLE)
def main():
pool = multiprocessing.Pool(3)
for c in range(10):
pool.apply_async(run,(str(c),))
pool.close()
pool.join()
if __name__ == '__main__':
main()
After scouring SO for a solution, I've learned that using multiprocessing may be redundant, but I need some way to limit the number of cores working.
Enabled by #J.F. Sebastian's comment regarding the cwd argument.
import multiprocessing, subprocess
def run(c):
exe = os.path.join("<location>","folder",str(c),"program.exe")
run = os.path.join("<location>","folder",str(c),"run.dat")
subprocess.check_call([exe,run],cwd=os.path.join("<location>","folder"),creationflags = subprocess.CREATE_NEW_CONSOLE)
def main():
pool = multiprocessing.Pool(3)
for c in range(10):
pool.apply_async(run,(str(c),))
pool.close()
pool.join()
if __name__ == '__main__':
main()
I am new to multiprocessing
I have run example code for two 'highly recommended' multiprocessing examples given in response to other stackoverflow multiprocessing questions. Here is an example of one (which i dare not run again!)
test2.py (running from pydev)
import multiprocessing
class MyFancyClass(object):
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
print(proc_name, self.name)
def worker(q):
obj = q.get()
obj.do_something()
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
queue.put(MyFancyClass('Fancy Dan'))
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
When I run this my computer slows down imminently. It gets incrementally slower. After some time I managed to get into the task manager only to see MANY MANY python.exe under the processes tab. after trying to end process on some, my mouse stopped moving. It was the second time i was forced to reboot.
I am too scared to attempt a third example...
running - Intel(R) Core(TM) i7 CPU 870 # 2.93GHz (8 CPUs), ~2.9GHz on win7 64
If anyone know what the issue is and can provide a VERY SIMPLE example of multiprocessing (send a string too a multiprocess, alter it and send it back for printing) I would be very grateful.
From the docs:
Make sure that the main module can be safely imported by a new Python
interpreter without causing unintended side effects (such a starting a
new process).
Thus, on Windows, you must wrap your code inside a
if __name__=='__main__':
block.
For example, this sends a string to the worker process, the string is reversed and the result is printed by the main process:
import multiprocessing as mp
def worker(inq,outq):
obj = inq.get()
obj = obj[::-1]
outq.put(obj)
if __name__=='__main__':
inq = mp.Queue()
outq = mp.Queue()
p = mp.Process(target=worker, args=(inq,outq))
p.start()
inq.put('Fancy Dan')
# Wait for the worker to finish
p.join()
result = outq.get()
print(result)
Because of the way multiprocessing works on Windows (child processes import the __main__ module) the __main__ module cannot actually run anything when imported -- any code that should execute when run directly must be protected by the if __name__ == '__main__' idiom. Your corrected code:
import multiprocessing
class MyFancyClass(object):
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
print(proc_name, self.name)
def worker(q):
obj = q.get()
obj.do_something()
if __name__ == '__main__':
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
queue.put(MyFancyClass('Fancy Dan'))
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
Might I suggest this link? It's using threads, instead of multiprocessing, but many of the principles are the same.