I am attempting to run an infinite while loop that will call a function that makes an API call. Each API call can take between 9-12 seconds and I want x processes constantly running.
I've attempted some of the existing pool examples but they seem to take a function along with a list of arguments. My function generates all inputs needed.
from multiprocessing import Process
from random import randint
from time import sleep
def loop_a():
while 1:
wait = randint(5,9)
print("now waiting" + str(wait) + " seconds")
sleep(wait)
if __name__ == '__main__':
#Initialize two separate while loops that can call/wait for responses independently
Process(target=loop_a).start()
Process(target=loop_a).start()
This sample code I found from another question solves my problem mostly, but I am wondering if there is an elegant way to define how many processes to run at once. Id like to be able to enter the number of processes as a parameter rather than defining a new line for each process. Is there a nice way to accomplish this?
This snippet seems to fix the issue I was facing.
if __name__ == '__main__':
[Process(target=loop_a).start() for x in range(2)]
You should use Pool
from multiprocessing import Pool
pool = Pool(processes=2) # define number of processes
results = [pool.apply(loop_a) for x in range(2)]
Related
I like to run a bunch of processes concurrently but never want to reuse an already existing process. So, basically once a process is finished I like to create a new one. But at all times the number of processes should not exceed N.
I don't think I can use multiprocessing.Pool for this since it reuses processes.
How can I achieve this?
One solution would be to run N processes and wait until all processed are done. Then repeat the same thing until all tasks are done. This solution is not very good since each process can have very different runtimes.
Here is a naive solution that appears to work fine:
from multiprocessing import Process, Queue
import random
import os
from time import sleep
def f(q):
print(f"{os.getpid()} Starting")
sleep(random.choice(range(1, 10)))
q.put("Done")
def create_proc(q):
p = Process(target=f, args=(q,))
p.start()
if __name__ == "__main__":
q = Queue()
N = 5
for n in range(N):
create_proc(q)
while True:
q.get()
create_proc(q)
Pool can reuse a process a limited number of times, including one time only when you pass maxtasksperchild=1. You might also try initializer to see if you can run the picky once per process parts of your library there instead of in your pool jobs.
So I have two webscrapers that collect data from two different sources. I am running them both simultaneously to collect a specific piece of data (e.g. covid numbers).
When one of the functions finds data I want to use that data without waiting for the other one to finish.
So far I tried the multiprocessing - pool module and to return the results with get() but by definition I have to wait for both get() to finish before I can continue with my code. My goal is to have the code as simple and as short as possible.
My webscraper functions can be run with arguments and return a result if found. It is also possible to modify them.
The code I have so far which waits for both get() to finish.
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
from twitter import post_tweet
if __name__ == '__main__':
with Pool(processes=2) as pool:
r1 = pool.apply_async(main_1, ('www.website1.com','June'))
r2 = pool.apply_async(main_2, ())
data = r1.get()
data2 = r2.get()
post_tweet("New data is {}".format(data))
post_tweet("New data is {}".format(data2))
From here I have seen that threading might be a better option since webscraping involves a lot of waiting and only little parsing but I am not sure how I would implement this.
I think the solution is fairly easy but I have been searching and trying different things all day without much success so I think I will just ask here. (I only started programming 2 months ago)
As always there are many ways to accomplish this task.
you have already mentioned using a Queue:
from multiprocessing import Process, Queue
from scraper1 import main_1
from scraper2 import main_2
def simple_worker(target, args, ret_q):
ret_q.put(target(*args)) # mp.Queue has it's own mutex so we don't need to worry about concurrent read/write
if __name__ == "__main__":
q = Queue()
p1 = Process(target=simple_worker, args=(main_1, ('www.website1.com','June'), q))
p2 = Process(target=simple_worker, args=(main_2, ('www.website2.com','July'), q))
p1.start()
p2.start()
first_result = q.get()
do_stuff(first_result)
#don't forget to get() the second result before you quit. It's not a good idea to
#leave things in a Queue and just assume it will be properly cleaned up at exit.
second_result = q.get()
p1.join()
p2.join()
You could also still use a Pool by using imap_unordered and just taking the first result:
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
def simple_worker2(args):
target, arglist = args #unpack args
return target(*arglist)
if __name__ == "__main__":
tasks = ((main_1, ('www.website1.com','June')),
(main_2, ('www.website2.com','July')))
with Pool() as p: #Pool context manager handles worker cleanup (your target function may however be interrupted at any point if the pool exits before a task is complete
for result in p.imap_unordered(simple_worker2, tasks, chunksize=1):
do_stuff(result)
break #don't bother with further results
I've seen people use queues in such cases: create one and pass it to both parsers so that they put their results in queue instead of returning them. Then do a blocking pop on the queue to retrieve the first available result.
I have seen that threading might be a better option
Almost true but not quite. I'd say that asyncio and async-based libraries is much better than both threading and multiprocessing when we're talking about code with a lot of blocking I/O. If it's applicable in your case, I'd recommend rewriting both your parsers in async.
I need to run a bunch of parallel processes, but cannot use the standard multiprocessing package since its serialization with pickle does not work for more complex objects. Therefore I'm currently using pathos.multiprocessing which uses dill for the serialization and it works flawlessly in that regard.
However, I would like to have an exit condition, so that all procceses get terminated once the result of a process meets a certain condition (I'm computing objective values for an optimization problem and I want all procceses to stop once the result from a process is worse than any of the previous results).
For the standard multiprocessing package I found this solution (taken from https://stackoverflow.com/a/21491438/15799363). Can I do something similar with pathos.multiprocessing? I couldn't figure out how to pass a callback function to processes with pathos.
from random import random
from multiprocessing import Pool
from time import sleep
def add_something(i):
# Sleep to simulate the long calculation
sleep(random() * 30)
return i + 1
def run_my_process():
# Create a process pool
pool = Pool(100)
# Callback function that checks results and kills the pool
def check_result(result):
print(result)
if result == 90:
pool.terminate()
# Start up all of the processes
for i in range(100):
pool.apply_async(add_something, args=[i], callback=check_result)
pool.close()
pool.join()
if __name__ == '__main__':
run_my_process()
I am in the following setting: I have a method that takes an objective function f as input. As a subrouting of that method i want to evaluate f on a small set of points. Since f has high complexity i considered doing that in parallel.
All online examples hang up even for trivial functions like squaring on sets with 5 points. They are using the multiprocessing library - and i don't know what i am doing wrong. I am not sure how to encapsulate that __name__ == "__main__" statement in my method. (since it is part of a module - i guess instead of "__main__" i should use the module name?)
Code i have been using looks like
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
num_cores = cpu_count()
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool(num_cores)
y = list(pool.map(f, x))
pool.join()
print(y)
When executing this code in my spyder it takes a bloody long time to finish.
So my main questions are: What am i doing wrong in this code? How can i encapsulate the __name__-statement, when this code is part of a bigger method?
Is it even worth it parallelizing this? (one function evaluation can take multiple minutes and in serial this adds up to a total runtime of hours...)
According to documentation :
close()
Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected
terminate() will be called immediately.
join()
Wait for the worker processes to exit. One must call close() or terminate() before using join().
So you should add :
from multiprocessing.pool import Pool
from multiprocessing import cpu_count
x = [1,2,3,4,5]
def f(x):
return x**2
if __name__ == "__main__":
pool = Pool()
y = list(pool.map(f, x))
pool.close()
pool.join()
print(y)
You can call Pool without any argument and it will use cpu_count by default
If processes is None then the number returned by cpu_count() is used
About the if name == "main", read more informations here.
So you need to think a bit about which code you want executed only in the main program. The most obvious example is that you want code that creates child processes to run only in the main program - so that should be protected by name == 'main'
You might want to look into the chunksize argument of the map function that you are using.
On a large enough input list, a lot of your time is spent simply communicating the arguments to and from the separate parallel processes.
One symptom of this problem is that when you use something like htop all cores are firing but at < 100%.
I am writing a Python script (in Python 2.7) wherein I need to generate around 500,000 uniform random numbers within a range. I need to do this 4 times, perform some calculations on them and write out the 4 files.
At the moment I am doing: (this is just part of my for loop, not the entire code)
random_RA = []
for i in xrange(500000):
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
random_dec = []
for i in xrange(500000):
random_dec.append(np.random.uniform(min(data_dec_1),max(data_dec_1))) # FINAL RANDOM 'dec'
to generate the random numbers within the range. I am running Ubuntu 14.04 and when I run the program I also open my system manager to see how the 8 CPU's I have are working. I seem to notice that when the program is running, only 1 of the 8 CPU's seem to work at 100% efficiency. So the entire program takes me around 45 minutes to complete.
I noticed that it is possible to use all the CPU's to my advantage using the module Multiprocessing
I would like to know if this is enough in my example:
random_RA = []
for i in xrange(500000):
multiprocessing.Process()
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
i.e adding just the line multiprocessing.Process(), would that be enough?
If you use multiprocessing, you should avoid shared state (like your random_RA list) as much as possible.
Instead, try to use a Pool and its map method:
from multiprocessing import Pool, cpu_count
def generate_random_ra(x):
return np.random.uniform(6.061, 6.505)
def generate_random_dec(x):
return np.random.uniform(min(data_dec_1), max(data_dec_1))
pool = Pool(cpu_count())
random_RA = pool.map(generate_random_ra, xrange(500000))
random_dec = pool.map(generate_random_dec, xrange(500000))
To get you started:
import multiprocessing
import random
def worker(i):
random.uniform(1,100000)
print i,'done'
if __name__ == "__main__":
for i in range(4):
t = multiprocessing.Process(target = worker, args=(i,))
t.start()
print 'All the processes have been started.'
You must gate the t = multiprocess.Process(...) with __name__ == "__main__" as each worker calls this program (module) again to find out what it has to do. If the gating didn't happen it would spawn more processes ...
Just for completeness, generating 500000 random numbers is not going to take you 45 minutes so i assume there are some intensive calculations going on here: you may want to look at them closely.