I am sure many similar questions have been asked before, but after reading many of them I am still not very sure what I should do. So, I have a Python script to control some external instruments (a camera and a power meter). I have written class for both instruments by calling the C functions in the .dll files using ctypes. Right now it looks something like this:
for i in range(10):
power_reading = newport.get_reading(N=100,interval=1) # take power meter reading
img = camera.capture(N=10)
value = image_processing(img) # analyze the img (ndarray) to get some values
results.append([power_reading,value]) # add both results to a list
I want to start executing the first two lines at the same time. Both newport.get_reading and camera.capture take about 100ms-1s to run (they will run for the same time if I choose the correct arguments). I don't need them to start at EXACTLY the same time, but ideally the delay should be smaller than about 10-20% of the total run time (so less than 0.2s delay when take each take about 1s to run). From what I have read, I can use the multiprocessing module. So I try something like this based on this post:
def p_get_reading(newport,N,interval,return_dict):
reading = newport.get_reading(N,interval,return_dict)
return_dict['power_meter'] = reading
def p_capture(camera,N,return_dict):
img = camera.capture(N)
return_dict['image'] = img
for i in range(10):
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = multiprocessing.Process(target=p_capture, args=(camera,10))
p.start()
p2 = multiprocessing.Process(target=p_get_reading, args=(newport,100,1))
p2.start()
p.join()
p2.join()
print(return_dict)
I have a few problems/questions:
I need to get the return values from both function calls. Using my current method, return_dict is only showing the entry for capture_img but not the power meter reading, why is that? It also read that I can use Queue, what is the best method for my current purpose?
How can I know whether both functions indeed start running at the same time? I am thinking of using the time module to record both the start and end time of both functions, maybe using some wrapper function to do the time logging, but will the use of multiprocessing pose any restrictions?
I usually run my code on an IDE (spyder), and from what I have read, I need to run in command prompt to see the output (I have some print statements inside the functions for debugging purposes). Can I still run in IDE for having both functions run at the same time?
Using a Lock may help with synchronisation:
import multiprocessing
def p_get_reading(newport, N, interval, lock, return_dict):
lock.acquire()
lock.release()
reading = newport.get_reading(N, interval)
return_dict['power_meter'] = reading
def p_capture(camera, N, lock, return_dict):
lock.acquire()
lock.release()
img = camera.capture(N)
return_dict['image'] = img
if __name__ == "__main__":
for i in range(10):
manager = multiprocessing.Manager()
return_dict = manager.dict()
lock = multiprocessing.Lock()
lock.acquire()
p = multiprocessing.Process(target=p_capture, args=(camera,10,lock,return_dict))
p.start()
p2 = multiprocessing.Process(target=p_get_reading, args=(newport,100,1,lock,return_dict))
p2.start()
lock.release()
p.join()
p2.join()
print(return_dict)
The two Process objects can now be created and start()ed in any order as the main routine has already acquired the lock. Once released, the two processes will fight between themselves to acquire and release the lock, and be ready almost at the same time.
Also, note the use of if __name__ == "__main__" as this helps when multiprocessing makes new processes.
I must say this seems like an abuse of a Lock
An answer to your first question is simply no if you are doing in normal way, but yes if you want it to be. No because the target function cannot communicate back to spawning thread using a return. One way to do it is to use a queue and wrapper functions as following:
from threading import Thread
from Queue import Queue
def p_get_reading(newport,N,interval,return_dict):
reading = newport.get_reading(N,interval,return_dict)
return_dict.update({'power_meter': reading})
return return_dict
def p_capture(camera,N,return_dict):
img = camera.capture(N)
return_dict.update({'image': img})
return return_dict
def wrapper1(func, arg1, arg2, queue):
queue.put(func(arg1, arg2))
def wrapper2(func, arg1, arg2, arg3, queue):
queue.put(func(arg1, arg2, arg3))
q = Queue()
Thread(target=wrapper1, args=(p_capture, camera, 10 , q)).start()
Thread(target=wrapper2, args=(p_get_reading, newport, 100, 1, q)).start()
Now q holds the updated and returned dict from p_capture() and p_get_reading().
Related
So I have two webscrapers that collect data from two different sources. I am running them both simultaneously to collect a specific piece of data (e.g. covid numbers).
When one of the functions finds data I want to use that data without waiting for the other one to finish.
So far I tried the multiprocessing - pool module and to return the results with get() but by definition I have to wait for both get() to finish before I can continue with my code. My goal is to have the code as simple and as short as possible.
My webscraper functions can be run with arguments and return a result if found. It is also possible to modify them.
The code I have so far which waits for both get() to finish.
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
from twitter import post_tweet
if __name__ == '__main__':
with Pool(processes=2) as pool:
r1 = pool.apply_async(main_1, ('www.website1.com','June'))
r2 = pool.apply_async(main_2, ())
data = r1.get()
data2 = r2.get()
post_tweet("New data is {}".format(data))
post_tweet("New data is {}".format(data2))
From here I have seen that threading might be a better option since webscraping involves a lot of waiting and only little parsing but I am not sure how I would implement this.
I think the solution is fairly easy but I have been searching and trying different things all day without much success so I think I will just ask here. (I only started programming 2 months ago)
As always there are many ways to accomplish this task.
you have already mentioned using a Queue:
from multiprocessing import Process, Queue
from scraper1 import main_1
from scraper2 import main_2
def simple_worker(target, args, ret_q):
ret_q.put(target(*args)) # mp.Queue has it's own mutex so we don't need to worry about concurrent read/write
if __name__ == "__main__":
q = Queue()
p1 = Process(target=simple_worker, args=(main_1, ('www.website1.com','June'), q))
p2 = Process(target=simple_worker, args=(main_2, ('www.website2.com','July'), q))
p1.start()
p2.start()
first_result = q.get()
do_stuff(first_result)
#don't forget to get() the second result before you quit. It's not a good idea to
#leave things in a Queue and just assume it will be properly cleaned up at exit.
second_result = q.get()
p1.join()
p2.join()
You could also still use a Pool by using imap_unordered and just taking the first result:
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
def simple_worker2(args):
target, arglist = args #unpack args
return target(*arglist)
if __name__ == "__main__":
tasks = ((main_1, ('www.website1.com','June')),
(main_2, ('www.website2.com','July')))
with Pool() as p: #Pool context manager handles worker cleanup (your target function may however be interrupted at any point if the pool exits before a task is complete
for result in p.imap_unordered(simple_worker2, tasks, chunksize=1):
do_stuff(result)
break #don't bother with further results
I've seen people use queues in such cases: create one and pass it to both parsers so that they put their results in queue instead of returning them. Then do a blocking pop on the queue to retrieve the first available result.
I have seen that threading might be a better option
Almost true but not quite. I'd say that asyncio and async-based libraries is much better than both threading and multiprocessing when we're talking about code with a lot of blocking I/O. If it's applicable in your case, I'd recommend rewriting both your parsers in async.
I am a beginner when it comes to python threading and multiprocessing so please bear with me.
I want to make a system that consists of three python scripts. The first one creates some data and sends this data to the second script continuously. The second script takes the data and saves on some file until the file exceeds defined memory limit. When that happens, the third script sends the data to an external device and gets rid of this "cache". I need all of this to happen concurrently. The pseudo code sums up what I am trying to do.
def main_1():
data = [1,2,3]
send_to_second_script(data)
def main_2():
rec_data = receive_from_first_script()
save_to_file(rec_data)
if file>limit:
signal_third_script()
def main_3():
if signal is true:
send_data_to_external_device()
remove_data_from_disk()
I understand that I can use queues to make this happen but I am not sure how.
Also , so far to do this, I tried a different approach where I created one python script and used threading to spawn threads for each part of the process. Is this correct or using queues is better?
Firstly, for Python you need to be really aware what the benefits of multithreading/multiprocessing gives you. IMO you should be considering multiprocessing instead of multithreading. Threading in Python is not actually concurrent due to GIL and there are many explanations out on which one to use. Easiest way to choose is to see if your program is IO-bound or CPU-bound. Anyways on to the Queue which is a simple way to work with multiple processes in python.
Using your pseudocode as an example, here is how you would use a Queue.
import multiprocessing
def main_1(output_queue):
test = 0
while test <=10: # simple limit to not run forever
data = [1,2,3]
print("Process 1: Sending data")
output_queue.put(data) #Puts data in queue FIFO
test+=1
output_queue.put("EXIT") # triggers the exit clause
def main_2(input_queue,output_queue):
file = 0 # Dummy psuedo variables
limit = 1
while True:
rec_data = input_queue.get() # Get the latest data from queue. Blocking if empty
if rec_data == "EXIT": # Exit clause is a way to cleanly shut down your processes
output_queue.put("EXIT")
print("Process 2: exiting")
break
print("Process 2: saving to file:", rec_data, "count = ", file)
file += 1
#save_to_file(rec_data)
if file>limit:
file = 0
output_queue.put(True)
def main_3(input_queue):
while(True):
signal = input_queue.get()
if signal is True:
print("Process 3: Data sent and removed")
#send_data_to_external_device()
#remove_data_from_disk()
elif signal == "EXIT":
print("Process 3: Exiting")
break
if __name__== '__main__':
q1 = multiprocessing.Queue() # Intializing the queues and the processes
q2 = multiprocessing.Queue()
p1 = multiprocessing.Process(target = main_1,args = (q1,))
p2 = multiprocessing.Process(target = main_2,args = (q1,q2,))
p3 = multiprocessing.Process(target = main_3,args = (q2,))
p = [p1,p2,p3]
for i in p: # Start all processes
i.start()
for i in p: # Ensure all processes are finished
i.join()
The prints may be a little off because I did not bother to lock the std_out. But using a queue ensures that stuff moves from one process to another.
EDIT: DO be aware that you should also have a look at multiprocessing locks to ensure that your file is 'thread-safe' when performing the move/delete. The pseudo code above only demonstrates how to use queue
Currently, I am trying to spawn a process in a Python program which again creates threads that continuously update variables in the process address space. So far I came up with this code which runs, but the update of the variable seems not to be propagated to the process level. I would have expected that defining a variable in the process address space and using global in the thread (which shares the address space of the process) would allow the thread to manipulate the variable and propagate the changes to the process.
Below is a minimal example of the problem:
import multiprocessing
import threading
import time
import random
def process1():
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
def urlCaller(url):
global lst
while True:
# the thread continuously pulls data from an API
# this is I/O heavy and therefore done by a thread
lst = {random.randint(1,9), random.randint(20,30)}
print(lst)
time.sleep(2)
prcss = multiprocessing.Process(target = process1)
prcss.start()
The process always prints an empty list while the thread prints, as expected, a list with two integers. I would expect that the process prints a list with two integers as well.
(Note: I am using Spyder as IDE and somehow there is only printed something to the console if I run this code on Linux/Ubuntu but nothing is printed to the console if I run the exact same code in Spyder on Windows.)
I am aware that the use of global variables is not always a good solution but I think it serves the purpose well in this case.
You might wonder why I want to create a thread within a process. Basically, I need to run the same complex calculation on different data sets that constantly change. Hence, I need multiple processes (one for each data set) to optimize the utilization of my CPU and use threads within the processes to make the I/O process most efficient. The data depreciates very fast, therefore, I cannot just store it in a database or file, which would of course simplify the communication process between data producer (thread) and data consumer (process).
You are defining a local variable lst inside the function process1, so what urlCaller does is irrelevant, it cannot change the local variable of a different function. urlCaller is defining a global variable but process1 can never see it because it's shadowed by the local variable you defined.
You need to remove lst = {} from that function and find an other way to return a value or declare the variable global there too:
def process1():
global lst
lst = {}
url = "url"
thrd = threading.Thread(target = urlCaller, args = (url,))
print("process alive")
thrd.start()
while True:
# the process does some CPU intense calculation
print(lst)
time.sleep(2)
I'd use something like concurrent.futures instead of the threading module directly.
Thanks to the previous answer, I figured out that it's best to implement a process class and define "thread-functions" within this class. Now, the threads can access a shared variable and manipulate this variable without the need of using "thread.join()" and terminating a thread.
Below is a minimal example in which 2 concurrent threads provide data for a parent process.
import multiprocessing
import threading
import time
import random
class process1(multiprocessing.Process):
lst = {}
url = "url"
def __init__(self, url):
super(process1, self).__init__()
self.url = url
def urlCallerInt(self, url):
while True:
self.lst = {random.randint(1,9), random.randint(20,30)}
time.sleep(2)
def urlCallerABC(self, url):
while True:
self.lst = {"Ab", "cD"}
time.sleep(5)
def run(self):
t1 = threading.Thread(target = self.urlCallerInt, args=(self.url,))
t2 = threading.Thread(target = self.urlCallerABC, args=(self.url,))
t1.start()
t2.start()
while True:
print(self.lst)
time.sleep(1)
p1 = process1("url")
p1.start()
I'm trying to run a function with multiprocessing. This is the code:
import multiprocessing as mu
output = []
def f(x):
output.append(x*x)
jobs = []
np = mu.cpu_count()
for n in range(np*500):
p = mu.Process(target=f, args=(n,))
jobs.append(p)
running = []
for i in range(np):
p = jobs.pop()
running.append(p)
p.start()
while jobs != []:
for r in running:
if r.exitcode == 0:
try:
running.remove(r)
p = jobs.pop()
p.start()
running.append(p)
except IndexError:
break
print "Done:"
print output
The output is [], while it should be [1,4,9,...]. Someone sees where i'm making a mistake?
You are using multiprocessing, not threading. So your output list is not shared between the processes.
There are several possible solutions;
Retain most of your program but use a multiprocessing.Queue instead of a list. Let the workers put their results in the queue, and read it from the main program. It will copy data from process to process, so for big chunks of data this will have significant overhead.
You could use shared memory in the form of multiprocessing.Array. This might be the best solution if the processed data is large.
Use a Pool. This takes care of all the process management for you. Just like with a queue, it copies data from process to process. It is probably the easiest to use. IMO this is the best option if the data sent to/from each worker is small.
Use threading so that the output list is shared between threads. Threading in CPython has the restriction that only one thread at a time can be executing Python bytecode, so you might not get as much performance benefit as you'd expect. And unlike the multiprocessing solutions it will not take advantage of multiple cores.
Edit:
Thanks to #Roland Smith to point out.
The main problem is the function f(x). When child process call this, it's unable for them to fine the output variable (since it's not shared).
Edit:
Just as #cdarke said, in multiprocessing you have to carefully control the shared object that child process could access(maybe a lock), and it's pretty complicated and hard to debug.
Personally I suggest to use the Pool.map method for this.
For instance, I assume that you run this code directly, not as a module, then your code would be:
import multiprocessing as mu
def f(x):
return x*x
if __name__ == '__main__':
np = mu.cpu_count()
args = [n for n in range(np*500)]
pool = mu.Pool(processes=np)
result = pool.map(f, args)
pool.close()
pool.join()
print result
but there's something you must know
if you just run this file but not import with module, the if __name__ == '__main__': is important, since python will load this file as a module for other process, if you don't place the function 'f' outside if __name__ == '__main__':, the child process would not be able to find your function 'f'
**Edit:**thanks #Roland Smith point out that we could use tuple
if you have more then one args for the function f, then you might need a tuple to do so, for instance
def f((x,y))
return x*y
args = [(n,1) for n in range(np*500)]
result = pool.map(f, args)
or check here for more detailed discussion
I got multiple parallel processes writing into one list in python. My code is:
global_list = []
class MyThread(threading.Thread):
...
def run(self):
results = self.calculate_results()
global_list.extend(results)
def total_results():
for param in params:
t = MyThread(param)
t.start()
while threading.active_count() > 1:
pass
return total_results
I don't like this aproach as it has:
An overall global variable -> What would be the way to have a local variable for the `total_results function?
The way I check when the list is returned seems somewhat clumsy, what would be the standard way?
Is your computation CPU-intensive? If so you should look at the multiprocessing module which is included with Python and offers a fairly easy to use Pool class into which you can feed compute tasks and later get all the results. If you need a lot of CPU time this will be faster anyway, because Python doesn't do threading all that well: only a single interpreter thread can run at a time in one process. Multiprocessing sidesteps that (and offers the Pool abstraction which makes your job easier). Oh, and if you really want to stick with threads, multiprocessing has a ThreadPool too.
1 - Use a class variable shared between all Worker's instances to append your results
from threading import Thread
class Worker(Thread):
results = []
...
def run(self):
results = self.calculate_results()
Worker.results.extend(results) # extending a list is thread safe
2 - Use join() to wait untill all the threads are done and let them have some computational time
def total_results(params):
# create all workers
workers = [Worker(p) for p in params]
# start all workers
[w.start() for w in workers]
# wait for all of them to finish
[w.join() for w in workers]
#get the result
return Worker.results