python multiprocessing - how to act on interim results - python

I'm using pandas to calculate statistics etc on a lot of data but it ends up running for hours, and I get new data frequently. I've tried to optimize already but I'd like to make it faster, so I'm trying to make it use multiple processes. The problem I'm having is that I need to perform some interim work with the results as they're getting done, and the examples I've seen for multiprocessing.Process and Pool all wait for everything to finish before working with the results.
This is the heavily trimmed code I'm using now. The piece I want to put into separate processes is generateAnalytics().
for counter, symbol in enumerate(queuelist): # queuelist
if needQueueLoad: # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
log.info('Shutting down analyticsRunner thread')
break
dfDay = generateAnalytics(symbol) # slow running function (15s+)
astore[analyticsTable(symbol)] = dfDay # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
dfLatest.loc[symbol] = dfDay.iloc[-1] # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)
log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
# do some stuff to update progress GUI
I can't figure out how to get the last lines to work with the results while it's ongoing and would appreciate suggestions.
I'm considering running it all in a Pool and having the processes add the results to a Queue (instead of returning them), and then have a while loop sit in the main process pulling off the queue as the results come in - would that be a reasonable way to do it? Something like:
mpqueue = multiprocessing.Queue()
pool = multiprocessing.Pool()
pool.map(generateAnalytics, [queuelist, mpqueue])
while not needQueueLoad: # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
while not mpqueue.empty():
dfDay = mpqueue.get()
astore[analyticsTable(symbol)] = dfDay # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
dfLatest.loc[symbol] = dfDay.iloc[-1] # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)
log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
# do some stuff to update GUI that shows progress
sleep(0.1)
# do some bookkeeping to see if queue has finished
pool.join()

Using a Queue looks like a reasonable way to do it, with two remarks.
Since it looks from the code that you're using a GUI, checking for results is probably better done in a timeout function or idle function rather than in a while-loop. Using a while-loop to check for results would block the GUI's event loop.
If the worker processes need to return a lot of data to the main process via the Queue, this will add significant overhead. You might want to consider using shared memory or even an intermediate file.

Related

Merge using threads not working in python

I have to merge two lists and every time a full the lists in order to merge them , but what is happening that I did it like this :
def repeated_fill_buffer(self):
"""
repeat the operation until reaching the end of file
"""
# clear buffers from last data
self.block = [[] for file in self.files]
filling_buffer_thread = threading.Thread(self.fill_buffer())
filling_buffer_thread.start()
# create inverted index thread
create_inverted_index_thread = threading.Thread(self.create_inverted_index())
create_inverted_index_thread.start()
# check if buffers are not empty to merge and start the thread
if any(self.block):
self.block = [[] for file in self.files]
filling_buffer_thread.join()
create_inverted_index_thread.join()
but what is happening that filling_buffer_thread and create_inverted_index_thread just called one time, and not working again, when I debugged the code I see that
filling_buffer_thread stopped
I don't know if I explain my question good, but what I want that I can called same thread multi time and run them..
If there is any operation which is CPU Bound then, using thread is of no use. Because of Python GIL, which prevents multiple byte-code instruction to be executed at a time. use multiprocessing module since, every process has its own GIL.
All number crunching or any operations which depends on CPU for its completion are CPU-Bound. Threads are useful for I/O Bound Operations (like Database Calls, Network Calls)
To summarize your error, your filling_buffer_thread got blocked due to create_inverted_index_thread

Python: Make sure concurrent Thread is finished

I'm trying to write a python program that works with threads. I'm using concurrent.futures to handle the threads. In the programm one thread should create an PDF-File. Because this task is lasting very long I want to create a thread to handle the creation. But after sometime I want to work with the pdf file again. Therefore, I have to be sure, that the previous thread is finished.
My question is, how can I check if my concurrent future thread is finished or how I can wait on its execution.
Code
if __name__ == '__main__'
with concurrent.futures.ThreadPoolExecutor() as executor:
document = executor.submit(createPDF, pdfValues)
#working on something different ....
#Here I want to work with the pdf file again, via a new thread
#Therefore I want to make sure that the thread above is finished.
with concurrent.futures.ThreadPoolExecutor() as executor:
result = executor.submit(workWithPDF, values)
your document variable is of type concurrent.futures.Future, which has done() method that returns True if it completed, or result() that returns the result and blocks until it ready (if it still isn't)

Share variables across scripts

I have 2 separate scripts working with the same variables.
To be more precise, one code edits the variables and the other one uses them (It would be nice if it could edit them too but not absolutely necessary.)
This is what i am currently doing:
When code 1 edits a variable it dumps it into a json file.
Code 2 repeatedly opens the json file to get the variables.
This method is really not elegant and the while loop is really slow.
How can i share variables across scripts?
My first scripts gets data from a midi controller and sends web-requests.
My second script is for LED strips (those run thanks to the same midi controller). Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down. I am currently just sharing the variables via a json file.
If enough people ask for it i will post the whole code but i have been told not to do this
Considering the information you provided, meaning...
Both script run in a "while true" loop.
I can't simply put them in the same script since every webrequest would slow the LEDs down.
To me, you have 2 choices :
Use a client/server model. You have 2 machines. One acts as the server, and the second as the client. The server has a script with an infinite loop that consistently updates the data, and you would have an API that would just read and expose the current state of your file/database to the client. The client would be on another machine, and as I understand it, it would simply request the current data, and process it.
Make a single multiprocessing script. Each script would run on a separate 'thread' and would manage its own memory. As you also want to share variables between your two programs, you could pass as argument an object that would be shared between both your programs. See this resource to help you.
Note that there are more solutions to this. For instance, you're using a JSON file that you are consistently opening and closing (that is probably what takes the most time in your program). You could use a real Database that could handle being opened only once, and processed many times, while still being updated.
a Manager from multiprocessing lets you do this sort thing pretty easily
first I simplify your "midi controller and sends web-request" code down to something that just sleeps for random amounts of time and updates a variable in a managed dictionary:
from time import sleep
from random import random
def slow_fn(d):
i = 0
while True:
sleep(random() ** 2)
i += 1
d['value'] = i
next we simplify the "LED strip" control down to something that just prints to the screen:
from time import perf_counter
def fast_fn(d):
last = perf_counter()
while True:
sleep(0.05)
value = d.get('value')
now = perf_counter()
print(f'fast {value} {(now - last) * 1000:.2f}ms')
last = now
you can then run these functions in separate processes:
import multiprocessing as mp
with mp.Manager() as manager:
d = manager.dict()
procs = []
for fn in [slow_fn, fast_fn]:
p = mp.Process(target=fn, args=[d])
procs.append(p)
p.start()
for p in procs:
p.join()
the "fast" output happens regularly with no obvious visual pauses

Passing updated args to multiple threads periodically in python

I have three base stations, they have to work in parallel, and they will receive a list every 10 seconds which contain information about their cluster, and I want to run this code for about 10 minutes. So, every 10 seconds my three threads have to call the target method with new arguments, and this process should last long for 10 minutes. I don't know how to do this, but I came up with the below idea which seems to be not quite a good one! Thus I appreciate any help.
I have a list named base_centroid_assign that I want to pass each item of it to a distinct thread. The list content will be updated frequently (supposed for instance 10 seconds), I so wish to recall my previous threads and give the update items to them.
In the below code, the list contains three items which have multiple items in them (it's nested). I want to have three threads stop after executing the quite simple target function, and then recall the threads with update item; however, when I run the below code, I ended up with 30 threads! (the run_time variable is 10 and list's length is 3).
How can I implement idea as mentioned above?
run_time = 10
def cluster_status_broadcasting(info_base_cent_avr):
print(threading.current_thread().name)
info_base_cent_avr.sort(key=lambda item: item[2], reverse=True)
start = time.time()
while(run_time > 0):
for item in base_centroid_assign:
t = threading.Thread(target=cluster_status_broadcasting, args=(item,))
t.daemon = True
t.start()
print('Entire job took:', time.time() - start)
run_time -= 1
Welcome to Stackoverflow.
Problems with thread synchronisation can be so tricky to handle that Python already has some very useful libraries specifically to handle such tasks. The primary such library is queue.Queue in Python 3. The idea is to have a queue for each "worker" thread. The main thread collect and put new data onto a queue, and have the subsidiary threads get the data from that queue.
When you call a Queue's get method its normal action is to block the thread until something is available, but presumably you want the threads to continue working on the current inputs until new ones are available, in which case it would make more sense to poll the queue and continue with the current data if there is nothing from the main thread.
I outline such an approach in my answer to this question, though in that case the worker threads are actually sending return values back on another queue.
The structure of your worker threads' run method would then need to be something like the following pseudo-code:
def run(self):
request_data = self.inq.get() # Wait for first item
while True:
process_with(request_data)
try:
request_data = self.inq.get(block=False)
except queue.Empty:
continue
You might like to add logic to terminate the thread cleanly when a sentinel value such as None is received.

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

Categories