Multiprocessing process intermediate output - python

I have a function that loads data and loops through times e.g.
def calculate_profit(account):
account_data = load(account) #very expensive operation
for day in account_data.days:
print(account_data.get(day).profit)
Because the loading of the data is expensive it makes sense to use joblib/multiprocessing to do something like this:
arr = [account1, account2, account3, ...]
joblib.Parallel(n_jobs=-1)(delayed(calculate_profit)(arr))
However, I have another expensive function that I would like to apply on the intermediate results of the calculate_profit function. For example, assume that it is an expensive operation to sum up all of the profit and process it/post it to website/etc. Also I need the previous day's profits to calculate the profit change in this function.
def expensive_sum(prev_day_profits, *account_profits):
total_profit_today = sum(account_profits)
profit_difference = total_profit_today - prev_day_profits
#some other expensive operation
#more expensive operations
So I would like to
Run the multiprocessing processes in parallel (to lessen the load of loading in all of the expensive account data)
Once each multiprocessing process hits a predefined point (e.g. finished one iteration of the loop), return those intermediate values to another function (expensive_sum) to process - assume that each individual multiprocessing process cannot continue until expensive_sum returns
HOWEVER, I want to keep the multiprocessing processes alive so that I don't have to reinitialize them (reducing that overhead)
Is there any way to do this?

from multiprocessing import Manager
queue = manager.Queue()
Once each multiprocessing process hits a predefined point
do
queue.put(item)
Meanwhile the other expensive function does
queue.get(item) ==> blocking call for get
The expensive function waits on get and goes ahead when it gets a value processes it and again waits on get

Related

Merge using threads not working in python

I have to merge two lists and every time a full the lists in order to merge them , but what is happening that I did it like this :
def repeated_fill_buffer(self):
"""
repeat the operation until reaching the end of file
"""
# clear buffers from last data
self.block = [[] for file in self.files]
filling_buffer_thread = threading.Thread(self.fill_buffer())
filling_buffer_thread.start()
# create inverted index thread
create_inverted_index_thread = threading.Thread(self.create_inverted_index())
create_inverted_index_thread.start()
# check if buffers are not empty to merge and start the thread
if any(self.block):
self.block = [[] for file in self.files]
filling_buffer_thread.join()
create_inverted_index_thread.join()
but what is happening that filling_buffer_thread and create_inverted_index_thread just called one time, and not working again, when I debugged the code I see that
filling_buffer_thread stopped
I don't know if I explain my question good, but what I want that I can called same thread multi time and run them..
If there is any operation which is CPU Bound then, using thread is of no use. Because of Python GIL, which prevents multiple byte-code instruction to be executed at a time. use multiprocessing module since, every process has its own GIL.
All number crunching or any operations which depends on CPU for its completion are CPU-Bound. Threads are useful for I/O Bound Operations (like Database Calls, Network Calls)
To summarize your error, your filling_buffer_thread got blocked due to create_inverted_index_thread

Why are multiple multiprocessing imap's blocking?

I want to do multiple transformations on some data. I figured I can use multiple Pool.imap's because each of the transformations is just a simple map. And Pool.imap is lazy, so it only does computation when needed.
But strangely, it looks like multiple consecutive Pool.imap's are blocking. And not lazy. Look at the following code as an example.
import time
from multiprocessing import Pool
def slow(n):
time.sleep(0.01)
return n*n
for i in [10, 100, 1000]:
with Pool() as p:
numbers = range(i)
iter1 = p.imap(slow, numbers)
iter2 = p.imap(slow, iter1)
start = time.perf_counter()
next(iter2)
print(i, time.perf_counter() - start)
# Prints
# 10 0.0327413540071575
# 100 0.27094774100987706
# 1000 2.6275791430089157
As you can see the time to the first element is increasing. I have 4 cores on my machine, so it roughly takes 2.5 seconds to process 1000 items with a 0.01 second delay. Hence, I think two consecutive Pool.imap's are blocking. And that the first Pool.imap finishes the entire workload before the second one starts. That is not lazy.
I've did some additional research. It does not matter if I use a process pool or a thread pool. It happens with Pool.imap and Pool.imap_unordered. The blocking takes longer when I do a third Pool.imap. A single Pool.imap is not blocking. This bug report seems related but different.
TL;DR imap is not a real generator, meaning it does not generate items on-demand (lazy computation aka similar to coroutine), and pools initiate "jobs" in serial.
longer answer: Every type of submission to a Pool be it imap, apply, apply_async etc.. gets written to a queue of "jobs". This queue is read by a thread in the main process (pool._handle_tasks) in order to allow jobs to continue to be initiated while the main process goes off and does other things. This thread contains a very simple double for loop (with a lot of error handling) that basically iterates over each job, then over each task within each job. The inner loop blocks until a worker is available to get each task, meaning tasks (and jobs) are always started in serial in the exact order they were submitted. This does not mean they will finish in perfect serial, which is why map, and imap collect results, and re-order them back to their original order (handled by pool._handle_resluts thread) before passing back to the main thread.
Rough pseudocode of what's going on:
#task_queue buffers task inputs first in - first out
pool.imap(foo, ("bar", "baz", "bat"), chunksize=1)
#put an iterator on the task queue which will yield "chunks" (a chunk is given to a single worker process to compute)
pool.imap(fun, ("one", "two", "three"), chunksize=1)
#put a second iterator to the task queue
#inside the pool._task_handler thread within the main proces
for task in task_queue: #[imap_1, imap_2]
#this is actually a while loop in reality that tries to get new tasks until the pool is close()'d
for chunk in task:
_worker_input_queue.put(chunk) # give the chunk to the next available worker
# This blocks until a worker actually takes the chunk, meaning the loop won't
# continue until all chunks are taken by workers.
def worker_function(_worker_input_queue, _worker_output_queue):
while True:
task = _worker_input_queue.get() #get the next chunk of tasks
#if task == StopSignal: break
result = task.func(task.args)
_worker_output_queue.put(result) #results are collected, and re-ordered
# by another thread in the main process
# as they are completed.

Dask: How to efficiently distribute a genetic search algorithm?

I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.

Python multi function multithreading with threading.Thread? (variable number of threads)

I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?
You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.
You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.

Incorrect output when using multiprocessing

I am doing gradient descent (100 iterations to be precise). Each data point can be analyzed in parallel, there are 50 data points. Since I have 4 cores, I create a pool of 4 workers using multiprocessing.Pool. The core of the program looks like following:
# Read the sgf files (total 50)
(intermediateBoards, finalizedBoards) = read_sgf_files()
# Create a pool of processes to analyze game boards in parallel with as
# many processes as number of cores
pool = Pool(processes=cpu_count())
# Initialize the parameter object
param = Param()
# maxItr = 100 iterations of gradient descent
for itr in range(maxItr):
args = []
# Prepare argument vector for each file
for i in range(len(intermediateBoards)):
args.append((intermediateBoards[i], finalizedBoards[i], param))
# 4 processes analyze 50 data points in parallel in each iteration of
# gradient descent
result = pool.map_async(train_go_crf_mcmc, args)
Now, I haven't included definition for the function train_go_crf, but the very first line in the function is a print statement. So, when I execute this function the print statement should get executed 100*50 times. But that does not happen. What's more, I get different number of console outputs different number of times.
What's wrong?
Your problem is that you are using map_async instead of map. This means that once all of the work is farmed out to the pool, it will continue on with the loop, even if all of the work has not been finished. It is not clear to me what will happen to the work still running when the next loop starts, but if these are supposed to be iterations, I can't imagine it is a) good b) well defined.
If you use map, it will block the loop until all of the worker functions have finished before moving on to the next step. I guess you could do this with sleep, but that would just make things more complicated for no gain. map will wait for exactly the minimum amount of time it needs to to let everything finish.

Categories