I am doing gradient descent (100 iterations to be precise). Each data point can be analyzed in parallel, there are 50 data points. Since I have 4 cores, I create a pool of 4 workers using multiprocessing.Pool. The core of the program looks like following:
# Read the sgf files (total 50)
(intermediateBoards, finalizedBoards) = read_sgf_files()
# Create a pool of processes to analyze game boards in parallel with as
# many processes as number of cores
pool = Pool(processes=cpu_count())
# Initialize the parameter object
param = Param()
# maxItr = 100 iterations of gradient descent
for itr in range(maxItr):
args = []
# Prepare argument vector for each file
for i in range(len(intermediateBoards)):
args.append((intermediateBoards[i], finalizedBoards[i], param))
# 4 processes analyze 50 data points in parallel in each iteration of
# gradient descent
result = pool.map_async(train_go_crf_mcmc, args)
Now, I haven't included definition for the function train_go_crf, but the very first line in the function is a print statement. So, when I execute this function the print statement should get executed 100*50 times. But that does not happen. What's more, I get different number of console outputs different number of times.
What's wrong?
Your problem is that you are using map_async instead of map. This means that once all of the work is farmed out to the pool, it will continue on with the loop, even if all of the work has not been finished. It is not clear to me what will happen to the work still running when the next loop starts, but if these are supposed to be iterations, I can't imagine it is a) good b) well defined.
If you use map, it will block the loop until all of the worker functions have finished before moving on to the next step. I guess you could do this with sleep, but that would just make things more complicated for no gain. map will wait for exactly the minimum amount of time it needs to to let everything finish.
Related
I am breaking a very large text file up into smaller chunks, and performing further processing on the chunks. For this example, let text_chunks be a list of lists, each list containing a section of text. The elements of text_chunks range in length from ~50 to ~15000. The class ProcessedText exists elsewhere in the code and does a large amount of subsequent processing and data classification based on the text fed to it. The different text chunks are processed into ProcessedText instances in parallel using code like the following:
def do_things_to_text(a, b):
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
#prepare inputs for starmap, pairing with list index so order can be reimposed later
pool_inputs = list(enumerate(text_chunks))
#parallel processing
pool = mp.Pool(processes=8)
results = pool.starmap_async(do_things_to_text, pool_inputs)
output = results.get()
The code executes successfully, but it seems that some of the worker processes created as part of the Pool randomly sit idle while the code runs. I track the memory usage, CPU usage, and status in top while the code executes.
At the beginning all 8 worker processes are engaged (status "R" in top and nonzero CPU usage), after ~20 entries from text_chunks are completed, the worker processes start to vary wildly. At times, as few as 1 worker process is running, and the others are in status "S" with zero CPU usage. I can also see from my printed output statements that do_things_to_text() is being called less frequently. So far I haven't been able to identify why the processes start to idle. There are plenty of entries left to process, so them sitting idle leads to time-inefficiency.
My questions are:
Why are these worker processes sitting idle?
Is there a better way to implement multiprocessing that will prevent this?
EDITED to ADD:
I have further characterized the problem. It is clear from the indexes I print out in do_things_to_text() that multiprocessing is dividing the total number of jobs into threads at every tenth index. So my console output shows Job 0, 10, 20, 30, 40, 50, 60, 70 being submitted at the same time (8 processes). And some of the Jobs complete faster than others, so you might see Job 22 completed before you see Job 1 completed.
Up until this first batch of threads is completed, all processes are active with nothing idle. However, when that batch is complete, and Job 80 starts, only one process is active, and the other 7 are idle. I have not confirmed, but I believe it stays like this until the 80-series is complete.
Here are some recommendations for better memory utilization:
I don't know how text_chunks is created but ultimately you end up with 8GB worth of strings in pool_inputs. Ideally, you would have a generator function, for example make_text_chunks, that yields the individual "text chunks" that formerly comprised the text_chunks iterable (if text_chunks is already such a generator expression, then you are all set). The idea is to not create all 8GB worth of data at once but only as the data is needed. With this strategy you can no longer use Pool method starmap_asynch; we will be using Pool.imap. This method, unlike startmap_asynch, will iteratively submit jobs in chunksize chunks and you can process the results as they become available (although that doesn't seem to be an issue).
def make_text_chunks():
# logic goes here to generate the next chunk
yield text_chunk
def do_things_to_text(t):
# t is now a tuple:
a, b = t
#pull out necessary things for ProcessedText initialization and return an instance
print('Processing {0}'.format(a))
return ProcessedText(a, b)
import multiprocessing as mp
# do not turn into a list!
pool_inputs = enumerate(make_text_chunks())
def compute_chunksize(n_jobs, poolsize):
"""
function to compute chunksize as is done by Pool module
"""
if n_jobs == 0:
return 0
chunksize, remainder = divmod(n_jobs, poolsize * 4)
if remainder:
chunksize += 1
return chunksize
#parallel processing
# number of jobs approximately
# don't know exactly without turning pool_inputs into a list, which would be self-defeating
N_JOBS = 300
POOLSIZE = 8
CHUNKSIZE = compute_chunksize(N_JOBS, POOLSIZE)
with mp.Pool(processes=POOLSIZE) as pool:
output = [result for result in pool.imap(do_things_to_text, pool_inputs, CHUNKSIZE)]
I am trying to implement an online recursive parallel algorithm, which is highly parallelizable. My problem is that my python implementation does not work as I want. I have two 2D matrices where I want to update recursively every column every time a new observation is observed at time-step t.
My parallel code is like this
def apply_async(t):
worker = mp.Pool(processes = 4)
for i in range(4):
X[:,i,np.newaxis], b[:,i,np.newaxis] = worker.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis])).get()
worker.close()
worker.join()
for t in range(p,T):
count = 0
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T
Gt = (1/(t-p+1))*G
if __name__ == '__main__':
apply_async(t)
The two matrices are X and b. I want to replace directly on master's memory as each process updates recursively only one specific column of the matrices.
Why this implementation is slower than the sequential?
Is there any way to resume the process every time-step rather than killing them and create them again? Could this be the reason it is slower?
The reason is, your program is in practice sequential. This is an example code snippet that is from parallelism standpoint identical to yours:
from multiprocessing import Pool
from time import sleep
def gwork( qq):
print (qq)
sleep(1)
return 42
p = Pool(processes=4)
for q in range(1, 10):
p.apply_async(gwork, args=(q,)).get()
p.close()
p.join()
Run this and you shall notice numbers 1-9 appearing exactly once in a second. Why is this? The reason is your .get(). This means every call to apply_async will in practice block in get() until a result is available. It will submit one task, wait a second emulating processing delay, then return the result, after which another task is submitted to your pool. This means there is no parallel execution ongoing at all.
Try replacing the pool management part with this:
results = []
for q in range(1, 10):
res = p.apply_async(gwork, args=(q,))
results.append(res)
p.close()
p.join()
for r in results:
print (r.get())
You can now see parallelism at work, as four of your tasks are now processed simultaneously. Your loop does not block in get, as get is moved out of the loop and results are received only when they are ready.
NB: If your arguments to your worker or the return values from them are large data structures, you will lose some performance. In practice Python implements these as queues, and transmitting a lot of data via a queue is slow on relative terms compared to getting an in-memory copy of a data structure when a subprocess is forked.
I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.
I have a function that loads data and loops through times e.g.
def calculate_profit(account):
account_data = load(account) #very expensive operation
for day in account_data.days:
print(account_data.get(day).profit)
Because the loading of the data is expensive it makes sense to use joblib/multiprocessing to do something like this:
arr = [account1, account2, account3, ...]
joblib.Parallel(n_jobs=-1)(delayed(calculate_profit)(arr))
However, I have another expensive function that I would like to apply on the intermediate results of the calculate_profit function. For example, assume that it is an expensive operation to sum up all of the profit and process it/post it to website/etc. Also I need the previous day's profits to calculate the profit change in this function.
def expensive_sum(prev_day_profits, *account_profits):
total_profit_today = sum(account_profits)
profit_difference = total_profit_today - prev_day_profits
#some other expensive operation
#more expensive operations
So I would like to
Run the multiprocessing processes in parallel (to lessen the load of loading in all of the expensive account data)
Once each multiprocessing process hits a predefined point (e.g. finished one iteration of the loop), return those intermediate values to another function (expensive_sum) to process - assume that each individual multiprocessing process cannot continue until expensive_sum returns
HOWEVER, I want to keep the multiprocessing processes alive so that I don't have to reinitialize them (reducing that overhead)
Is there any way to do this?
from multiprocessing import Manager
queue = manager.Queue()
Once each multiprocessing process hits a predefined point
do
queue.put(item)
Meanwhile the other expensive function does
queue.get(item) ==> blocking call for get
The expensive function waits on get and goes ahead when it gets a value processes it and again waits on get
I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.
You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)
I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.