I have two functions. Each function runs a for loop.
def f1(df1, df2):
final_items = []
for ind, row in df1.iterrows():
id = row['Id']
some_num = row['some_num']
timestamp = row['Timestamp']
res = f2(df=df2, id=id, some_num=some_num, timestamp=timestamp))
final_items.append(res)
return final_items
def f2(df, id, some_num, timestamp):
for ind, row in df.iterrows():
filename = row['some_filename']
dfx = reader(key=filename) # User defined; object reader
# Assign variables
st_ID = dfx["Id"]
st_some_num = dfx["some_num"]
st_time_first = dfx['some_first_time_variable']
st_time_last = dfx['some_last_time_variable']
if device_id == st_ID and some_num == st_some_num:
if st_time_first <= timestamp and st_time_last >= timestamp:
return filename
else:
return None
else:
continue
The first function calls the second function as shown. The first loop occurs 2000 times, i.e., there are 2000 rows in the first dataframe.
The second function (the one that is called from f1()) runs 10 Million times.
My objective is to speed up f2() using parallel processing. I have tried using python packages like Multiprocessing and Ray but I am new to the world of parallel processing and am running into a lot of roadblocks due to lack of experience.
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
FACTS : initial formulation asks 2E3 rows in f1() to request f2() to scan 1E7 rows in "shared" df2,so as to get called an unspecified reader()-process to receive some other data to decide about further processing or return
My objective is to speed up f2() using parallel processing
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
Surprise No.1 : This is NOT a use-case of parallel-processing
The problem, as-is formulated above, calls many times file-I/O operations, that are never true-[PARALLEL] down there on the physical storage level, are they? Never. Any and all smart file-I/O-(pre)-caching and sliding-window file-I/O tricks cease to help on even moderate levels of a just-[CONCURRENT] workloads and often wreak havoc if going a single step beyond that principal workload ceiling due to physically limited scope of memory resources and I/O-bus width x speed and the weakest chain element's latency increasing under still growing traffic-loads.
The workflow controlling iterators are pure-[SERIAL] "Work Dispatchers" that sequentially step through their domain of values, one after another, and order just another file to get ( again iteratively ) processed.
Surprise No.2 : Vectorisation will NOT help
While vectorised operations are smart for many vector/matrix/tensor processing schemes ( love using numpy + numba ), the Condicio Sine Qua Non is, that the problem has to be:
"compact" - so that it gets easily expressed by vectorising syntax-tricks, which this original [SERIAL]-row-after-row-after-row to find a first and only first "device_ID match" in a "remote"-file-content, next return None if not ( <exprA> and <exprB> ) else filename
"uniform", i.e. non-sequential "until" something first happens - the vectorisation is great to "cover" the whole N-dimensional space with smart-internal code for (best) orthogonal-sub-structures processing uniformly "across" the whole space. On the contrary here, the vectorisation is hard to re-sequentialise "back" to stop (poison) it from any further smart-producing results right after the first occurrence was matched... (ref.1 above "find first and only first occurrence ( and die / return ) )
"memory-adequately-sized", i.e. given any add-on logic is added to the vectorised task, whenever a code asks vectorisation engine to process N-dim "data" using some sort of where(...)-clause, the interim product of such where(...)-condition is consuming additional [SPACE]-footprint ( best in RAM, worse in SWAP-file-I/O ) and this additional memory-footpring may soon devastate any and all benefits from the idea of vectorised processing re-formulation ( not speaking about the cases that due to such immense additional memory-allocation needs result but in a swap-file-I/O suffocation of the whole process flow ) where(...)-clause over a 10E6 rows is expensive, the more once the global strategy is to execute that 1 < nCPUs < 2E3 many times ( as noted above, vectorisation goes uniformly "across" the whole range of data, no sequentially beneficial shortcuts to stop after a first and only the first match... )
THE BEST NEXT STEP : dependency-graph -> latencies -> bottleneck
The problem as-is formulated above is a just-[CONCURRENT] processing, where the actual blocking or availability of "shared" resources' usage limits the overall processing duration. Having no more than a given set of resources to use, there are no magic chances to speed-up the concurrent usage patterns for faster processing. Thus the "amounts" of free-resources to harness and their respective response-"latencies" sure, those under-high-levels-of-concurrent-workloads, not the idealistic, unloaded, response times
If you have no profiling data, measure/benchmark at least the main characteristic durations:
a) the net f2()-per-row process latency [ min, Avg, MAX, StDev] in [us]
b) the reader()-related setup/retrieve latency [ min, Avg, MAX, StDev] in [us]
test, whether the reader()'s performance represents or not a bottleneck - a ceiling for the any-increased-concurrency operated process-flow
If it does, you get it's maximum workload it can handle and based on this, the concurrent-processing may get the speed forwards up to this reader()-determined performance ceiling.
All the rest is elementary.
Epilogue
Such latency-data engineered, (un)avoidable bottleneck-aware right-sized concurrent processing setup for a maximum Latency Masking is about the maximum one can expect here to help.
Given a chance to re-engineer and re-factor the global strategy, there might be much faster processing times, but that may come from other than a pure-[SERIAL] tandem of sequential iterators instructing the sequence of about ~ 20.000.000.000 calls to an unknown reader()-code.
Yet, that goes ways beyond the scope of this Stack Overflow MinCunVE-problem definition.
Hope this might have sparked some fresh views on how to make the results faster. Smart ideas may lead to processing times from a few days down to a few minutes (!). Having gone this way a few times, no one will believe how fulfilling this hard work may get both you and your customer(s), if you hit such a solution by designing the right-sized solution for their business domain.
Related
I want to shuffle values in a 3D numpy-array, but only when they are > 0.
When I run my function with a single core, it is much faster than with even 2 cores. It is way beyond the overhead of creating new python processes. What am I missing?
The following code outputs:
random shuffling of markers started
time in serial execution: 1.0288s
time executing in parallel with num_cores=1: 0.9056s
time executing in parallel with num_cores=2: 273.5253s
import numpy as np
import time
from random import shuffle
from joblib import Parallel, delayed
import multiprocessing
import numpy as np
def randomizeVoxels(V,markerLUT):
V_rand=V.copy()
# the xyz naming here does not match outer convention, which will depend on permutation
for ix in range(V.shape[0]):
for iy in range(V.shape[1]):
if V[ix,iy]>0:
V_rand[ix,iy]=markerLUT[V[ix,iy]]
return V_rand
V_ori=np.arange(1000000,-1000000,-1).reshape(100,100,200)
V_rand=V_ori.copy()
listMarkers=np.unique(V_ori)
listMarkers=[val for val in listMarkers if val>0]
print("random shuffling of markers started\n")
reassignedMarkers=listMarkers.copy()
#random shuffling of original markers
shuffle(reassignedMarkers)
markerLUT={}
for i,iMark in enumerate(listMarkers):
markerLUT[iMark]=reassignedMarkers[i]
tic=time.perf_counter()
for ix in range(len(V_ori)):
for iy in range(len(V_ori[0])):
for iz in range(len(V_ori[0][0])):
if V_ori[ix,iy,iz]>0:
V_rand[ix,iy,iz]=markerLUT[V_ori[ix,iy,iz]]
toc=time.perf_counter()
print("time in serial execution: \t\t\t{: >4.4f} s".format(toc-tic))
#######################################################################3
num_cores = 1
V_rand=V_ori.copy()
tic=time.perf_counter()
results= Parallel(n_jobs=num_cores)\
(delayed(randomizeVoxels)\
(V_ori[imSlice,:,:],
markerLUT
)for imSlice in range(V_ori.shape[0]))
for i,resTuple in enumerate(results):
V_rand[i,:,:]=resTuple
toc=time.perf_counter()
print("time executing in parallel with num_cores={}:\t{: >4.4f} s".format(num_cores,toc-tic))
num_cores = 2
V_rand=V_ori.copy()
MASK = "time executing in parallel with num_cores={}:\t {: >4.4f}s"
tic=time.perf_counter() #----------------------------- [PERF-me]
results= Parallel(n_jobs=num_cores)\
(delayed(randomizeVoxels)\
(V_ori[imSlice,:,:],
markerLUT
)for imSlice in range(V_ori.shape[0]))
for i,resTuple in enumerate(results):
V_rand[i,:,:]=resTuple
toc=time.perf_counter() #----------------------------- [PERF-me]
print( MASK.format(num_cores,toc-tic) )
Q : "What am I missing?"
Most probably the memory-I/O bottlenecks.
While the numpy-part of the processing seems to be pretty shallow here (shuffle does not compute a bit, but moves data between a pair of locations, doesn't it?), for the most of the time, this will not permit "time-enough" (by doing any useful work) so as to get the memory-I/O-s be masked by re-ordered CPU-core instructions (ref. latency-costs for straight + cross-QPI memory-I/O ops at the lowest levels of the contemporary super-scalar CISC architectures with highly speculative branch predictions (not useful for memory-I/O bound non-branching tightly crafted sections) and multi-core and many-core NUMA designs ).
This is most probably why even the first spin-off concurrent process (no matter if enforced for camping on the same (here a shared-CPU-core time by an interleaving pair of a two-step dancing processes, again memory-I/O bound, with even worse chances for latency masking on shared memory-I/O channels...) or any other (here adding cross-QPI add-on latency costs if having to perform non-local memory-I/O, again worsening chances for memory-I/O latency-masking) CPU-core.
CPU-core hopping, enforced by the colliding effects of the CPU-clock Boost policy (later starting to violate the Thermal Management, thus hopping the process to camp on a next, colder CPU-core) will invalidate all CPU-core cache benefits, by not having the pre-cached data on the next, colder, core available, thus having to re-fetch all (once pre-cached already into the fastest L1data cache) data again (perhaps, for array-objects with larger memory footprints, having even a need to cross-QPI fetch), so harnessing more cores does not have a trivial effect on the resulting efficiency.
;o)The numpy high performance & smart processing is here not the one to be blamed - the very opposite - it clearly demasks the CPU "starvation" state - known for ages to be The Very Performance Ceiling for all our modern CPUs - this is why we see so many-core CPUs, that try to circumvent this bottleneck by having more and more cores - see the commented silicon-level analysis referenced above.Last but not leastthe code as-is contains immense count of opportunities to improve it's performance, numpy-smart-vectorised being the first one to name, avoiding range()-loops, so there are more tips to follow, all of which will finally ring the headbang into the very same trouble - the CPU-starvation ceiling
The reason using multiple processes can be slower than a single process is because multiprocessing requires that arguments are serialized before being sent to worker processes. This introduces additional overhead that scales with the amount of data that must be serialized.
The Python multiprocessing library uses pickle for this, while joblib has a custom serializer loky. The documentation can be found here: https://joblib.readthedocs.io/en/latest/parallel.html#serialization-processes
For more, here is another StackOverflow answer that gives more context about why this serialization is needed.
https://joblib.readthedocs.io/en/latest/parallel.html#serialization-processes
I have a work load that consist of a very slow query that returns a HUGE amount of data that has to be parsed and calculated, all that on a loop. Basically, it looks like this:
for x in lastTenYears
myData = DownloadData(x) # takes about ~40-50 [sec]
parsedData.append(ParseData(myData)) # takes another +30-60 [sec]
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
How can I achieve this parallelism of operations?
Ideally speaking, I would like to have 1 thread always downloading, and N threads doing the parsing. The download part is actually a query against a database, so it's not good to have a bunch o parallel of them...
Details:
The parsing of the data is a heavily CPU bound, and consists of raw math calculations and nothing else.
Using Python 3.7.4
1) Use a threadsafe queue. Queue.FIFOQueue. At the top level define
my_queue = Queue.FIFOQueue()
parsedData = []
2) On the first thread, kick off the data loading
my_queue.put(DownloadData(x))
On the second thread
if not (my_queue.empty()):
myData = my_queue.get()
parsedData.append(ParseData(myData))
If your program is CPU bound you will have hard times to do anything else in other threads due to the GIL (global interpreter lock).
Here is a link to an article which might help you to understand the topic: https://opensource.com/article/17/4/grok-gil
Downloading the data in a sub-process is most likely the best approach.
It's hard to say if and how much this will actually help (as I have nothing to test...), but you might try a multiprocessing.Pool. It handles all the dirty work for you and you can customize number of processes, chunk size etc.
from multiprocessing import Pool
def worker(x):
myData = DownloadData(x)
return ParseData(myData)
if __name__ == "__main__":
processes = None # defaults to os.cpu_count()
chunksize = 1
with Pool(processes) as pool:
parsedData = pool.map(worker, lastTenYears, chunksize)
Here for the example I use the map method, but according to your needs you might want to use imap or map_async.
Q : How can I achieve this parallelism of operations?
The step number one is to realise, the above requested use-case is not a [PARALLEL] code-execution, but an un-ordered batch of resources-use policy limited execution of a strict sequence of pairs of :
First-a-remote-[DB-Query](returning (cit.) HUGE amount of data)
Next-a-local-[CPU-process]( of (cit.) HUGE amount of data just returned here)
The latency of the first could be masked( if it were permitted, but it is not permitted - due to a will not to overload the DB-host ),the latency for the second not( can start but a next I/O-bound DB-Query, yet only if not violating the rule of keeping the DB-machine but under a mild workload ).
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
It is high time to make thing clear and sound :
Facts :
A )
The CPU-bound tasks will never run faster in whatever number N of threads in python-GIL-lock controlled ecosystem( since ever and forever, as Guido ROSSUM has expressed ),as the GIL-lock enforces a re-[SERIAL]-isation, so the more threads "work", the more threads actually wait for acquiring the GIL-lock, before they "get" it but for a 1 / ( N + 1 )-th fraction of time of the resulting, thanks to the GIL-lock policing again pure-[SERIAL], duration of N * ( 30 - 60 ) [sec]
B )
The I/O-bound task makes no sense to off-load into a full process-based, concurrent execution, as the full-copy of the python process ( in Windows also with duplicating the whole python interpreter state with all data, during the sub-process instantiation ) makes no sense, as there are smarter techniques for I/O-bound processing ( where GIL-lock does not hurt so much.
C )
The whole concept of N-parsing : 1-querying is principally wrong - the maximum achievable goal is to mask the latency of the I/O-process ( where making sense ), yet here each one and every query takes those said ~ 40-50 [sec] so no second pack-of-data to parse will ever be present here before running those said ~ 40-50 [sec] next time, sono second worker will ever get anything to parse anytime before T0 + ~ 80~100 [sec] - so one could dream a wish to have N-(unbound)-workers working ( yet have 'em but actually waiting for data ) is possible, but awfully anti-productive ( the worse for N-(GIL-MUTEX-ed)-"waiting"-agents ).
I am running a backtest for a trading strategy, defined as a class. I am trying to select the best combination of parameters to input in the model, so I am running multiple backtesting on a given period, trying out different combinations. The idea is to be able to select the first generation of a population to feed into a genetic algorithm. Seems like the perfect job for multiprocessing!
So I tried a bunch of things to see what works faster. I opened 10 Spyder consoles (yes, I tried it) and ran a single combination of parameters for each console (all running at the same time).
The sample code used for each single Spyder console:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
I then tried the multiprocessing way, using pool.
The sample code used in multiprocessing:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
def single_run_backtesting(single_parameter_combi):
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
return backtesting
def backtest_many(list_of parameter_combinations):
p=multiprocessing.pool()
result=p.map(single_run_backtesting,list_of parameter_combinations)
p.close()
p.join()
return result
if __name__ == '__main__':
parameter_combis=[...] # a list of parameter combinations, 10 different ones in this case
result = backtest_many(parameter_combis)
I have also tried the following: opening 5 Spyder consoles and running 2 instances of the class in a for loop, as below, and a single Spyder console with 10 instances of the class.
class MyStrategy(day,parameters):
# my strategy that runs on a single day
parameter_combis=[...] # a list of parameter combinations
backtest_dict={k: [] for k in range(len(parameter_combis)} # make a dictionary of empty lists
for day in days:
for j,single_parameter_combi in enumerate(parameter_combis):
backtesting_day=MyStrategy(day,single_parameter_combi)
backtest_dict[j].append(backtesting_day)
To my great surprise, it takes around 25 minutes with multiprocessing to go thorugh a single day, about the same time with a single Spyder console with 10 instances of a class in the for loop, and magically it takes only 15 minutes when I run 10 Spyder consoles at the same time. How do I process this information? It doesn't really make sense to me. I am running a 12-cpu machine on windows 10.
Consider that I am planning to run things on AWS with a 96-core machine, with something like 100 combinations of parameters that cross in a genetic algorithm which should run something like 20-30 generations (a full backtesting is 2 business months = 44 days).
My question is: what am I missing??? Most importantly, is this just a difference in scale?
I know that for example if you define a simple squaring function and run it serially for 100 times, multiprocessing is actually slower than a for loop. You start seeing the advantage around 10000 times, see for example this: https://github.com/vprusso/youtube_tutorials/blob/master/multiprocessing_and_threading/multiprocessing/multiprocessing_pool.py
Will I see a difference in performance when I go up to 100 combinations with multiprocessing, and is there any way of knowing in advnace if this is the case? Am I properly writing the code? Other ideas? Do you think it would speed up significatively if I was to use multiprocessing one step "above", in a single parameter combination over many days?
To expand upon my comment "Try p.imap_unordered().":
p.map() ensures that you get the results in the same order they're in the parameter list. To achieve this, some of the workers necessarily remain idle for some time
For your use case – essentially a grid search of parameter combinations – you really don't need to have them in the same order, you just want to end up with the best option. (Additionally, quoth the documentation, "it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.")
p.imap_unordered(), by contrast, doesn't really care – it just queues things up and workers work on them as they free up.
It's also worth experimenting with the chunksize parameter – quoting the imap() documentation, "For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1." (since you spend less time queueing and synchronizing things).
Finally, for your particular use case, you might want to consider having the master process generate an infinite amount of parameter combinations using a generator function, and breaking off the loop once you find a good enough solution or enough time passes.
A simple-ish function to do this and a contrived problem (finding two random numbers 0..1 to maximize their sum) follows. Just remember to return the original parameter set from the worker function too, otherwise you won't have access to it! :)
import random
import multiprocessing
import time
def find_best(*, param_iterable, worker_func, metric_func, max_time, chunksize=10):
best_result = None
best_metric = None
start_time = time.time()
n_results = 0
with multiprocessing.Pool() as p:
for result in p.imap_unordered(worker_func, param_iterable, chunksize=chunksize):
n_results += 1
elapsed_time = time.time() - start_time
metric = metric_func(result)
if best_metric is None or metric > best_metric:
print(f'{elapsed_time}: Found new best solution, metric {metric}')
best_metric = metric
best_result = result
if elapsed_time >= max_time:
print(f'{elapsed_time}: Max time reached.')
break
final_time = time.time() - start_time
print(f'Searched {n_results} results in {final_time} s.')
return best_result
# ------------
def generate_parameter():
return {'a': random.random(), 'b': random.random()}
def generate_parameters():
while True:
yield generate_parameter()
def my_worker(parameters):
return {
'parameters': parameters, # remember to return this too!
'value': parameters['a'] + parameters['b'], # our maximizable metric
}
def my_metric(result):
return result['value']
def main():
result = find_best(
param_iterable=generate_parameters(),
worker_func=my_worker,
metric_func=my_metric,
max_time=5,
)
print(f'Best result: {result}')
if __name__ == '__main__':
main()
An example run:
~/Desktop $ python3 so59357979.py
0.022627830505371094: Found new best solution, metric 0.5126700311039976
0.022940874099731445: Found new best solution, metric 0.9464256914062249
0.022969961166381836: Found new best solution, metric 1.2946600313637404
0.02298712730407715: Found new best solution, metric 1.6255217652861256
0.023016929626464844: Found new best solution, metric 1.7041449687571075
0.02303481101989746: Found new best solution, metric 1.8898109980050104
0.030200958251953125: Found new best solution, metric 1.9031436071918972
0.030324935913085938: Found new best solution, metric 1.9321951916206537
0.03880715370178223: Found new best solution, metric 1.9410837287942249
0.03970479965209961: Found new best solution, metric 1.9649277383314245
0.07829880714416504: Found new best solution, metric 1.9926667738329622
0.6105098724365234: Found new best solution, metric 1.997217792614364
5.000051021575928: Max time reached.
Searched 621931 results in 5.07216 s.
Best result: {'parameters': {'a': 0.997483, 'b': 0.999734}, 'value': 1.997217}
(By the way, this is nearly 6 times slower when chunksize=1.)
I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.
I defined two correct ways of calculating averages in python.
def avg_regular(values):
total = 0
for value in values:
total += value
return total/len(values)
def avg_concurrent(values):
mean = 0
num_of_values = len(values)
for value in values:
#calculate a small portion of the average for each num and add to the total
mean += value/num_of_values
return mean
The first function is the regular way of calculating averages, but I wrote the second one because each run of the loop doesn't depend on previous runs. So theoretically the average can be computed in parallel.
However, the "parallel" one (without running in parallel) takes about 30% more time than the regular one.
Are my assumptions correct and worth the speed loss?
if yes how can I make the second function run the second one parrallely?
if not, where did I go wrong?
The code you implemented is basically the difference between (a1+a2+ ... + an) / n and (a1/n + a2/n + ... + an/n). The result is the same, but in the second version there are more operations (namely (n-1) more divisions) which slows the calculation down. You claimed that in the second version each loop run is independent of the others. In the first loop we need the following information to finish one loop run: total before the run and the current value. In the second version we need the following information to finish one loop run: mean before the run, the current value and num_of_values. As you see in the second version we even depend on more values!
But how could we divide the work between cores (which is the goal of multiprocessing)? We could just give one core the first half of the values and the second the second half, i.e. ((a1+a2+ ... + a(n//2)) + ( a(n//2 +1) + ... + a(n)) / n). Yes, the work of dividing by n is not splitted between the cores, but it's a single instruction so we don't really care. Also we need to add the left total and the right total, which we can't split, but again it's only a single operation.
So the code we want to run:
def my_sum(values):
total = 0
for value in values:
total += value
return total
There's still a problem with python - normally one could use threads to do the computations, because each thread will use one core. But in that case one has to take care that your program does not run into race conditions, and the python interpreter itself also needs to take care of that. CPython decided it's not worth it and basically only runs in one thread at a time. A basic solution is to use multiple processes via multiprocessing.
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(my_sum, [long_list[0:len(long_list)//2], long_list[len(long_list)//2:]))
print(sum(results) / len(long_list)) # add subresults and divide by n
But of course multiple processes do not come for free. You need to fork, copy stuff, etc. so you will not gain a speedup of 2 as one could expect. Also the biggest slowdown is actually using python itself, it's not really optimized for fast numerical computations. There are various ways around that, but using numpy is probably the simplest. Just use:
import numpy
print(numpy.mean(long_list))
Which is probably much faster than the python version. I don't think numpy uses multiprocessing internal, so one could gain a boost by using multiple processes and a fast implementation (numpy or something other written in C) but normally numpy is fast enough.