I defined two correct ways of calculating averages in python.
def avg_regular(values):
total = 0
for value in values:
total += value
return total/len(values)
def avg_concurrent(values):
mean = 0
num_of_values = len(values)
for value in values:
#calculate a small portion of the average for each num and add to the total
mean += value/num_of_values
return mean
The first function is the regular way of calculating averages, but I wrote the second one because each run of the loop doesn't depend on previous runs. So theoretically the average can be computed in parallel.
However, the "parallel" one (without running in parallel) takes about 30% more time than the regular one.
Are my assumptions correct and worth the speed loss?
if yes how can I make the second function run the second one parrallely?
if not, where did I go wrong?
The code you implemented is basically the difference between (a1+a2+ ... + an) / n and (a1/n + a2/n + ... + an/n). The result is the same, but in the second version there are more operations (namely (n-1) more divisions) which slows the calculation down. You claimed that in the second version each loop run is independent of the others. In the first loop we need the following information to finish one loop run: total before the run and the current value. In the second version we need the following information to finish one loop run: mean before the run, the current value and num_of_values. As you see in the second version we even depend on more values!
But how could we divide the work between cores (which is the goal of multiprocessing)? We could just give one core the first half of the values and the second the second half, i.e. ((a1+a2+ ... + a(n//2)) + ( a(n//2 +1) + ... + a(n)) / n). Yes, the work of dividing by n is not splitted between the cores, but it's a single instruction so we don't really care. Also we need to add the left total and the right total, which we can't split, but again it's only a single operation.
So the code we want to run:
def my_sum(values):
total = 0
for value in values:
total += value
return total
There's still a problem with python - normally one could use threads to do the computations, because each thread will use one core. But in that case one has to take care that your program does not run into race conditions, and the python interpreter itself also needs to take care of that. CPython decided it's not worth it and basically only runs in one thread at a time. A basic solution is to use multiple processes via multiprocessing.
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(my_sum, [long_list[0:len(long_list)//2], long_list[len(long_list)//2:]))
print(sum(results) / len(long_list)) # add subresults and divide by n
But of course multiple processes do not come for free. You need to fork, copy stuff, etc. so you will not gain a speedup of 2 as one could expect. Also the biggest slowdown is actually using python itself, it's not really optimized for fast numerical computations. There are various ways around that, but using numpy is probably the simplest. Just use:
import numpy
print(numpy.mean(long_list))
Which is probably much faster than the python version. I don't think numpy uses multiprocessing internal, so one could gain a boost by using multiple processes and a fast implementation (numpy or something other written in C) but normally numpy is fast enough.
Related
I am running a backtest for a trading strategy, defined as a class. I am trying to select the best combination of parameters to input in the model, so I am running multiple backtesting on a given period, trying out different combinations. The idea is to be able to select the first generation of a population to feed into a genetic algorithm. Seems like the perfect job for multiprocessing!
So I tried a bunch of things to see what works faster. I opened 10 Spyder consoles (yes, I tried it) and ran a single combination of parameters for each console (all running at the same time).
The sample code used for each single Spyder console:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
I then tried the multiprocessing way, using pool.
The sample code used in multiprocessing:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
def single_run_backtesting(single_parameter_combi):
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
return backtesting
def backtest_many(list_of parameter_combinations):
p=multiprocessing.pool()
result=p.map(single_run_backtesting,list_of parameter_combinations)
p.close()
p.join()
return result
if __name__ == '__main__':
parameter_combis=[...] # a list of parameter combinations, 10 different ones in this case
result = backtest_many(parameter_combis)
I have also tried the following: opening 5 Spyder consoles and running 2 instances of the class in a for loop, as below, and a single Spyder console with 10 instances of the class.
class MyStrategy(day,parameters):
# my strategy that runs on a single day
parameter_combis=[...] # a list of parameter combinations
backtest_dict={k: [] for k in range(len(parameter_combis)} # make a dictionary of empty lists
for day in days:
for j,single_parameter_combi in enumerate(parameter_combis):
backtesting_day=MyStrategy(day,single_parameter_combi)
backtest_dict[j].append(backtesting_day)
To my great surprise, it takes around 25 minutes with multiprocessing to go thorugh a single day, about the same time with a single Spyder console with 10 instances of a class in the for loop, and magically it takes only 15 minutes when I run 10 Spyder consoles at the same time. How do I process this information? It doesn't really make sense to me. I am running a 12-cpu machine on windows 10.
Consider that I am planning to run things on AWS with a 96-core machine, with something like 100 combinations of parameters that cross in a genetic algorithm which should run something like 20-30 generations (a full backtesting is 2 business months = 44 days).
My question is: what am I missing??? Most importantly, is this just a difference in scale?
I know that for example if you define a simple squaring function and run it serially for 100 times, multiprocessing is actually slower than a for loop. You start seeing the advantage around 10000 times, see for example this: https://github.com/vprusso/youtube_tutorials/blob/master/multiprocessing_and_threading/multiprocessing/multiprocessing_pool.py
Will I see a difference in performance when I go up to 100 combinations with multiprocessing, and is there any way of knowing in advnace if this is the case? Am I properly writing the code? Other ideas? Do you think it would speed up significatively if I was to use multiprocessing one step "above", in a single parameter combination over many days?
To expand upon my comment "Try p.imap_unordered().":
p.map() ensures that you get the results in the same order they're in the parameter list. To achieve this, some of the workers necessarily remain idle for some time
For your use case – essentially a grid search of parameter combinations – you really don't need to have them in the same order, you just want to end up with the best option. (Additionally, quoth the documentation, "it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.")
p.imap_unordered(), by contrast, doesn't really care – it just queues things up and workers work on them as they free up.
It's also worth experimenting with the chunksize parameter – quoting the imap() documentation, "For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1." (since you spend less time queueing and synchronizing things).
Finally, for your particular use case, you might want to consider having the master process generate an infinite amount of parameter combinations using a generator function, and breaking off the loop once you find a good enough solution or enough time passes.
A simple-ish function to do this and a contrived problem (finding two random numbers 0..1 to maximize their sum) follows. Just remember to return the original parameter set from the worker function too, otherwise you won't have access to it! :)
import random
import multiprocessing
import time
def find_best(*, param_iterable, worker_func, metric_func, max_time, chunksize=10):
best_result = None
best_metric = None
start_time = time.time()
n_results = 0
with multiprocessing.Pool() as p:
for result in p.imap_unordered(worker_func, param_iterable, chunksize=chunksize):
n_results += 1
elapsed_time = time.time() - start_time
metric = metric_func(result)
if best_metric is None or metric > best_metric:
print(f'{elapsed_time}: Found new best solution, metric {metric}')
best_metric = metric
best_result = result
if elapsed_time >= max_time:
print(f'{elapsed_time}: Max time reached.')
break
final_time = time.time() - start_time
print(f'Searched {n_results} results in {final_time} s.')
return best_result
# ------------
def generate_parameter():
return {'a': random.random(), 'b': random.random()}
def generate_parameters():
while True:
yield generate_parameter()
def my_worker(parameters):
return {
'parameters': parameters, # remember to return this too!
'value': parameters['a'] + parameters['b'], # our maximizable metric
}
def my_metric(result):
return result['value']
def main():
result = find_best(
param_iterable=generate_parameters(),
worker_func=my_worker,
metric_func=my_metric,
max_time=5,
)
print(f'Best result: {result}')
if __name__ == '__main__':
main()
An example run:
~/Desktop $ python3 so59357979.py
0.022627830505371094: Found new best solution, metric 0.5126700311039976
0.022940874099731445: Found new best solution, metric 0.9464256914062249
0.022969961166381836: Found new best solution, metric 1.2946600313637404
0.02298712730407715: Found new best solution, metric 1.6255217652861256
0.023016929626464844: Found new best solution, metric 1.7041449687571075
0.02303481101989746: Found new best solution, metric 1.8898109980050104
0.030200958251953125: Found new best solution, metric 1.9031436071918972
0.030324935913085938: Found new best solution, metric 1.9321951916206537
0.03880715370178223: Found new best solution, metric 1.9410837287942249
0.03970479965209961: Found new best solution, metric 1.9649277383314245
0.07829880714416504: Found new best solution, metric 1.9926667738329622
0.6105098724365234: Found new best solution, metric 1.997217792614364
5.000051021575928: Max time reached.
Searched 621931 results in 5.07216 s.
Best result: {'parameters': {'a': 0.997483, 'b': 0.999734}, 'value': 1.997217}
(By the way, this is nearly 6 times slower when chunksize=1.)
I'm having a bit trouble in Python multiprocessing.Pool. I have two list of numpy array a and b, in which
a.shape=(10000,3)
and
b.shape=(1000000000,3)
Then I have a function which does some computation like
def role(array, point):
sub = array-point
return (1/(np.sqrt(np.min(np.sum(sub*sub, axis=-1)))+0.001)**2)
Next, I need to compute
[role(a, point) for point in b]
To speed it up, I try to use
cpu_num = 4
m = multiprocessing.Pool(cpu_num)
cost_list = m.starmap(role, [(a, point) for point in b])
m.close
The whole process takes around 70s, but if I set cpu_num = 1, the processing time decrease to 60s... My laptop has 6 core, for reference.
Here I have two questions:
is there sth I did wrong with multiprocessing.Pool? why the processing time increased if I set cpu_num = 4?
for task like this (each for loop is a very tiny process), should I use multiprocessing to speed up? I feel like each time, python fill in Pool takes longer than process function role...
Any suggestions is really welcome.
Multiprocessing comes with some overhead (to create new processes), which is why it's not a very good choice when you have lots of tiny tasks, where the overhead of process creation might outweigh the benefit of parallelizing.
Have you considered vectorizing your problem?
In particular, if you broadcast the variable b you get there:
sub = a - b[::,np.newaxis] # broadcast b
1./(np.sqrt(np.min(np.sum(sub**2, axis=2), axis=-1))+0.001)**2
I believe you could then still reduce the complexity of the last expression a bit, as you're creating the square of a square root, which seems redundant (note that I'm assuming the 0.001 constant value is merely there to avoid some non-sensible operation like division by zero).
If the tasks are too tiny, then tohe multiprocessing overhead will be your bottleneck and you will win nothing.
If the amount of data per task that you have to pass to a worker or that the worker has to return then you will also not win a lot (or even win nothing)
If you have 10.000 tiny tasks, then I recommend to create a list of meta tasks.
Each meta task would consist of executing for example 20 tiny tasks.
meta_tasks = []
for idx in range(0, len(tiny_tasks), 20):
meta_tasks.append(tiny_tasks[idx:idx+20])
Then pass the meta tasks to your worker pool.
I have two functions. Each function runs a for loop.
def f1(df1, df2):
final_items = []
for ind, row in df1.iterrows():
id = row['Id']
some_num = row['some_num']
timestamp = row['Timestamp']
res = f2(df=df2, id=id, some_num=some_num, timestamp=timestamp))
final_items.append(res)
return final_items
def f2(df, id, some_num, timestamp):
for ind, row in df.iterrows():
filename = row['some_filename']
dfx = reader(key=filename) # User defined; object reader
# Assign variables
st_ID = dfx["Id"]
st_some_num = dfx["some_num"]
st_time_first = dfx['some_first_time_variable']
st_time_last = dfx['some_last_time_variable']
if device_id == st_ID and some_num == st_some_num:
if st_time_first <= timestamp and st_time_last >= timestamp:
return filename
else:
return None
else:
continue
The first function calls the second function as shown. The first loop occurs 2000 times, i.e., there are 2000 rows in the first dataframe.
The second function (the one that is called from f1()) runs 10 Million times.
My objective is to speed up f2() using parallel processing. I have tried using python packages like Multiprocessing and Ray but I am new to the world of parallel processing and am running into a lot of roadblocks due to lack of experience.
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
FACTS : initial formulation asks 2E3 rows in f1() to request f2() to scan 1E7 rows in "shared" df2,so as to get called an unspecified reader()-process to receive some other data to decide about further processing or return
My objective is to speed up f2() using parallel processing
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
Surprise No.1 : This is NOT a use-case of parallel-processing
The problem, as-is formulated above, calls many times file-I/O operations, that are never true-[PARALLEL] down there on the physical storage level, are they? Never. Any and all smart file-I/O-(pre)-caching and sliding-window file-I/O tricks cease to help on even moderate levels of a just-[CONCURRENT] workloads and often wreak havoc if going a single step beyond that principal workload ceiling due to physically limited scope of memory resources and I/O-bus width x speed and the weakest chain element's latency increasing under still growing traffic-loads.
The workflow controlling iterators are pure-[SERIAL] "Work Dispatchers" that sequentially step through their domain of values, one after another, and order just another file to get ( again iteratively ) processed.
Surprise No.2 : Vectorisation will NOT help
While vectorised operations are smart for many vector/matrix/tensor processing schemes ( love using numpy + numba ), the Condicio Sine Qua Non is, that the problem has to be:
"compact" - so that it gets easily expressed by vectorising syntax-tricks, which this original [SERIAL]-row-after-row-after-row to find a first and only first "device_ID match" in a "remote"-file-content, next return None if not ( <exprA> and <exprB> ) else filename
"uniform", i.e. non-sequential "until" something first happens - the vectorisation is great to "cover" the whole N-dimensional space with smart-internal code for (best) orthogonal-sub-structures processing uniformly "across" the whole space. On the contrary here, the vectorisation is hard to re-sequentialise "back" to stop (poison) it from any further smart-producing results right after the first occurrence was matched... (ref.1 above "find first and only first occurrence ( and die / return ) )
"memory-adequately-sized", i.e. given any add-on logic is added to the vectorised task, whenever a code asks vectorisation engine to process N-dim "data" using some sort of where(...)-clause, the interim product of such where(...)-condition is consuming additional [SPACE]-footprint ( best in RAM, worse in SWAP-file-I/O ) and this additional memory-footpring may soon devastate any and all benefits from the idea of vectorised processing re-formulation ( not speaking about the cases that due to such immense additional memory-allocation needs result but in a swap-file-I/O suffocation of the whole process flow ) where(...)-clause over a 10E6 rows is expensive, the more once the global strategy is to execute that 1 < nCPUs < 2E3 many times ( as noted above, vectorisation goes uniformly "across" the whole range of data, no sequentially beneficial shortcuts to stop after a first and only the first match... )
THE BEST NEXT STEP : dependency-graph -> latencies -> bottleneck
The problem as-is formulated above is a just-[CONCURRENT] processing, where the actual blocking or availability of "shared" resources' usage limits the overall processing duration. Having no more than a given set of resources to use, there are no magic chances to speed-up the concurrent usage patterns for faster processing. Thus the "amounts" of free-resources to harness and their respective response-"latencies" sure, those under-high-levels-of-concurrent-workloads, not the idealistic, unloaded, response times
If you have no profiling data, measure/benchmark at least the main characteristic durations:
a) the net f2()-per-row process latency [ min, Avg, MAX, StDev] in [us]
b) the reader()-related setup/retrieve latency [ min, Avg, MAX, StDev] in [us]
test, whether the reader()'s performance represents or not a bottleneck - a ceiling for the any-increased-concurrency operated process-flow
If it does, you get it's maximum workload it can handle and based on this, the concurrent-processing may get the speed forwards up to this reader()-determined performance ceiling.
All the rest is elementary.
Epilogue
Such latency-data engineered, (un)avoidable bottleneck-aware right-sized concurrent processing setup for a maximum Latency Masking is about the maximum one can expect here to help.
Given a chance to re-engineer and re-factor the global strategy, there might be much faster processing times, but that may come from other than a pure-[SERIAL] tandem of sequential iterators instructing the sequence of about ~ 20.000.000.000 calls to an unknown reader()-code.
Yet, that goes ways beyond the scope of this Stack Overflow MinCunVE-problem definition.
Hope this might have sparked some fresh views on how to make the results faster. Smart ideas may lead to processing times from a few days down to a few minutes (!). Having gone this way a few times, no one will believe how fulfilling this hard work may get both you and your customer(s), if you hit such a solution by designing the right-sized solution for their business domain.
I'm using such algorithm to make some calculations on array of Decimals:
fkn = Decimal('0')
for bits in itertools.combinations(decimals_array, elements_count):
kxn = reduce(operator.mul, bits, Decimal('1'))
fkn += kxn
I'm using Python 3.4 x64.
Decimals have precision>300 (it's a must).
len(decimals_array) is most of the time over 40.
elements_count is most of the time len(decimals_array)/2.
Calculations take very long time.
I wanted to make them multiprocess so first I was thinking about making an array of all combinations and send parts of this array to many processes - but during making of such array I quickly get MemoryError Exception.
Now I'm looking for nicer way to make this code multiprocess.
What is a good way to run this algorithm on multiple cores?
Or maybe there is a better (faster) way to make such calculations?
Thank you in advance for some ideas.
In order to really parallelize this you need to get around combinations() being sequential so that each process can generate its own combinations. The rest of the problem is already paralellizable.
40 choose 20 is about 138 billion combinations so pre-generating that or generating it in each process is going to hurt. With a 20-element list taking around 224 bytes (says sys.getsizeof()) that's 30 something terabytes if you generate the whole thing in one go. No wonder you ran out of memory. You also can't really split a generator across processes; or rather, if you do, each process will get its own copy of the generator.
Solution 1 is to have a process whose sole job is to generate combinations and push them into a queue, possibly in batches to save on IPC overhead, and have the other processes consume combinations from that queue.
Solution 2 is to write a non-sequential version of combinations that returns the Nth combination without computing the rest. This is definitely possible because it's possible with permutations, and combinations are an internally sorted subset of permutations. Then each process in a Pool can generate its own based on a start and step of N - process one counts combination 0, 3, 6..., process two counts combination 1, 4, 7... and so on, for example. This would probably be even slower unless you use C/Cython though.
Solution 3 (or possibly solution 0?) is to go over to the math stackexchange and ask if there's a mathematical rather than computational solution to this problem.
Here is one solution although it isn't very neat.
The idea is to use multiple processes where one process is responsible for one interval. However, since itertools.combinations is sequential, each process has to loop over unnecessary combinations until it reaches the right interval. When the right interval is handled, the process stops. The code is from this book.
import itertools
from tqdm import tqdm
from math import factorial
from multiprocessing import Process
import itertools
def total_combo(n, r):
return factorial(n) // factorial(r) // factorial(n-r)
def cal_combo(var,noCombo,start,end):
data = itertools.combinations(range(var),noCombo)
for i in enumerate(tqdm(data)):
if i[0] >= start:
if i[0] < start+10: print(i)
if i[0] > end: break
if __name__=='__main__':
noCombo=3
var=1000
print(total_combo(var,noCombo),'combinations for',noCombo,'of',var,'variants')
noProc=6
interval=total_combo(var,noCombo)/noProc
if interval%1==0:
print(interval)
procs=[]
for pid in range(noProc):
proc = Process(target=cal_combo, args=(var,noCombo, interval*pid, interval*(pid+1)))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
I have a script in python but it takes more than 20 hours to run until the end.
Since my code is pretty big, I will post a simplified one.
The first part of the code:
flag = 1
mydic = {}
for i in mylist:
mydic[flag] = myfunction(i)
flag += 1
mylist has more than 700 entries and each time I call myfunction it run for around 20sec.
So, I was thinking if I can use paraller programming to split the iteration into two groups and run it simultaneously. Is that possible and will I need the half time than before?
The second part of the code:
mymatrix = []
for n1 in range(0,flag):
mat = []
for n2 in range(0,flag):
if n1 >= n2:
mat.append(0)
else:
res = myfunction2(mydic(n1),mydic(n2))
mat.append(res)
mymatrix.append(mat)
So, if mylist has 700 entries, I want to create a 700x700 matrix where it is upper triangular matrix. But the myfunction2() needs around 30sec each time. I don't know if I can use parallel programming here too.
I cannot simplify the myfunction() and myfunction2() since they are functions where I call an external api and return the results.
Do you have any suggestion of how can I change it to make it faster.
Based on your comments, I think it's very likely that the 30seconds of time is mostly due to external API calls. I would add some timing code to test what portions of your code are actually responsible for the slowness.
If it is from the external API calls, there are some easy fixes. The external API calls block, so you'll get a speedup if you can move to a parallel model ( though 30s of blocking sounds huge to me ).
I think it would be easiest to create a quick "task list" by having the output of 2 loops be a matrix of arguments to pass into a function. Then I'd pipe them into Celery to run the tasks. That should give you a decent speedup with a minimal amount of work.
You would probably save a lot more time with the threading or multiprocessing modules to run tasks (or sections) , or even write it all in Twisted python - but that usually takes longer than a simple celery function.
The one caveat with the Celery approach is that you'll be dispatching a lot of work - so you'll have to have some functionality to poll for results. That could be a while loop that just sleeps(10) and repeats itself until celery has a result for every task. If you do it in Twisted, you can access/track results on finish. I've never had to do something like this with multiprocessing, so don't know how that would fit in.
how about using a generator for the second part instead of one of the for loops
def fn():
for n1 in range(0, flag):
yield n1
generate = fn()
while True:
a = next(generate)
for n2 in range(0, flag):
if a >= n2:
mat.append(0)
else:
mat.append(myfunction2(mydic(a),mydic(n2))
mymatrix.append(mat)