Following some online research (1, 2, numpy, scipy, scikit, math), I have found several ways for calculating the Euclidean Distance in Python:
# 1
numpy.linalg.norm(a-b)
# 2
distance.euclidean(vector1, vector2)
# 3
sklearn.metrics.pairwise.euclidean_distances
# 4
sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)
# 5
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))
# 6
math.hypot(x, y)
I was wondering if someone could provide an insight on which of the above (or any other that I have not found) is considered the best in terms of efficiency and precision. If someone is aware of any resource(s) which discusses the subject that would also be great.
The context I am interesting in is in calculating the Euclidean Distance between pairs of number-tuples, e.g. the distance between (52, 106, 35, 12) and (33, 153, 75, 10).
Conclusion first:
From the test result by using timeit for efficiency test, we can conclude that regarding the efficiency:
Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm) > Method2 (scipy.spatial.distance) > Method3 (sklearn.metrics.pairwise.euclidean_distances )
While I didn't really test your Method4 as it is not suitable for general cases and it is generally equivalent to Method5.
For the rest, quite surprisingly, Method5 is the fastest one. While for Method1 which uses numpy, as what we expected, which is heavily optimized in C, is the second fastest.
For scipy.spatial.distance, if you go directly to the function definition, you will see that it is actually using numpy.linalg.norm, except it will perform the validation on the two input vectors before the actual numpy.linalg.norm. That's why it is slightly slower thant numpy.linalg.norm.
Finally for sklearn, according to the documentation:
This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed.
However, this is not the most precise way of doing this computation, and the distance matrix returned by this function may not be exactly symmetric as required
Since in your question you would like to use a fixed set of data, the advantage of this implementation is not reflected. And due to the trade off between the performance and precision, it also gives the worst precision among all of the methods.
Regarding the precision, Method5=Metho1=Method2>Method3
Efficiency Test Script:
import numpy as np
from scipy.spatial import distance
from sklearn.metrics.pairwise import euclidean_distances
import math
# 1
def eudis1(v1, v2):
return np.linalg.norm(v1-v2)
# 2
def eudis2(v1, v2):
return distance.euclidean(v1, v2)
# 3
def eudis3(v1, v2):
return euclidean_distances(v1, v2)
# 5
def eudis5(v1, v2):
dist = [(a - b)**2 for a, b in zip(v1, v2)]
dist = math.sqrt(sum(dist))
return dist
dis1 = (52, 106, 35, 12)
dis2 = (33, 153, 75, 10)
v1, v2 = np.array(dis1), np.array(dis2)
import timeit
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
wrappered1 = wrapper(eudis1, v1, v2)
wrappered2 = wrapper(eudis2, v1, v2)
wrappered3 = wrapper(eudis3, v1, v2)
wrappered5 = wrapper(eudis5, v1, v2)
t1 = timeit.repeat(wrappered1, repeat=3, number=100000)
t2 = timeit.repeat(wrappered2, repeat=3, number=100000)
t3 = timeit.repeat(wrappered3, repeat=3, number=100000)
t5 = timeit.repeat(wrappered5, repeat=3, number=100000)
print('\n')
print('t1: ', sum(t1)/len(t1))
print('t2: ', sum(t2)/len(t2))
print('t3: ', sum(t3)/len(t3))
print('t5: ', sum(t5)/len(t5))
Efficiency Test Output:
t1: 0.654838958307
t2: 1.53977598714
t3: 6.7898791732
t5: 0.422228400305
Precision Test Script & Result:
In [8]: eudis1(v1,v2)
Out[8]: 64.60650122085238
In [9]: eudis2(v1,v2)
Out[9]: 64.60650122085238
In [10]: eudis3(v1,v2)
Out[10]: array([[ 64.60650122]])
In [11]: eudis5(v1,v2)
Out[11]: 64.60650122085238
This is not exactly answering the question, but it is probably worth mentioning that if you aren't interested in the actual euclidean distance, but just want to compare euclidean distances against each other, square roots are monotone functions, i.e. x**(1/2) < y**(1/2) if and only if x < y.
So if you don't want the explicit distance, but for instance just want to know if the euclidean distance of vector1 is closer to a list of vectors, called vectorlist, you can avoid the expensive (in terms of both precision and time) square root, but can make do with something like
min(vectorlist, key = lambda compare: sum([(a - b)**2 for a, b in zip(vector1, compare)])
Here is an example on how to use just numpy.
import numpy as np
a = np.array([3, 0])
b = np.array([0, 4])
c = np.sqrt(np.sum(((a - b) ** 2)))
# c == 5.0
Improving benchmark on the accepted answer, I've found out that, assuming you already get input in numpy array format, method5 can better written in:
import numpy as np
from numba import jit
#jit(nopython=True)
def euclidian_distance(y1, y2):
return np.sqrt(np.sum((y1-y2)**2)) # based on pythagorean
Speed test:
euclidian_distance(y1, y2)
# 2.03 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.linalg.norm(y1-y2)
# 17.6 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Fun fact, you can add jit to numpy function:
#jit(nopython=True)
def jit_linalg(y1, y2):
return np.linalg.norm(y1-y2)
jit_linalg(y[i],y[j])
# 2.91 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a general rule of thumb, stick to the scipy and numpy implementations where possible, as they're vectorized and much faster than native Python code. (Main reasons are: implementations in C, vectorization eliminates type checking overhead that looping does.)
(Aside: My answer doesn't cover precision here, but I think the same principle applies for precision as for efficiency.)
As a bit of a bonus, I'll chip in with a bit of information on how you can profile your code, to measure efficiency. If you're using the IPython interpreter, the secret is to use the %prun line magic.
In [1]: import numpy
In [2]: from scipy.spatial import distance
In [3]: c1 = numpy.array((52, 106, 35, 12))
In [4]: c2 = numpy.array((33, 153, 75, 10))
In [5]: %prun distance.euclidean(c1, c2)
35 function calls in 0.000 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 linalg.py:1976(norm)
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.dot}
6 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}
4 0.000 0.000 0.000 0.000 numeric.py:406(asarray)
1 0.000 0.000 0.000 0.000 distance.py:232(euclidean)
2 0.000 0.000 0.000 0.000 distance.py:152(_validate_vector)
2 0.000 0.000 0.000 0.000 shape_base.py:9(atleast_1d)
1 0.000 0.000 0.000 0.000 misc.py:11(norm)
1 0.000 0.000 0.000 0.000 function_base.py:605(asarray_chkfinite)
2 0.000 0.000 0.000 0.000 numeric.py:476(asanyarray)
1 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 linalg.py:111(isComplexType)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
4 0.000 0.000 0.000 0.000 {built-in method builtins.len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {method 'squeeze' of 'numpy.ndarray' objects}
In [6]: %prun numpy.linalg.norm(c1 - c2)
10 function calls in 0.000 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 linalg.py:1976(norm)
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.dot}
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 numeric.py:406(asarray)
1 0.000 0.000 0.000 0.000 {method 'ravel' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 linalg.py:111(isComplexType)
1 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
What %prun does is tell you how long a function call takes to run, including a bit of trace to figure out where the bottleneck might be. In this case, both the scipy.spatial.distance.euclidean and numpy.linalg.norm implementations are pretty fast. Assuming you defined a function dist(vect1, vect2), you can profile using the same IPython magic call. As another added bonus, %prun also works inside the Jupyter notebook, and you can do %%prun to profile an entire cell of code, rather than just one function, simply by making %%prun the first line of that cell.
I don't know how the precision and speed compares to the other libraries you mentioned, but you can do it for 2D vectors using the built-in math.hypot() function:
from math import hypot
def pairwise(iterable):
"s -> (s0, s1), (s1, s2), (s2, s3), ..."
a, b = iter(iterable), iter(iterable)
next(b, None)
return zip(a, b)
a = (52, 106, 35, 12)
b = (33, 153, 75, 10)
dist = [hypot(p2[0]-p1[0], p2[1]-p1[1]) for p1, p2 in pairwise(tuple(zip(a, b)))]
print(dist) # -> [131.59027319676787, 105.47511554864494, 68.94925670375281]
Related
NumPy version: 1.14.5
Purpose of the 'foo' function:
Finding the Euclid distance between arrays with the shapes (1,512), which represent facial features.
Issue:
foo function takes ~223.32 ms , but after that, some background operations related to NumPy take 170 seconds for some reason
Question:
Is keeping arrays in dictionaries, and iterating over them is a very dangerous usage of NumPy arrays?
Request for an Advice:
When I keep the arrays stacked and separate from dict, Euclid distance calculation takes half the time (~120ms instead of ~250ms), but overall performance doesn't change much for some reason. Allocating new arrays and stacking them may have countered the benefits of bigger array calculations.
I am open to any advice.
Code:
import numpy as np
import time
import uuid
import random
from funcy import print_durations
#print_durations
def foo(merged_faces_rec, face):
t = time.time()
for uid, feature_list in merged_faces_rec.items():
dist = np.linalg.norm( np.subtract(feature_list[0], face))
print("foo inside : ", time.time()-t)
rand_age = lambda : random.choice(["0-18", "18-35", "35-55", "55+"])
rand_gender = lambda : random.choice(["Erkek", "Kadin"])
rand_emo = lambda : random.choice(["happy", "sad", "neutral", "scared"])
date_list = []
emb = lambda : np.random.rand(1, 512)
def generate_faces_rec(d, n=12000):
for _ in range(n):
d[uuid.uuid4().hex] = [emb(), rand_gender(), rand_age(), rand_emo(), date_list]
faces_rec1 = dict()
generate_faces_rec(faces_rec1)
faces_rec2 = dict()
generate_faces_rec(faces_rec2)
faces_rec3 = dict()
generate_faces_rec(faces_rec3)
faces_rec4 = dict()
generate_faces_rec(faces_rec4)
faces_rec5 = dict()
generate_faces_rec(faces_rec5)
merged_faces_rec = dict()
st = time.time()
merged_faces_rec.update(faces_rec1)
merged_faces_rec.update(faces_rec2)
merged_faces_rec.update(faces_rec3)
merged_faces_rec.update(faces_rec4)
merged_faces_rec.update(faces_rec5)
t2 = time.time()
print("updates: ", t2-st)
face = list(merged_faces_rec.values())[0][0]
t3 = time.time()
print("face: ", t3-t2)
t4 = time.time()
foo(merged_faces_rec, face)
t5 = time.time()
print("foo: ", t5-t4)
Result:
Computations between t4 and t5 took 168 seconds.
updates: 0.00468754768371582
face: 0.0011434555053710938
foo inside : 0.2232837677001953
223.32 ms in foo({'d02d46999aa145be8116..., [[0.96475353 0.8055263...)
foo: 168.42408967018127
cProfile
python3 -m cProfile -s tottime test.py
cProfile Result:
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
30720512 44.991 0.000 85.425 0.000 arrayprint.py:888(__call__)
36791296 42.447 0.000 42.447 0.000 {built-in method numpy.core.multiarray.dragon4_positional}
30840514/60001 36.154 0.000 149.749 0.002 arrayprint.py:659(recurser)
24649728 25.967 0.000 25.967 0.000 {built-in method numpy.core.multiarray.dragon4_scientific}
30720512 20.183 0.000 26.420 0.000 arrayprint.py:636(_extendLine)
10 12.281 1.228 12.281 1.228 {method 'sub' of '_sre.SRE_Pattern' objects}
60001 11.434 0.000 79.370 0.001 arrayprint.py:804(fillFormat)
228330011/228329975 10.270 0.000 10.270 0.000 {built-in method builtins.len}
204081 4.815 0.000 16.469 0.000 {built-in method builtins.max}
18431577 4.624 0.000 21.742 0.000 arrayprint.py:854(<genexpr>)
18431577 4.453 0.000 28.627 0.000 arrayprint.py:859(<genexpr>)
30720531 3.987 0.000 3.987 0.000 {method 'split' of 'str' objects}
12348936 3.012 0.000 13.873 0.000 arrayprint.py:829(<genexpr>)
12348936 3.007 0.000 17.955 0.000 arrayprint.py:832(<genexpr>)
18431577 2.179 0.000 2.941 0.000 arrayprint.py:863(<genexpr>)
18431577 2.124 0.000 2.870 0.000 arrayprint.py:864(<genexpr>)
12348936 1.625 0.000 3.180 0.000 arrayprint.py:833(<genexpr>)
12348936 1.468 0.000 1.992 0.000 arrayprint.py:834(<genexpr>)
12348936 1.433 0.000 1.922 0.000 arrayprint.py:844(<genexpr>)
12348936 1.432 0.000 1.929 0.000 arrayprint.py:837(<genexpr>)
12324864 1.074 0.000 1.074 0.000 {method 'partition' of 'str' objects}
6845518 0.761 0.000 0.761 0.000 {method 'rstrip' of 'str' objects}
60001 0.747 0.000 80.175 0.001 arrayprint.py:777(__init__)
2 0.637 0.319 245.563 122.782 debug.py:237(smart_repr)
120002 0.573 0.000 0.573 0.000 {method 'reduce' of 'numpy.ufunc' objects}
60001 0.421 0.000 231.153 0.004 arrayprint.py:436(_array2string)
60000 0.370 0.000 0.370 0.000 {method 'rand' of 'mtrand.RandomState' objects}
60000 0.303 0.000 232.641 0.004 arrayprint.py:1334(array_repr)
60001 0.274 0.000 232.208 0.004 arrayprint.py:465(array2string)
60001 0.261 0.000 80.780 0.001 arrayprint.py:367(_get_format_function)
120008 0.255 0.000 0.611 0.000 numeric.py:2460(seterr)
Update to Clearify the Question
This is the part that has the bug. Something behind the scenes causes to program to take too long. Is it something to do with garbage collector, or just weird numpy bug? I don't have any clue.
t6 = time.time()
foo1(big_array, face) # 223.32ms
t7 = time.time()
print("foo1 : ", t7-t6) # foo1 : 170 seconds
I am trying to parallelize an embarrassingly parallel for loop (previously asked here) and settled on this implementation that fit my parameters:
with Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets, train_size_common, feat_sel_size, train_perc,
total_test_samples, num_classes, num_features, label_set,
method_names, pos_class_index, out_results_dir, exhaustive_search])
partial_func_holdout = partial(holdout_trial_compare_datasets, *shared_inputs)
with Pool(processes=num_procs) as pool:
cv_results = pool.map(partial_func_holdout, range(num_repetitions))
The reason I need to use a proxy object (shared between processes) is the first element in the shared proxy list datasets that is a list of large objects (each about 200-300MB). This datasets list usually has 5-25 elements. I typically need to run this program on a HPC cluster.
Here is the question, when I run this program with 32 processes and 50GB of memory (num_repetitions=200, with datasets being a list of 10 objects, each 250MB), I do not see a speedup even by factor of 16 (with 32 parallel processes). I do not understand why - any clues? Any obvious mistakes, or bad choices? Where can I improve this implementation? Any alternatives?
I am sure this has been discussed before, and the reasons can be varied and very specific to implementation - hence I request you to provide me your 2 cents. Thanks.
Update: I did some profiling with cProfile to get a better idea - here is some info, sorted by cumulative time.
In [19]: p.sort_stats('cumulative').print_stats(50)
Mon Oct 16 16:43:59 2017 profiling_log.txt
555404 function calls (543552 primitive calls) in 662.201 seconds
Ordered by: cumulative time
List reduced from 4510 to 50 due to restriction <50>
ncalls tottime percall cumtime percall filename:lineno(function)
897/1 0.044 0.000 662.202 662.202 {built-in method builtins.exec}
1 0.000 0.000 662.202 662.202 test_rhst.py:2(<module>)
1 0.001 0.001 661.341 661.341 test_rhst.py:70(test_chance_classifier_binary)
1 0.000 0.000 661.336 661.336 /Users/Reddy/dev/neuropredict/neuropredict/rhst.py:677(run)
4 0.000 0.000 661.233 165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:533(wait)
4 0.000 0.000 661.233 165.308 /Users/Reddy/anaconda/envs/py36/lib/python3.6/threading.py:263(wait)
23 661.233 28.749 661.233 28.749 {method 'acquire' of '_thread.lock' objects}
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:261(map)
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:637(get)
1 0.000 0.000 661.233 661.233 /Users/Reddy/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py:634(wait)
866/8 0.004 0.000 0.868 0.108 <frozen importlib._bootstrap>:958(_find_and_load)
866/8 0.003 0.000 0.867 0.108 <frozen importlib._bootstrap>:931(_find_and_load_unlocked)
720/8 0.003 0.000 0.865 0.108 <frozen importlib._bootstrap>:641(_load_unlocked)
596/8 0.002 0.000 0.865 0.108 <frozen importlib._bootstrap_external>:672(exec_module)
1017/8 0.001 0.000 0.863 0.108 <frozen importlib._bootstrap>:197(_call_with_frames_removed)
522/51 0.001 0.000 0.765 0.015 {built-in method builtins.__import__}
The profiling info now sorted by time
In [20]: p.sort_stats('time').print_stats(20)
Mon Oct 16 16:43:59 2017 profiling_log.txt
555404 function calls (543552 primitive calls) in 662.201 seconds
Ordered by: internal time
List reduced from 4510 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
23 661.233 28.749 661.233 28.749 {method 'acquire' of '_thread.lock' objects}
115/80 0.177 0.002 0.211 0.003 {built-in method _imp.create_dynamic}
595 0.072 0.000 0.072 0.000 {built-in method marshal.loads}
1 0.045 0.045 0.045 0.045 {method 'acquire' of '_multiprocessing.SemLock' objects}
897/1 0.044 0.000 662.202 662.202 {built-in method builtins.exec}
3 0.042 0.014 0.042 0.014 {method 'read' of '_io.BufferedReader' objects}
2037/1974 0.037 0.000 0.082 0.000 {built-in method builtins.__build_class__}
286 0.022 0.000 0.061 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:12(docformat)
2886 0.021 0.000 0.021 0.000 {built-in method posix.stat}
79 0.016 0.000 0.016 0.000 {built-in method posix.read}
597 0.013 0.000 0.021 0.000 <frozen importlib._bootstrap_external>:830(get_data)
276 0.011 0.000 0.013 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/sre_compile.py:250(_optimize_charset)
108 0.011 0.000 0.038 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:626(_construct_argparser)
1225 0.011 0.000 0.050 0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
7179 0.009 0.000 0.009 0.000 {method 'splitlines' of 'str' objects}
33 0.008 0.000 0.008 0.000 {built-in method posix.waitpid}
283 0.008 0.000 0.015 0.000 /Users/Reddy/anaconda/envs/py36/lib/python3.6/site-packages/scipy/misc/doccer.py:128(indentcount_lines)
3 0.008 0.003 0.008 0.003 {method 'poll' of 'select.poll' objects}
7178 0.008 0.000 0.008 0.000 {method 'expandtabs' of 'str' objects}
597 0.007 0.000 0.007 0.000 {method 'read' of '_io.FileIO' objects}
More profiling info sorted by percall info:
Update 2
The elements in the large list datasets I mentioned earlier are not usually as big - they are typically 10-25MB each. But depending on the floating point precision used, number of samples and features, this can easily grow to 500MB-1GB per element also. hence I'd prefer a solution that can scale.
Update 3:
The code inside holdout_trial_compare_datasets uses method GridSearchCV of scikit-learn, which internally uses joblib library if we set n_jobs > 1 (or whenever we even set it). This might lead to some bad interactions between multiprocessing and joblib. So trying another config where I do not set n_jobs at all (which should to default no parallelism within scikit-learn). Will keep you posted.
Based on discussion in the comments, I did a mini experiment, compared three versions of implementation:
v1: basically as same as your approach, in fact, as partial(f1, *shared_inputs) will unpack proxy_manager.list immediately, Manager.List not involved here, data passed to worker with the internal queue of Pool.
v2: v2 made use Manager.List, work function will receive a ListProxy object, it fetches shared data via a internal connection to a server process.
v3: child process share data from the parent, take advantage of fork(2) system call.
def f1(*args):
for e in args[0]: pow(e, 2)
def f2(*args):
for e in args[0][0]: pow(e, 2)
def f3(n):
for i in datasets: pow(i, 2)
def v1(np):
with mp.Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets,])
pf = partial(f1, *shared_inputs)
with mp.Pool(processes=np) as pool:
r = pool.map(pf, range(16))
def v2(np):
with mp.Manager() as proxy_manager:
shared_inputs = proxy_manager.list([datasets,])
pf = partial(f2, shared_inputs)
with mp.Pool(processes=np) as pool:
r = pool.map(pf, range(16))
def v3(np):
with mp.Pool(processes=np) as pool:
r = pool.map(f3, range(16))
datasets = [2.0 for _ in range(10 * 1000 * 1000)]
for f in (v1, v2, v3):
print(f.__code__.co_name)
for np in (2, 4, 8, 16):
s = time()
f(np)
print("%s %.2fs" % (np, time()-s))
results taken on a 16 core E5-2682 VPC, it is obvious that v3 scales better:
{method 'acquire' of '_thread.lock' objects}
Looking at your profiler output I would say that the shared object lock/unlock overhead overwhelms the speed gains of multithreading.
Refactor so that the work is farmed out to workers that do not need to talk to one another as much.
Specifically, if possible, derive one answer per data pile and then act on the accumulated results.
This is why Queues can seem so much faster: they involve a type of work that does not require an object that has to be 'managed' and so locked/unlocked.
Only 'manage' things that absolutely need to be shared between processes. Your managed list contains some very complicated looking objects...
A faster paradigm is:
allwork = manager.list([a, b,c])
theresult = manager.list()
and then
while mywork:
unitofwork = allwork.pop()
theresult = myfunction(unitofwork)
If you do not need a complex shared object, then only use a list of the most simple objects imaginable.
Then tell the workers to acquire the complex data that they can process in their own little world.
Try:
allwork = manager.list([datasetid1, datasetid2 ,...])
theresult = manager.list()
while mywork:
unitofworkid = allwork.pop()
theresult = myfunction(unitofworkid)
def myfunction(unitofworkid):
thework = acquiredataset(unitofworkid)
result = holdout_trial_compare_datasets(thework, ...)
I hope that this makes sense. It should not take too much time to refactor in this direction. And you should see that {method 'acquire' of '_thread.lock' objects} number drop like a rock when you profile.
i have created the code below, it takes a series of values,
and generates 10 numbers between x and r with an average value of 8000
in order to meet the specification to cover the range as well as possible, I also calculated the standard deviation, which is a good measure of spread. So whenever a sample set meets the criteria of mean of 8000, I compared it to previous matches and constantly choose the samples that have the highest std dev (mean always = 8000)
def node_timing(average_block_response_computational_time, min_block_response_computational_time, max_block_response_computational_time):
sample_count = 10
num_of_trials = 1
# print average_block_response_computational_time
# print min_block_response_computational_time
# print max_block_response_computational_time
target_sum = sample_count * average_block_response_computational_time
samples_list = []
curr_stdev_max = 0
for trials in range(num_of_trials):
samples = [0] * sample_count
while sum(samples) != target_sum:
samples = [rd.randint(min_block_response_computational_time, max_block_response_computational_time) for trial in range(sample_count)]
# print ("Mean: ", st.mean(samples), "Std Dev: ", st.stdev(samples), )
# print (samples, "\n")
if st.stdev(samples) > curr_stdev_max:
curr_stdev_max = st.stdev(samples)
samples_best = samples[:]
return samples_best[0]
i take the first value in the list and use this as a timing value, however this code is REALLY slow, i need to call this piece of code several thousand times during the simulation so need to improve the efficency of the code some how
anyone got any suggestions on how to ?
To see where we'd get the best speed improvements, I started by profiling your code.
import cProfile
pr = cProfile.Profile()
pr.enable()
for i in range(100):
print(node_timing(8000, 7000, 9000))
pr.disable()
pr.print_stats(sort='time')
The top of the results show where your code is spending most of its time:
23561178 function calls (23561176 primitive calls) in 10.612 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
4502300 3.694 0.000 7.258 0.000 random.py:172(randrange)
4502300 2.579 0.000 3.563 0.000 random.py:222(_randbelow)
4502300 1.533 0.000 8.791 0.000 random.py:216(randint)
450230 1.175 0.000 9.966 0.000 counter.py:19(<listcomp>)
4608421 0.690 0.000 0.690 0.000 {method 'getrandbits' of '_random.Random' objects}
100 0.453 0.005 10.596 0.106 counter.py:5(node_timing)
4502300 0.294 0.000 0.294 0.000 {method 'bit_length' of 'int' objects}
450930 0.141 0.000 0.150 0.000 {built-in method builtins.sum}
100 0.016 0.000 0.016 0.000 {built-in method builtins.print}
600 0.007 0.000 0.025 0.000 statistics.py:105(_sum)
2200 0.005 0.000 0.006 0.000 fractions.py:84(__new__)
...
From this output, we can see that we're spending ~7.5 seconds (out of 10.6 seconds) generating random numbers. Therefore, the only way to make this noticeably faster is to generate fewer random numbers or generate them faster. You're not using a cryptographic random number generator so I don't have a way to make generating numbers faster. However, we can fudge the algorithm a bit and drastically reduce the number of values we need to generate.
Instead of only accepting samples with a mean of exactly 8000, what if we accepted samples with a mean of 8000 +- 0.1% (then we're taking samples with a mean of 7992 to 8008)? By being a tiny bit inexact, we can drastically speed up the algorithm. I replaced the while condition with:
while abs(sum(samples) - target_sum) > epsilon
Where epsilon = target_sum * 0.001. Then I ran the script again and got much better profiler numbers.
232439 function calls (232437 primitive calls) in 0.163 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100 0.032 0.000 0.032 0.000 {built-in method builtins.print}
31550 0.026 0.000 0.053 0.000 random.py:172(randrange)
31550 0.019 0.000 0.027 0.000 random.py:222(_randbelow)
31550 0.011 0.000 0.064 0.000 random.py:216(randint)
4696 0.010 0.000 0.013 0.000 fractions.py:84(__new__)
3155 0.008 0.000 0.073 0.000 counter.py:19(<listcomp>)
600 0.008 0.000 0.039 0.000 statistics.py:105(_sum)
100 0.006 0.000 0.131 0.001 counter.py:4(node_timing)
32293 0.005 0.000 0.005 0.000 {method 'getrandbits' of '_random.Random' objects}
1848 0.004 0.000 0.009 0.000 fractions.py:401(_add)
Allowing the mean to be up to 0.1% off of the target dropped the number of calls to randint by 100x. Naturally, the code also runs 100x faster (and now spends most of its time printing to console).
Here is the rate limiting function in my code
def timepropagate(wv1, ham11,
ham12, ham22, scalararray, nt):
wv2 = np.zeros((nx, ny), 'c16')
fw1 = np.zeros((nx, ny), 'c16')
fw2 = np.zeros((nx, ny), 'c16')
for t in range(0, nt, 1):
wv1, wv2 = scalararray*wv1, scalararray*wv2
fw1, fw2 = (np.fft.fft2(wv1), np.fft.fft2(wv2))
fw1 = ham11*fw1+ham12*fw2
fw2 = ham12*fw1+ham22*fw2
wv1, wv2 = (np.fft.ifft2(fw1), np.fft.ifft2(fw2))
wv1, wv2 = scalararray*wv1, scalararray*wv2
del(fw1)
del(fw2)
return np.array([wv1, wv2])
What I would need to do is find a reasonably fast implementation that would allow me to go at twice the speed, preferably the fastest.
The more general question I'm interested in, is what way can I speed up this piece, using minimal possible connections back to python. I assume that even if I speed up specific segments of the code, say the scalar array multiplications, I would still come back and go from python at the Fourier transforms which would take time. Are there any ways I can use, say numba or cython and not make this "coming back" to python in the middle of the loops?
On a personal note, I'd prefer something fast on a single thread considering that I'd be using my other threads already.
Edit: here are results of profiling, the 1st one for 4096x4096 arrays for 10 time steps, I need to scale it up for nt = 8000.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.099 0.099 432.556 432.556 <string>:1(<module>)
40 0.031 0.001 28.792 0.720 fftpack.py:100(fft)
40 45.867 1.147 68.055 1.701 fftpack.py:195(ifft)
80 0.236 0.003 47.647 0.596 fftpack.py:46(_raw_fft)
40 0.102 0.003 1.260 0.032 fftpack.py:598(_cook_nd_args)
40 1.615 0.040 99.774 2.494 fftpack.py:617(_raw_fftnd)
20 0.225 0.011 29.739 1.487 fftpack.py:819(fft2)
20 2.252 0.113 72.512 3.626 fftpack.py:908(ifft2)
80 0.000 0.000 0.000 0.000 fftpack.py:93(_unitary)
40 0.631 0.016 0.820 0.021 fromnumeric.py:43(_wrapit)
80 0.009 0.000 0.009 0.000 fromnumeric.py:457(swapaxes)
40 0.338 0.008 1.158 0.029 fromnumeric.py:56(take)
200 0.064 0.000 0.219 0.001 numeric.py:414(asarray)
1 329.728 329.728 432.458 432.458 profiling.py:86(timepropagate)
1 0.036 0.036 432.592 432.592 {built-in method builtins.exec}
40 0.001 0.000 0.001 0.000 {built-in method builtins.getattr}
120 0.000 0.000 0.000 0.000 {built-in method builtins.len}
241 3.930 0.016 3.930 0.016 {built-in method numpy.core.multiarray.array}
3 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.zeros}
40 18.861 0.472 18.861 0.472 {built-in method numpy.fft.fftpack_lite.cfftb}
40 28.539 0.713 28.539 0.713 {built-in method numpy.fft.fftpack_lite.cfftf}
1 0.000 0.000 0.000 0.000 {built-in method numpy.fft.fftpack_lite.cffti}
80 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
40 0.006 0.000 0.006 0.000 {method 'astype' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
80 0.000 0.000 0.000 0.000 {method 'pop' of 'list' objects}
40 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
80 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}
80 0.001 0.000 0.001 0.000 {method 'swapaxes' of 'numpy.ndarray' objects}
40 0.022 0.001 0.022 0.001 {method 'take' of 'numpy.ndarray' objects}
I think I've done it wrong the first time, using time.time() to calculate time differences for small arrays and extrapolating the conclusions for larger ones.
If most of the time is spent in the hamiltonian multiplication, you may want to apply numba on that part. The most benefit coming from removing all the temporal arrays that would be needed if evaluating expressions from within NumPy.
Bear also in mind that the arrays (4096, 4096, c16) are big enough to not fit comfortably in the processor caches. A single matrix would take 256 MiB. So think that the performance is unlikely to be related at all with the operations, but rather on the bandwidth. So implement those operations in a way that you only perform one pass in the input operands. This is really trivial to implement in numba. Note: You will only need to implement in numba the hamiltonian expressions.
I want also to point out that the "preallocations" using np.zeros seems to signal that your code is not following your intent as:
fw1 = ham11*fw1+ham12*fw2
fw2 = ham12*fw1+ham22*fw2
will actually create new arrays for fw1, fw2. If your intent was to reuse the buffer, you may want to use "fw1[:,:] = ...". Otherwise the np.zeros do nothing but waste time and memory.
You may want to consider to join (wv1, wv2) into a (2, 4096, 4096, c16) array. The same with (fw1, fw2). That way code will be simpler as you can rely on broadcasting to handle the "scalararray" product. fft2 and ifft2 will actually do the right thing (AFAIK).
Sorting a list of tuples (dictionary keys,values pairs where the key is a random string) is faster when I do not explicitly specify that the key should be used (edit: added operator.itemgetter(0) from comment by #Chepner and the key version is now faster!):
import timeit
setup ="""
import random
import string
random.seed('slartibartfast')
d={}
for i in range(1000):
d[''.join(random.choice(string.ascii_uppercase) for _ in range(16))] = 0
"""
print min(timeit.Timer('for k,v in sorted(d.iteritems()): pass',
setup=setup).repeat(7, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=lambda x: x[0]): pass',
setup=setup).repeat(7, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=operator.itemgetter(0)): pass',
setup=setup).repeat(7, 1000))
Gives:
0.575334150664
0.579534521128
0.523808984422 (the itemgetter version!)
If however I create a custom object passing the key=lambda x: x[0] explicitly to sorted makes it faster:
setup ="""
import random
import string
random.seed('slartibartfast')
d={}
class A(object):
def __init__(self):
self.s = ''.join(random.choice(string.ascii_uppercase) for _ in
range(16))
def __hash__(self): return hash(self.s)
def __eq__(self, other):
return self.s == other.s
def __ne__(self, other): return self.s != other.s
# def __cmp__(self, other): return cmp(self.s ,other.s)
for i in range(1000):
d[A()] = 0
"""
print min(timeit.Timer('for k,v in sorted(d.iteritems()): pass',
setup=setup).repeat(3, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=lambda x: x[0]): pass',
setup=setup).repeat(3, 1000))
print min(timeit.Timer('for k,v in sorted(d.iteritems(),key=operator.itemgetter(0)): pass',
setup=setup).repeat(3, 1000))
Gives:
4.65625458083
1.87191002252
1.78853626684
Is this expected ? Seems like second element of the tuple is used in the second case but shouldn't the keys compare unequal ?
Note: uncommenting the comparison method gives worse results but still the times are at one half:
8.11941771831
5.29207000173
5.25420037046
As expected built in (address comparison) is faster.
EDIT: here are the profiling results from my original code that triggered the question - without the key method:
12739 function calls in 0.007 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.007 0.007 <string>:1(<module>)
1 0.000 0.000 0.007 0.007 __init__.py:6527(_refreshOrder)
1 0.002 0.002 0.006 0.006 {sorted}
4050 0.003 0.000 0.004 0.000 bolt.py:1040(__cmp__) # here is the custom object
4050 0.001 0.000 0.001 0.000 {cmp}
4050 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
291 0.000 0.000 0.000 0.000 __init__.py:6537(<lambda>)
291 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 bolt.py:1240(iteritems)
1 0.000 0.000 0.000 0.000 {method 'iteritems' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
and here are the results when I specify the key:
7027 function calls in 0.004 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.004 0.004 <string>:1(<module>)
1 0.000 0.000 0.004 0.004 __init__.py:6527(_refreshOrder)
1 0.001 0.001 0.003 0.003 {sorted}
2049 0.001 0.000 0.002 0.000 bolt.py:1040(__cmp__)
2049 0.000 0.000 0.000 0.000 {cmp}
2049 0.000 0.000 0.000 0.000 {isinstance}
1 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
291 0.000 0.000 0.000 0.000 __init__.py:6538(<lambda>)
291 0.000 0.000 0.000 0.000 __init__.py:6533(<lambda>)
291 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 bolt.py:1240(iteritems)
1 0.000 0.000 0.000 0.000 {method 'iteritems' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Apparently it is the __cmp__ and not the __eq__ method that is called (edit cause that class defines __cmp__ but not __eq__, see here for the order of resolution of equal and compare).
In the code here __eq__ method is indeed called (8605 times) as seen by adding debug prints (see the comments).
So the difference is as stated in the answer by #chepner. The last thing I am not quite clear on is why are those tuple equality calls needed (IOW why we need to call eq and we don't call cmp directly).
FINAL EDIT: I asked this last point here: Why in comparing python tuples of objects is __eq__ and then __cmp__ called? - turns out it's an optimization, tuple's comparison calls __eq__ in the tuple elements, and only call cmp for not eq tuple elements. So this is now perfectly clear. I thought it called directly __cmp__ so initially it seemed to me that specifying the key is just unneeded and after Chepner's answer I was still not getting where the equal calls come in.
Gist: https://gist.github.com/Utumno/f3d25e0fe4bd0f43ceb9178a60181a53
There are two issues at play.
Comparing two values of builtin types (such as int) happens in C. Comparing two values of a class with an __eq__ method happens in Python; repeatedly calling __eq__ imposes a significant performance penalty.
The function passed with key is called once per element, rather than once per comparison. This means that lambda x: x[0] is called once to build a list of A instances to be compared. Without key, you need to make O(n lg n) tuple comparisons, each of which requires a call to A.__eq__ to compare the first element of each tuple.
The first explains why your first pair of results is under a second while the second takes several seconds. The second explains why using key is faster regardless of the values being compared.