Fastest computation of distances in rectangular array - python

I'm looking for the fastest way to compute a lot of distances from some origin in a an image to every other point. Right now, what I have is something like this:
origin = [some_val,some_other_val]
y,x = np.mgrid[:image.shape[0],:image.shape[1]].astype(float)
r = np.hypot(y-origin[0],x-origin[1])
Is there a faster way? I saw this answer, but I'm not sure how to apply it.

Let's bring some broadcasting into play -
m,n= image.shape
r = np.sqrt((np.arange(m)[:,None]-origin[0])**2 + (np.arange(n)-origin[1])**2)
Runtime tests and verify results
Define functions -
In [115]: def broadcasting_based(origin,image_shape):
...: m,n= image_shape
...: return np.sqrt((np.arange(m)[:,None]-origin[0])**2 + (np.arange(n)-origin[1])**2)
...:
...:
...: def original_approach(origin,image_shape):
...: y,x = np.mgrid[:image_shape[0],:image_shape[1]].astype(float)
...: return np.hypot(y-origin[0],x-origin[1])
...:
Case # 1:
In [116]: origin = np.array([100,200])
In [117]: np.allclose(broadcasting_based(origin,[500,500]),original_approach(origin,[500,500]))
Out[117]: True
In [118]: %timeit broadcasting_based(origin,[500,500])
100 loops, best of 3: 3.28 ms per loop
In [119]: %timeit original_approach(origin,[500,500])
10 loops, best of 3: 21.2 ms per loop
Case # 2:
In [123]: origin = np.array([1000,2000])
In [124]: np.allclose(broadcasting_based(origin,[5000,5000]),original_approach(origin,[5000,5000]))
Out[124]: True
In [125]: %timeit broadcasting_based(origin,[5000,5000])
1 loops, best of 3: 460 ms per loop
In [126]: %timeit original_approach(origin,[5000,5000])
1 loops, best of 3: 2.96 s per loop

Apart from the other answers, you should definitely answer the question whether you need the distance, or if your problem can be solved with just the square of the distance. E.g. if you want to know the nearest one, this can be perfectly done with the square.
This would save you an expensive square root calculation for each point pair.

Related

Python dict: the size affects timing?

Let' say you have one key in dictionary A vs 1 billion keys in dictionary B
Algorithmically a lookup op is O(1)
However, the actual time (program execution time) to look up different based on the size of the dict?
onekey_stime = time.time()
print one_key_dict.get('firstkey')
onekey_dur = time.time() - onekey_stime
manykeys_stime = time.time()
print manykeys_dict.get('randomkey')
manykeys_dur = time.time() - manykey_stime
Would i see any time difference between onekey_dur and manykeys_dur?
Pretty much identical in a test with a small and large dict:
In [31]: random_key = lambda: ''.join(np.random.choice(list(string.ascii_letters), 20))
In [32]: few_keys = {random_key(): np.random.random() for _ in xrange(100)}
In [33]: many_keys = {random_key(): np.random.random() for _ in xrange(1000000)}
In [34]: few_lookups = np.random.choice(few_keys.keys(), 50)
In [35]: many_lookups = np.random.choice(many_keys.keys(), 50)
In [36]: %timeit [few_keys[k] for k in few_lookups]
100000 loops, best of 3: 6.25 µs per loop
In [37]: %timeit [many_keys[k] for k in many_lookups]
100000 loops, best of 3: 7.01 µs per loop
EDIT: For you, #ShadowRanger -- missed lookups are pretty close too:
In [38]: %timeit [few_keys.get(k) for k in many_lookups]
100000 loops, best of 3: 7.99 µs per loop
In [39]: %timeit [many_keys.get(k) for k in few_lookups]
100000 loops, best of 3: 8.78 µs per loop

Optimizing Fourier transformed signal length

I recently stumbled on an interessting problem, when computing the fourier transform of a signal with np.fft.fft. The reproduced problem is:
%timeit np.fft.fft(np.random.rand(59601))
1 loops, best of 3: 1.34 s per loop
I found that the amount of time is unexpectedly long. For instance lets look at some other fft's, but with a slightly longer/shorter signal:
%timeit np.fft.fft(np.random.rand(59600))
100 loops, best of 3: 6.18 ms per loop
%timeit np.fft.fft(np.random.rand(59602))
10 loops, best of 3: 61.3 ms per loop
%timeit np.fft.fft(np.random.rand(59603))
10 loops, best of 3: 113 ms per loop
%timeit np.fft.fft(np.random.rand(59604))
1 loops, best of 3: 182 ms per loop
%timeit np.fft.fft(np.random.rand(59605))
100 loops, best of 3: 6.53 ms per loop
%timeit np.fft.fft(np.random.rand(59606))
1 loops, best of 3: 2.17 s per loop
%timeit np.fft.fft(np.random.rand(59607))
100 loops, best of 3: 8.14 ms per loop
We can observe that the times are now in miliseconds, except for np.random.rand(59606), which lasts 2.17 s.
Note, the numpy documentation states:
FFT (Fast Fourier Transform) refers to a way the discrete Fourier Transform (DFT) can be calculated efficiently, by using symmetries in the calculated terms. The symmetry is highest when n is a power of 2, and the transform is therefore most efficient for these sizes.
However these vectors do not have the length of a power of 2. Could someone explain how to avoid/predict cases, when computation times are considerably higher?
As some comments have pointed, the prime factor decomposition allows you to predict the time to calculate the FFT. The following graphs show your results. Remark the logarithmic scale!
This image is generated with the following code:
import numpy as np
import matplotlib.pyplot as plt
def prime_factors(n):
"""Returns all the prime factors of a positive integer"""
#from http://stackoverflow.com/questions/23287/largest-prime-factor-of-a-number/412942#412942
factors = []
d = 2
while n > 1:
while n % d == 0:
factors.append(d)
n /= d
d = d + 1
return factors
times = []
decomp = []
for i in range(59600, 59613):
print(i)
t= %timeit -o np.fft.fft(np.random.rand(i))
times.append(t.best)
decomp.append(max(prime_factors(i)))
plt.loglog(decomp, times, 'o')
plt.ylabel("best time")
plt.xlabel("largest prime in prime factor decomposition")
plt.title("FFT timings")

Python: faster operation for indexing

I have the following snippet that extracts indices of all unique values (hashable) in a sequence-like data with canonical indices and store them in a dictionary as lists:
from collections import defaultdict
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
This looks like to me a quite common use case. And it happens that 90% of the execution time of my code is spent in these few lines. This part is passed through over 10000 times during execution, and len(data) is around 50000 to 100000 each time this is run. Number of unique elements ranges from 50 to 150 roughly.
Is there a faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), that achieves the same thing?
Many many thanks.
Not as impressive as I hoped for originally (there's still a fair bit of pure Python in the groupby code path), but you might be able to cut the time down by a factor of 2-4, depending on how much you care about the exact final types involved:
import numpy as np, pandas as pd
from collections import defaultdict
def by_dd(data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(data):
return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()}
def by_pand2(data):
return data.groupby(data.values).indices
data = pd.Series(np.random.randint(0, 100, size=10**5))
gives me
>>> %timeit by_dd(data)
10 loops, best of 3: 42.9 ms per loop
>>> %timeit by_pand1(data)
100 loops, best of 3: 18.2 ms per loop
>>> %timeit by_pand2(data)
100 loops, best of 3: 11.5 ms per loop
Though it's not the perfect solution (it's O(NlogN) instead of O(N)), a much faster, vectorized way to do it is:
def data_to_idxlists(data):
sorting_ixs = np.argsort(data)
uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
Another solution that is O(N*U), (where U is the number of unique groups):
def data_to_idxlists(data):
u, ixs = np.unique(data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
I found this question to be pretty interesting and while I wasn't able to get a large improvement over the other proposed methods I did find a pure numpy method that was slightly faster than the other proposed methods.
import numpy as np
import pandas as pd
from collections import defaultdict
data = np.random.randint(0, 10**2, size=10**5)
series = pd.Series(data)
def get_values_and_indicies(input_data):
input_data = np.asarray(input_data)
sorted_indices = input_data.argsort() # Get the sorted indices
# Get the sorted data so we can see where the values change
sorted_data = input_data[sorted_indices]
# Find the locations where the values change and include the first and last values
run_endpoints = np.concatenate(([0], np.where(sorted_data[1:] != sorted_data[:-1])[0] + 1, [len(input_data)]))
# Get the unique values themselves
unique_vals = sorted_data[run_endpoints[:-1]]
# Return the unique values along with the indices associated with that value
return {unique_vals[i]: sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist() for i in range(num_values)}
def by_dd(input_data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(input_data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(input_data):
idx_lists = defaultdict(list)
return {k: v.tolist() for k,v in series.groupby(input_data).indices.items()}
def by_pand2(input_data):
return series.groupby(input_data).indices
def data_to_idxlists(input_data):
u, ixs = np.unique(input_data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
def data_to_idxlists_unique(input_data):
sorting_ixs = np.argsort(input_data)
uniques, unique_indices = np.unique(input_data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
The resulting timings were (from fastest to slowest):
>>> %timeit get_values_and_indicies(data)
100 loops, best of 3: 4.25 ms per loop
>>> %timeit by_pand2(series)
100 loops, best of 3: 5.22 ms per loop
>>> %timeit data_to_idxlists_unique(data)
100 loops, best of 3: 6.23 ms per loop
>>> %timeit by_pand1(series)
100 loops, best of 3: 10.2 ms per loop
>>> %timeit data_to_idxlists(data)
100 loops, best of 3: 15.5 ms per loop
>>> %timeit by_dd(data)
10 loops, best of 3: 21.4 ms per loop
and it should be noted that unlike by_pand2 it results a dict of lists as given in the example. If you would prefer to return a defaultdict you can simply change the last time to return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) for i in range(num_values))) which increased the overall timing in my tests to 4.4 ms.
Lastly, I should note that these timing are data sensitive. When I used only 10 different values I got:
get_values_and_indicies: 4.34 ms per loop
data_to_idxlists_unique: 4.42 ms per loop
by_pand2: 4.83 ms per loop
data_to_idxlists: 6.09 ms per loop
by_pand1: 9.39 ms per loop
by_dd: 22.4 ms per loop
while if I used 10,000 different values I got:
get_values_and_indicies: 7.00 ms per loop
data_to_idxlists_unique: 14.8 ms per loop
by_dd: 29.8 ms per loop
by_pand2: 47.7 ms per loop
by_pand1: 67.3 ms per loop
data_to_idxlists: 869 ms per loop

Optimizing access on numpy arrays for numba

I recently stumbled upon numba and thought about replacing some homemade C extensions with more elegant autojitted python code. Unfortunately I wasn't happy, when I tried a first, quick benchmark. It seems like numba is not doing much better than ordinary python here, though I would have expected nearly C-like performance:
from numba import jit, autojit, uint, double
import numpy as np
import imp
import logging
logging.getLogger('numba.codegen.debug').setLevel(logging.INFO)
def sum_accum(accmap, a):
res = np.zeros(np.max(accmap) + 1, dtype=a.dtype)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
autonumba_sum_accum = autojit(sum_accum)
numba_sum_accum = jit(double[:](int_[:], double[:]),
locals=dict(i=uint))(sum_accum)
accmap = np.repeat(np.arange(1000), 2)
np.random.shuffle(accmap)
accmap = np.repeat(accmap, 10)
a = np.random.randn(accmap.size)
ref = sum_accum(accmap, a)
assert np.all(ref == numba_sum_accum(accmap, a))
assert np.all(ref == autonumba_sum_accum(accmap, a))
%timeit sum_accum(accmap, a)
%timeit autonumba_sum_accum(accmap, a)
%timeit numba_sum_accum(accmap, a)
accumarray = imp.load_source('accumarray', '/path/to/accumarray.py')
assert np.all(ref == accumarray.accum(accmap, a))
%timeit accumarray.accum(accmap, a)
This gives on my machine:
10 loops, best of 3: 52 ms per loop
10 loops, best of 3: 42.2 ms per loop
10 loops, best of 3: 43.5 ms per loop
1000 loops, best of 3: 321 us per loop
I'm running the latest numba version from pypi, 0.11.0. Any suggestions, how to fix the code, so it runs reasonably fast with numba?
I figured out myself. numba wasn't able to determine the type of the result of np.max(accmap), even if the type of accmap was set to int. This somehow slowed down everything, but the fix is easy:
#autojit(locals=dict(reslen=uint))
def sum_accum(accmap, a):
reslen = np.max(accmap) + 1
res = np.zeros(reslen, dtype=a.dtype)
for i in range(len(accmap)):
res[accmap[i]] += a[i]
return res
The result is quite impressive, about 2/3 of the C version:
10000 loops, best of 3: 192 us per loop
Update 2022:
The work on this issue led to the python package numpy_groupies, which is available here:
https://github.com/ml31415/numpy-groupies
#autojit
def numbaMax(arr):
MAX = arr[0]
for i in arr:
if i > MAX:
MAX = i
return MAX
#autojit
def autonumba_sum_accum2(accmap, a):
res = np.zeros(numbaMax(accmap) + 1)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
10 loops, best of 3: 26.5 ms per loop <- original
100 loops, best of 3: 15.1 ms per loop <- with numba but the slow numpy max
10000 loops, best of 3: 47.9 µs per loop <- with numbamax

(num)pythonic way to make 3d meshes for line plotting

I want to create a line between two points in 3d space:
origin = np.array((0,0,0),'d')
final = np.array((1,2,3),'d')
delta = final-origin
npts = 25
points np.array([origin + i*delta for i in linspace(0,1,npts)])
But this is silly: I build a big python list and then pass it into numpy, when I'm sure there's a way to do this with numpy alone. How do the numpy wizards do something like this?
You can do away with all Python loops for this one with a little broadcasting:
origin + delta*np.linspace(0, 1, npts)[:, np.newaxis]
Perhaps use np.column_stack:
In [71]: %timeit np.column_stack((np.linspace(o,f,npts) for o,f in zip(origin,final)))
10000 loops, best of 3: 45 us per loop
In [77]: %timeit np.array([origin + i*delta for i in np.linspace(0,1,npts)])
10000 loops, best of 3: 138 us per loop
Note: Jaime's answer is faster:
In [92]: %timeit origin + (final-origin)*np.linspace(0, 1, npts)[:, np.newaxis]
10000 loops, best of 3: 21.1 us per loop

Categories