Nested for Loop optimization in python - python

i want to optimize 2 for loops into single for loop, is there any way as length of array is very large.
A = [1,4,2 6,9,10,80] #length of list is very large
B = []
for x in A:
for y in A:
if x != y:
B.append(abs(x-y))
print(B)

not any better but more pythonic:
B = [abs(x-y) for x in A for y in A if x!=y]
unless you absolutely need duplicates (abs(a-b) == abs(b-a)), you can half your list (and thus computation):
B = [abs(A[i]-A[j]) for i in range(len(A)) for j in range(i+1, len(A))]
finaly you can use the power of numpy to get C++ speedup:
import numpy as np
A = np.array(A)
A.shape = -1,1 # make it a column vector
diff = np.abs(A - A.T) # diff is the matrix of abs differences
# grab upper triangle of order 1 (i.e. less the diagonal)
B = diff[np.triu_indices(len(A), k=1)]
But this will always be O(n^2) no matter what...

Related

Find the missing starting number of shuffled cumsum series

a and b are two arrays of floats of length n each. a can have both negative and positive entries.
b is cumulative sum of a.
b[0] != a[0]. In fact, b[0] = a[0] + k
Both a and b are shuffled such that the relative order between them is maintained, i.e., if say a[0] becomes a[6] then b[0] will become b[6] and so on.
Can someone suggest an algo to find k for randomly shuffled a and b such that their relative order is maintained.
My naive attempt below (which takes forever for n>=10)
import numpy as np
import itertools
def get_starting_point(a, b):
for msk in itertools.permutations(range(len(a))): # NOTE: Takes forever for n>=10.
new_a = a[list(msk)]
new_b = b[list(msk)]
k = new_b[0] - new_a[0]
new_a = np.cumsum(new_a) + k
if np.nansum(np.abs(new_b - new_a)) < 0.001:
return k
return None
Generate samples of a, b and expected k to try your solution:
def get_a_b_k(n=14):
a = np.round(np.random.uniform(low=-10, high=10, size=(n,)), 2)
b = np.cumsum(a)
prob = np.random.uniform(0,1)
if prob < 0.4:
k = np.round(np.random.uniform(-10,10), 2)
# NOTE: this elif can be removed as its just sub-case of else block.
elif prob < 0.6: # k same as the last b.
k = b[n-1]
a[n-2] -= k
else: # k same as one of b's
idx = np.random.choice(n, size=1)
k = b[idx]
a[idx] -= k
b = np.cumsum(a)
msk = np.random.choice(n, size=n, replace=False) # Randomly generated mask of size n.
return a[msk], b[msk] + k, k
We have:
b = np.cumsum(a) + k
We can compute b-a to get the previous elements of the sum. Thus the only element of b-a that does not belong to b indicates the position of the start.
As we are working with floating point numbers, we need a function to match floating point values. I used isin_tolerance that is defined here.
def solve(a, b):
m = isin_tolerance(b-a, b, 1e-8)
return (b[~m] - a[~m])[0]
np.random.seed(0)
for i in range(1_000_000):
a, b, k = get_a_b_k()
assert np.isclose(k, solve(a, b))
This takes a few minutes to run on 1M attempts but did not fail. On 10k tests with n=200 this runs in ~2s.
NB. This could fail if coincidentally, k is equal to one of the values in b, but this is fairly unlikely and did not happen in my random tests.

Is there a better way to search a sorted list if the other list is sorted too?

In the numpy library, one can pass a list into the numpy.searchsorted function, whereby it searched through a different list one element at a time and returns an array of the same sizes as the indices needed to preserve order. However, it seems to be wasting performance if both lists are sorted. For example:
m=[1,3,5,7,9]
n=[2,4,6,8,10]
numpy.searchsorted(m,n)
would return [1,2,3,4,5] which is the correct answer, but it looks like this would have complexity O(n ln(m)), whereby if one were to simply loop through m, and have some kind of pointer to n, it seems like the complexity is more like O(n+m)? Is there some kind of function in NumPy which does this?
AFAIK, this is not possible to do that in linear time only with Numpy without making additional assumptions on the inputs (eg. the integer are small and bounded). An alternative solution is to use Numba to do the merge manually:
import numba as nb
# Note: Numba requires a function signature with well defined array types
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted(a, b):
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < a.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
return result
a, b = np.cumsum(np.random.randint(0, 100, (2, 1000000)).astype(np.int64), axis=1)
result = search_both_sorted(a, b)
A faster implementation consists in using a branch-less approach so to remove the overhead of branch mis-prediction (especially on random/unpredictable inputs) when a and b are about the same size. Additionally, the O(n log m) algorithm can be faster when b is small so using np.searchsorted in that case is very efficient as pointed out by #MichaelSzczesny. Note that the Numba implementation of np.searchsorted can be a bit slower than the one of Numpy so it is better to pick the Numpy implementation. Here is the optimized version:
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted_opt_numba(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if sb < sa * 0.15:
# Use a version with branches because `a[i] < b[j]`
# should be most of the time true.
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
else:
# Use a branchless approach to avoid miss-predictions
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
tmp = a[i] < b[j]
result[j] = i
i += tmp
j += ~tmp
for k in range(j, b.size):
result[k] = i
return result
def search_both_sorted_opt(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if 2 * sb * np.log2(sa) < sa + sb:
return np.searchsorted(a, b)
else:
return search_both_sorted_opt_numba(a, b)
searchsorted: 19.1 ms
snp_search: 11.8 ms
search_both_sorted: 6.5 ms
search_both_sorted_branchless: 4.3 ms
The optimized branchless Numba implementation is about 4.4 times faster than searchsorted which is pretty good considering that the code of searchsorted is already highly optimized. It can be even faster when a and b are huge because of cache locality.
You could use sortednp, unfortunately it does not give too much flexibility, In the code snippet below I used its merge tracking indices, but it produces three arrays, four times more memory than necessary is used, but it is faster than searchsorted.
import numpy as np
import sortednp as snp
a = np.cumsum(np.random.rand(1000000))
b = np.cumsum(np.random.rand(1000000))
def snp_search(a,b):
m, (ib, ia) = snp.merge(b, a, indices=True)
return ib - np.arange(len(ib))
assert(np.all(snp_search(a,b) == np.searchsorted(a,b)))
np.searchsorted(a, b); #58 ms
snp_search(a,b); # 22ms
np.searchsorted takes this into account already as can be seen from the source code:
/*
* Updating only one of the indices based on the previous key
* gives the search a big boost when keys are sorted, but slightly
* slows down things for purely random ones.
*/
if (cmp(last_key_val, key_val)) {
max_idx = arr_len;
}
else {
min_idx = 0;
max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
}
Here min_idx, max_idx are used to perform binary search on the array. If last_key_val < key_val then only max_idx is reset to the array length, but min_idx remains at its current value, i.e. binary search starts at the same lower boundary as for the previous key.

Numpy rotation matrix multiplication

I want to calculate and multply a sequence of rotation matrix using numpy. I've written this code to do my job,
def npmat(angle_list):
aa = np.full((nn, n, n),np.eye(n))
c=0
for j in range(1,n):
for i in range(j):
th = angle_list[c]
aa[c,i,i]=aa[c,j,j] = np.cos(th)
aa[c,i,j]= np.sin(th)
aa[c,j,i]= -np.sin(th)
c+=1
return np.linalg.multi_dot(aa)
n,nn=3,3
#nn=n*(n-1)/2
angle_list= array([1.06426904, 0.27106789, 0.56149785])
npmat(angle_list)=
array([[ 0.46742875, 0.6710055 , 0.57555363],
[-0.84250501, 0.53532228, 0.06012796],
[-0.26776049, -0.51301235, 0.81555052]])
But I've to apply this function over 10K times and this is very slow and feels like not using numpy to its full potential. Is there a more efficient a do this in numpy?
EDIT: Since it seems like you are looking for the product of these matrices, you can apply the matrices without constructing them. It might also make sense to just compute the cosine and sine without having vectorized that first.
n=3
nn= n*(n-1)//2
theta_list = np.array([1.06426904, 0.27106789, 0.56149785])
sin_list = np.sin(theta_list)
cos_list = np.cos(theta_list)
A = np.eye(n)
c=0
for i in range(1,n):
for j in range(i):
ri = np.copy(A[i])
rj = np.copy(A[j])
A[i] = cos_list[c]*ri + sin_list[c]*rj
A[j] = -sin_list[c]*ri + cos_list[c]*rj
c+=1
print(A.T) // transpose at end because its faster to update A[i] than A[:,i]
If you want to compute each of the matrices explicitly here is a vectorized version of some of your original code.
n=4
nn= n*(n-1)//2
theta_list = np.random.rand(nn)*2*np.pi
sin_list = np.sin(theta_list)
cos_list = np.cos(theta_list)
aa = np.full((nn, n, n),np.eye(n))
ii,jj = np.tril_indices(n,k=-1)
cc = np.arange(nn)
aa[cc,ii,ii] = cos_list[cc]
aa[cc,jj,jj] = cos_list[cc]
aa[cc,ii,jj] = -sin_list[cc]
aa[cc,jj,ii] = sin_list[cc]
A solutions with more vectorisation levels :
def npmats(angle):
a,b = angle.shape
aa = np.full((a,b, n,n),np.eye(n))
for j in range(1,n):
for i in range(j):
aa[:,:,i,i]=aa[:,:,j,j] = np.cos(angle)
sinangle=np.sin(angle)
aa[:,:,i,j]= sinangle
aa[:,:,j,i]= -sinangle
bb=np.empty((a,n,n))
for i in range(a):
bb[i]=np.linalg.multi_dot(aa[i])
return bb
It seems reasonably fast:
In [9]: angle= np.random.rand(10000,nn)
In [10]: %time res = npmats(angle)
Wall time: 205 ms

Sparse matrix multiplication when results' sparsity is known (in python|scipy|cython)

Suppose we want to compute C=A*B for given sparse matrices A,B but are interested in a very small subset of entries of C, represented by a list of index pairs:
rows=[i1, i2, i3 ... ]
cols=[j1, j2, j3 ... ]
Both A and B are quite large (say 50Kx50K), but very sparse (<1% of entries is non-zero).
How can we compute this subset of the multiplication?
Here's a naive implementation that works really slow:
def naive(A, B, rows, cols):
N = len(rows)
vals = []
for n in xrange(N):
v = A.getrow(rows[n]) * B.getcol(cols[n])
vals.append(v[0, 0])
R = sps.coo_matrix((np.array(vals), (np.array(rows), np.array(cols))), shape=(A.shape[0], B.shape[1]), dtype=np.float64)
return R
even for small matrices this is quite bad:
import scipy.sparse as sps
import numpy as np
D = 1000
A = np.random.randn(D, D)
A[np.abs(A) > 0.1] = 0
A = sps.csr_matrix(A)
B = np.random.randn(D, D)
B[np.abs(B) > 0.1] = 0
B = sps.csr_matrix(B)
X = np.random.randn(D, D)
X[np.abs(X) > 0.1] = 0
X[X != 0] = 1
X = sps.csr_matrix(X)
rows, cols = X.nonzero()
naive(A, B, rows, cols)
On my machine, naive() finishes after 1 minute, and most of the effort is spent on structuring the rows/cols (in getrow(), getcol()).
Of course, converting this (very small) example to dense matrices, the computation takes about 100ms:
A0 = np.array(A.todense())
B0 = np.array(B.todense())
X0 = np.array(X.todense())
A0.dot(B0) * X0
Any thoughts on how to efficiently compute such matrix multiplication?
Note: This question is almost identical to the following question:
Subset of a matrix multiplication, fast, and sparse
However, there, A and B are full matrices, and, one of the dimensions is very low (say, 10)
the proposed solutions seem to benefit from both.
The format of your sparse matrices is important here. You always need a row form A and a column from B. So, store A as a csr and B as csc to get rid of the getrow/getcol overhead. Unfortunately, this is only a small part of the story.
The best solution depends a lot on the structure of your sparse matrix (a lot of sparse columns/rows, etc), but you might try one based on dictionaries and sets. For matrix A for each row the following are kept:
a set with all non-zero column indices on that row
a dictionary with the non-zero indices as keys and the corresponding non-zero values as values
For matrix B similar dicts and sets are kept for each column.
To calculate element (M, N) in the multiplication result, row M of A is multiplied with column N of B. The multiplication:
find the set intersection of the non-zero sets
calculate the sum of multiplications of the non-zero elements (i.e. the intersection above)
In most cases this should be very fast, as in a sparse matrix the set intersection is usually very small.
Some code:
class rowarray():
def __init__(self, arr):
self.rows = []
for row in arr:
nonzeros = np.nonzero(row)[0]
nzvalues = { i: row[i] for i in nonzeros }
self.rows.append((set(nonzeros), nzvalues))
def __getitem__(self, key):
return self.rows[key]
def __len__(self):
return len(self.rows)
class colarray(rowarray):
def __init__(self, arr):
rowarray.__init__(self, arr.T)
def maybe_less_naive(A, B, rows, cols):
N = len(rows)
vals = []
for n in xrange(N):
nz1,v1 = A[rows[n]]
nz2,v2 = B[cols[n]]
# list of common non-zeros
nz = nz1.intersection(nz2)
# sum of non-zeros
vals.append(sum([ v1[i]*v2[i] for i in nz]))
R = sps.coo_matrix((np.array(vals), (np.array(rows), np.array(cols))), shape=(len(A), len(B)), dtype=np.float64)
return R
D = 1000
Ap = np.random.randn(D, D)
Ap[np.abs(Ap) > 0.1] = 0
A = rowarray(Ap)
Bp = np.random.randn(D, D)
Bp[np.abs(Bp) > 0.1] = 0
B = colarray(Bp)
X = np.random.randn(D, D)
X[np.abs(X) > 0.1] = 0
X[X != 0] = 1
X = sps.csr_matrix(X)
rows, cols = X.nonzero()
maybe_less_naive(A, B, rows, cols)
This is a bit more efficient, the multiplication takes approximately 2 seconds for the test (80 000 elements). The results seem to be essentially the same.
A few comments on the performance.
There are two operations performed for each output element:
set intersection
multiplication
The complexity of set intersection should be O(min(m,n)) where m and n are the numbers of non-zeros in each operand. This is invariant of the size of the matrix, only the average number of non-zeros per row/column is important.
The number of multiplications (and dict lookups) depends on the number of non-zeros found in the intersection above.
If both matrices have randomly distributed non-zeros with probability (density) p, and the row/column length is n, then:
set intersection: O(np)
dictionary lookup, multiplication: O(np^2)
This shows that with really sparse matrices finding the intersections is the critical point. This can also be verified by profiling; most of the time is spent calculating the intersections.
When this is reflected to the real world, we seem to spend around 20 us for a row/column of 80 non-zeros. This is not blindingly fast, and the code can certainly be made faster. Cython may be one solution, but this may be one of the problems where Python is not the best possible solution. A simple linear matching (merge sort -type algorithm) for sorted integers should be at least an order of magnitude faster when written in C.
One important thing to note is that the algorithm can be done in parallel for several elements at a time. There is no need to settle for a single thread, as the calculations are independent as far as one thread handles one output point.

Shuffle in one dimension of a matrix(effeciently)?

I was trying to write a function that gets a matrix of 2D points and a probability p and change or swap each points coordinates with probability p
So I asked a question and I was trying to use a binary sequence as an array of the powers of a specific matrix swap_matrix=[[0,1],[1,0]] to swap randomly (with a specific proportion) the coordinates of a given set of 2D points. However I realised that power function only accepts integer values and not arrays. And shuffle is as I can understand for the whole matrix and you cannot specify a specific dimension.
Having either of these two functions is OK.
For example:
swap(a=[[1,2],[2,3],[3,4],[3,5],[5,6]],b=[0,0,0,1,1])
should return [[1,2],[2,3],[3,4],[5,3],[6,5]]
The idea that just popped up and now I am editing is:
def swap(mat,K,N):
#where K/N is the proportion and K and N are natural numbers
#mat is a N*2 matrix that I am planning to randomly changes
#it coordinates of each row or keep it as it is
a=[[[0,1],[1,0]]]
b=[[[1,0],[0,1]]]
a=np.repeat(a,K,axis=0)
b=np.repeat(b,N-K,axis=0)
out=np.append(a,b,axis=0)
np.random.shuffle(out)
return np.multiply(mat,out.T)
Where I get an error cause I cannot flatten only once to make the matrices multipliable!
Again I am looking for an efficient method(vectorized in Matlab context).
P.S. In my special case the matrix is in the shape (N,2) and with the second column as all ones if that would help.
Maybe this is good enough for your purposes. In a quick test it appears to be about 13x faster than the blunt for-loop approach (#Naji, posting your "inefficient" code is helpful for making a comparison).
Edited my code following Jaime's comment
def swap(a, b):
a = np.copy(a)
b = np.asarray(b, dtype=np.bool)
a[b] = a[b, ::-1] # equivalent to: a[b] = np.fliplr(a[b])
return a
# the following is faster, but modifies the original array
def swap_inplace(a, b):
b = np.asarray(b, dtype=np.bool)
a[b] = a[b, ::-1]
print swap(a=[[1,2],[2,3],[3,4],[3,5],[5,6]],b=[0,0,0,1,1])
Outputs:
[[1 2]
[2 3]
[3 4]
[5 3]
[6 5]]
Edit to include more detailed timings
I wanted to know if I could speed this up still with Cython, so I investigated the efficiency some more :-) The results are worth mentioning I think (since efficiency is part of the actual question), but I do appologize in advance for the amount of additional code.
First the results.. The "cython" function is clearly the fastest of all, another 10x faster than the proposed Numpy solution above. The "blunt loop approach" I mentioned is given by the function named "loop", but as it turns out there are much faster methods conceivable. My pure Python solution is only 3x slower than the vectorized Numpy code above! Another thing to note is that "swap_inplace" was most of the time only marginally faster than "swap". Also the timings vary a bit with different random matrices a and b... So now you know :-)
function | milisec | normalized
-------------+---------+-----------
loop | 184 | 10.
double_loop | 84 | 4.7
pure_python | 51 | 2.8
swap | 18 | 1
swap_inplace | 17 | 0.95
cython | 1.9 | 0.11
And the rest of code I used (it seems I took this way to seriously :P):
def loop(a, b):
a_c = np.copy(a)
for i in xrange(a.shape[0]):
if b[i]:
a_c[i,:] = a[i, ::-1]
def double_loop(a, b):
a_c = np.copy(a)
n, m = a_c.shape
for i in xrange(n):
if b[i]:
for j in xrange(m):
a_c[i, j] = a[i, m-j-1]
return a_c
from copy import copy
def pure_python(a, b):
a_c = copy(a)
n, m = len(a), len(a[0])
for i in xrange(n):
if b[i]:
for j in xrange(m):
a_c[i][j] = a[i][m-j-1]
return a_c
import pyximport; pyximport.install()
import testcy
def cython(a, b):
return testcy.swap(a, np.asarray(b, dtype=np.uint8))
def rand_bin_array(K, N):
arr = np.zeros(N, dtype=np.bool)
arr[:K] = 1
np.random.shuffle(arr)
return arr
N = 100000
a = np.random.randint(0, N, (N, 2))
b = rand_bin_array(0.33*N, N)
# before timing the pure python solution I first did:
a = a.tolist()
b = b.tolist()
######### In the file testcy.pyx #########
#cython: boundscheck=False
#cython: wraparound=False
import numpy as np
cimport numpy as np
def swap(np.ndarray[np.int_t, ndim=2] a, np.ndarray[np.uint8_t, ndim=1] b):
cdef np.ndarray[np.int_t, ndim=2] a_c
cdef int n, m, i, j
a_c = a.copy()
n = a_c.shape[0]
m = a_c.shape[1]
for i in range(n):
if b[i]:
for j in range(m):
a_c[i, j] = a[i, m-j-1]
return a_c

Categories