How to speed up nested for loop in python

How to speed up nested for loop in python - python

I want to calculate the cM between two different windows along a chromosome.
My code has three nested loops.
For sample, I use random number stand for the recombination map.
import random
windnr = 54800
w, h = windnr, windnr
recmatrix = [[0 for x in range(w)] for y in range(h)]
#Generate 54800 random numbers between 10 and 30
rec_map = random.sample(range(0, 30), 54800)
for i in range(windnr):
for j in range(windnr):
recmatrix[i][j] = 0.25 * rec_map[i] #mean distance within own window
if i > j:
recmatrix[i][j] = recmatrix[i][j] + 0.5 * rec_map[j] #+ mean rdistance final window
for k in range(i-1,j,-1):
recmatrix[i][j] = recmatrix[i][j] + rec_map[k] #add all windows between i and j
if i < j:
recmatrix[i][j] = recmatrix[i][j] + 0.5 * rec_map[j] #+ mean distance final window
for k in range(i+1,j):
recmatrix[i][j] = recmatrix[i][j] + rec_map[k] #add all windows between i and j
#j += 1
if i % 10 == 0:
print("window {}".format(i))
#i += 1
The calculation costs a lot of time. I have to calculate almost 7 days for my data.
Can I speed up the nested for loop within 10 hours?
How can I increase the performance?
Although the 2D array has 3 billion items (~96 GB when being floats), I would rule out hard disk swapping issues, since the server which does the computation has 200 GB of RAM.

Using Numpy will make your application much faster. It's written in C/C++, so it does not suffer from slow loops in Python.
I'm doing my tests on an old Intel Xeon X5550 with 2 sockets, 8 cores and 96 GB of triple channel RAM. I don't have much experience with Numpy, so bear with me, if below code is not optimal.
Array initialization
Already the initialization is much faster:
recmatrix = [[0 for x in range(w)] for y in range(h)]
needs 24 GB of RAM (integers) and takes 3:28 minutes on my PC. Whereas
recmatrix = np.zeros((windnr, windnr), dtype=np.int)
is finished after 50 ms. But since you need floats anyway, start with floats from the beginning:
recmatrix = np.zeros((windnr, windnr), dtype=np.float)
Random samples
The code
#Generate 54800 random numbers between 10 and 30
rec_map = random.sample(range(0, 30), 54800)
did not work for me, so I replaced it and increased k for more stable measurements
rec_map = random.choices(range(0, 30), k=5480000)
which runs in 2.5 seconds. The numpy replacement
rec_map = np.random.choice(np.arange(0, 30), size=5480000)
is done in 0.1 seconds.
The loop
The loop will need most work, since you'll avoid Python loops in Numpy whenever possible.
For example, if you have an array and want to multiply all elements by 2, you would not write a loop but simply multiply the whole array:
import numpy as np
single = np.random.choice(np.arange(0, 10), size=100)
doubled = single * 2
print(single, "\r\n", doubled)
I don't fully understand what the code does, but let's apply that strategy on the first part of the loop. The original is
for i in range(windnr):
for j in range(windnr):
recmatrix[i][j] = 0.25 * rec_map[i] #mean distance within own window
and it takes 18.5 seconds with a reduced windnr = 5480. The numpy equivalent should be
column = 0.25 * rec_map_np
recmatrix = np.repeat(column, windnr)
and is done within 0.25 seconds. Also note: since we're assigning the variable here, we don't need the zero initialization at all.
For the if i>j: and if i<j: parts, I see that the first line is identical
recmatrix[i][j] = recmatrix[i][j] + 0.5 * rec_map[j]
That means, this calculation is applied to all elements except the ones on the diagonal. You can use a mask for that:
mask = np.ones((windnr, windnr), dtype=bool)
np.fill_diagonal(mask, False)
rec_map_2d = np.repeat(0.5 * rec_map_np, windnr-1)
recmatrix[mask] += rec_map_2d
This took only 1:20 minutes for all 54800 elements, but reached my RAM limit at 93 GB.

Usually in python looping always take much time. So if possible then in your case then use map this will save a lot of time for you. Where you are using a iter(list) so it will be good for this script.
example:
def func():
your code
nu = (1, 2, 3, 4)
output = map(func, nu)
print(output)

Related

Speeding up vector distance calculation using Numba

Below are some of the functions I wrote for distance (square) calculation in 3-D toroidal geometry for a collection of particles in that 3-D space:
import itertools
import time
import numpy as np
import scipy
import numba
from numba import njit
#njit(cache=True)
def get_dr2(i=np.array([]),j=np.array([]),cellsize=np.array([])):
k=np.zeros(3,dtype=np.float64)
dr2=0.0
for idx in numba.prange(cellsize.shape[0]):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.rint((j[idx]-i[idx])/cellsize[idx])
dr2+=k[idx]**2
return dr2
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64[:])"],
"(m),(m),(m)->()",nopython=True,cache=True)
def get_dr2_vec(i,j,cellsize,dr2):
dr2[:]=0.0
k=np.zeros(3,dtype=np.float64)
for idx in numba.prange(cellsize.shape[0]):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.rint((j[idx]-i[idx])/cellsize[idx])
dr2[0]+=k[idx]**2
#njit(cache=True)
def pair_vec_gen(pIList=np.array([[]]),pJList=np.array([[]])):
assert pIList.shape[1] == pJList.shape[1]
vecI=np.zeros((pIList.shape[0]*pJList.shape[0],pIList.shape[1]))
vecJ=np.zeros_like(vecI)
for i in numba.prange(pIList.shape[0]):
for j in numba.prange(pJList.shape[0]):
for k in numba.prange(pIList.shape[1]):
vecI[j+pJList.shape[0]*i][k]=pIList[i][k]
vecJ[j+pJList.shape[0]*i][k]=pJList[j][k]
return vecI,vecJ
#njit(cache=True)
def pair_vec_dist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
vecI=np.zeros((pIList.shape[0]*pJList.shape[0],pIList.shape[1]))
vecJ=np.zeros_like(vecI)
r2List=np.zeros(vecI.shape[0])
for i in numba.prange(pIList.shape[0]):
for j in numba.prange(pJList.shape[0]):
for k in numba.prange(pIList.shape[1]):
vecI[j+pJList.shape[0]*i][k]=pIList[i][k]
vecJ[j+pJList.shape[0]*i][k]=pJList[j][k]
r2List=get_dr2_vec2(vecI,vecJ,cellsize)
return r2List
#njit(cache=True)
def get_dr2_vec2(i=np.array([[]]),j=np.array([[]]),cellsize=np.array([])):
dr2=np.zeros(i.shape[0],dtype=np.float64)
k=np.zeros(i.shape[1],dtype=np.float64)
for m in numba.prange(i.shape[0]):
for n in numba.prange(i.shape[1]):
k[n] = (j[m,n]-i[m,n])-cellsize[n]*np.rint((j[m,n]-i[m,n])/cellsize[n])
dr2[m]+=k[n]**2
return dr2
def pair_dist_calculator_cdist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
r2List = (scipy.spatial.distance.cdist(pIList, pJList, metric=get_dr2_wrapper(cellsize=cellsize))).flatten()
return np.array(r2List).flatten()
def get_dr2_wrapper(cellsize=np.array([])):
return lambda u, v: get_dr2(u,v,cellsize)
frames=50
timedata=np.zeros((5,frames),dtype=np.float64)
N, dim = 100, 3 # 100 particles in 3D
cellsize=np.array([26.4,19.4,102.4])
for i in range(frames):
print("\rIter {}".format(i),end='')
vec = np.random.random((N, dim))
rList1=[];rList2=[];rList3=[];rList4=[];rList5=[]
#method 1
#print("method 1")
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
rList1.append(get_dr2(pI,pJ,cellsize))
end =time.perf_counter()
timedata[0,i]=(end-start)
#method 2
#print("method 2")
pIvec=[];pJvec=[];rList2=[]
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
pIvec.append(pI)
pJvec.append(pJ)
rList2=get_dr2_vec(np.array(pIvec),np.array(pJvec),cellsize)
end =time.perf_counter()
timedata[1,i]=(end-start)
#method 3
#print("method 3")
start = time.perf_counter()
rList3=get_dr2_vec(*pair_vec_gen(vec,vec),cellsize)
end =time.perf_counter()
timedata[2,i]=(end-start)
#method 4
#print("method 4")
start = time.perf_counter()
rList4=pair_vec_dist(vec,vec,cellsize)
end =time.perf_counter()
timedata[3,i]=(end-start)
#method 5
#print("method 5")
#start = time.perf_counter()
#rList5=pair_dist_calculator_cdist(np.array(pIvec),np.array(pJvec),cellsize)
#end =time.perf_counter()
#timedata[4,i]=(end-start)
assert (rList1 == rList2).all()
assert (rList2 == rList3).all()
assert (rList3 == rList4).all()
#assert rList4 == rList5
print("\n")
for i in range(4):
print("Method {} Average time {:.3g}s \u00B1 {:.3g}s".format(i+1,np.mean(timedata[i,1:]),np.std(timedata[i,1:])))
exit()
The essential idea is that at a particular time you have a snapshot of the particles or frame which contains the position of the particles. Now we can calculate all the distances between the particles we can use the following approaches:
Calculate distance between points iteratively in pure python; passing the combination of the position of the two particles one by one via Numba.
Create an iteration list (in pure python) beforehand and pass the whole list to a Numba #guvectorize function
Do (2) but all steps in Numba
Integrate all step in (3) to a simple Numba function
(optional) parse the positions to scipy.spatial.distance.cdist with the distance function as the distance metric.
For 50 frames containing 100 particles we have the respective times (frames, N = 50, 100):
Method 1 Average time 0.017s ± 0.00555s
Method 2 Average time 0.0181s ± 0.00573s
Method 3 Average time 0.00182s ± 0.000944s
Method 4 Average time 0.000485s ± 0.000348s
For 50 frames containing 1000 particles we have the respective times (frames, N = 50, 1000):
Method 1 Average time 2.11s ± 0.977s
Method 2 Average time 2.42s ± 0.859s
Method 3 Average time 0.349s ± 0.12s
Method 4 Average time 0.0694s ± 0.022s
and for 1000 frames containing 100 particles we have the respective times (frames, N = 1000, 100):
Method 1 Average time 0.0244s ± 0.0166s
Method 2 Average time 0.0288s ± 0.0254s
Method 3 Average time 0.00258s ± 0.00231s
Method 4 Average time 0.000636s ± 0.00086s
(All the time shown above are after removing the contribution from the first iteration)
Method 5 simply fails due to memory requirements and is much slower in comparison to any other method
Given the above dataset, I tend to prefer Method 4 though I am a bit concerned about the average time increase when I increase frames from 50 to 1000. Is there any further optimizations I can do in these implementations or if someone has ideas for much faster and memory conscious implementations? Any suggestions are welcome.
Update
Based on Jerome's answer the modified function is now:
#njit(cache=True,parallel=True)
def pair_vec_dist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
assert cellsize.size == 3
dr2=np.zeros(pIList.shape[0]*pJList.shape[0],dtype=np.float64)
inv_cellsize = 1.0 / cellsize
for i in numba.prange(pIList.shape[0]):
for j in range(pJList.shape[0]):
offset = j + pJList.shape[0] * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[offset] = xk**2+yk**2+zk**2
return dr2
As Jerome pointed out that a very simple optimization could be running the loops through just the "lower half of the symmetric matrix" the distance calculation creates, though in a realistic situation I might have vector lists as pI and pJ where pI is a subset of pJ, which complicates this situation. Either I have to create two separate functions and control them via a wrapper function or somehow manage that in one single function. If there are any suggestions on how to do so that would be really helpful.
Update 2
I should clarify the problem furthermore. In this code I am trying to calculate distance between all points in a frame/snapshot, which is used further for pair distance distribution analysis. But in some cases we might want to focus on a subset of coordinates in a frame and calculate the distribution from their perspective. In such a case we select this subset smallVec from a pool of all coordinates vec (such that smallVec +restOfVec = vec) and calculate pair_vec_dist(smallVec,vec) instead of pair_vec_dist(vec,vec). For this calculation one can use list(pair_vec_dist(smallVec,smallVec)).append(pair_vec_dist(smallVec,restOfVec).
Based on the discussion with Jerome, I modified my function as:
#njit(cache=True,parallel=True)
def pair_vec_dist_cmb(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([]),is_sq=True,is_nonsq=True):
assert pIList.shape[1] == pJList.shape[1]
assert cellsize.size == 3
dr2_1=0; dr2_2=0
dr2_1=int(0.5*pIList.shape[0]*(pIList.shape[0]+1))
if is_nonsq:
dr2_2=int(pIList.shape[0]*pJList.shape[0])
dr2 = np.zeros((dr2_1+dr2_2),dtype=np.float64)
inv_cellsize = 1.0 / cellsize
for j in numba.prange(0,pIList.shape[0],1):
if is_sq:
for i in range(j,pIList.shape[0],1):
index_1 = int(0.5*i*(i+1)+j)
xdist = pIList[j,0]-pIList[i,0]
ydist = pIList[j,1]-pIList[i,1]
zdist = pIList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index_1] = xk**2+yk**2+zk**2
if is_nonsq:
for j in range(pJList.shape[0]):
index_2 = dr2_1+ j + pJList.shape[0] * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index_2] = xk**2+yk**2+zk**2
return dr2
Where pI (size: (N,3)) is the subset of pJ (size (M,3). In this code we subdivide the calculation into two sections: pair distance between pI-pI, which is symmetric and hence we can calculate only the lower triangular matrix i.e. N(N-1)/2 unique values. The other section is pI-pJ distances where we have to go through M(M-N) unique values. To further optimize the function, I have two additional changes:
Combining the outer loop for both sections. In order to do so I am now iterating over the upper triangular matrix which translates to N(N+1)/2 values. One can also implement an if check to see if coordinates are identical, though I am not sure how much time it would save.
To avoid appending the results from the two section together, I am predefining and partitioning the returned array by length.
A further assumption I have made is that time needed for partitioning vec into smallVec and restOfVec is negligent with respect to the pair distance calculation. Obviously, if wrong, one might need to rethink another optimization pathway.
The resultant function is 1.5 times faster than the previous function. I am looking to further optimize the function, but I am very new to loop tilling and other advanced optimizations, so if you have any suggestions, please let me know.
Update 3
So I figured that I should focus on making the function more optimized in terms of serial calculations as I might simply use Dask or multiprocessing to implement to work on multiple sections of an input collection of frames. So the reference function now is:
#njit(cache=True,parallel=False, fastmath=True, boundscheck=False, nogil=True)
def pair_vec_dist_test(pIList,pJList,cellsize):
_I=pIList.shape[0]
_J=pJList.shape[0]
dr2 = np.empty(int(_I*_J),dtype=np.float32)
inv_cellsize = 1.0 / cellsize
for i in numba.prange(pIList.shape[0]):
for j in range(pJList.shape[0]):
index = j + pJList.shape[0] * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index] = xk**2+yk**2+zk**2
return dr2
Going back to the main problem while ignoring the symmetry aspect, I tried to further optimize the distance function as:
#njit(cache=True,parallel=False, fastmath=True, boundscheck=False, nogil=True)
def pair_vec_dist_test_v2(pIList,pJList,cellsize):
_I=pIList.shape[0]
_J=pJList.shape[0]
dr2 = np.empty(int(_I*_J),dtype=np.float32)
inv_cellsize = 1.0 / cellsize
tile=32
for ii in range(0,_I,tile):
for jj in range(0,_J,tile):
for i in range(ii,min(_I,ii+tile)):
for j in range(jj,min(_J,jj+tile)):
index = j + _J * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index] = xk**2+yk**2+zk**2
return dr2
which is essentially tiling up the two vector arrays. However I couldn't get any speedup as the exec time for both functions are roughly the same. I also thought about working with the transpose of the vector arrays, but I couldn't figure out how to align them in a loop when the vector lengths are not a multiple of tile length. Does anyone has any further suggestions or ideas on how to procced?
Edit: Another failed trial
#njit(cache=True,parallel=False, fastmath=True, boundscheck=False, nogil=True)
def pair_vec_dist_test_v3(pIList,pJList,cellsize):
inv_cellsize = 1.0 / cellsize
tile=32
_I=pIList.shape[0]
_J=pJList.shape[0]
vecI=np.empty((_I+2*tile,3),dtype=np.float64) # for rolling effect
vecJ=np.empty((_J+2*tile,3),dtype=np.float64) # for rolling effect
vecI_mask=np.ones((_I+2*tile),dtype=np.uint8)
vecJ_mask=np.ones((_J+2*tile),dtype=np.uint8)
vecI[:_I]=pIList
vecJ[:_J]=pJList
vecI[_I:]=0.
vecJ[_J:]=0.
vecI_mask[_I:]=0
vecI_mask[_J:]=0
#print(vecI,vecJ)
ILim=_I+(tile-_I%tile)
JLim=_J+(tile-_J%tile)
dr2 = np.empty((ILim*JLim),dtype=np.float64)
vecI=vecI.T
vecJ=vecJ.T
for ii in range(ILim):
for jj in range(0,JLim,tile):
index = jj + JLim*ii
#print(ii,jj,index)
mask = np.multiply(vecJ_mask[jj:jj+tile],vecI_mask[ii:ii+tile])
xdist = vecJ[0,jj:jj+tile]-vecI[0,ii:ii+tile]
ydist = vecJ[1,jj:jj+tile]-vecI[1,ii:ii+tile]
zdist = vecJ[2,jj:jj+tile]-vecI[2,ii:ii+tile]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
arr = xk**2+yk**2+zk**2
dr2[index:index+tile] = np.multiply(arr,mask)
return dr2

First things first: there are races conditions in your current code. This basically means the produced results can be corrupted (and it also impact performance). In practice, this causes an undefined behaviour. For example, k[n] is read by multiple thread in get_dr2_vec2. One need to be very careful when using prange. In this case, the race condition can be removed by just not using a temporary array which is not really useful and not using prange in the inner loop due to dr2[m] being updated (updating it from multiple threads also cause a race condition).
Moreover, prange is often not practically useful when parallel=True is not set in the Numba decorator. Indeed, the current functions are not parallel since this flag is missing.
Finally, you can merge the function pair_vec_dist and get_dr2_vec2 and the internal loops so to avoid creating and filling large temporary arrays. Indeed, the RAM throughput is pretty small nowadays compared to the computing power of modern processor. This gap is getting bigger since the last two decades. This effect is called the "memory wall" and it is not expected to disappear any time soon. Codes less memory-bound generally tends to be faster and scale better.
Here is the resulting code:
#njit(cache=True, parallel=True)
def pair_vec_dist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
dr2=np.zeros(pIList.shape[0]*pJList.shape[0],dtype=np.float64)
inv_cellsize = 1.0 / cellsize
for i in numba.prange(pIList.shape[0]):
for j in range(pJList.shape[0]):
offset = j + pJList.shape[0] * i
for k in range(pIList.shape[1]):
tmp = pJList[j,k]-pIList[i,k]
k = tmp-cellsize[k]*np.rint(tmp*inv_cellsize[k])
dr2[offset] += k**2
return dr2
It is 11 times faster with frames=50 and N=1000 on my 6-core machine (i5-9600KF).
The code can be optimized further. For example, dr2 is a flatten symmetric square matrix, so only the upper-right part needs to be computed and the bottom-left part can just be copied. Note that to do that efficiently in parallel, the work needs to be balanced between the thread (otherwise, the slowest will not be faster and will be the bottleneck). One can also generate an optimized version of the function only supporting cellsize.size == 3. Moreover, one can use register tiling so to make the code more cache-friendly. Finally, one can transpose the input so the layout is more SIMD-friendly (this certainly require the loop to be manually unrolled and the register tiling optimization to be done before).

Storing list from nested loop into a matrix in python

I'm trying to compute the estimation errors for a monte-carlo integration of the integral below for a number of different sample sizes.
import numpy as np
N = 20
sample_size = np.zeros(N, dtype=int)
truetheta = 0.4
# for loop creates sample size 2, 4, 8, 16 ....1024...
for n in range(N):
sample_size[n] = 2**(n+1)
# for loop which computes estimation errors for different sample sizes
naive_error = []
for i in (sample_size):
x = np.random.uniform(0, 1, i)
y = 2*(2*x -1)**4
naive_error.append(abs(np.sum((y)/i)-truetheta))
Now, this code yields a list of 20 estimation errors, one for each sample size. However, I also want to to produce a matrix with M values for each sample size. This seems like a very simple operation but I am new to python and I'm struggling hard with the syntax. I was thinking of putting my first loop into another loop from 1 to M. But I'm not sure how to store an entire list from a loop into a Matrix or how to properly setup the nested loop. Suggestions for solutions to my problem would be much appreciated.

This code produces the matrix:
N = 20
sample_size = np.zeros(N, dtype=int)
truetheta = 0.4
# for loop creates sample size 2, 4, 8, 16 ....1024...
for n in range(N):
sample_size[n] = 2**(n+1)
M = 10
CRUDE_ERROR = np.zeros((M, N)) ##matrix to store errors in
for j in range(0, M):
naive_error = [] ##list to store errors in
for i in (sample_size):
x = np.random.uniform(0, 1, i)
y = 2*(2*x -1)**4
naive_error.append(abs(np.sum((y)/i)-truetheta)**2)
CRUDE_ERROR[ j] = naive_error

How to minimize this distance faster with Numpy? (find shifting-index for which two signals are close to each other)

Given an array x of length 1000, and y of length 500k, we can compute the index k for which x is the closest to "y-shifted by k indices":
mindistance = np.inf # infinity
for k in range(len(y)-1000):
t = np.sum(np.power(x-y[k:k+1000],2))
if t < mindistance:
mindistance = t
index = k
print index
# x is close to y[index:index+N]
According to my tests, this seems to be numerically costly. Is there a clever numpy way to compute it faster?
Note: It seems that if I replace the length of x from 1000 to 100, it doesn't change much the time taken for the computation. The slowness seems to come mostly from the for k in range(...) loop. How to speed it up?

This can be done with np.correlate which computes not the coefficient of correlation (as one might guess), but simply the sum of products like x[n]*y[m] (here m is n plus some shift). Since
(x[n] - y[m])**2 = x[n]**2 - 2*x[n]*y[m] + y[m]**2
we can get the sum of squares of differences from this, by adding the sums of squares of x and of a part of y. (Actually, the sum of x[n]**2 will not depend on the shift, since we'll always get just np.sum(x**2), but I'll include it all the same.) The sum of a part of y**2 can also be found in this way, by replacing x with an all-ones array of the same size, and y with y**2.
Here is an example.
import numpy as np
x = np.array([3.1, 1.2, 4.2])
y = np.array([8, 5, 3, -2, 3, 1, 4, 5, 7])
diff_sq = np.sum(x**2) - 2*np.correlate(y, x) + np.correlate(y**2, np.ones_like(x))
print(diff_sq)
This prints [39.89 45.29 11.69 39.49 0.09 12.89 23.09] which are indeed the required distances from x to various parts of y. Pick the smallest with argmin.

A little benchmark in addition to user6655984's wonderful answer:
import numpy as np
import time
x = np.random.rand(1000) # random array of size 1k
y = np.random.rand(100*1000) # random array of size 100k
print "Naive method"
start = time.time()
mindistance = np.inf
for k in range(len(y)-1000):
t = np.sum(np.power(x-y[k:k+1000],2))
if t < mindistance:
mindistance = t
index = k
print index, mindistance
print "%.2f seconds\n" % (time.time() - start)
print "Correlation method"
start = time.time()
diff_sq = np.sum(x**2) - 2*np.correlate(y, x) + np.correlate(y**2, np.ones_like(x))
i = np.argmin(diff_sq)
print i, diff_sq[i]
print "%.2f seconds\n" % (time.time() - start)
We get a x 145 speed improvement factor :)
Naive method
60911 143.6153965841267
8.75 seconds
Correlation method
60911 143.6153965841267
0.06 seconds

The minimum of the SSD distance ("sum of squared difference") is the maximum of the correlation.
Correlations are known to be computed efficiently (in time N Log N instead of NM), by the famous FFT.
With N=1000 and M=500000 you can expect a speedup.

Mean over a sub-interval Python

I am using Python but since I am noob I can't figure out how to compute the average of a vector each, let's say, 100 elements in a larger for-loop.
My trial so far, which is not what I want is
import numpy as np
r = np.zeros(10000) # declare my vector
for i in range(0,2000): # start the loop
r[i] = i**2 # some function to compute and save
if (i%100 == 0): # each time I save 100 elements I want the mean
av_r = np.mean(r)
print(av_r)
My code do not work as I want because I would like to make the average of 100 elements only then pass to the other 100, compute the mean and go on.
I try to reduce the dimension of the vector and clean it into the if:
import numpy as np
r = np.zeros(100) # declare my vector
for i in range(0,2000): # start the loop
r[i] = i**2 # some function to compute and save
if (i%100 == 0): # each time I save 100 elements I want the mean
av_r = np.mean(r)
print(av_r)
r = np.zeros(100)
naively, I thought you may save 100 elements, compute the average clean the vector and continue the calculation saving the other elements from 100+1 to 200+1 but it give me errors. In particular:
IndexError: index 100 is out of bounds for axis 0 with size 100
Many thanks for your help.

Is this what you're looking for? This code will iterate from 0 to 2000 in intervals of 100, mapping some function (x -> x**2) over each interval, calculating the mean and printing the result.
import numpy as np
r = np.zeros(10000)
for i in range(0, 2000, 100):
interval = [x ** 2 for x in r[i:i + 100]]
av_r = np.mean(interval)
print(av_r)
The output from this is just a series of 20 0.0.

the error you probably have encountered is an arrays out of bounds (IndexError: index 100 is out of bounds for axis 0 with size 100), because your index ranges from 0 to 1999 and you're doing
r[i] = i**2 # some function to compute and save
on a 100-sized array.
Fix:
r[i%100] = i**2 # some function to compute and save

whether to use python to the huge computing?

Sorry for my bad English.I am currently work with python and i have a problem with too slow to filling of matrix size of 10000x10000.I program a "relaxation method" where i need to test the different size of matrix.What do you think about that?...P.S I wait 3-5 min to fill one matrix 10000x10000.
def CalcMatrix(self):
for i in range(0, self._n + 1): #(0,10000)
for j in range(0, self._n + 1): #(0,10000)
one = sin(pi * self._x[i]) # x is the vector of size 10000
two = sin(pi * self._y[j]) # y too
self._f[i][j] = 2 * pi * pi * one * two #fill

Native Python loop speed is very slow. People working with arrays and matrices in Python usually use numpy. There are other tools like cython and numba which can dramatically improve the speed in certain circumstances, but the basic idea of numpy is to vectorize the operations and push the hard work down to fast libraries implemented in C and fortran.
The following code takes only a few seconds on my not-very-fast notebook:
import numpy as np
from numpy import pi
x = np.linspace(0,1,10**4)
y = np.linspace(2,5,10**4)
ans = 2*pi**2 * np.outer(np.sin(pi*x), np.sin(pi*y))
(PS: If your _n == 10000, then won't your matrix be 10001x10001, not 10000x10000?)

Some improvements are possible. Consider moving some of the computations out of the loop
def CalcMatrix(self):
for i in range(0, self._n + 1): #(0,10000)
one = sin(pi * self._x[i]) # x is the vector of size 10000
for j in range(0, self._n + 1): #(0,10000)
two = sin(pi * self._y[j]) # y too
self._f[i][j] = 2 * pi * pi * one * two #fill
This value 2 * pi * pi can be precomputed and stored in a variable so it doesn't have to be recomputed each time in the loop.
If that is still not enough, consider using a native language like C or Fortran.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up nested for loop in python - python

Usually in python looping always take much time. So if possible then in your case then use map this will save a lot of time for you. Where you are using a iter(list) so it will be good for this script. example: def func(): your code nu = (1, 2, 3, 4) output = map(func, nu) print(output)

Related

Speeding up vector distance calculation using Numba

Storing list from nested loop into a matrix in python

How to minimize this distance faster with Numpy? (find shifting-index for which two signals are close to each other)

Mean over a sub-interval Python

whether to use python to the huge computing?

Categories

Resources