Python efficient summation in large 2D array

Python efficient summation in large 2D array - python

My task is fairly simple: I have a large 2D matrix, containing only zeros and ones. For each position in this matrix, I want to sum all pixels in a window around this position. The problem is that the matrix has the shape (166667, 17668) and window sizes range from (333, 333) to (5333, 5333). So far I have only tried on a subset of the data. The code I arrived at:
out_arr = np.array( in_arr.shape )
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
for y in range(out_arr.shape[0]):
for x in range(out_arr.shape[1]):
out_arr[y, x] = np.sum(in_arr[y:y+windowsize, x:x+windowsize])
Obviously, this takes a long time. But in my case it was faster than a rolling window approach using numpy.stride_tricks.as_strided, as described here. I tried compiling it using cython, without effect.
What would be your suggestions to speed this up, apart from parallelizing?
I have a Nvidia Titan X at hand. Is there a way to benefit from that?
(e.g. using cupy)

For windowed summation convolution is actually overkill since a simple O(n) solution exists:
import numpy as np
from scipy.signal import convolve
def winsum(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2+1, mode='reflect')[:-1, :-1]
in_arr[0] = 0
in_arr[:, 0] = 0
ps = in_arr.cumsum(0).cumsum(1)
return ps[windowsize:, windowsize:] + ps[:-windowsize, :-windowsize] \
- ps[windowsize:, :-windowsize] - ps[:-windowsize, windowsize:]
This is already fast but you can save even more because ps calculated once for the largest window size could be reused for all smaller window sizes.
However, there is one potential drawback, which are the very large numbers that may arise from summing everything like that. A numerically more sound version eliminates this problem by taking the differences first. Downside: the extra saving through sharing ps is no longer available.
def winsum_safe(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
in_arr[windowsize:] -= in_arr[:-windowsize]
in_arr[:, windowsize:] -= in_arr[:, :-windowsize]
return in_arr.cumsum(0)[windowsize-1:].cumsum(1)[:, windowsize-1:]
For reference, here is the closest competitor which is fft based convolution. You need an up-to-date version of scipy for this to work efficiently. On older versions use fftconvolve instead of convolve.
def winsumc(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid')
The next one is to simulate scipy's old - and excruciatingly slow - behavior.
def winsum_nofft(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid', method='direct')
Testing and benchmarking:
data = np.random.random((1000, 1000))
assert np.allclose(winsum(data, 333), winsumc(data, 333))
assert np.allclose(winsum(data, 333), winsum_safe(data, 333))
kwds = dict(globals=globals(), number=10)
from timeit import timeit
from time import perf_counter
print('data 100x1000, window 333x333')
print('cumsum: ', timeit('winsum(data, 333)', **kwds)*100, 'ms')
print('cumsum safe: ', timeit('winsum_safe(data, 333)', **kwds)*100, 'ms')
print('fftconv: ', timeit('winsumc(data, 333)', **kwds)*100, 'ms')
t = perf_counter()
res = winsum_nofft(data, 99) # 333 just takes too long
t = perf_counter() - t
assert np.allclose(winsum(data, 99), res)
print('data 100x1000, window 99x99')
print('conv: ', t*1000, 'ms')
Sample output:
data 100x1000, window 333x333
cumsum: 70.33260859316215 ms
cumsum safe: 59.98647050000727 ms
fftconv: 298.60571819590405 ms
data 100x1000, window 99x99
conv: 135224.8261970235 ms

#Divakar pointed out in the comments that you can use conv2d and he is right. Here is an example:
import numpy as np
from scipy import signal
data = np.random.rand(5,5) # you original data that you want to sum
kernel = np.ones((2,2)) # square matrix of your dimensions, filled with ones
output = signal.convolve2d(data,kernel,mode='same') # the convolution

Related

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

I have a 3D data cube of values of size on the order of 10,000x512x512. I want to parse a window of vectors (say 6) along dim[0] repeatedly and generate the fourier transforms efficiently. I think I'm doing an array copy into the pyfftw package and it's giving me massive overhead. I'm going over the documentation now since I think there is an option I need to set, but I could use some extra help on the syntax.
This code was originally written by another person with numpy.fft.rfft and accelerated with numba. But the implementation wasn't working on my workstation so I re-wrote everything and opted to go for pyfftw instead.
import numpy as np
import pyfftw as ftw
from tkinter import simpledialog
from math import ceil
import multiprocessing
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
dataChunk = np.random.random((1000,512,512))
numFrames = dataChunk.shape[0]
# select the window size
windowSize = int(simpledialog.askstring('Window Size',
'How many frames to demodulate a single time point?'))
numChannels = windowSize//2+1
# create fftw arrays
ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
for yy in range(data.shape[1]):
for xx in range(data.shape[2]):
index = 0
for i in range(0,frameLen-windowSize+1,windowSize):
ftwIn[:] = data[i:i+windowSize,yy,xx]
fftObject()
channelOut[:,index,yy,xx] = 2*np.abs(ftwOut[:numChannels])/windowSize
index+=1
return channelOut
if __name__ == '__main__':
runme()
What happens is I get a 4D array; the variable channelChunks. I am saving out each channel to a binary (not included in the code above, but the saving part works fine).
This process is for a demodulation project we have, the 4D data cube channelChunks is then parsed into eval(numChannel) 3D data cubes (movies) and from that we are able to separate a movie by color given our experimental set up. I was hoping I could circumvent writing a C++ function that calls the fft on the matrix via pyfftw.
Effectively, I am taking windowSize=6 elements along the 0 axis of dataChunk at a given index of 1 and 2 axis and performing a 1D FFT. I need to do this throughout the entire 3D volume of dataChunk to generate the demodulated movies. Thanks.

The FFTW advanced plans can be automatically built by pyfftw.
The code could be modified in the following way:
Real to complex transforms can be used instead of complex to complex transform.
Using pyfftw, it typically writes:
ftwIn = ftw.empty_aligned(windowSize, dtype='float64')
ftwOut = ftw.empty_aligned(windowSize//2+1, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
Add a few flags to the FFTW planner. For instance, FFTW_MEASURE will time different algorithms and pick the best. FFTW_DESTROY_INPUT signals that the input array can be modified: some implementations tricks can be used.
fftObject = ftw.FFTW(ftwIn,ftwOut, flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
Limit the number of divisions. A division costs more than a multiplication.
scale=1.0/windowSize
for ...
for ...
2*np.abs(ftwOut[:,:,:])*scale #instead of /windowSize
Avoid multiple for loops by making use of FFTW advanced plan through pyfftw.
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
...
for yy in range(data.shape[1]):
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
Here is the modifed code. I also, decreased the number of frame to 100, set the seed of the random generator to check that the outcome is not modifed and commented tkinter. The size of the window can be set to a power of two, or a number made by multiplying 2,3,5 or 7, so that the Cooley-Tuckey algorithm can be efficiently applied. Avoid large prime numbers.
import numpy as np
import pyfftw as ftw
#from tkinter import simpledialog
from math import ceil
import multiprocessing
import time
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
ftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
np.random.seed(seed=42)
dataChunk = np.random.random((100,512,512))
numFrames = dataChunk.shape[0]
# select the window size
#windowSize = int(simpledialog.askstring('Window Size',
# 'How many frames to demodulate a single time point?'))
windowSize=32
numChannels = windowSize//2+1
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
#ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
#ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
printed=0
nbwindow=data.shape[0]//windowSize
scale=1.0/windowSize
for yy in range(data.shape[1]):
#for xx in range(data.shape[2]):
index = 0
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
#for i in range(nbwindow):
#channelOut[:,i,yy,xx] = 2*np.abs(ftwOut[i,:])*scale
if printed==0:
for j in range(channelOut.shape[0]):
print j,channelOut[j,0,yy,0]
printed=1
return channelOut
if __name__ == '__main__':
seconds=time.time()
runme()
print "time: ", time.time()-seconds
Let us know how much it speeds up your computations! I went from 24s to less than 2s on my computer...

Optimizations to be had for numpy/scipy splines over large TIFF stack ndarrays?

So I'm trying to remove all background from all frames of a TIFF stack. Basically, I want to fit a spline for every row for every frame.
I know there are also ways to correct for local background to reduce overhead with naive background "rings" around located samples to quickly do it for multiple frames, as well as some sort of background fitting (where the uses I've heard of are quite slow).
My version is this:
import numpy as np
import time
from scipy.interpolate import UnivariateSpline as Spline
def timeit(method):
times = []
def timed(*args, **kw):
ts = time.time()
result = method(*args, **kw)
te = time.time()
times.append((te - ts) * 1000)
print('%r %2.2f ms' % (method.__name__, (te - ts) * 1000))
return result
return timed
# Generate something that resembles a video
img = np.random.randint(low = 0, high = 2**16, size = (10, 500, 400))
img = img/(2**16) # convert to (0,1)
#timeit
def spline_background_subtract(arr, deg, s):
frames, rows, columns = arr.shape
ix = np.arange(0, columns) # Points to evaluate spline over
frames = []
for i in range(img.shape[0]):
frame = img[i, :, :]
ls = [Spline(ix, frame[i, :], k = deg, s = s)(ix) for i in range(rows)] # Fit every row with a spline to determine background
new = np.row_stack(ls) # Stack all rows
frames.append(new)
return frames
frames = spline_background_subtract(arr = img, deg = 2, s = 1e4)
# new_video = np.reshape(np.dstack(frames), newshape = (img.shape))
This takes about 50 ms per frame on my computer, but if I have 1000 frames and 100 movies, this quickly adds up if corrections should be done in real-time.
I've tried to trim it as much as possible. Is there anywhere to gain anything, besides rewriting everything in a high performance language?
EDIT
Some testing:
scipy.interpolate.RectBivariateSpline is about twice as slow...
scipy.ndimage.filters.gaussian is about twice as fast it seems! It does a very good job if the features to be isolated are relatively small (in my case they are), as they'll be smoothed out at smaller standard deviations (= faster computation).

Without benchmarking it myself, I would guess that this line is perhaps the culprit:
ls = [Spline(ix, frame[i, :], k = deg, s = s)(ix) for i in range(rows)]
If you could vectorize that operation you would get a speedup. You could also try something like this 2d bspline in scipy: which could be faster.
Could also look into Cython / Numba

Scipy convolve2d with subsampling like Theano's conv2d?

I wish to perform 2D convolution on images of size 600 X 400 using a 10 X 10 filter. The filter is not separable. scipy.signal.convolve2d works well for me currently but, I am expecting a lot bigger images soon.
To counter that, I have two ideas
resizing images
subsampling (or striding)?
Focusing on the subsampling part, theano has a function which does convolution the same way as scipy convolve2d, see theano conv2d
It also has the subsampling option too. But, installing theano on windows has been painful to me. How do I get subsampling work with scipy.signal.convolve2d? Any other alternatives (which doesn't require me installing me some heavyweight library)?

You could implement subsampling by hand, I'll only sketch 1d for simplicity. Say you want to sample s = d * f on a regular subgrid with spacing k. Then your nth sample is s_nk = sum_i=0^10 f_i d_nk-i. The thing to observe here is that the indices of f and d always sum to a multiple of k. This suggests splitting it up into sub-sums s_nk = sum_j=0^k-1 sum_i=0^10/k f_j+ik d_-j+(n-i)k. So what you need to do is: subsample d and f at grids with spacing k at all offsets 0, ..., k-1. Convolve all pairs of subsampled d and f whose offsets sum to 0 or k and add the results.
Here's some code for 1d. It roughly implements the above, only the grids are placed slightly differently to make index management easier. The second function does it the stupid way, i.e. computes the full convolution and then decimates. It is for testing the first function against.
import numpy as np
from scipy import signal
def ss_conv(d1, d2, decimate):
n = (len(d1) + len(d2) - 1) // decimate
out = np.zeros((n,))
for i in range(decimate):
d1d = d1[i::decimate]
d2d = d2[decimate-i-1::decimate]
cv = signal.convolve(d1d, d2d, 'full')
out[:len(cv)] += cv
return out
def conv_ss(d1, d2, decimate):
return signal.convolve(d1, d2, 'full')[decimate-1::decimate]
Edit: 2d version:
import numpy as np
from scipy import signal
def ss_conv_2d(d1, d2, decy, decx):
ny = (d1.shape[0] + d2.shape[0] - 1) // decy
nx = (d1.shape[1] + d2.shape[1] - 1) // decx
out = np.zeros((ny, nx))
for i in range(decy):
for j in range(decx):
d1d = d1[i::decy, j::decx]
d2d = d2[decy-i-1::decy, decx-j-1::decx]
cv = signal.convolve2d(d1d, d2d, 'full')
out[:cv.shape[0], :cv.shape[1]] += cv
return out
def conv_ss_2d(d1, d2, decy, decx):
return signal.convolve2d(d1, d2, 'full')[decy-1::decy, decx-1::decx]

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.
Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).
EDIT: I just tried using scipy.spatial.distance.pdistbut it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.
features = np.array(dataframe)
distances = np.zeros((17000, 17000))
def sum_diff():
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares

You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).
But have you had a look at sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here) ?
You would use it as:
from sklearn.metrics import pairwise
import numpy as np
a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)

The big thing with numpy is to avoid using loops and to let it do its magic with the vectorised operations, so there are a few basic improvements that will save you some computation time:
import numpy as np
import timeit
#I reduced the problem size to 1000*300 to keep the timing in reasonable range
n=1000
features = np.random.rand(n,300)
distances = np.zeros((n,n))
def sum_diff():
for i in range(n):
for j in range(n):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Here I removed the unnecessary copy induced by calling np.array
# -> some improvement
def sum_diff_v0():
for i in range(n):
for j in range(n):
diff = features[i] - features[j]
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Collapsing of the statements -> no improvement
def sum_diff_v1():
for i in range(n):
for j in range(n):
distances[i][j] = np.sum(np.square(features[i] - features[j]))
# Using brodcasting and vetorized operations -> big improvement
def sum_diff_v2():
for i in range(n):
distances[i] = np.sum(np.square(features[i] - features),axis=1)
# Computing only half the distance -> 1/2 computation time
def sum_diff_v3():
for i in range(n):
distances[i][i+1:] = np.sum(np.square(features[i] - features[i+1:]),axis=1)
distances[:] = distances + distances.T
print("original :",timeit.timeit(sum_diff, number=10))
print("v0 :",timeit.timeit(sum_diff_v0, number=10))
print("v1 :",timeit.timeit(sum_diff_v1, number=10))
print("v2 :",timeit.timeit(sum_diff_v2, number=10))
print("v3 :",timeit.timeit(sum_diff_v3, number=10))
Edit : For completeness I also timed Camilleri's solution that is much faster:
from sklearn.metrics import pairwise
def Camilleri_solution():
distances=pairwise.pairwise_distances(features)
Timing results (in seconds, function run 10 times with 1000*300 input):
original : 138.36921879299916
v0 : 111.39915344800102
v1 : 117.7582511530054
v2 : 23.702392491002684
v3 : 9.712442981006461
Camilleri's : 0.6131987979897531
So as you can see we can easily gain an order of magnitude by using the proper numpy syntax. Note that with only 1/20th of the data the function run in about one second so I would expect the whole thing to run in the tens of minutes as the scipt runs in N^2.

scipy parallel cdist with multiprocessing

I have a big matrix with millions of rows and hundreds of columns.
The first n rows (about 100K) are reference rows, and for the others, I would like to find the k (about 10) closest neighbours in the reference vectors with scipy cdist
I created an multiprocessing.sharedctypes.Array from the matrix, and use asarray and slicing to split up the matrix and compute distances with cdist.
My current code looks like this:
import numpy
from multiprocessing import Pool, sharedctypes
from scipy.spatial.distance import cdist
shared_m = None
def generate_sample():
m = numpy.random.random((20000, 640))
shape = m.shape
global shared_m
shared_m = sharedctypes.Array('d', m.flatten(), lock=False)
return shape
def dist(A, B, metric):
return cdist(A, B, metric=metric)
def get_closest(args):
shape, lenA, start_B, end_B, metric, numres = args
m = numpy.ctypeslib.as_array(shared_m)
m.shape = shape
A = numpy.asarray(m[:lenA,:], dtype=numpy.double)
B = numpy.asarray(m[start_B:end_B,:], dtype=numpy.double)
distances = dist(B, A, metric)
# rest of code to find closests
def p_get_closest(shape, lenA, lenB, metric="cosine", sample_size=1000, numres=10):
p = Pool(4)
args = ((shape, lenA, i, i + sample_size, metric, numres)
for i in xrange(lenB / sample_size))
res = p.map_async(get_closest, args)
return res.get()
def main():
shape = generate_sample()
p_get_closest(shape, 5000, shape[0] - 5000, "cosine", 3000, 10)
if __name__ == "__main__":
main()
My problem right now is that the parallel calls of cdist are somehow block each other. (Maybe I use the block expression incorrectly. The problem is that there are no parallel cdist computations)
I tried to trace back the problem with printouts into scipy/spatial/distance.py and scipy/spatial/src/distance.c to understand where the run blocks. It looks like there is no copying of data, the dtypes argument took care of that.
When putting printf into distance.c:cdist_cosine(), it shows that all the processes reach the point where the actual computation starts (before the for loops), but the computations don't run in parallel.
I tried a lot of things like using multiprocessing.sharedctypes.RawArray instead of Array, using lock=True while creating the Array.
I have no other idea what I did wrong or how to investigate more the problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python efficient summation in large 2D array - python

Related

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

Optimizations to be had for numpy/scipy splines over large TIFF stack ndarrays?

Scipy convolve2d with subsampling like Theano's conv2d?

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

scipy parallel cdist with multiprocessing

Categories

Resources