I am trying to perform a 2D by 1D matrix multiply. Specifically:
import numpy as np
s = np.ones(268)
one_d = np.ones(9422700)
s_newaxis = s[:, np.newaxis]
goal = s_newaxis * one_d
While the dimensions above are the same as my problem ((268, 1) and (9422700,)), the actual values in my arrays are a mix of very large and very small numbers. As a result I can run goal = s_newaxis * one_d because only 1s exist. However, I run out of ram using my actual data.
I recognize that, at the end of the day, this amounts to a matrix with ~2.5 billion values and so a heavy memory footprint is to be expected. However, any improvement in terms of efficiency would be welcome.
For completeness, I've included a rough attempt. It is not very elegant, but it is just enough of an improvement that it won't crash my computer (admittedly a low bar).
import gc
def arange(start, stop, step):
# `arange` which includes the endpoint (`stop`).
arr = np.arange(start=start, stop=stop, step=step)
if arr[-1] < stop:
return np.append(arr, [stop])
else:
return arr
left, all_arrays = 0, list()
for right in arange(0, stop=s_newaxis.shape[0], step=10):
chunk = s_newaxis[left:right,:] * one_d
all_arrays.append(chunk)
left = right
gc.collect() # unclear if this makes any difference...I suspect not.
goal = np.vstack(all_arrays)
Related
I have an index manipulation problem that I can solve, but it's slow. I'm looking for a way to speed this up.
I have a large (m, 2) float array (think 2D point coordinates) and a large idx array into points. (A typical operation is to pick indices out, points[idx].) From idx, I sometimes need to delete a few entries. I can do that with mask, but the operation is slow, presumably because the entire array is rewritten in memory. Alternative: numpy's masked arrays. Masking is fast, of course, but unfortunately they don't work as indices: the masking is simply ignored.
MWE:
import numpy
# setup
points = numpy.random.rand(10, 2)
n = 5 # can be very large irl
idx = numpy.random.randint(0, 10, n)
# typical operation with idx:
# points[idx]
# a few entries are deleted
mask = numpy.ones(n, dtype=bool)
mask[2] = False # only a few are masked
idx = idx[mask] # takes a while
# alternative: use ma?
idx = numpy.random.randint(0, 10, n)
idx = numpy.ma.array(idx)
idx[2] = numpy.ma.masked
# Doesn't work, masking is ignored:
points[idx]
Any hints on how to speed this up?
I have a fairly large NumPy array that I need to perform an operation on but when I do so, my ~2GB array requires ~30GB of RAM in order to perform the operation. I've read that NumPy can be fairly clumsy with memory usage but this seems excessive.
Does anyone know of an alternative way to apply these operations to limit the RAM load? Perhaps row-by-row/in place etc.?
Code below (ignore the meaningless calculation, in my code the coefficients vary):
import xarray as xr
import numpy as np
def optimise(data):
data_scaled_offset = (((data - 1000) * (1 / 1)) + 1).round(0)
return data_scaled_offset.astype(np.uint16)
# This could also be float32 but I'm using uint16 here to reduce memory load for demo purposes
ds = np.random.randint(0, 12000, size=(40000,30000), dtype=np.uint16)
ds = optimise(ds) # Results in ~30GB RAM usage
By default operations like multiplication, addition and many others... you can use numpy.multiply, numpy.add and use out parameter to use existing array for storing result. That will significantly reduce the memory usage. Please see the demo below and translate you code to use those functions instead
arr = np.random.rand(100)
arr2 = np.random.rand(100)
arr3 = np.subtract(arr, 100, out=arr)
arr4 = arr+100
arr5 = np.add(arr, arr2, out=arr2)
arr6 = arr+arr2
print(arr is arr3) # True
print(arr is arr4) # False
print(arr2 is arr5) # True
print(arr2 is arr6) # False
You could use eg. Numba or Cython to reduce memory usage.
Of course a simple Python loop would also be possible, but very slow.
With allocated output array
import numpy as np
import numba as nb
#nb.njit()
def optimise(data):
data_scaled_offset=np.empty_like(data)
# Inversely apply scale and scale and offset for this product
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data_scaled_offset[i,j] = np.round_((((data[i,j] - 1000) *(1 / 1)) + 1),0)
return data_scaled_offset
In-Place
#nb.njit()
def optimise_in_place(data):
# Inversely apply scale and scale and offset for this product
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data[i,j] = np.round_((((data[i,j] - 1000) *(1 / 1)) + 1),0)
return data
I have to boost the time for an interpolation over a large (NxMxT) matrix MTR, where:
N is about 8000;
M is about 10000;
T represents the number of times at which each NxM matrix is calculated (in my case it's 23).
I have to compute the interpolation element-wise, on all the T different times, and return the interpolated values over a different array of times (T_interp, in my case with lenght 47) so, as output, I want an NxMxT_interp matrix.
The code snippet below defines the function I built for the interpolation, using scipy.interpolate.Rbf (y is the array MTR[i,j,:], x is the times array with length T, x_interp is the new array of times with length T_interp:
#==============================================================================
# Interpolate without nans
#==============================================================================
def interp(x,y,x_interp,**kwargs):
import numpy as np
from scipy.interpolate import Rbf
mask = np.isnan(y)
y_mask = np.ma.array(y,mask = mask)
x_new = [x[i] for i in np.where(~mask)[0]]
if len(y_mask.compressed()) == 0:
return [np.nan for i,n in enumerate(x_interp)]
elif len(y_mask.compressed()) == 1:
return [y_mask.compressed() for i,n in enumerate(x_interp)]
interp = Rbf(x_new,y_mask.compressed(),**kwargs)
y_interp = interp(x_interp)
return y_interp
I tried to achieve my goal either by looping over the NxM elements of the MTR matrix:
new_MTR = np.empty((N,M,T_interp))
for i in range(N):
for j in range(M):
new_MTR[i,j,:]=interp(times,MTR[i,j,:],New_times,function = 'linear')
or by using the np.apply_along_axis funtion:
new_MTR = np.apply_along_axis(lambda x: interp(times,x,New_times,function = 'linear'),2,MTR)
In both cases I extimated the time it takes to perform the whole operation and it appears to be slightly better for the np.apply_along_axis function, but still it will take about 15 hours!!
Is there a way to reduce this time? Maybe by vectorizing the entire operation? I don't know much about vectorizing and how it can be done in a situation like mine so any help would be much appreciated. Thank you!
I have two matrices A and B, each with a size of NxM, where N is the number of samples and M is the size of histogram bins. Thus, each row represents a histogram for that particular sample.
What I would like to do is to compute the chi-square distance between two matrices for a different pair of samples. Therefore, each row in the matrix A will be compared to all rows in the other matrix B, resulting a final matrix C with a size of NxN and C[i,j] corresponds to the chi-square distance between A[i] and B[j] histograms.
Here is my python code that does the job:
def chi_square(histA,histB):
esp = 1.e-10
d = sum((histA-histB)**2/(histA+histB+eps))
return 0.5*d
def matrix_cost(A,B):
a,_ = A.shape
b,_ = B.shape
C = zeros((a,b))
for i in xrange(a):
for j in xrange(b):
C[i,j] = chi_square(A[i],B[j])
return C
Currently, for a 100x70 matrix, this entire process takes 0.1 seconds.
Is there any way to improve this performance?
I would appreciate any thoughts or recommendations.
Thank you.
Sure! I'm assuming you're using numpy?
If you have the RAM available, you could use broadcast the arrays and use numpy's efficient vectorization of the operations on those arrays.
Here's how:
Abroad = A[:,np.newaxis,:] # prepared for broadcasting
C = np.sum((Abroad - B)**2/(Abroad + B), axis=-1)/2.
Timing considerations on my platform show a factor of 10 speed gain compared to your algorithm.
A slower option (but still faster than your original algorithm) that uses less RAM than the previous option is simply to broadcast the rows of A into 2D arrays:
def new_way(A,B):
C = np.empty((A.shape[0],B.shape[0]))
for rowind, row in enumerate(A):
C[rowind,:] = np.sum((row - B)**2/(row + B), axis=-1)/2.
return C
This has the advantage that it can be run for arrays with shape (N,M) much larger than (100,70).
You could also look to Theano to push the expensive for-loops to the C-level if you don't have the memory available. I get a factor 2 speed gain compared to the first option (not taking into account the initial compile time) for both the (100,70) arrays as well as (1000,70):
import theano
import theano.tensor as T
X = T.matrix("X")
Y = T.matrix("Y")
results, updates = theano.scan(lambda x_i: ((x_i - Y)**2/(x_i+Y)).sum(axis=1)/2., sequences=X)
chi_square_norm = theano.function(inputs=[X, Y], outputs=[results])
chi_square_norm(A,B) # same result
For those who can read Latex, this is what I am trying to compute:
$$k_{xyi} = \sum_{j}\left ( \left ( x_{i}-x_{j} \right )^{2}+\left ( y_{i}-y_{j} \right )^{2} \right )$$
where x and y are rows of a matrix A.
For computer language only folk this would translate as:
k(x,y,i) = sum_j( (xi - xj)^2 + (yi - yj)^2 )
where x and y are rows of a matrix A.
So k is a 3d matrix.
Can this be done with API calls only? (no for loops)
Here is testing startup:
import numpy as np
A = np.random.rand(4,4)
k = np.empty((4,4,4))
for ix in range(4):
for iy in range(4):
x = A[ix,]
y = A[iy,]
sx = np.power(x - x[:,np.newaxis],2)
sy = np.power(y - y[:,np.newaxis],2)
k[ix,iy] = (sx + sy).sum(axis=1).T
And now for the master coders, please replace the two for loops with numpy API calls.
Update:
Forgot to mention that I need a method that saves up RAM space, my A matrices are usually 20-30 thousand squared. So it would be great if your answer does not create huge temporary multidimensional arrays.
I would change your latex to look something more like the following- it is much less confusing imo:
From this I assume the last line in your expression should really be:
k[ix,iy] = (sx + sy).sum(axis=-1)
If so, you can compute the above expression as follows:
Axij = (A[:, None, :] - A[..., None])**2
k = np.sum(Axij[:, None, :, :] + Axij, axis=-1)
The above first expands out a memory intensive 4D array. You can skip this if you are worried about memory by introducing a new for loop:
k = np.empty((4,4,4))
Axij = (A[:, None, :] - A[..., None])**2
for xi in range(A.shape[0]):
k[xi] = np.sum(Axij[xi, None, :, :] + Axij, axis=-1)
This will be slower, but not by as much as you would think since you still do a lot of the operations in numpy. You could probably skip the 3D Axij intermediate, but again you are going to take a performance penalty doing so.
If your matrices are really 20k on an edge your 3D output will be 64TB. You are not going to do this in numpy or even in memory (unless you have a large scale distributed memory system).