Reduce Memory Usage when Running Numpy Array Operations - python

I have a fairly large NumPy array that I need to perform an operation on but when I do so, my ~2GB array requires ~30GB of RAM in order to perform the operation. I've read that NumPy can be fairly clumsy with memory usage but this seems excessive.
Does anyone know of an alternative way to apply these operations to limit the RAM load? Perhaps row-by-row/in place etc.?
Code below (ignore the meaningless calculation, in my code the coefficients vary):
import xarray as xr
import numpy as np
def optimise(data):
data_scaled_offset = (((data - 1000) * (1 / 1)) + 1).round(0)
return data_scaled_offset.astype(np.uint16)
# This could also be float32 but I'm using uint16 here to reduce memory load for demo purposes
ds = np.random.randint(0, 12000, size=(40000,30000), dtype=np.uint16)
ds = optimise(ds) # Results in ~30GB RAM usage

By default operations like multiplication, addition and many others... you can use numpy.multiply, numpy.add and use out parameter to use existing array for storing result. That will significantly reduce the memory usage. Please see the demo below and translate you code to use those functions instead
arr = np.random.rand(100)
arr2 = np.random.rand(100)
arr3 = np.subtract(arr, 100, out=arr)
arr4 = arr+100
arr5 = np.add(arr, arr2, out=arr2)
arr6 = arr+arr2
print(arr is arr3) # True
print(arr is arr4) # False
print(arr2 is arr5) # True
print(arr2 is arr6) # False

You could use eg. Numba or Cython to reduce memory usage.
Of course a simple Python loop would also be possible, but very slow.
With allocated output array
import numpy as np
import numba as nb
#nb.njit()
def optimise(data):
data_scaled_offset=np.empty_like(data)
# Inversely apply scale and scale and offset for this product
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data_scaled_offset[i,j] = np.round_((((data[i,j] - 1000) *(1 / 1)) + 1),0)
return data_scaled_offset
In-Place
#nb.njit()
def optimise_in_place(data):
# Inversely apply scale and scale and offset for this product
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data[i,j] = np.round_((((data[i,j] - 1000) *(1 / 1)) + 1),0)
return data

Related

mix data type inputs for numba njit

I have a large array for operation, for example, matrix transpose. numba is much faster:
#test_transpose.py
import numpy as np
import numba as nb
import time
#nb.njit('float64[:,:](float64[:,:])', parallel=True)
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
if __name__ == "__main__":
x = np.random.randn(int(3e6), 50)
t = time.time()
x = x.transpose().copy()
print(f"numpy transpose: {round(time.time() - t, 4)} secs")
x = np.random.randn(int(3e6), 50)
t = time.time()
x = transpose(x)
print(f"numba paralleled transpose: {round(time.time() - t, 4)} secs")
Run in Windows command prompt
D:\data\test>python test_transpose.py
numpy transpose: 2.0961 secs
numba paralleled transpose: 0.8584 secs
However, I want to input another large matrix, which are integers, using x as
x = np.random.randint(int(3e6), size=(int(3e6), 50), dtype=np.int64)
Exception is raised as
Traceback (most recent call last):
File "test_transpose.py", line 39, in <module>
x = transpose(x)
File "C:\Program Files\Python38\lib\site-packages\numba\core\dispatcher.py", line 703, in _explain_matching_error
raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(int64, 2d, C)
It does not recognize the input data matrix as integer. If I release the data type check for the integer matrix as
#nb.njit(parallel=True) # 'float64[:,:](float64[:,:])'
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower:
D:\Data\test>python test_transpose.py
numba paralleled transpose: 1.6653 secs
Using #nb.njit('int64[:,:](int64[:,:])', parallel=True) for the integer data matrix is faster, as expected.
So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?
So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?
The problem is that the Numba function is defined only for float64 types and not int64. The specification of the types is required because Numba compile the Python code to a native code with well-defined types. You can add multiple signatures to a Numba function:
#nb.njit(['float64[:,:](float64[:,:])', 'int64[:,:](int64[:,:])'], parallel=True)
def transpose(x):
r, c = x.shape
# Specifying the dtype is very important here.
# This is a good habit to take to avoid numerical issues and
# slower performance in Numpy too.
x2 = np.zeros((c, r), dtype=x.dtype)
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower
This is because of lazy compilation. The first execution include the compilation time. THis is not the case when the signature is specified because of eager compilation is used instead.
numba is much faster
Well, not to much here considering many cores are used. In fact, the naive transposition is very inefficient on big matrices (is wast about 90% of the memory throughput in this case on large arrays). There are faster algorithms. For more information, please read this post (it only consider in-place 2D square transposition which is much simpler but the idea is the same). Also note that the wider the type, the bigger the array. The bigger the array the slower the transposition.

Einsum slower than explicit Numpy implementation for n-mode tensor-matrix product

I'm trying to implement the n-mode tensor-matrix product (as defined by Kolda and Bader: https://www.sandia.gov/~tgkolda/pubs/pubfiles/SAND2007-6702.pdf) efficiently in Python using Numpy. The operation effectively gets down to (for matrix U, tensor X and axis/mode k):
Extract all vectors along axis k from X by collapsing all other axes.
Multiply these vectors on the left by U using standard matrix multiplication.
Insert the vectors again into the output tensor using the same shape, apart from X.shape[k], which is now equal to U.shape[0] (initially, X.shape[k] must be equal to U.shape[1], as a result of the matrix multiplication).
I've been using an explicit implementation for a while which performs all these steps separately:
Transpose the tensor to bring axis k to the front (in my full code I added an exception in case k == X.ndim - 1, in which case it's faster to leave it there and transpose all future operations, or at least in my application, but that's not relevant here).
Reshape the tensor to collapse all other axes.
Calculate the matrix multiplication.
Reshape the tensor to reconstruct all other axes.
Transpose the tensor back into the original order.
I would think this implementation creates a lot of unnecessary (big) arrays, so once I discovered np.einsum I thought this would speed things up considerably. However using the code below I got worse results:
import numpy as np
from time import time
def mode_k_product(U, X, mode):
transposition_order = list(range(X.ndim))
transposition_order[mode] = 0
transposition_order[0] = mode
Y = np.transpose(X, transposition_order)
transposed_ranks = list(Y.shape)
Y = np.reshape(Y, (Y.shape[0], -1))
Y = U # Y
transposed_ranks[0] = Y.shape[0]
Y = np.reshape(Y, transposed_ranks)
Y = np.transpose(Y, transposition_order)
return Y
def einsum_product(U, X, mode):
axes1 = list(range(X.ndim))
axes1[mode] = X.ndim + 1
axes2 = list(range(X.ndim))
axes2[mode] = X.ndim
return np.einsum(U, [X.ndim, X.ndim + 1], X, axes1, axes2, optimize=True)
def test_correctness():
A = np.random.rand(3, 4, 5)
for i in range(3):
B = np.random.rand(6, A.shape[i])
X = mode_k_product(B, A, i)
Y = einsum_product(B, A, i)
print(np.allclose(X, Y))
def test_time(method, amount):
U = np.random.rand(256, 512)
X = np.random.rand(512, 512, 256)
start = time()
for i in range(amount):
method(U, X, 1)
return (time() - start)/amount
def test_times():
print("Explicit:", test_time(mode_k_product, 10))
print("Einsum:", test_time(einsum_product, 10))
test_correctness()
test_times()
Timings for me:
Explicit: 3.9450525522232054
Einsum: 15.873924326896667
Is this normal or am I doing something wrong? I know there are circumstances where storing intermediate results can decrease complexity (e.g. chained matrix multiplication), however in this case I can't think of any calculations that are being repeated. Is matrix multiplication so optimized that it removes the benefits of not transposing (which technically has a lower complexity)?
I'm more familiar with the subscripts style of using einsum, so worked out these equivalences:
In [194]: np.allclose(np.einsum('ij,jkl->ikl',B0,A), einsum_product(B0,A,0))
Out[194]: True
In [195]: np.allclose(np.einsum('ij,kjl->kil',B1,A), einsum_product(B1,A,1))
Out[195]: True
In [196]: np.allclose(np.einsum('ij,klj->kli',B2,A), einsum_product(B2,A,2))
Out[196]: True
With a mode parameter, your approach in einsum_product may be best. But the equivalences help me visualize the calculation better, and may help others.
Timings should basically be the same. There's an extra setup time in einsum_product that should disappear in larger dimensions.
After updating Numpy, Einsum is only slightly slower than the explicit method, with or without multi-threading (see comments to my question).

Efficient way to perform a 2D x 1D Matrix Multiply

I am trying to perform a 2D by 1D matrix multiply. Specifically:
import numpy as np
s = np.ones(268)
one_d = np.ones(9422700)
s_newaxis = s[:, np.newaxis]
goal = s_newaxis * one_d
While the dimensions above are the same as my problem ((268, 1) and (9422700,)), the actual values in my arrays are a mix of very large and very small numbers. As a result I can run goal = s_newaxis * one_d because only 1s exist. However, I run out of ram using my actual data.
I recognize that, at the end of the day, this amounts to a matrix with ~2.5 billion values and so a heavy memory footprint is to be expected. However, any improvement in terms of efficiency would be welcome.
For completeness, I've included a rough attempt. It is not very elegant, but it is just enough of an improvement that it won't crash my computer (admittedly a low bar).
import gc
def arange(start, stop, step):
# `arange` which includes the endpoint (`stop`).
arr = np.arange(start=start, stop=stop, step=step)
if arr[-1] < stop:
return np.append(arr, [stop])
else:
return arr
left, all_arrays = 0, list()
for right in arange(0, stop=s_newaxis.shape[0], step=10):
chunk = s_newaxis[left:right,:] * one_d
all_arrays.append(chunk)
left = right
gc.collect() # unclear if this makes any difference...I suspect not.
goal = np.vstack(all_arrays)

numpy row pair sum of squared row wise differences without for loops (only api calls)

For those who can read Latex, this is what I am trying to compute:
$$k_{xyi} = \sum_{j}\left ( \left ( x_{i}-x_{j} \right )^{2}+\left ( y_{i}-y_{j} \right )^{2} \right )$$
where x and y are rows of a matrix A.
For computer language only folk this would translate as:
k(x,y,i) = sum_j( (xi - xj)^2 + (yi - yj)^2 )
where x and y are rows of a matrix A.
So k is a 3d matrix.
Can this be done with API calls only? (no for loops)
Here is testing startup:
import numpy as np
A = np.random.rand(4,4)
k = np.empty((4,4,4))
for ix in range(4):
for iy in range(4):
x = A[ix,]
y = A[iy,]
sx = np.power(x - x[:,np.newaxis],2)
sy = np.power(y - y[:,np.newaxis],2)
k[ix,iy] = (sx + sy).sum(axis=1).T
And now for the master coders, please replace the two for loops with numpy API calls.
Update:
Forgot to mention that I need a method that saves up RAM space, my A matrices are usually 20-30 thousand squared. So it would be great if your answer does not create huge temporary multidimensional arrays.
I would change your latex to look something more like the following- it is much less confusing imo:
From this I assume the last line in your expression should really be:
k[ix,iy] = (sx + sy).sum(axis=-1)
If so, you can compute the above expression as follows:
Axij = (A[:, None, :] - A[..., None])**2
k = np.sum(Axij[:, None, :, :] + Axij, axis=-1)
The above first expands out a memory intensive 4D array. You can skip this if you are worried about memory by introducing a new for loop:
k = np.empty((4,4,4))
Axij = (A[:, None, :] - A[..., None])**2
for xi in range(A.shape[0]):
k[xi] = np.sum(Axij[xi, None, :, :] + Axij, axis=-1)
This will be slower, but not by as much as you would think since you still do a lot of the operations in numpy. You could probably skip the 3D Axij intermediate, but again you are going to take a performance penalty doing so.
If your matrices are really 20k on an edge your 3D output will be 64TB. You are not going to do this in numpy or even in memory (unless you have a large scale distributed memory system).

Fastest way to convert ubyte [0, 255] array to float array [-0.5, +0.5] with NumPy

The question is in the title and it is pretty straightforward.
I have a file f from which I am reading a ubyte array:
arr = numpy.fromfile(f, '>u1', size * rows * cols).reshape((size, rows, cols))
max_value = 0xFF # max value of ubyte
Currently I'm renormalizing the data in 3 passes, as follows:
arr = images.astype(float)
arr -= max_value / 2.0
arr /= max_value
Since the array is somewhat large, this takes a noticeable fraction of a second.
It would be great if I could do this in 1 or 2 passes through the data, as I think that would be faster.
Is there some way for me to perform a "composite" vector operation to decrease the number of passes?
Or, is there some other way for me to speed this up?
I did:
ar = ar - 255/2.
ar *= 1./255
Seems faster :)
No I timed it, it's roughly twice as fast on my system. It seems ar = ar - 255/2. does subtraction and type conversion on the fly. Also, it seems division with a scalar is not optimized: it's faster to do the division once and then a bunch of multiplications on the array. Though the additional floating point operation may increase round-off error.
As noted in the comments, numexpr might be a truly fast yet simple way to achieve this. On my system it's another factor two quicker, but mostly due to numexpr using multiple cores and not so much the fact it does only a single pass over the array. Code:
import numexpr
ar = numexpr.evaluate('(ar - 255.0/2.0) / 255.0')
This lookup table might be a bit faster than the repeated calculation:
table = numpy.linspace(-0.5, 0.5, 256)
images = numpy.memmap(f, '>u1', 'r', shape=(size, rows, cols))
arr = table[images]
On my system, it shaves 10 to 15 percent off the time compared to yours.
I found a better solution myself (around 25% faster):
arr = numpy.memmap(f, '>u1', 'r', shape=(size, rows, cols))
arr = arr / float(max_value)
arr -= 0.5
I'm curious if it can be improved.
I get like 50% speed up for large arrays using cython.parallel.prange with below code (done for one-dimensional array, but easily extensible); I guess the speed-up depends on the number of CPU cores:
pilot.pyx file:
cimport cython
from cython.parallel import prange
import numpy as np
cimport numpy as np
from numpy cimport float64_t, uint8_t, ndarray
#cython.boundscheck(False)
#cython.wraparound(False)
def norm(np.ndarray[uint8_t, ndim=1] img):
cdef:
Py_ssize_t i, n = len(img)
np.ndarray[float64_t, ndim=1] arr = np.empty(n, dtype='float64')
float64_t * left = <float64_t *> arr.data
uint8_t * right = <uint8_t *> img.data
for i in prange(n, nogil=True):
left[i] = (right[i] - 127.5) / 255.0
return arr
setup.py file to build a C extension module out of above code:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_module = Extension(
'pilot',
['pilot.pyx'],
extra_compile_args=['-fopenmp'],
extra_link_args=['-fopenmp'],
)
setup(
name = 'pilot',
cmdclass = {'build_ext': build_ext},
ext_modules = [ext_module],
)

Categories