Optimising memory usage in numpy - python

The following program loads two images with PyGame, converts them to Numpy arrays, and then performs some other Numpy operations (such as FFT) to emit a final result (of a few numbers). The inputs can be large, but at any moment only one or two large objects should be live.
A test image is about 10M pixels, which translates to 10MB once it's greyscaled. It gets converted to a Numpy array of dtype uint8, which after some processing (applying Hamming windows), is an array of dtype float64. Two images are loaded into arrays this way; later FFT steps result in an array of dtype complex128. Prior to adding the excessive gc.collect calls, the program memory size tended to increase with each step. Additionally, it seems most Numpy operations will give a result in the highest precision available.
Running the test (sans the gc.collect calls) on my 1GB Linux machine results in prolonged thrashing, which I have not waited for. I don't yet have detailed memory use stats -- I tried some Python modules and the time command to no avail; now I'm looking into valgrind. Watching PS (and dealing with machine unresponsiveness in the later stages of the test) suggests a maximum memory usage of about 800 MB.
A 10 million cell array of complex128 should occupy 160 MB. Having (ideally) at most two of these live at one time, plus the not-insubstantial Python and Numpy libraries and other paraphernalia, probably means allowing for 500 MB.
I can think of two angles from which to attack the problem:
Discarding intermediate arrays as soon as possible. That's what the gc.collect calls are for -- they seem to have improved the situation, as it now completes with only a few minutes of thrashing ;-). I think one can expect that memory-intensive programming in a language like Python will require some manual intervention.
Using less-precise Numpy arrays at each step. Unfortunately the operations that return arrays, like fft2, do not appear to allow the type to be specified.
So my main question is: is there a way of specifying output precision in Numpy array operations?
More generally, are there other common memory-conserving techniques when using Numpy?
Additionally, does Numpy have a more idiomatic way of freeing array memory? (I imagine this would leave the array object live in Python, but in an unusable state.) Explicit deletion followed by immediate GC feels hacky.
import sys
import numpy
import pygame
import gc
def get_image_data(filename):
im = pygame.image.load(filename)
im2 = im.convert(8)
a = pygame.surfarray.array2d(im2)
hw1 = numpy.hamming(a.shape[0])
hw2 = numpy.hamming(a.shape[1])
a = a.transpose()
a = a*hw1
a = a.transpose()
a = a*hw2
return a
def check():
gc.collect()
print 'check'
def main(args):
pygame.init()
pygame.sndarray.use_arraytype('numpy')
filename1 = args[1]
filename2 = args[2]
im1 = get_image_data(filename1)
im2 = get_image_data(filename2)
check()
out1 = numpy.fft.fft2(im1)
del im1
check()
out2 = numpy.fft.fft2(im2)
del im2
check()
out3 = out1.conjugate() * out2
del out1, out2
check()
correl = numpy.fft.ifft2(out3)
del out3
check()
maxs = correl.argmax()
maxpt = maxs % correl.shape[0], maxs / correl.shape[0]
print correl[maxpt], maxpt, (correl.shape[0] - maxpt[0], correl.shape[1] - maxpt[1])
if __name__ == '__main__':
args = sys.argv
exit(main(args))

This
on SO says "Scipy 0.8 will have single precision support for almost all the fft code",
and SciPy 0.8.0 beta 1 is just out.
(Haven't tried it myself, cowardly.)

if I understand correctly, you are calculating a convolution between two images. The Scipy package contains a dedicated module for that (ndimage), which might be more memory efficient than the "manual" approach via Fourier transforms. It would be good to try using it instead of going through Numpy.

Related

Is there an alternative to Numba for functions that use many features not supported by Numba?

I know Numba does not support all Python features nor all NumPy features.
However I really need to speed up the execution time of the following function, which is block_reduce available in the scikit-image library (I've not downloaded the whole package, I've just taken block_reduce and view_as_blocks from it).
Here is the original code (I've just removed the examples from the docstring).
block_reduce.py
import numpy as np
from numpy.lib.stride_tricks import as_strided
def block_reduce(image, block_size, func=np.sum, cval=0):
"""
Taken from scikit-image to avoid installation (it's very big)
Down-sample image by applying function to local blocks.
Parameters
----------
image : ndarray
N-dimensional input image.
block_size : array_like
Array containing down-sampling integer factor along each axis.
func : callable
Function object which is used to calculate the return value for each
local block. This function must implement an ``axis`` parameter such
as ``numpy.sum`` or ``numpy.min``.
cval : float
Constant padding value if image is not perfectly divisible by the
block size.
Returns
-------
image : ndarray
Down-sampled image with same number of dimensions as input image.
"""
if len(block_size) != image.ndim:
raise ValueError("`block_size` must have the same length "
"as `image.shape`.")
pad_width = []
for i in range(len(block_size)):
if block_size[i] < 1:
raise ValueError("Down-sampling factors must be >= 1. Use "
"`skimage.transform.resize` to up-sample an "
"image.")
if image.shape[i] % block_size[i] != 0:
after_width = block_size[i] - (image.shape[i] % block_size[i])
else:
after_width = 0
pad_width.append((0, after_width))
image = np.pad(image, pad_width=pad_width, mode='constant',
constant_values=cval)
blocked = view_as_blocks(image, block_size)
return func(blocked, axis=tuple(range(image.ndim, blocked.ndim)))
def view_as_blocks(arr_in, block_shape):
"""Block view of the input n-dimensional array (using re-striding).
Blocks are non-overlapping views of the input array.
Parameters
----------
arr_in : ndarray
N-d input array.
block_shape : tuple
The shape of the block. Each dimension must divide evenly into the
corresponding dimensions of `arr_in`.
Returns
-------
arr_out : ndarray
Block view of the input array.
"""
if not isinstance(block_shape, tuple):
raise TypeError('block needs to be a tuple')
block_shape = np.array(block_shape)
if (block_shape <= 0).any():
raise ValueError("'block_shape' elements must be strictly positive")
if block_shape.size != arr_in.ndim:
raise ValueError("'block_shape' must have the same length "
"as 'arr_in.shape'")
arr_shape = np.array(arr_in.shape)
if (arr_shape % block_shape).sum() != 0:
raise ValueError("'block_shape' is not compatible with 'arr_in'")
# -- restride the array to build the block view
new_shape = tuple(arr_shape // block_shape) + tuple(block_shape)
new_strides = tuple(arr_in.strides * block_shape) + arr_in.strides
arr_out = as_strided(arr_in, shape=new_shape, strides=new_strides)
return arr_out
test_block_reduce.py
import numpy as np
import time
from block_reduce import block_reduce
image = np.arange(3*3*1000).reshape(3, 3, 1000)
# DO NOT REPORT THIS... COMPILATION TIME IS INCLUDED IN THE EXECUTION TIME!
start = time.time()
block_reduce(image, block_size=(3, 3, 1), func=np.mean)
end = time.time()
print("Elapsed (with compilation) = %s" % (end - start))
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time.time()
block_reduce(image, block_size=(3, 3, 1), func=np.mean)
end = time.time()
print("Elapsed (after compilation) = %s" % (end - start))
I went through many issues with this code.
For example Numba does not support function type parameters. But even if I try to work around this problem by using a string for this parameter (for example func would be the string "sum" instead of np.sum) I'll fall into many more issues related to features unsupported by Numba (like np.pad, isinstance, the tuple function, etc.).
Going through each single issue turned out to be very painful. For example, I've tried to incorporate all the code for np.pad from numpy into block_reduce.py and add the numba.jit decorator to np.pad but I got additional problems.
If there is a smart way to use Numba despite all these unsupported features I would be happy with it.
Otherwise is there any alternative to Numba for that? I know there is PyPy which I've never used. If PyPy is a solution for my problem I have to highlight I just need this single script block_reduce.py to run with PyPy. The rest of the project should be run with CPython.
I was also thinking of creating a C module extension, which I've never done. But if it's worth trying I will do.
Have you tried running detailed profiling of your code? If you are dissatisfied with the performance of your program I think it can be very helpful to use a tool such as cProfile or py-spy. This can identify bottlenecks in your program and which parts specifically need to be sped up.
That being said, as #CJR said, if your program is spending the bulk of the compute time in NumPy, there likely is no reason to worry about speeding it up using a just-in-time compiler or similar modifications to your setup. As explained in more detail here, NumPy is fast due to it implementing compute-intensive tasks in compiled languages, so it saves you from worrying about that and abstracts it away.
Depending on what exactly you are planning to do, it is possible that your efficiency could be improved by parallelism, but this is not something I would worry about yet.
To end on a more general note: while optimizing code efficiency is of course very important, it is imperative to do so carefully and deliberately. As Donald Knuth is famous for saying "premature optimization is the root of all evil (or at least most of it) in programming". See this stack exchange thread for some more discussion on this.

Anaconda package for cufft keeping arrays in gpu memory between fft / ifft calls

I am using the anaconda suite with ipython 3.6.1 and their accelerate package. There is a cufft sub-package in this two functions fft and ifft. These, as far as I understand, takes in a numpy array and outputs to a numpy array, both in system ram, i.e. all gpu-memory and transfer between system and gpu memory is handled automatically and gpu memory is releaseed as function is ended. This seems all very nice and seems to work for me. However, I would like to run multiple fft/ifft calls on the same array and for each time extract just one number from the array. It would be nice to keep the array in the gpu memory to minimize system <-> gpu transfer. Am I correct that this is not possible using this package? If so, is there another package that would do the same. I have noticed the reikna project but that doesn't seem available in anaconda.
The thing I am doing (and would like to do efficiently on gpu) is in short shown here using numpy.fft
import math as m
import numpy as np
import numpy.fft as dft
nr = 100
nh = 2**16
h = np.random.rand(nh)*1j
H = np.zeros(nh,dtype='complex64')
h[10] = 1
r = np.zeros(nr,dtype='complex64')
fftscale = m.sqrt(nh)
corr = 0.12j
for i in np.arange(nr):
r[i] = h[10]
H = dft.fft(h,nh)/fftscale
h = dft.ifft(h*corr)*fftscale
r[nr-1] = h[10]
print(r)
Thanks in advance!
So I found Arrayfire which seems rather easy to work with.

Optimizing a multithreaded numpy array function

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

Multiprocessing IOError: bad message length

I get an IOError: bad message length when passing large arguments to the map function. How can I avoid this?
The error occurs when I set N=1500 or bigger.
The code is:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
In the docs of multiprocessing there is the function recv_bytes that raises an IOError. Could it be because of this? (https://python.readthedocs.org/en/v2.7.2/library/multiprocessing.html)
EDIT
If I use images as a numpy array instead of a list, I get a different error: SystemError: NULL result without error in PyObject_Call.
A bit different code:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
images=np.array(images) #new
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
EDIT2 The actual function that I use is:
def func(args):
i=args[0]
images=args[1]
image=np.mean(images,axis=0)
np.savetxt("image%d.txt"%(i),image)
return 0
Additionally, the iter_args do not contain the same set of images:
iter_args=[]
for i in range(0,1):
rand_ind=np.random.random_integers(0,N-1,N)
iter_args.append([i,images[rand_ind]])
You're creating a pool and sending all the images at once to func(). If you can get away with working on a single image at once, try something like this, which runs to completion with N=10000 in 35s with Python 2.7.10 for me:
import numpy as np
import multiprocessing
def func(args):
i = args[0]
img = args[1]
print "{}: {} {}".format(i, img.shape, img.sum())
return 0
N=10000
images = ((i, np.random.random_integers(1,100,size=(500,500))) for i in xrange(N))
pool=multiprocessing.Pool(4)
pool.imap(func, images)
pool.close()
pool.join()
The key here is to use iterators so you don't have to hold all the data in memory at once. For instance I converted images from an array holding all the data to a generator expression to create the image only when needed. You could modify this to load your images from disk or whatever. I also used pool.imap instead of pool.map.
If you can, try to load the image data in the worker function. Right now you have to serialize all the data and ship it across to another process. If your image data is larger, this might be a bottleneck.
[update now that we know func has to handle all images at once]
You could do an iterative mean on your images. Here's a solution without using multiprocessing. To use multiprocessing, you could divide your images into chunks, and farm those chunks out to the pool.
import numpy as np
N=10000
shape = (500,500)
def func(images):
average = np.full(shape, 0)
for i, img in images:
average += img / N
return average
images = ((i, np.full(shape,i)) for i in range(N))
print func(images)
Python is likely to load your data in your RAM memory and you need this memory to be available. Have you checked your computer memory usage ?
Also as Patrick mentioned, you're loading 3GB of data, make sure you use the 64 bits version of Python as you are reaching the 32 bits memory contraint. This could cause your process to crash : 32 vs 64 bits Python
Another improvement would be to use python 3.4 instead of 2.7. Python 3 implementation seems to be optimized for very large ranges, see Python3 vs Python2 list/generator range performance
When running your program it actually gives me an clear error:
OSError: [Errno 12] Cannot allocate memory
Like mentioned by other users, the solution to your problem is simple add memory(a lot) or change the way your program is handling the images.
The reason it's using so much memory is because you allocate your memory for your images on a module level. So when multiprocess forks your process it's also copying all the images (which isn't free according to Shared-memory objects in python multiprocessing), this is not necessary because you are also giving the images as an argument to the function which the multiprocess module also copies using ipc and pickle, this would still likely result in a lack of memory. Try one of the proposed solutions given by the other users.
This is what solved the problem: declaring the images global.
import numpy as np
import multiprocessing
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
def func(args):
i=args[0]
images=images
print i
return 0
iter_args=[]
for i in range(0,1):
iter_args.append([i])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
The reason why you get IOError: bad message length when passing around large objects is due to a hard coded limit in older CPython versions (3.2 and earlier) of 0x7fffffff Bytes or around 2.1GB: https://github.com/python/cpython/blob/v2.7.5/Modules/_multiprocessing/multiprocessing.h#L182
This CPython changeset (which is in CPython 3.3 and later) removed the hard coded limit: https://github.com/python/cpython/commit/87cf220972c9cb400ddcd577962883dcc5dca51a#diff-4711c9abeca41b149f648d4b3c15b6a7d2baa06aa066f46359e4498eb8e39f60L182

pyOpenCL getting different results compared to numpy

I'm trying to get started with pyOpenCL and GPGPU in general.
For the below dot product code I'm getting fairly different results between the GPU and CPU versions. What am I doing wrong?
The difference of ~0.5% seems large for floating point errors to account for the difference. The difference does seem to increase with array size (~9e-8 relative difference with array size of 10000). Maybe it's an issue with combining results across blocks...? Either way, color me disconcerted.
I don't know if it matters: I'm running this on a MacBook Air, Intel(R) Core(TM) i7-4650U CPU # 1.70GHz, with Intel HD Graphics 5000.
Thanks in advance.
import pyopencl as cl
import numpy
from pyopencl.reduction import ReductionKernel
import pyopencl.clrandom as cl_rand
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
dot = ReductionKernel( ctx, \
dtype_out = numpy.float32, \
neutral = "0", \
map_expr = "x[i]*y[i]", \
reduce_expr = "a+b", \
arguments = "__global const float *x, __global const float *y"
)
x = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
y = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
x_dot_y = dot(x,y).get() # GPU: array(25001304.0, dtype=float32)
x_dot_y_cpu = numpy.dot(x.get(), y.get()) # CPU: 24875690.0
print abs(x_dot_y_cpu - x_dot_y)/x_dot_y # 0.0050496689740063489
The order in which values are reduced will likely be very different between these two methods. Across large data sets, the tiny errors in floating point rounding can soon add up. There could also be other details about the underlying implementations that affect the precision of the result.
I've run your example code on my own machine and get a similar sort of difference in the final result (~0.5%). As a data point, you can implement a very simple dot product in raw Python and see how much that differs from both the OpenCL result and from Numpy.
For example, you could add something simple like this to your example:
x_dot_y_naive = sum(a*b for a,b in zip(x.get(), y.get()))
Here's the results I get on my machine:
OPENCL: 25003466.000000
NUMPY: 24878146.000000 (0.5012%)
NAIVE: 25003465.601387 (0.0000%)
As you can see, the naive implementation is closer to the OpenCL version than Numpy is. One explanation for this could be that Numpy's dot function probably makes use of fused multiply-add (FMA) operations, which will change how intermediate results are rounded. Without any compiler options to tell it otherwise, OpenCL should be fully complying to the IEE-754 standard, rather than using the faster FMA operations.

Categories