Python image pixel access is very slow - python

I'm playing around with different transform coding such as the DWT, DCT and DFT. I've been questioning my approach to the data structure I'm using when it comes to execution time. Here is an example below, say I want to analyze every individual pixel in an image, perhaps to perform encoding of some sort. All I'm doing here is loading each pixel on the same variable "a" to demonstrate a very generic example. This code snippet takes around 66mS which is quite slow for me. Is there a better approach to Image processing when it comes to individual reading and writing of pixels?
class Example():
def __init__(self):
self.load_image("lena_312.png") #312 by 312
def load_image(self, directory):
self.img = cv2.imread(directory)
self.height,self.width,self.channel = self.img.shape
self.img_org = np.matrix(self.img[:,:,0]) #Image is greyscale
def test(self):
for j in range(0,self.height):
for i in range(0,self.width):
a = self.img_org[j,i]
if __name__ == "__main__":
EX = Example()
start = time.time()
EX.test()
print time.time()-start

Pure Python per-pixel access will expectedly be very slow. Python lists and Python integers have a huge overhead for numeric-heavy operations that normally rely on CPU caches and pipelining.
Consider using Pillow, or numpy / scipy to do massive bitmap operations, or at least try using Cython if you must code your custom algorithm in unrestricted Python.

Related

How to parallelize calculations of celestial bodies motion?

I have a piece of code which calculates positions of some satellites and planets using Skyfield. For clarity, I use Pandas DataFrame as a container of positions and corresponding time moments. I want to make calculation parallel, but always getting the same error: TypeError: can't pickle Satrec objects. Different parallelizers were tested, like Dask, pandarallel, swifter and Pool.map().
Example of piece of code to be parallelized:
def get_sun_position(self, row):
t = self.ts.utc(row["Date"]) # from skyfield
pos = self.earth.at(t).observe(self.sun).apparent().position.m # from skyfield, error is here
return pos
def get_sat_position(self, row):
t = self.ts.utc(row["Date"]) # from skyfield
pos = self.sat.at(t).position.m # from skyfield, error is here
return pos
def get_positions(self):
self.df["sat_pos"] = self.df.swifter.apply(self.get_sat_position, axis=1) # all the parallelization goes here
self.df["sun_pos"] = self.df.swifter.apply(self.get_sun_position, axis=1) # and here
# the same implementation but using dask
# self.df["sat_pos"] = dd.from_pandas(self.df, npartitions=4*cpu_count())\
# .map_partitions(lambda df : df.apply(lambda row : self.get_sat_position(row),axis=1))\
# .compute(scheduler='processes')
# self.df["sun_pos"] = dd.from_pandas(self.df, npartitions=4*cpu_count())\
# .map_partitions(lambda df : df.apply(lambda row : self.get_sun_position(row),axis=1))\
# .compute(scheduler='processes')
For Dask to avoid Pickle I tried to set serializaton manually like this serializers=['dask', 'pickle'] but it didn't help.
As I understand, Skyfield uses sgp4 which contains Satrec class.
I would be wondering if there is some way to parallelize this .apply(). Or maybe I should not try Skyfield functions for parallel processing at all?
Alas, all of the mechanisms you are using to make the computation parallel do so by creating another process and then sending copies of all of the objects involved in the computation over to the other process — and the Satrec object is written in C++, not Python, to make it faster, and C++ objects have no native way to "serialize" themselves into bytes for transmission to another process. (Python objects have that ability built-in.)
Have you profiled your code to see what the most expensive steps are? My guess is that most of your expense is in the Sun computation, because to achieve its high precision Skyfield needs to compute the Earth's orientation to very high accuracy to give the Sun's position in the sky to high enough precision for even radio astronomers.
But if you yourself don't need that high an accuracy, you could switch to lower-precision sky coordinates for the Sun. Before using t in get_sun_position(), try doing this to it:
t._nutation_angles = iau2000b(t.tt)
That will use a lower precision estimate of the Earth's nutation (print out the values before and after this change to see how big the difference is, and compare that to how much inaccuracy your application can stand), but also hopefully run faster.

Optimizing a multithreaded numpy array function

Given 2 large arrays of 3D points (I'll call the first "source", and the second "destination"), I needed a function that would return indices from "destination" which matched elements of "source" as its closest, with this limitation: I can only use numpy... So no scipy, pandas, numexpr, cython...
To do this i wrote a function based on the "brute force" answer to this question. I iterate over elements of source, find the closest element from destination and return its index. Due to performance concerns, and again because i can only use numpy, I tried multithreading to speed it up. Here are both threaded and unthreaded functions and how they compare in speed on an 8 core machine.
import timeit
import numpy as np
from numpy.core.umath_tests import inner1d
from multiprocessing.pool import ThreadPool
def threaded(sources, destinations):
# Define worker function
def worker(point):
dlt = (destinations-point) # delta between destinations and given point
d = inner1d(dlt,dlt) # get distances
return np.argmin(d) # return closest index
# Multithread!
p = ThreadPool()
return p.map(worker, sources)
def unthreaded(sources, destinations):
results = []
#for p in sources:
for i in range(len(sources)):
dlt = (destinations-sources[i]) # difference between destinations and given point
d = inner1d(dlt,dlt) # get distances
results.append(np.argmin(d)) # append closest index
return results
# Setup the data
n_destinations = 10000 # 10k random destinations
n_sources = 10000 # 10k random sources
destinations= np.random.rand(n_destinations,3) * 100
sources = np.random.rand(n_sources,3) * 100
#Compare!
print 'threaded: %s'%timeit.Timer(lambda: threaded(sources,destinations)).repeat(1,1)[0]
print 'unthreaded: %s'%timeit.Timer(lambda: unthreaded(sources,destinations)).repeat(1,1)[0]
Retults:
threaded: 0.894030461056
unthreaded: 1.97295164054
Multithreading seems beneficial but I was hoping for more than 2X increase given the real life dataset i deal with are much larger.
All recommendations to improve performance (within the limitations described above) will be greatly appreciated!
Ok, I've been reading Maya documentation on python and I came to these conclusions/guesses:
They're probably using CPython inside (several references to that documentation and not any other).
They're not fond of threads (lots of non-thread safe methods)
Since the above, I'd say it's better to avoid threads. Because of the GIL problem, this is a common problem and there are several ways to do the earlier.
Try to build a tool C/C++ extension. Once that is done, use threads in C/C++. Personally, I'd only try SIP to work, and then move on.
Use multiprocessing. Even if your custom python distribution doesn't include it, you can get to a working version since it's all pure python code. multiprocessing is not affected by the GIL since it spawns separate processes.
The above should've worked out for you. If not, try another parallel tool (after some serious praying).
On a side note, if you're using outside modules, be most mindful of trying to match maya's version. This may have been the reason because you couldn't build scipy. Of course, scipy has a huge codebase and the windows platform is not the most resilient to build stuff.

Multiprocessing IOError: bad message length

I get an IOError: bad message length when passing large arguments to the map function. How can I avoid this?
The error occurs when I set N=1500 or bigger.
The code is:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
In the docs of multiprocessing there is the function recv_bytes that raises an IOError. Could it be because of this? (https://python.readthedocs.org/en/v2.7.2/library/multiprocessing.html)
EDIT
If I use images as a numpy array instead of a list, I get a different error: SystemError: NULL result without error in PyObject_Call.
A bit different code:
import numpy as np
import multiprocessing
def func(args):
i=args[0]
images=args[1]
print i
return 0
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
images=np.array(images) #new
iter_args=[]
for i in range(0,1):
iter_args.append([i,images])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
EDIT2 The actual function that I use is:
def func(args):
i=args[0]
images=args[1]
image=np.mean(images,axis=0)
np.savetxt("image%d.txt"%(i),image)
return 0
Additionally, the iter_args do not contain the same set of images:
iter_args=[]
for i in range(0,1):
rand_ind=np.random.random_integers(0,N-1,N)
iter_args.append([i,images[rand_ind]])
You're creating a pool and sending all the images at once to func(). If you can get away with working on a single image at once, try something like this, which runs to completion with N=10000 in 35s with Python 2.7.10 for me:
import numpy as np
import multiprocessing
def func(args):
i = args[0]
img = args[1]
print "{}: {} {}".format(i, img.shape, img.sum())
return 0
N=10000
images = ((i, np.random.random_integers(1,100,size=(500,500))) for i in xrange(N))
pool=multiprocessing.Pool(4)
pool.imap(func, images)
pool.close()
pool.join()
The key here is to use iterators so you don't have to hold all the data in memory at once. For instance I converted images from an array holding all the data to a generator expression to create the image only when needed. You could modify this to load your images from disk or whatever. I also used pool.imap instead of pool.map.
If you can, try to load the image data in the worker function. Right now you have to serialize all the data and ship it across to another process. If your image data is larger, this might be a bottleneck.
[update now that we know func has to handle all images at once]
You could do an iterative mean on your images. Here's a solution without using multiprocessing. To use multiprocessing, you could divide your images into chunks, and farm those chunks out to the pool.
import numpy as np
N=10000
shape = (500,500)
def func(images):
average = np.full(shape, 0)
for i, img in images:
average += img / N
return average
images = ((i, np.full(shape,i)) for i in range(N))
print func(images)
Python is likely to load your data in your RAM memory and you need this memory to be available. Have you checked your computer memory usage ?
Also as Patrick mentioned, you're loading 3GB of data, make sure you use the 64 bits version of Python as you are reaching the 32 bits memory contraint. This could cause your process to crash : 32 vs 64 bits Python
Another improvement would be to use python 3.4 instead of 2.7. Python 3 implementation seems to be optimized for very large ranges, see Python3 vs Python2 list/generator range performance
When running your program it actually gives me an clear error:
OSError: [Errno 12] Cannot allocate memory
Like mentioned by other users, the solution to your problem is simple add memory(a lot) or change the way your program is handling the images.
The reason it's using so much memory is because you allocate your memory for your images on a module level. So when multiprocess forks your process it's also copying all the images (which isn't free according to Shared-memory objects in python multiprocessing), this is not necessary because you are also giving the images as an argument to the function which the multiprocess module also copies using ipc and pickle, this would still likely result in a lack of memory. Try one of the proposed solutions given by the other users.
This is what solved the problem: declaring the images global.
import numpy as np
import multiprocessing
N=1500 #N=1000 works fine
images=[]
for i in np.arange(N):
images.append(np.random.random_integers(1,100,size=(500,500)))
def func(args):
i=args[0]
images=images
print i
return 0
iter_args=[]
for i in range(0,1):
iter_args.append([i])
pool=multiprocessing.Pool()
print pool
pool.map(func,iter_args)
The reason why you get IOError: bad message length when passing around large objects is due to a hard coded limit in older CPython versions (3.2 and earlier) of 0x7fffffff Bytes or around 2.1GB: https://github.com/python/cpython/blob/v2.7.5/Modules/_multiprocessing/multiprocessing.h#L182
This CPython changeset (which is in CPython 3.3 and later) removed the hard coded limit: https://github.com/python/cpython/commit/87cf220972c9cb400ddcd577962883dcc5dca51a#diff-4711c9abeca41b149f648d4b3c15b6a7d2baa06aa066f46359e4498eb8e39f60L182

multiple cpu usage when accessing data attached to traited classes

I have an application that uses a number of classes inheriting from HasTraits. Some of these classes manage access to data and others provide functions for analyzing that data. This works wonderfully for a gui -- I can check that the data and analysis code is doing what it should. However, I've noticed that when I use these classes for gui-less computations, all the cpus on the system end up getting used.
Here is a small example that shows the cpu usage:
from traits.api import HasTraits, List, Int, Enum, Instance
import numpy as np
import psutil
from itertools import combinations
"""
Small example of high CPU usage by traited classes
"""
class DataStorage(HasTraits):
nsamples = Int(2000)
samples = List
def _samples_default(self):
return np.random.randn(self.nsamples,2000).tolist()
def sample_samples(self,indices):
""" return a 2D array of data at indices """
return np.array(
[self.samples[i] for i in indices])
class DataAccessor(HasTraits):
""" Class that grabs data and computes something """
measure = Enum("correlation","covariance")
data_source = Instance(DataStorage,())
def compute_measure(self,indices):
""" example of some computation """
samples = self.data_source.sample_samples(indices)
percentage = psutil.cpu_percent(interval=0, percpu=True)
if self.measure == "correlation":
result = np.corrcoef(samples)
elif self.measure == "covariance":
result = np.cov(samples)
return percentage
# Run a simulation to see cpu usage
analyzer = DataAccessor()
usage = []
n_iterations = 0
max_iterations = 500
for combo in combinations(np.arange(2000),500):
# evaluate the measurement on a subset of the data
usage.append(analyzer.compute_measure(combo))
n_iterations += 1
if n_iterations > max_iterations:
break
print n_iterations
use_percents = np.array(usage).T
When I run this on an 8-cpu machine running CentOS, top reports the python process at roughly 600%.
>>> use_percents.mean(1)
shows
array([ 67.05548902, 67.06906188, 66.89041916, 67.28942116,
66.69421158, 67.61437126, 99.8007984 , 67.31996008])
Question:
My computation is embarrassingly parallel, so it would be great to have the other cpus available to split up the job. Does anyone know what's happening here? A plain python version of this uses 100% on a single cpu.
Is there a way to keep everything local to a single cpu without rewriting all my classes without traits?
Traits is not causing the CPU usage. It's easy to rewrite this bit of code without Traits, and you will see that you get the same pattern of CPU usage (at least, I do).
Instead, what you are probably seeing is the CPU usage of the BLAS library that your build of numpy is linked against. numpy.corrcoeff() calls numpy.cov(), and much of the computation of numpy.cov() is taken up by a numpy.dot() call, which does a matrix-matrix multiplication using BLAS. If it is an optimized BLAS library, then it will usually use non-Python threads internally to split up these computations among your CPUs. You will have to consult the documentation of your optimized BLAS library to find out how to change this.

Optimising memory usage in numpy

The following program loads two images with PyGame, converts them to Numpy arrays, and then performs some other Numpy operations (such as FFT) to emit a final result (of a few numbers). The inputs can be large, but at any moment only one or two large objects should be live.
A test image is about 10M pixels, which translates to 10MB once it's greyscaled. It gets converted to a Numpy array of dtype uint8, which after some processing (applying Hamming windows), is an array of dtype float64. Two images are loaded into arrays this way; later FFT steps result in an array of dtype complex128. Prior to adding the excessive gc.collect calls, the program memory size tended to increase with each step. Additionally, it seems most Numpy operations will give a result in the highest precision available.
Running the test (sans the gc.collect calls) on my 1GB Linux machine results in prolonged thrashing, which I have not waited for. I don't yet have detailed memory use stats -- I tried some Python modules and the time command to no avail; now I'm looking into valgrind. Watching PS (and dealing with machine unresponsiveness in the later stages of the test) suggests a maximum memory usage of about 800 MB.
A 10 million cell array of complex128 should occupy 160 MB. Having (ideally) at most two of these live at one time, plus the not-insubstantial Python and Numpy libraries and other paraphernalia, probably means allowing for 500 MB.
I can think of two angles from which to attack the problem:
Discarding intermediate arrays as soon as possible. That's what the gc.collect calls are for -- they seem to have improved the situation, as it now completes with only a few minutes of thrashing ;-). I think one can expect that memory-intensive programming in a language like Python will require some manual intervention.
Using less-precise Numpy arrays at each step. Unfortunately the operations that return arrays, like fft2, do not appear to allow the type to be specified.
So my main question is: is there a way of specifying output precision in Numpy array operations?
More generally, are there other common memory-conserving techniques when using Numpy?
Additionally, does Numpy have a more idiomatic way of freeing array memory? (I imagine this would leave the array object live in Python, but in an unusable state.) Explicit deletion followed by immediate GC feels hacky.
import sys
import numpy
import pygame
import gc
def get_image_data(filename):
im = pygame.image.load(filename)
im2 = im.convert(8)
a = pygame.surfarray.array2d(im2)
hw1 = numpy.hamming(a.shape[0])
hw2 = numpy.hamming(a.shape[1])
a = a.transpose()
a = a*hw1
a = a.transpose()
a = a*hw2
return a
def check():
gc.collect()
print 'check'
def main(args):
pygame.init()
pygame.sndarray.use_arraytype('numpy')
filename1 = args[1]
filename2 = args[2]
im1 = get_image_data(filename1)
im2 = get_image_data(filename2)
check()
out1 = numpy.fft.fft2(im1)
del im1
check()
out2 = numpy.fft.fft2(im2)
del im2
check()
out3 = out1.conjugate() * out2
del out1, out2
check()
correl = numpy.fft.ifft2(out3)
del out3
check()
maxs = correl.argmax()
maxpt = maxs % correl.shape[0], maxs / correl.shape[0]
print correl[maxpt], maxpt, (correl.shape[0] - maxpt[0], correl.shape[1] - maxpt[1])
if __name__ == '__main__':
args = sys.argv
exit(main(args))
This
on SO says "Scipy 0.8 will have single precision support for almost all the fft code",
and SciPy 0.8.0 beta 1 is just out.
(Haven't tried it myself, cowardly.)
if I understand correctly, you are calculating a convolution between two images. The Scipy package contains a dedicated module for that (ndimage), which might be more memory efficient than the "manual" approach via Fourier transforms. It would be good to try using it instead of going through Numpy.

Categories