Efficiently remove rows/columns of numpy image array - python

I am trying to remove rows or columns from an image represented by a Numpy array. My image is of type uint16 and 2560 x 2176. As an example, say I want to remove the first 16 columns to make it 2560 x 2160.
I'm a MATLAB-to-Numpy convert, and in MATLAB would use something like:
A = rand(2560, 2196);
A(:, 1:16) = [];
As I understand, this deletes the columns in place and saves a lot of time by not copying to a new array.
For Numpy, previous posts have used commands like numpy.delete. However, the documentation is clear that this returns a copy, and so I must reassign the copy to A. This seems like it would waste a lot of time copying.
import numpy as np
A = np.random.rand(2560,2196)
A = np.delete(A, np.r_[:16], 1)
Is this truly as fast as an in-place deletion? I feel I must be missing a better method or not understanding how python handles array storage during deletion.
Relevant previous posts:
Removing rows in NumPy efficiently
Documentation for numpy.delete

Why not just do a slice? Here I'm removing the first 3000 columns instead of 16 to make the memory usage more clear:
import numpy as np
a = np.empty((5000, 5000)
a = a[:, 3000:]
This effectively reduces the size of the array in memory, as can be seen:
In [31]: a = np.zeros((5000, 5000), dtype='d')
In [32]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x5000: 25000000 elems, type `float64`, 200000000 bytes (190 Mb)
In [33]: a = a[:, 3000:]
In [34]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x2000: 10000000 elems, type `float64`, 80000000 bytes (76 Mb)
For this size of array a slice seems to be about 10,000x faster than your delete option:
%timeit a=np.empty((5000,5000), dtype='d'); a=np.delete(a, np.r_[:3000], 1)
1 loops, best of 3: 404 ms per loop
%timeit a=np.empty((5000,5000), dtype='d'); a=a[:, 3000:]
10000 loops, best of 3: 39.3 us per loop

Related

Efficiently determining if large sorted numpy array has only unique values

I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop

Time access of elements in huge numpy array

I want to access and do some basic stuff with columns of a huge numpy array, non sparse, full of float, with shape 2 million x 800.
So basically:
import numpy as np
a = np.ones(2000000) # 2 millions elts
%timeit np.log10(a)
10 loops, best of 3: 28.8 ms per loop
but if I create a bigger array with 800 columns, the access time to a column is ridiculously high:
a = np.ones((2000000, 800)) # take 12 Go RAM
b = a[:, 0].copy() # 6 mins
%timeit np.log10(b)
10 loops, best of 3: 31.4 ms per loop
so my question is : how do you handle this situation ? Is there better objects to store such a big matrix and do simple operations like log10, plots, etc ? Or do I need to trick it, like creating 800 np.arrays with shape (2000000,) ? I've quickly read threads about Pytables and hdf, but I don't know if those solutions are appropriate for my problem, which looks very simple at first sight...
thanks a lot !
edit : I am on a xp 64 bits, and dtype of the array is float64.
I have tested this column access time with several shapes, and smth strange came up :
a = np.ones((1000000, 800), dtype="float64")
%timeit np.log10(a[:, 0])
1 loops, best of 3: 150 ms per loop
and
a = np.ones((1100000, 800), dtype="float64") # just a few more column
b = np.log10(a[:, 0]) # 6 min 20 sec
there is a threshold for the time access around 1050000 lines, so I suspect that the problem is related to the maximum size in bits for a memory allocation on a 64 bits machine ?

Why is the pandas.DataFrame.cov() method orders of magnitude faster than numpy.dot(...) for the same data?

I was calculating covariance run times like this in ipython
>>> from pandas import DataFrame
>>> import numpy as np
>>> # create data frame set
>>> df = get_data()
>>> df.shape
(4795, 1000)
>>> %timeit df.cov()
10 loops, best of 3: 99.5 ms per loop
>>> mat = np.matrix(df.values)
>>> %timeit np.dot(mat.transpose(), mat)
1 loops, best of 3: 1min per loop
So, I've figured out the why of the observed speed difference. . . but not the why of the why. I'll update when I find that.
This is the answer to: "Why is the DataFrame.cov method so much faster than converting to a numpy matrix and using the np.cov or np.dot method?"
The DataFrame data type was int64. When it was converted to a numpy matrix using
mat = np.matrix(df.to_matrix())
The resulting 'mat' object is also of type int64.
Under the hood, the DataFrame.cov method converts its matrix to float64 before calling numpy's covariance method.
When running timeit's on numpy ndarrays or matrices of dtype int64, you see the same performance lag. On my machine with a dataset of shape (16497, 5000) int64 operations do not complete and sometimes crash with memory errors. float64 completes in seconds.
So, the short answer is the reason the numpy.dot method above is slower than the DataFrame.cov one is the datatypes were different.
I'm going to investigate why this gap exists.

Apply numpy index to matrix

I have spent the last hour trying to figure this out
Suppose we have
import numpy as np
a = np.random.rand(5, 20) - 0.5
amin_index = np.argmin(np.abs(a), axis=1)
print(amin_index)
> [ 0 12 5 18 1] # or something similar
this does not work:
a[amin_index]
So, in essence, I need to find the minima along a certain axis for the array np.abs(a), but then extract the values from the array a at these positions. How can I apply an index to just one axis?
Probably very simple, but I can't get it figured out. Also, I can't use any loops since I have to do this for arrays with several million entries.
thanks 😊
One way is to pass in the array of row indexes (e.g. [0,1,2,3,4]) and the list of column indexes for the minimum in each corresponding row (your list amin_index).
This returns an array containing the value at [i, amin_index[i]] for each row i:
>>> a[np.arange(a.shape[0]), amin_index]
array([-0.0069325 , 0.04268358, -0.00128002, -0.01185333, -0.00389487])
This is basic indexing (rather than advanced indexing), so the returned array is actually a view of a rather than a new array in memory.
Is because argmin returns indexes of columns for each of the rows (with axis=1), therefore you need to access to each row at those particular columns:
a[range(a.shape[0]), amin_index]
Why not simply do np.amin(np.abs(a), axis=1), it's much simpler if you don't need the intermediate amin_index array via argmin?
Numpy's reference page is an excellent resource, see "Indexing".
Edits: Timing is always useful:
In [3]: a=np.random.rand(4000, 4000)-.5
In [4]: %timeit np.amin(np.abs(a), axis=1)
10 loops, best of 3: 128 ms per loop
In [5]: %timeit a[np.arange(a.shape[0]), np.argmin(np.abs(a), axis=1)]
10 loops, best of 3: 135 ms per loop

Is numpy.transpose reordering data in memory?

In order to speed up the functions like np.std, np.sum etc along an axis of an n dimensional huge numpy array, it is recommended to apply along the last axis.
When I do, np.transpose to rotate the axis I want to operate, to the last axis. Is it really reshuffling the data in memory, or just changing the way the axis are addressed?
When i tried to measure the time using %timeit. it was doing this transpose in micro seconds, (much smaller than the time required to copy the (112x1024x1024) array i was having.
If it is not actually reordering the data in memory and only changing the addressing, will it still speed up the np.sum or np.std when applied to newly rotated last axis?
When i tried to measure it, i does seem to speed up. But i don't understand how.
Update
It doesn't really seem to speed up with transpose. The fastest axis is last one when it is C-ordered, and first one when it is Fortran-ordered. So there is no point in transposing before applying np.sum or np.std.
For my specific code, i solved the issue by giving order='FORTRAN' during the array creation. Which made the first axis fastest.
Thanks for all the answers.
Transpose just changes the strides, it doesn't touch the actual array. I think the reason why sum etc. along the final axis is recommended (I'd like to see the source for that, btw.) is that when an array is C-ordered, walking along the final axis preserves locality of reference. That won't be the case after you transpose, since the transposed array will be Fortran-ordered.
To elaborate on larsman's answer, here are some timings:
# normal C (row-major) order array
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a, axis=1)
1000 loops, best of 3: 272 us per loop
# transposing and summing along the first axis makes no real difference
# to performance
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a.T, axis=0)
1000 loops, best of 3: 269 us per loop
# however, converting to Fortran (column-major) order does improve speed...
>>> %%timeit a = np.asfortranarray(np.random.randn(500,400))
>>> np.sum(a, axis=1)
10000 loops, best of 3: 114 us per loop
# ... but only if you don't count the conversion in the timed operations
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(np.asfortranarray(a), axis=1)
1000 loops, best of 3: 599 us per loop
In summary, it might make sense to convert your arrays to Fortran order if you're going to apply a lot of operations over the columns, but the conversion itself is costly and almost certainly not worth it for a single operation.
You can force a physical reorg with np.ascontinguousarray. For example...
a = np.ascontiguousarray(a.transpose())

Categories