In order to speed up the functions like np.std, np.sum etc along an axis of an n dimensional huge numpy array, it is recommended to apply along the last axis.
When I do, np.transpose to rotate the axis I want to operate, to the last axis. Is it really reshuffling the data in memory, or just changing the way the axis are addressed?
When i tried to measure the time using %timeit. it was doing this transpose in micro seconds, (much smaller than the time required to copy the (112x1024x1024) array i was having.
If it is not actually reordering the data in memory and only changing the addressing, will it still speed up the np.sum or np.std when applied to newly rotated last axis?
When i tried to measure it, i does seem to speed up. But i don't understand how.
Update
It doesn't really seem to speed up with transpose. The fastest axis is last one when it is C-ordered, and first one when it is Fortran-ordered. So there is no point in transposing before applying np.sum or np.std.
For my specific code, i solved the issue by giving order='FORTRAN' during the array creation. Which made the first axis fastest.
Thanks for all the answers.
Transpose just changes the strides, it doesn't touch the actual array. I think the reason why sum etc. along the final axis is recommended (I'd like to see the source for that, btw.) is that when an array is C-ordered, walking along the final axis preserves locality of reference. That won't be the case after you transpose, since the transposed array will be Fortran-ordered.
To elaborate on larsman's answer, here are some timings:
# normal C (row-major) order array
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a, axis=1)
1000 loops, best of 3: 272 us per loop
# transposing and summing along the first axis makes no real difference
# to performance
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a.T, axis=0)
1000 loops, best of 3: 269 us per loop
# however, converting to Fortran (column-major) order does improve speed...
>>> %%timeit a = np.asfortranarray(np.random.randn(500,400))
>>> np.sum(a, axis=1)
10000 loops, best of 3: 114 us per loop
# ... but only if you don't count the conversion in the timed operations
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(np.asfortranarray(a), axis=1)
1000 loops, best of 3: 599 us per loop
In summary, it might make sense to convert your arrays to Fortran order if you're going to apply a lot of operations over the columns, but the conversion itself is costly and almost certainly not worth it for a single operation.
You can force a physical reorg with np.ascontinguousarray. For example...
a = np.ascontiguousarray(a.transpose())
Related
I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop
I was calculating covariance run times like this in ipython
>>> from pandas import DataFrame
>>> import numpy as np
>>> # create data frame set
>>> df = get_data()
>>> df.shape
(4795, 1000)
>>> %timeit df.cov()
10 loops, best of 3: 99.5 ms per loop
>>> mat = np.matrix(df.values)
>>> %timeit np.dot(mat.transpose(), mat)
1 loops, best of 3: 1min per loop
So, I've figured out the why of the observed speed difference. . . but not the why of the why. I'll update when I find that.
This is the answer to: "Why is the DataFrame.cov method so much faster than converting to a numpy matrix and using the np.cov or np.dot method?"
The DataFrame data type was int64. When it was converted to a numpy matrix using
mat = np.matrix(df.to_matrix())
The resulting 'mat' object is also of type int64.
Under the hood, the DataFrame.cov method converts its matrix to float64 before calling numpy's covariance method.
When running timeit's on numpy ndarrays or matrices of dtype int64, you see the same performance lag. On my machine with a dataset of shape (16497, 5000) int64 operations do not complete and sometimes crash with memory errors. float64 completes in seconds.
So, the short answer is the reason the numpy.dot method above is slower than the DataFrame.cov one is the datatypes were different.
I'm going to investigate why this gap exists.
I have spent the last hour trying to figure this out
Suppose we have
import numpy as np
a = np.random.rand(5, 20) - 0.5
amin_index = np.argmin(np.abs(a), axis=1)
print(amin_index)
> [ 0 12 5 18 1] # or something similar
this does not work:
a[amin_index]
So, in essence, I need to find the minima along a certain axis for the array np.abs(a), but then extract the values from the array a at these positions. How can I apply an index to just one axis?
Probably very simple, but I can't get it figured out. Also, I can't use any loops since I have to do this for arrays with several million entries.
thanks 😊
One way is to pass in the array of row indexes (e.g. [0,1,2,3,4]) and the list of column indexes for the minimum in each corresponding row (your list amin_index).
This returns an array containing the value at [i, amin_index[i]] for each row i:
>>> a[np.arange(a.shape[0]), amin_index]
array([-0.0069325 , 0.04268358, -0.00128002, -0.01185333, -0.00389487])
This is basic indexing (rather than advanced indexing), so the returned array is actually a view of a rather than a new array in memory.
Is because argmin returns indexes of columns for each of the rows (with axis=1), therefore you need to access to each row at those particular columns:
a[range(a.shape[0]), amin_index]
Why not simply do np.amin(np.abs(a), axis=1), it's much simpler if you don't need the intermediate amin_index array via argmin?
Numpy's reference page is an excellent resource, see "Indexing".
Edits: Timing is always useful:
In [3]: a=np.random.rand(4000, 4000)-.5
In [4]: %timeit np.amin(np.abs(a), axis=1)
10 loops, best of 3: 128 ms per loop
In [5]: %timeit a[np.arange(a.shape[0]), np.argmin(np.abs(a), axis=1)]
10 loops, best of 3: 135 ms per loop
In my code I am currently iterating and creating three lists:
data, row, col
There is a high repetition of (row, col) pairs, and in my final sparse matrix M I would like the value of M[row, col] to be the sum of all the corresponding elements in data. From reading the documentation, the coo_matrix format seems perfect and for small examples it works just fine.
The problem I am having is that when I scale up my problem size, it looks like the intermediate lists data, row, col are using up all of my (8gb) of memory and the swap space and my script gets automatically killed.
So my question is:
Is there an appropriate format or an efficient way of incrementally building my summed matrix so I don't have to store the full intermediate lists / numpy arrays?
My program loops over a grid, creating local_data, local_row, local_col lists at each point, the elements of which are then appended to data, row, col, so being able to update the sparse matrix with lists as per the sparse matrix constructors would be the ideal case.
There are two things that may be killing you: the duplicates or the overhead of a list over an array. In either case, probably the right thing to do is to grow your list only so big before dumping it into a coo_matrix and adding it to your total. I took a couple of timings:
rows = list(np.random.randint(100, size=(10000,)))
cols = list(np.random.randint(100, size=(10000,)))
values = list(np.random.rand(10000))
%timeit sps.coo_matrix((values, (rows, cols)))
100 loops, best of 3: 4.03 ms per loop
%timeit (sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000]))) +
sps.coo_matrix((values[5000:], (rows[5000:], cols[5000:]))))
100 loops, best of 3: 5.24 ms per loop
%timeit sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000])))
100 loops, best of 3: 2.16 ms per loop
So there is about a 25% overhead in splitting the lists in two, converting each to a coo_matrix and then adding them together. And it doesn't seem to be as bad if you do more splits:
%timeit (sps.coo_matrix((values[:2500], (rows[:2500], cols[:2500]))) +
sps.coo_matrix((values[2500:5000], (rows[2500:5000], cols[2500:5000]))) +
sps.coo_matrix((values[5000:7500], (rows[5000:7500], cols[5000:7500]))) +
sps.coo_matrix((values[7500:], (rows[7500:], cols[7500:]))))
100 loops, best of 3: 5.76 ms per loop
I am trying to remove rows or columns from an image represented by a Numpy array. My image is of type uint16 and 2560 x 2176. As an example, say I want to remove the first 16 columns to make it 2560 x 2160.
I'm a MATLAB-to-Numpy convert, and in MATLAB would use something like:
A = rand(2560, 2196);
A(:, 1:16) = [];
As I understand, this deletes the columns in place and saves a lot of time by not copying to a new array.
For Numpy, previous posts have used commands like numpy.delete. However, the documentation is clear that this returns a copy, and so I must reassign the copy to A. This seems like it would waste a lot of time copying.
import numpy as np
A = np.random.rand(2560,2196)
A = np.delete(A, np.r_[:16], 1)
Is this truly as fast as an in-place deletion? I feel I must be missing a better method or not understanding how python handles array storage during deletion.
Relevant previous posts:
Removing rows in NumPy efficiently
Documentation for numpy.delete
Why not just do a slice? Here I'm removing the first 3000 columns instead of 16 to make the memory usage more clear:
import numpy as np
a = np.empty((5000, 5000)
a = a[:, 3000:]
This effectively reduces the size of the array in memory, as can be seen:
In [31]: a = np.zeros((5000, 5000), dtype='d')
In [32]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x5000: 25000000 elems, type `float64`, 200000000 bytes (190 Mb)
In [33]: a = a[:, 3000:]
In [34]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x2000: 10000000 elems, type `float64`, 80000000 bytes (76 Mb)
For this size of array a slice seems to be about 10,000x faster than your delete option:
%timeit a=np.empty((5000,5000), dtype='d'); a=np.delete(a, np.r_[:3000], 1)
1 loops, best of 3: 404 ms per loop
%timeit a=np.empty((5000,5000), dtype='d'); a=a[:, 3000:]
10000 loops, best of 3: 39.3 us per loop