Splitting multidimensional array in Numpy - python

I'm trying to split a multidimensional array (array)
import numpy as np
shape = (3, 4, 4, 2)
array = np.random.randint(0,10,shape)
into an array (new_array) with shape (3,2,2,2,2,2) where the dimension 1 has been split into 2 (dimension 1 and 2) and dimension 2 in array has been split into 2 (dimensions 3 and 4).
So far I got a working method which is:
div_x = 2
div_y = 2
new_dim_x = shape[1]//div_x
new_dim_y = shape[2]//div_y
new_array_split = np.array([np.split(each_sub, axis=2, indices_or_sections=div_y) for each_sub in np.split(array[:, :(new_dim_x*div_x), :(new_dim_y*div_y)], axis=1, indices_or_sections=div_x)])
I'm also looking into using reshape:
new_array_reshape = array[:, :(div_x*new_dim_x), :(div_y*new_dim_y), ...].reshape(shape[0], div_x, div_y, new_dim_x, new_dim_y, shape[-1]).transpose(1,2,0,3,4,5)
The reshape method is faster than the split method:
%timeit array[:, :(div_x*new_dim_x), :(div_y*new_dim_y), ...].reshape(shape[0], div_x, div_y, new_dim_x, new_dim_y, shape[-1]).transpose(1,2,0,3,4,5)
2.16 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.array([np.split(each_sub, axis=2, indices_or_sections=div_y) for each_sub in np.split(array[:, :(new_dim_x*div_x), :(new_dim_y*div_y)], axis=1, indices_or_sections=div_x)])
58.3 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
However, I cannot get the same results, because of the last dimension:
print('Reshape method')
print(new_array_reshape[1,0,0,...])
print('\nSplit method')
print(new_array_split[1,0,0,...])
Reshape method
[[[2 2]
[4 3]]
[[3 5]
[5 9]]]
Split method
[[[2 2]
[4 3]]
[[5 3]
[9 8]]]
The split method does exactly what I want, I did check number by number and it does the type of split I want, but not at the speed I would like.
QUESTION
Is there a way to achieve the same results as the split method, using reshape or any other approach?
CONTEXT
The array is actually data flow from image processing, where the first dimension of array is the time, the second dimension is coordinate x (4), the third dimension is coordinate y (4) and the fourth dimension (2) is the Magnitude and phase of the flow.
I would like to split the images (coordinate x and y) into subimages making an array of pictures of 2x2 so I can analyse the flow more locally, perform averages, clustering, etc.
This process (splitting) is going to be performed many times that is why I'm looking for an optimal and efficient solution. I believe the way is probably using reshape, but I'm open to any other option.

Reshape and permute axes -
array.reshape(3,2,2,2,2,2).transpose(1,3,0,2,4,5)

For your use case I'm not sure reshape is the best option. If you want to be able to locally average and cluster, you might want a window function:
from skimage.util import view_as_windows
def window_over(arr, size = 2, step = 2, axes = (1, 2) ):
wshp = list(arr.shape)
for a in axes:
wshp[a] = size
return view_as_windows(arr, wshp, step).squeeze()
window_over(test).shape
Out[]: (2, 2, 3, 2, 2, 2)
Your output axes can then be rearranged how you want using transpose. The benefit of this is that you can get the intermediate windows:
window_over(test, step = 1).shape
Out[]: (3, 3, 3, 2, 2, 2)
That includes the 2x2 windows that overlap, so you get 3x3 results.
Since overlapping is possible, you also don't need your windows to be divisible by the dimension size:
window_over(test, size = 3).shape
Out[]: (2, 2, 3, 3, 3, 2)

Related

copy from two multidimensional numpy array to another with different shape

I have two numpy arrays of the following shape:
print(a.shape) -> (100, 20, 3, 3)
print(b.shape) -> (100, 3)
Array a is empty, as I just need this predefined shape, I created it with:
a = numpy.empty(shape=(100, 20, 3, 3))
Now I would like to copy data from array b to array a so that the second and third dimension of array a gets filled with the same 3 values of the corresponding row of array b.
Let me try to make it a bit clearer:
Array b contains 100 rows (100, 3) and each row holds three values (100, 3).
Now every row of array a (100, 20, 3, 3) should also hold the same three values in the last dimension (100, 20, 3, 3), while those three values stay the same for the second and third dimension (100, 20, 3, 3) for the same row (100, 20, 3, 3).
How can I copy the data as described without using loops? I just can not get it done but there must be an easy solution for this.
We can make use of np.broadcast_to.
If you are okay with a view -
np.broadcast_to(b[:,None, None, :], (100, 2, 3, 3))
If you need an output with its own memory space, simply append with .copy().
If you want to save on memory and fill into already defined array, a :
a[:] = b[:,None,None,:]
Note that we can skip the trailing :s.
Timings :
In [20]: b = np.random.rand(100, 3)
In [21]: %timeit np.broadcast_to(b[:,None, None, :], (100, 2, 3, 3))
5.93 µs ± 64.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [22]: %timeit np.broadcast_to(b[:,None, None, :], (100, 2, 3, 3)).copy()
11.4 µs ± 56.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [23]: %timeit np.repeat(np.repeat(b[:,None,None,:], 20, 1), 3, 2)
39.3 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can use repeat along axis. You also do not need to predefine a. I would also suggest NOT to use broadcast_to since it returns readonly view and memory is shared among elements:
a = np.repeat(b[:,None,None,:], 20, 1) #adds dimensions 1 and 2 and repeats 20 times along axis 1
a = np.repeat(a, 3, 2) #repeats 3 times along axis 2
Smaller example:
b = np.arange(2*3).reshape(2,3)
#[[0 1 2]
# [3 4 5]]
a = np.repeat(b[:,None,None,:], 2, 1)
a = np.repeat(a, 3, 2)
#shape(2,2,3,3)
[[[[0 1 2]
[0 1 2]
[0 1 2]]
[[0 1 2]
[0 1 2]
[0 1 2]]]
[[[3 4 5]
[3 4 5]
[3 4 5]]
[[3 4 5]
[3 4 5]
[3 4 5]]]]

Reducing a X,Y,Z list to an X,Y list and retrieve the min & max X of the XY PTS

I have large data sets of XYZ data I need to extract only the XY as PTS and find the minimum and maximum. The real data sets are floats... ( I still struggle to understand the list comprehension I guess... ) This is part of a bigger problem with Ascii Grid Files, that Im trying to resolve by creating a larger XYZ data set... (Which is a poor approach I know...)
# I have got this... xyz data...
PTSXYZ = [[1,1,3],[4,4,2],[6,4,1],[6,6,5]]
# I want to get this....xy data PTSXY = [[1,1],[4,4],[6,4],[6,6]]
# I have Tried.... this ??
PTSXY = [PTSXYZ[0][i],PTSXYZ[1][i] for i in PTSXYZ] # Addressing Wrong ?
#Then If I want the Minimum & Maximum X Values..
print min(PTSXY[:][0]),max(PTSXY[:][0])
Retrieving the XY slice
For retrieving from the XYZ matrix the XY matrix you can proceed with slicing:
import numpy as np
a = np.array([[1, 1, 3], [4, 4, 2], [6, 4, 1], [6, 6, 5]])
a[:, :2]
With its output being:
array([[1, 1],
[4, 4],
[6, 4],
[6, 6]])
Retrieving the max and min from x
Then, for retrieving the min and max of the X you can use numpy.min and numpy.max:
_max, _min = np.max(a[:, 0]), np.min(a[:, 0])
Obtaining:
6, 1
How numpy efficiency differs from list comprehension
Let's run a test to see how the difference between numpy and list comprehension using numpy:
import numpy as np
a = np.random.randint(1000, size=(1000, 3))
%timeit [[i[0], i[1]] for i in a]
>>> 483 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit a[:,:2]
>>> 391 ns ± 6.29 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
As expected, numpy is amazingly faster.
For your first question, you can use list comprehension to create a new list based on the first two elements of PTSXYZ:
PTSXY = [i[:2] for i in PTSXYZ]
print PTSXY
# [[1, 1], [4, 4], [6, 4], [6, 6]]
You might also want to consider a generator, which won't store the entire list in memory:
PTSXY = (i[:2] for i in PTSXYZ)
For the second question, you can get the min and max by creating a list of X elements and taking their min and max:
print min(i[0] for i in PTSXY)
# 1
print max(i[0] for i in PTSXY)
# 6
Once you get the XY list as #101 explained, you can get the minimum and maximum in a similar way:
PTSXY = [[i[0], i[1]] for i in PTSXYZ]
minimum = min([i[0] for i in PTSXY])
maximum = max([i[0] for i in PTSXY])

Altering arrays of different dimensions to be broadcasted together

I am looking for a more optimized way to convert a (n,n) or (n,n,1) matrix to a (n,n,3) matrix. I start out with an (n,n,3), but my dimensions get reduced after I perform a sum over the second axis to (n,n). Essentially, I want to keep the original size of the array and have the second axis just repeated 3 times. The reason I need this is that I will later be broadcasting it with another (n,n,3) array, but they need the same dimensions.
My current method works, but does not seem elegant.
a0=np.random.random((n,n))
b=a.flatten().tolist()
a=np.array(zip(b,b,b))
a.shape=n,n,3
This setup has the desired result, but is clunky and hard to follow. Is there perhaps a way to go directly from an (n,n) to an (n,n,3) by duplicating the second index? or perhaps a way to not downsize the array to begin with?
None or np.newaxis is a common way of adding a dimension to an array. reshape with (3,3,1) works just as well:
In [64]: arr=np.arange(9).reshape(3,3)
In [65]: arr1 = arr[...,None]
In [66]: arr1.shape
Out[66]: (3, 3, 1)
repeat as function or method replicates this.
In [72]: arr2=arr1.repeat(3,axis=2)
In [73]: arr2.shape
Out[73]: (3, 3, 3)
In [74]: arr2[0,0,:]
Out[74]: array([0, 0, 0])
But you might not need to do this. With broadcasting a (3,3,1) works with a (3,3,3).
In [75]: (arr1+arr2).shape
Out[75]: (3, 3, 3)
In fact it will broadcast with a (3,) to produce (3,3,3).
In [77]: arr1+np.ones(3,int)
Out[77]:
array([[[1, 1, 1],
[2, 2, 2],
...
[[7, 7, 7],
[8, 8, 8],
[9, 9, 9]]])
So arr1+np.zeros(3,int) is another way of expanding that (3,3,1) to (3,3,3).
The broadcasting rules are:
(3,3,1) + (3,) => (3,3,1) + (1,1,3) => (3,3,3)
broadcasting adds dimensions at the start as needed.
When you sum on an axis, you can keep the original number of dimensions with a parameter:
In [78]: arr2.sum(axis=2).shape
Out[78]: (3, 3)
In [79]: arr2.sum(axis=2, keepdims=True).shape
Out[79]: (3, 3, 1)
This is handy if you want to subtract the mean from an array along any dimension:
arr2-arr2.mean(axis=2, keepdims=True)
You can firstly create a new axis (axis = 2) on a and then use np.repeat along this new axis:
np.repeat(a[:,:,None], 3, axis = 2)
Or another approach, flatten the array, repeat elements and then reshape:
np.repeat(a.ravel(), 3).reshape(n,n,3)
The result comparison:
import numpy as np
n = 4
a=np.random.random((n,n))
b=a.flatten().tolist()
a1=np.array(zip(b,b,b))
a1.shape=n,n,3
# a1 is the result from the original method
(np.repeat(a[:,:,None], 3, axis = 2) == a1).all()
# True
(np.repeat(a.ravel(), 3).reshape(4,4,3) == a1).all()
# True
Timing, use built-in numpy.repeat also shows a speed up:
import numpy as np
n = 4
a=np.random.random((n,n))
​
def rep():
b=a.flatten().tolist()
a1=np.array(zip(b,b,b))
a1.shape=n,n,3
%timeit rep()
# 100000 loops, best of 3: 7.11 µs per loop
%timeit np.repeat(a[:,:,None], 3, axis = 2)
# 1000000 loops, best of 3: 1.64 µs per loop
%timeit np.repeat(a.ravel(), 3).reshape(4,4,3)
# 1000000 loops, best of 3: 1.9 µs per loop

How can I select values along an axis of an nD array with an (n-1)D array of indices of that axis?

This is motivated by my answer here.
Given array A with shape (n0,n1), and array J with shape (n0), I'd like to create an array B with shape (n0) such that
B[i] = A[i,J[i]]
I'd also like to be able to generalize this to k-dimensional arrays, where A has shape (n0,n1,...,nk) and J has shape (n0,n1,...,n(k-1))
There are messy, flattening ways of doing this that make assumptions about index order:
import numpy as np
B = A.ravel()[ J+A.shape[-1]*np.arange(0,np.prod(J.shape)).reshape(J.shape) ]
The question is, is there a way to do this that doesn't rely on flattening arrays and dealing with indexes manually?
For the 2 and 1d case, this indexing works:
A[np.arange(J.shape[0]), J]
Which can be applied to more dimensions by reshaping to 2d (and back):
A.reshape(-1, A.shape[-1])[np.arange(np.prod(A.shape[:-1])).reshape(J.shape), J]
For 3d A this works:
A[np.arange(J.shape[0])[:,None], np.arange(J.shape[1])[None,:], J]
where the 1st 2 arange indices broadcast to the same dimension as J.
With functions in lib.index_tricks, this can be expressed as:
A[np.ogrid[0:J.shape[0],0:J.shape[1]]+[J]]
A[np.ogrid[slice(J.shape[0]),slice(J.shape[1])]+[J]]
or for multiple dimensions:
A[np.ix_(*[np.arange(x) for x in J.shape])+(J,)]
A[np.ogrid[[slice(k) for k in J.shape]]+[J]]
For small A and J (eg 2*3*4), J.choose(np.rollaxis(A,-1)) is faster. All of the extra time is in preparing the index tuple. np.ix_ is faster than np.ogrid.
np.choose has a size limit. At its upper end it is slower than ix_:
In [610]: Abig=np.arange(31*31).reshape(31,31)
In [611]: Jbig=np.arange(31)
In [612]: Jbig.choose(np.rollaxis(Abig,-1))
Out[612]:
array([ 0, 32, 64, 96, 128, 160, ... 960])
In [613]: timeit Jbig.choose(np.rollaxis(Abig,-1))
10000 loops, best of 3: 73.1 µs per loop
In [614]: timeit Abig[np.ix_(*[np.arange(x) for x in Jbig.shape])+(Jbig,)]
10000 loops, best of 3: 22.7 µs per loop
In [635]: timeit Abig.ravel()[Jbig+Abig.shape[-1]*np.arange(0,np.prod(Jbig.shape)).reshape(Jbig.shape) ]
10000 loops, best of 3: 44.8 µs per loop
I did similar indexing tests at https://stackoverflow.com/a/28007256/901925, and found that flat indexing was faster for much larger arrays (e.g. n0=1000). That's where I learned about the 32 limit for choice.
It doesn't solve your problem exactly, but choose() should nevertheless help:
>>> A = array(range(1, 28)).reshape(3, 3, 3)
>>> B = array([0, 0, 0, 1, 1, 1, 2, 2, 2]).reshape(3, 3)
>>> B.choose(A)
array([[ 1, 2, 3],
[13, 14, 15],
[25, 26, 27]])
It selects among the first dimension instead of the last.

Is it possible to use argsort in descending order?

Consider the following code:
avgDists = np.array([1, 8, 6, 9, 4])
ids = avgDists.argsort()[:n]
This gives me indices of the n smallest elements. Is it possible to use this same argsort in descending order to get the indices of n highest elements?
If you negate an array, the lowest elements become the highest elements and vice-versa. Therefore, the indices of the n highest elements are:
(-avgDists).argsort()[:n]
Another way to reason about this, as mentioned in the comments, is to observe that the big elements are coming last in the argsort. So, you can read from the tail of the argsort to find the n highest elements:
avgDists.argsort()[::-1][:n]
Both methods are O(n log n) in time complexity, because the argsort call is the dominant term here. But the second approach has a nice advantage: it replaces an O(n) negation of the array with an O(1) slice. If you're working with small arrays inside loops then you may get some performance gains from avoiding that negation, and if you're working with huge arrays then you can save on memory usage because the negation creates a copy of the entire array.
Note that these methods do not always give equivalent results: if a stable sort implementation is requested to argsort, e.g. by passing the keyword argument kind='mergesort', then the first strategy will preserve the sorting stability, but the second strategy will break stability (i.e. the positions of equal items will get reversed).
Example timings:
Using a small array of 100 floats and a length 30 tail, the view method was about 15% faster
>>> avgDists = np.random.rand(100)
>>> n = 30
>>> timeit (-avgDists).argsort()[:n]
1.93 µs ± 6.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
1.64 µs ± 3.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
1.64 µs ± 3.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For larger arrays, the argsort is dominant and there is no significant timing difference
>>> avgDists = np.random.rand(1000)
>>> n = 300
>>> timeit (-avgDists).argsort()[:n]
21.9 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
21.7 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
21.9 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Please note that the comment from nedim below is incorrect. Whether to truncate before or after reversing makes no difference in efficiency, since both of these operations are only striding a view of the array differently and not actually copying data.
Just like Python, in that [::-1] reverses the array returned by argsort() and [:n] gives that last n elements:
>>> avgDists=np.array([1, 8, 6, 9, 4])
>>> n=3
>>> ids = avgDists.argsort()[::-1][:n]
>>> ids
array([3, 1, 2])
The advantage of this method is that ids is a view of avgDists:
>>> ids.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
(The 'OWNDATA' being False indicates this is a view, not a copy)
Another way to do this is something like:
(-avgDists).argsort()[:n]
The problem is that the way this works is to create negative of each element in the array:
>>> (-avgDists)
array([-1, -8, -6, -9, -4])
ANd creates a copy to do so:
>>> (-avgDists_n).flags['OWNDATA']
True
So if you time each, with this very small data set:
>>> import timeit
>>> timeit.timeit('(-avgDists).argsort()[:3]', setup="from __main__ import avgDists")
4.2879798610229045
>>> timeit.timeit('avgDists.argsort()[::-1][:3]', setup="from __main__ import avgDists")
2.8372560259886086
The view method is substantially faster (and uses 1/2 the memory...)
Instead of using np.argsort you could use np.argpartition - if you only need the indices of the lowest/highest n elements.
That doesn't require to sort the whole array but just the part that you need but note that the "order inside your partition" is undefined, so while it gives the correct indices they might not be correctly ordered:
>>> avgDists = [1, 8, 6, 9, 4]
>>> np.array(avgDists).argpartition(2)[:2] # indices of lowest 2 items
array([0, 4], dtype=int64)
>>> np.array(avgDists).argpartition(-2)[-2:] # indices of highest 2 items
array([1, 3], dtype=int64)
As #Kanmani hinted, an easier to interpret implementation may use numpy.flip, as in the following:
import numpy as np
avgDists = np.array([1, 8, 6, 9, 4])
ids = np.flip(np.argsort(avgDists))
print(ids)
By using the visitor pattern rather than member functions, it is easier to read the order of operations.
You can use the flip commands numpy.flipud() or numpy.fliplr() to get the indexes in descending order after sorting using the argsort command. Thats what I usually do.
You could create a copy of the array and then multiply each element with -1.
As an effect the before largest elements would become the smallest.
The indeces of the n smallest elements in the copy are the n greatest elements in the original.
With your example:
avgDists = np.array([1, 8, 6, 9, 4])
Obtain indexes of n maximal values:
ids = np.argpartition(avgDists, -n)[-n:]
Sort them in descending order:
ids = ids[np.argsort(avgDists[ids])[::-1]]
Obtain results (for n=4):
>>> avgDists[ids]
array([9, 8, 6, 4])
An elegant way could be as follows -
ids = np.flip(np.argsort(avgDists))
This will give you indices of elements sorted in descending order.
Now you can use regular slicing...
top_n = ids[:n]
consider order of equal elements
If you run a sorting routine and 2 elements are equal, the order is usually not changed. However, the flip/[::-1] approach changes the order of equal elements.
>>> arr = np.array([3, 5, 4, 7, 3])
>>>
>>> np.argsort(arr)[::-1]
array([3, 1, 2, 4, 0]) # equal elements reorderd
>>> np.argsort(-arr)
array([3, 1, 2, 0, 4]) # equal elements not reorderd (compatible to other sorting)
For compatibility reasons I would hence prefer the argsort of the negative array approach. This is especially relevant, when arr represents some number representation of more complex elements.
Example:
obj = ['street', 'house', 'bridge', 'station', 'rails']
arr = np.array([3, 5, 4, 7, 3]) # cost of obj in coins
Disclaimer: A more common approach is to solve the example above with sorted(list_of_tuples_obj_cost, key=lambda x: x[1])
Another way is to use only a '-' in the argument for argsort as in : "df[np.argsort(-df[:, 0])]", provided df is the dataframe and you want to sort it by the first column (represented by the column number '0'). Change the column-name as appropriate. Of course, the column has to be a numeric one.

Categories