This is essentially the 2D array equivalent of slicing a python list into smaller lists at indexes that store a particular value. I'm running a program that extracts a large amount of data out of a CSV file and copies it into a 2D NumPy array. The basic format of these arrays are something like this:
[[0 8 9 10]
[9 9 1 4]
[0 0 0 0]
[1 2 1 4]
[0 0 0 0]
[1 1 1 2]
[39 23 10 1]]
I want to separate my NumPy array along rows that contain all zero values to create a set of smaller 2D arrays. The successful result for the above starting array would be the arrays:
[[0 8 9 10]
[9 9 1 4]]
[[1 2 1 4]]
[[1 1 1 2]
[39 23 10 1]]
I've thought about simply iterating down the array and checking if the row has all zeros but the data I'm handling is substantially large. I have potentially millions of rows of data in the text file and I'm trying to find the most efficient approach as opposed to a loop that could waste computation time. What are your thoughts on what I should do? Is there a better way?
a is your array. You can use any to find all zero rows, remove them, and then use split to split by their indices:
#not_all_zero rows indices
idx = np.flatnonzero(a.any(1))
#all_zero rows indices
idx_zero = np.delete(np.arange(a.shape[0]),idx)
#select not_all_zero rows and split by all_zero row indices
output = np.split(a[idx],idx_zero-np.arange(idx_zero.size))
output:
[array([[ 0, 8, 9, 10],
[ 9, 9, 1, 4]]),
array([[1, 2, 1, 4]]),
array([[ 1, 1, 1, 2],
[39, 23, 10, 1]])]
You can use the np.all function to check for rows which are all zeros, and then index appropriately.
# assume `x` is your data
indices = np.all(x == 0, axis=1)
zeros = x[indices]
nonzeros = x[np.logical_not(indices)]
The all function accepts an axis argument (as do many NumPy functions), which indicates the axis along which to operate. 1 here means to do the reduction along rows, so you get back a boolean array of shape (x.shape[0],), which can be used to directly index x.
Note that this will be much faster than a for-loop over the rows, especially for large arrays.
Related
I have a 2D array
a = np.array([[0,1,2,3],[4,5,6,7]])
that is a 2x4 array. I need to shift the elements of each of the two arrays in axis 0 in but with different steps, say 1 for the first and 2 for the second, so that the output will be
np.array([[1,2,3,0],[6,7,4,5]])
With np.roll it doesn't seem possible to do it, at least looking at the documentation, I don't see any useful hint. There exists another function doing this?
This is an attempt at a generalized version of numpy.roll.
import numpy as np
a = np.array([[0,1,2,3],[4,5,6,7]])
def roll(a, shifts, axis):
assert a.shape[axis] == len(shifts)
return np.stack([
np.roll(np.take(a, i, axis), shifts[i]) for i in range(len(shifts))
], axis)
print(a)
print(roll(a, [-1, -2], 0))
print(roll(a, [1, 2, 1, 0], 1))
prints
[[0 1 2 3]
[4 5 6 7]]
[[1 2 3 0]
[6 7 4 5]]
[[4 1 6 3]
[0 5 2 7]]
Here, the parameter a is a numpy.array, shifts is an Iterable containing the shift amounts per element and axis is the axis along which to shift. Note that was only tested on two-dimensional arrays however.
I'm trying to index the last dimension of a 3D matrix with a matrix consisting of indices that I wish to keep.
I have a matrix of thrust values with shape:
(3, 3, 5)
I would like to filter the last index according to some criteria so that it is reduced from size 5 to size 1. I have already found the indices in the last dimension that fit my criteria:
[[0 0 1]
[0 0 1]
[1 4 4]]
What I want to achieve: for the first row and first column I want the 0th index of the last dimension. For the first row and third column I want the 1st index of the last dimension. In terms of indices to keep the final matrix will become a (3, 3) 2D matrix like this:
[[0,0,0], [0,1,0], [0,2,1];
[1,0,0], [1,1,0], [1,2,1];
[2,0,1], [2,1,4], [2,2,4]]
I'm pretty confident numpy can achieve this, but I'm unable to figure out exactly how. I'd rather not build a construction with nested for loops.
I have already tried:
minValidPower = totalPower[:, :, tuple(indexMatrix)]
But this results in a (3, 3, 3, 3) matrix, so I am not entirely sure how I'm supposed to approach this.
With a as input array and idx as the indexing one -
np.take_along_axis(a,idx[...,None],axis=-1)[...,0]
Alternatively, with open-grids -
I,J = np.ogrid[:idx.shape[0],:idx.shape[1]]
out = a[I,J,idx]
You can build corresponding index arrays for the first two dimensions. Those would basically be:
[0 1 2]
[0 1 2]
[0 1 2]
[0 0 0]
[1 1 1]
[2 2 2]
You can construct these using the meshgrid function. I stored them as m1, and m2 in the example:
vals = np.arange(3*3*5).reshape(3, 3, 5) # test sample
m1, m2 = np.meshgrid(range(3), range(3), indexing='ij')
m3 = np.array([[0, 0, 1], 0, 0, 1], [1, 4, 4]])
sel_vals = vals[m1, m2, m3]
The shape of the result matches the shape of the indexing arrays m1, m2 and m3.
I'm trying to index the last dimension of a 3D matrix with a matrix consisting of indices that I wish to keep.
I have a matrix of thrust values with shape:
(3, 3, 5)
I would like to filter the last index according to some criteria so that it is reduced from size 5 to size 1. I have already found the indices in the last dimension that fit my criteria:
[[0 0 1]
[0 0 1]
[1 4 4]]
What I want to achieve: for the first row and first column I want the 0th index of the last dimension. For the first row and third column I want the 1st index of the last dimension. In terms of indices to keep the final matrix will become a (3, 3) 2D matrix like this:
[[0,0,0], [0,1,0], [0,2,1];
[1,0,0], [1,1,0], [1,2,1];
[2,0,1], [2,1,4], [2,2,4]]
I'm pretty confident numpy can achieve this, but I'm unable to figure out exactly how. I'd rather not build a construction with nested for loops.
I have already tried:
minValidPower = totalPower[:, :, tuple(indexMatrix)]
But this results in a (3, 3, 3, 3) matrix, so I am not entirely sure how I'm supposed to approach this.
With a as input array and idx as the indexing one -
np.take_along_axis(a,idx[...,None],axis=-1)[...,0]
Alternatively, with open-grids -
I,J = np.ogrid[:idx.shape[0],:idx.shape[1]]
out = a[I,J,idx]
You can build corresponding index arrays for the first two dimensions. Those would basically be:
[0 1 2]
[0 1 2]
[0 1 2]
[0 0 0]
[1 1 1]
[2 2 2]
You can construct these using the meshgrid function. I stored them as m1, and m2 in the example:
vals = np.arange(3*3*5).reshape(3, 3, 5) # test sample
m1, m2 = np.meshgrid(range(3), range(3), indexing='ij')
m3 = np.array([[0, 0, 1], 0, 0, 1], [1, 4, 4]])
sel_vals = vals[m1, m2, m3]
The shape of the result matches the shape of the indexing arrays m1, m2 and m3.
I know that
a - a.min(axis=0)
will subtract the minimum of each column from every element in the column. I want to subtract the minimum in each row from every element in the row. I know that
a.min(axis=1)
specifies the minimum within a row, but how do I tell the subtraction to go by rows instead of columns? (How do I specify the axis of the subtraction?)
edit: For my question, a is a 2d array in NumPy.
Assuming a is a numpy array, you can use this:
new_a = a - np.min(a, axis=1)[:,None]
Try it out:
import numpy as np
a = np.arange(24).reshape((4,6))
print (a)
new_a = a - np.min(a, axis=1)[:,None]
print (new_a)
Result:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
[[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]]
Note that np.min(a, axis=1) returns a 1d array of row-wise minimum values.
We than add an extra dimension to it using [:,None]. It then looks like this 2d array:
array([[ 0],
[ 6],
[12],
[18]])
When this 2d array participates in the subtraction, it gets broadcasted into a shape of (4,6), which looks like this:
array([[ 0, 0, 0, 0, 0, 0],
[ 6, 6, 6, 6, 6, 6],
[12, 12, 12, 12, 12, 12],
[18, 18, 18, 18, 18, 18]])
Now, element-wise subtraction happens between the two (4,6) arrays.
Specify keepdims=True to preserve a length-1 dimension in place of the dimension that min collapses, allowing broadcasting to work out naturally:
a - a.min(axis=1, keepdims=True)
This is especially convenient when axis is determined at runtime, but still probably clearer than manually reintroducing the squashed dimension even when the 1 value is fixed.
If you want to use only pandas you can just apply a lambda to every column using min(row)
new_df = pd.DataFrame()
for i, col in enumerate(df.columns):
new_df[col] = df.apply(lambda row: row[i] - min(row))
After I sort this numpy array and remove all duplicate (y) values and the corresponding (x) value for the duplicate (y) value, I use a for loop to draw rectangles at the remaining coordinates. yet I get the error : ValueError: too many values to unpack (expected 2), but its the same shape as the original just the duplicates have been removed.
from graphics import *
import numpy as np
def main():
win = GraphWin("A Window", 500, 500)
# starting array
startArray = np.array([[2, 1, 2, 3, 4, 7],
[5, 4, 8, 3, 7, 8]])
# the following reshapes the from all x's in one row and y's in second row
# to x,y rows pairing the x with corresponding y value.
# then it searches for duplicate (y) values and removes both the duplicate (y) and
# its corresponding (x) value by removing the row.
# then the unique [x,y]'s array is reshaped back to a [[x,....],[y,....]] array to be used to draw rectangles.
d = startArray.reshape((-1), order='F')
# reshape to [x,y] matching the proper x&y's together
e = d.reshape((-1, 2), order='C')
# searching for duplicate (y) values and removing that row so the corresponding (x) is removed too.
f = e[np.unique(e[:, 1], return_index=True)[1]]
# converting unique array back to original shape
almostdone = f.reshape((-1), order='C')
# final reshape to return to original starting shape but is only unique values
done = almostdone.reshape((2, -1), order='F')
# print all the shapes and elements
print("this is d reshape of original/start array:", d)
print("this is e reshape of d:\n", e)
print("this is f unique of e:\n", f)
print("this is almost done:\n", almostdone)
print("this is done:\n", done)
print("this is original array:\n",startArray)
# loop to draw a rectangle with each x,y value being pulled from the x and y rows
# says too many values to unpack?
for x,y in np.nditer(done,flags = ['external_loop'], order = 'F'):
print("this is x,y:", x,y)
print("this is y:", y)
rect = Rectangle(Point(x,y),Point(x+4,y+4))
rect.draw(win)
win.getMouse()
win.close()
main()
here is the output:
line 42, in main
for x,y in np.nditer(done,flags = ['external_loop'], order = 'F'):
ValueError: too many values to unpack (expected 2)
this is d reshape of original/start array: [2 5 1 4 2 8 3 3 4 7 7 8]
this is e reshape of d:
[[2 5]
[1 4]
[2 8]
[3 3]
[4 7]
[7 8]]
this is f unique of e:
[[3 3]
[1 4]
[2 5]
[4 7]
[2 8]]
this is almost done:
[3 3 1 4 2 5 4 7 2 8]
this is done:
[[3 1 2 4 2]
[3 4 5 7 8]]
this is original array:
[[2 1 2 3 4 7]
[5 4 8 3 7 8]]
why would the for loop work for the original array but not this sorted one?
or what loop could I use to just use (f) since it is sorted but shape(-1,2)?
I also tried a different loop:
for x,y in done[np.nditer(done,flags = ['external_loop'], order = 'F')]:
Which seems to fix the too many values error but I get:
IndexError: index 3 is out of bounds for axis 0 with size 2
and
FutureWarning: Using a non-tuple sequence for multidimensional indexing is
deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this
will be interpreted as an array index, `arr[np.array(seq)]`, which will
result either in an error or a different result.
for x,y in done[np.nditer(done,flags = ['external_loop'], order = 'F')]:
which I've looked up on stackexchange to fix but keep getting the error regardless of how I do the syntax.
any help would be great thanks!
I don't have the graphics package (it might be a windows specific thing?), but I do know that you're making this waaaaay too complicated. Here is a much simpler version that produces the same done array:
from graphics import *
import numpy as np
# starting array
startArray = np.array([[2, 1, 2, 3, 4, 7],
[5, 4, 8, 3, 7, 8]])
# searching for duplicate (y) values and removing that row so the corresponding (x) is removed too.
done = startArray.T[np.unique(startArray[1,:], return_index=True)[1]]
for x,y in done:
print("this is x,y:", x, y)
print("this is y:", y)
rect = Rectangle(Point(x,y),Point(x+4,y+4))
rect.draw(win)
To note, in the above version done.shape==(5, 2) instead of (2, 5), but you can always change that back after the for loop with done = done.T.
Here's some notes on your original code for future reference:
The order flag in reshape is completely superfluous to what your code is trying to do, and just makes it more confusing/potentially more buggy. You can do all of the reshapes you wanted to do without it.
The use case for nditer is to iterate over the individual elements of one (or more) array one at a time. It cannot in general be used to iterate over the rows or columns of a 2D array. If you try to use it this way, you're likely to get buggy results that are highly dependent on the layout of the array in memory (as you saw).
To iterate over rows or columns of a 2D array, just use simple iteration. If you just iterate over an array (eg for row in arr:), you get each row, one at a time. If you want the columns instead you can transpose the array first (like I did in the code above with .T).
Note about .T
.T takes the transpose of an array. For example, if you start with
arr = np.array([[0, 1, 2, 3],
[4, 5, 6, 7]])
then the transpose is:
arr.T==np.array([[0, 4],
[1, 5],
[2, 6],
[3, 7]])