Numpy - delete data rows with negative values

Numpy - delete data rows with negative values - python

I am taking the output from AUTO numerical continuation package, and need to filter out results that have negative values of the variables, as they are non-physical. So if I have, for example:
>>> a = np.array([[0,1,2,3,4],[-1,-0.5,0,0.5,1],[-3,-4,-5,0.1,0.2]])
I would like to be left with:
>>> b
array([[ 3. , 4. ],
[ 0.5, 1. ],
[ 0.1, 0.2]])
But when I try numpy.where I get:
>>> b = a[:,(np.where(a[1]>=0) and np.where(a[2]>=0))]
>>> b
array([[[ 3. , 4. ]],
[[ 0.5, 1. ]],
[[ 0.1, 0.2]]])
>>> b.shape
(3, 1, 2)
That is, it adds another unwanted axis to the array. What am I doing wrong?

Assuming all you want to do is to remove columns that have one or more negative values, you could do this:
a = np.array([[0,1,2,3,4],[-1,-0.5,0,0.5,1],[-3,-4,-5,0.1,0.2]])
b = a[:,a.min(axis=0)>=0]

If what you want are fully positive columns, then #Yakym's answer is the way to go as it is probably faster. However, if it was just an example and you want to threshold certain columns, you can do it by sightly modifying your example:
>>> a[:, (a[1] >= 0) & (a[2] >= 0)]
array([[ 3. , 4. ],
[ 0.5, 1. ],
[ 0.1, 0.2]])
Here (a[1] >= 0) and (a[2] >= 2) create boolean mask that are merged by the & (boolean/logical and) operator and used to index the array a.

Related

Finding the average of the x component of an array of coordinates, based on the y component

I have the following example array of x-y coordinate pairs:
A = np.array([[0.33703753, 3.],
[0.90115394, 5.],
[0.91172016, 5.],
[0.93230994, 3.],
[0.08084283, 3.],
[0.71531777, 2.],
[0.07880787, 3.],
[0.03501083, 4.],
[0.69253184, 4.],
[0.62214452, 3.],
[0.26953094, 1.],
[0.4617873 , 3.],
[0.6495549 , 0.],
[0.84531478, 4.],
[0.08493308, 5.]])
My goal is to reduce this to an array with six rows by taking the average of the x-values for each y-value, like so:
array([[0.6495549 , 0. ],
[0.26953094, 1. ],
[0.71531777, 2. ],
[0.41882167, 3. ],
[0.52428582, 4. ],
[0.63260239, 5. ]])
Currently I am achieving this by converting to a pandas dataframe, performing the calculation, and converting back to a numpy array:
>>> df = pd.DataFrame({'x':A[:, 0], 'y':A[:, 1]})
>>> df.groupby('y').mean().reset_index()
y x
0 0.0 0.649555
1 1.0 0.269531
2 2.0 0.715318
3 3.0 0.418822
4 4.0 0.524286
5 5.0 0.632602
Is there a way to perform this calculation using numpy, without having to resort to the pandas library?

Here's a completely vectorized solution that only uses numpy methods and no python iteration:
sort_indices = np.argsort(A[:, 1])
unique_y, unique_indices, group_count = np.unique(A[sort_indices, 1], return_index=True, return_counts=True)
Once we have the indices and counts of all the unique elements, we can use the np.ufunc.reduceat method to collect the results of np.add for each group, and then divide by their counts to get the mean:
group_sum = np.add.reduceat(A[sort_indices, :], unique_indices, axis=0)
group_mean = group_sum / group_count[:, None]
# array([[0.6495549 , 0. ],
# [0.26953094, 1. ],
# [0.71531777, 2. ],
# [0.41882167, 3. ],
# [0.52428582, 4. ],
# [0.63260239, 5. ]])
Benchmarks:
Comparing this solution with the other answers here (Code at tio.run) for
A contains 10k rows, with A[:, 1] containing N groups, N varies from 1 to 10k
A contains N rows (N varies from 1 to 10k), with A[:, 1] containing min(N, 1000) groups
Observations:
The numpy-only solutions (Dani's and mine) win easily -- they are significantly faster than the pandas approach (possibly since the time taken to create the dataframe is an overhead that doesn't exist for the former).
The pandas solution is slower than the python+numpy solutions (Jaimu's and mine) for smaller arrays, since it's faster to just iterate in python and get it over with than to create a dataframe first, but these solutions become much slower than pandas as the array size or number of groups increases.
Note: The previous version of this answer iterated over the groups as returned by the accepted answer to Is there any numpy group by function? and individually calculated the mean:
First, we need to sort the array on the column you want to group by
A_s = A[A[:, 1].argsort(), :]
Then, run that snippet. np.split splits its first argument at the indices given by the second argument.
unique_elems, unique_indices = np.unique(A_s[:, 1], return_index=True)
# (array([0., 1., 2., 3., 4., 5.]), array([ 0, 1, 2, 3, 9, 12]))
split_indices = unique_indices[1:] # No need to split at the first index
groups = np.split(A_s, split_indices)
# [array([[0.6495549, 0. ]]),
# array([[0.26953094, 1. ]]),
# array([[0.71531777, 2. ]]),
# array([[0.33703753, 3. ],
# [0.93230994, 3. ],
# [0.08084283, 3. ],
# [0.07880787, 3. ],
# [0.62214452, 3. ],
# [0.4617873 , 3. ]]),
# array([[0.03501083, 4. ],
# [0.69253184, 4. ],
# [0.84531478, 4. ]]),
# array([[0.90115394, 5. ],
# [0.91172016, 5. ],
# [0.08493308, 5. ]])]
Now, groups is a list containing multiple np.arrays. Iterate over the list and mean each array:
means = np.zeros((len(groups), groups[0].shape[1]))
for i, grp in enumerate(groups):
means[i, :] = grp.mean(axis=0)
# array([[0.6495549 , 0. ],
# [0.26953094, 1. ],
# [0.71531777, 2. ],
# [0.41882167, 3. ],
# [0.52428582, 4. ],
# [0.63260239, 5. ]])

Here is a work around using numpy.
unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.empty((unique_ys.shape[0], 2))
for i, y in enumerate(unique_ys):
result[i, 0] = np.mean(A[indices == i, 0])
result[i, 1] = y
print(result)
Alternative:
To make the code more pythonic, you can use a list comprehension to create the result array, instead of using a for loop.
unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.array([[np.mean(A[indices == i, 0]), y] for i, y in enumerate(unique_ys)])
print(result)
Output:
[[0.6495549 0. ]
[0.26953094 1. ]
[0.71531777 2. ]
[0.41882167 3. ]
[0.52428582 4. ]
[0.63260239 5. ]]

Use np.bincount + np.unique:
sums = np.bincount(A[:, 1].astype(np.int64), weights=A[:, 0])
values, counts = np.unique(A[:, 1], return_counts=True)
res = np.vstack((sums / counts, values)).T
print(res)
Output
[[0.6495549 0. ]
[0.26953094 1. ]
[0.71531777 2. ]
[0.41882167 3. ]
[0.52428582 4. ]
[0.63260239 5. ]]

If you know the y values beforehand, you could try to match the array for each:
for example:
A[(A[:,1]==1),0] will give you all the x values where the y value is equal to 1.
So you could go through each value of y, sum the A[:,1]==y[n] to get the number of matches, sum the x values that match, divide to make the average, and place in a new array:
B=np.zeros([6,2])
for i in range( 6):
nmatch=sum(A[:,1]==i)
nsum=sum(A[(A[:,1]==i),0])
B[i,0]=i
B[i,1]=nsum/nmatch
There must be a more pythonic way of doing this ....

Use numpy array to do conditional operations on another array

Let's say I have 2 arrays:
a = np.array([2, 2, 0, 0, 2, 1, 0, 0, 0, 0, 3, 0, 1, 0, 0, 2])
b = np.array([0, 0.5, 0.25, 0.9])
What I would like to do, is take the value in array b and multiple it to the values in array a, based on it's index.
So the first value in array a is 2. I want the value in array b at that index position to be multiplied by that value. So in array b, index postion 2's value is 0.25, so multiple that value (2) in array a by 0.25.
I know it can be done with iteration, but I'm trying to figure out how it's done elmentwise operations.
Here's the iteration way that I've done:
result = np.array([])
for idx in a:
result = np.append(result, (b[idx] * idx))
To get the result:
print(result)
[0.5 0.5 0. 0. 0.5 0.5 0. 0. 0. 0. 2.7 0. 0.5 0. 0. 0.5]
What's an elementwise equivalent?

Integer arrays can be used as indices in numpy. As a consequence, you can simply do something like this
b[a] * a
EDIT:
Just for completeness, your iterative solution triggers a new memory allocation every time append is called (see the 'returns' section of this page). Since you already now the shape of your output (i.e. a.shape), it's much better to allocate the output array in advance, e.g. result = np.empty(a.shape) and then go through the cycle.

So there are a few ways to do this, but if you want purely element-wise operations you could do the following:
Before getting the result, each element of b is transformed by its index. So create another vector n.
n = np.arange(len(b)) * b
# In the example, n now equals [0. , 0.5, 0.5, 2.7]
# then the result is just n indexed by a
result = n[a]
# result = [0.5, 0.5, 0. , 0. , 0.5, 0.5, 0. , 0. , 0. , 0. , 2.7, 0. , 0.5, 0. , 0. , 0.5]

Replacing non zero values in a matrix with the marginals

I am trying to do some math with my matrix, i can write it down but i am not sure how to code it. This involves getting a column of row marginal values, then making a new matrix that has all non-zero row values replaced with the marginals, after that I would like to divide the sum of non zero new values to be the column marginals.
I can get to the row marginals but I cant seem to think of a way to repopulate.
example of what i want
import numpy as np
matrix = np.matrix([[1,3,0],[0,1,2],[1,0,4]])
matrix([[1, 3, 0],
[0, 1, 2],
[1, 0, 4]])
marginals = ((matrix != 0).sum(1) / matrix.sum(1))
matrix([[0.5 ],
[0.66666667],
[0.4 ]])
What I want done next is a filling of the matrix based on the non zero locations of the first.
matrix([[0.5, 0.5, 0],
[0, 0.667, 0.667],
[0.4, 0, 0.4]])
Final wanted result is the new matrix column sum divided by the number of non zero occurrences in that column.
matrix([[(0.5+0.4)/2, (0.5+0.667)/2, (0.667+0.4)/2]])

To get the final matrix we can use matrix-multiplication for efficiency -
In [84]: mask = matrix!=0
In [100]: (mask.T*marginals).T/mask.sum(0)
Out[100]: matrix([[0.45 , 0.58333334, 0.53333334]])
Or simpler -
In [110]: (marginals.T*mask)/mask.sum(0)
Out[110]: matrix([[0.45 , 0.58333334, 0.53333334]])
If you need that intermediate filled output too, use np.multiply for broadcasted elementwise multiplication -
In [88]: np.multiply(mask,marginals)
Out[88]:
matrix([[0.5 , 0.5 , 0. ],
[0. , 0.66666667, 0.66666667],
[0.4 , 0. , 0.4 ]])

How to use arrays to access matrix elements?

I need to change all nans of a matrix to a different value. I can easily get the nan positions using argwhere, but then I am not sure how to access those positions programmatically. Here is my nonworking code:
myMatrix = np.array([[3.2,2,float('NaN'),3],[3,1,2,float('NaN')],[3,3,3,3]])
nanPositions = np.argwhere(np.isnan(myMatrix))
maxVal = np.nanmax(abs(myMatrix))
for pos in nanPositions :
myMatrix[pos] = maxval
the problem is that myMatrix[pos] does not accept pos as an array.

The more-efficient way of generating your output has already been covered by sacul. However, you're incorrectly indexing your 2D matrix in the case where you want to use an array.
At least to me, it's a bit unintuitive, but you need to use:
myMatrix[[all_row_indices], [all_column_indices]]
The following will give you what you expect:
import numpy as np
myMatrix = np.array([[3.2,2,float('NaN'),3],[3,1,2,float('NaN')],[3,3,3,3]])
nanPositions = np.argwhere(np.isnan(myMatrix))
maxVal = np.nanmax(abs(myMatrix))
print(myMatrix[nanPositions[:, 0], nanPositions[:, 1]])
You can see more about advanced indexing in the documentation

In [54]: arr = np.array([[3.2,2,float('NaN'),3],[3,1,2,float('NaN')],[3,3,3,3]])
...:
In [55]: arr
Out[55]:
array([[3.2, 2. , nan, 3. ],
[3. , 1. , 2. , nan],
[3. , 3. , 3. , 3. ]])
Location of the nan:
In [56]: np.where(np.isnan(arr))
Out[56]: (array([0, 1]), array([2, 3]))
In [57]: np.argwhere(np.isnan(arr))
Out[57]:
array([[0, 2],
[1, 3]])
where produces a tuple of arrays; argwhere the same values but as a 2d array
In [58]: arr[Out[56]]
Out[58]: array([nan, nan])
In [59]: arr[Out[56]] = [100,200]
In [60]: arr
Out[60]:
array([[ 3.2, 2. , 100. , 3. ],
[ 3. , 1. , 2. , 200. ],
[ 3. , 3. , 3. , 3. ]])
The argwhere can be used to index individual items:
In [72]: for ij in Out[57]:
...: print(arr[tuple(ij)])
100.0
200.0
The tuple() is needed here because np.array([1,3]) in interpreted as 2 element indexing on the first dimension.
Another way to get that indexing tuple is to use unpacking:
In [74]: [arr[i,j] for i,j in Out[57]]
Out[74]: [100.0, 200.0]
So while argparse looks useful, it is trickier to use than plain where.
You could, as noted in the other answers, use boolean indexing (I've already modified arr so the isnan test no longer works):
In [75]: arr[arr>10]
Out[75]: array([100., 200.])
More on indexing with a list or array, and indexing with a tuple:
In [77]: arr[[0,0]] # two copies of row 0
Out[77]:
array([[ 3.2, 2. , 100. , 3. ],
[ 3.2, 2. , 100. , 3. ]])
In [78]: arr[(0,0)] # one element
Out[78]: 3.2
In [79]: arr[np.array([0,0])] # same as list
Out[79]:
array([[ 3.2, 2. , 100. , 3. ],
[ 3.2, 2. , 100. , 3. ]])
In [80]: arr[np.array([0,0]),:] # making the trailing : explicit
Out[80]:
array([[ 3.2, 2. , 100. , 3. ],
[ 3.2, 2. , 100. , 3. ]])

You can do this instead (IIUC):
myMatrix[np.isnan(myMatrix)] = np.nanmax(abs(myMatrix))

Numpy array variable avoid bad calculations

I'm setting a numpy array with a power-law equation. The problem is that part of my domain tries to do numpy.power(x, n) when x is negative and n is not an integer. In this part of the domain I want the value to be 0.0. Below is a code that has the correct behavior, but is there a more Pythonic way to do this?
# note mesh.x is a numpy array of length nx
myValues = npy.zeros((nx))
para = [5.8780046, 0.714285714, 2.819250868]
for j in range(nx):
if mesh.x[j] > para[1]:
myValues[j] = para[0]*npy.power(mesh.x[j]-para[1],para[2])
else:
myValues[j] = 0.0

Is "numpythonic" a word? It should be a word. The following is really neither pythonic nor unpythonic, but it is much more efficient than using a for loop, and close(r) to the way Travis would probably do it:
import numpy
mesh_x = numpy.array([0.5,1.0,1.5])
myValues = numpy.zeros_like( mesh_x )
para = [5.8780046, 0.714285714, 2.819250868]
mask = mesh_x > para[1]
myValues[mask] = para[0] * numpy.power(mesh_x[mask] - para[1], para[2])
print(myValues)
For very large problems you would probably want to avoid creating temporary arrays:
mask = mesh.x > para[1]
myValues[mask] = mesh.x[mask]
myValues[mask] -= para[1]
myValues[mask] **= para[2]
myValues[mask] *= para[0]

Here's one approach with np.where to choose values between the power calculations and 0 -
import numpy as np
np.where(mesh.x>para[1],para[0]*np.power(mesh.x-para[1],para[2]),0)
Explanation :
np.where(mask,A,B) chooses elements from A or B depending on mask elements. So, in our case it is mesh.x>para[1] when doing a vectorized comparison for all mesh.x elements in one go.
para[0]*np.power(mesh.x-para[1],para[2]) gives us the elements that are to be chosen in case a mask element is True. Else, we choose 0, which is the third argument to np.where.

More of an explanation of the answers given by #jez and #Divakar with simple examples than an answer itself. They both rely on some form of boolean indexing.
>>>
>>> a
array([[-4.5, -3.5, -2.5],
[-1.5, -0.5, 0.5],
[ 1.5, 2.5, 3.5]])
>>> n = 2.2
>>> a ** n
array([[ nan, nan, nan],
[ nan, nan, 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
np.where is made for this it selects one of two values based on a boolean array.
>>> np.where(np.isnan(a**n), 0, a**n)
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>>
>>> b = np.where(a < 0, 0, a)
>>> b
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.5],
[ 1.5, 2.5, 3.5]])
>>> b **n
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
Use of boolean indexing on the left-hand-side and the right-hand-side. This is similar to np.where
>>>
>>> a[a >= 0] = a[a >= 0] ** n
>>> a
array([[ -4.5 , -3.5 , -2.5 ],
[ -1.5 , -0.5 , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>> a[a < 0] = 0
>>> a
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 0.21763764],
[ 2.44006149, 7.50702771, 15.73800567]])
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy - delete data rows with negative values - python

Assuming all you want to do is to remove columns that have one or more negative values, you could do this: a = np.array([[0,1,2,3,4],[-1,-0.5,0,0.5,1],[-3,-4,-5,0.1,0.2]]) b = a[:,a.min(axis=0)>=0]

Related

Finding the average of the x component of an array of coordinates, based on the y component

Use numpy array to do conditional operations on another array

Replacing non zero values in a matrix with the marginals

How to use arrays to access matrix elements?

Numpy array variable avoid bad calculations

Categories

Resources