I have a numpy array with four columns and many rows:
>>> dat
array([['4/5/2004', '17', 0.0, 0.0],
['4/5/2004', '7', 0.0, 0.0],
['4/5/2004', '19:48:20', 58.432488, -135.9202205],
['4/5/2004', '19:48:32', 58.432524300000004, 0.0],
['4/5/2004', '19:48:36', 58.4325365, -150.9202813]], dtype=object)
I would like to remove all rows where the value in columns 3 or 4 equal 0 so the result would be:
([['4/5/2004', '19:48:20', 58.432488, -135.9202205],
['4/5/2004', '19:48:36', 58.4325365, -150.9202813]])
I can do this one column at a time with:
a = dat[~(dat[:,2]==0), :]
Which returns rows where value in column 3 does not equal 0. I could do this iteratively for multiple columns, but it would be convenient to do it all in one command.
I thought something like the following two examples would work (but they do not):
a = dat[~(dat[:,2]==0), :] or dat[~(dat[:,3]==0), :]
a = dat[~(dat[:,2&3]==0), :]
Hopefully there's some simple syntax I'm missing and can't find in the numpy help.
Assuming the data array is 2D, we could slice and look for the valid ones -
dat[~(dat[:,2:4]==0).any(1)]
Alternatively, we can use np.all on the !=0 ones -
dat[(dat[:,2:4]!=0).all(1)]
When the columns of interest are not contiguous ones, we need to slice them using those column IDs and use the same technique. So, let's say the column IDs to be examined are stored in an array or list named colID, then we would have the approaches modified, like so -
dat[~(dat[:,colID]==0).any(1)]
dat[(dat[:,colID]!=0).all(1)]
Thus, for the stated case of columns 3 and 4, we would have : colID = [2,3].
What about using &:
>>> dat[(dat[:,2] != 0) & (dat[:,3] != 0), :]
array([['4/5/2004', '19:48:20', 58.432488, -135.9202205],
['4/5/2004', '19:48:36', 58.4325365, -150.9202813]], dtype=object)
which yields the element-wise "and".
I've changed it for != 0 thus the & which avoids the additional inversions with ~.
You got the idea of using or conceptually correct. The main difference is that you want to do logical or (|) or logical and (&) (just like you are using logical not (~)).
This works because an operation like dat[:,3] == 0 creates an array or booleans of the same size as a column of dat. When this array is used as an index, numpy interprets it as a mask. Splitting off the mask array to highlight this concept:
mask = (dat[:, 2] != 0) & (dat[:, 3] != 0)
dat = dat[mask, :]
Another way to compute the mask would be as follows:
mask = np.logical_and.reduce(dat[:, 2:] != 0, axis=1)
np.logical_and.reduce shrinks the input array across the columns (axis=1) by applying np.logical_and (which is the function that processes the & operator) to the rows, so you get a True where all the elements of the selected portion of each row are True.
Related
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
Can someone explain the second line of code with reference to specific documentation? I know its slicing but the I couldn't find any reference for the notation ":-1" anywhere. Please give the specific documentation portion.
Thank you
It results in slicing, most probably using numpy and it is being done on a data of shape (610, 14)
Per the docs:
Indexing on ndarrays
ndarrays can be indexed using the standard Python x[obj] syntax, where x is the array and obj the selection. There are different kinds of indexing available depending on obj: basic indexing, advanced indexing and field access.
1D array
Slicing a 1-dimensional array is much like slicing a list
import numpy as np
np.random.seed(0)
array_1d = np.random.random((5,))
print(len(array_1d.shape))
1
NOTE: The len of the array shape tells you the number of dimensions.
We can use standard python list slicing on the 1D array.
# get the last element
print(array_1d[-1])
0.4236547993389047
# get everything up to but excluding the last element
print(array_1d[:-1])
[0.5488135 0.71518937 0.60276338 0.54488318]
2D array
array_2d = np.random.random((5, 1))
print(len(array_2d.shape))
2
Think of a 2-dimensional array like a data frame. It has rows (the 0th axis) and columns (the 1st axis). numpy grants us the ability to slice these axes independently by separating them with a comma (,).
# the 0th row and all columns
# the 0th row and all columns
print(array_2d[0, :])
[0.79172504]
# the 1st row and everything after + all columns
print(array_2d[1:, :])
[[0.52889492]
[0.56804456]
[0.92559664]
[0.07103606]]
# the 1st through second to last row + the last column
print(array_2d[1:-1, -1])
[0.52889492 0.56804456 0.92559664]
Your Example
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
Note that data.shape is >= 2 (otherwise you'd get an IndexError).
This means data[:, :-1] is keeping all "rows" and slicing up to, but not including, the last "column". Likewise, data[:, -1] is keeping all "rows" and selecting only the last "column".
It's important to know that when you slice an ndarray using a colon (:), you will get an array with the same dimensions.
print(len(array_2d[1:, :-1].shape)) # 2
But if you "select" a specific index (i.e. don't use a colon), you may reduce the dimensions.
print(len(array_2d[1, :-1].shape)) # 1, because I selected a single index value on the 0th axis
print(len(array_2d[1, -1].shape)) # 0, because I selected a single index value on both the 0th and 1st axes
You can, however, select a list of indices on either axis (assuming they exist).
print(len(array_2d[[1], [-1]].shape)) # 1
print(len(array_2d[[1, 3], :].shape)) # 2
This slicing notation is explained here https://docs.python.org/3/tutorial/introduction.html#strings
-1 means last element, -2 - second from last, etc. For example, if there are 8 elements in a list, -1 is equivalent to 7 (not 8 because indexing starts from 0)
Keep in mind that "normal" python slicing for nested lists looks like [1:3][5:7], while numpy arrays also have a slightly different syntax ([8:10, 12:14]) that lets you slice multidimensional arrays. However, -1 always means the same thing. Here is the numpy documentation for slicing https://numpy.org/doc/stable/user/basics.indexing.html
I have a Dataframe with several lines and columns and I have transformed it into a numpy array to speed-up the calculations.
The first five columns of the Dataframe looked like this:
par1 par2 par3 par4 par5
1.502366 2.425301 0.990374 1.404174 1.929536
1.330468 1.460574 0.917349 1.172675 0.766603
1.212440 1.457865 0.947623 1.235930 0.890041
1.222362 1.348485 0.963692 1.241781 0.892205
...
These columns are now stored in a numpy array a = df.values
I need to check whether at least two of the five columns satisfy a condition (i.e., their value is larger than a certain threshold). Initially I wrote a function that performed the operation directly on the dataframe. However, because I have a very large amount of data and need to repeat the calculations over and over, I switched to numpy to take advantage of the vectorization.
To check the condition I was thinking to use
df['Result'] = np.where(condition_on_parameters > 2, True, False)
However, I cannot figure out how to write the condition_on_parameters such that it returns a True of False when at least 2 out of the 5 parameters are larger than the threshold. I thought to use the sum() function on the condition_on_parameters but I am not sure how to write such condition.
EDIT
It is important to specify that the thresholds are different for each parameter. For example thr1=1.2, thr2=2.0, thr3=1.5, thr4=2.2, thr5=3.0. So I need to check that par1 > thr1, par2 > thr2, ..., par5 > thr5.
Assuming condition_on_parameters returns an array the sames size as a with entries as True or False, you can use np.sum(condition_on_parameters, axis=1) to sum over the true values (True has a numerical values of 1) of each row. This provides a 1D array with entries as the number of columns that meet the condition. This array can then be used with where to get the row numbers you are looking for.
df['result'] = np.where(np.sum(condition_on_parameters, axis=1) > 2)
Can you exploit pandas functionalities? For example, you can efficiently check conditions on multiple rows/columns with .apply and then .sum(axis=1).
Here some sample code:
import pandas as pd
df = pd.DataFrame([[1.50, 2.42, 0.88], [0.98,1.3, 0.56]], columns=['par1', 'par2', 'par3'])
# custom_condition, e.g. value less or equal than threshold
def leq(x, t):
return x<=t
condition = df.apply(lambda x: leq(x, 1)).sum(axis=1)
# filter
df.loc[condition >=2]
I think this should be equivalent to numpy in terms of efficiency as pandas is ultimately build on top of that, however I'm not entirely sure...
It seems you are looking for numpy.any
a = np.array(\
[[1.502366, 2.425301, 0.990374, 1.404174, 1.929536],
[1.330468, 1.460574, 0.917349, 1.172675, 0.766603 ],
[1.212440, 1.457865, 0.947623, 1.235930, 0.890041 ],
[1.222362, 1.348485, 0.963692, 1.241781, 0.892205 ]]);
df = pd.DataFrame(a, columns=[f'par{i}' for i in range(1, 6)])
df['Result'] = np.any(df > 1.46, axis=1) # append the result column
Gives the following dataframe
Now I have one 2D Numpy array of float values, i.e. a, and its shape is (10^6, 3).
I want to know which rows are greater than np.array([25.0, 25.0, 25.0]). And then outputting the rows that satisfy this condition.
My code appears as follows.
# Create an empty array
a_cut = np.empty(shape=(0, 3), dtype=float)
minimum = np.array([25.0, 25.0, 25.0])
for i in range(len(a)):
if a[i,:].all() > minimum.all():
a_cut = np.append(a_cut, a[i,:], axis=0)
However, the code is inefficient. After a few hours, the result has not come out.
So Is there a way to improve the speed of this loop?
np.append re-allocates the entire array every time you call it. It is basically the same as np.concatenate: use it very sparingly. The goal is to perform the entire operation in bulk.
You can construct a mask:
mask = (a > minimum).all(axis=1)
Then select:
a_cut = a[mask, :]
You may get a slight improvement from using indices instead of a boolean mask:
a_cut = a[np.flatnonzero(mask), :]
Indexing with fewer indices than there are dimensions applies the indices to the leading dimensions, so you can do
a_cut = a[mask]
The one liner is therefore:
a_cut = a[(a > minimium).all(1)]
I have two 2d numpy arrays of size (12550,200) and (12550,10). I need to find the set of column indexes of the first array that are matching the 2nd array columns.
Eg:
ar1 = [[1,2,3,4],[4,5,6,7],[1,3,4,5],[6,7,8,5]]
ar2 = [[1,3],[4,6],[1,4],[6,8]]
so matching columns are 1,4,1,4 and 3,6,4,8
I need the index of these columns in ar1 as output i.e., [0,2]
Can anyone help me with the python code that is fast enough as the original array dimensions are big
Check this out:
ar1 = np.array([[1,2,3,4],[4,5,6,7],[1,3,4,5],[6,7,8,5]])
ar2 = np.array([[1,3],[4,6],[1,4],[6,8]])
np.where((ar1[:,None].T == ar2.T).all(axis=2))[0]
gives
array([0, 2], dtype=int64)
meaning column 0 of ar2 is found at column 0 of ar1, and column 1 of ar2 is found at column 2 of ar1.
The transpose is used because you care about columns rather than rows. The [:,None] is used for broadcasting (i.e. test every column against every other). The all() checks that entire columns match. And finally the [0] element of the np.where result will give you the ar1 column indices where this happens.
how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.
Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])
np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).