flexible query based on two numpy arrays - python

I would like to create a somewhat dynamical query based on numpy array. Ideally, the indices of one array (uu) should be returned, given the conditions for each column in a second array (cond).
The sample below demonstrates what I have in mind, and it works using a loop. I am wondering if there is a more efficient method. Thanks for your help.
import numpy as np
# create an array that has four columns, each contain a vector (here: identical vectors)
n = 101
u = np.linspace(0,100,n)
uu = np.ones((4, n)) * u
# ideally I would like the indices of uu
# (so I could access another array of the same shape as uu)
# that meets the following conditions:
# the condition based on which values in uu should be selected
# in the first column (index 0) all values <= 10. should be selected
# in the second column (index 1) all values <= 20. should be selected
cond = np.array([10,20])
# this gives the correct indices, but in a series of 1D solutions
# this would work as a work-around
for i in range(cond.size):
ix = np.where(uu[i,:] <= cond[i])
print(ix)
# this seems like a work-around using True/False
# but here I am not sure how to best convert this to indices
for i in range(cond.size):
uu[i,:] = uu[i,:] <= cond[i]
print(uu)

Numpy allows to compare arrays directly:
import numpy as np
# just making my own uu with random numbers
n = 101
uu = np.random.rand(n,4)
# then if you have an array or a list of values...
cond = [.2,.5,.7,.8]
# ... you can directly compare the array to it
comparison = uu <= cond
# comparison now has True/False values, and the shape of uu
# you can directly use comparison to get the desired values in uu
uu[comparison] # gives the values of uu where comparison ir True
# but if you really need the indices you can do
np.where(comparison)
#returns 2 arrays containing i-indices and j-indices where comparison is True

Related

How best to randomly select a number of non zero elements from an array with many duplicate integers

I need to randomly select x non-zero integers from an unsorted 1D numpy array containing y integer elements including an unknown number of zeros as well as duplicate integers. The output should include duplicate integers if required by this random selection. What is the best way to achieve this?
One option is to select the non-zero elements first then use random.choice() (with the replace parameter set to either True or False) to select a given number of elements.
Something like this:
import numpy as np
rng = np.random.default_rng() # doing this is recommended by numpy
n = 4 # number of non-zero samples
arr = np.array([1,2,0,0,4,2,3,0,0,4,2,1])
non_zero_arr = arr[arr!=0]
rng.choice(non_zero_arr, n, replace=True)

Removing non-triple rows en masse in numpy array

I have an array A that has 1 million rows and 3 columns. In the last column there are unique integers that help identify data in the other two columns. I would only like to keep data that has three of the same unique integer occurrences, and delete all other rows that have other amounts of unique integer occurrences (i.e. for unique integers that are only appearing once, twice, or four times for example). Below is a function remove_loose_ends that I wrote to handle this. However, this function is being called many times and is the bottleneck of the entire program. Are there any possible enhancements that could remove the loop from this operation or decrease its runtime in other ways?
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
unique_id, unique_counter = np.unique(A[:, 2], return_counts=True)
# initialize outgoing indice mask
good_index = np.array([[True] * (A.shape[0])])
# loop through all indices and flip them to false if they match the not triplet entries
for i in range(0, len(unique_id)):
if unique_counter[i] != 3:
good_index = good_index ^ (A[:, 2] == unique_id[i])
# return incoming array with mask applied
return A[np.squeeze(good_index), :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
So, the main problem is that you loop over all the values twice basically, making it roughly an n² operation.
What you could do instead, is create an array of booleans directly from the output of the numpy.unique function to do the indexing for you.
For example, something like this:
import numpy as np
import time
def remove_loose_ends(A):
# get unique counts
_, unique_inverse, unique_counter = np.unique(A[:, 2], return_inverse=True, return_counts=True)
# Obtain boolean array of which integers occurred 3 times
idx = unique_counter == 3
# Obtain boolean array of which rows have integers that occurred 3 times
row_idx = idx[unique_inverse]
# return incoming array with mask applied
return A[row_idx, :]
# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)
# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)
I tried timing both versions.
The function you posted I stopped after 15 minutes, whereas the one I give takes around 0.15s on my PC.

Numpy: Finding correspondencies in one array by uniques of other array, arbitrary length

I have a problem where I have two arrays, one with identifiers which can occur multiple time, lets just say
import numpy as np
ind = np.random.randint(0,10,(100,))
and another one which is the same length and contains some info, in this case boolean, for each of the elementes identified by ind. They are sorted correspondingly.
dat = np.random.randint(0,2,(100,)).astype(np.bool8)
I'm looking for a (faster?) way to do the following: Do a np.any() for each element (defined by ind) for all elements. The number of occurences per element is, as in the example, random. What I'm doing now is
result = np.empty(np.unique(ind))
for i,uni in enumerate(np.unique(ind)):
result[i] = np.any(dat[ind==uni])
Which is sort of slow. Any ideas?
Approach #1
Index ind with dat to select the ones required to be checked, get the binned counts with np.bincount and see which bins have more one than occurrence -
result = np.bincount(ind[dat])>0
If ind has negative numbers, offset it with the min value -
ar = ind[dat]
result = np.bincount(ar-ar.min())>0
Approach #2
One more with np.unique -
unq = np.unique(ind[dat])
n = len(np.unique(ind))
result = np.zeros(n,dtype=bool)
result[unq] = 1
We can use pandas to get n :
import pandas as pd
n = pd.Series(ind).nunique()
Approach #3
One more with indexing -
ar = ind[dat]
result = np.zeros(ar.max()+1,dtype=bool)
result[ar] = 1

Pythonic way to compare sign of numpy array with Dataframe

I have a pandas data frame,df. The contents of the first row are as follows:
-1387.900
1 -1149.000
2 1526.300
3 1306.300
4 1134.300
5 -1077.200
6 -734.890
7 -340.870
8 -268.970
9 -176.070
10 -515.510
11 283.440
12 -55.148
13 -1701.800
14 -63.294
15 -270.720
16 2216.800
17 4251.200
18 1459.000
19 -613.680
Which is basically a series. I have a (1x20) numpy array, as follows:
array([[ 1308.22000654, -920.02730748, 1285.54273707, -1119.67498439,
789.50281435, -331.14325768, 756.67399745, -101.9251545 ,
157.17779635, -333.17043669, -191.10517521, -127.80219696,
698.32168135, 154.30798847, -1055.54268665, -1795.96042107,
202.53471769, 25.58830318, 793.63902134, 220.94259961]])
Now what I want is that for each cell value of this top row of df data frame, I need to check if the sign of that cell is same as that of the corresponding cell sign of the above numpy array. If the sign is different then for all the rows in df, for that corresponding co-ordinate, flip the signs of each corresponding co-ordinate value in df. For ex. if you see the first cell value. Df has -1387 while numpy array has 1380. So now the first column of df frame should have it's sign reversed. Same with other columns.
I am doing it using a for loop.
Like
for x in range(20):
if(np.sign(Y1[0][x])!=np.sign(df.ix[0][x])):
if(np.sign(Y1[0][x])==0 and np.sign(df.ix[0][x]>0)):
df[x]=df[x]*1
else:
df[x]=df[x]*(-1)
I also need to make sure that if np.sign(Y[x])=0 then the sign which it takes is not zero but +1. I can add that condition in the above code, but point is how to make it more pythonic?.
EDIT: I have added the code which I wrote which seems to work fine and flip the signs of df column based on the conditions mentioned above. ANy idea how to do this in pythonic way?
EDITII: I have one more doubt. My numpy array is supposed to be single dimensional. But as you see above it is coming as 2 dimensional and I have to unnecessarily access the cell by 2 indexes. Why is that?. This is how I created numpy array(Dot product of two 1x11025 row of a df with 11025x20 matrix giving 1x20 array. But it is coming as array of array as you see above. code to create numpy array:
Y1=np.dot(X_smilie_norm[0:1],W)
X_smilie_norm is a 28x11025 pandas dataframe. I am accessing just the first row of that and doing a dot product with W which is a 11025x20 matrix. It is giving a double dimensional array when all I want is a single dimensional so that I could access Y1 values just with single index.
Here is the code, but I don't know what the result you want when the first row of df contians zero.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(-10, 10, (10, 12)))
sign = np.random.randint(-10, 10, 12)
df.loc[:, (df.iloc[0] >= 0) ^ (sign >= 0)] *= -1
You could use a mask and apply it to the dataframe
mask = (arr <= 0) != (df <= 0) # true if signs are different
df[mask] = -df[mask] # flip the signs on those members where mask is true

setting null values in a numpy array

how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.
Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])
np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).

Categories