Composite a numpy array/matrix based on column values and variables? - python

I'm playing with NumPy and Scipy and I'm having trouble finding a feature in the documentation. I was thus wondering if anyone could help.
Suppose I have an array in NumPy with two columns and k rows. One column serves as an numerical indicator (e.g. 2 = male, 1 = female, 0 = unknown) while the second column is perhaps a list of values or scores.
Lets say that I want to find the standard deviation (could be mean or whatever, I just want to apply a function) of the values for all rows with indicator 0, and then for 1, and finally, 2.
Is there a predefined function to composite this for me?
In R, the equivalent can be found in the plyr package. Does NumPy and/or Scipy have an equivalent, or am I stuck creating a mask for this array and then somehow filtering through this mask and then applying my function?
As always, thanks for your help!

If I understand your description, you have a dataset something like this:
In [79]: x=np.random.randint(0,3,size=100)
In [80]: y=np.random.randint(0,100,size=100)
In [81]: d=np.vstack([x,y]).T
In [88]: print d[:5,:]
[[ 0 43]
[ 1 60]
[ 2 60]
[ 1 4]
[ 0 30]]
In this situation numpy.unique can be used to generate an array of unique "key" values:
In [82]: idx=np.unique(d[:,0])
In [83]: print idx
[0 1 2]
and those values used to drive a generator expression like this:
[113]: g=(d[np.where(d[:,0]==val),1].std() for val in idx)
The generator g will emit the standard deviation of all the entries in d which match each entry in the index. numpy.fromiterator can then be used to collect the results:
In [114]: print np.vstack([idx,np.fromiter(g,dtype=np.float)]).T
[[ 0. 26.87376385]
[ 1. 29.41046084]
[ 2. 24.2477246 ]]
Note there is conversion of the keys to floating point in the last step during stacking, you might not want that depending on your data, but I did just it for illustrative purposes to have a "nice" looking final result to post.

You can use masked array operations for that.
http://docs.scipy.org/doc/numpy/reference/maskedarray.html#maskedarray
To create the mask, you can use the numpy.where function, like so:
male_mask = numpy.where(a[:,0]==2, False, True)
female_mask = numpy.where(a[:,0]==1, False, True)
Then, remember to use the special functions from numpy.ma:
http://docs.scipy.org/doc/numpy/reference/routines.ma.html
male_average = numpy.ma.average(ma.array(a[:,1], mask=male_mask))
EDIT: actually, this works just as well:
numpy.ma.average(ma.array(a[:,1], mask=a[:,0]!=value))

Related

How to get specific index of np.array of np.arrays fast

At the most basic I have the following dataframe:
a = {'possibility' : np.array([1,2,3])}
b = {'possibility' : np.array([4,5,6])}
df = pd.DataFrame([a,b])
This gives me a dataframe of size 2x1:
like so:
row 1: np.array([1,2,3])
row 2: np.array([4,5,6])
I have another vector of length 2. Like so:
[1,2]
These represent the index I want from each row.
So if I have [1,2] I want: from row 1: 2, and from row 2: 6.
Ideally, my output is [2,6] in a vector form, of length 2.
Is this possible? I can easily run through a for loop, but am looking for FAST approaches, ideally vectors approaches since it is already in pandas/numpy.
For actual use case approximations, I am looking to make this work in the 300k-400k row ranges. And need to run it in optimization problems (hence the fast part)
You could transform to a multi-dimensional numpy array and take_along_axis:
v = np.array([1,2])
a = np.vstack(df['possibility'])
np.take_along_axis(a.T, v[None], axis=0)[0]
output: array([2, 6])

Checking multiple condition with numpy array

I have a Dataframe with several lines and columns and I have transformed it into a numpy array to speed-up the calculations.
The first five columns of the Dataframe looked like this:
par1 par2 par3 par4 par5
1.502366 2.425301 0.990374 1.404174 1.929536
1.330468 1.460574 0.917349 1.172675 0.766603
1.212440 1.457865 0.947623 1.235930 0.890041
1.222362 1.348485 0.963692 1.241781 0.892205
...
These columns are now stored in a numpy array a = df.values
I need to check whether at least two of the five columns satisfy a condition (i.e., their value is larger than a certain threshold). Initially I wrote a function that performed the operation directly on the dataframe. However, because I have a very large amount of data and need to repeat the calculations over and over, I switched to numpy to take advantage of the vectorization.
To check the condition I was thinking to use
df['Result'] = np.where(condition_on_parameters > 2, True, False)
However, I cannot figure out how to write the condition_on_parameters such that it returns a True of False when at least 2 out of the 5 parameters are larger than the threshold. I thought to use the sum() function on the condition_on_parameters but I am not sure how to write such condition.
EDIT
It is important to specify that the thresholds are different for each parameter. For example thr1=1.2, thr2=2.0, thr3=1.5, thr4=2.2, thr5=3.0. So I need to check that par1 > thr1, par2 > thr2, ..., par5 > thr5.
Assuming condition_on_parameters returns an array the sames size as a with entries as True or False, you can use np.sum(condition_on_parameters, axis=1) to sum over the true values (True has a numerical values of 1) of each row. This provides a 1D array with entries as the number of columns that meet the condition. This array can then be used with where to get the row numbers you are looking for.
df['result'] = np.where(np.sum(condition_on_parameters, axis=1) > 2)
Can you exploit pandas functionalities? For example, you can efficiently check conditions on multiple rows/columns with .apply and then .sum(axis=1).
Here some sample code:
import pandas as pd
df = pd.DataFrame([[1.50, 2.42, 0.88], [0.98,1.3, 0.56]], columns=['par1', 'par2', 'par3'])
# custom_condition, e.g. value less or equal than threshold
def leq(x, t):
return x<=t
condition = df.apply(lambda x: leq(x, 1)).sum(axis=1)
# filter
df.loc[condition >=2]
I think this should be equivalent to numpy in terms of efficiency as pandas is ultimately build on top of that, however I'm not entirely sure...
It seems you are looking for numpy.any
a = np.array(\
[[1.502366, 2.425301, 0.990374, 1.404174, 1.929536],
[1.330468, 1.460574, 0.917349, 1.172675, 0.766603 ],
[1.212440, 1.457865, 0.947623, 1.235930, 0.890041 ],
[1.222362, 1.348485, 0.963692, 1.241781, 0.892205 ]]);
df = pd.DataFrame(a, columns=[f'par{i}' for i in range(1, 6)])
df['Result'] = np.any(df > 1.46, axis=1) # append the result column
Gives the following dataframe

Python: Combining 2D arrays with 1 common column that has different values

I want to combine two arrays which represent a curve where the variable is column 1, however the column 0 values do not always match:
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
I want to combine these into one array where the column 0 is combined and both column 1s are stacked with gaps where there is no value for the corresponding column 0 value, something like this:
arr3=np.array([((10,None,1020),(12,1003,None),(17,900,902),(19,None,870),(20,810,None),(21,None,750))])
The reason for this is that I want to be able to get mean values of the second column for each array but they are not at exactly the same column 0 value so the idea of creating this array is to then interpolate to replace all the None values, then create mean values from column 1 and 2 and have an extra column to represent that.
I have used numPy for everything else so far but obviously have got stuck with the np.column_stack function as it needs lists of the same length and also will be blind to stacking based on values from column o. Lastly I do not want to create a fit for the data as the actual data is non-linear and possibily not consistent so a fit will not work and interpolation seems like the most accurate method.
There may be an answer already but due to me not knowing how to describe it well I can't find it. Also I am relatively new to python so please don't make any assumptions about my knowledge other than it is very little.
Thank you.
will this help ??
import pandas
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
d1 = pandas.DataFrame(arr1)
d2 = pandas.DataFrame(arr2)
d1.columns = d2.columns = ['t','v']
d3 = pandas.DataFrame(np.array(d1.merge(d2, on='t',how='outer')))
print d3.values
# use d3.as_matrix() to convert to numpy array
output
[[ 12. 1003. nan]
[ 17. 900. 902.]
[ 20. 810. nan]
[ 10. nan 1020.]
[ 19. nan 870.]
[ 21. nan 750.]]

numpy, fill sparse matrix with rows from other matrix

I have trouble figuring out what would be the most efficient way to do the following:
import numpy as np
M = 10
K = 10
ind = np.array([0,1,0,1,0,0,0,1,0,0])
full = np.random.rand(sum(ind),K)
output = np.zeros((M,K))
output[1,:] = full[0,:]
output[3,:] = full[1,:]
output[7,:] = full[2,:]
I want to build output, which is a sparse matrix, whose rows are given in a dense matrix (full) and the row indices are specified through a binary vector.
Ideally, I want to avoid a for-loop. Is that possible? If not, I'm looking for the most efficient way to for-loop this.
I need to perform this operation quite a few times. ind and full will keep changing, hence I've just provided some exemplar values for illustration.
I expect ind to be pretty sparse (at most 10% ones), and both M and K to be large numbers (10e2 - 10e3). Ultimately, I might need to perform this operation in pytorch, but some decent procedure for numpy, would already get me quite far.
Please also help me find a more appropriate title for the question, if you have one or more appropriate categories for this question.
Many thanks,
Max
output[ind.astype(bool)] = full
By converting the integer values in ind to boolean values, you can do boolean indexing to select the rows in output that you want to populate with values in full.
example with a 4x4 array:
M = 4
K = 4
ind = np.array([0,1,0,1])
full = np.random.rand(sum(ind),K)
output = np.zeros((M,K))
output[ind.astype(bool)] = full
print(output)
[[ 0. 0. 0. 0. ]
[ 0.32434109 0.11970721 0.57156261 0.35839647]
[ 0. 0. 0. 0. ]
[ 0.66038644 0.00725318 0.68902177 0.77145089]]

setting null values in a numpy array

how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.
Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])
np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).

Categories