Performing grouped average and standard deviation with NumPy arrays - python

I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.
x = data[:,0]
y = data[:,1]

You can use binned_statistic from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique would be useful. Putting all those, here's an implementation -
from scipy.stats import binned_statistic as bstat
# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]
# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])
# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)
From the docs of binned_statistic, one can also use a custom statistic function :
function : a user-defined function which takes a 1D array of values,
and outputs a single numerical statistic. This function will be called
on the values in each bin. Empty bins will be represented by
function([]), or NaN if this returns an error.
Sample input, output -
In [121]: data
Out[121]:
array([[2, 5],
[2, 2],
[1, 5],
[3, 8],
[0, 8],
[6, 7],
[8, 1],
[2, 5],
[6, 8],
[1, 8]])
In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]:
array([[ 0. , 8. , 0. ],
[ 1. , 6.5 , 1.5 ],
[ 2. , 4. , 1.41421356],
[ 3. , 8. , 0. ],
[ 6. , 7.5 , 0.5 ],
[ 8. , 1. , 0. ]])

x_unique = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])

Pandas is done for such task :
data=np.random.randint(1,5,20).reshape(10,2)
import pandas
pandas.DataFrame(data).groupby(0).mean()
gives
1
0
1 2.666667
2 3.000000
3 2.000000
4 1.500000

Related

Replacing non zero values in a matrix with the marginals

I am trying to do some math with my matrix, i can write it down but i am not sure how to code it. This involves getting a column of row marginal values, then making a new matrix that has all non-zero row values replaced with the marginals, after that I would like to divide the sum of non zero new values to be the column marginals.
I can get to the row marginals but I cant seem to think of a way to repopulate.
example of what i want
import numpy as np
matrix = np.matrix([[1,3,0],[0,1,2],[1,0,4]])
matrix([[1, 3, 0],
[0, 1, 2],
[1, 0, 4]])
marginals = ((matrix != 0).sum(1) / matrix.sum(1))
matrix([[0.5 ],
[0.66666667],
[0.4 ]])
What I want done next is a filling of the matrix based on the non zero locations of the first.
matrix([[0.5, 0.5, 0],
[0, 0.667, 0.667],
[0.4, 0, 0.4]])
Final wanted result is the new matrix column sum divided by the number of non zero occurrences in that column.
matrix([[(0.5+0.4)/2, (0.5+0.667)/2, (0.667+0.4)/2]])
To get the final matrix we can use matrix-multiplication for efficiency -
In [84]: mask = matrix!=0
In [100]: (mask.T*marginals).T/mask.sum(0)
Out[100]: matrix([[0.45 , 0.58333334, 0.53333334]])
Or simpler -
In [110]: (marginals.T*mask)/mask.sum(0)
Out[110]: matrix([[0.45 , 0.58333334, 0.53333334]])
If you need that intermediate filled output too, use np.multiply for broadcasted elementwise multiplication -
In [88]: np.multiply(mask,marginals)
Out[88]:
matrix([[0.5 , 0.5 , 0. ],
[0. , 0.66666667, 0.66666667],
[0.4 , 0. , 0.4 ]])

New array from existing one, 2 column begin indexes of line/colum from the existing, third being values [duplicate]

This question already has answers here:
Generalise slicing operation in a NumPy array
(4 answers)
Closed 5 years ago.
Here is some code I'm struggling with.
My goal is to create an array (db) from an existing one (t) , in db each line will represent a value of t. db will have 3 column, 1 for line index in t, 1 for column index in t and 1 for the value in t.
In my case, t was a distance matrix, thus diagonal was 0 and it was symetric, I replaced lower triangular values with 0. I don't need 0 values in the new array but I can just delete them in another step.
import numpy as np
t = np.array([[0, 2.5],
[0, 0]])
My goal is to obtain a new array such as :
db = np.array([[0, 0, 0],
[0, 1, 2.5],
[1, 0, 0],
[1, 1, 0]])
Thanks for your time.
You can create a meshgrid of 2D coordinates for the rows and columns, then unroll these into 1D arrays. You can then concatenate these two arrays as well as the unrolled version of t into one final matrix:
import numpy as np
(Y, X) = np.meshgrid(np.arange(t.shape[1]), np.arange(t.shape[0]))
db = np.column_stack((X.ravel(), Y.ravel(), t.ravel()))
Example run
In [9]: import numpy as np
In [10]: t = np.array([[0, 2.5],
...: [0, 0]])
In [11]: (Y, X) = np.meshgrid(np.arange(t.shape[1]), np.arange(t.shape[0]))
In [12]: db = np.column_stack((X.ravel(), Y.ravel(), t.ravel()))
In [13]: db
Out[13]:
array([[ 0. , 0. , 0. ],
[ 0. , 1. , 2.5],
[ 1. , 0. , 0. ],
[ 1. , 1. , 0. ]])

Keep the n highest values of each row of an numpy array and zero everything else [duplicate]

This question already has answers here:
numpy matrix, setting 0 to values by sorting each row
(2 answers)
Closed 5 years ago.
I have a numpy array of data where I need to keep only n highest values, and zero everything else.
My current solution:
import numpy as np
np.random.seed(30)
# keep only the n highest values
n = 3
# Simple 2x5 data field for this example, real life application will be exteremely large
data = np.random.random((2,5))
#[[ 0.64414354 0.38074849 0.66304791 0.16365073 0.96260781]
# [ 0.34666184 0.99175099 0.2350579 0.58569427 0.4066901 ]]
# find indices of the n highest values per row
idx = np.argsort(data)[:,-n:]
#[[0 2 4]
# [4 3 1]]
# put those values back in a blank array
data_ = np.zeros(data.shape) # blank slate
for i in xrange(data.shape[0]):
data_[i,idx[i]] = data[i,idx[i]]
# Each row contains only the 3 highest values per row or the original data
#[[ 0.64414354 0. 0.66304791 0. 0.96260781]
# [ 0. 0.99175099 0. 0.58569427 0.4066901 ]]
In the code above, data_ has the n highest values and everything else is zeroed out. This works out nicely even if data.shape[1] is smaller than n. But the only issue is the for loop, which is slow because my actual use case is on very very large arrays.
Is it possible to get rid of the for loop?
You could act on the result of np.argsort -- np.argsort twice, the first to get the index order and the second to get the ranks -- in a vectorized fashion, and then use either np.where or simply multiplication to zero everything else:
In [116]: np.argsort(data)
Out[116]:
array([[3, 1, 0, 2, 4],
[2, 0, 4, 3, 1]])
In [117]: np.argsort(np.argsort(data)) # these are the ranks
Out[117]:
array([[2, 1, 3, 0, 4],
[1, 4, 0, 3, 2]])
In [118]: np.argsort(np.argsort(data)) >= data.shape[1] - 3
Out[118]:
array([[ True, False, True, False, True],
[False, True, False, True, True]], dtype=bool)
In [119]: data * (np.argsort(np.argsort(data)) >= data.shape[1] - 3)
Out[119]:
array([[ 0.64414354, 0. , 0.66304791, 0. , 0.96260781],
[ 0. , 0.99175099, 0. , 0.58569427, 0.4066901 ]])
In [120]: np.where(np.argsort(np.argsort(data)) >= data.shape[1]-3, data, 0)
Out[120]:
array([[ 0.64414354, 0. , 0.66304791, 0. , 0.96260781],
[ 0. , 0.99175099, 0. , 0.58569427, 0.4066901 ]])

Compare elements in a numpy array 3 rows a time

I got a numpy array as below:
[[3.4, 87]
[5.5, 11]
[22, 3]
[4, 9.8]
[41, 11.22]
[32, 7.6]]
and I want to:
compare elements in column 2, 3 rows a time
delete the row with the biggest value in column 2, 3 rows a time
For example, in the first 3 rows, 3 values in column 2 are 87, 11 and 3, respectively, and I would like to remain 11 and 3.
The output numpy array I expected would be:
[[5.5, 11]
[22, 3]
[4, 9.8]
[32, 7.6]]
I am new to numpy array, and please give me advice to achieve this.
import numpy as np
x = np.array([[3.4, 87],
[5.5, 11],
[22, 3],
[4, 9.8],
[41, 11.22],
[32, 7.6]])
y = x.reshape(-1,3,2)
idx = y[..., 1].argmax(axis=1)
mask = np.arange(3)[None, :] != idx[:, None]
y = y[mask]
print(y)
# This might be helpful for the deleted part of your question
# y = y.reshape(-1,2,2)
# z = y[...,1]/y[...,1].sum(axis=1)
# result = np.dstack([y, z[...,None]])
yields
[[ 5.5 11. ]
[ 22. 3. ]
[ 4. 9.8]
[ 32. 7.6]]
"Grouping by three" with NumPy can be done by reshaping the array to create a new axis of length 3 -- provided the original number of rows is divisible by 3:
In [92]: y = x.reshape(-1,3,2); y
Out[92]:
array([[[ 3.4 , 87. ],
[ 5.5 , 11. ],
[ 22. , 3. ]],
[[ 4. , 9.8 ],
[ 41. , 11.22],
[ 32. , 7.6 ]]])
In [93]: y.shape
Out[93]: (2, 3, 2)
| | |
| | o--- 2 columns in each group
| o------ 3 rows in each group
o--------- 2 groups
For each group, we can select the second column and find the row with the maximum value:
In [94]: idx = y[..., 1].argmax(axis=1); idx
Out[94]: array([0, 1])
array([0, 1]) indicates that in the first group, the 0th indexed row contains the maximum (i.e. 87), and in the second group, the 1st indexed row contains the maximum (i.e. 11.22).
Next, we can generate a 2D boolean selection mask which is True where the rows do not contain the maximum value:
In [95]: mask = np.arange(3)[None, :] != idx[:, None]; mask
Out[95]:
array([[False, True, True],
[ True, False, True]], dtype=bool)
In [96]: mask.shape
Out[96]: (2, 3)
mask has shape (2,3). y has shape (2,3,2). If mask is used to index y as in y[mask], then the mask is aligned with the first two axes of y, and all values where mask is True are returned:
In [98]: y[mask]
Out[98]:
array([[ 5.5, 11. ],
[ 22. , 3. ],
[ 4. , 9.8],
[ 32. , 7.6]])
In [99]: y[mask].shape
Out[99]: (4, 2)
By the way, the same calculation could be done using Pandas like this:
import numpy as np
import pandas as pd
x = np.array([[3.4, 87],
[5.5, 11],
[22, 3],
[4, 9.8],
[41, 11.22],
[32, 7.6]])
df = pd.DataFrame(x)
idx = df.groupby(df.index // 3)[1].idxmax()
# drop the row with the maximum value in each group
df = df.drop(idx.values, axis=0)
which yields the DataFrame:
0 1
1 5.5 11.0
2 22.0 3.0
3 4.0 9.8
5 32.0 7.6
You might find Pandas syntax easier to use, but for the above calculation NumPy
is faster.

What is the Python numpy equivalent of the IDL # operator?

I am looking for the Python numpy equivalent of the IDL # operator.
Here is what the # operator does:
Computes array elements by multiplying the columns of the first array
by the rows of the second array. The second array must have the same
number of columns as the first array has rows. The resulting array has
the same number of columns as the first array and the same number of
rows as the second array.
Here are the numpy arrays I am dealing with:
A = [[ 0.9826128 0. 0.18566662]
[ 0. 1. 0. ]
[-0.18566662 0. 0.9826128 ]]
and
B = [[ 1. 0. 0. ]
[ 0.62692564 0.77418869 0.08715574]]
Also, numpy.dot(A,B) results in ValueError: matrices are not aligned.
Reading the notes on IDL's definition of matrix multiplication, it seems they use the opposite notation to everyone else:
IDL’s convention is to consider the first dimension to be the column
and the second dimension to be the row
So # can be achieved by the rather strange looking:
numpy.dot(A.T, B.T).T
from their example values:
import numpy as np
A = np.array([[0, 1, 2], [3, 4, 5]])
B = np.array([[0, 1], [2, 3], [4, 5]])
C = np.dot(A.T, B.T).T
print(C)
gives
[[ 3 4 5]
[ 9 14 19]
[15 24 33]]
If I'm correct you want matrix multiplication.

Categories