Assume I have a matrix which is N items long x M columns long (where M<=N). I want to know the average rank for each of the N across the M columns.
arr = np.array([
[0,1],
[2,0],
[1,2]
])
I could loop through each of the N values and do something like the following, but I'm wondering whether there's a better approach to this
for n in range(3):
np.where(arr==n)[0].mean()
Edit
Sorry, it seems my choice of example has caused some confusion. To better illustrate, let me swap in letters since the values in the matrix are identifiers, not numbers to be calculated on.
arr = np.array([
['A','B'],
['C','A'],
['B','C']
])
I am not trying to do a simple row-wise average. I'm trying to say that
A average rank is 0.5 (0 + 1) / 2
B average rank is 1.0 (0 + 2) / 2
C average rank is 1.5 (1 + 2) / 2
Hopefully this clarified my request
It looks like you want to get the mean of your array along a certain axis. You can do this using the axis= argument of numpy.mean:
import numpy as np
arr = np.array([
[0,1],
[2,0],
[1,2]
])
np.mean(arr, axis=1)
# [ 0.5 1. 1.5]
If you want row wise mean
>>> np.mean(arr, axis=1)
array([ 0.5, 1. , 1.5])
To get rank (as OP's description)
First generate 2D array of indices
import numpy as np
M = 5
N = 7
narray = np.array(np.tile(np.arange(N), M)).reshape(N, M)
print(narray)
Output:
[[0 1 2 3 4]
[5 6 0 1 2]
[3 4 5 6 0]
[1 2 3 4 5]
[6 0 1 2 3]
[4 5 6 0 1]
[2 3 4 5 6]]
Now take row wise mean to get rank
mean_value = np.mean(narray, axis=1)
print(mean_value)
Output
[ 2. 2.8 3.6 3. 2.4 3.2 4. ]
If each of the N items appears exactly 1 time in each column,
(i.e each column is a ranking) , you can simply do :
#arr = np.array([['A','B'],['C','A'],['B','C']])
means = arr.argsort(0).mean(1)
#array([ 0.5, 1. , 1.5])
Here is my attempt to "improve" your original solution. My solution has the benefit of not needing to perform two (possibly very time consuming) operations all over again for each value in the array: np.where(arr==n) (1. find all values equal to n; 2. find indices of elements for which previous equality is true).
values, inverse, counts = np.unique(arr, return_inverse=True, return_counts=True)
rows = np.argsort(inverse) // len(arr[0])
cumsum = np.cumsum(counts)
avranks = np.add.reduceat(rows, cumsum - cumsum[0]) / counts
Then, for your original data,
>>> print(avranks)
[0.5 1. 1.5]
Related
I have a strange bug in this naive code for Gram-schmidt orthogonalization I wrote for numpy (v1.21.2).
def gram_schmidt(basis):
"""
Compute orthogonal basis from given basis.
Parameters:
basis <- an mxn array-like of vectors in m-space
Returns:
orthogonalized basis as array of shape (m,n)
"""
m, n = basis.shape
orthobasis = np.zeros_like(basis)
quadrances = np.ones(n)
for i, b in enumerate(basis.T):
# initial iteration
if i == 0:
orthobasis[:, i] = b
quadrances[i] = b.T # b
continue
# subsequent iterations
# get the orthogonal complement vector as the next basis vector in the set
# 1: project onto subspace generated by basis accumulated so far
P = orthobasis[:, :i]
proj_b = P # ((P.T # b) / quadrances[:i])
# 2: get the orthogonal vector to the projection
e = b - proj_b
# 3: add this orthogonal vector to the basis being constructed
orthobasis[:, i] = e
# 4: add the quadrance of the new vector
quadrances[i] = e.T # e
print(f"**iteration {i}**\nb = {b}\nP = {P}\nproj_b = {proj_b}\ne = {e}\northobasis =\n{orthobasis}")
print("-" * 100)
return orthobasis
When I run this code with the following input:
basis = np.stack([ np.array(v) for v in
[[1,1,1,1],
[0,1,1,1],
[0,0,1,1]] ],
axis=1)
gram_schmidt(basis)
I get the output:
**iteration 1**
b = [0 1 1 1]
P = [[1]
[1]
[1]
[1]]
proj_b = [0.75 0.75 0.75 0.75]
e = [-0.75 0.25 0.25 0.25]
orthobasis =
[[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]]
----------------------------------------------------------------------------------------------------
**iteration 2**
b = [0 0 1 1]
P = [[1 0]
[1 0]
[1 0]
[1 0]]
proj_b = [0.5 0.5 0.5 0.5]
e = [-0.5 -0.5 0.5 0.5]
orthobasis =
[[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]]
So the unexpected behavior is that step 3 in the gram_schmidt function does not assign the computed array e to the i-th column of orthobasis. You can ignore the correctness of gram_schmidt as the problem is related to array assignment: the first iteration assigns successfully to the first column, yet the subsequent iterations silently fail. What is even more odd is that if I change e in step 3 to a hard-coded array (of conforming shape with columns of orthobasis), it assigns successfully! In the output you can see that starting from the second iteration (iteration 1), there is no change to the columns of orthobasis; they are always 0!
I am so confused and have lost hours debugging this. I appreciate any assistance!
Each numpy array has a specified data type and can contain elements only of this type. In your example basis is an array of integers. Then you are initializing orthobasis using
orthobasis = np.zeros_like(basis)
This makes the data type of orthobasis the same as basis, i.e. it is an array of integers too. When you try to modify entries of this array by assigning to them float values, these floats are first converted to integers by rounding them toward 0. Thus the array [-0.5 -0.5 0.5 0.5] becomes an array of zeros.
To fix it, you can initialize orthobasis as an array of floats:
orthobasis = np.zeros_like(basis, dtype=float)
I am given two numpy-arrays: One of dimensions i x mand the other of dimensions j x m. What I want to do is, loop through the FirstArray and compare each of its elements with each of the elements of the SecondArray. When I say 'compare', I mean: I want to compute the Euclidean distance between the elements of FirstArray and SecondArray. Then, I want to store the index of the element of SecondArray that is closest to the corresponding element of FirstArray, and I also want to store the index of the element of SecondArray that is second closest to the element of the FirstArray.
In code this would look somewhat similar to this:
smallest = None
idx = 0
for i in range(0, FirstArrayRows):
for j in range(0, SecondArrayRows):
EuclideanDistance = np.sqrt(np.sum(np.square(FirstArray[i,:] - SecondArray[j,:])))
if smallest is None or EuclideanDistance < smallest:
smallest = EuclideanDistance
idx_second = idx
idx = j
Closest[i] = idx
SecondClosest[i] = idx_second
And I think this works. However, there are two cases when this code fails to give the correct index for the second closest element of SecondArray:
when the element of SecondArray that is closest to the element of FirstArray is at j = 0.
when the element of SecondArray that is closest to the element of FirstArray is at j = 1.
So I wonder: Is there a better way of implementing this?
I know there is. Maybe someone can help me see it?
You could use numpy's broadcasting to your advantage. Compute the Euclidean distance with all elements of the second array in a single operation. Then, you can find the two smallest distances using argpartition.
import numpy as np
i, j, m = 3, 4, 5
a = np.random.choice(10,(i,m))
b = np.random.choice(10,(j,m))
print('First array:\n',a)
print('Second array:\n',b)
closest, second_closest = np.zeros(i), np.zeros(i)
for i in range(a.shape[0]):
dist = np.sqrt(((a[i,:] - b)**2).sum(axis=1))
closest[i], second_closest[i] = np.argpartition(dist, 2)[:2]
print('Closest:', closest)
print('Second Closest:', second_closest)
Output:
First array:
[[3 9 0 2 2]
[1 2 9 9 7]
[4 0 6 6 4]]
Second array:
[[9 9 2 2 3]
[9 9 0 2 3]
[1 1 6 7 7]
[5 7 0 4 4]]
Closest: [3. 2. 2.]
Second Closest: [1. 3. 3.]
I have a numpy.ndarray called grouping of size (S, N). Each row of grouping gives me the group labels of a sample of data. I run my algorithm S times and get new group labels in each iteration.
I want to determine how many times each sample of my data has the same group label as every other sample of my data across the S iterations in a fully vectorized way.
In a not-completely-vectorized way:
sim_matrix = np.zeros((N, N))
for s in range(S):
sim_matrix += np.equal.outer(grouping[s, :], grouping[s, :])
One vectorized approach would be with broadcasting -
(grouping[:,None,:] == grouping[:,:,None]).sum(0)
For performance, we can use np.count_nonzero -
np.count_nonzero(grouping[:,None,:] == grouping[:,:,None],axis=0)
The sum of equal.outer is a cryptic way of calculating all-pairs similarity of columns:
sum_i sum_jk (A[i,j] == A[i,k]) is the same as
sum_jk sum_i (A[i,j] == A[i,k])
where sum_i loops over rows, sum_jk over all pairs of columns.
Comparing two vectors by counting the the number of positions where they differ
is called
Hamming distance .
If we change == above to !=, similarity to distance = nrows - similarity
(most similar ⇔ distance 0), we get the problem:
find the Hamming distance between all pairs of a bunch of vectors:.
def allpairs_hamming( A, dtype=np.uint32 ):
""" -> Hamming distances between all pairs of rows of A """
nrow, ncol = A.shape
allpair_dist = np.zeros( [nrow, nrow], dtype=dtype )
for j in xrange(nrow):
for k in xrange( j + 1, nrow ):
allpair_dist[j,k] = allpair_dist[k,j] = (A[j] != A[k]).sum() # row diff
return allpair_dist
allpairs_hamming: 30.7 sec, 3 ns per cmp Nvec 2000 Veclen 5000 A 10m pairdist uint32 15m
Almost all the cpu time is in the row diff, not in the outer loop for j ... for k -- 3 ns per scalar compare, on a stock mac, isn't bad.
However memory caching is much faster if each row A[j] is in contiguous memory,
as for numpy C-order arrays.
Apart from that, whether you do "all pairs of rows" or "all pairs of columns"
doesn't matter, as long as you're clear.
(Is it possible to find "nearby" pairs in time and space < O(npairs), here O(20000^2) ? Afaik there are more methods than test cases.)
See also:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html (bug: hamming .mean not .sum)
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
https://stats.stackexchange.com/search?q=[clustering]+pairwise
You want to compare identic rows. A way to do that is grouping the entire rows in a raw block :
S,N=12,2
a=np.random.randint(0,3,(S,N)) #12 samples of two labels.
#a
0 1
0 2 2
1 2 0
2 1 2
3 0 0
4 0 1
5 1 1
6 0 1
7 0 1
8 0 1
9 0 0
10 2 2
11 0 0
samples=np.ascontiguousarray(a).view(dtype((void,a.strides[0])))
sample.shape is then (S,1).
you can now inventory your sample with np.unique, and use Pandas dataframes for pretty report :
_,inds,invs=np.unique(samples,return_index=True, return_inverse=True)
df=pd.DataFrame(invs)
result=df.reset_index().groupby(0).index.apply(list).to_frame()
result['sample']=[list(x) for x in a[inds]]
for
index samples
0
0 [3, 9, 11] [0, 0]
1 [4, 6, 7, 8] [0, 1]
2 [5] [1, 1]
3 [2] [1, 2]
4 [1] [2, 0]
5 [0, 10] [2, 2]
It can be a O(S ln S) if there is few fits between samples, when yours is O( N²S).
I have an array :
a = np.array([1,2,3,4,5,6,7,8])
The array may be reshaped to a = np.array([[1,2,3,4],[5,6,7,8]]), whatever is more convenient.
Now, I have an array :
b = np.array([[11,22,33,44], [55,66,77,88]])
I want to replace to each of these elements the corresponding elements from a.
The a array will always hold as many elements as b has.
So, array b will be :
[1,2,3,4], [5,6,7,8]
Note, that I must keep each b subarray dimension to (4,). I don't want to change it.So, the idx will take values from 0 to 3.I want to make a fit to every four values.
I am struggling with reshape, split,mask ..etc and I can't figure a way to do it.
import numpy as np
#a = np.array([[1,2,3,4],[5,6,7,8]])
a = np.array([1,2,3,4,5,6,7,8])
b = np.array([[11,22,33,44], [55,66,77,88]])
for arr in b:
for idx, x in enumerate(arr):
replace every arr[idx] with corresponding a value
For your current case, what you want is probably:
b, c = list(a.reshape(2, -1))
This isn't the cleanest, but it is a one-liner. Turn your 1D array into a 2D array with with the first dimension as 2 with reshape(2, -1), then list splits it along the first dimension so you can directly assign them to b, c
You can also do it with the specialty function numpy.split
b, c = np.split(a, 2)
EDIT: Based on accepted solution, vectorized way to do this is
b = a.reshape(b.shape)
The following worked for me:
i = 0
for arr in b:
for idx, x in enumerate(arr):
arr[idx] = a[i]
print(arr[idx])
i += 1
Output (arr[idx]): 1 2 3 4 5 6 7 8
If you type print(b) it'll output [[1 2 3 4] [5 6 7 8]]
b = a[:len(a)//2]
c = a[len(a)//2:]
Well, I'm quite new to Python but this worked for me:
for i in range (0, len(a)//2):
b[i] = a[i]
for i in range (len(a)//2,len(a)):
c[i-4] = a[i]
by printing the 3 arrays I have the following output:
[1 2 3 4 5 6 7 8]
[1 2 3 4]
[5 6 7 8]
But I would go for Daniel's solution (the split one): 1 liner, using numpy API, ...
I am trying to stack arrays horizontally, using numpy hstack, but can't get it to work. Instead, it all comes out in one list, instead of a 'matrix-looking' 2D array.
import numpy as np
y = np.array([0,2,-6,4,1])
y_bool = y > 0
y_bool = [1 if l == True else 0 for l in y_bool] #convert to decimals for classification
y_range = range(0,len(y))
print y
print y_bool
print y_range
print np.hstack((y,y_bool,y_range))
Prints this:
[ 0 2 -6 4 1]
[0, 1, 0, 1, 1]
[0, 1, 2, 3, 4]
[ 0 2 -6 4 1 0 1 0 1 1 0 1 2 3 4]
How do I instead get the last line to look like this:
[0 0 0
2 1 1
-6 0 2
4 1 3]
If you want to create a 2D array, do:
print np.transpose(np.array((y, y_bool, y_range)))
# [[ 0 0 0]
# [ 2 1 1]
# [-6 0 2]
# [ 4 1 3]
# [ 1 1 4]]
Well, close enough h is for horizontal/column wise, if you check its help, you will see under See Also
vstack : Stack arrays in sequence vertically (row wise).
dstack : Stack arrays in sequence depth wise (along third axis).
concatenate : Join a sequence of arrays together.
Edit: First thought vstack does it, but it would be if np.vstack(...).T or np.dstack(...).squeeze(). Other then that the "problem" is that the arrays are 1D and you want them to act like 2D, so you could do:
print np.hstack([np.asarray(a)[:,np.newaxis] for a in (y,y_bool,y_range)])
the np.asarray is there just in case one of the variables is a list. The np.newaxis makes them 2D to make it clearer what happens when concatenating.