Summing sparse matrix rows by column groups

Summing sparse matrix rows by column groups - python

I have a scipy sparse matrix in coo format:
from scipy.sparse import coo_matrix
data = np.asarray([[1, 0, 0], [.8, .2, 0], [0, 1, 0], [0.4, 0.3, 0.3]])
data
array([[1. , 0. , 0. ],
[0.8, 0.2, 0. ],
[0. , 1. , 0. ],
[0.4, 0.3, 0.3]])
sparse_matrix = coo_matrix(data)
For each column I have a cluster assignment, I would like to sum rows grouped by their cluster assignment. During this operation I would like to stay in sparse format for memory issues.
Example:
labels = ["a", "b", "b"]
Expected output:
1, 0
.8, .2
0, 1
.4, .6

It could be approached the same was with dense arrays - for each group, select the desired columns, and sum. Collect the results.
In [2]: data = np.asarray([[1, 0, 0], [.8, .2, 0], [0, 1, 0], [0.4, 0.3, 0.3]])
In [3]: M = sparse.csc_matrix(data)
In [4]: M
Out[4]:
<4x3 sparse matrix of type '<class 'numpy.float64'>'
with 7 stored elements in Compressed Sparse Column format>
In [5]: M.A
Out[5]:
array([[1. , 0. , 0. ],
[0.8, 0.2, 0. ],
[0. , 1. , 0. ],
[0.4, 0.3, 0.3]])
In [6]: M[:,[0]].sum(axis=1)
Out[6]:
matrix([[1. ],
[0.8],
[0. ],
[0.4]])
In [7]: M[:,[1,2]].sum(axis=1)
Out[7]:
matrix([[0. ],
[0.2],
[1. ],
[0.6]])
In [8]: res = np.concatenate((Out[6], Out[7]), axis=1)
In [9]: res
Out[9]:
matrix([[1. , 0. ],
[0.8, 0.2],
[0. , 1. ],
[0.4, 0.6]])
Note that the sum produces a dense np.matrix. sparse does this routinely, I think, because such summations are always denser than the source. A sum will be 0 only if all the elements are 0 (except for the rare case of a bunch of nonzeros canceling each other out).
Since the column indexing and sum are both implemented as matrix products, it might be possible to speed up the process a bit by constructing a matrix that does both actions at once. But that's an implementation detail.
Indexing of sparse matrices is pretty slow (compared to dense ones).

Related

How to add a 4X4 matrix values into a 6x6 matrix using numpy

suppose i have multiple 4x4 matrices which i want to add to a final 6x6 zero matrix by adding some of the values in the designated coordination. how would i do this. I throughout of adding slices to np.zero 6x6 matrix , but i believe this may be quite tedious.
matrix 1 would go to this position first position and you have matrix 2 going to this position position 2. these two positions would be added and form the following final matrix Final position matrix
import numpy as np
from math import sqrt
# Element 1
C_1= 3/5
S_1= 4/5
matrix_1 = np.matrix([[C_1**2, C_1*S_1,-C_1**2,-C_1*S_1],[C_1*S_1,S_1**2,-C_1*S_1,-S_1**2],
[-C_1**2,-C_1*S_1,C_1**2,C_1*S_1],[-C_1*S_1,-S_1**2,C_1*S_1,S_1**2]])
empty_mat1 = np.zeros((6,6))
empty_mat1[0:4 , 0:4] = empty_mat1[0:4 ,0:4] + matrix_1
#print(empty_mat1)
# Element 2
C_2 = 0
S_2 = 1
matrix_2 = 1.25*np.matrix([[C_2**2, C_2*S_2,-C_2**2,-C_2*S_2],[C_2*S_2,S_2**2,-C_2*S_2,-S_2**2],
[-C_2**2,-C_2*S_2,C_2**2,C_2*S_2],[-C_2*S_2,-S_2**2,C_2*S_2,S_2**2]])
empty_mat2 = np.zeros((6,6))
empty_mat2[0:2,0:2] = empty_mat2[0:2,0:2] + matrix_2[0:2,0:2]
empty_mat2[4:6,0:2] = empty_mat2[4:6,0:2] + matrix_2[2:4,0:2]
empty_mat2[0:2,4:6] = empty_mat2[0:2,4:6] + matrix_2[2:4,2:4]
empty_mat2[4:6,4:6] = empty_mat2[4:6,4:6] + matrix_2[0:2,0:2]
print(empty_mat1+empty_mat2)

Adding two arrays of differents dimensions is a little bit tricky with numpy.
However, with array comprehension, you could do it with the following "rustic" method :
Supposing M1 and M2 your 2 input arrays, M3 (from M1) and M4 (from M2) your temporary arrays and M5 the final array :
#Initalisation
M1 = np.array([[ 0.36, 0.48, -0.36, -0.48], [ 0.48, 0.64, -0.48, -0.64], [ -0.36, -0.48, 0.36, 0.48], [-0.48, -0.64, 0.48, 0.64]])
M2 = np.array([[ 0, 0, 0, 0], [ 0, 1.25, 0, -1.25], [ 0, 0, 0, 0], [ 0, -1.25, 0, 1.25]])
M3, M4 = np.zeros((6, 6)), np.zeros((6, 6))
#M3 and M4 operations
M3[0:4, 0:4] = M1[0:4, 0:4] + M3[0:4, 0:4]
M4[0:2, 0:2] = M2[0:2, 0:2]
M4[0:2, 4:6] = M2[0:2, 2:4]
M4[4:6, 0:2] = M2[2:4, 0:2]
M4[4:6, 4:6] = M2[2:4, 2:4]
#Final operation
M5 = M3+M4
print(M5)
Output :
[[ 0.36 0.48 -0.36 -0.48 0. 0. ]
[ 0.48 1.89 -0.48 -0.64 0. -1.25]
[-0.36 -0.48 0.36 0.48 0. 0. ]
[-0.48 -0.64 0.48 0.64 0. 0. ]
[ 0. 0. 0. 0. 0. 0. ]
[ 0. -1.25 0. 0. 0. 1.25]]
Have a good day.

You will need to encode some way of where your 4x4 matrices end up in the final 6x6 matrix. Suppose you have N (=2 in your case) such 4x4 matrices. You can then define two new arrays (shape Nx4) that denote the row and col indices of the final 6x6 matrix that you want your 4x4 matrices to end up in. Finally, you use fancy indexing and broadcasting to build up a Nx6x6 array which you can sum over. Your example:
import numpy as np
N = 2
arr = np.array([[
[0.36, 0.48, -0.36, -0.48],
[0.48, 0.64, -0.48, -0.64],
[-0.36, -0.48, 0.36, 0.48],
[-0.48, -0.64, 0.48, 0.64],
], [
[0, 0, 0, 0],
[0, 1.25, 0, -1.25],
[0, 0, 0, 0],
[0, -1.25, 0, 1.25],
]])
rows = np.array([
[0, 1, 2, 3],
[0, 1, 4, 5]
])
cols = np.array([
[0, 1, 2, 3],
[0, 1, 4, 5]
])
i = np.arange(N)
out = np.zeros((N, 6, 6))
out[
i[:, None, None],
rows[:, :, None],
cols[:, None, :]
] = arr
out = out.sum(axis=0)
Gives as output:
array([[ 0.36, 0.48, -0.36, -0.48, 0. , 0. ],
[ 0.48, 1.89, -0.48, -0.64, 0. , -1.25],
[-0.36, -0.48, 0.36, 0.48, 0. , 0. ],
[-0.48, -0.64, 0.48, 0.64, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , -1.25, 0. , 0. , 0. , 1.25]])
If you want even more control over where each row/col ends up, you can pull off some more trickery as follows:
rows = np.array([
[1, 2, 3, 4, 0, 0],
[1, 2, 0, 0, 3, 4]
])
cols = np.array([
[1, 2, 3, 4, 0, 0],
[1, 2, 0, 0, 3, 4]
])
i = np.arange(N)
out = np.pad(arr, ((0, 0), (1, 0), (1, 0)))[
i[:, None, None],
rows[:, :, None],
cols[:, None, :]
].sum(axis=0)
which has the same output. This would allow you to shuffle the rows/cols of arr by shuffling the values 1-4 in the rows, cols arrays. I would prefer option 1 though.

I probably should wait for you to correct your question, but I'll go ahead and give you some code - yes, in the most tedious form - based on your images
res = np.zeros((6,6))
# arr1, arr2 are (4,4) arrays
res[:4, :4] += arr1
idx = np.array([0,1,4,5])
res[idx[:,None], idx] += arr2
The first is contiguous block, so the 2 slices are enough.
The second is split up, so I'm using advanced indexing.

Combining two multi-dimentional numpy arrays when one of them encodes the index information, the other encodes the array content

Here is the toy version of the problem I am facing:
Given the following two numpy arrays:
img = np.array([[0,1,1,2], [0,2,1,1]])
number = np.array([[0,0.1,0.1,0.2], [0.1,0,0.2,0.2]])
both img and number are 2 by 4 NumPy arrays. You can think of it as 2 participants in a study and 4 trials per participant. img encodes which image is presented at each trial, so its element is always an integer (0, 1, or 2) representing an image ID (image #0, #1, or #2), and there are in total 3 candidate images. Each image may occur more than once for each participant as shown in the example.
number is also a 2 by 4 NumPy array which encodes some numeric quantity corresponding to each image. You can think of it as a number presented to the participant above the image.
Within each participant, the number and image are uniquely paired. For example, img[0,1]=img[0,2]=1 means the first participant sees the same image (image #1) in the second and the third trial. Then it must follow that number[0,1]=number[0,2]. However, for the second participant, the pairing may change. While image #1 is paired with 0.1 for the first participant, it is instead paired with 0.2 for the second. The end product I want is something like the following:
goal = np.array([[[0,0],[0.1, 0.1],[0.2, 0.2]], [[0.1,0.1],[0.2, 0.2],[0, 0]]])
The goal is a 2x3x2 NumPy array. 2 again means the 2 participants, 3 means the total amount of unique images used. In this example, 3 unique images are indexed by 0,1, and 2. The third dimension 2 is just repeating the same digit twice, which I do need. Can someone think of a way of doing this in a purely vectorized fashion?
Here is how I would do it using for loop (not exactly syntactically correct):
goal = np.empty((2,3,2))
img = extract_first_occurance_of_each_element(img)
number = extract_first_occurance_of_each_element(number)
for subj in range(subjects):
for trial in range(3):
img_idx = img[subj, trial]
goal[subj, img_idx,:] = [number[subj, trial], number[subj, trial]]

Your iteration - cleaned up a bit:
In [7]: goal = np.zeros((2,3,2))
...: for subj in range(2):
...: for trial in range(3):
...: img_idx = img[subj, trial]
...: goal[subj, img_idx,:] = [number[subj, trial], number[subj, trial]]
In [8]: goal
Out[8]:
array([[[0. , 0. ],
[0.1, 0.1],
[0. , 0. ]],
[[0.1, 0.1],
[0.2, 0.2],
[0. , 0. ]]])
Not quite the same as the stated target, but close enough:
In [9]: np.array([[[0,0],[0.1, 0.1],[0.2, 0.2]], [[0.1,0.1],[0.2, 0.2],[0, 0]]])
Out[9]:
array([[[0. , 0. ],
[0.1, 0.1],
[0.2, 0.2]],
[[0.1, 0.1],
[0.2, 0.2],
[0. , 0. ]]])
With multidimensional indexing:
In [26]: subj=np.arange(2)[:,None]; trial=np.arange(3)
In [27]: img_idx = img[subj, trial]; img_idx
Out[27]:
array([[0, 1, 1],
[0, 2, 1]])
In [28]: goal = np.zeros((2,3,2))
In [29]: goal[subj,img_idx]=np.stack([number[subj,trial], number[subj, trial]], axis=2)
In [30]: goal
Out[30]:
array([[[0. , 0. ],
[0.1, 0.1],
[0. , 0. ]],
[[0.1, 0.1],
[0.2, 0.2],
[0. , 0. ]]])

Numpy: How to stack a single array into each row of a bigger array and turn it into a 2D array?

I have a numpy array named heartbeats with 100 rows. Each row has 5 elements.
I also have a single array named time_index with 5 elements.
I need to prepend the time index to each row of heartbeats.
heartbeats = np.array([
[-0.58, -0.57, -0.55, -0.39, -0.40],
[-0.31, -0.31, -0.32, -0.46, -0.46]
])
time_index = np.array([-2, -1, 0, 1, 2])
What I need:
array([-2, -0.58],
[-1, -0.57],
[0, -0.55],
[1, -0.39],
[2, -0.40],
[-2, -0.31],
[-1, -0.31],
[0, -0.32],
[1, -0.46],
[2, -0.46])
I only wrote two rows of heartbeats to illustrate.

Assuming you are using numpy, the exact output array you are looking for can be made by stacking a repeated version of time_index with the raveled version of heartbeats:
np.stack((np.tile(time_index, len(heartbeats)), heartbeats.ravel()), axis=-1)

Another approach, using broadcasting
In [13]: heartbeats = np.array([
...: [-0.58, -0.57, -0.55, -0.39, -0.40],
...: [-0.31, -0.31, -0.32, -0.46, -0.46]
...: ])
...: time_index = np.array([-2, -1, 0, 1, 2])
Make a target array:
In [14]: res = np.zeros(heartbeats.shape + (2,), heartbeats.dtype)
In [15]: res[:,:,1] = heartbeats # insert a (2,5) into a (2,5) slot
In [17]: res[:,:,0] = time_index[None] # insert a (5,) into a (2,5) slot
In [18]: res
Out[18]:
array([[[-2. , -0.58],
[-1. , -0.57],
[ 0. , -0.55],
[ 1. , -0.39],
[ 2. , -0.4 ]],
[[-2. , -0.31],
[-1. , -0.31],
[ 0. , -0.32],
[ 1. , -0.46],
[ 2. , -0.46]]])
and then reshape to 2d:
In [19]: res.reshape(-1,2)
Out[19]:
array([[-2. , -0.58],
[-1. , -0.57],
[ 0. , -0.55],
[ 1. , -0.39],
[ 2. , -0.4 ],
[-2. , -0.31],
[-1. , -0.31],
[ 0. , -0.32],
[ 1. , -0.46],
[ 2. , -0.46]])
[17] takes a (5,), expands it to (1,5), and then to (2,5) for the insert. Read up on broadcasting.

As an alternative way, you can repeat time_index by np.concatenate based on the specified times:
concatenated = np.concatenate([time_index] * heartbeats.shape[0])
# [-2 -1 0 1 2 -2 -1 0 1 2]
# result = np.dstack((concatenated, heartbeats.reshape(-1))).squeeze()
result = np.array([concatenated, heartbeats.reshape(-1)]).T
Using np.concatenate may be faster than np.tile. This solution is faster than Mad Physicist, but the fastest is using broadcasting as hpaulj's answer.

Matrix created from a function, and concatenated column vector of the matrix

We have a function f(x,y). We want to calculate the matrix Bij = f(xi,xj) = f(ih,jh) for 1 <= i,j <= n and h=1/(n+1), such as :
If f(x,y)=x+y, then Bij = ih+jh and the matrix becomes (here, n=3) :
I would like to program a function calculating the column vector b that concatenates all the columns of Bij. For example, with my previous example, we would have :
I done, we can change the function and n, here f(x,y)=x+y :
n=3
def f(i,j):
h=1.0/(n+1)
a=((i+1)*h)+((j+1)*h)
return a
B = np.fromfunction(f,(n,n))
print(B)
But I don't know how to do the vector b. And with
np.concatenate((B[:,0],B[:,1],B[:,2],B[:,3])
I get a line vector, and not a column vector. Could you help me ? Sorry for my bad english, and I'm beginner in Python.

The ravel function along with a new axis should do the trick:
import numpy as np
x = np.array([[0.5, 0.75, 1],
[0.75, 1, 1.25],
[1, 1.25, 1.5]])
x.T.ravel()[:, np.newaxis]
# array([[ 0.5 ],
# [ 0.75],
# [ 1. ],
# [ 0.75],
# [ 1. ],
# [ 1.25],
# [ 1. ],
# [ 1.25],
# [ 1.5 ]])
Ravel stitches together all the rows, so we first transpose the matrix (with .T). The result is a row-vector, and we change it to a column vector by adding a new axis.

import numpy as np
# create sample matrix `m`
m = np.matrix([[0.5, 0.75, 1], [0.75, 1, 1.25], [1, 1.25, 1.5]])
# convert matrix `m` to a 'flat' matrix
m_flat = m.flatten()
print(m_flat)
# `m_flat` is still a matrix, in case you need an array:
m_flat_arr = np.squeeze(np.asarray(m_flat))
print(m_flat_arr)
The snippet uses .flatten(), .asarray() and .squeeze() to convert the original matrix m being
matrix([[ 0.5 , 0.75, 1. ],
[ 0.75, 1. , 1.25],
[ 1. , 1.25, 1.5 ]])
into an array m_flat_arr of:
array([ 0.5 , 0.75, 1. , 0.75, 1. , 1.25, 1. , 1.25, 1.5 ])

Cosine Similarity

I was reading and came across this formula:
The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix:
M = [[2,3,4,1,0],[0,0,0,0,5],[5,4,3,0,0],[1,1,1,1,1]]
Here the entries inside the matrix are ratings the people u has given to item i based on row u and column i. I want to calculate this cosine similarity for this matrix between items (rows). This should yield a 5 x 5 matrix I believe. I tried to do
df = pd.DataFrame(M)
item_mean_subtracted = df.sub(df.mean(axis=0), axis=1)
similarity_matrix = item_mean_subtracted.fillna(0).corr(method="pearson").values
However, this does not seem right.

Here's a possible implementation of the adjusted cosine similarity:
import numpy as np
from scipy.spatial.distance import pdist, squareform
M = np.asarray([[2, 3, 4, 1, 0],
[0, 0, 0, 0, 5],
[5, 4, 3, 0, 0],
[1, 1, 1, 1, 1]])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
Remarks:
I'm taking advantage of NumPy broadcasting to subtract the mean.
If M is a sparse matrix, you could do something like ths: M.toarray().
From the docs:
Y = pdist(X, 'cosine')
Computes the cosine distance between vectors u and v,
1 − u⋅v / (||u||2||v||2)
where ||∗||2 is the 2-norm of its argument *, and u⋅v is the dot product of u and v.
Array transposition is performed through the T method.
Demo:
In [277]: M_u
Out[277]: array([ 2. , 1. , 2.4, 1. ])
In [278]: item_mean_subtracted
Out[278]:
array([[ 0. , 1. , 2. , -1. , -2. ],
[-1. , -1. , -1. , -1. , 4. ],
[ 2.6, 1.6, 0.6, -2.4, -2.4],
[ 0. , 0. , 0. , 0. , 0. ]])
In [279]: np.set_printoptions(precision=2)
In [280]: similarity_matrix
Out[280]:
array([[ 1. , 0.87, 0.4 , -0.68, -0.72],
[ 0.87, 1. , 0.8 , -0.65, -0.91],
[ 0.4 , 0.8 , 1. , -0.38, -0.8 ],
[-0.68, -0.65, -0.38, 1. , 0.27],
[-0.72, -0.91, -0.8 , 0.27, 1. ]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Summing sparse matrix rows by column groups - python

Related

How to add a 4X4 matrix values into a 6x6 matrix using numpy

Combining two multi-dimentional numpy arrays when one of them encodes the index information, the other encodes the array content

Numpy: How to stack a single array into each row of a bigger array and turn it into a 2D array?

Matrix created from a function, and concatenated column vector of the matrix

Cosine Similarity

Categories

Resources