How to find linearly independent rows from a matrix - python

How to identify the linearly independent rows from a matrix? For instance,
The 4th rows is independent.

First, your 3rd row is linearly dependent with 1t and 2nd row. However, your 1st and 4th column are linearly dependent.
Two methods you could use:
Eigenvalue
If one eigenvalue of the matrix is zero, its corresponding eigenvector is linearly dependent. The documentation eig states the returned eigenvalues are repeated according to their multiplicity and not necessarily ordered. However, assuming the eigenvalues correspond to your row vectors, one method would be:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(matrix.T)
# The linearly dependent row vectors
print matrix[lambdas == 0,:]
Cauchy-Schwarz inequality
To test linear dependence of vectors and figure out which ones, you could use the Cauchy-Schwarz inequality. Basically, if the inner product of the vectors is equal to the product of the norm of the vectors, the vectors are linearly dependent. Here is an example for the columns:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
print np.linalg.det(matrix)
for i in range(matrix.shape[0]):
for j in range(matrix.shape[0]):
if i != j:
inner_product = np.inner(
matrix[:,i],
matrix[:,j]
)
norm_i = np.linalg.norm(matrix[:,i])
norm_j = np.linalg.norm(matrix[:,j])
print 'I: ', matrix[:,i]
print 'J: ', matrix[:,j]
print 'Prod: ', inner_product
print 'Norm i: ', norm_i
print 'Norm j: ', norm_j
if np.abs(inner_product - norm_j * norm_i) < 1E-5:
print 'Dependent'
else:
print 'Independent'
To test the rows is a similar approach.
Then you could extend this to test all combinations of vectors, but I imagine this solution scale badly with size.

With sympy you can find the linear independant rows using: sympy.Matrix.rref:
>>> import sympy
>>> import numpy as np
>>> mat = np.array([[0,1,0,0],[0,0,1,0],[0,1,1,0],[1,0,0,1]]) # your matrix
>>> _, inds = sympy.Matrix(mat).T.rref() # to check the rows you need to transpose!
>>> inds
[0, 1, 3]
Which basically tells you the rows 0, 1 and 3 are linear independant while row 2 isn't (it's a linear combination of row 0 and 1).
Then you could remove these rows with slicing:
>>> mat[inds]
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
This also works well for rectangular (not only for quadratic) matrices.

I edited the code for Cauchy-Schwartz inequality which scales better with dimension: the inputs are the matrix and its dimension, while the output is a new rectangular matrix which contains along its rows the linearly independent columns of the starting matrix. This works in the assumption that the first column in never null, but can be readily generalized in order to implement this case too. Another thing that I observed is that 1e-5 seems to be a "sloppy" threshold, since some particular pathologic vectors were found to be linearly dependent in that case: 1e-4 doesn't give me the same problems. I hope this could be of some help: it was pretty difficult for me to find a really working routine to extract li vectors, and so I'm willing to share mine. If you find some bug, please report them!!
from numpy import dot, zeros
from numpy.linalg import matrix_rank, norm
def find_li_vectors(dim, R):
r = matrix_rank(R)
index = zeros( r ) #this will save the positions of the li columns in the matrix
counter = 0
index[0] = 0 #without loss of generality we pick the first column as linearly independent
j = 0 #therefore the second index is simply 0
for i in range(R.shape[0]): #loop over the columns
if i != j: #if the two columns are not the same
inner_product = dot( R[:,i], R[:,j] ) #compute the scalar product
norm_i = norm(R[:,i]) #compute norms
norm_j = norm(R[:,j])
#inner product and the product of the norms are equal only if the two vectors are parallel
#therefore we are looking for the ones which exhibit a difference which is bigger than a threshold
if absolute(inner_product - norm_j * norm_i) > 1e-4:
counter += 1 #counter is incremented
index[counter] = i #index is saved
j = i #j is refreshed
#do not forget to refresh j: otherwise you would compute only the vectors li with the first column!!
R_independent = zeros((r, dim))
i = 0
#now save everything in a new matrix
while( i < r ):
R_independent[i,:] = R[index[i],:]
i += 1
return R_independent

I know this was asked a while ago, but here is a very simple (although probably inefficient) solution. Given an array, the following finds a set of linearly independent vectors by progressively adding a vector and testing if the rank has increased:
from numpy.linalg import matrix_rank
def LI_vecs(dim,M):
LI=[M[0]]
for i in range(dim):
tmp=[]
for r in LI:
tmp.append(r)
tmp.append(M[i]) #set tmp=LI+[M[i]]
if matrix_rank(tmp)>len(LI): #test if M[i] is linearly independent from all (row) vectors in LI
LI.append(M[i]) #note that matrix_rank does not need to take in a square matrix
return LI #return set of linearly independent (row) vectors
#Example
mat=[[1,2,3,4],[4,5,6,7],[5,7,9,11],[2,4,6,8]]
LI_vecs(4,mat)

I interpret the problem as finding rows that are linearly independent from other rows.
That is equivalent to finding rows that are linearly dependent on other rows.
Gaussian elimination and treat numbers smaller than a threshold as zeros can do that. It is faster than finding eigenvalues of a matrix, testing all combinations of rows with Cauchy-Schwarz inequality, or singular value decomposition.
See:
https://math.stackexchange.com/questions/1297437/using-gauss-elimination-to-check-for-linear-dependence
Problem with floating point numbers:
http://numpy-discussion.10968.n7.nabble.com/Reduced-row-echelon-form-td16486.html

With regards to the following discussion:
Find dependent rows/columns of a matrix using Matlab?
from sympy import *
A = Matrix([[1,1,1],[2,2,2],[1,7,5]])
print(A.nullspace())
It is obvious that the first and second row are multiplication of each other.
If we execute the above code we get [-1/3, -2/3, 1]. The indices of the zero elements in the null space show independence. But why is the third element here not zero? If we multiply the A matrix with the null space, we get a zero column vector. So what's wrong?
The answer which we are looking for is the null space of the transpose of A.
B = A.T
print(B.nullspace())
Now we get the [-2, 1, 0], which shows that the third row is independent.
Two important notes here:
Consider whether we want to check the row dependencies or the column
dependencies.
Notice that the null space of a matrix is not equal to the null
space of the transpose of that matrix unless it is symmetric.

You can basically find the vectors spanning the columnspace of the matrix by using SymPy library's columnspace() method of Matrix object. Automatically, they are the linearly independent columns of the matrix.
import sympy as sp
import numpy as np
M = sp.Matrix([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
for i in M.columnspace():
print(np.array(i))
print()
# The output is following.
# [[0]
# [0]
# [1]]
# [[1]
# [0]
# [0]]
# [[0]
# [1]
# [0]]

Related

Scipy CSR matrix: subtract only from non-zero values

I have a large matrix that I want to perform calculations on.
To make things easier to understand, here are examples of what I want to do using smaller data:
I use a sparse CSR matrix like this (shape of the actual matrix is (9000, 900)):
x = sp.csr_matrix(np.array([[1,0,2],[1,1,0]]))
# > (0, 0) 1
# (0, 2) 2
# (1, 0) 1
# (1, 1) 1
I then have a vector of appropriate shape that I want to subtract from the matrix (shape of the actual vector is (,9000) ):
y = np.array([1, 1])
res = x - np.array([y]).T
# > matrix([[ 0, -1, 1],
# [ 0, 0, -1]])
This also subtracts from values that are zero in the sparse matrix, but I want to only subtract form non-zero values. I tried using scipys .nonzero(), like this:
x[x.nonzero()] - np.array([y]).T
which works on this small example, but when I try it on my actual data more than 32 GB of RAM are being used. Performing the calculation without .nonzero() works perfectly fine and barely takes a second.
What is an efficient way of performing the subtraction only on non-zero values?
EDIT:
I realized that my question is a bit unclear, and I have also found the solution, so here is a clarification of the question and then an answer:
I have a matrix with 9000 rows and a column-vector with 9000 rows. I wanted to subtract the value in a row of the column vector from all non-zero values of the corresponding matrix row. So for my example matrix sp.csr_matrix(np.array([[1,0,2],[1,1,0]])) and vector np.array([1, 1]), the result should be
[[0 0 1]
[0 0 0]]
I thought that my attempt at using .nonzero() calcualted just that, but I was wrong. However, I found the correct way of doing it here. So this is the working code, which also does not cause any RAM issues :
x = sp.csr_matrix(np.array([[1,0,2],[1,1,0]]))
y = np.array([1,1])
nz = x.nonzero()
x[nz] -= y[nz[0]]

Find optimal unique neighbour pairs based on closest distance

General problem
First let's explain the problem more generally. I have a collection of points with x,y coordinates and want to find the optimal unique neighbour pairs such that the distance between the neighbours in all pairs is minimised, but points cannot be used in more than one pair.
Some simple examples
Note: points are not ordered and x and y coordinates will both vary between 0 and 1000, but for simplicity in below examples x==y and items are ordered.
First, let's say I have the following matrix of points:
matrix1 = np.array([[1, 1],[2, 2],[5, 5],[6, 6]])
For this dataset, the output should be [0,0,1,1] as points 1 and 2 are closest to each other and points 3 and 4, providing pairs 0 and 2.
Second, two points cannot have the same partner. If we have the matrix:
matrix2 = np.array([[1, 1],[2, 2],[4, 4],[6, 6]])
Here pt1 and pt3 are closest to pt2, but pt1 is relatively closer, so the output should again be [0,0,1,1].
Third, if we have the matrix :
matrix3 = np.array([[1, 1],[2, 2],[3, 3],[4, 4]])
Now pt1 and pt3 are again closest to pt2 but now they are at the same distance. Now the output should again be [0,0,1,1] as pt4 is closest to pt3.
Fourth, in the case of an uneven number of points, the furthest point should be made nan, e.g.
matrix4 = np.array([[1, 1],[2, 2],[4,4]])
should give output [0,0,nan]
Fifth, in the case there are three or more points with exactly the same distance, the pairing can be random, e.g.
matrix5 = np.array([[1, 1],[2, 2],[3, 3]])
both an output of '[0,0,nan]and[nan,0,0]` should be fine.
My effort
Using sklearn:
import numpy as np
from sklearn.neighbors import NearestNeighbors
data = matrix3
nbrs = NearestNeighbors(n_neighbors=len(data), algorithm="ball_tree")
nbrs = nbrs.fit(data)
distances,indices = nbrs.kneighbors(data)
This outputs instances:
array([[0, 1, 2, 3],
[1, 2, 0, 3],
[2, 1, 3, 0],
[3, 2, 1, 0]]))
The second column provides the nearest points:
nearinds = `indices[:,1]`
Next in case there are duplicates in the list we need to find the nearest distance:
if len(set(nearinds) != len(nearinds):
dupvals = [i for i in set(nearinds) if list(nearinds).count(i) > 1]
for dupval in dupvals:
dupinds = [i for i,j in enumerate(nearinds) if j == dupval]
dupdists = distances[dupinds,1]
Using these dupdists I would be able to find that one is closer to the pt than the other:
if len(set(dupdists))==len(dupdists):
duppriority = np.argsort(dupdists)
Using the duppriority values we can provide the closer pt its right pairing. But to give the other point its pairing will then depend on its second nearest pairing and the distance of all other points to that same point.. Furthermore, if both points are the same distance to their closest point, I would also need to go one layer deeper:
if len(set(dupdists))!=len(dupdists):
dupdists2 = [distances[i,2] for i,j in enumerate(inds) if j == dupval]```
if len(set(dupdists2))==len(dupdists2):
duppriority2 = np.argsort(dupdists2)
etc..
I am kind of stuck here and also feel it is not very efficient in this way, especially for more complicated conditions than 4 points and where multiple points can be similar distance to one or multiple nearest, second-nearest etc points..
I also found that with scipy there is a similar one-line command that could be used to get the distances and indices:
from scipy.spatial import cKDTree
distances,indices = cKDTree(matrix3).query(matrix3, k=len(matrix3))
so am wondering if one would be better to continue with vs the other.
More specific problem that I want to solve
I have a list of points and need to match them optimally to a list of points previous in time. Number of points is generally limited and ranges from 2 to 10 but is generally consistent over time (i.e. it won't jump much between values over time). Data tends to look like:
prevdat = {'loc': [(300, 200), (425, 400), (400, 300)], 'contid': [0, 1, 2]}
currlocs = [(435, 390), (405, 295), (290, 215),(440,330)]`
Pts in time are generally closer to themselves than to others. Thus I should be able to link identities of the points over time. There are however a number of complications that need to be overcome:
sometimes there is no equal number of current and previous points
points often have the same closest neighbour but should not be able to be allocated the same identity
points sometimes have the same distance to closest neighbour (but very unlikely to 2nd, 3rd nearest-neighbours etc.
Any advice to help solve my problem would be much appreciated. I hope my examples and effort above will help. Thanks!
This can be formulated as a mixed integer linear programming problem.
In python you can model and solve such problems using cvxpy.
def connect_point_cloud(points):
'''
Given a set of points computes return pairs of points that
whose added distance is minimised
'''
N = points.shape[0];
I, J = np.indices((N, N))
d = np.sqrt(sum((points[I, i] - points[J, i])**2 for i in range(points.shape[1])));
use = cvxpy.Variable((N, N), integer=True)
# each entry use[i,j] indicates that the point i is connected to point j
# each pair may count 0 or 1 times
constraints = [use >= 0, use <= 1];
# point i must be used in at most one connection
constraints += [sum(use[i,:]) + sum(use[:, i]) <= 1 for i in range(N)]
# at least floor(N/2) connections must be presented
constraints += [sum(use[i,j] for i in range(N) for j in range(N)) >= N//2];
# let the solver to handle the problem
P = cvxpy.Problem(cvxpy.Minimize(sum(use[i,j] * d[i,j] for i in range(N) for j in range(N))), constraints)
dist = P.solve()
return use.value
Here a piece of code to visualize the result for a 2D problem
# create a random set with 50 points
p = np.random.rand(50, 2)
# find the pairs to with minimum distance
pairs = connect_point_cloud(p)
# plot all the points with circles
plt.plot(p[:, 0], p[:, 1], 'o')
# plot lines connecting the points
for i1, i2 in zip(*np.nonzero(pairs)):
plt.plot([p[i1,0], p[i2,0]], [p[i1,1], p[i2,1]])

Fast way to do consecutive one-to-all calculations on Numpy arrays without a for-loop?

I'm working on an optimization problem, but to avoid getting into the details, I'm going to provide a simple example of a bug that's been giving me headaches for a few days.
Say I have a 2D numpy array with observed x-y coordinates:
from scipy.optimize import distance
x = np.array([1,2], [2,3], [4,5], [5,6])
I also have a list of x-y coordinates to compare to these points (y):
y = np.array([11,13], [12, 14])
I have a function that takes the sum of manhattan differences between a value of x and all of the values in y:
def find_sum(ref_row, comp_rows):
modeled_counts = []
y = ref_row * len(comp_rows)
res = list(map(distance.cityblock, ref_row, comp_rows))
modeled_counts.append(sum(res))
return sum(modeled_counts)
Essentially, what I would like to do is find the sum of manhattan distances for every item in y with each item in x (so basically for each item in x, find the sum of the Manhattan distances between that (x,y) pair and every (x,y) pair in y).
I've tried this out with the following line of code:
z = list(map(find_sum, x, y))
However, z is of length 2 (like y), and not 4 like x. Is there a way to ensure that z is the result of consecutive one-to-all calculations? That is, I'd like to calculate the sum of all of the manhattan differences between x[0] and every set in y, and so on and so forth, so the length of z should be equal to the length of x.
Is there a simple way to do this without a for loop? My data is rather large (~ 4 million rows), so I'd really appreciate fast solutions. I'm fairly new to Python programming, so any explanations about why the solution works and is fast would be appreciated as well, but definitely isn't required!
Thanks!
This solution implements the distance in numpy, as I think it is a good example of broadcasting, which is a very useful thing to know if you need to use arrays and matrices.
By definition of Manhattan distance, you need to evaluate the sum of the absolute value of difference between each column. However, the first column of x, x[:, 0], has shape (4,) and the first column of y, y[:, 0], has shape (2,), so they are not compatible in the sense of applying subtraction: the broadcasting property says that each shape is compared starting with the trailing dimensions and two dimensions are compatible when they are equal or one of them is 1. Sadly, none of them are true for your columns.
However, you can add a new dimension of value 1 using np.newaxis, so
x[:, 0]
is array([1, 2, 4, 5]), but
x[:, 0, np.newaxis]
is
array([[1],
[2],
[4],
[5]])
and its shape is (4 ,1). Now, a matrix of shape (4, 1) subtracted by an array of shape 2 results in a matrix of shape (4, 2), by numpy's broadcasting treatment:
4 x 1
2
= 4 x 2
You can obtain the differences for each column:
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
and evaluate the sum of their absolute values:
np.abs(first_column_difference) + np.abs(second_column_difference)
which results in a (4, 2) matrix. Now, you want to sum the values for each row, so that you have 4 values:
np.sum(np.abs(first_column_difference) + np.abs(second_column_difference), axis=1)
which results in array([73, 69, 61, 57]). The rule is simple: the parameter axis will eliminate that dimension from the result, therefore using axis=1 for a (4, 2) matrix generates 4 values -- if you use axis=0, it will generate 2 values.
So, this will solve your problem:
x = np.array([[1, 2], [2, 3], [4, 5], [5, 6]])
y = np.array([[11, 13], [12, 43]])
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
z = np.abs(first_column_difference) + np.abs(second_column_difference)
print(np.sum(z, axis=1))
You can also skip the intermediate steps for each column and evaluate everything at once (it is a little bit harder to understand, so I prefer the method described above to explain what is happening):
print(np.abs(x[:, np.newaxis] - y).sum(axis=(1, 2)))
It is a general case for an n-dimensional Manhattan distance: if x is (u, n) and y is (v, n), it generates u rows by broadcasting (u, 1, n) by (v, n) = (u, v, n), then applying sum to eliminate the second and third axis.
Here is how you can do it using numpy broadcast with simplified explanation
Adjust Shape For Broadcasting
import numpy as np
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
## using np.newaxis as index add a new dimension at that position
## : give all the elements on that dimension
start_points = start_points[np.newaxis, :, :]
dest_points = dest_points[:, np.newaxis, :]
## Now lets check he shape of the point arrays
print('start_points.shape: ', start_points.shape) # (1, 4, 2)
print('dest_points.shape', dest_points.shape) # (2, 1, 2)
Lets try to understand
last element of shape represent x and y of a point, size 2
we can think of start_points as having 1 row and 4 columns of points
we can think of dest_points as having 2 rows and 1 columns of points
We can think start_points and dest_points as matrix or a table of points of size (1X4) and (2X1)
We clearly see that size are not compatible. What will happen if we perform arithmatic
operation between them? Here is where a smart part of numpy comes, called broadcast.
It will repeat rows of start_points to match that of dest_point making matrix of (2X4)
It will repeat columns of dest_point to match that of start_points making matrix of (2X4)
Result is arithmetic operation between every pair of elements on start_points and dest_points
Calculate the distance
diff_x_y = start_points - dest_points
print(diff_x_y.shape) # (2, 4, 2)
abs_diff_x_y = np.abs(start_points - dest_points)
man_distance = np.sum(abs_diff_x_y, axis=2)
print('man_distance:\n', man_distance)
sum_distance = np.sum(man_distance, axis=0)
print('sum_distance:\n', sum_distance)
Oneliner
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
np.sum(np.abs(start_points[np.newaxis, :, :] - dest_points[:, np.newaxis, :]), axis=(0,2))
Here is more detail explanation of broadcasting if you want to understand it more
With so many rows you can make substantial savings by using a smart algorithm. Let us for simplicity assume there is just one dimension; once we have established the algorithm, getting back to the general case is a simple matter of summing over coordinates.
The naive algorithm is O(mn) where m,n are the sizes of sets X,Y. Our algorithm is O((m+n)log(m+n)) so it scales much better.
We first have to sort the union of X and Y by coordinate and then form the cumsum over Y. Next, we find for each x in X the number YbefX of y in Y to its left and use it to look up the corresponding cumsum item YbefXval. The summed distances to all y to the left of x are YbefX times coordinate of x minus YbefXval, the distances to all y to the right are sum of all y coordinates minus YbefXval minus n - YbefX times coordinate of x.
Where does the saving come from? Sorting coordinates enables us to recycle the summations we have done before, instead of starting each time from scratch. This uses the fact that up to a sign we always sum the same y coordinates and going from left to right the signs flip one by one.
Code:
import numpy as np
from scipy.spatial.distance import cdist
from timeit import timeit
def pp(X,Y):
(m,k),(n,k) = X.shape,Y.shape
XY = np.concatenate([X.T,Y.T],1)
idx = XY.argsort(1)
Xmsk = idx<m
Ymsk = ~Xmsk
Xidx = np.arange(k)[:,None],idx[Xmsk].reshape(k,m)
Yidx = np.arange(k)[:,None],idx[Ymsk].reshape(k,n)
YbefX = Ymsk.cumsum(1)[Xmsk].reshape(k,m)
YbefXval = XY[Yidx].cumsum(1)[np.arange(k)[:,None],YbefX-1]
YbefXval[YbefX==0] = 0
XY[Xidx] = ((2*YbefX-n)*XY[Xidx]) - 2*YbefXval + Y.sum(0)[:,None]
return XY[:,:m].sum(0)
def summed_cdist(X,Y):
return cdist(X,Y,"minkowski",p=1).sum(1)
# demo
m,n,k = 1000,500,10
X,Y = np.random.randn(m,k),np.random.randn(n,k)
print("same result:",np.allclose(pp(X,Y),summed_cdist(X,Y)))
print("sort :",timeit(lambda:pp(X,Y),number=1000),"ms")
print("scipy cdist:",timeit(lambda:summed_cdist(X,Y),number=100)*10,"ms")
Sample run, comparing smart algo "sort" to naive algo implemented using cdist library function:
same result: True
sort : 1.4447695480193943 ms
scipy cdist: 36.41934019047767 ms

Numpy: smart matrix multiplication to sparse result matrix

In python with numpy, say I have two matrices:
S, a sparse x*x matrix
M, a dense x*y matrix
Now I want to do np.dot(M, M.T) which will return a dense x*x matrix S_.
However, I only care about the cells that are nonzero in S, which means that it would not make a difference for my application if I did
S_ = S*S_
Obviously, that would be a waste of operations as I would like to leave out the irrelevant cells given in S alltogether. Remember that in matrix multiplication
S_[i,j] = np.sum(M[i,:]*M[:,j])
So I want to do this operation only for i,j such that S[i,j]=True.
Is this supported somehow by numpy implementations that run in C so that I do not need to implement it with python loops?
EDIT 1 [solved]: I still have this problem, actually M is now also sparse.
Now, given rows and cols of S, I implemented it like this:
data = np.array([ M[rows[i],:].dot(M[cols[i],:]).data[0] for i in xrange(len(rows)) ])
S_ = csr( (data, (rows,cols)) )
... but it is still slow. Any new ideas?
EDIT 2: jdehesa has given a great solution, but I would like to save more memory.
The solution was to do the following:
data = M[rows,:].multiply(M[cols,:]).sum(axis=1)
and then build a new sparse matrix from rows, cols and data.
However, when running the above line, scipy builds a (contiguous) numpy array with as many elements as nnz of the first submatrix plus nnz of the second submatrix, which can lead to MemoryError in my case.
In order to save more memory, I would like to multiply iteratively each row with its respective 'partner' column, then sum over and discard the result vector. Using simple python to implement this, basically I am back to the extremely slow version.
Is there a fast way of solving this problem?
Here is how you can do it with NumPy/SciPy, both for dense and sparse M matrices:
import numpy as np
import scipy.sparse as sp
# Coordinates where S is True
S = np.array([[0, 1],
[3, 6],
[3, 4],
[9, 1],
[4, 7]])
# Dense M matrix
# Random big matrix
M = np.random.random(size=(1000, 2000))
# Take relevant rows and compute values
values = np.sum(M[S[:, 0]] * M[S[:, 1]], axis=1)
# Make result matrix from values
result = np.zeros((len(M), len(M)), dtype=values.dtype)
result[S[:, 0], S[:, 1]] = values
# Sparse M matrix
# Construct sparse M as COO matrix or any other way
M = sp.coo_matrix(([10, 20, 30, 40, 50], # Data
([0, 1, 3, 4, 6], # Rows
[4, 4, 5, 5, 8])), # Columns
shape=(1000, 2000))
# Convert to CSR for fast row slicing
M_csr = M.tocsr()
# Take relevant rows and compute values
values = M_csr[S[:, 0]].multiply(M_csr[S[:, 1]]).sum(axis=1)
values = np.squeeze(np.asarray(values))
# Construct COO sparse matrix from values
result = sp.coo_matrix((values, (S[:, 0], S[:, 1])), shape=(M.shape[0], M.shape[0]))

A particular way of resizing a matrix

Having a nxn (6x6 in the example below) matrix filled only with 0 and 1:
old_matrix=[[0,0,0,1,1,0],
[1,1,1,1,0,0],
[0,0,1,0,0,0],
[1,0,0,0,0,1],
[0,1,1,1,1,0],
[1,0,0,1,1,0]]
I want to resize it in a particular way. Taking (2x2) sub-matrice and checking if there are more ones or zeros. This means the new matrix will be (3x3) If there are more 1 than 0 un the sub-matrice a 1 value will be assigned in the new matrix. Otherwise, (if there are less or equal) its new value will be 0.
new_matrix=[[0,1,0],
[0,0,0],
[0,1,0]]
I've tried to achieve this by using lots of whiles. However it doesn seem to work. Here's what I got so far:
def convert_track(a):
#converts original map to a 8x8 tile Track
NEW_TRACK=[]
w=0 #matrix width
h=0 #matrix heigth
t_w=0 #submatrix width
t_h=0 #submatrix heigth
BLACK=0 #number of ones in submatrix
WHITE=0 #number of zeros in submatrix
while h<=6:
while w<=6:
l=[]
while t_h<=2 and h<=6:
t_w=0
while t_w<=2 and w<=6:
if a[h][w]==1:
BLACK+=1
else:
WHITE+=1
t_w+=1
w+=1
h+=1
t_h+=1
t_w=0
t_h+=1
if BLACK<=WHITE:
l.append(0)
else:
l.append(1)
BLACK=0
WHITE=0
t_h=0
NEW_TRACK.append(l)
return NEW_TRACK
Raises the error list index out of range or returns the list
[[0]]
is there an easier way to achieve this? What am i doing wrong?
If you are willing/able to use NumPy you can do something like this. If you're working with anything like the data you've shown it's well worth your time to learn as operations like these can be done very efficiently and with very little code.
import numpy as np
from scipy.signal import convolve2d
old_matrix=[[0,0,0,1,1,0],
[1,1,1,1,0,0],
[0,0,1,0,0,0],
[1,0,0,0,0,1],
[0,1,1,1,1,0],
[1,0,0,1,1,0]]
a = np.array(old_matrix)
k = np.ones((2,2))
# compute sums at each submatrix
local_sums = convolve2d(a, k, mode='valid')
# restrict to sums corresponding to non-overlapping
# sub-matrices with local_sums[::2, ::2] and check if
# there are more 1 than 0 elements
result = local_sums[::2, ::2] > 2
# convert back to Python list if needed
new_matrix = result.astype(np.int).tolist()
Result:
>>> result.astype(np.int).tolist()
[[0, 1, 0], [0, 0, 0], [0, 1, 0]]
Here I've used convolve2d to compute the sums at each submatrix. From what I can tell you are only interested in non-overlapping sub-matrices, so the part local_sums[::2, ::2] chops out only the sums corresponding to those.

Categories