Find optimal unique neighbour pairs based on closest distance

Find optimal unique neighbour pairs based on closest distance - python

General problem
First let's explain the problem more generally. I have a collection of points with x,y coordinates and want to find the optimal unique neighbour pairs such that the distance between the neighbours in all pairs is minimised, but points cannot be used in more than one pair.
Some simple examples
Note: points are not ordered and x and y coordinates will both vary between 0 and 1000, but for simplicity in below examples x==y and items are ordered.
First, let's say I have the following matrix of points:
matrix1 = np.array([[1, 1],[2, 2],[5, 5],[6, 6]])
For this dataset, the output should be [0,0,1,1] as points 1 and 2 are closest to each other and points 3 and 4, providing pairs 0 and 2.
Second, two points cannot have the same partner. If we have the matrix:
matrix2 = np.array([[1, 1],[2, 2],[4, 4],[6, 6]])
Here pt1 and pt3 are closest to pt2, but pt1 is relatively closer, so the output should again be [0,0,1,1].
Third, if we have the matrix :
matrix3 = np.array([[1, 1],[2, 2],[3, 3],[4, 4]])
Now pt1 and pt3 are again closest to pt2 but now they are at the same distance. Now the output should again be [0,0,1,1] as pt4 is closest to pt3.
Fourth, in the case of an uneven number of points, the furthest point should be made nan, e.g.
matrix4 = np.array([[1, 1],[2, 2],[4,4]])
should give output [0,0,nan]
Fifth, in the case there are three or more points with exactly the same distance, the pairing can be random, e.g.
matrix5 = np.array([[1, 1],[2, 2],[3, 3]])
both an output of '[0,0,nan]and[nan,0,0]` should be fine.
My effort
Using sklearn:
import numpy as np
from sklearn.neighbors import NearestNeighbors
data = matrix3
nbrs = NearestNeighbors(n_neighbors=len(data), algorithm="ball_tree")
nbrs = nbrs.fit(data)
distances,indices = nbrs.kneighbors(data)
This outputs instances:
array([[0, 1, 2, 3],
[1, 2, 0, 3],
[2, 1, 3, 0],
[3, 2, 1, 0]]))
The second column provides the nearest points:
nearinds = `indices[:,1]`
Next in case there are duplicates in the list we need to find the nearest distance:
if len(set(nearinds) != len(nearinds):
dupvals = [i for i in set(nearinds) if list(nearinds).count(i) > 1]
for dupval in dupvals:
dupinds = [i for i,j in enumerate(nearinds) if j == dupval]
dupdists = distances[dupinds,1]
Using these dupdists I would be able to find that one is closer to the pt than the other:
if len(set(dupdists))==len(dupdists):
duppriority = np.argsort(dupdists)
Using the duppriority values we can provide the closer pt its right pairing. But to give the other point its pairing will then depend on its second nearest pairing and the distance of all other points to that same point.. Furthermore, if both points are the same distance to their closest point, I would also need to go one layer deeper:
if len(set(dupdists))!=len(dupdists):
dupdists2 = [distances[i,2] for i,j in enumerate(inds) if j == dupval]```
if len(set(dupdists2))==len(dupdists2):
duppriority2 = np.argsort(dupdists2)
etc..
I am kind of stuck here and also feel it is not very efficient in this way, especially for more complicated conditions than 4 points and where multiple points can be similar distance to one or multiple nearest, second-nearest etc points..
I also found that with scipy there is a similar one-line command that could be used to get the distances and indices:
from scipy.spatial import cKDTree
distances,indices = cKDTree(matrix3).query(matrix3, k=len(matrix3))
so am wondering if one would be better to continue with vs the other.
More specific problem that I want to solve
I have a list of points and need to match them optimally to a list of points previous in time. Number of points is generally limited and ranges from 2 to 10 but is generally consistent over time (i.e. it won't jump much between values over time). Data tends to look like:
prevdat = {'loc': [(300, 200), (425, 400), (400, 300)], 'contid': [0, 1, 2]}
currlocs = [(435, 390), (405, 295), (290, 215),(440,330)]`
Pts in time are generally closer to themselves than to others. Thus I should be able to link identities of the points over time. There are however a number of complications that need to be overcome:
sometimes there is no equal number of current and previous points
points often have the same closest neighbour but should not be able to be allocated the same identity
points sometimes have the same distance to closest neighbour (but very unlikely to 2nd, 3rd nearest-neighbours etc.
Any advice to help solve my problem would be much appreciated. I hope my examples and effort above will help. Thanks!

This can be formulated as a mixed integer linear programming problem.
In python you can model and solve such problems using cvxpy.
def connect_point_cloud(points):
'''
Given a set of points computes return pairs of points that
whose added distance is minimised
'''
N = points.shape[0];
I, J = np.indices((N, N))
d = np.sqrt(sum((points[I, i] - points[J, i])**2 for i in range(points.shape[1])));
use = cvxpy.Variable((N, N), integer=True)
# each entry use[i,j] indicates that the point i is connected to point j
# each pair may count 0 or 1 times
constraints = [use >= 0, use <= 1];
# point i must be used in at most one connection
constraints += [sum(use[i,:]) + sum(use[:, i]) <= 1 for i in range(N)]
# at least floor(N/2) connections must be presented
constraints += [sum(use[i,j] for i in range(N) for j in range(N)) >= N//2];
# let the solver to handle the problem
P = cvxpy.Problem(cvxpy.Minimize(sum(use[i,j] * d[i,j] for i in range(N) for j in range(N))), constraints)
dist = P.solve()
return use.value
Here a piece of code to visualize the result for a 2D problem
# create a random set with 50 points
p = np.random.rand(50, 2)
# find the pairs to with minimum distance
pairs = connect_point_cloud(p)
# plot all the points with circles
plt.plot(p[:, 0], p[:, 1], 'o')
# plot lines connecting the points
for i1, i2 in zip(*np.nonzero(pairs)):
plt.plot([p[i1,0], p[i2,0]], [p[i1,1], p[i2,1]])

Related

Fast way to do consecutive one-to-all calculations on Numpy arrays without a for-loop?

I'm working on an optimization problem, but to avoid getting into the details, I'm going to provide a simple example of a bug that's been giving me headaches for a few days.
Say I have a 2D numpy array with observed x-y coordinates:
from scipy.optimize import distance
x = np.array([1,2], [2,3], [4,5], [5,6])
I also have a list of x-y coordinates to compare to these points (y):
y = np.array([11,13], [12, 14])
I have a function that takes the sum of manhattan differences between a value of x and all of the values in y:
def find_sum(ref_row, comp_rows):
modeled_counts = []
y = ref_row * len(comp_rows)
res = list(map(distance.cityblock, ref_row, comp_rows))
modeled_counts.append(sum(res))
return sum(modeled_counts)
Essentially, what I would like to do is find the sum of manhattan distances for every item in y with each item in x (so basically for each item in x, find the sum of the Manhattan distances between that (x,y) pair and every (x,y) pair in y).
I've tried this out with the following line of code:
z = list(map(find_sum, x, y))
However, z is of length 2 (like y), and not 4 like x. Is there a way to ensure that z is the result of consecutive one-to-all calculations? That is, I'd like to calculate the sum of all of the manhattan differences between x[0] and every set in y, and so on and so forth, so the length of z should be equal to the length of x.
Is there a simple way to do this without a for loop? My data is rather large (~ 4 million rows), so I'd really appreciate fast solutions. I'm fairly new to Python programming, so any explanations about why the solution works and is fast would be appreciated as well, but definitely isn't required!
Thanks!

This solution implements the distance in numpy, as I think it is a good example of broadcasting, which is a very useful thing to know if you need to use arrays and matrices.
By definition of Manhattan distance, you need to evaluate the sum of the absolute value of difference between each column. However, the first column of x, x[:, 0], has shape (4,) and the first column of y, y[:, 0], has shape (2,), so they are not compatible in the sense of applying subtraction: the broadcasting property says that each shape is compared starting with the trailing dimensions and two dimensions are compatible when they are equal or one of them is 1. Sadly, none of them are true for your columns.
However, you can add a new dimension of value 1 using np.newaxis, so
x[:, 0]
is array([1, 2, 4, 5]), but
x[:, 0, np.newaxis]
is
array([[1],
[2],
[4],
[5]])
and its shape is (4 ,1). Now, a matrix of shape (4, 1) subtracted by an array of shape 2 results in a matrix of shape (4, 2), by numpy's broadcasting treatment:
4 x 1
2
= 4 x 2
You can obtain the differences for each column:
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
and evaluate the sum of their absolute values:
np.abs(first_column_difference) + np.abs(second_column_difference)
which results in a (4, 2) matrix. Now, you want to sum the values for each row, so that you have 4 values:
np.sum(np.abs(first_column_difference) + np.abs(second_column_difference), axis=1)
which results in array([73, 69, 61, 57]). The rule is simple: the parameter axis will eliminate that dimension from the result, therefore using axis=1 for a (4, 2) matrix generates 4 values -- if you use axis=0, it will generate 2 values.
So, this will solve your problem:
x = np.array([[1, 2], [2, 3], [4, 5], [5, 6]])
y = np.array([[11, 13], [12, 43]])
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
z = np.abs(first_column_difference) + np.abs(second_column_difference)
print(np.sum(z, axis=1))
You can also skip the intermediate steps for each column and evaluate everything at once (it is a little bit harder to understand, so I prefer the method described above to explain what is happening):
print(np.abs(x[:, np.newaxis] - y).sum(axis=(1, 2)))
It is a general case for an n-dimensional Manhattan distance: if x is (u, n) and y is (v, n), it generates u rows by broadcasting (u, 1, n) by (v, n) = (u, v, n), then applying sum to eliminate the second and third axis.

Here is how you can do it using numpy broadcast with simplified explanation
Adjust Shape For Broadcasting
import numpy as np
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
## using np.newaxis as index add a new dimension at that position
## : give all the elements on that dimension
start_points = start_points[np.newaxis, :, :]
dest_points = dest_points[:, np.newaxis, :]
## Now lets check he shape of the point arrays
print('start_points.shape: ', start_points.shape) # (1, 4, 2)
print('dest_points.shape', dest_points.shape) # (2, 1, 2)
Lets try to understand
last element of shape represent x and y of a point, size 2
we can think of start_points as having 1 row and 4 columns of points
we can think of dest_points as having 2 rows and 1 columns of points
We can think start_points and dest_points as matrix or a table of points of size (1X4) and (2X1)
We clearly see that size are not compatible. What will happen if we perform arithmatic
operation between them? Here is where a smart part of numpy comes, called broadcast.
It will repeat rows of start_points to match that of dest_point making matrix of (2X4)
It will repeat columns of dest_point to match that of start_points making matrix of (2X4)
Result is arithmetic operation between every pair of elements on start_points and dest_points
Calculate the distance
diff_x_y = start_points - dest_points
print(diff_x_y.shape) # (2, 4, 2)
abs_diff_x_y = np.abs(start_points - dest_points)
man_distance = np.sum(abs_diff_x_y, axis=2)
print('man_distance:\n', man_distance)
sum_distance = np.sum(man_distance, axis=0)
print('sum_distance:\n', sum_distance)
Oneliner
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
np.sum(np.abs(start_points[np.newaxis, :, :] - dest_points[:, np.newaxis, :]), axis=(0,2))
Here is more detail explanation of broadcasting if you want to understand it more

With so many rows you can make substantial savings by using a smart algorithm. Let us for simplicity assume there is just one dimension; once we have established the algorithm, getting back to the general case is a simple matter of summing over coordinates.
The naive algorithm is O(mn) where m,n are the sizes of sets X,Y. Our algorithm is O((m+n)log(m+n)) so it scales much better.
We first have to sort the union of X and Y by coordinate and then form the cumsum over Y. Next, we find for each x in X the number YbefX of y in Y to its left and use it to look up the corresponding cumsum item YbefXval. The summed distances to all y to the left of x are YbefX times coordinate of x minus YbefXval, the distances to all y to the right are sum of all y coordinates minus YbefXval minus n - YbefX times coordinate of x.
Where does the saving come from? Sorting coordinates enables us to recycle the summations we have done before, instead of starting each time from scratch. This uses the fact that up to a sign we always sum the same y coordinates and going from left to right the signs flip one by one.
Code:
import numpy as np
from scipy.spatial.distance import cdist
from timeit import timeit
def pp(X,Y):
(m,k),(n,k) = X.shape,Y.shape
XY = np.concatenate([X.T,Y.T],1)
idx = XY.argsort(1)
Xmsk = idx<m
Ymsk = ~Xmsk
Xidx = np.arange(k)[:,None],idx[Xmsk].reshape(k,m)
Yidx = np.arange(k)[:,None],idx[Ymsk].reshape(k,n)
YbefX = Ymsk.cumsum(1)[Xmsk].reshape(k,m)
YbefXval = XY[Yidx].cumsum(1)[np.arange(k)[:,None],YbefX-1]
YbefXval[YbefX==0] = 0
XY[Xidx] = ((2*YbefX-n)*XY[Xidx]) - 2*YbefXval + Y.sum(0)[:,None]
return XY[:,:m].sum(0)
def summed_cdist(X,Y):
return cdist(X,Y,"minkowski",p=1).sum(1)
# demo
m,n,k = 1000,500,10
X,Y = np.random.randn(m,k),np.random.randn(n,k)
print("same result:",np.allclose(pp(X,Y),summed_cdist(X,Y)))
print("sort :",timeit(lambda:pp(X,Y),number=1000),"ms")
print("scipy cdist:",timeit(lambda:summed_cdist(X,Y),number=100)*10,"ms")
Sample run, comparing smart algo "sort" to naive algo implemented using cdist library function:
same result: True
sort : 1.4447695480193943 ms
scipy cdist: 36.41934019047767 ms

Is it possible to use numpy to replace this for loop?

I'm trying to speed up a function that evaluates the fitness of a solution, the idea is to apply a for loop in an array, is it possible to do this with np.sum ?
def calculate_fitness2(individual):
fitness=0
for i in range(0,len(individual)-1):
fitness=fitness+sp.spatial.distance.euclidean(city_list[individual][i][1:],city_list[individual][i+1][1:])
return fitness
individual is an array of ids ex.[1,5,3,4,6,7] , each of these is represented on the city list, the city list, this list includes the coordinates of the cities, ex . [[1,34,55],[2,44,78],...,[7,99,23]]. The main idea is to calculate the distance on a TSP problem.
Thank you all

Assuming you want the total distance traveled by going city to city according to the individual list, one way to write it is:
city_list = np.array([[0, 0], [1, 0], [1, 1], [0, 1]])
path_idx = [0, 1, 2, 3] # i.e. individual
polyline = city_list[path_idx] # the list of visited coordinates
# a (n x dim) array
distance = np.sum(np.sqrt(np.sum(np.diff(polyline, axis=0)**2, axis=1)))
distance # 3.0

How to determine which one is free variable in the result of sympy.linsolve

I want to solve the linear equation for n given points in n dimensional space to get the equation of hyper-plane.
for example, in two dimensional case, Ax + By + C = 0.
How can I get one solution if there are infinite solutions in a linear equations ?
I have tried scipy.linalg.solve() but it requires coefficient matrix A to be nonsingular.
I also tried sympy
A = Matrix([[0, 0, 1], [1, 1, 1]])
b = Matrix([0, 0])
linsolve((A, b), [x, y, z])
It returned me this
{(−y,y,0)}
I have to parse the result to determine which one is the free variable and then assign a number to it to get a solution.
Is there a more convenient way since I only want to get a specific solution ?

How to find linearly independent rows from a matrix

How to identify the linearly independent rows from a matrix? For instance,
The 4th rows is independent.

First, your 3rd row is linearly dependent with 1t and 2nd row. However, your 1st and 4th column are linearly dependent.
Two methods you could use:
Eigenvalue
If one eigenvalue of the matrix is zero, its corresponding eigenvector is linearly dependent. The documentation eig states the returned eigenvalues are repeated according to their multiplicity and not necessarily ordered. However, assuming the eigenvalues correspond to your row vectors, one method would be:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(matrix.T)
# The linearly dependent row vectors
print matrix[lambdas == 0,:]
Cauchy-Schwarz inequality
To test linear dependence of vectors and figure out which ones, you could use the Cauchy-Schwarz inequality. Basically, if the inner product of the vectors is equal to the product of the norm of the vectors, the vectors are linearly dependent. Here is an example for the columns:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
print np.linalg.det(matrix)
for i in range(matrix.shape[0]):
for j in range(matrix.shape[0]):
if i != j:
inner_product = np.inner(
matrix[:,i],
matrix[:,j]
)
norm_i = np.linalg.norm(matrix[:,i])
norm_j = np.linalg.norm(matrix[:,j])
print 'I: ', matrix[:,i]
print 'J: ', matrix[:,j]
print 'Prod: ', inner_product
print 'Norm i: ', norm_i
print 'Norm j: ', norm_j
if np.abs(inner_product - norm_j * norm_i) < 1E-5:
print 'Dependent'
else:
print 'Independent'
To test the rows is a similar approach.
Then you could extend this to test all combinations of vectors, but I imagine this solution scale badly with size.

With sympy you can find the linear independant rows using: sympy.Matrix.rref:
>>> import sympy
>>> import numpy as np
>>> mat = np.array([[0,1,0,0],[0,0,1,0],[0,1,1,0],[1,0,0,1]]) # your matrix
>>> _, inds = sympy.Matrix(mat).T.rref() # to check the rows you need to transpose!
>>> inds
[0, 1, 3]
Which basically tells you the rows 0, 1 and 3 are linear independant while row 2 isn't (it's a linear combination of row 0 and 1).
Then you could remove these rows with slicing:
>>> mat[inds]
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
This also works well for rectangular (not only for quadratic) matrices.

I edited the code for Cauchy-Schwartz inequality which scales better with dimension: the inputs are the matrix and its dimension, while the output is a new rectangular matrix which contains along its rows the linearly independent columns of the starting matrix. This works in the assumption that the first column in never null, but can be readily generalized in order to implement this case too. Another thing that I observed is that 1e-5 seems to be a "sloppy" threshold, since some particular pathologic vectors were found to be linearly dependent in that case: 1e-4 doesn't give me the same problems. I hope this could be of some help: it was pretty difficult for me to find a really working routine to extract li vectors, and so I'm willing to share mine. If you find some bug, please report them!!
from numpy import dot, zeros
from numpy.linalg import matrix_rank, norm
def find_li_vectors(dim, R):
r = matrix_rank(R)
index = zeros( r ) #this will save the positions of the li columns in the matrix
counter = 0
index[0] = 0 #without loss of generality we pick the first column as linearly independent
j = 0 #therefore the second index is simply 0
for i in range(R.shape[0]): #loop over the columns
if i != j: #if the two columns are not the same
inner_product = dot( R[:,i], R[:,j] ) #compute the scalar product
norm_i = norm(R[:,i]) #compute norms
norm_j = norm(R[:,j])
#inner product and the product of the norms are equal only if the two vectors are parallel
#therefore we are looking for the ones which exhibit a difference which is bigger than a threshold
if absolute(inner_product - norm_j * norm_i) > 1e-4:
counter += 1 #counter is incremented
index[counter] = i #index is saved
j = i #j is refreshed
#do not forget to refresh j: otherwise you would compute only the vectors li with the first column!!
R_independent = zeros((r, dim))
i = 0
#now save everything in a new matrix
while( i < r ):
R_independent[i,:] = R[index[i],:]
i += 1
return R_independent

I know this was asked a while ago, but here is a very simple (although probably inefficient) solution. Given an array, the following finds a set of linearly independent vectors by progressively adding a vector and testing if the rank has increased:
from numpy.linalg import matrix_rank
def LI_vecs(dim,M):
LI=[M[0]]
for i in range(dim):
tmp=[]
for r in LI:
tmp.append(r)
tmp.append(M[i]) #set tmp=LI+[M[i]]
if matrix_rank(tmp)>len(LI): #test if M[i] is linearly independent from all (row) vectors in LI
LI.append(M[i]) #note that matrix_rank does not need to take in a square matrix
return LI #return set of linearly independent (row) vectors
#Example
mat=[[1,2,3,4],[4,5,6,7],[5,7,9,11],[2,4,6,8]]
LI_vecs(4,mat)

I interpret the problem as finding rows that are linearly independent from other rows.
That is equivalent to finding rows that are linearly dependent on other rows.
Gaussian elimination and treat numbers smaller than a threshold as zeros can do that. It is faster than finding eigenvalues of a matrix, testing all combinations of rows with Cauchy-Schwarz inequality, or singular value decomposition.
See:
https://math.stackexchange.com/questions/1297437/using-gauss-elimination-to-check-for-linear-dependence
Problem with floating point numbers:
http://numpy-discussion.10968.n7.nabble.com/Reduced-row-echelon-form-td16486.html

With regards to the following discussion:
Find dependent rows/columns of a matrix using Matlab?
from sympy import *
A = Matrix([[1,1,1],[2,2,2],[1,7,5]])
print(A.nullspace())
It is obvious that the first and second row are multiplication of each other.
If we execute the above code we get [-1/3, -2/3, 1]. The indices of the zero elements in the null space show independence. But why is the third element here not zero? If we multiply the A matrix with the null space, we get a zero column vector. So what's wrong?
The answer which we are looking for is the null space of the transpose of A.
B = A.T
print(B.nullspace())
Now we get the [-2, 1, 0], which shows that the third row is independent.
Two important notes here:
Consider whether we want to check the row dependencies or the column
dependencies.
Notice that the null space of a matrix is not equal to the null
space of the transpose of that matrix unless it is symmetric.

You can basically find the vectors spanning the columnspace of the matrix by using SymPy library's columnspace() method of Matrix object. Automatically, they are the linearly independent columns of the matrix.
import sympy as sp
import numpy as np
M = sp.Matrix([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
for i in M.columnspace():
print(np.array(i))
print()
# The output is following.
# [[0]
# [0]
# [1]]
# [[1]
# [0]
# [0]]
# [[0]
# [1]
# [0]]

Count how many matrices have full rank for all submatrices

I would like to count how many m by n matrices whose elements are 1 or -1 have the property that all its floor(m/2)+1 by n submatrices have full rank. My current method is naive and slow and is in the following python/numpy code. It simply iterates over all matrices and tests all the submatrices.
import numpy as np
import itertools
from scipy.misc import comb
m = 8
n = 4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
matrix_g=(np.array(x).reshape(m,n) for x in itertools.product([-1,1], repeat = m*n))
nofound = 0
for A in matrix_g:
count = 0
for rows in itertools.combinations(range(m), int(rowstochoose)):
if (np.linalg.matrix_rank(A[list(rows)]) == int(min(n,rowstochoose))):
count+=1
else:
break
if (count == maxnumber):
nofound+=1
print nofound, 2**(m*n)
Is there a better/faster way to do this? I would like to do this calculation for n and m up to 20 but any significant improvements would be great.
Context. I am interested in getting some exact solutions for https://math.stackexchange.com/questions/640780/probability-that-every-vector-is-not-orthogonal-to-half-of-the-others .
As a data point to compare implementations. n,m = 4,4 should output 26880 . n,m=5,5 is too slow for me to run. For n = 2 and m = 2,3,4,5,6 the outputs should be 8, 0, 96, 0, 1280.
Current status Feb 2, 2014:
The answer of leewangzhong is fast but is not correct for m > n . leewangzhong is considering how to fix it.
The answer of Hooked does not run for m > n .

(Now a partial solution for n = m//2+1, and the requested code.)
Let k := m//2+1
This is somewhat equivalent to asking, "How many collections of m n-dimensional vectors of {-1,1} have no linearly dependent sets of size min(k,n)?"
For those matrices, we know or can assume:
The first entry of every vector is 1 (if not, multiply the whole by -1). This reduces the count by a factor of 2**m.
All vectors in the list are distinct (if not, any submatrix with two identical vectors has non-full rank). This eliminates a lot. There are choose(2**m,n) matrices of distinct vectors.
The list of vectors are sorted lexicographically (rank isn't affected by permutations). So we're really thinking about sets of vectors instead of lists. This reduces the count by a factor of m! (because we require distinctness).
With this, we have a solution for n=4, m=8. There are only eight different vectors with the property that the first entry is positive. There is only one combination (sorted list) of 8 distinct vectors from 8 distinct vectors.
array([[ 1, 1, 1, 1],
[ 1, 1, 1, -1],
[ 1, 1, -1, 1],
[ 1, 1, -1, -1],
[ 1, -1, 1, 1],
[ 1, -1, 1, -1],
[ 1, -1, -1, 1],
[ 1, -1, -1, -1]], dtype=int8)
100 size-4 combinations from this list have rank 3. So there are 0 matrices with the property.
For a more general solution:
Note that there are 2**(n-1) vectors with first coordinate -1, and choose(2**(n-1),m) matrices to inspect. For n=8 and m=8, there are 128 vectors, and 1.4297027e+12 matrices. It might help to answer, "For i=1,...,k, how many combinations have rank i?"
Alternatively, "What kind of matrices (with the above assumptions) have less than full rank?" And I think the answer is exactly, A sufficient condition is, "Two columns are multiples of each other". I have a feeling that this is true, and I tested this for all 4x4, 5x5, and 6x6 matrices.(Must've screwed up the tests) Since the first column was chosen to be homogeneous, and since all homogeneous vectors are multiples of each other, any submatrix of size k with a homogeneous column other than the first column will have rank less than k.
This is not a necessary condition, though. The following matrix is singular (first plus fourth is equal to third plus second).
array([[ 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, -1],
[ 1, 1, -1, -1, 1],
[ 1, 1, -1, -1, -1],
[ 1, -1, 1, -1, 1]], dtype=int8)
Since there are only two possible values (-1 and 1), all mxn matrices where m>2, k := m//2+1, n = k and with first column -1 have a majority member in each column (i.e. at least k members are the same). So for n=k, the answer is 0.
For n<=8, here's code to generate the vectors.
from numpy import unpackbits, arange, uint8, int8
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
Matrix generator:
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
The following is a function to check that all submatrices of length maxrank have full rank. It stops if any have less than maxrank, instead of checking all combinations.
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
Generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
Putting it all together:
import numpy as np
from numpy import unpackbits, arange, uint8, int8, array
from itertools import combinations
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
if n==0:
return array([])
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
#generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
#returns (number of nice matrices, number of all matrices)
def count_nicematrices(m,n):
from math import factorial
return (len(list(nicematrices(m,n)))*factorial(m)*2**m, 2**(m*n))
for i in range(0,6):
print (i, count_nicematrices(i,i))
count_nicematrices(5,5) takes about 15 seconds for me, the vast majority of which is taken by the matrix_rank function.

Since no one's answered yet, here's an answer without code. The useful symmetries that I see are as follows.
Multiply a row by -1.
Multiply a column by -1.
Permute the rows.
Permute the columns.
I would attack this problem by exhaustively generating the non-isomorphs, filtering them, and summing the sizes of their orbits. nauty will be quite useful for the first and third steps. Assuming that most matrices have few symmetries (undoubtedly an excellent assumption for n large, but it's not obvious a priori how large), I would expect 8x8 to be doable, 9x9 to be borderline, and 10x10 to be out of reach.
Expanded pseudocode:
Generate one representative of each orbit of the (m - 1) by (n - 1) 0-1 matrices acted upon by the group generated by row and column permutations, together with the size of the orbit (= (m - 1)! (n - 1)! / the size of the automorphism group). Perhaps the author of the paper that Tim linked would be willing to share his code; otherwise, see below.
For each matrix, replace entries x by (-1)^x. Add one row and one column of 1s. Multiply the size of its orbit by 2^(m + n - 1). This takes care of the sign change symmetries.
Filter the matrices and sum the orbit sizes of the ones that remain. You might save a little computation time here by implementing Gram--Schmidt yourself so that when you try all combinations in lexicographic order there's an opportunity to reuse partial results for the shared prefixes.
Isomorph-free enumeration:
McKay's template can be used to generate the representatives for (m + 1) by n 0-1 matrices from the representatives for m by n 0-1 matrices, in a manner amenable to depth-first search. With each m by n 0-1 matrix, associate a bipartite graph with m black vertices, n white vertices, and the appropriate edge for each 1 entry. Do the following for each m by n representative.
For each length-n vector, construct the graph for the (m + 1) by n matrix consisting of the representative together with the new vector and run nauty to get a canonical labeling and the vertex orbits.
Filter out the possibilities where the vertex corresponding to the new vector is in a different orbit from the black vertex with the lowest number.
Filter out the possibilities with duplicate canonical labelings.
nauty also computes the orders of automorphism groups.

You will need to rethink this problem from a mathematical point of view. That said even with brute force, there are some programming tricks you can use to speed up the process (as SO is a programming site). Little tricks like not recalculating int(min(n,rowstochoose)) and itertools.combinations(range(m), int(rowstochoose)) can save a few percent - but the real gain comes from memoization. Others have mentioned it, but I thought it might be useful to have a complete, working, code example:
import numpy as np
from scipy.misc import comb
import itertools, hashlib
m,n = 4,4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
combo_itr = (x for x in itertools.product([-1,1], repeat = m*n))
matrix_itr = (np.array(x,dtype=np.int8).reshape((n,m)) for x in combo_itr)
sub_shapes = map(list,(itertools.combinations(range(m), int(rowstochoose))))
required_rank = int(min(n,rowstochoose))
memo = {}
no_found = 0
for A in matrix_itr:
check = True
for s in sub_shapes:
view = A[s].view(np.int8)
h = hashlib.sha1(view).hexdigest()
if h not in memo:
memo[h] = np.linalg.matrix_rank(view)
if memo[h] != required_rank:
check = False
break
if check: no_found+=1
print no_found, 2**(m*n)
This gives a speed gain of almost 10x for the 4x4 case - you'll see substantial improvements for larger matrices if you care to wait long enough. It's possible for the larger matrices, where the cost of the rank is proportionally more expensive, that you can order the matrices ahead of time on the hashing:
idx = np.lexsort(view.T)
h = hashlib.sha1(view[idx]).hexdigest()
For the 4x4 case this makes it slightly worse, but I expect that to reverse for the 5x5 case.

Algorithm 1 - memorizing small ones
I would use memorizing of the already checked smaller matrices.
You could simply write down in binary format (0 for -1, 1 for 1) all smaller matrices. BTW, you cane directly check for ranges matrices of (0 and 1) instead of (-1 and 1) - it is the same. Let us call these coding IMAGES. Using long types you can have matrices of up to 64 cells, so, up to 8x8. It is fast. Using String you can them have as large as you need.
Really, 8x8 is more than enough - in the 8GB memory we can place 1G longs. it is about 2^30, so, you can remember matrices of about up to 25-28 elements.
For every size you'll have a set of images:
for 2x2: 1001, 0110, 1000, 0100, 0010, 0001, 0111, 1011, 1101, 1110.
So, you'll have archive=array of NxN, each element of which will be an ordered list of binary images of good matrices.
- (for matrix size MxN, where M>=N, the appropriate place in archive will have coordinates M,N. If M
When you are checking a new large matrix, divide it into small ones
For every small matrix T
If the appropriate place in the archive for size of T has no list, create it and fill by images of all full-rank matrices of size of T and order images. If you are out of memory, stop the process of archive filling.
If T could be in archive, according to size:
Make image of T
Look for image(t) in the list - if it is in it, it is OK, if no, the large matrix should be thrown off.
If T is too big for the archive, check it as you do it.
Algorithm 2 - increasing sizes
The other possibility is to create larger matrices by adding pieces to the lesser ones, already found.
You should decide, up to what size the matrices will grow.
When you find a "correct" matrix of size MxN, try to add a row to top it. New matrices should be checked only for submatrices that include the new row. The same with the new column.
You should set exact algorithm, which sizes are derived from which ones. Thus you can minimize the number of remembered matrices. I thought about that sequence:
Start from 2x2 matrices.
continue with 3x2
4x2, 3x3
5x2, 4x3
6x2, 5x3, 4x4
7x2, 6x3, 5x4
...
So you can remember only (M+N)/2-1 matrices for searching among sizes MxN.
If each time when we can create new size from two old ones, we derive from more square ones, we could also greatly spare place for matrices remembering: For "long" matrices as 7x2 we do need remember and check only the last line 1x2. For matrices 6x3 we should remember their stub of 2x3, and so on.
Also, you don't need to remember the largest matrices - you won't use them for further counting.
Again use "images" for remembering the matrix.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.