Cosine similarity for very large dataset - python

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory but I get MemoryError during np.dot() internal call
Here's my use-case and how I am currently tackling it.
Here's my parent vector of 100-dimension which I need to compare with other 500,000 different vectors of same dimension (i.e. 100)
parent_vector = [1, 2, 3, 4 ..., 100]
Here are my child vectors (with some made-up random numbers for this example)
child_vector_1 = [2, 3, 4, ....., 101]
child_vector_2 = [3, 4, 5, ....., 102]
child_vector_3 = [4, 5, 6, ....., 103]
.......
.......
child_vector_500000 = [3, 4, 5, ....., 103]
My final goal is to get top-N child vectors (with their names such as child_vector_1 and their corresponding cosine score) who have very high cosine similarity with the parent vector.
My current approach (which I know is inefficient and memory consuming):
Step 1: Create a super-dataframe of following shape
parent_vector 1, 2, 3, ....., 100
child_vector_1 2, 3, 4, ....., 101
child_vector_2 3, 4, 5, ....., 102
child_vector_3 4, 5, 6, ....., 103
......................................
child_vector_500000 3, 4, 5, ....., 103
Step 2: Use
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
to get pair-wise cosine similarity between all vectors (shown in above dataframe)
Step 3: Make a list of tuple to store the key such as child_vector_1 and value such as the cosine similarity number for all such combinations.
Step 4: Get the top-N using sort() of list -- so that I get the child vector name as well as its cosine similarity score with the parent vector.
PS: I know this is highly inefficient but I couldn't think of a better
way to faster compute cosine similarity between each of child vector
and parent vector and get the top-N values.
Any help would be highly appreciated.

even though your (500000, 100) array (the parent and its children) fits into memory
any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.
Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.
To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.rand(500000,100))
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately
n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children
hope that solves your question.

I couldn't even fit the entire corpus in memory so a solution for me was to load it gradually and compute cosine similarity on smaller batches, always retaining the least/most n (depending on your usecase) similar items:
data = []
iterations = 0
with open('/media/corpus.txt', 'r') as f:
for line in f:
data.append(line)
if len(data) <= 1000:
pass
else:
print('Getting bottom k, iteration {x}'.format(x=iterations))
data = get_bottom_k(data, 500)
iterations += 1
filtered = get_bottom_k(data, 500) # final most different 500 texts in corpus
def get_bottom_k(corpus:list, k:int):
pairwise_similarity = make_similarity_matrix(corpus) # returns pairwise similarity matrix
sums = csr_matrix.sum(pairwise_similarity, axis=1) # Similarity index for each item in corpus. Bigger > more
sums = np.squeeze(np.asarray(sums))
# similar to other txt.
indexes = np.argpartition(sums, k, axis=0)[:k] # Bottom k in terms of similarity (-k for top and [-k:])
return [corpus[i] for i in indexes]
This is by far an optimal solution but it's the easiest i found so far, maybe it will be of help to someone.

This solution is insanely fast
child_vectors = np.array(child_vector_1, child_vector_2, ....., child_vector_500000)
input_norm = parent_vector / np.linalg.norm(parent_vector, axis=-1)[:, np.newaxis]
embed_norm = child_vectors/ np.linalg.norm(child_vectors, axis=-1)[:, np.newaxis]
cosine_similarities = np.sort(np.round(np.dot(input_norm, embed_norm.T), 3)[0])[::-1]
paiswise_distances = 1 - cosine_similarities

Related

How to handle missing data in KNN without imputing?

I'm working on an assignment where I need to do KNN Regression using the sklearn library--but, if I have missing data (assume it's missing-at-random) I am not supposed to impute it. Instead, I have to leave it as null and somehow in my code account for it to ignore comparisons where one value is null.
For example, if my observations are (1, 2, 3, 4, null, 6) and (1, null, 3, 4, 5, 6) then I would ignore both the second and the fifth observations.
Is this possible with the sklearn library?
ETA: I would just drop the null values, but I won't know what the data looks like that they'll be testing and it could end up dropping anywhere between 0% and 99% of the data.
This depends a little on what exactly you're trying to do.
Ignore all columns with nulls: I imagine this isn't what you're asking since that's more of a data pre-processing step and isn't really unique to sklearn. Even in pure python, just search for column indices containing nulls and construct a new data set with those indices filtered out.
Ignore null values in vector comparisons: This one is actually kind of fun. Essentially you're saying something like the distance between [1, 2, 3, 4, None, 6] and [1, None, 3, 4, 5, 6] is sqrt(1*1 + 3*3 + 4*4 + 6*6). In this case you need some kind of a custom metric, which sklearn supports. Unfortunately you can't input null values into the KNN fit() method, so even with a custom metric you can't quite get what you want. The solution is to pre-compute distances. E.g.:
from math import sqrt, isfinite
X_train = [
[1, 2, 3, 4, None, 6],
[1, None, 3, 4, 5, 6],
]
y_train = [3.14, 2.72] # we're regressing something
def euclidean(p, q):
# Could also use numpy routines
return sqrt(sum((x-y)**2 for x,y in zip(p,q)))
def is_num(x):
# The `is not None` check needs to happen first because of short-circuiting
return x is not None and isfinite(x)
def restricted_points(p, q):
# Returns copies of `p` and `q` except at coordinates where either vector
# is None, inf, or nan
return tuple(zip(*[(x,y) for x,y in zip(p,q) if all(map(is_num, (x,y)))]))
def dist(p, q):
# Note that in this form you can use any metric you like on the
# restricted vectors, not just the euclidean metric
return euclidean(*restricted_points(p, q))
dists = [[dist(p,q) for p in X_train] for q in X_train]
knn = KNeighborsRegressor(
n_neighbors=1, # only needed in our test example since we have so few data points
metric='precomputed'
)
knn.fit(dists, y_train)
X_test = [
[1, 2, 3, None, None, 6],
]
# We tell sklearn which points in the knn graph to use by telling it how far
# our queries are from every input. This is super inefficient.
predictions = knn.predict([[dist(q, p) for p in X_train] for q in X_test])
There's still an open question of what to do if you have nulls in the outputs you're regressing to, but your problem statement doesn't make it sound like that's an issue for you.
This should work:
import pandas as pd
df = pd.read_csv("your_data.csv")
df.dropna(inplace = True)

Efficient Histogram of Differences for sparse Data

I want to compute a histogram of the differences between all the elements in one array A with all the elements in another array B.
So I want to have a histogram of the following data:
Delta1 = A1-B1
Delta2 = A1-B2
Delta3 = A1-B3
...
DeltaN = A2-B1
DeltaN+1 = A2-B2
DeltaN+2 = A2-B3
...
The point of this calculation is to show that these data has a correlation, even though not every data point has a "partner" in the other array and the correlation is rather noisy in practice.
The problem is that these files are in practice very large, several GB and all entries of the vectors are 64 bit integer numbers with very large differences.
It seems unfeasible to me to convert these data to binary arrays in order to be able to use correlation functions and fourier transforms to compute this.
Here is a small example to give a better taste of what I'm looking at.
This implementation with numpy's searchsorted in a for loop is rather slow.
import numpy as np
import matplotlib.pyplot as plt
timetagsA = [668656283,974986989,1294941174,1364697327,\
1478796061,1525549542,1715828978,2080480431,2175456303,2921498771,3671218524,\
4186901001,4444689281,5087334517,5467644990,5836391057,6249837363,6368090967,8344821453,\
8933832044,9731229532]
timetagsB = [13455,1294941188,1715828990,2921498781,5087334530,5087334733,6368090978,9731229545,9731229800,9731249954]
max_delta_t = 500
nbins = 10000
histo=np.zeros((nbins,2), dtype = float)
histo[:,0]=np.arange(0,nbins)
for i in range(0,int(len(timetagsA))):
delta_t = 0
j = np.searchsorted(timetagsB,timetagsA[i])
while (np.round(delta_t) < max_delta_t and j<len(timetagsB)):
delta_t = timetagsB[j] - timetagsA[i]
if(delta_t<max_delta_t):
histo[int(delta_t),1]+=1
j = j+1
plt.plot(histo[0:50,1])
plt.show()
It would be great if someone could help me to find a faster way to compute this. Thanks in advance!
EDIT
The below solution is supposing that your data is so huge that you can not use np.substract with np.outer and then slice the value you want to keep:
arr_diff = np.subtract.outer(arrB, arrA)
print (arr_diff[(0<arr_diff ) &(arr_diff <max_delta_t)])
# array([ 14, 12, 10, 13, 216, 11, 13, 268], dtype=int64)
with your example data it works but not with too huge data set
ORIGINAL SOLUTION
Let's first suppose your max_delta_t is smaller than the difference between two successive values in timetagsB for an easy way of doing it (then we can try to generalize it).
#create the array instead of list
arrA = np.array(timetagsA)
arrB = np.array(timetagsB)
max_delta_t = np.diff(arrB).min() - 1 #here it's 202 just for the explanation
You can use np.searchsorted in a vectorize way:
# create the array of search
arr_search = np.searchsorted(arrB, arrA) # the position of each element of arrA in arrB
print (arr_search)
# array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 6, 6, 6, 6, 7, 7, 7],dtype=int64)
You can calculate the difference between the element of arrB corresponding to each element of arrA by slicing arrB with arr_search
# calculate the difference
arr_diff = arrB[arr_search] - arrA
print (arr_diff[arr_diff<max_delta_t]) # finc the one smaller than max_delta_t
# array([14, 12, 10, 13, 11, 13], dtype=int64)
So what you are looking for is then calculated by np.bincount
arr_bins = np.bincount(arr_diff[arr_diff<max_delta_t])
#to make it look like histo but not especially necessary
histo = np.array([range(len(arr_bins)),arr_bins]).T
Now the problem is that, there is some values of difference between arrA and arrB that could not be obtained with this method, when max_delta_t is bigger than two successive values in arrB. Here is one way, naybe not the most efficient depending on the values of your data. For any value of max_delta_t
#need an array with the number of elements in arrB for each element of arrA
# within a max_delta_t range
arr_diff_search = np.searchsorted(arrB, arrA + max_delta_t)- np.searchsorted(arrB, arrA)
#do a loop to calculate all the values you are interested in
list_arr = []
for i in range(arr_diff_search.max()+1):
arr_diff = arrB[(arr_search+i)%len(arrB)][(arr_diff_search>=i)] - arrA[(arr_diff_search>=i)]
list_arr.append(arr_diff[(0<arr_diff)&(arr_diff<max_delta_t)])
Now you can np.concatenate the list_arr and use np.bincount such as:
arr_bins = np.bincount(np.concatenate(list_arr))
histo = np.array([range(len(arr_bins)),arr_bins]).T

How to cluster very big sparse data set using low memory in Python?

I have data which forms a sparse matrix in shape of 1000 x 1e9. I want to cluster the 1000 examples into 10 clusters using K-means.
The matrix is very sparse, less than 1/1e6 values.
My laptop got 16 RAM. I tried sparse matrix in scipy. Unfortunately, the matrix makes the clustering process need much more memory than I have. Could anyone suggest a way to do this?
My system crashed when running the following test snippet
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
row = np.array([0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 8, 8, 8])
col = np.array([0, 2, 2, 0, 1, 2] * 3)
data = np.array([1, 2, 3, 4, 5, 6] * 3)
X = csr_matrix((data, (row, col)), shape=(9, 1e9))
resC = KMeans(n_clusters=3).fit(X)
resC.labels_
Any helpful suggestion is appreciated.
KMeans centers will not be sparse anymore, so this would need careful optimization for the sparse case (that may be costly for the usual case, so it probably isn't optimized this way).
You can try ELKI (not python but Java) which often is much faster, and also has sparse data types. You can also try using single-precision float will also help.
But in the end, the results will be questionable: k-means is statistically rooted in least-squares. It assumes your data is coming from k signals plus some Gaussian error. Because your data is sparse, it obviously does not have this kind of Gaussian shape. When the majority of values is 0, it cannot be a Gaussian.
With just 1000 data points, I'd rather use HAC.
Whatever you do (for your data; given your memory-constraints): kmeans is not ready for that!
This includes:
Online KMeans / MiniBatch Kmeans; as proposed in another answer
it only helps to handle many samples (and is hurt by the same effect mentioned later)!
Various KMeans-implementation in different languages (it's an algorithmic problem; not bound by an implementation)
Ignoring potential theoretic reasons (high-dimensionality and non-convex heuristic optimization) i'm just mentioning the practical problem here:
your centroids might become non-sparse! (mentioned in sidenote by SOs clustering-expert; this link also mentions alternatives!)
this means: the sparse data-structures used will get very non-sparse and eventually blow up your memory!
(i changed sklearn's code to observe what the above link already mentioned)
relevant sklearn code: center_shift_total = squared_norm(centers_old - centers)
Even if you remove / turn-off all the memory-heavy components like:
init=some_sparse_ndarray (instead of k-means++)
n_init=1 instead of 10
precompute_distances=False instead of True (unclear if it helps)
n_jobs=1 instead of -1
the above will be your problem to care!
Although KMeans accepts sparse matrices as input, the centroids used within the algorithm have a dense representation, and your feature space is so big that even 10 centroids will not fit into 16GB of RAM.
I have 2 ideas:
Can you fit the clustering into RAM if you discard all empty columns? If you have 1000 samples and only about 1/1e6 values are occupied, then probably less than 1 in 1000 columns will contain any non-zero entries.
Several clustering algorithms in scikit-learn will accept a matrix of distances between samples in stead of the full data e.g. sklearn.cluster.SpectralClustering. You could precompute the pairwise distances in a 1000x1000 matrix and pass that to your clustering algorithm in stead. (I can't make a specific recommendation of a clustering method, or a suitable function to calculate the distances, as it will depend on your application)
Consider using dict, since it will only store the values wich were assigned. I guess a nice way to do this is by creating a SparseMatrix object like this:
class SparseMatrix(dict):
def __init__(self, mapping=[]):
dict.__init__(self, {i:mapping[i] for i in range(len(mapping))})
#overriding this method makes never-accessed indexes return 0.0
def __getitem__(self, i):
try:
return dict.__getitem__(self, i)
except KeyError:
return 0.0
>>> my_matrix = SparseMatrix([1,2,3])
>>> my_matrix[0]
1
>>> my_matrix[5]
0.0
Edit:
For the multi-dimensional case you may need to override the two item-management methods as follows:
def __getitem__(self, ij):
i,j = ij
dict.__setitem__(i*self.n + j)
def __getitem__(self, ij):
try:
i,j = ij
return dict.__getitem__(self, i*self.n + j)
except KeyError:
return 0.0
>>> my_matrix[0,0] = 10
>>> my_matrix[1,2]
0.0
>>> my_matrix[0,0]
10
Also assuming you defined self.n as the length of the matrix rows.

Printing principal features in clusters (python)

I have a mxn matrix, with m features and n samples. The matrix is called term_individual. The clustering is done using scikitlearn:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_clusters)
kmeans.fit(term_individual.T)
centroids = kmeans.cluster_centers_.squeeze()
labels = kmeans.labels_
Each sample is a vector filled with positive integer numbers. If the i-th component of a sample is n, it means that the i-th feature is present n times in that sample.
I would like to know the most representative features of each cluster. For instance, suppose that the i-th feature is present many times in the first and second sample, causing these samples to be in the same cluster along with many others in which the i-th feature is present as well. I would like to print that feature (or the index associated with it, i.e. print i).
I appreciate your help.
It appears that the question you're asking is which features are most important to each cluster. Essentially you can just start estimating the zscore mean values of each feature in each cluster:
def cluster_feature_importance(X, Y, feature_importances):
N, M = X.shape
X = scale(X)
out = {}
for c in set(Y):
out[c] = dict(
zip(range(N), np.mean(X[Y==c, :], axis=0))
)
return out
Here Xis your matrix term_individual and Y as a list informing which cluster each sample belongs to e.g. like so: [0, 0, 1, 1, 0, 3, 2, 2, 3, 0, ...] where Y is n long.

Count how many matrices have full rank for all submatrices

I would like to count how many m by n matrices whose elements are 1 or -1 have the property that all its floor(m/2)+1 by n submatrices have full rank. My current method is naive and slow and is in the following python/numpy code. It simply iterates over all matrices and tests all the submatrices.
import numpy as np
import itertools
from scipy.misc import comb
m = 8
n = 4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
matrix_g=(np.array(x).reshape(m,n) for x in itertools.product([-1,1], repeat = m*n))
nofound = 0
for A in matrix_g:
count = 0
for rows in itertools.combinations(range(m), int(rowstochoose)):
if (np.linalg.matrix_rank(A[list(rows)]) == int(min(n,rowstochoose))):
count+=1
else:
break
if (count == maxnumber):
nofound+=1
print nofound, 2**(m*n)
Is there a better/faster way to do this? I would like to do this calculation for n and m up to 20 but any significant improvements would be great.
Context. I am interested in getting some exact solutions for https://math.stackexchange.com/questions/640780/probability-that-every-vector-is-not-orthogonal-to-half-of-the-others .
As a data point to compare implementations. n,m = 4,4 should output 26880 . n,m=5,5 is too slow for me to run. For n = 2 and m = 2,3,4,5,6 the outputs should be 8, 0, 96, 0, 1280.
Current status Feb 2, 2014:
The answer of leewangzhong is fast but is not correct for m > n . leewangzhong is considering how to fix it.
The answer of Hooked does not run for m > n .
(Now a partial solution for n = m//2+1, and the requested code.)
Let k := m//2+1
This is somewhat equivalent to asking, "How many collections of m n-dimensional vectors of {-1,1} have no linearly dependent sets of size min(k,n)?"
For those matrices, we know or can assume:
The first entry of every vector is 1 (if not, multiply the whole by -1). This reduces the count by a factor of 2**m.
All vectors in the list are distinct (if not, any submatrix with two identical vectors has non-full rank). This eliminates a lot. There are choose(2**m,n) matrices of distinct vectors.
The list of vectors are sorted lexicographically (rank isn't affected by permutations). So we're really thinking about sets of vectors instead of lists. This reduces the count by a factor of m! (because we require distinctness).
With this, we have a solution for n=4, m=8. There are only eight different vectors with the property that the first entry is positive. There is only one combination (sorted list) of 8 distinct vectors from 8 distinct vectors.
array([[ 1, 1, 1, 1],
[ 1, 1, 1, -1],
[ 1, 1, -1, 1],
[ 1, 1, -1, -1],
[ 1, -1, 1, 1],
[ 1, -1, 1, -1],
[ 1, -1, -1, 1],
[ 1, -1, -1, -1]], dtype=int8)
100 size-4 combinations from this list have rank 3. So there are 0 matrices with the property.
For a more general solution:
Note that there are 2**(n-1) vectors with first coordinate -1, and choose(2**(n-1),m) matrices to inspect. For n=8 and m=8, there are 128 vectors, and 1.4297027e+12 matrices. It might help to answer, "For i=1,...,k, how many combinations have rank i?"
Alternatively, "What kind of matrices (with the above assumptions) have less than full rank?" And I think the answer is exactly, A sufficient condition is, "Two columns are multiples of each other". I have a feeling that this is true, and I tested this for all 4x4, 5x5, and 6x6 matrices.(Must've screwed up the tests) Since the first column was chosen to be homogeneous, and since all homogeneous vectors are multiples of each other, any submatrix of size k with a homogeneous column other than the first column will have rank less than k.
This is not a necessary condition, though. The following matrix is singular (first plus fourth is equal to third plus second).
array([[ 1, 1, 1, 1, 1],
[ 1, 1, 1, 1, -1],
[ 1, 1, -1, -1, 1],
[ 1, 1, -1, -1, -1],
[ 1, -1, 1, -1, 1]], dtype=int8)
Since there are only two possible values (-1 and 1), all mxn matrices where m>2, k := m//2+1, n = k and with first column -1 have a majority member in each column (i.e. at least k members are the same). So for n=k, the answer is 0.
For n<=8, here's code to generate the vectors.
from numpy import unpackbits, arange, uint8, int8
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
Matrix generator:
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
The following is a function to check that all submatrices of length maxrank have full rank. It stops if any have less than maxrank, instead of checking all combinations.
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
Generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
Putting it all together:
import numpy as np
from numpy import unpackbits, arange, uint8, int8, array
from itertools import combinations
#all distinct n-length vectors from -1,1 with first entry -1
def nvectors(n):
if n > 8:
raise ValueError #is that the right error?
if n==0:
return array([])
return -1 + 2 * (
#explode binary numbers to arrays of 8 zeroes and ones
unpackbits(arange(2**(n-1),dtype=uint8)) #unpackbits only takes uint
.reshape((-1,8)) #unpackbits flattens, so we need to shape it to 8 bits
[:,-n:] #only take the last n bytes
.view(int8) #need signed
)
#generate all length-m matrices that are combinations of distinct n-vectors
def matrix_g(n,m):
return (array(mat) for mat in combinations(nvectors(n),m))
rankof = np.linalg.matrix_rank
#all submatrices of at least half size have maxrank
#(we only need to check the maxrank-sized matrices)
def halfrank(matrix,maxrank):
return all(rankof(submatr) == maxrank for submatr in combinations(matrix,maxrank))
#generate all matrices that have all half-matrices with full rank
def nicematrices(m,n):
maxrank = min(m//2+1,n)
return (matr for matr in matrix_g(n,m) if halfrank(matr,maxrank))
#returns (number of nice matrices, number of all matrices)
def count_nicematrices(m,n):
from math import factorial
return (len(list(nicematrices(m,n)))*factorial(m)*2**m, 2**(m*n))
for i in range(0,6):
print (i, count_nicematrices(i,i))
count_nicematrices(5,5) takes about 15 seconds for me, the vast majority of which is taken by the matrix_rank function.
Since no one's answered yet, here's an answer without code. The useful symmetries that I see are as follows.
Multiply a row by -1.
Multiply a column by -1.
Permute the rows.
Permute the columns.
I would attack this problem by exhaustively generating the non-isomorphs, filtering them, and summing the sizes of their orbits. nauty will be quite useful for the first and third steps. Assuming that most matrices have few symmetries (undoubtedly an excellent assumption for n large, but it's not obvious a priori how large), I would expect 8x8 to be doable, 9x9 to be borderline, and 10x10 to be out of reach.
Expanded pseudocode:
Generate one representative of each orbit of the (m - 1) by (n - 1) 0-1 matrices acted upon by the group generated by row and column permutations, together with the size of the orbit (= (m - 1)! (n - 1)! / the size of the automorphism group). Perhaps the author of the paper that Tim linked would be willing to share his code; otherwise, see below.
For each matrix, replace entries x by (-1)^x. Add one row and one column of 1s. Multiply the size of its orbit by 2^(m + n - 1). This takes care of the sign change symmetries.
Filter the matrices and sum the orbit sizes of the ones that remain. You might save a little computation time here by implementing Gram--Schmidt yourself so that when you try all combinations in lexicographic order there's an opportunity to reuse partial results for the shared prefixes.
Isomorph-free enumeration:
McKay's template can be used to generate the representatives for (m + 1) by n 0-1 matrices from the representatives for m by n 0-1 matrices, in a manner amenable to depth-first search. With each m by n 0-1 matrix, associate a bipartite graph with m black vertices, n white vertices, and the appropriate edge for each 1 entry. Do the following for each m by n representative.
For each length-n vector, construct the graph for the (m + 1) by n matrix consisting of the representative together with the new vector and run nauty to get a canonical labeling and the vertex orbits.
Filter out the possibilities where the vertex corresponding to the new vector is in a different orbit from the black vertex with the lowest number.
Filter out the possibilities with duplicate canonical labelings.
nauty also computes the orders of automorphism groups.
You will need to rethink this problem from a mathematical point of view. That said even with brute force, there are some programming tricks you can use to speed up the process (as SO is a programming site). Little tricks like not recalculating int(min(n,rowstochoose)) and itertools.combinations(range(m), int(rowstochoose)) can save a few percent - but the real gain comes from memoization. Others have mentioned it, but I thought it might be useful to have a complete, working, code example:
import numpy as np
from scipy.misc import comb
import itertools, hashlib
m,n = 4,4
rowstochoose = int(np.floor(m/2)+1)
maxnumber = comb(m, rowstochoose, exact = True)
combo_itr = (x for x in itertools.product([-1,1], repeat = m*n))
matrix_itr = (np.array(x,dtype=np.int8).reshape((n,m)) for x in combo_itr)
sub_shapes = map(list,(itertools.combinations(range(m), int(rowstochoose))))
required_rank = int(min(n,rowstochoose))
memo = {}
no_found = 0
for A in matrix_itr:
check = True
for s in sub_shapes:
view = A[s].view(np.int8)
h = hashlib.sha1(view).hexdigest()
if h not in memo:
memo[h] = np.linalg.matrix_rank(view)
if memo[h] != required_rank:
check = False
break
if check: no_found+=1
print no_found, 2**(m*n)
This gives a speed gain of almost 10x for the 4x4 case - you'll see substantial improvements for larger matrices if you care to wait long enough. It's possible for the larger matrices, where the cost of the rank is proportionally more expensive, that you can order the matrices ahead of time on the hashing:
idx = np.lexsort(view.T)
h = hashlib.sha1(view[idx]).hexdigest()
For the 4x4 case this makes it slightly worse, but I expect that to reverse for the 5x5 case.
Algorithm 1 - memorizing small ones
I would use memorizing of the already checked smaller matrices.
You could simply write down in binary format (0 for -1, 1 for 1) all smaller matrices. BTW, you cane directly check for ranges matrices of (0 and 1) instead of (-1 and 1) - it is the same. Let us call these coding IMAGES. Using long types you can have matrices of up to 64 cells, so, up to 8x8. It is fast. Using String you can them have as large as you need.
Really, 8x8 is more than enough - in the 8GB memory we can place 1G longs. it is about 2^30, so, you can remember matrices of about up to 25-28 elements.
For every size you'll have a set of images:
for 2x2: 1001, 0110, 1000, 0100, 0010, 0001, 0111, 1011, 1101, 1110.
So, you'll have archive=array of NxN, each element of which will be an ordered list of binary images of good matrices.
- (for matrix size MxN, where M>=N, the appropriate place in archive will have coordinates M,N. If M
When you are checking a new large matrix, divide it into small ones
For every small matrix T
If the appropriate place in the archive for size of T has no list, create it and fill by images of all full-rank matrices of size of T and order images. If you are out of memory, stop the process of archive filling.
If T could be in archive, according to size:
Make image of T
Look for image(t) in the list - if it is in it, it is OK, if no, the large matrix should be thrown off.
If T is too big for the archive, check it as you do it.
Algorithm 2 - increasing sizes
The other possibility is to create larger matrices by adding pieces to the lesser ones, already found.
You should decide, up to what size the matrices will grow.
When you find a "correct" matrix of size MxN, try to add a row to top it. New matrices should be checked only for submatrices that include the new row. The same with the new column.
You should set exact algorithm, which sizes are derived from which ones. Thus you can minimize the number of remembered matrices. I thought about that sequence:
Start from 2x2 matrices.
continue with 3x2
4x2, 3x3
5x2, 4x3
6x2, 5x3, 4x4
7x2, 6x3, 5x4
...
So you can remember only (M+N)/2-1 matrices for searching among sizes MxN.
If each time when we can create new size from two old ones, we derive from more square ones, we could also greatly spare place for matrices remembering: For "long" matrices as 7x2 we do need remember and check only the last line 1x2. For matrices 6x3 we should remember their stub of 2x3, and so on.
Also, you don't need to remember the largest matrices - you won't use them for further counting.
Again use "images" for remembering the matrix.

Categories