Find two pairs of pairs that sum to the same value - python

I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))
I would like to determine if the matrix has two pairs of pairs of rows which sum to the same row vector. I am looking for a fast method to do this. My current method just tries all possibilities.
for pair in combinations(combinations(range(n), 2), 2):
if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
print "Pair found", pair
A method that worked for n = 100 would be really great.

Here is a pure numpy solution; no extensive timings, but I have to push n up to 500 before I can see my cursor blink once before it completes. it is memory intensive though, and will fail due to memory requirements for much larger n. Either way, I get the intuition that the odds of finding such a vector decrease geometrically for larger n anyway.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)
def base3(a):
"""
pack the last axis of an array in base 3
40 base 3 numbers per uint64
"""
S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
for i in xrange(len(S)):
s = S[i]
r = R[...,i]
for j in xrange(s.shape[-1]):
r *= 3
r += s[...,j]
return R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return unique, count
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_pairs_of_pairs(A):
#optional; convert rows to base 3
A = base3(A)
#precompute sums over a lower triangular set of all combinations
rowsums = sum(A[I] for I in np.tril_indices(n,-1))
#count the number of times each row occurs by sorting
#note that this is not quite O(n log n), since the cost of handling each row is also a function of n
unique, count = unique_count(voidview(rowsums))
#print if any pairs of pairs exist;
#computing their indices is left as an excercise for the reader
return np.any(count>1)
from time import clock
t = clock()
for i in xrange(100):
print has_pairs_of_pairs(A)
print clock()-t
Edit: included base-3 packing; now n=2000 is feasible, taking about 2gb of mem, and a few seconds of processing
Edit: added some timings; n=100 takes only 5ms per call on my i7m.

Based on the code in your question, and on the assumption that you're actually looking for pairs of pairs of rows that sum to equal the same row vector, you could do something like this:
def findMatchSets(A):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), 2))
matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
This basically stratifies the matrix into equivalence sets that sum to the same value after one column has been taken into account, then two columns, then three, and so on, until it either reaches the last column or there is no equivalence set left with more than one member (i.e. there is no such pair of pairs). This will work fine for 100x100 arrays, largely because the chances of two pairs of rows summing to the same row vector are infinitesimally small when n is large (n*(n+1)/2 combinations compared to 3^n possible vector sums).
UPDATE
Updated code to allow searching for pairs of n-size subsets of all rows, as requested. Default is n=2 as per the original question:
def findMatchSets(A, n=2):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), n))
matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets

Here is a 'lazy' approach, that scales up to n=10000, using 'only' 4gb of memory, and completing in 10s per call or so. Worst case complexity is O(n^3), but for random data, expected performance is O(n^2). At first sight, it seems like youd need O(n^3) ops; each row combination needs to be produced and inspected at least once. But we need not look at the entire row. Rather, we can perform an early exit strategy on the comparison of rowpairs, once it is clear they are of no use to us; and for random data, we may draw this conclusion typically long before we have considered all columns in a row.
import numpy as np
n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
init = np.zeros(a.shape[1:], dtype)
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
yield reduce(
lambda acc,inc: acc*base+inc,
columns,
init)
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all rowpairs
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those pairs which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
return True
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(10):
print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Extended to include the calculation over sums of triplets of rows, as you asked above. For n=100, this still takes only about 0.2s
Edit: some cleanup; edit2: some more cleanup

Your current code does not test for pairs of rows that sum to the same value.
Assuming that's actually what you want, its best to stick to pure numpy. This generates the indices of all rows that have equal sum.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n))
rowsum = A.sum(axis=1)
unique, inverse = np.unique(rowsum, return_inverse = True)
count = np.zeros_like(unique)
np.add.at(count, inverse, 1)
for p in unique[count>1]:
print p, np.nonzero(rowsum==p)[0]

If all you need to do is determine whether such a pair exists you can do:
exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]

Related

Generating multiple pairs of unequal random integers in python/numpy

In the context of a Monte Carlo simulation I am generating pairs of random indices, using
ij = np.random.randint(0, N, (n,2))
where n can be quite large (e.g. 10**6). I then loop over these pairs.
Issue:
I would like the numbers in each pair to be different.
The solutions that I found (e.g., using random.sample or np.random.choice) suggest generating number pair by pair. In my case it means calling the random numbers generator repeatedly in a loop, which slows down the code.
This is a simple way to do it:
import numpy as np
N = 10
n = 10000
np.random.seed(0)
i = np.random.choice(N, n)
j = np.random.choice(N - 1, n)
j[j >= i] += 1
print(np.any(i == j))
# False
ij = np.stack([i, j], axis=1)
One approach could be to iteratively update those elements that have the same pairs:
m = np.full(ij.shape[0], True)
while m.any():
ij[m] = np.random.randint(0, N, (m.sum(),2))
m = ij[:,0] == ij[:,1]

Are there some functions in Python for generating matrices with special conditions?

I'm writing dataset generator on Python and I got following problem: I need a set of zero-one matrices with no empty columns/rows. Also the ratio between zeros and ones should be constant.
I've tried to shuffle zero-one list with fixed ratio of zeros and ones with following reshaping, but for matrices with hundreds of rows/cols it's too long. Also I took into account that I can't achieve some inputs like 3*10 matrix with 9 one-elements and that some inputs can have only solution like 10*10 matrix with 10 one-elements.
If I understand the task, something like this might work:
import numpy as np
from collections import defaultdict, deque
def gen_mat(n, m, k):
"""
n: rows,
m: cols,
k: ones,
"""
assert k % n == 0 and k % m == 0
mat = np.zeros((n, m), dtype=int)
ns = np.repeat(np.arange(n), k // n)
ms = np.repeat(np.arange(m), k // m)
# uniform shuffle
np.random.shuffle(ms)
ms_deque = deque(ms)
assigned = defaultdict(set)
for n_i in ns:
while True:
m_i = ms_deque.popleft()
if m_i in assigned[n_i]:
ms_deque.append(m_i)
continue
mat[n_i, m_i] = 1
assigned[n_i].add(m_i)
break
return mat
We first observe that an n x m matrix can be populated with k ones s.t. equal ratios only k is divisible by both n and m.
Assuming this condition holds, each row index will appear k/n times and each column index will appear m/k times. We shuffle the column indices to ensure that the assignment is random, and store the random column indices in a deque for efficiency.
For each row, we store a set of columns s.t. mat[row, column] = 1 (initially empty).
We can now loop over each row k/n times, picking the next column s.t. mat[row, column] = 0 from the deque and set mat[row, column] to 1.
Without loss, assume that n <= m. This algorithm terminates successfully unless we encounter a situation when all remaining columns in the deque satisfy mat[row, column] = 1. This can only happen in the last row, meaning that we have already assigned k/m + 1 ones to some column, which is impossible.

Having an N element output from rejection sampling for N elements

I am applying a rejection sampling for N elements given probability density function pdf. When applying this method for N elements, it is likely that you will return an array of values that has less number of elements compared to the N number you are evaluating, which is from applying the rejection method without looping for values of condition that are False.
In order to reconcile this, I can try and loop the values that does not meet the condition until they are True. However, I am not sure on how to loop the condition until the number of elements in my array has the same length with the number of values N given the functions I have defined.
import numpy
N = 1000 # number of elements
x = np.linspace(0, 200, N)
pdf = pdf(x) # some pdf
# Rejection Method #1
# -------------------
fx = np.random.random_sample(size=N) * x.max() # uniform random samples scaled out
u = np.random.random_sample(size=N) # uniform random sample
condition = np.where(u <= pdf/pdf.max())[0] # Run first rejection criterion that returns bool values
x_arr = fx[condition]
# Here, len(x_arr) < N, so I want to fix it until len(x_arr) == N
while len(x_arr) < N:
...
if len(x_arr) == N:
break
After this, I am having trouble forming a method of iterating until len(x_arr) = N.
Here is one way using boolean and advanced indexing. It keeps a list of indices at which values were rejected and redraws these values until the list is empty.
Example sampling and accept/reject functions:
def sample(N):
return np.random.uniform(-3, 3, (N,))
def accept(v):
return np.random.rand(v.size) < stats.norm().pdf(v)
Main loop:
def draw(N, f_sample, f_accept):
out = f_sample(N)
mask = f_accept(out)
reject, = np.where(~mask)
while reject.size > 0:
fill = f_sample(reject.size)
mask = f_accept(fill)
out[reject[mask]] = fill[mask]
reject = reject[~mask]
return out
Sanity check:
>>> counts, bins = np.histogram(draw(100000, sample, accept))
>>> counts / stats.norm().pdf((bins[:-1] + bins[1:]) / 2)
array([65075.50020815, 65317.17811578, 60973.84255365, 59440.53739031,
58969.62310004, 59267.33983256, 60565.1928325 , 61108.60840388,
64303.2863583 , 68293.86441234])
Looks roughly flat, so ok.

NumPy: Sparse outer product of n vectors (hyperbolic cross)

I'm trying to compute a certain subset of the full outer product of n vectors. The computation of the full outer product is described in this question.
Formally: Let v1,v2,...,vk be vectors of some length n, and K be a positive constant. I want a list containing all the products v1[i1]v2[i2]...vk[ik] for which i1*i2*...*ik <= K (indices start at one). Note: For example, if K = n ** k, the list would contain every combination.
My current approach is to create a hierarchical list of the indices fulfilling the condition above and then calculating the products recursively, which has the advantage of reusing some factors.
This implementation is a lot slower than the computation of the full outer product using NumPy (for same n and k). I want to achieve a better performance than the computation of the full product. I'm interested in larger values for k, and small K (this problem comes from function approximation with sparse bases, i.e. hyperbolic cross).
Does anyone know a more performant way to get this list? Maybe by using more NumPy or another algorithm? I will try a C implementation next.
Here is my current implementation:
import numpy as np
def get_cross_indices(n, k, K):
"""
Assume k > 0.
Returns a hierarchical list containg elements of type
(i1, list) with
- i1 being a index (zero based!)
- list being again a list (possibly empty) with all indices i2, such
that (i1+1) * (i2+1) * ... * (ik+1) <= K (going down the hierarchy)
"""
if k == 1:
num = min(n, K)
return (num, [(x, []) for x in range(num)])
else:
indices = []
nums = 0
for i in xrange(min(n, K)):
(num, tail) = get_cross_indices(n,
k - 1, K // (i + 1))
indices.append((i, tail))
nums += num
return (nums, indices)
def calc_cross_outer_product(vectors, result, factor, indices, pos):
"""
Fills the result list recursively with all products
vectors[0][i1] * ... * vectors[k-1][ik]
such that i1,...,ik is a feasible index sequence
from `indices` (they are in there hierarchically,
also see `get_cross_indices`).
"""
for (x, list) in indices:
if not list:
result[pos] = factor * vectors[0][x]
pos += 1
else:
pos = calc_cross_outer_product(vectors[1:], result,
factor * vectors[0][x], list, pos)
return pos
k = 3 # number of vectors
n = 4 # vector length
K = 3
# using random values here just for demonstration purposes
vectors = np.random.rand(k, n)
# get all indices which meet the condition
(count, indices) = get_cross_indices(n, k, K)
result = np.ones(count)
calc_cross_outer_product(vectors, result, 1, indices, 0)
## Equivalent version ##
alt_result = np.ones(count)
# create full outer products
outer_product = reduce(np.multiply, np.ix_(*vectors))
pos = 0
for inds in np.ndindex((n,)*k):
# current index set is feasible?
if np.product(np.array(inds) + 1) <= K:
# compute [ vectors[0][inds[0]],...,vectors[k-1][inds[k-1]] ]
values = map(lambda x: vectors[x[0]][x[1]],
np.dstack((np.arange(k), inds))[0])
alt_result[pos] = np.product(values)
pos += 1
To get a visual idea of the indices I'm interested in, here is a picture for k=3, K=n:
(taken from this website)

Subsample a matrix python

I have a text files that lists pairs, for example
10,1
2,7
3,1
10,1
That has then been turned into a symmetric matrix, so the (1,10) entry is the number of times the pair (1,10) showed up on the list. I would now like to subsample this matrix. By subsample I mean - I would like to make a matrix that would have been the result of only using a random 30% of the lines in the original text file. So in this example, had I erased 70% of the text file, the (1,10) pair might only show up once instead of twice, and so the (1,10) entry in the matrix would be 1 instead of 2.
This can be done easily if I actually have the original text file, by just using random.sample to pick out 30% of the lines in the files. But if I only have the matrix, how can I randomly decimate 70% of the data?
I guess the best way depends on where your data is large:
Do you have a huge matrix, with mostly small counts in it? or
Do you have a moderately sized matrix with huge numbers of counts in it?
Here's a solution that will be suited to the second case, though it will also work
OK in the first case.
Basically, the fact that the counts happen to be in a 2D matrix is not so
important: this is basically the problem of sampling from a population that has
been binned. So what we can do is extract the bins directly, and forget about the
matrix for a bit:
import numpy as np
import random
# Input counts matrix
mat = np.array([
[5, 5, 2],
[1, 1, 3],
[6, 0, 4]
], dtype=np.int64)
# Build a list of (row,col) pairs, and a list of counts
keys, counts = zip(*[
((i,j), mat[i,j])
for i in range(mat.shape[0])
for j in range(mat.shape[1])
if mat[i,j] > 0
])
And then sample from those bins, using a cumulative array of counts:
# Make the cumulative counts array
counts = np.array(counts, dtype=np.int64)
sum_counts = np.cumsum(counts)
# Decide how many counts to include in the sample
frac_select = 0.30
count_select = int(sum_counts[-1] * frac_select)
# Choose unique counts
ind_select = sorted(random.sample(xrange(sum_counts[-1]), count_select))
# A vector to hold the new counts
out_counts = np.zeros(counts.shape, dtype=np.int64)
# Perform basically the merge step of merge-sort, finding where
# the counts land in the cumulative array
i = 0
j = 0
while i<len(sum_counts) and j<len(ind_select):
if ind_select[j] < sum_counts[i]:
j += 1
out_counts[i] += 1
else:
i += 1
# Rebuild the matrix using the `keys` list from before
out_mat = np.zeros(mat.shape, dtype=np.int64)
for i in range(len(out_counts)):
out_mat[keys[i]] = out_counts[i]
Now you will have the sampled matrix in out_mat.
Unfortunately example two and three do not observe correct distribution according to the number of appearances of lines in the original file.
Instead of removing tuples from the original data you could randomly remove counts from your matrix.
So you have to generate random indices and decrease the corresponding count. Be sure to avoid decreasing a zero count and instead generate a new index. Do this until you have decreased the overall amount of counted tuples to 30%.
Basically this could look like this:
amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
x = random.randint(0, n)
y = random.randint(0, n)
if matrix[x][y] > 0:
matrix[x][y]-=1
decreased+=1
if x != y:
matrix[y][x]-=1
This should work well if your matrix is well populated.
If it's not you might want to recreate a list of tuples from the matrix and then choose a random subset from that. After this recreate your matrix from the remaining tuples:
tuples = []
for y in range(n):
for x in range(y+1):
for _ in range(matrix[x][y])
tuples.append((x,y))
remaining = random.sample(tuples, int(overall_amount*0.7) )
Or you can do a combination where you do a first pass to find all indices that are not zero and then sample these to decrease the counts:
valid_indices = []
for y in range(n):
for x in range(y+1):
valid_indices.append((x,y))
amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
x,y = random.choice(valid_indices)
matrix[x][y]-=1
if x != y:
matrix[y][x]-=1
if matrix[y][x] == 0:
valid_indices.remove((x,y))
There is another approach that would use the right possibilities but might not give you an exact reduction. The idea is to set a probability for keeping a line/count. This could be 0.3 if you are aiming for a reduction to 30%. Then you can go over the matrix and check for every count if it should be kept or not.
keep_chance = 0.3
for y in range(n):
for x in range(y+1):
for _ in range(matrix[x][y])
if random.random() > keep_chance:
matrix[x][y] -= 1
if x != y:
matrix[y][x]-=1
Assuming that the couples 1,10 and 10,1 are different, so that mat[1][10] is not necessarily the same as mat[10][1] (if not, read below the line)
First compute the sum of all the values in the matrix.
Let this sum be S. This counts the number of rows in the file.
Let x and y the dimensions of the matrix.
Now loop for n from 0 to [70% of S]:
pick a random integer between 1 and x. let this be j
pick a random integer between 1 and y. let this be k
if mat[j][k] > 0, decrease mat[j][k] and do n++
Since you increase a single value in the matrix for each row in your file, decreasing randomly a positive value in the matrix is the same as decimating the rows in the file.
If 10,1 is the same of 1,10 you don't need half of the matrix, so you can change the algorithm like this:
Loop for n from 0 to [70% of S]:
pick a random integer between 1 and x. Let this be j
pick a random integer between 1 and k. Let this be k
if mat[j][k] > 0, decrease mat[j][k] and do n++

Categories