Generating multiple pairs of unequal random integers in python/numpy - python

In the context of a Monte Carlo simulation I am generating pairs of random indices, using
ij = np.random.randint(0, N, (n,2))
where n can be quite large (e.g. 10**6). I then loop over these pairs.
Issue:
I would like the numbers in each pair to be different.
The solutions that I found (e.g., using random.sample or np.random.choice) suggest generating number pair by pair. In my case it means calling the random numbers generator repeatedly in a loop, which slows down the code.

This is a simple way to do it:
import numpy as np
N = 10
n = 10000
np.random.seed(0)
i = np.random.choice(N, n)
j = np.random.choice(N - 1, n)
j[j >= i] += 1
print(np.any(i == j))
# False
ij = np.stack([i, j], axis=1)

One approach could be to iteratively update those elements that have the same pairs:
m = np.full(ij.shape[0], True)
while m.any():
ij[m] = np.random.randint(0, N, (m.sum(),2))
m = ij[:,0] == ij[:,1]

Related

Is there a way to avoid going through all the possible combinations generated by itertools.combinations() in python?

I am trying to generate 300 randomized sets/lines of 16 numbers each, from 1 to 64 using Python.
I'm using the itertools package to generate combinations, and this is my code:
import itertools
import random
def generate_combinations():
combinations = list(itertools.combinations(range(1, 64), 16))
random.shuffle(combinations)
combinations = combinations[:300]
return print(*combinations, sep = "\n")
Based on my code, the combinations list is generated using the itertools.combinations() function, then those combinations are shuffled and lastly, I limit the list length to 300.
The issue is the time it takes to get the ~488,526,937,079,580 combinations in the first step. Is there any way I can achieve this more efficiently?
Any approach actually generating all those combos will run out of memory (300 trillion+), but there is a fast way to generate the nth combo using an itertools recipe.
import math
def nth_combination(iterable, r, index):
"Equivalent to list(combinations(iterable, r))[index]"
pool = tuple(iterable)
n = len(pool)
c = math.comb(n, r)
if index < 0:
index += c
if index < 0 or index >= c:
raise IndexError
result = []
while r:
c, n, r = c*r//n, n-1, r-1
while index >= c:
index -= c
c, n = c*(n-r)//n, n-1
result.append(pool[-1-n])
return tuple(result)
Now you can just generate random indices and get the result:
import random
n = 64
r = 16
iterable = range(1, n)
n_combos = math.comb(len(iterable), r)
indices = random.sample(range(n_combos), k=300) # random.choices if dupes are ok
combos = [nth_combination(iterable, r, i) for i in indices]

Creating an identity matrix of size 16x16 using 2 for loops with ranges (1,5)

I am trying to create a 16x16 identity matrix in python by using nested for loops.
import numpy as np
total = []
for i in range(1,5):
for j in range(1,5):
row = 16*[0]
total.append(row)
mat = np.matrix(total)
How do I modify this to get an identity matrix? The ranges cannot be changed.
You can follow this. It is done here simply. Note that there are many ways to do this. You can also follow the second way I uploaded below.
import numpy as np
total =[]
for i in range(1,17):
row=[]
for j in range(1,17):
if i==j:
row.append(1)
else:
row.append(0)
total.append(row)
mat = np.array(total)
Alternative way:
matrix = np.asmatrix(np.eye(16), dtype=int)
Actually it is much easier to create your matrix as:
mat = np.asmatrix(np.eye(16), dtype=int)
But if you insist on usage of 2 nested loops, with ranges (1, 5),
you can do it as follows:
total = []
row = [1] + [0] * 15
for _ in range(1,5):
for _ in range(1,5):
total.append(row)
row = np.roll(row, 1).tolist()
mat = np.matrix(total)
Note that each loop is executed actually 4 times, so the total number
of executions is just 16.
Even i and j, which you used in your code, are not needed
(they are never used), so I put _ instead of them.

Are there some functions in Python for generating matrices with special conditions?

I'm writing dataset generator on Python and I got following problem: I need a set of zero-one matrices with no empty columns/rows. Also the ratio between zeros and ones should be constant.
I've tried to shuffle zero-one list with fixed ratio of zeros and ones with following reshaping, but for matrices with hundreds of rows/cols it's too long. Also I took into account that I can't achieve some inputs like 3*10 matrix with 9 one-elements and that some inputs can have only solution like 10*10 matrix with 10 one-elements.
If I understand the task, something like this might work:
import numpy as np
from collections import defaultdict, deque
def gen_mat(n, m, k):
"""
n: rows,
m: cols,
k: ones,
"""
assert k % n == 0 and k % m == 0
mat = np.zeros((n, m), dtype=int)
ns = np.repeat(np.arange(n), k // n)
ms = np.repeat(np.arange(m), k // m)
# uniform shuffle
np.random.shuffle(ms)
ms_deque = deque(ms)
assigned = defaultdict(set)
for n_i in ns:
while True:
m_i = ms_deque.popleft()
if m_i in assigned[n_i]:
ms_deque.append(m_i)
continue
mat[n_i, m_i] = 1
assigned[n_i].add(m_i)
break
return mat
We first observe that an n x m matrix can be populated with k ones s.t. equal ratios only k is divisible by both n and m.
Assuming this condition holds, each row index will appear k/n times and each column index will appear m/k times. We shuffle the column indices to ensure that the assignment is random, and store the random column indices in a deque for efficiency.
For each row, we store a set of columns s.t. mat[row, column] = 1 (initially empty).
We can now loop over each row k/n times, picking the next column s.t. mat[row, column] = 0 from the deque and set mat[row, column] to 1.
Without loss, assume that n <= m. This algorithm terminates successfully unless we encounter a situation when all remaining columns in the deque satisfy mat[row, column] = 1. This can only happen in the last row, meaning that we have already assigned k/m + 1 ones to some column, which is impossible.

NumPy: Sparse outer product of n vectors (hyperbolic cross)

I'm trying to compute a certain subset of the full outer product of n vectors. The computation of the full outer product is described in this question.
Formally: Let v1,v2,...,vk be vectors of some length n, and K be a positive constant. I want a list containing all the products v1[i1]v2[i2]...vk[ik] for which i1*i2*...*ik <= K (indices start at one). Note: For example, if K = n ** k, the list would contain every combination.
My current approach is to create a hierarchical list of the indices fulfilling the condition above and then calculating the products recursively, which has the advantage of reusing some factors.
This implementation is a lot slower than the computation of the full outer product using NumPy (for same n and k). I want to achieve a better performance than the computation of the full product. I'm interested in larger values for k, and small K (this problem comes from function approximation with sparse bases, i.e. hyperbolic cross).
Does anyone know a more performant way to get this list? Maybe by using more NumPy or another algorithm? I will try a C implementation next.
Here is my current implementation:
import numpy as np
def get_cross_indices(n, k, K):
"""
Assume k > 0.
Returns a hierarchical list containg elements of type
(i1, list) with
- i1 being a index (zero based!)
- list being again a list (possibly empty) with all indices i2, such
that (i1+1) * (i2+1) * ... * (ik+1) <= K (going down the hierarchy)
"""
if k == 1:
num = min(n, K)
return (num, [(x, []) for x in range(num)])
else:
indices = []
nums = 0
for i in xrange(min(n, K)):
(num, tail) = get_cross_indices(n,
k - 1, K // (i + 1))
indices.append((i, tail))
nums += num
return (nums, indices)
def calc_cross_outer_product(vectors, result, factor, indices, pos):
"""
Fills the result list recursively with all products
vectors[0][i1] * ... * vectors[k-1][ik]
such that i1,...,ik is a feasible index sequence
from `indices` (they are in there hierarchically,
also see `get_cross_indices`).
"""
for (x, list) in indices:
if not list:
result[pos] = factor * vectors[0][x]
pos += 1
else:
pos = calc_cross_outer_product(vectors[1:], result,
factor * vectors[0][x], list, pos)
return pos
k = 3 # number of vectors
n = 4 # vector length
K = 3
# using random values here just for demonstration purposes
vectors = np.random.rand(k, n)
# get all indices which meet the condition
(count, indices) = get_cross_indices(n, k, K)
result = np.ones(count)
calc_cross_outer_product(vectors, result, 1, indices, 0)
## Equivalent version ##
alt_result = np.ones(count)
# create full outer products
outer_product = reduce(np.multiply, np.ix_(*vectors))
pos = 0
for inds in np.ndindex((n,)*k):
# current index set is feasible?
if np.product(np.array(inds) + 1) <= K:
# compute [ vectors[0][inds[0]],...,vectors[k-1][inds[k-1]] ]
values = map(lambda x: vectors[x[0]][x[1]],
np.dstack((np.arange(k), inds))[0])
alt_result[pos] = np.product(values)
pos += 1
To get a visual idea of the indices I'm interested in, here is a picture for k=3, K=n:
(taken from this website)

Find two pairs of pairs that sum to the same value

I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))
I would like to determine if the matrix has two pairs of pairs of rows which sum to the same row vector. I am looking for a fast method to do this. My current method just tries all possibilities.
for pair in combinations(combinations(range(n), 2), 2):
if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
print "Pair found", pair
A method that worked for n = 100 would be really great.
Here is a pure numpy solution; no extensive timings, but I have to push n up to 500 before I can see my cursor blink once before it completes. it is memory intensive though, and will fail due to memory requirements for much larger n. Either way, I get the intuition that the odds of finding such a vector decrease geometrically for larger n anyway.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)
def base3(a):
"""
pack the last axis of an array in base 3
40 base 3 numbers per uint64
"""
S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
for i in xrange(len(S)):
s = S[i]
r = R[...,i]
for j in xrange(s.shape[-1]):
r *= 3
r += s[...,j]
return R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return unique, count
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_pairs_of_pairs(A):
#optional; convert rows to base 3
A = base3(A)
#precompute sums over a lower triangular set of all combinations
rowsums = sum(A[I] for I in np.tril_indices(n,-1))
#count the number of times each row occurs by sorting
#note that this is not quite O(n log n), since the cost of handling each row is also a function of n
unique, count = unique_count(voidview(rowsums))
#print if any pairs of pairs exist;
#computing their indices is left as an excercise for the reader
return np.any(count>1)
from time import clock
t = clock()
for i in xrange(100):
print has_pairs_of_pairs(A)
print clock()-t
Edit: included base-3 packing; now n=2000 is feasible, taking about 2gb of mem, and a few seconds of processing
Edit: added some timings; n=100 takes only 5ms per call on my i7m.
Based on the code in your question, and on the assumption that you're actually looking for pairs of pairs of rows that sum to equal the same row vector, you could do something like this:
def findMatchSets(A):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), 2))
matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
This basically stratifies the matrix into equivalence sets that sum to the same value after one column has been taken into account, then two columns, then three, and so on, until it either reaches the last column or there is no equivalence set left with more than one member (i.e. there is no such pair of pairs). This will work fine for 100x100 arrays, largely because the chances of two pairs of rows summing to the same row vector are infinitesimally small when n is large (n*(n+1)/2 combinations compared to 3^n possible vector sums).
UPDATE
Updated code to allow searching for pairs of n-size subsets of all rows, as requested. Default is n=2 as per the original question:
def findMatchSets(A, n=2):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), n))
matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
Here is a 'lazy' approach, that scales up to n=10000, using 'only' 4gb of memory, and completing in 10s per call or so. Worst case complexity is O(n^3), but for random data, expected performance is O(n^2). At first sight, it seems like youd need O(n^3) ops; each row combination needs to be produced and inspected at least once. But we need not look at the entire row. Rather, we can perform an early exit strategy on the comparison of rowpairs, once it is clear they are of no use to us; and for random data, we may draw this conclusion typically long before we have considered all columns in a row.
import numpy as np
n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
init = np.zeros(a.shape[1:], dtype)
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
yield reduce(
lambda acc,inc: acc*base+inc,
columns,
init)
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all rowpairs
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those pairs which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
return True
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(10):
print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Extended to include the calculation over sums of triplets of rows, as you asked above. For n=100, this still takes only about 0.2s
Edit: some cleanup; edit2: some more cleanup
Your current code does not test for pairs of rows that sum to the same value.
Assuming that's actually what you want, its best to stick to pure numpy. This generates the indices of all rows that have equal sum.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n))
rowsum = A.sum(axis=1)
unique, inverse = np.unique(rowsum, return_inverse = True)
count = np.zeros_like(unique)
np.add.at(count, inverse, 1)
for p in unique[count>1]:
print p, np.nonzero(rowsum==p)[0]
If all you need to do is determine whether such a pair exists you can do:
exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]

Categories