Subsample a matrix python - python

I have a text files that lists pairs, for example
10,1
2,7
3,1
10,1
That has then been turned into a symmetric matrix, so the (1,10) entry is the number of times the pair (1,10) showed up on the list. I would now like to subsample this matrix. By subsample I mean - I would like to make a matrix that would have been the result of only using a random 30% of the lines in the original text file. So in this example, had I erased 70% of the text file, the (1,10) pair might only show up once instead of twice, and so the (1,10) entry in the matrix would be 1 instead of 2.
This can be done easily if I actually have the original text file, by just using random.sample to pick out 30% of the lines in the files. But if I only have the matrix, how can I randomly decimate 70% of the data?

I guess the best way depends on where your data is large:
Do you have a huge matrix, with mostly small counts in it? or
Do you have a moderately sized matrix with huge numbers of counts in it?
Here's a solution that will be suited to the second case, though it will also work
OK in the first case.
Basically, the fact that the counts happen to be in a 2D matrix is not so
important: this is basically the problem of sampling from a population that has
been binned. So what we can do is extract the bins directly, and forget about the
matrix for a bit:
import numpy as np
import random
# Input counts matrix
mat = np.array([
[5, 5, 2],
[1, 1, 3],
[6, 0, 4]
], dtype=np.int64)
# Build a list of (row,col) pairs, and a list of counts
keys, counts = zip(*[
((i,j), mat[i,j])
for i in range(mat.shape[0])
for j in range(mat.shape[1])
if mat[i,j] > 0
])
And then sample from those bins, using a cumulative array of counts:
# Make the cumulative counts array
counts = np.array(counts, dtype=np.int64)
sum_counts = np.cumsum(counts)
# Decide how many counts to include in the sample
frac_select = 0.30
count_select = int(sum_counts[-1] * frac_select)
# Choose unique counts
ind_select = sorted(random.sample(xrange(sum_counts[-1]), count_select))
# A vector to hold the new counts
out_counts = np.zeros(counts.shape, dtype=np.int64)
# Perform basically the merge step of merge-sort, finding where
# the counts land in the cumulative array
i = 0
j = 0
while i<len(sum_counts) and j<len(ind_select):
if ind_select[j] < sum_counts[i]:
j += 1
out_counts[i] += 1
else:
i += 1
# Rebuild the matrix using the `keys` list from before
out_mat = np.zeros(mat.shape, dtype=np.int64)
for i in range(len(out_counts)):
out_mat[keys[i]] = out_counts[i]
Now you will have the sampled matrix in out_mat.

Unfortunately example two and three do not observe correct distribution according to the number of appearances of lines in the original file.
Instead of removing tuples from the original data you could randomly remove counts from your matrix.
So you have to generate random indices and decrease the corresponding count. Be sure to avoid decreasing a zero count and instead generate a new index. Do this until you have decreased the overall amount of counted tuples to 30%.
Basically this could look like this:
amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
x = random.randint(0, n)
y = random.randint(0, n)
if matrix[x][y] > 0:
matrix[x][y]-=1
decreased+=1
if x != y:
matrix[y][x]-=1
This should work well if your matrix is well populated.
If it's not you might want to recreate a list of tuples from the matrix and then choose a random subset from that. After this recreate your matrix from the remaining tuples:
tuples = []
for y in range(n):
for x in range(y+1):
for _ in range(matrix[x][y])
tuples.append((x,y))
remaining = random.sample(tuples, int(overall_amount*0.7) )
Or you can do a combination where you do a first pass to find all indices that are not zero and then sample these to decrease the counts:
valid_indices = []
for y in range(n):
for x in range(y+1):
valid_indices.append((x,y))
amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
x,y = random.choice(valid_indices)
matrix[x][y]-=1
if x != y:
matrix[y][x]-=1
if matrix[y][x] == 0:
valid_indices.remove((x,y))
There is another approach that would use the right possibilities but might not give you an exact reduction. The idea is to set a probability for keeping a line/count. This could be 0.3 if you are aiming for a reduction to 30%. Then you can go over the matrix and check for every count if it should be kept or not.
keep_chance = 0.3
for y in range(n):
for x in range(y+1):
for _ in range(matrix[x][y])
if random.random() > keep_chance:
matrix[x][y] -= 1
if x != y:
matrix[y][x]-=1

Assuming that the couples 1,10 and 10,1 are different, so that mat[1][10] is not necessarily the same as mat[10][1] (if not, read below the line)
First compute the sum of all the values in the matrix.
Let this sum be S. This counts the number of rows in the file.
Let x and y the dimensions of the matrix.
Now loop for n from 0 to [70% of S]:
pick a random integer between 1 and x. let this be j
pick a random integer between 1 and y. let this be k
if mat[j][k] > 0, decrease mat[j][k] and do n++
Since you increase a single value in the matrix for each row in your file, decreasing randomly a positive value in the matrix is the same as decimating the rows in the file.
If 10,1 is the same of 1,10 you don't need half of the matrix, so you can change the algorithm like this:
Loop for n from 0 to [70% of S]:
pick a random integer between 1 and x. Let this be j
pick a random integer between 1 and k. Let this be k
if mat[j][k] > 0, decrease mat[j][k] and do n++

Related

sparse matrix multipliction with dictionaris in python

A sparse matrix is a matrix whose most members have zero value. Therefore, in order to save memory and storage
The matrices, it is convenient to represent them using a dictionary in the following configuration: for each cell in the matrix that is not zero, a tuple key will be stored in the dictionary
which represents the coordinates of the cell, and the value represents the value of the cell in the matrix (some number of type int or float) as usual in mathematics,
The indices of the matrix start from one.
• The cell coordinate (j, i) contains natural numbers so that the coordinate represents the cell in the i-th row
and in the jth column.
• The order in the dictionary is not important.
Realize the function sparse_mult(n, mat2, mat1) which receives 2 dictionaries, mat1 and mat2, representing square sparse matrices
of size n×n, and returns a dictionary representing the mat2×mat1 matrix multiplication matrix.
pay attention:
There is no need to check the correctness of the matrices.
It can be assumed that n is a natural number < 1.
The repeated dictionary must represent a sparse matrix as defined above.
for i in range(1, n + 1):
temp = 0
for j in range(1, n + 1):
if (mat1.get((i, j), 0) != 0)|(mat2.get((j, i), 0) != 0):
temp += mat1.get((i, j), 0) * mat2.get((j, i), 0)
if temp !=0:
resultrow[(i, i)]=temp
That's my code, I know I got it wrong but i just don't have a clue
It is inefficient to iterate over all indices in the 2-dimensional index set when multiplying two sparse matrices. Instead, you can iterate over all pairs of keys where 1 pair is drawn from each sparse matrix. Given such a pair (i,j) and (k,l), it contributes a product of 2 numbers if and only if j == k. In this case the corresponding product goes towards entry (i,l) in the overall product. A final dictionary comprehension can get rid of any zero entries. This last step might be inadequate if the numbers are floats and some entries are non-zero only due to round-off error. In that case a threshold approach which removes entries close to zero and not merely equal to zero.
def sparse_multiply(a,b):
c = {}
for i,j in a.keys():
for k,l in b.keys():
if j == k:
p = a[(i,j)]*b[(k,l)]
if (i,l) in c:
c[(i,l)] += p
else:
c[(i,l)] = p
return {k:v for k,v in c.items() if v != 0}
Note that n plays no role here. The complexity is mk where m is the number of non-zero entries in the first matrix and k the number of such entries in the second. For matrices which are very sparse this will be substantially faster than the n^3 of using straight-forward matrix multiplication. There will be some threshold where mk will actually be larger than n^3, but at that stage the matrices are no longer sparse.
so i eventually got it, if anyone care:
initialize the result dictionary
result = {}
# iterate over the rows and columns of the result matrix
for i in range(1, n + 1):
for j in range(1, n + 1):
# initialize the sum to 0
sum = 0
# iterate over the columns of the first matrix and the rows of the second matrix
for k in range(1, n + 1):
# check if the cell (i, k) in the first matrix and the cell (k, j) in the second matrix are non-zero
if (i, k) in mat1 and (k, j) in mat2:
sum += mat1[(i, k)] * mat2[(k, j)]
# add the result to the dictionary if it is non-zero
if sum != 0:
result[(i, j)] = sum
# return the result dictionary
return result

Writing constraints for grid squares in python z3

So I'm working with Z3 in python and I have to write constraints/conditions for a "marvellous square" which is just a grid of numbers. the conditions for a marvellous square are:
It is filled with all the integers from 1 to 𝑛**2
Every row in the square sums to the same number t
Every column in the square also sums to that same number t
Both diagonals sum to that same number t
Using the list constraints I've been been able to do the first 1:
aGrid = [ [ Int("x_%s_%s" % (i+1, j+1)) for j in range(n) ] for i in range(n) ]
conditionOne = [ And(1 <= aGrid[i][j], aGrid[i][j] <= n**2) for i in range(n) for j in range(n) ]
So in line 1 I create the instance for an n-by-n grid.
In line 2, I create the first condition where each of the entries is from 1 to n squared
The issue I have now is getting the sum of each column and row and equating them to the same thing in the same constraint. As well as the diagonal constraints. I have a feeling they will all be done in the same constraint, but the list comprenhension is confusing.
Here's one way to do it:
from z3 import *
# Grid size
n = 4
# Total and grid itself
t = Int('t')
grid = [[Int("x_%s_%s" % (i+1, j+1)) for j in range(n)] for i in range(n)]
s = Solver()
# Range constraint
allElts = [elt for row in grid for elt in row]
for x in allElts: s.add(And(x >= 1, x <= n*n))
# Distinctness constraint
s.add(Distinct(*allElts))
# Each row
for row in grid: s.add(t == sum(row))
# Each column
for i in range(n): s.add(t == sum([grid[j][i] for j in range(n)]))
# Each diagonal
s.add(t == sum([grid[i][i] for i in range(len(grid))]))
s.add(t == sum([grid[n-i-1][i] for i in range(len(grid))]))
# Solve:
if s.check() == sat:
m = s.model()
print(f't = {m[t]}')
for row in grid:
for elt in row:
print(f'{str(m[elt]):>3s}', end="")
print("")
else:
print("No solution")
When I run this, I get:
t = 34
7 4 14 9
11 16 2 5
6 13 3 12
10 1 15 8
Which satisfies the constraints.
Note that as n gets larger the time z3 will spend in solving will increase quite a bit. Here're are two ideas to make it go much faster:
Note that t depends on n. That is, if you know n, you can compute t from it. (It'll be n * (n*n + 1) / 2, you can justify to yourself why that's true.) So, don't make t symbolic, instead directly compute it and use its value.
Computing over Int values is expensive. Instead, you should use bit-vector values of minimum size. For instance, if n = 6, then t = 111; and you only need 7-bits to represent this value. So, instead of using Int('x'), use BitVec('x', 7). (It's important that you pick a large enough bit-vector size!)
If you make the above two modifications, you'll also see that it performs better than Int values only.

Are there some functions in Python for generating matrices with special conditions?

I'm writing dataset generator on Python and I got following problem: I need a set of zero-one matrices with no empty columns/rows. Also the ratio between zeros and ones should be constant.
I've tried to shuffle zero-one list with fixed ratio of zeros and ones with following reshaping, but for matrices with hundreds of rows/cols it's too long. Also I took into account that I can't achieve some inputs like 3*10 matrix with 9 one-elements and that some inputs can have only solution like 10*10 matrix with 10 one-elements.
If I understand the task, something like this might work:
import numpy as np
from collections import defaultdict, deque
def gen_mat(n, m, k):
"""
n: rows,
m: cols,
k: ones,
"""
assert k % n == 0 and k % m == 0
mat = np.zeros((n, m), dtype=int)
ns = np.repeat(np.arange(n), k // n)
ms = np.repeat(np.arange(m), k // m)
# uniform shuffle
np.random.shuffle(ms)
ms_deque = deque(ms)
assigned = defaultdict(set)
for n_i in ns:
while True:
m_i = ms_deque.popleft()
if m_i in assigned[n_i]:
ms_deque.append(m_i)
continue
mat[n_i, m_i] = 1
assigned[n_i].add(m_i)
break
return mat
We first observe that an n x m matrix can be populated with k ones s.t. equal ratios only k is divisible by both n and m.
Assuming this condition holds, each row index will appear k/n times and each column index will appear m/k times. We shuffle the column indices to ensure that the assignment is random, and store the random column indices in a deque for efficiency.
For each row, we store a set of columns s.t. mat[row, column] = 1 (initially empty).
We can now loop over each row k/n times, picking the next column s.t. mat[row, column] = 0 from the deque and set mat[row, column] to 1.
Without loss, assume that n <= m. This algorithm terminates successfully unless we encounter a situation when all remaining columns in the deque satisfy mat[row, column] = 1. This can only happen in the last row, meaning that we have already assigned k/m + 1 ones to some column, which is impossible.

Having an N element output from rejection sampling for N elements

I am applying a rejection sampling for N elements given probability density function pdf. When applying this method for N elements, it is likely that you will return an array of values that has less number of elements compared to the N number you are evaluating, which is from applying the rejection method without looping for values of condition that are False.
In order to reconcile this, I can try and loop the values that does not meet the condition until they are True. However, I am not sure on how to loop the condition until the number of elements in my array has the same length with the number of values N given the functions I have defined.
import numpy
N = 1000 # number of elements
x = np.linspace(0, 200, N)
pdf = pdf(x) # some pdf
# Rejection Method #1
# -------------------
fx = np.random.random_sample(size=N) * x.max() # uniform random samples scaled out
u = np.random.random_sample(size=N) # uniform random sample
condition = np.where(u <= pdf/pdf.max())[0] # Run first rejection criterion that returns bool values
x_arr = fx[condition]
# Here, len(x_arr) < N, so I want to fix it until len(x_arr) == N
while len(x_arr) < N:
...
if len(x_arr) == N:
break
After this, I am having trouble forming a method of iterating until len(x_arr) = N.
Here is one way using boolean and advanced indexing. It keeps a list of indices at which values were rejected and redraws these values until the list is empty.
Example sampling and accept/reject functions:
def sample(N):
return np.random.uniform(-3, 3, (N,))
def accept(v):
return np.random.rand(v.size) < stats.norm().pdf(v)
Main loop:
def draw(N, f_sample, f_accept):
out = f_sample(N)
mask = f_accept(out)
reject, = np.where(~mask)
while reject.size > 0:
fill = f_sample(reject.size)
mask = f_accept(fill)
out[reject[mask]] = fill[mask]
reject = reject[~mask]
return out
Sanity check:
>>> counts, bins = np.histogram(draw(100000, sample, accept))
>>> counts / stats.norm().pdf((bins[:-1] + bins[1:]) / 2)
array([65075.50020815, 65317.17811578, 60973.84255365, 59440.53739031,
58969.62310004, 59267.33983256, 60565.1928325 , 61108.60840388,
64303.2863583 , 68293.86441234])
Looks roughly flat, so ok.

Find two pairs of pairs that sum to the same value

I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))
I would like to determine if the matrix has two pairs of pairs of rows which sum to the same row vector. I am looking for a fast method to do this. My current method just tries all possibilities.
for pair in combinations(combinations(range(n), 2), 2):
if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
print "Pair found", pair
A method that worked for n = 100 would be really great.
Here is a pure numpy solution; no extensive timings, but I have to push n up to 500 before I can see my cursor blink once before it completes. it is memory intensive though, and will fail due to memory requirements for much larger n. Either way, I get the intuition that the odds of finding such a vector decrease geometrically for larger n anyway.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)
def base3(a):
"""
pack the last axis of an array in base 3
40 base 3 numbers per uint64
"""
S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
for i in xrange(len(S)):
s = S[i]
r = R[...,i]
for j in xrange(s.shape[-1]):
r *= 3
r += s[...,j]
return R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return unique, count
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_pairs_of_pairs(A):
#optional; convert rows to base 3
A = base3(A)
#precompute sums over a lower triangular set of all combinations
rowsums = sum(A[I] for I in np.tril_indices(n,-1))
#count the number of times each row occurs by sorting
#note that this is not quite O(n log n), since the cost of handling each row is also a function of n
unique, count = unique_count(voidview(rowsums))
#print if any pairs of pairs exist;
#computing their indices is left as an excercise for the reader
return np.any(count>1)
from time import clock
t = clock()
for i in xrange(100):
print has_pairs_of_pairs(A)
print clock()-t
Edit: included base-3 packing; now n=2000 is feasible, taking about 2gb of mem, and a few seconds of processing
Edit: added some timings; n=100 takes only 5ms per call on my i7m.
Based on the code in your question, and on the assumption that you're actually looking for pairs of pairs of rows that sum to equal the same row vector, you could do something like this:
def findMatchSets(A):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), 2))
matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
This basically stratifies the matrix into equivalence sets that sum to the same value after one column has been taken into account, then two columns, then three, and so on, until it either reaches the last column or there is no equivalence set left with more than one member (i.e. there is no such pair of pairs). This will work fine for 100x100 arrays, largely because the chances of two pairs of rows summing to the same row vector are infinitesimally small when n is large (n*(n+1)/2 combinations compared to 3^n possible vector sums).
UPDATE
Updated code to allow searching for pairs of n-size subsets of all rows, as requested. Default is n=2 as per the original question:
def findMatchSets(A, n=2):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), n))
matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
Here is a 'lazy' approach, that scales up to n=10000, using 'only' 4gb of memory, and completing in 10s per call or so. Worst case complexity is O(n^3), but for random data, expected performance is O(n^2). At first sight, it seems like youd need O(n^3) ops; each row combination needs to be produced and inspected at least once. But we need not look at the entire row. Rather, we can perform an early exit strategy on the comparison of rowpairs, once it is clear they are of no use to us; and for random data, we may draw this conclusion typically long before we have considered all columns in a row.
import numpy as np
n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
init = np.zeros(a.shape[1:], dtype)
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
yield reduce(
lambda acc,inc: acc*base+inc,
columns,
init)
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all rowpairs
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those pairs which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
return True
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(10):
print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Extended to include the calculation over sums of triplets of rows, as you asked above. For n=100, this still takes only about 0.2s
Edit: some cleanup; edit2: some more cleanup
Your current code does not test for pairs of rows that sum to the same value.
Assuming that's actually what you want, its best to stick to pure numpy. This generates the indices of all rows that have equal sum.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n))
rowsum = A.sum(axis=1)
unique, inverse = np.unique(rowsum, return_inverse = True)
count = np.zeros_like(unique)
np.add.at(count, inverse, 1)
for p in unique[count>1]:
print p, np.nonzero(rowsum==p)[0]
If all you need to do is determine whether such a pair exists you can do:
exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]

Categories