sparse matrix multipliction with dictionaris in python - python

A sparse matrix is a matrix whose most members have zero value. Therefore, in order to save memory and storage
The matrices, it is convenient to represent them using a dictionary in the following configuration: for each cell in the matrix that is not zero, a tuple key will be stored in the dictionary
which represents the coordinates of the cell, and the value represents the value of the cell in the matrix (some number of type int or float) as usual in mathematics,
The indices of the matrix start from one.
• The cell coordinate (j, i) contains natural numbers so that the coordinate represents the cell in the i-th row
and in the jth column.
• The order in the dictionary is not important.
Realize the function sparse_mult(n, mat2, mat1) which receives 2 dictionaries, mat1 and mat2, representing square sparse matrices
of size n×n, and returns a dictionary representing the mat2×mat1 matrix multiplication matrix.
pay attention:
There is no need to check the correctness of the matrices.
It can be assumed that n is a natural number < 1.
The repeated dictionary must represent a sparse matrix as defined above.
for i in range(1, n + 1):
temp = 0
for j in range(1, n + 1):
if (mat1.get((i, j), 0) != 0)|(mat2.get((j, i), 0) != 0):
temp += mat1.get((i, j), 0) * mat2.get((j, i), 0)
if temp !=0:
resultrow[(i, i)]=temp
That's my code, I know I got it wrong but i just don't have a clue

It is inefficient to iterate over all indices in the 2-dimensional index set when multiplying two sparse matrices. Instead, you can iterate over all pairs of keys where 1 pair is drawn from each sparse matrix. Given such a pair (i,j) and (k,l), it contributes a product of 2 numbers if and only if j == k. In this case the corresponding product goes towards entry (i,l) in the overall product. A final dictionary comprehension can get rid of any zero entries. This last step might be inadequate if the numbers are floats and some entries are non-zero only due to round-off error. In that case a threshold approach which removes entries close to zero and not merely equal to zero.
def sparse_multiply(a,b):
c = {}
for i,j in a.keys():
for k,l in b.keys():
if j == k:
p = a[(i,j)]*b[(k,l)]
if (i,l) in c:
c[(i,l)] += p
else:
c[(i,l)] = p
return {k:v for k,v in c.items() if v != 0}
Note that n plays no role here. The complexity is mk where m is the number of non-zero entries in the first matrix and k the number of such entries in the second. For matrices which are very sparse this will be substantially faster than the n^3 of using straight-forward matrix multiplication. There will be some threshold where mk will actually be larger than n^3, but at that stage the matrices are no longer sparse.

so i eventually got it, if anyone care:
initialize the result dictionary
result = {}
# iterate over the rows and columns of the result matrix
for i in range(1, n + 1):
for j in range(1, n + 1):
# initialize the sum to 0
sum = 0
# iterate over the columns of the first matrix and the rows of the second matrix
for k in range(1, n + 1):
# check if the cell (i, k) in the first matrix and the cell (k, j) in the second matrix are non-zero
if (i, k) in mat1 and (k, j) in mat2:
sum += mat1[(i, k)] * mat2[(k, j)]
# add the result to the dictionary if it is non-zero
if sum != 0:
result[(i, j)] = sum
# return the result dictionary
return result

Related

Python matrix multiplication - result matrix size

I am trying to create the product matrix for matrix multiplication in Python, but I am not sure what size the matrix will be as the user can give any input for the matrix multiplication.
I've approached the situation using this for a previous task on matrices when the actual matrix size is provided
product_matrix = [[col for col in range(4)] for row in range(4)]
But I'm not sure how to tackle it in this case.
For x = range(len(B[0])), did you mean x = len(B[0])?
By multiplying A * B, your resulting matrix result should have the num_raw of A and num_col of B.
x = len(B[0]) means that your x is counting how many elements are there for each row of B, that is x is num_col of B. len(A) is counting how many rows are there in A, that is len(A) is num_row of A. Therefore, your result is initialized as "each row has x elements, and there are len(A) rows" with all entries of 0.
And the second line you provided should be in a for-loop, which is exactly the same how you calculate each entry of resulting matrix by hand.
def inMatrix(m,n):#function For Multiplication And Pass Row and col number as Parameter
a=[]#Blank Matrix
for i in range(m):#Row
b=[]#Blank List
for j in range(n):#col
j=int(input("Enter Matrix Elements in pocket ["+str(i)+"]["+str(j)+"]"))
b.append(j)#add element to list
a.append(b)#add List to Matrix
return a #return Matrix
def priMatrix(a):#function for Print Matrix
for i in range(len(a)): #row
for j in range(len(a[0])):#col
print(a[i][j],end=" ") #print Number with space
print()#print line for New row
def multply(a,b):#Multiplication function Pass two Matrix as Parameter
mul=[] #Blank Matrix
sum=0#sum With ) Value
for i in range(len(a)):#row
l=[] #Blank list
for j in range(len(a[0])):#col
for k in range(len(b)):#select
sum=sum+a[i][k]*b[k][j]#sum with Mul
l.append(sum)#add Mul Value
sum=0#Sum 0 For Next Calculation
mul.append(l)#Add List
return mul #return Multiplication Matrix
m=int(input("row"))#First Matrix Row
n=int(input("col"))#First Matrix Col
a=inMatrix(m,n)#First Matrix input
j=int(input("row"))#second Matrix Row
k=int(input("col"))#second Matrix Col
b=inMatrix(j,k)#Second Matrix Input
priMatrix(a)#First Matrix Print
priMatrix(b)#second Matrix Print
c=multply(a,b)#Multiplication Matrix Function Call
priMatrix(c)#Multiplication Matrix Print

Are there some functions in Python for generating matrices with special conditions?

I'm writing dataset generator on Python and I got following problem: I need a set of zero-one matrices with no empty columns/rows. Also the ratio between zeros and ones should be constant.
I've tried to shuffle zero-one list with fixed ratio of zeros and ones with following reshaping, but for matrices with hundreds of rows/cols it's too long. Also I took into account that I can't achieve some inputs like 3*10 matrix with 9 one-elements and that some inputs can have only solution like 10*10 matrix with 10 one-elements.
If I understand the task, something like this might work:
import numpy as np
from collections import defaultdict, deque
def gen_mat(n, m, k):
"""
n: rows,
m: cols,
k: ones,
"""
assert k % n == 0 and k % m == 0
mat = np.zeros((n, m), dtype=int)
ns = np.repeat(np.arange(n), k // n)
ms = np.repeat(np.arange(m), k // m)
# uniform shuffle
np.random.shuffle(ms)
ms_deque = deque(ms)
assigned = defaultdict(set)
for n_i in ns:
while True:
m_i = ms_deque.popleft()
if m_i in assigned[n_i]:
ms_deque.append(m_i)
continue
mat[n_i, m_i] = 1
assigned[n_i].add(m_i)
break
return mat
We first observe that an n x m matrix can be populated with k ones s.t. equal ratios only k is divisible by both n and m.
Assuming this condition holds, each row index will appear k/n times and each column index will appear m/k times. We shuffle the column indices to ensure that the assignment is random, and store the random column indices in a deque for efficiency.
For each row, we store a set of columns s.t. mat[row, column] = 1 (initially empty).
We can now loop over each row k/n times, picking the next column s.t. mat[row, column] = 0 from the deque and set mat[row, column] to 1.
Without loss, assume that n <= m. This algorithm terminates successfully unless we encounter a situation when all remaining columns in the deque satisfy mat[row, column] = 1. This can only happen in the last row, meaning that we have already assigned k/m + 1 ones to some column, which is impossible.

NumPy: Sparse outer product of n vectors (hyperbolic cross)

I'm trying to compute a certain subset of the full outer product of n vectors. The computation of the full outer product is described in this question.
Formally: Let v1,v2,...,vk be vectors of some length n, and K be a positive constant. I want a list containing all the products v1[i1]v2[i2]...vk[ik] for which i1*i2*...*ik <= K (indices start at one). Note: For example, if K = n ** k, the list would contain every combination.
My current approach is to create a hierarchical list of the indices fulfilling the condition above and then calculating the products recursively, which has the advantage of reusing some factors.
This implementation is a lot slower than the computation of the full outer product using NumPy (for same n and k). I want to achieve a better performance than the computation of the full product. I'm interested in larger values for k, and small K (this problem comes from function approximation with sparse bases, i.e. hyperbolic cross).
Does anyone know a more performant way to get this list? Maybe by using more NumPy or another algorithm? I will try a C implementation next.
Here is my current implementation:
import numpy as np
def get_cross_indices(n, k, K):
"""
Assume k > 0.
Returns a hierarchical list containg elements of type
(i1, list) with
- i1 being a index (zero based!)
- list being again a list (possibly empty) with all indices i2, such
that (i1+1) * (i2+1) * ... * (ik+1) <= K (going down the hierarchy)
"""
if k == 1:
num = min(n, K)
return (num, [(x, []) for x in range(num)])
else:
indices = []
nums = 0
for i in xrange(min(n, K)):
(num, tail) = get_cross_indices(n,
k - 1, K // (i + 1))
indices.append((i, tail))
nums += num
return (nums, indices)
def calc_cross_outer_product(vectors, result, factor, indices, pos):
"""
Fills the result list recursively with all products
vectors[0][i1] * ... * vectors[k-1][ik]
such that i1,...,ik is a feasible index sequence
from `indices` (they are in there hierarchically,
also see `get_cross_indices`).
"""
for (x, list) in indices:
if not list:
result[pos] = factor * vectors[0][x]
pos += 1
else:
pos = calc_cross_outer_product(vectors[1:], result,
factor * vectors[0][x], list, pos)
return pos
k = 3 # number of vectors
n = 4 # vector length
K = 3
# using random values here just for demonstration purposes
vectors = np.random.rand(k, n)
# get all indices which meet the condition
(count, indices) = get_cross_indices(n, k, K)
result = np.ones(count)
calc_cross_outer_product(vectors, result, 1, indices, 0)
## Equivalent version ##
alt_result = np.ones(count)
# create full outer products
outer_product = reduce(np.multiply, np.ix_(*vectors))
pos = 0
for inds in np.ndindex((n,)*k):
# current index set is feasible?
if np.product(np.array(inds) + 1) <= K:
# compute [ vectors[0][inds[0]],...,vectors[k-1][inds[k-1]] ]
values = map(lambda x: vectors[x[0]][x[1]],
np.dstack((np.arange(k), inds))[0])
alt_result[pos] = np.product(values)
pos += 1
To get a visual idea of the indices I'm interested in, here is a picture for k=3, K=n:
(taken from this website)

Find two pairs of pairs that sum to the same value

I have random 2d arrays which I make using
import numpy as np
from itertools import combinations
n = 50
A = np.random.randint(2, size=(n,n))
I would like to determine if the matrix has two pairs of pairs of rows which sum to the same row vector. I am looking for a fast method to do this. My current method just tries all possibilities.
for pair in combinations(combinations(range(n), 2), 2):
if (np.array_equal(A[pair[0][0]] + A[pair[0][1]], A[pair[1][0]] + A[pair[1][1]] )):
print "Pair found", pair
A method that worked for n = 100 would be really great.
Here is a pure numpy solution; no extensive timings, but I have to push n up to 500 before I can see my cursor blink once before it completes. it is memory intensive though, and will fail due to memory requirements for much larger n. Either way, I get the intuition that the odds of finding such a vector decrease geometrically for larger n anyway.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n)).astype(np.int8)
def base3(a):
"""
pack the last axis of an array in base 3
40 base 3 numbers per uint64
"""
S = np.array_split(a, a.shape[-1]//40+1, axis=-1)
R = np.zeros(shape=a.shape[:-1]+(len(S),), dtype = np.uint64)
for i in xrange(len(S)):
s = S[i]
r = R[...,i]
for j in xrange(s.shape[-1]):
r *= 3
r += s[...,j]
return R
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1)
return unique, count
def voidview(arr):
"""view the last axis of an array as a void object. can be used as a faster form of lexsort"""
return np.ascontiguousarray(arr).view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1]))).reshape(arr.shape[:-1])
def has_pairs_of_pairs(A):
#optional; convert rows to base 3
A = base3(A)
#precompute sums over a lower triangular set of all combinations
rowsums = sum(A[I] for I in np.tril_indices(n,-1))
#count the number of times each row occurs by sorting
#note that this is not quite O(n log n), since the cost of handling each row is also a function of n
unique, count = unique_count(voidview(rowsums))
#print if any pairs of pairs exist;
#computing their indices is left as an excercise for the reader
return np.any(count>1)
from time import clock
t = clock()
for i in xrange(100):
print has_pairs_of_pairs(A)
print clock()-t
Edit: included base-3 packing; now n=2000 is feasible, taking about 2gb of mem, and a few seconds of processing
Edit: added some timings; n=100 takes only 5ms per call on my i7m.
Based on the code in your question, and on the assumption that you're actually looking for pairs of pairs of rows that sum to equal the same row vector, you could do something like this:
def findMatchSets(A):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), 2))
matchSets = [[i for i in pairs if B[0][i[0]] + B[0][i[1]] == z] for z in range(3)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if B[c][i[0]] + B[c][i[1]] == z] for z in range(3) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
This basically stratifies the matrix into equivalence sets that sum to the same value after one column has been taken into account, then two columns, then three, and so on, until it either reaches the last column or there is no equivalence set left with more than one member (i.e. there is no such pair of pairs). This will work fine for 100x100 arrays, largely because the chances of two pairs of rows summing to the same row vector are infinitesimally small when n is large (n*(n+1)/2 combinations compared to 3^n possible vector sums).
UPDATE
Updated code to allow searching for pairs of n-size subsets of all rows, as requested. Default is n=2 as per the original question:
def findMatchSets(A, n=2):
B = A.transpose()
pairs = tuple(combinations(range(len(A[0])), n))
matchSets = [[i for i in pairs if sum([B[0][i[j]] for j in range(n)]) == z] for z in range(n + 1)]
for c in range(1, len(A[0])):
matchSets = [[i for i in block if sum([B[c][i[j]] for j in range(n)]) == z] for z in range(n + 1) for block in matchSets]
matchSets = [block for block in matchSets if len(block) > 1]
if not matchSets:
return []
return matchSets
Here is a 'lazy' approach, that scales up to n=10000, using 'only' 4gb of memory, and completing in 10s per call or so. Worst case complexity is O(n^3), but for random data, expected performance is O(n^2). At first sight, it seems like youd need O(n^3) ops; each row combination needs to be produced and inspected at least once. But we need not look at the entire row. Rather, we can perform an early exit strategy on the comparison of rowpairs, once it is clear they are of no use to us; and for random data, we may draw this conclusion typically long before we have considered all columns in a row.
import numpy as np
n = 10
#also works for non-square A
A = np.random.randint(2, size=(n*2,n)).astype(np.int8)
#force the inclusion of some hits, to keep our algorithm on its toes
##A[0] = A[1]
def base_pack_lazy(a, base, dtype=np.uint64):
"""
pack the last axis of an array as minimal base representation
lazily yields packed columns of the original matrix
"""
a = np.ascontiguousarray( np.rollaxis(a, -1))
init = np.zeros(a.shape[1:], dtype)
packing = int(np.dtype(dtype).itemsize * 8 / (float(base) / 2))
for columns in np.array_split(a, (len(a)-1)//packing+1):
yield reduce(
lambda acc,inc: acc*base+inc,
columns,
init)
def unique_count(a):
"""returns counts of unique elements"""
unique, inverse = np.unique(a, return_inverse=True)
count = np.zeros(len(unique), np.int)
np.add.at(count, inverse, 1) #note; this scatter operation requires numpy 1.8; use a sparse matrix otherwise!
return unique, count, inverse
def has_identical_row_sums_lazy(A, combinations_index):
"""
compute the existence of combinations of rows summing to the same vector,
given an nxm matrix A and an index matrix specifying all combinations
naively, we need to compute the sum of each row combination at least once, giving n^3 computations
however, this isnt strictly required; we can lazily consider the columns, giving an early exit opportunity
all nicely vectorized of course
"""
multiplicity, combinations = combinations_index.shape
#list of indices into combinations_index, denoting possibly interacting combinations
active_combinations = np.arange(combinations, dtype=np.uint32)
for packed_column in base_pack_lazy(A, base=multiplicity+1): #loop over packed cols
#compute rowsums only for a fixed number of columns at a time.
#this is O(n^2) rather than O(n^3), and after considering the first column,
#we can typically already exclude almost all rowpairs
partial_rowsums = sum(packed_column[I[active_combinations]] for I in combinations_index)
#find duplicates in this column
unique, count, inverse = unique_count(partial_rowsums)
#prune those pairs which we can exclude as having different sums, based on columns inspected thus far
active_combinations = active_combinations[count[inverse] > 1]
#early exit; no pairs
if len(active_combinations)==0:
return False
return True
def has_identical_triple_row_sums(A):
n = len(A)
idx = np.array( [(i,j,k)
for i in xrange(n)
for j in xrange(n)
for k in xrange(n)
if i<j and j<k], dtype=np.uint16)
idx = np.ascontiguousarray( idx.T)
return has_identical_row_sums_lazy(A, idx)
def has_identical_double_row_sums(A):
n = len(A)
idx = np.array(np.tril_indices(n,-1), dtype=np.int32)
return has_identical_row_sums_lazy(A, idx)
from time import clock
t = clock()
for i in xrange(10):
print has_identical_double_row_sums(A)
print has_identical_triple_row_sums(A)
print clock()-t
Extended to include the calculation over sums of triplets of rows, as you asked above. For n=100, this still takes only about 0.2s
Edit: some cleanup; edit2: some more cleanup
Your current code does not test for pairs of rows that sum to the same value.
Assuming that's actually what you want, its best to stick to pure numpy. This generates the indices of all rows that have equal sum.
import numpy as np
n = 100
A = np.random.randint(2, size=(n,n))
rowsum = A.sum(axis=1)
unique, inverse = np.unique(rowsum, return_inverse = True)
count = np.zeros_like(unique)
np.add.at(count, inverse, 1)
for p in unique[count>1]:
print p, np.nonzero(rowsum==p)[0]
If all you need to do is determine whether such a pair exists you can do:
exists_unique = np.unique(A.sum(axis=1)).size != A.shape[0]

Subsample a matrix python

I have a text files that lists pairs, for example
10,1
2,7
3,1
10,1
That has then been turned into a symmetric matrix, so the (1,10) entry is the number of times the pair (1,10) showed up on the list. I would now like to subsample this matrix. By subsample I mean - I would like to make a matrix that would have been the result of only using a random 30% of the lines in the original text file. So in this example, had I erased 70% of the text file, the (1,10) pair might only show up once instead of twice, and so the (1,10) entry in the matrix would be 1 instead of 2.
This can be done easily if I actually have the original text file, by just using random.sample to pick out 30% of the lines in the files. But if I only have the matrix, how can I randomly decimate 70% of the data?
I guess the best way depends on where your data is large:
Do you have a huge matrix, with mostly small counts in it? or
Do you have a moderately sized matrix with huge numbers of counts in it?
Here's a solution that will be suited to the second case, though it will also work
OK in the first case.
Basically, the fact that the counts happen to be in a 2D matrix is not so
important: this is basically the problem of sampling from a population that has
been binned. So what we can do is extract the bins directly, and forget about the
matrix for a bit:
import numpy as np
import random
# Input counts matrix
mat = np.array([
[5, 5, 2],
[1, 1, 3],
[6, 0, 4]
], dtype=np.int64)
# Build a list of (row,col) pairs, and a list of counts
keys, counts = zip(*[
((i,j), mat[i,j])
for i in range(mat.shape[0])
for j in range(mat.shape[1])
if mat[i,j] > 0
])
And then sample from those bins, using a cumulative array of counts:
# Make the cumulative counts array
counts = np.array(counts, dtype=np.int64)
sum_counts = np.cumsum(counts)
# Decide how many counts to include in the sample
frac_select = 0.30
count_select = int(sum_counts[-1] * frac_select)
# Choose unique counts
ind_select = sorted(random.sample(xrange(sum_counts[-1]), count_select))
# A vector to hold the new counts
out_counts = np.zeros(counts.shape, dtype=np.int64)
# Perform basically the merge step of merge-sort, finding where
# the counts land in the cumulative array
i = 0
j = 0
while i<len(sum_counts) and j<len(ind_select):
if ind_select[j] < sum_counts[i]:
j += 1
out_counts[i] += 1
else:
i += 1
# Rebuild the matrix using the `keys` list from before
out_mat = np.zeros(mat.shape, dtype=np.int64)
for i in range(len(out_counts)):
out_mat[keys[i]] = out_counts[i]
Now you will have the sampled matrix in out_mat.
Unfortunately example two and three do not observe correct distribution according to the number of appearances of lines in the original file.
Instead of removing tuples from the original data you could randomly remove counts from your matrix.
So you have to generate random indices and decrease the corresponding count. Be sure to avoid decreasing a zero count and instead generate a new index. Do this until you have decreased the overall amount of counted tuples to 30%.
Basically this could look like this:
amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
x = random.randint(0, n)
y = random.randint(0, n)
if matrix[x][y] > 0:
matrix[x][y]-=1
decreased+=1
if x != y:
matrix[y][x]-=1
This should work well if your matrix is well populated.
If it's not you might want to recreate a list of tuples from the matrix and then choose a random subset from that. After this recreate your matrix from the remaining tuples:
tuples = []
for y in range(n):
for x in range(y+1):
for _ in range(matrix[x][y])
tuples.append((x,y))
remaining = random.sample(tuples, int(overall_amount*0.7) )
Or you can do a combination where you do a first pass to find all indices that are not zero and then sample these to decrease the counts:
valid_indices = []
for y in range(n):
for x in range(y+1):
valid_indices.append((x,y))
amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
x,y = random.choice(valid_indices)
matrix[x][y]-=1
if x != y:
matrix[y][x]-=1
if matrix[y][x] == 0:
valid_indices.remove((x,y))
There is another approach that would use the right possibilities but might not give you an exact reduction. The idea is to set a probability for keeping a line/count. This could be 0.3 if you are aiming for a reduction to 30%. Then you can go over the matrix and check for every count if it should be kept or not.
keep_chance = 0.3
for y in range(n):
for x in range(y+1):
for _ in range(matrix[x][y])
if random.random() > keep_chance:
matrix[x][y] -= 1
if x != y:
matrix[y][x]-=1
Assuming that the couples 1,10 and 10,1 are different, so that mat[1][10] is not necessarily the same as mat[10][1] (if not, read below the line)
First compute the sum of all the values in the matrix.
Let this sum be S. This counts the number of rows in the file.
Let x and y the dimensions of the matrix.
Now loop for n from 0 to [70% of S]:
pick a random integer between 1 and x. let this be j
pick a random integer between 1 and y. let this be k
if mat[j][k] > 0, decrease mat[j][k] and do n++
Since you increase a single value in the matrix for each row in your file, decreasing randomly a positive value in the matrix is the same as decimating the rows in the file.
If 10,1 is the same of 1,10 you don't need half of the matrix, so you can change the algorithm like this:
Loop for n from 0 to [70% of S]:
pick a random integer between 1 and x. Let this be j
pick a random integer between 1 and k. Let this be k
if mat[j][k] > 0, decrease mat[j][k] and do n++

Categories