Create a sparse matrix from a generator - python

I would like to create a big sparse matrix where its source data can't be fully loaded because of the memory issues. You may think that we have a very big file on disk and we can't read it.
I think about it but I couldn't find a way to create a sparse matrix from a generator.
from scipy.sparse import coo_matrix
matrix1 = coo_matrix(xrange(10)) # it works. Create a sparse matrix with 9 elements.
data = ((0, 1, random.randint(0,5)) for i in xrange(10)) # generator example
matrix2 = coo_matrix(data) # does not work.
Any idea?
Edit: I found this, haven't tried it yet but it looks helpful.

Here's an example of using a generator to populate a sparse matrix. I use the generator to fill a structured array, and create the sparse matrix from its fields.
import numpy as np
from scipy import sparse
N, M = 3,4
def foo(N,M):
# just a simple dense matrix of random data
cnt = 0
for i in xrange(N):
for j in xrange(M):
yield cnt, (i, j, np.random.random())
cnt += 1
dt = dt=np.dtype([('i',int), ('j',int), ('data',float)])
X = np.empty((N*M,), dtype=dt)
for cnt, tup in foo(N,M):
X[cnt] = tup
print X.shape
print X['i']
print X['j']
print X['data']
S = sparse.coo_matrix((X['data'], (X['i'], X['j'])), shape=(N,M))
print S.shape
print S.A
producing something like:
(12,)
[0 0 0 0 1 1 1 1 2 2 2 2]
[0 1 2 3 0 1 2 3 0 1 2 3]
[ 0.99268494 0.89277993 0.32847213 0.56583702 0.63482291 0.52278063
0.62564791 0.15356269 0.1554067 0.16644956 0.41444479 0.75105334]
(3, 4)
[[ 0.99268494 0.89277993 0.32847213 0.56583702]
[ 0.63482291 0.52278063 0.62564791 0.15356269]
[ 0.1554067 0.16644956 0.41444479 0.75105334]]
All of the nonzero data points will exist in memory in 2 forms - the fields of X, and the row,col,data arrays of the sparse matrix.
A structured array like X could also be loaded from the columns of a csv file.
A couple of the sparse matrix formats let you set data elements, e.g.
S = sparse.lil_matrix((N,M))
for cnt, tup in foo(N,M):
i,j,value = tup
S[i,j] = value
print S.A
sparse tells me that lil is the least expensive format for this type of assignment.

Related

Numpy Matrix values to complex values in the form of 0+Value*i

I have a matrix in the form of a = numpy.matrix('1 2 3; 4 5 6', dtype=complex) of course giving the output:
a = [1 2 3; 4 5 6]
I'm trying to get this matrix to the form of:
a = [0+1*i 0+2*i 0+3*i; 0+4*i 0+5*i 0+6*i]
I can step through this in a loop like:
for i in range(2):
for j in range(1):
a[i,j] = complex(0, a[i,j])
The only problem is my matrix is quite a bit larger than a 2x3. I'd like to use a.astype(complex) but this gives me an output of:
a = [1+0i 2+0i 3+0i; 4+0i 5+0i 6+0i]
How can I use astype to get my matrix to my desired output with the matrix values as the imaginary components?
I am of course, open to other suggestions to speed this up instead of looping through a million values in an array.
Thank you!
~Ave

Why operation on list is slower than operation on numpy array

I am doing a project on encrypting data using RSA algo and for that, I have taken a .wav file as an input and reading it by using wavfile and I can apply the key (3, 25777) but when I am applying the decryption key (16971,25777) it is giving wrong output like this:
The output I'm getting:
[[ 0 -25777]
[ 0 -25777]
[ 0 -25777]
...
[-25777 -25777]
[-15837 -15837]
[ -8621 1]]
output i want:
[[ 0 -1]
[ 2 -1]
[ 2 -3]
...
[-9 -5]
[-2 -2]
[-4 1]]
This was happening only with the decryption part of the array so I decided to convert the 2d array to a 2d list. After that, it is giving me the desired output but it is taking a lot of time to apply the keys to all the elements of the list(16min, in case of array it was 2sec). I don't understand why it is happening and if there is any other solution to this problem ?
here is the encryption and decryption part of the program:
#encryption
for i in range(0, tup[0]): #tup[0] is the no of rows
for j in range(0, tup[1]): #tup[1] is the no of cols
x = data[i][j]
x = ((pow(x,3)) % 25777) #applying the keys
data[i][j] = x #storing back the updated value
#decryption
data= data.tolist() #2d array to list of lists
for i1 in (range(len(data)):
for j1 in (range(len(data[i1]))):
x1 = data[i1][j1]
x1 = (pow(x1, 16971)%25777) #applying the keys
data[i1][j1] = x1
Looking forward to suggestions. Thank you.
The occurrence of something like pow(x1, 16971) should give you pause. This will for almost any integer x1 yield a result which a 64 bit int cannot hold. Which is the reason numpy gives the wrong result, because numpy uses 64 bit or 32 bit integers on the most common platforms. It is also the reason why plain python is slow, because while it can handle large integers this is costly.
A way around this is to apply the modulus in between multiplications, that way numbers remain small and can be readily handled by 64 bit arithmetic.
Here is a simple implementation:
def powmod(b, e, m):
b2 = b
res = 1
while e:
if e & 1:
res = (res * b2) % m
b2 = (b2*b2) % m
e >>= 1
return res
For example:
>>> powmod(2000, 16971, 25777)
10087
>>> (2000**16971)%25777
10087
>>> timeit(lambda: powmod(2000, 16971, 25777), number=100)
0.00031936285085976124
>>> timeit(lambda: (2000**16971)%25777, number=100)
0.255017823074013

fastest way to iterate a numpy array and update each element

This might be weird to you people, but I happen to have this weird goal to achieve, code goes as follows.
# A is a numpy array, dtype=int32,
# and each element is actually an ID(int), the ID range might be wide,
# but the actually existing values are quite fewer than the dense range,
A = array([[379621, 552965, 192509],
[509849, 252786, 710979],
[379621, 718598, 591201],
[509849, 35700, 951719]])
# and I need to map these sparse ID to dense ones,
# my idea is to have a dict, mapping actual_sparse_ID -> dense_ID
M = {}
# so I iterate this numpy array, and check if this sparse ID has a dense one or not
for i in np.nditer(A, op_flags=['readwrite']):
if i not in M:
M[i] = len(M) # sparse ID got a dense one
i[...] = M[i] # replace sparse one with the dense ID
My goal could be achieved with np.unique(A, return_inverse=True), and the return_inverse result is what I want.
However, the numpy array I have is too huge to fully load into memory, so I cannot run np.unique over the whole data, and this is why I came up with this dict-mapping idea...
Is this the right way to go? Any possible improvement?
I will make an attempt to provide an alternative way of doing this by using numpy.unique() on sub-arrays. This solution is not fully tested. I also did not do any side-by-side performance evaluation since your solution is not fully working for me.
Let's say we have an array c that we split into two smaller arrays. Let's create some test data, for example:
>>> a = np.array([[1,1,2,3,4],[1,2,6,6,2],[8,0,1,1,4]])
>>> b = np.array([[11,2,-1,12,6],[12,2,6,11,2],[7,0,3,1,3]])
>>> c = np.vstack([a, b])
>>> print(c)
[[ 1 1 2 3 4]
[ 1 2 6 6 2]
[ 8 0 1 1 4]
[11 2 -1 12 6]
[12 2 6 11 2]
[ 7 0 3 1 3]]
Here we assume that c is the large array and a and b are sub-arrays. Of course, one could build c first and then extract sub-arrays.
Next step is to run numpy.unique() on the two sub-arrays:
>>> ua, ia = np.unique(a, return_inverse=True)
>>> ub, ib = np.unique(b, return_inverse=True)
>>> uc, ic = np.unique(c, return_inverse=True) # this is for future reference
Now, here is an algorithm for combining the results from subarrays:
def merge_unique(ua, ia, ub, ib):
# make copies *if* changing inputs is undesirable:
ua = ua.copy()
ia = ia.copy()
ub = ub.copy()
ib = ib.copy()
# find differences between unique values in the two arrays:
diffab = np.setdiff1d(ua, ub, assume_unique=True)
diffba = np.setdiff1d(ub, ua, assume_unique=True)
# find indices in ua, ub where to insert "other" unique values:
ssa = np.searchsorted(ua, diffba)
ssb = np.searchsorted(ub, diffab)
# throw away values that are too large:
ssa = ssa[np.where(ssa < len(ua))]
ssb = ssb[np.where(ssb < len(ub))]
# increment indices past previously computed "insert" positions:
for v in ssa[::-1]:
ia[ia >= v] += 1
for v in ssb[::-1]:
ib[ib >= v] += 1
# combine results:
uc = np.union1d(ua, ub) # or use ssa, ssb, diffba, diffab to update ua, ub
ic = np.concatenate([ia, ib])
return uc, ic
Now, let's run this function on the results of numpy.unique() from sub-arrays and then compare merged indices and unique values with the reference results uc and ic:
>>> uc2, ic2 = merge_unique(ua, ia, ub, ib)
>>> np.all(uc2 == uc)
True
>>> np.all(ic2 == ic)
True
Splitting into more than two sub-arrays can be handled with little additional work - simply keep accumulating "unique" values and indices, like this:
uacc, iacc = np.unique(subarr1, return_inverse=True)
ui, ii = np.unique(subarr2, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr3, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
ui, ii = np.unique(subarr4, return_inverse=True)
uacc, iacc = merge_unique(uacc, iacc, ui, ii)
................................ (etc.)

Transforming a 3 Column Matrix into an N x N Matrix in Numpy

I have a 2D numpy array with 3 columns. Columns 1 and 2 are a list of connections between ID's. Column 3 is a the strength of that connection. I would like to transform this 3 column matrix into a weighted adjacency matrix (an N x N matrix where cells represent the strength of connection between each ID).
I have already done this in my code below. matrix is the 3 column 2D array and t1 is the weighted adjacency matrix. My problem is this code is very slow because I am using nested for loops. I am familiar with the pandas function melt which does this, but I am not able to use pandas. Is there a faster implementation not using pandas?
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
flds = list(np.unique(matrix[:,0]))
flds.extend(list(np.unique(matrix[:,1])))
flds = np.asarray(flds)
flds = np.unique(flds)
#make lookup dict
lookup = dict(zip(np.arange(0,len(flds)), flds))
lookup_rev = dict(zip(flds, np.arange(0,len(flds))))
#make empty n by n matrix with unique lists
t1 = np.zeros([len(flds) , len(flds)])
#map values into the n by n matrix and make the rest 0
'''this takes a long time to run'''
#iterate through rows
for i in np.arange(0,len(lookup)):
#iterate through columns
for k in np.arange(0,len(lookup)):
val = matrix[(matrix[:,0] == lookup[i]) & (matrix[:,1] == lookup[k])][:,2]
if val:
t1[i,k] = sum(val)
Assuming that I understood the question correctly and that val is a scalar, you could use a vectorized approach that involves initializing with zeros and then indexing, like so -
out = np.zeros((len(flds),len(flds)))
out[matrix[:,0].astype(int),matrix[:,1].astype(int)] = matrix[:,2]
Please note that by my observation it looks like you can avoid using lookup.
You need to iterate your matrix only once:
import numpy as np
size = 2000
a = np.arange(size)
np.random.shuffle(a)
b = np.arange(size)
np.random.shuffle(b)
c = np.random.rand(size,1)
matrix = np.column_stack((a,b,c))
#get unique value list of nm
fields = np.unique(matrix[:,:2])
n = len(fields)
#make reverse lookup dict
lookup = dict(zip(fields, range(n)))
#make empty n by n matrix
t1 = np.zeros([n, n])
for src, dest, val in matrix:
i = lookup[src]
j = lookup[dest]
t1[i, j] += val
The main acceleration you can get is by not iterating through each element of the NxN matrix but instead iterate trough your connection list, which is much smaller.
I tried to simplify your code a bit. It use the list.index method, which can be slow, but it should still be faster that what you had.
import numpy as np
a = np.arange(2000)
np.random.shuffle(a)
b = np.arange(2000)
np.random.shuffle(b)
c = np.random.rand(2000,1)
matrix = np.column_stack((a,b,c))
lookup = np.unique(matrix[:,:2]).tolist() # You can call unique only once
t1 = np.zeros((len(lookup),len(lookup)))
for i,j,val in matrix:
t1[lookup.index(i),lookup.index(j)] = val # Fill the matrix

Speeding up summing certain columns in a matrix

Question in short
Given a large sparse csr_matrix A and a numpy array B, what is the fastest way to construct a numpy matrix C, such that C[i,j] = sum(A[k,j]) for all k where B[k] == i?
Details of question
I found a solution to do this, but I am not really content with how long it takes. I will first explain the problem, then my solution, then show my code, and then show my timings.
Problem
I am working on a clustering algorithm in Python, and I'd like to speed it up. I have a sparse csr_matrix pam, in which I have per person per article how many items they bought of that article. Furthermore, I have a numpy array clustering, in which the cluster that person belongs to is denoted. Example:
pam pam.T clustering
article person
p [[1 0 0 0] a
e [0 2 0 0] r [[1 0 1 0 0 0] [0 0 0 0 1 1]
r [1 1 0 0] t [0 2 1 0 0 0]
s [0 0 1 0] i [0 0 0 1 0 1]
o [0 0 0 1] c [0 0 0 0 1 2]]
n [0 0 1 2]] l
e
What I like to calculate is acm: the amount of items all people in one cluster together bought. This amounts to, for every column i in acm, adding those columns p of pam.T for which clustering[p] == i.
acm
cluster
a
r [[2 0]
t [3 0]
i [1 1]
c [0 3]]
l
e
Solution
First, I create another sparse matrix pcm, in which I indicate per element [i,j] if person i is in cluster j. Result (when cast to dense matrix):
pcm
cluster
p [[False True]
e [False True]
r [ True False]
s [False True]
o [False True]
n [ True False]]
Next, I matrix multiply pam.T with pcm to get the matrix that I want.
Code
I wrote the following program to test the duration of this method in practice.
import numpy as np
from scipy.sparse.csr import csr_matrix
from timeit import timeit
def _clustering2pcm(clustering):
'''
Converts a clustering (np array) into a person-cluster matrix (pcm)
'''
N_persons = clustering.size
m_person = np.arange(N_persons)
clusters = np.unique(clustering)
N_clusters = clusters.size
m_data = [True] * N_persons
pcm = csr_matrix( (m_data, (m_person, clustering)), shape = (N_persons, N_clusters))
return pcm
def pam_clustering2acm():
'''
Convert a person-article matrix and a given clustering into an
article-cluster matrix
'''
global clustering
global pam
pcm = _clustering2pcm(clustering)
acm = csr_matrix.transpose(pam).dot(pcm).todense()
return acm
if __name__ == '__main__':
global clustering
global pam
N_persons = 200000
N_articles = 400
N_shoppings = 400000
N_clusters = 20
m_person = np.random.choice(np.arange(N_persons), size = N_shoppings, replace = True)
m_article = np.random.choice(np.arange(N_articles), size = N_shoppings, replace = True)
m_data = np.random.choice([1, 2], p = [0.99, 0.01], size = N_shoppings, replace = True)
pam = csr_matrix( (m_data, (m_person, m_article)), shape = (N_persons, N_articles))
clustering = np.random.choice(np.arange(N_clusters), size = N_persons, replace = True)
print timeit(pam_clustering2acm, number = 100)
Timing
It turns out that for these 100 runs, I need 5.1 seconds. 3.6 seconds of these are spent on creating pcm. I have the feeling there could be a faster way to calculate this matrix without creating a temporary sparse matrix, but I don't see one without looping. Is there a faster way of construction?
EDIT
After Martino's answer, I have tried to implement the loop over clusters and slicing algorithm, but that is even slower. It takes now 12.5 seconds to calculate acm 100 times, of which 4.1 seconds remain if I remove the line acm[:,i] = pam[p,:].sum(axis = 0).
def pam_clustering2acm_loopoverclusters():
global clustering
global pam
N_articles = pam.shape[1]
clusters = np.unique(clustering)
N_clusters = clusters.size
acm = np.zeros([N_articles, N_clusters])
for i in clusters:
p = np.where(clustering == i)[0]
acm[:,i] = pam[p,:].sum(axis = 0)
return acm
This is about 50x faster than your _clustering2pcm function:
def pcm(clustering):
n = clustering.size
data = np.ones((n,), dtype=bool)
indptr = np.arange(n+1)
return csr_matrix((data, clustering, indptr))
I haven't looked at the source code, but when you pass the CSR constructor the (data, (rows, cols)) structure, it is almost certainly using that to create a COO matrix, then converting it to CSR. Because your matrix is so simple, it is very easy to put the actual CSR matrix description arrays together as above, and skip all of that.
This almost cuts your execution time down by three:
In [38]: %timeit pam_clustering2acm()
10 loops, best of 3: 36.9 ms per loop
In [40]: %timeit pam.T.dot(pcm(clustering)).A
100 loops, best of 3: 12.8 ms per loop
In [42]: np.all(pam.T.dot(pcm(clustering)).A == pam_clustering2acm())
Out[42]: True
I refer you to the scipy.sparse docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). Where they say the row slicing is efficient (as opposed to column splicing), so it is probably better to to stick to the non-transposed matrix. Then if you browse down there is a sum function where the axis can be specified. It is probably better to use the methods that come with your object as they are likely to use compiled code. This is at the cost of looping through clusters (of which I am assuming there are not too many).

Categories