Issue converting Matlab sparse() code to numpy/scipy with csc_matrix() - python

I'm a bit of a newbie to both Matlab and Python so, many apologies if this question is a bit dumb...
I'm trying to convert some Matlab code over to Python using numpy and scipy and things were going fine until I reached the sparse matrix that someone wrote. The Matlab code goes like:
unwarpMatrix = sparse(phaseOrigin, ceil([1:nRead*nSlice*nPhaseDmap]/expan), 1, numPoints, numPoints)/expan;
Here's my python code (with my thought process) leading up to my attempt at conversion. For a given dataset I was testing with (in both Matlab and Python):
nread = 64
nslice = 28
nphasedmap = 3200
expan = 100
numpoints = 57344
Thus, the length of phaseorigin, s, and j arrays are 5734400 (and I've confirmed the functions that create my phaseorigin array output exactly the same result that Matlab does)
#Matlab sparse takes: S = sparse(i,j,s,m,n)
#Generates an m by n sparse matrix such that: S(i(k),j(k)) = s(k)
#scipy csc matrix takes: csc_matrix((data, ij), shape=(M, N))
#Matlab code is: unwarpMatrix = sparse(phaseOrigin, ceil([1:nRead*nSlice*nPhaseDmap]/expan), 1, numPoints, numPoints)/expan;
size = nread*nslice*nphasedmap
#i would be phaseOrigin variable
j = np.ceil(np.arange(1,size+1, dtype=np.double)/expan)
#Matlab apparently treats '1' as a scalar so I should be tiling 1 to the same size as j and phaseorigin
s = np.tile(1,size)
unwarpmatrix = csc_matrix((s,(phaseorigin, j)), shape=(numpoints,numpoints))/expan
so when I try to run my python code I get:
ValueError: column index exceedes matrix dimensions
This doesn't occur when I run the Matlab code even though the array sizes are larger than the defined matrix size...
What am I doing wrong? I've obviously screwed something up... Thanks very much in advance for any help!

The problem is; Python indexes start from 0, whereas Matlab indexes start from 1. So for an array of size 57344, in Python first element would be arr[0] and last element would be arr[57343].
You variable j has values from 1 to 57344. You probably see the problem. Creating your j like this would solve the problem:
j = np.floor(np.arange(0,size, dtype=np.double)/expan)
Still, better to check this before using...

Related

avoid for loop in python normal(size={}))

My goal is to create an array where each elemet is normal(size={})) of each element of it.
I am trying to oprimize:
it = 2 ** arange(6, 25)
M = zeros(len(it))
for x in range(len(it)):
M[x] = (normal(size=it[x]))
I have these not working so far:
N = zeros(len(it))
it = 2 ** arange(6, 25)
N = (normal(size=it))
Further I tried:
N = (normal(size=it[:]))
Provided my data, I believe that such a manual work, or for loop is really inefficient, so I am trying to come up with vectorized operations.
i receive:
File "mtrand.pyx", line 1335, in numpy.random.mtrand.RandomState.normal
File "common.pyx", line 557, in numpy.random.common.cont
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
you've not been very precise in where these functions are coming from, but I'm guessing that by normal(size=it[:]) you mean:
import numpy as np
it = 2 ** np.arange(6, 25)
np.random.normal(size=it)
which would be telling numpy to create a 19 dimensional array (i.e. len(it)) that contains 6 × 1085 elements (i.e. np.prod(it.astype(float)) as floats because the number overflows an int64). numpy is saying that it can't do that, which seems like a reasonable thing to do.
Numpy doesn't like the "ragged arrays" you're trying to create, neither do most matrix/numeric libraries, hence support is limited!
I'm unsure why you consider that the "loop is really inefficient". You're creating ~33 million of floats from 19 iterations of a simple Python loop. The vast majority of time will be in highly optimised Numpy library code and some tiny (basically unmeasurable) amount of time will be spent evaluating your Python bytecode.
If you really want a one-liner then you can do:
X = [np.random.normal(size=2**i) for i in range(6, 25)]
which makes the split between Numpy and Python worlds more obvious.
Note that on my laptop, the Python code executes in ~5µs while the Numpy code runs for ~800ms. So you're trying to optimise the 0.0006% part!
Note that it's not always a win to use Numpy's vectorization, it only helps with larger arrays, for example the above loop is "faster" than:
X = [np.random.normal(i) for i in 2**np.arange(6, 25)]
4.8 vs 5.1 µs for the Python code, because of the time spent marshalling objects into/out of the Numpy world. Again, none of this matters, just use whichever solution makes your code easier to understand. A few microseconds is nothing compared to seconds.

Python sparse matrix in Cplex?

I am working on a large quadratic programming problem. I would like to feed in the Q matrix defining the objective function into IBM's Cplex using Python API. The Q matrix is built using scipy lil matrix because it is sparse. Ideally, I would like to pass the matrix onto Cplex. Does Cplex accept scipy lil matrix?
I can convert the Q to the format of list of lists which Cplex accepts, lets call it qMat. But the size of qMat becomes too large and the machine runs out of memory (even with 120 Gig).
Below is my work in progress code. In the actual problem n is around half a million, and m is around 5 million. In the actual problem Q is given and not randomly assigned as in the problem below.
from __future__ import division
import numpy as np
import cplex
import sys
import random
from scipy import sparse
n = 10
m = 5
def create():
Q = sparse.lil_matrix((n, n))
nums = random.sample(range(0, n), m)
for i in nums:
for j in nums:
a = random.uniform(0,1)
Q[i,j] = a
Q[j,i] = a
return Q
def convert(Q):
qMat = [[[], []] for _ in range(n)]
for k in xrange(n-1):
qMat[k][0] = Q.rows[k]
qMat[k][1] = Q.data[k]
return qMat
Q = create()
qMat = convert(Q)
my_prob = cplex.Cplex()
my_prob.objective.set_quadratic(qMat)
If n = 500000 and m = 5000000, then that is 2.5e12 non-zeroes. For each of these you'd need roughly one double for the non-zero value and one CPXDIM for the index. That is 8+4=12 bytes per non-zero. This would give:
>>> print(2.5e12 * 12 / 1024. / 1024. / 1024.)
27939.6772385
Roughly, 28TB of memory! It's not clear exactly how many non-zeros you plan on having, but using this calculation you can easily find out whether it is even possible or not to do what you're asking.
As mentioned in the comments, the CPLEX Python API does not accept scipy lil matrices. You could try docplex, which is numpy friendly, or you could even try generating an LP file directly.
Using something like the following is probably your best bet in terms of reducing the conversion overhead (I think I made an off-by-one error in the comments section above):
my_prob.objective.set_quadratic(list(zip(Q.rows, Q.data)))
or
my_prob.objective.set_quadratic([[row, data] for row, data in zip(Q.rows, Q.data)]))
At any rate, you should play with these to see what gives the best performance (in terms of speed and memory).

Numpy - Creating a function whose inputs are two matrices and whose output is a binary matrix in Python

I am trying to write a function that take two m x n matrices as input and gives a binary matrix output, where 0 is returned if an element m,n is less than zero and returns 1 if else. I want these binary values to replace the values that were evaluated as negative or else in an array or matrix format. Here is my code that has produced errors thus far:
def rdMatrix(mat1, mat2):
mat3 = np.dot(mat1,mat2)
arr = [[]]
k = mat3(m,n)
for k in mat3:
if k < 0:
arr.append[0]
else:
arr.append[1]
I am having difficulty in telling the function to map a new value to each element in the matrix and then store it in an array. I'm also having trouble defining what a specific element of m x n is in the for loop. I am new to programming, so please forgive me for any obvious mistakes or errors that will easily fix this function. Also, please let me know if there is anything that needs clarification.
Any help is greatly appreciated, thank you!
This is NumPy, so you can obtain binary matrices using comparison operators.
For example, your code can be implemented very simply as
mat3 = np.dot(mat1, mat2)
return mat3 >= 0

Numpy Slicing slow?

Hi I am running scientific computing using numpy + numba.
I've realized that numpy array addition in-place is very slow... compared to matlab
here is the matlab code:
tic;
% A,B are 2-d matrices, ind may not be distinct
for ii=1:N
A(ind(ii),:) = A(ind(ii),:) + B(ii,:);
end
toc;
and here is the numpy code:
s = time.time()
# A,B are numpy.ndarray, ind may not be distinct
for k in xrange(N):
A[ind[k],:] += B[k,:];
print time.time() - s
The result shows that numpy code is 10x slower than matlab... which confuses me a lot.
Moreover, when I pull the addition out of for loop, and just compare a single matrix addition with numpy.add, numpy and matlab seem to be comparable at speed.
One factor I know is that matlab uses JIT for version>=2012a to speed up for loop, but I tried numba on python code, it still does not speed up even a bit. I think this has to do with that numba has not touched numpy.add function at all, hence the performance does not change at all.
I am guessing that matlab does some sick caching for this case, hence it beats numpy dramatically.
Any suggestion on how to speed up numpy ?
Try
A[ind] += B[:N]
i.e. without any loop.
If ind could have duplicate elements, you can use np.add.at:
np.add.at(A, ind, B[:N])
Here'a version that uses dot matrix multiplication. It constructs a matrix of 1s and 0s from ind.
def bar(A,B,ind):
K,M =B.shape
N,M =A.shape
I = np.zeros((N,K))
I[ind,np.arange(K)] = 1
return A+np.dot(I,B)
For a problem with sizes like K,M,N = 30,14,15 this is about 3x faster. But for larger ones like K,M,N = 300,100,150 it's a bit slower.

Matlab to Python sparse matrix conversion , overcoming the zero index problem

I have an N x N sparse matrix in Matlab, that has cell values indexed by (r,c) pairs such that r and c are unique id's.
The problem is, that after converting this matrix into Python, all of the indices values are decremented by 1.
For example:
Before After
(210058,10326) = 1 (210057,10325) = 1
Currently, I am doing the following to counter this:
mat_contents = sparse.loadmat(filename)
G = mat_contents['G']
I,J = G.nonzero()
I += 1
J += 1
V = G.data
G = sparse.csr_matrix((V,(I,J)))
I have also tried using different options in scipy.sparse.io.loadmat {matlab_compatible, mat_dtype}, but neither worked.
I am looking for a solution that will give me the same indices as the Matlab matrix. Solutions that do not require reconstructing the matrix would be ideal, but I am also curious how others have gotten around this problem.
Thank you all for the good advice.
I decided to stick with Python. I do most of my data transfers between Matlab and Python
using text files now.

Categories