memory efficient way to make large zeros matrix python - python

I am currently trying to make a really large matrix, i am unsure how to do so in a memory efficient way.
I was trying to use numpy, which worked fine for my smaller case (2750086X300)
However, i got a larger one, 2750086X1000, which is just too big for me to run.
I though about making it out of ints, but I will add float values to it, so unsure how that cld affect it.
I tried find something about making a sparse zero filled array, but cldnt find any great topics/questions/suggestions here or elsewhere.
Anyone got any good advice? I am currently using python so I am kind of looking for a pythonic solution, but i am willing to try other languages.
Thx
edit:
thx for advices, i ve tried scipy.sparse.csr_matrix which managed to create a matrix but deeply increased the time to go through it.
heres kind of what i am doing:
matrix = scipy.sparse.csr_matrix((df.shape[0], 300))
## matrix = np.zeros((df.shape[0],
for i, q in enumerate(df['column'].values):
matrix[i, :] = function(q)
where function is pretty much a vector operation function on that row.
Now, if i do the loop on the np.zeros, it does so quite easily, about 10 minuts.
Now, if i try to do the same with the scipy sparse matrix, it takes about 50 hours. which is not that reasonable.
Any advices?
Edit 2:
scipy.sparse.lil_matrix did the trick
takes about 20 minut for the loop and uses way less memory than np.zeros
Thx.
Edit 3:
still memory expensive. decided to not store data on matrix. process 1 row at a time. get relevant value/metric out of it, store value at original df, run again.

Try scipy.sparse.csr_matrix:
from scipy.sparse import *
from scipy import *
a=csr_matrix( (2750086,1000), dtype=int8 )
Then a is
<2750086x1000 sparse matrix of type '<class 'numpy.int8'>'
with 0 stored elements in Compressed Sparse Row format>
For example, if you do:
from scipy.sparse import *
from scipy import *
a=csr_matrix( (5,4), dtype=int8 ).todense()
print(a)
You get:
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]]
Another options is to use scipy.sparse.lil_matrix
a = scipy.sparse.lil_matrix((2750086,1000), dtype=int8 )
This seems to be more efficient for setting elements (like a[1,1]=2).

Related

Fast way to construct a matrix in Python

I have been browsing through the questions, and could find some help, but I prefer having confirmation by asking it directly. So here is my problem.
I have an (numpy) array u of dimension N, from which I want to build a square matrix k of dimension N^2. Basically, each matrix element k(i,j) is defined as k(i,j)=exp(-|u_i-u_j|^2).
My first naive way to do it was like this, which is, I believe, Fortran-like:
for i in range(N):
for j in range(N):
k[i][j]=np.exp(np.sum(-(u[i]-u[j])**2))
However, this is extremely slow. For N=1000, for example, it is taking around 15 seconds.
My other way to proceed is the following (inspired by other questions/answers):
i, j = np.ogrid[:N,:N]
k = np.exp(np.sum(-(u[i]-u[j])**2,axis=2))
This is way faster, as for N=1000, the result is almost instantaneous.
So I have two questions.
1) Why is the first method so slow, and why is the second one so fast ?
2) Is there a faster way to do it ? For N=10000, it is starting to take quite some time already, so I really don't know if this was the "right" way to do it.
Thank you in advance !
P.S: the matrix is symmetric, so there must also be a way to make the process faster by calculating only the upper half of the matrix, but my question was more related to the way to manipulate arrays, etc.
First, a small remark, there is no need to use np.sum if u can be re-written as u = np.arange(N). Which seems to be the case since you wrote that it is of dimension N.
1) First question:
Accessing indices in Python is slow, so best is to not use [] if there is a way to not use it. Plus you call multiple times np.exp and np.sum, whereas they can be called for vectors and matrices. So, your second proposal is better since you compute your k all in once, instead of elements by elements.
2) Second question:
Yes there is. You should consider using only numpy functions and not using indices (around 3 times faster):
k = np.exp(-np.power(np.subtract.outer(u,u),2))
(NB: You can keep **2 instead of np.power, which is a bit faster but has smaller precision)
edit (Take into account that u is an array of tuples)
With tuple data, it's a bit more complicated:
ma = np.subtract.outer(u[:,0],u[:,0])**2
mb = np.subtract.outer(u[:,1],u[:,1])**2
k = np.exp(-np.add(ma, mb))
You'll have to use twice np.substract.outer since it will return a 4 dimensions array if you do it in one time (and compute lots of useless data), whereas u[i]-u[j] returns a 3 dimensions array.
I used np.add instead of np.sum since it keep the array dimensions.
NB: I checked with
N = 10000
u = np.random.random_sample((N,2))
I returns the same as your proposals. (But 1.7 times faster)

Scipy sparse matrices element wise multiplication

I am trying to do an element-wise multiplication for two large sparse matrices. Both are of size around (400K X 500K), with around 100M elements.
However, they might not have non-zero elements in the same positions, and they might not have the same number of non-zero elements. In either situation, Im okay with multiplying the non-zero value of one matrix and the zero value in the other matrix to zero.
I keep running out of memory (8GB) in every approach, which doesnt make much sense. I shouldnt be. These are what I've tried.
A and B are sparse matrices (Ive tried with COO and CSC formats).
# I have loaded sparse matrices A and B, and have a file opened in write mode
row,col = A.nonzero()
index = zip(row,col)
del row,col
for i,j in index :
# Approach 1
A[i,j] *= B[i,j]
# Approach 2
someopenfile.write(' '.join([str(i),str(j),str(A[j,j]*B[i,j]),'\n']))
# Approach 3
if B[i,j] != 0 :
A[i,j] = A[i,j]*B[i,j] # or, I wrote it to a file instead
# like in approach 2
If I comment out the for loop, I see that I use almost 3.5GB of memory. But the moment I use the loop, whether Im writing the products to a file or back to a matrix, the memory usage shoots up to the full memory, causing me to stop the execution, or the system hangs. How can I do this operation without consuming so much memory?
I suspect that your sparse matrices are becoming non sparse when you perform the operation have you tried just:
A.multiply(B)
As I suspect that it will be better optimised than anything that you can easily do.
If A is not already the correct type of sparse matrix you might need:
A = A.tocsr()
# May also need
# B = B.tocsr()
A = A.multiply(B)

Python space+time efficient Data Structure to store 2D Bit Arrays

I want to create a 2D Binary (Bit) Array in Python in a space and time efficient way as my 2D bitarray would be around 1 Million(rows) * 50000(columns of 0's or 1's) and also I would be performing bitwise operations over these huge elements. My array would look something like:
0 1 0 1
1 1 1 0
1 0 0 0
...
In C++ most efficient way (space) for me would be to create a kind of array of integers where each element represents 32 bits and then I can use the shift operators coupled with bit-wise operators to carry operations.
Now I know that there is a bitarray module in python. But I am unable to create a 2D structure using list of bit-arrays. How can I do this?
Another way I know in C++ would be to create a map something like map<id, vector<int> > where then I can manipulate the vector as I have mentioned above. Should I use the dictionary equivalent in python?
Even if you suggest me some way out to use bit array for this task it will be great If I can get to know whether I can have multiple threads operate on a splice of bitarray so that I can make it multithreaded. Thanks for the help!!
EDIT:
I can even go on creating my own data structure for this if the need be. However just wanted to check before reinventing the wheel.
As per my comment, you may be able to use sets
0 1 0 1
1 1 1 0
1 0 0 0
can be represented as
set([(1,0), (3,0), (0,1), (1,1), (2, 1), (0,2)])
or
{(1,0), (3,0), (0,1), (1,1), (2, 1), (0,2)}
an AND is equivalent to an intersection of 2 sets
OR is the union of the 2 sets
How about the following:
In [11]: from bitarray import bitarray
In [12]: arr = [bitarray(50) for i in xrange(10)]
This creates a 10x50 bit array, which you can access as follows:
In [15]: arr[0][1] = True
In [16]: arr[0][1]
Out[16]: True
Bear in mind that a 1Mx50K array would require about 6GB of memory (and a 64-bit build of Python on a 64-bit OS).
whether I can have multiple threads operate on a splice of bitarray so that I can make it multithreaded
That shouldn't be a problem, with the usual caveats. Bear in mind that due to the GIL you are unlikely to achieve performance improvements through multithreading.
Can you use numpy?
>>> import numpy
>>> A = numpy.zeros((50000, 1000000), dtype=bool)
EDIT: Doesn't seem to be the most space efficient. Uses 50GB (1-byte per bool). Does anyone know if numpy has a way to use packed bools?

Issue converting Matlab sparse() code to numpy/scipy with csc_matrix()

I'm a bit of a newbie to both Matlab and Python so, many apologies if this question is a bit dumb...
I'm trying to convert some Matlab code over to Python using numpy and scipy and things were going fine until I reached the sparse matrix that someone wrote. The Matlab code goes like:
unwarpMatrix = sparse(phaseOrigin, ceil([1:nRead*nSlice*nPhaseDmap]/expan), 1, numPoints, numPoints)/expan;
Here's my python code (with my thought process) leading up to my attempt at conversion. For a given dataset I was testing with (in both Matlab and Python):
nread = 64
nslice = 28
nphasedmap = 3200
expan = 100
numpoints = 57344
Thus, the length of phaseorigin, s, and j arrays are 5734400 (and I've confirmed the functions that create my phaseorigin array output exactly the same result that Matlab does)
#Matlab sparse takes: S = sparse(i,j,s,m,n)
#Generates an m by n sparse matrix such that: S(i(k),j(k)) = s(k)
#scipy csc matrix takes: csc_matrix((data, ij), shape=(M, N))
#Matlab code is: unwarpMatrix = sparse(phaseOrigin, ceil([1:nRead*nSlice*nPhaseDmap]/expan), 1, numPoints, numPoints)/expan;
size = nread*nslice*nphasedmap
#i would be phaseOrigin variable
j = np.ceil(np.arange(1,size+1, dtype=np.double)/expan)
#Matlab apparently treats '1' as a scalar so I should be tiling 1 to the same size as j and phaseorigin
s = np.tile(1,size)
unwarpmatrix = csc_matrix((s,(phaseorigin, j)), shape=(numpoints,numpoints))/expan
so when I try to run my python code I get:
ValueError: column index exceedes matrix dimensions
This doesn't occur when I run the Matlab code even though the array sizes are larger than the defined matrix size...
What am I doing wrong? I've obviously screwed something up... Thanks very much in advance for any help!
The problem is; Python indexes start from 0, whereas Matlab indexes start from 1. So for an array of size 57344, in Python first element would be arr[0] and last element would be arr[57343].
You variable j has values from 1 to 57344. You probably see the problem. Creating your j like this would solve the problem:
j = np.floor(np.arange(0,size, dtype=np.double)/expan)
Still, better to check this before using...

Matlab to Python sparse matrix conversion , overcoming the zero index problem

I have an N x N sparse matrix in Matlab, that has cell values indexed by (r,c) pairs such that r and c are unique id's.
The problem is, that after converting this matrix into Python, all of the indices values are decremented by 1.
For example:
Before After
(210058,10326) = 1 (210057,10325) = 1
Currently, I am doing the following to counter this:
mat_contents = sparse.loadmat(filename)
G = mat_contents['G']
I,J = G.nonzero()
I += 1
J += 1
V = G.data
G = sparse.csr_matrix((V,(I,J)))
I have also tried using different options in scipy.sparse.io.loadmat {matlab_compatible, mat_dtype}, but neither worked.
I am looking for a solution that will give me the same indices as the Matlab matrix. Solutions that do not require reconstructing the matrix would be ideal, but I am also curious how others have gotten around this problem.
Thank you all for the good advice.
I decided to stick with Python. I do most of my data transfers between Matlab and Python
using text files now.

Categories