Save a csr_matrix and a numpy array in one file - python

I need to save a large sparse csr_matrix and a numpy array to be able to read them back later. Let X be the sparse csr_matrix and Y be the number array.
Currently I take the following slightly insane route.
from scipy.sparse import csr_matrix
import numpy as np
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
save_sparse_csr("file1", X)
np.save("file2", Y)
Then when I want to read them in it is:
X = load_sparse_csr("file1.npz")
Y = np.load("file2.npy")
Two questions:
Is there a better way to save a csr_matrix than this?
Can I save both X and Y to the same file somehow? I seems crazy to have to make two files for this.

So you are saving the 3 array attributes of the csr along with its shape. And that is sufficient to recreate the array, right?
What's wrong with that? Even if you find a function that saves the csr for you, I bet it is doing the same thing - saving those same arrays.
The normal way in Python to save a class is to pickle it. But the class has to create the appropriate pickle method. numpy does that (essentially its save function). But as far as I know scipy.sparse has not provided that.
Since scipy.sparse has its roots in the MATLAB sparse code (and C/Fortran code developed for linear algebra problems), it can load/save using the loadmat/savemat functions. I'd have to double check but I think the work with csc the default MATLAB sparse format.
There are one or two other sparse.io modules than handle sparse, but I have worked with those. There formats for sharing sparse arrays among different packages working with the same problems (for example PDEs or finite element). More than likely those formats will use a coo compatible layout (data, rows, cols), either as 3 arrays, a csv of 3 columns, or 2d array.
Mentioning coo format raises another possibility. Make a structure array with data, row, col fields, and use np.save or even np.savetxt. I don't think it's any faster or cleaner than csr direct. But it does put all the data in one array (but shape might still need a separate entry).
You might also be able to pickle the dok format, since it is a dict subclass.

Related

How to avoid memory error when using np.kron to generate a big matrix

I try to write a matrix consisting of kronecker-products
def kron_sparse_2(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p):
kron= sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(a,b),c),d),e),f),g),h),i),j),k),l),m),n),o),p)
return kron
res = 0
for i in sd:
res= res +( kron_sparse_2(i,i,I,I,I,I,I,I,I,I,I,I,I,I,I,I))
The i's in sd are 2x2 matrices.
Is there anything I can do further to calculate this without the memory problem?
The error I get is: MemoryError: Unable to allocate 16.0 GiB for an array with shape (536870912, 2, 2) and data type float64
If I understand correctly (I think you are trying to form the Hamiltonian for some spin problem, and you should be able to go up to 20 spins with ease. Also if it is indeed the case try using np.roll and reduce functions to rewrite your methods efficiently), you might try converting all of your matrices (even with dims. 2X2) to sparse format (say csr or csc) and use scipy kron function with the format specified as the sparse matrix format you used to construct all of your matrices. Because as far as I remember kron(format=None) uses an explicit representation of matrices which causes memory problems, try format='csc' for instance.

Whey saving an numpy array of float arrays to .npy file using numpy.save/numpy.load, is there any reason why the order of the arrays would change?

I currently have data where each row has a text passage and a numpy float array.
As far as I know, the it's not efficient to save these two datatypes into one data format (correct me if I am wrong). So I am going to save them separately, with another column of ints that will be used to map the two datasets together when I want to join them again.
I have having trouble figuring out how to append a column of ints next to the float arrays (if anyone has a solution to that I would love to hear it) and then save the numpy array.
But then I realized I can just save the float arrays as is with numpy.save without the extra int column if I can get a confirmation that numpy.save and numpy.load will never change the order of the arrays.
That way I can just append the loaded numpy float arrays to the pandas dataframe as is.
Logically, I don't see any reason why the order of the rows would change, but perhaps there's some optimization compression that I am unaware of.
Would numpy.save or numpy.load ever change the order of a numpy array of float arrays?
The order will not change by the numpy save / load. You are saving the numpy object as is. An array is an ordered object.
Note: if you want to save multiple data arrays to the same file, you can use np.savez.
>>> np.savez('out.npz', f=array_of_floats, s=array_of_strings)
You can retrieve back each with the following:
>>> data = np.load('out.npz')
>>> array_of_floats = data['f']
>>> array_of_strings = data['s']

TypeError from hstack on sparse matrices

I have two csr sparse matrices. One contains the transform from a sklearn.feature_extraction.text.TfidfVectorizer and the other converted from a numpy array. I am trying to do a scipy.sparse.hstack on the two to increase my feature matrix but I always get the error:
TypeError: 'coo_matrix' object is not subscriptable
Below is the code:
vectorizer = TfidfVectorizer(analyzer="char", lowercase=True, ngram_range=(1, 2), strip_accents="unicode")
ngram_features = vectorizer.fit_transform(df["strings"].values.astype(str))
list_other_features = ["entropy", "string_length"]
other_features = csr_matrix(df[list_other_features].values)
joined_features = scipy.sparse.hstack((ngram_features, other_features))
Both feature matrices are scipy.sparse.csr_matrix objects and I have also tried not converting other_features, leaving it as a numpy.array, but it results in the same error.
Python package versions:
numpy == 1.13.3
pandas == 0.22.0
scipy == 1.1.0
I can not understand why it is talking about coo_matrix object in this case, especially when I have both matrices converted to csr_matrix. Looking at the scipy code I understand it will not do any conversion if the input matrices are csr_matrix objects.
In the source code of scipy.sparse.hstack, it calls bmat, where it potentially converts matrices into coo_matrix if fast path cases are not established.
Diagnosis
Looking at the scipy code I understand it will not do any conversion
if the input matrices are csr_matrix objects.
In bat's source code, There are actually more conditions besides two matrices being csr_matrix before it will not be turned into coo_matrix objects. Seeing the source code, one of the following 2 conditions need to be met
# check for fast path cases
if (N == 1 and format in (None, 'csr') and all(isinstance(b, csr_matrix)
for b in blocks.flat)):
...
elif (M == 1 and format in (None, 'csc')
and all(isinstance(b, csc_matrix) for b in blocks.flat)):
...
before line 573 A = coo_matrix(blocks[i,j]) to be called.
Suggestion
To resolve the issue, I would suggest you make one more check to see whether you meet the fast path case for either csr_matrix or csc_matrix (the two condition listed above). Please see the whole source code for bat to gain a better understanding. If you do not meet the conditions, you will be forwarded to transform matrices into coo_matrix.
It's a little unclear whether this error occurs in the hstack or after when you use the result.
If it's in the hstack you need to provide a traceback so we can see what's going on.
hstack, using bmat, normally collects the coo attributes of all inputs, and combines them to make a new coo matrix. So regardless of inputs (except the special cases), the result will be coo. But hstack also accepts a fmt parameter.
Or you can add a .tocsr(). There's no extra cost if the matrix is already csr.

Python - read 2d array from binary data

I try to read a 2d array with floats from a binary file with Python. Files have been written with big endian by a Fortran program (it is the intermediate file of the Weather Research and Forecast model). I already know dimension sizes of the array to read (nx & ny) but as a Fortran and IDl programer I am completely lost, how to manage it in Python. (Later on I want to visualize the array).
Shall I use struct.unpack or numpy.fromfile or the array module?
Do I have to read first a vector and afterwards reshape it? (have seen this option only for the numpy-way)
How do I define a 2d array with numpy and how do I define the dtype to read with big-endian byte ordering?
Is there an issue with array ordering (column or row wise) to take into account?
Short answers per sub-question:
I don't think the array module has a way to specify endianness.
Between the struct module and Numpy I think Numpy is easier to
use, especially for Fortran-like ordered arrays.
All data is inherently 1-dimensional as far as the hardware (disk,
RAM, etc) is concerned, so yes reshaping to get a 2D representation
is always necessary. With numpy.fromfile the reshape must happen
explicitly afterwards, but numpy.memmap provides a way to reshape
more implicitly.
The easiest way to specify endianness with Numpy is to use a short
type string, actually very similar to the approach needed for
the struct module. In Numpy >f and >f4 specify single
precision and >d and >f8 double precision big-endian floating
point.
Your binary file could walk the array along the rows (C-like) or along
the columns (Fortran-like). Whichever of the two, this has to be taken into
account to represent the data properly. Numpy makes this easy with the
order keyword argument for reshape and memmap (among others).
All in all, the code could be for example:
import numpy as np
filename = 'somethingsomething'
with open(filename, 'rb') as f:
nx, ny = ... # parse; advance file-pointer to data segment
data = np.fromfile(f, dtype='>f8', count=nx*ny)
array = np.reshape(data, [nx, ny], order='F')

Python: Add function to an array in a FOR loop

Maybe this is a simple issue, but I could not find any information about it so far.
For an optimization in numpy I need an array of functions. The number of functions I need depends on the current object which shall be optimized.
I have already figured out how to create these functions dynamically, but now I would like to store them in an array like this:
myArray = zeros(x)
for i in range(x):
myArray[i] = createFunction(i)
If I run this I get a type mismatch:
float() argument must be a string or a number, not 'function'
Creating the array directly works well:
myArray = array([createFunction(0)...])
But because I don't know the number of functions I need, this is exactly what I want to prevent.
Ah, I get it. You really do mean an array of functions.
The type mismatch error arises because the call to zeros creates an array of floats by default. So your original would work if instead you did myArray = numpy.empty(x, dtype=numpy.object) (note that empty makes more sense than zeros here). The slightly more pythonic version is to use a list comprehension
myArray = numpy.array([createFunction(i) for i in range(x)]).
But you might not need to create a numpy array at all, depending on what you want to do with it:
myArray = [createFunction(i) for i in range(x)]
If you want to avoid the list, it might be better to use numpy.fromfunction along with numpy.vectorize:
myArray = numpy.fromfunction(numpy.vectorize(createFunction),
shape=(x,), dtype=numpy.object)
where (x,) is a tuple giving the shape of the array. The call to vectorize is needed because fromfunction assumes that the function can work on an array of inputs and return an array of scalars, and vectorize converts a function to do exactly that. The dtype=object is needed since otherwise numpy tries to create an array of floats.
Maybe you can use
myArray = array([createFunction(i) for i in range(x)])
If you need an array of functions, is it possible to not use NumPy? NumPy arrays have C-style types and it defaults to float. If you can, just use a standard Python list. But if you absolutely must use NumPy, try defining the array like so:
import numpy as np
a = np.empty([x], dtype=np.dtype(np.object_))
Or however you need it to be with that dtype.
Numpy arrays are homogeneous. That is all elements of a numpy array are of the same type -- python is duck-typed, numpy isn't. This is part of what makes matrix operations on numpy arrays and matrices so fast. However, because of this a data type must be known when the array is first created. Numpy is generally very good at inferring the data type. The problem comes when creating an empty or zeroed array. Since there are no elements to examine numpy must guess the data type. Numpy defaults to numpy.float64 if it isn't given a data type at array creation time. This is a decent choice as numpy is typically used in scientific or engineering areas where floating point numbers are required. This is also why numpy is complaining -- because it can't store your functions as 64-bit floating point numbers.
The quick solution is to let numpy know the data type you want. eg.
myArray = numpy.zeros(x, dtype=numpy.object)
Note that the data type cannot be any class, but must be an instance of numpy.dtype (for advanced use you can create additional dtypes a runtime that numpy can then manipulate). For functions, numpy will store them as numpy.object (which means any generic python object). I do not think you will get any performance benefit from using numpy to store arrays of functions. Perhaps you would be better off creating generator functions and chaining them, converting to a numpy array once you know the result will be a number.
funcs = [createFunction(i) for i in xrange(x)]
def getItemFromEachFunction(i):
return funcs[i]()
arr = numpy.fromfunction(getItemFromEachFunction, (x,))

Categories