How to save and load a dictionary of scipy sparse csr matrices? - python

I have a dict of scipy.sparse.csr_matrix objects as values, with integer keys. How can I save this in a separate file?
If I had a regular ndarray for each entry, then I could serialize it with json, but when I try this with a sparse matrix:
with open('filename.txt', 'w') as f:
f.write(json.dumps(the_matrix))
I get a TypeError:
TypeError: <75x75 sparse matrix of type '<type 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format> is not JSON serializable
How can I save my dictionary with keys that are integers and values that are sparse csr matrices?

I faced this same issue trying to save a dictionary whose values are csr_matrix. Dumped it to disk using pickle. file handler should be opened in "wb" mode.
import pickle
pickle.dump(csr_dict_obj, open("csr_dict.pkl","wb"))
load the dict back using.
csr_dict = pickle.load(open("csr_dict.pkl","rb"))

Newer scipy versions have a scipy.sparse.save_npz function (and corresponding load). It saves the attributes of a sparse matrix to a numpy savez zip archive. In the case of a csr is saves the data, indices and indptr arrays, plus shape.
scipy.io.savemat can save a sparse matrix in a MATLAB compatible format (csc). There are one or two other scipy.io formats that can handle sparse matrices, but I haven't worked with them.
While a sparse matrix contains numpy arrays it isn't an array subclass, so the numpy functions can't be used directly.
The pickle method for numpy arrays is its np.save. And an array that contains objects, uses pickle (if possible). So a pickle of a dictionary of arrays should work.
The sparse dok format is a subclass of dict, so might be pickleable. It might even work with json. But I haven't tried it.
By the way, a plain numpy array can't be jsoned either:
In [427]: json.dumps(np.arange(5))
TypeError: array([0, 1, 2, 3, 4]) is not JSON serializable
In [428]: json.dumps(np.arange(5).tolist())
Out[428]: '[0, 1, 2, 3, 4]'
dok doesn't work either. The keys are tuples of indices,
In [433]: json.dumps(M.todok())
TypeError: keys must be a string
MatrixMarket is a text format that handles sparse:
In [444]: io.mmwrite('test.mm', M)
In [446]: cat test.mm.mtx
%%MatrixMarket matrix coordinate integer general
%
1 5 4
1 2 1
1 3 2
1 4 3
1 5 4

import numpy as np
from scipy.sparse import lil_matrix, csr_matrix, issparse
import re
def save_sparse_csr(filename, **kwargs):
arg_dict = dict()
for key, value in kwargs.items():
if issparse(value):
value = value.tocsr()
arg_dict[key+'_data'] = value.data
arg_dict[key+'_indices'] = value.indices
arg_dict[key+'_indptr'] = value.indptr
arg_dict[key+'_shape'] = value.shape
else:
arg_dict[key] = value
np.savez(filename, **arg_dict)
def load_sparse_csr(filename):
loader = np.load(filename)
new_d = dict()
finished_sparse_list = []
sparse_postfix = ['_data', '_indices', '_indptr', '_shape']
for key, value in loader.items():
IS_SPARSE = False
for postfix in sparse_postfix:
if key.endswith(postfix):
IS_SPARSE = True
key_original = re.match('(.*)'+postfix, key).group(1)
if key_original not in finished_sparse_list:
value_original = csr_matrix((loader[key_original+'_data'], loader[key_original+'_indices'], loader[key_original+'_indptr']),
shape=loader[key_original+'_shape'])
new_d[key_original] = value_original.tolil()
finished_sparse_list.append(key_original)
break
if not IS_SPARSE:
new_d[key] = value
return new_d
You can write a wrapper as shown above.

Related

give a key value to an np array

I have an array that is composed of multiple np arrays. I want to give every array a key and convert it to an HDF5 file
arr = np.concatenate((Hsp_data, Hsp_rdiff, PosC44_WKS, PosX_WKS, PosY_WKS, PosZ_WKS,
RMS_Acc_HSp, RMS_Acc_Rev, RMS_Schall, Rev_M, Rev_rdiff, X_rdiff, Z_I, Z_rdiff, time), axis=1)
d1 = np.random.random(size=(7501, 15))
hf = h5py.File('data.hdf5', 'w')
hf.create_dataset('arr', data=d1)
hf.close()
hf = h5py.File('data.hdf5', 'r+')
print(hf.key)
This what I have done so far and I get this error AttributeError: 'File' object has no attribute 'key'.
I want the final answer to be like this when printing the keys
<KeysViewHDF5 ['Hsp_M', 'Hsp_rdiff', 'PosC44_WKS', 'PosX_WKS', 'PosY_WKS', 'PosZ_WKS', 'RMS_Acc_HSp', 'RMS_Acc_Rev', 'RMS_Schall', 'Rev_M', 'Rev_rdiff', 'X_rdiff', 'Z_I', 'Z_rdiff']>
any ideas?
You/we need a clearer idea of how the original .mat is laid out. In h5py, the file is viewed as a nested set of groups, which are dict like. Hence the use of keys(). At the ends of that nesting are datasets which can be loaded (or saved from) as numpy arrays. The datasets/arrays don't have keys; it's the file and groups that have those.
Creating your file:
In [69]: import h5py
In [70]: d1 = np.random.random(size=(7501, 15))
...: hf = h5py.File('data.hdf5', 'w')
...: hf.create_dataset('arr', data=d1)
...: hf.close()
Reading it:
In [71]: hf = h5py.File('data.hdf5', 'r+')
In [72]: hf.keys()
Out[72]: <KeysViewHDF5 ['arr']>
In [73]: hf['arr']
Out[73]: <HDF5 dataset "arr": shape (7501, 15), type "<f8">
In [75]: arr = hf['arr'][:]
In [76]: arr.shape
Out[76]: (7501, 15)
'arr' is the name of the dataset that we created at the start. In this case there's no group; just the one dataset. [75] loads the dataset to an array which I called arr, but that name could be anything (like the original d1).
Arrays and datasets may have a compound dtype, which has named fields. I don't know if MATLAB uses those or not.
Without knowledge of the group and dataset layout in the original .mat, it's hard to help you. And when looking at datasets, pay particular attention to shape and dtype.

numpy savetxt: how to save an integer and a float numpy array into the save row of the file

I have a set of integers and a set of numpy arrays, which I would like to use np.savetxt to store the corresponding integer and the array into the same row, and rows are separated by \n.
In the txt file, each row should look like the following:
12345678 0.282101 -0.343122 -0.19537 2.001613 1.034215 0.774909 0.369273 0.219483 1.526713 -1.637871
The float numbers should separated by space
I try to use the following code to solve this
np.savetxt("a.txt", np.column_stack([ids, a]), newline="\n", delimiter=' ',fmt='%d %.06f')
But somehow I cannot figure out the correct formating for integer and floats.
Any suggestions?
Please specify what a "set of integers" and "set of numpy arrays" are: from your example it looks as though ids is a list or 1d numpy array, and a is a 2d numpy array, but this is not clear from your question.
If you're trying to combine a list of integers with a 2d array, you should probably avoid np.savetxt and convert to a string first:
import numpy as np
ids = [1, 2, 3, 4, 5]
a = np.random.rand(5, 5)
with open("filename.txt", "w") as f:
for each_id, row in zip(ids, a):
line = "%d " %each_id + " ".join(format(x, "0.8f") for x in row) + "\n"
f.write(line)
Gives the output in filename.txt:
1 0.38325380 0.80964789 0.83787527 0.83794886 0.93933360
2 0.44639702 0.91376799 0.34716179 0.60456704 0.27420285
3 0.59384528 0.12295988 0.28452126 0.23849965 0.08395266
4 0.05507753 0.26166780 0.83171085 0.17840250 0.66409724
5 0.11363045 0.40060894 0.90749637 0.17903019 0.15035594

Convert a list of numpy arrays to a 5D numpy array

I am having a database of 7000 objects (list_of_objects), each one of these files contains a numpy array with size of 10x5x50x50x3. I would like to create a 5d numpy array that will contain 7000*10x5x50x50x3. I tried to do so using two for-loops. My sample code:
fnl_lst = []
for object in list_of_objects:
my_array = read_array(object) # size 10x5x50x50x3
for ind in my_array:
fnl_lst.append(ind)
fnl_lst= np.asarray( fnl_lst) # print(fnl_lst) -> (70000,)
That code result in the end in a nested numpy array which contains 70000 arrays each of them has a size of 5x50x50x3. However, I would like instead to build a 5d array with size 70000x5x50x50x3. How can I do that instead?
fnl_lst = np.stack([ind for ind in read_array(obj) for obj in list_of_objects])
or, just append to the existing code:
fnl_lst = np.stack(fnl_lst)
UPD: by hpaulj's comment, if my_array is indeed 10x5x50x50x3, this might be enough:
fnl_lst = np.stack([read_array(obj) for obj in list_of_objects])

Pandas/numpy array filling

I´ve a Pandas dataframe that I read from csv and contains X and Y coordinates and a value that I need to put in a matrix and save it to a text file. So, I created a numpy array with max(X) and max(Y) extension.
I´ve this file:
fid,x,y,agblongo_tch_alive
2368458,1,1,45.0126083457747
2368459,1,2,44.8996854102889
2368460,2,2,45.8565022933761
2358154,3,1,22.6352522929758
2358155,3,3,23.1935887499899
And I need this one:
45.01 44.89 -9999.00
-9999.00 45.85 -9999.00
22.63 -9999.00 23.19
To do that, I´m using a loop like this:
for row in data.iterrows():
p[int(row[1][2]),int(row[1][1])] = row[1][3]
and then I save it to disk using np.array2string. It works.
As the original csv has 68 M lines, it´s taking a lot of time to process, so I wonder if there´s another more pythonic and fast way to do that.
Assuming the columns of your df are 'x', 'y', 'value', you can use advanced indexing
>>> x, y, value = data['x'].values, data['y'].values, data['value'].values
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> result[y, x] = value
This will, however, not work properly if coordiantes are not unique.
In that case it is safer (but slower) to use add.at:
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> np.add.at(result, (y, x), value)
Alternatively, you can create a sparse matrix since your data happen to be in sparse coo format. Using the '.A' property you can then convert that to a normal (dense) array as needed:
>>> from scipy import sparse
>>> spM = sparse.coo_matrix((value, (y, x)), (y.max()+1, x.max()+1))
>>> (spM.A == result).all()
True
Update: if the fillvalue is not zero the above must be modified.
Method 1: replace second line with (remember this should only be used if coordinates are unique):
>>> result = np.full((y.max()+1, x.max()+1), fillvalue, value.dtype)
Method 2: does not work
Method 3: after creating spM do
>>> spM.sum_duplicates()
>>> assert spM.has_canonical_format
>>> spM.data -= fillvalue
>>> result2 = spM.A + fillvalue

Save & Retrieve Numpy Array From String

I would like to convert a multi-dimensional Numpy array into a string and, later, convert that string back into an equivalent Numpy array.
I do not want to save the Numpy array to a file (e.g. via the savetxt and loadtxt interface).
Is this possible?
You could use np.tostring and np.fromstring:
In [138]: x = np.arange(12).reshape(3,4)
In [139]: x.tostring()
Out[139]: '\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n\x00\x00\x00\x0b\x00\x00\x00'
In [140]: np.fromstring(x.tostring(), dtype=x.dtype).reshape(x.shape)
Out[140]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Note that the string returned by tostring does not save the dtype nor the shape of the original array. You have to re-supply those yourself.
Another option is to use np.save or np.savez or np.savez_compressed to write to a io.BytesIO object (instead of a file):
import numpy as np
import io
x = np.arange(12).reshape(3,4)
output = io.BytesIO()
np.savez(output, x=x)
The string is given by
content = output.getvalue()
And given the string, you can load it back into an array using np.load:
data = np.load(io.BytesIO(content))
x = data['x']
This method stores the dtype and shape as well.
For large arrays, np.savez_compressed will give you the smallest string.
Similarly, you could use np.savetxt and np.loadtxt:
import numpy as np
import io
x = np.arange(12).reshape(3,4)
output = io.BytesIO()
np.savetxt(output, x)
content = output.getvalue()
# '0.000000000000000000e+00 1.000000000000000000e+00 2.000000000000000000e+00 3.000000000000000000e+00\n4.000000000000000000e+00 5.000000000000000000e+00 6.000000000000000000e+00 7.000000000000000000e+00\n8.000000000000000000e+00 9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01\n'
x = np.loadtxt(io.BytesIO(content))
print(x)
Summary:
tostring gives you the underlying data as a string, with no dtype or
shape
save is like tostring except it also saves dtype and shape (.npy format)
savez saves the array in npz format (uncompressed)
savez_compressed saves the array in compressed npz format
savetxt formats the array in a humanly readable format
If you want to save the dtype as well you can also use the pickle module from python.
import pickle
import numpy as np
a = np.ones(4)
string = pickle.dumps(a)
pickle.loads(string)
np.tostring and np.fromstring does NOT work anymore. They use np.tobyte but it parses the np.array as bytes and not string. To do that use ast.literal_eval.
if elements of lists are 2D float. ast.literal_eval() cannot handle a lot very complex list of list of nested list while retrieving back.
Therefore, it is better to parse list of list as dict and dump the string.
while loading a saved dump, ast.literal_eval() handles dict as strings in a better way. convert the string to dict and then dict to list of list
k = np.array([[[0.09898942, 0.22804536],[0.06109612, 0.19022354],[0.93369348, 0.53521671],[0.64630094, 0.28553219]],[[0.94503154, 0.82639528],[0.07503319, 0.80149062],[0.1234832 , 0.44657691],[0.7781163 , 0.63538195]]])
d = dict(enumerate(k.flatten(), 1))
d = str(d) ## dump as string (pickle and other packages parse the dump as bytes)
m = ast.literal_eval(d) ### convert the dict as str to dict
m = np.fromiter(m.values(), dtype=float) ## convert m to nparray
I use JSON to do that:
1. Encode to JSON
The first step is to encode it to JSON:
import json
import numpy as np
np_array = np.array(
[[[0.2123842 , 0.45560746, 0.23575005, 0.40605248],
[0.98393952, 0.03679023, 0.6192098 , 0.00547201],
[0.13259942, 0.69461942, 0.8781533 , 0.83025555]],
[[0.8398132 , 0.98341709, 0.25296835, 0.84055815],
[0.27619265, 0.55544911, 0.56615598, 0.058845 ],
[0.76205113, 0.18001961, 0.68206229, 0.47252472]]])
json_array = json.dumps(np_array.tolist())
print("converted to: " + str(type(json_array)))
print("looks like:")
print(json_array)
Which results in this:
converted to: <class 'str'>
looks like:
[[[0.2123842, 0.45560746, 0.23575005, 0.40605248], [0.98393952, 0.03679023, 0.6192098, 0.00547201], [0.13259942, 0.69461942, 0.8781533, 0.83025555]], [[0.8398132, 0.98341709, 0.25296835, 0.84055815], [0.27619265, 0.55544911, 0.56615598, 0.058845], [0.76205113, 0.18001961, 0.68206229, 0.47252472]]]
2. Decode back to Numpy
To convert it back to a numpy array you can use:
list_from_json = json.loads(json_array)
np.array(list_from_json)
print("converted to: " + str(type(list_from_json)))
print("converted to: " + str(type(np.array(list_from_json))))
print(np.array(list_from_json))
Which give you:
converted to: <class 'list'>
converted to: <class 'numpy.ndarray'>
[[[0.2123842 0.45560746 0.23575005 0.40605248]
[0.98393952 0.03679023 0.6192098 0.00547201]
[0.13259942 0.69461942 0.8781533 0.83025555]]
[[0.8398132 0.98341709 0.25296835 0.84055815]
[0.27619265 0.55544911 0.56615598 0.058845 ]
[0.76205113 0.18001961 0.68206229 0.47252472]]]
I like this method because the string is easy to read and, although for this case you didn't need storing it in a file or something, this can be done as well with this format.

Categories