error while saving large matrix using scipy.io.savemat - python

i want to save a large matrix of 20Gb in matlab(.mat) format using scipy.io.savemat function.While i am saving at that time it gives me error as follow:
error is:
scipy.io.savemat matrix too large to save with Matlab 5 format
My code is
scipy.io.savemat('output.mat',mdict={'data':data})
I hope experts may give some suggestion to overcome the above problem.Thanks in advance.

Yes, I agree with #hpaulj. I think there is no other way than saving it in smaller chunks than 20GB... Is that at all possible for you? Maybe you can orient your solution on the layout of your data-structure, i.e. if it's a 3-dimensional matrix saving slices along the 3rd axis.

Related

How do I store a multidimensional array?

I am trying to write a code in python that will display the trajectory of projectile on a 2D graph. The initial velocity and launch angle will be varying. Instead of calculating it every time, I was wondering if there is any way to create a data file which will store all the values of the coordinates for each of those different combinations of speed and launch angle. That is a 4 dimensional database. Is this even possible?
This sounds like a pretty ideal case for using CSV as your file format. It's not a "4 dimension" so much as a "4 column" database.
initial_velocity, launch_angle, end_x, end_y
which you can write out and read in easily - using either the standard library's csv module, or pandas' read_csv()
i think that you should look at the HDF5 format, which has been specialized to work with big data in NASA and bulletproof in very large scale applications :
From the webside
HDF5 lets you store huge amounts of numerical data, and easily
manipulate that data from NumPy. For example, you can slice into
multi-terabyte datasets stored on disk, as if they were real NumPy
arrays. Thousands of datasets can be stored in a single file,
categorized and tagged however you want.
In addition from me is the point that NumPy has been developed to work with multidimensional array very efficiently. Good luck !

Python, How to write a 2-d array to a binary File with precision (uint8, uint16)?

I'm hoping someone can explain to me how to write a binary output to a text file.
I know that in matlab, it is pretty simple, as it has the 'fwrite' function which has the following format: fwrite(fileID,A,precision)
However, I'm not sure how to translate this to python. I currently have the data stored in a 2-D numpy array and would like to write it to the file in the same way that matlab does.
I have attempted it with the following:
#let Y be the numpy matrix
with open (filepath,'wb') as FileToWrite:
np.array(Y.shape).tofile(FileToWrite)
Y.T.tofile(FileToWrite)
however, this didn't lead to the desired result. The file size is way to large and the data is incorrectly formatted.
Ideally i should be able to specificy the data format to be uint8 or uint16 as well.
Any help would be massively appreciated.
So, I figured it. The solution is as follows.
#let Y be the numpy matrix
with open (filepath,'wb') as FileToWrite:
np.asarray(Y, dtype=np.uint8).tofile(FileToWrite)

Effective way to convert 3D-numpy array to ASCII format

I need to make a conversion of 18 3D-numpy arrays (each weight~2 Gb) to the ASCII format files.
Here is the info about one of the ndarray:
I've search in the web about any ideas of this convertation, one of them was to save ndarray to txt with numpy.savetxt but it seems that it works only for 2D arrays. For 3D arrays it was reccomended firstly to slice into 2D arrays and then transfer txt file into ASCII format file. But since I am new in Python I am not sure this is the optimal way to deal with huge massive of the data as I have. Becasue in my case it will be 256 2D-arrays for each file (and I have them 18 files in total).
Will be really appreciate for your ideas and help!
This procedure can be done with the awrite function in Py4CatS or with numpy.savetxt function in Python.

index milion row square matrix for fast access

I have some very large matrices (let say of the order of the million rows), that I can not keep in memory, and I would need to access to subsample of this matrix in descent time (less than a minute...).
I started looking at hdf5 and blaze in combination with numpy and pandas:
http://web.datapark.io/yves/blaze.html
http://blaze.pydata.org
But I found it a bit complicated, and I am not sure if it is the best solution.
Are there other solutions?
thanks
EDIT
Here some more specifications about the kind of data I am dealing with.
The matrices are usually sparse (< 10% or < 25% of cells with non-zero)
The matrices are symmetric
And what I would need to do is:
Access for reading only
Extract rectangular sub-matrices (mostly along the diagonal, but also outside)
Did you try PyTables ? It can be very useful for very large matrix. Take a look to this SO post.
Your question is lacking a bit in context; but hdf5 compressed block storage is probably as-efficient as a sparse storage format for these relatively dense matrices you describe. In memory, you can always cast your views to sparse matrices if it pays. That seems like an effective and simple solution; and as far as I know there are no sparse matrix formats which can easily be read partially from disk.

What data format for large files in R?

I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are
not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...
Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot.
Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?
Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.
For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.
SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.

Categories