How to save large Python numpy datasets?

How to save large Python numpy datasets? - python

I'm attempting to create an autonomous RC car and my Python program is supposed to query the live stream on a given interval and add it to a training dataset. The data I want to collect is the array of the current image from OpenCV and the current speed and angle of the car. I would then like it to be loaded into Keras for processing.
I found out that numpy.save() just saves one array to a file. What is the best/most efficient way of saving data for my needs?

As with anything regarding performance or efficiency, test it yourself. The problem with recommendations for the "best" of anything is that they might change from year to year.
First, you should determine if this is even an issue you should be tackling. If you're not experiencing performance issues or storage issues, then don't bother optimizing until it becomes a problem. What ever you do, don't waste your time on premature optimizations.
Next, assuming it actually is an issue, try out every method for saving to see which one yields the smallest results in the shortest amount of time. Maybe compression is the answer, but that might slow things down? Maybe pickling objects would be faster? Who knows until you've tried.
Finally, weigh the trade-offs and decide which method you can compromise on; You'll almost never have one silver bullet solution. While your at it, determine if just adding more CPU, RAM or disk space at the problem would solve it. Cloud computing affords you a lot of headroom in those areas.

The most simple way is np.savez_compressed(). This saves any number of arrays using the same format as np.save() but encapsulated in a standard Zip file.
If you need to be able to add more arrays to an existing file, you can do that easily, because after all the NumPy ".npz" format is just a Zip file. So open or create a Zip file using zipfile, and then write arrays into it using np.save(). The APIs aren't perfectly matched for this, so you can first construct a StringIO "file", write into it with np.save(), then use writestr() in zipfile.

Related

Lazy image manipulation in Python to process huge files

I am working on a project in Python that implies manipulating images that are pretty huge (can go up to 1.5GB).
Since filtering operations are generally local, I do not need to load the entire image to detect small objects in the images for instance. Therefore, I guess one possible solution would be to load a small patch in the image, process it, consider the next patch, and so on...
However, I believe I am not the first one to be facing this problem, and some sort of lazy library has probably been developed already. Does anybody have a pointer?
Thanks in advance

Load large data without iterators or chunks

I want to learn on a large file of data (7GB) : 800 rows, 5 millions columns. So I want to load these data and put them in a form I can use (2D list or array).
The problem is here, when I load the data and try to store them, they use all my memory (12GB) and just stop at row 500.
I heard a lot about how to use this kind of data, like using chunks and iterators, but I would like to load them entirely in the memory so I can do cross-validation.
I tried to use pandas to help me but the problem is the same.
Is there some issues to load and store the entire 7GB of data as I want to ? Or any other idea that could help me ?

You can try getting a swap or page file. Depending on your operating system, you can use virtual memory to allow your system to address more objects in a single process than will fit in physical memory. Depending on how large the working set is, performance may not suffer that much, or it may be completely dreadful.
That said, it is almost certain that getting more memory or using some partitioning strategy (similar to what you are calling chunking) is a better solution for your problem.
On windows take a look here for information on how to adjust page file size. For Redhat Linux try this link for information on adding swap.

Is it meaningful to use multiprocessing to read multiple files with Python?

I intent to use multiprocessing to read a set of small files with multiprocesing capabilities of Python. However this is awkward in some sense to me because if the disk is rotational then the bottle neck is the rotation time and even-though I use multiple processes, total read time should be similar with single process read. Am I wrong ? What are your comments?
I addition, do you think using multiprocessing might cause intertwined reading of the files so the contents of these files are skewed in some way?

Your reasoning is sound, but the only way to find out for sure is by benchmarking (that said, it is unlikely that reading many small files in parallel will increase performance over reading them sequentially).
I am not entirely sure what you mean by "intertwined reading", but -- unless there are bugs in your code or the files are being changed while you're reading them -- you will get exactly the same contents irrespective of how you read it.

You are indeed right, the bottleneck will be disk-IO.
However, the only way to really know, is to measure both approaches.
If you have influence on the files, you could go for one larger file as opposed to many smaller files.

Trying to delete values in an hdf5 file with PyTables, but file size is not shrinking [duplicate]

I'm having a HDF5 file with one-dimensional (N x 1) dataset of compound elements - actually it's a time series. The data is first collected offline into the HFD5 file, and then analyzed. During analysis most of the data turns out to be uninteresting, and only some parts of it are interesting. Since the datasets can be quite big, I would like to get rid of the uninteresting elements, while keeping the interesting ones. For instance, keep elements 0-100 and 200-300 and 350-400 of a 500-element dataset, dump the rest. But how?
Does anybody have experience on how accomplish this with HDF5? Apparently it could be done in several ways, at least:
(Obvious solution), create a new fresh file and write the necessary data there, element by element. Then delete the old file.
Or, into the old file, create a new fresh dataset, write the necessary data there, unlink the old dataset using H5Gunlink(), and get rid of the unclaimed free space by running the file through h5repack.
Or, move the interesting elements within the existing dataset towards the start (e.g. move elements 200-300 to positions 101-201 and elements 350-400 to positions 202-252). Then call H5Dset_extent() to reduce the size of the dataset. Then maybe run through h5repack to release the free space.
Since the files can be quite big even when the uninteresting elements have been removed, I'd rather not rewrite them (it would take a long time), but it seems to be required to actually release the free space. Any hints from HDF5 experts?

HDF5 (at least the version I am used to, 1.6.9) does not allow deletion. Actually, it does, but it does not free the used space, with the result that you still have a huge file. As you said, you can use h5repack, but it's a waste of time and resources.
Something that you can do is to have a lateral dataset containing a boolean value, telling you which values are "alive" and which ones have been removed. This does not make the file smaller, but at least it gives you a fast way to perform deletion.
An alternative is to define a slab on your array, copy the relevant data, then delete the old array, or always access the data through the slab, and then redefine it as you need (I've never done it, though, so I'm not sure if it's possible, but it should)
Finally, you can use the hdf5 mounting strategy to have your datasets in an "attached" hdf5 file you mount on your root hdf5. When you want to delete the stuff, copy the interesting data in another mounted file, unmount the old file and remove it, then remount the new file in the proper place. This solution can be messy (as you have multiple files around) but it allows you to free space and to operate only on subparts of your data tree, instead of using the repack.

Copying the data or using h5repack as you have described are the two usual ways of 'shrinking' the data in an HDF5 file, unfortunately.
The problem, as you may have guessed, is that an HDF5 file has a complicated internal structure (the file format is here, for anyone who is curious), so deleting and shrinking things just leaves holes in an identical-sized file. Recent versions of the HDF5 library can track the freed space and re-use it, but your use case doesn't seem to be able to take advantage of that.
As the other answer has mentioned, you might be able to use external links or the virtual dataset feature to construct HDF5 files that were more amenable to the sort of manipulation you would be doing, but I suspect that you'll still be copying a lot of data and this would definitely add additional complexity and file management overhead.
H5Gunlink() has been deprecated, by the way. H5Ldelete() is the preferred replacement.

Efficient ways to write a large NumPy array to a file

I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.
Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.
It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.
Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?

Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.
The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.
For platform-independent, but numpy-specific, saves, you can use save (doc).
Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.

I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.

Try Joblib - Fast compressed persistence
One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.
Edit:
Newer (2016) blog entry on data persistence in Joblib

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.