Efficient ways to write a large NumPy array to a file

Efficient ways to write a large NumPy array to a file - python

I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.
Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.
It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.
Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?

Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.
The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.
For platform-independent, but numpy-specific, saves, you can use save (doc).
Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.

I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.

Try Joblib - Fast compressed persistence
One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.
Edit:
Newer (2016) blog entry on data persistence in Joblib

Related

Iterate and compute over multiple dask arrays

I have multiple dask arrays and would like to save them to a GIF or some movie format using imageio one frame at a time, but I think the problem is generic enough that the solution could help other people. I'm wondering if there is a way to compute the arrays in order and while computing one array and writing it to disk, start computing the next one on the remaining workers. If possible, it would be nice if the scheduler/graph could share tasks between the dask arrays if any.
The code would look something like this in my eyes:
import dask.array as da
writer = Writer(...)
for dask_arr in da.compute([dask_arr1, dask_arr2, dask_arr3]):
writer.write_frame(dask_arr)
It looks like this is probably hackable by users with the distributed scheduler, but I'd like to use the threaded scheduler if possible. I'm also not sure if this is super useful in my exact real world case given memory usage or possibly having to write entire frames at a time instead of chunks. I also don't doubt that this could be handled in a custom array-like object with da.store...some how.

If you're able to write a function that takes in a slice of the array and then writes it appropriately you might be able to use a function like da.map_blocks.
This would become much more complex if you're trying to write into a single file where random access is harder to guarantee.
Perhaps you could use map_blocks to save each slice as a single image and then use some post-processing tool to stitch those images together.

How to save large Python numpy datasets?

I'm attempting to create an autonomous RC car and my Python program is supposed to query the live stream on a given interval and add it to a training dataset. The data I want to collect is the array of the current image from OpenCV and the current speed and angle of the car. I would then like it to be loaded into Keras for processing.
I found out that numpy.save() just saves one array to a file. What is the best/most efficient way of saving data for my needs?

As with anything regarding performance or efficiency, test it yourself. The problem with recommendations for the "best" of anything is that they might change from year to year.
First, you should determine if this is even an issue you should be tackling. If you're not experiencing performance issues or storage issues, then don't bother optimizing until it becomes a problem. What ever you do, don't waste your time on premature optimizations.
Next, assuming it actually is an issue, try out every method for saving to see which one yields the smallest results in the shortest amount of time. Maybe compression is the answer, but that might slow things down? Maybe pickling objects would be faster? Who knows until you've tried.
Finally, weigh the trade-offs and decide which method you can compromise on; You'll almost never have one silver bullet solution. While your at it, determine if just adding more CPU, RAM or disk space at the problem would solve it. Cloud computing affords you a lot of headroom in those areas.

The most simple way is np.savez_compressed(). This saves any number of arrays using the same format as np.save() but encapsulated in a standard Zip file.
If you need to be able to add more arrays to an existing file, you can do that easily, because after all the NumPy ".npz" format is just a Zip file. So open or create a Zip file using zipfile, and then write arrays into it using np.save(). The APIs aren't perfectly matched for this, so you can first construct a StringIO "file", write into it with np.save(), then use writestr() in zipfile.

Caching CSV-read data with pandas for multiple runs

I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.
As this is a partially empirical process I need to run the script numerous times which results in the pandas.read_csv() function being called over and over again and it takes a lot of time.
Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.
Code example in the solution would be great!

I would store already parsed DFs in one of the following formats:
HDF5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
Feather (extremely fast - makes sense to use on SSD drives)
Pickle (fast)
All of them are very fast
PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically

how to rapidaly load data into memory with python?

I have a large csv file (5 GB) and I can read it with pandas.read_csv(). This operation takes a lot of time 10-20 minutes.
How can I speed it up?
Would it be useful to transform the data in a sqllite format? In case what should I do?
EDIT: More information:
The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)
I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)
EDIT 2 :
During the loading I receive this message
/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
EDIT: SOLUTION
Read one time the csv file and save it as
data.to_hdf('data.h5', 'table')
This format is incredibly efficient

This actually depends on which part of reading it is taking 10 minutes.
If it's actually reading from disk, then obviously any more compact form of the data will be better.
If it's processing the CSV format (you can tell this because your CPU is at near 100% on one core while reading; it'll be very low for the other two), then you want a form that's already preprocessed.
If it's swapping memory, e.g., because you only have 2GB of physical RAM, then nothing is going to help except splitting the data.
It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip) will make the first problem a lot better, but the second one even worse.
It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)
Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.
In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT to pull in all the data and then process it all into one big frame.
Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.
Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).
Finally, pickling the data may sometimes be faster. pickle is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.
(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)

Python has two built-in libraries called pickle and cPickle that can store any Python data structure.
cPickle is identical to pickle except that cPickle has trouble with Unicode stuff and is 1000x faster.
Both are really convenient for saving stuff that's going to be re-loaded into Python in some form, since you don't have to worry about some kind of error popping up in your file I/O.
Having worked with a number of XML files, I've found some performance gains from loading pickles instead of raw XML. I'm not entirely sure how the performance compares with CSVs, but it's worth a shot, especially if you don't have to worry about Unicode stuff and can use cPickle. It's also simple, so if it's not a good enough boost, you can move on to other methods with minimal time lost.
A simple example of usage:
>>> import pickle
>>> stuff = ["Here's", "a", "list", "of", "tokens"]
>>> fstream = open("test.pkl", "wb")
>>> pickle.dump(stuff,fstream)
>>> fstream.close()
>>>
>>> fstream2 = open("test.pkl", "rb")
>>> old_stuff = pickle.load(fstream2)
>>> fstream2.close()
>>> old_stuff
["Here's", 'a', 'list', 'of', 'tokens']
>>>
Notice the "b" in the file stream openers. This is important--it preserves cross-platform compatibility of the pickles. I've failed to do this before and had it come back to haunt me.
For your stuff, I recommend writing a first script that parses the CSV and saves it as a pickle; when you do your analysis, the script associated with that loads the pickle like in the second block of code up there.
I've tried this with XML; I'm curious how much of a boost you will get with CSVs.

If the problem is in the processing overhead, then you can divide the file into smaller files and handle them in different CPU cores or threads. Also for some algorithms the python time will increase non-linearly and the dividing method will help in these cases.

Manipulating Large Amounts of Image Data in Python

I have a large number of images of different categories, e.g. "Cat", "Dog", "Bird". The images have some hierarchical structure, like a dict. So for example the key is the animal name and the value is a list of animal images, e.g. animalPictures[animal][index].
I want to manipulate each image (e.g. compute histogram) and then save the manipulated data in an identical corresponding structure, e.g. animalPictures['bird'][0] has its histogram stored in animalHistograms['bird'][0].
The only issue is I do not have enough memory to load all the images, perform all the manipulations, and create an additional structure of the transformed images.
Is it possible to load an image from disk, manipulate the image, and stream the data to a dict on disk? This way I could work on a per-image basis and not worry about loading everything into memory at once.
Thanks!

I would suggest using theshelvemodule which provides dictionary-like persistent objects backed by a file object. For the most part you can use them like ordinary dictionaries, so converting any existing code that uses them is relatively easy. There's couple of additional methods for controlling the writing back of any cached entries and closing the object.
Basically it will let you transparently store your Python dictionary data in an underlying file-based database. All entries accessed are also cached in memory. However you can prevent the cache from consuming too much memory by manually calling the sync() method which writes any modified entries back to disk and empties the cache.

It is possible, if you us NumPy and especially numpy.memmap to store the image data. That way the image data looks as if it were in memory but is on the disk using the mmap mechanism. The nice thing is that the numpy.memmap arrays are not more difficult to handle than ordinary arrays.
There is some performance overhead as all memmap arrays are saved to disk. The arrays could be described as "disk-backed arrays", i.e. the data is also kept in RAM as long as possible. This means that if you access some data array very often, it is most likely in memory, and there is no disk read overhead.
So, keep your metadata in a dict in memory, but memmap your bigger arrays.
This is probably the easiest way. However, make sure you have a 64-bit Python in use, as the 32-bit one runs out of addresses ad 2 GiB.
Of course, there are a lot of ways to compress the image data. If your data may be compressed, then you might consider using compression to save memory.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.