I have a Python program that processes fairly large NumPy arrays (in the hundreds of megabytes), which are stored on disk in pickle files (one ~100MB array per file). When I want to run a query on the data I load the entire array, via pickle, and then perform the query (so that from the perspective of the Python program the entire array is in memory, even if the OS is swapping it out). I did this mainly because I believed that being able to use vectorized operations on NumPy arrays would be substantially faster than using for loops through each item.
I'm running this on a web server which has memory limits that I quickly run up against. I have many different kinds of queries that I run on the data so writing "chunking" code which loads portions of the data from separate pickle files, processes them, and then proceeds to the next chunk would likely add a lot of complexity. It'd definitely be preferable to make this "chunking" transparent to any function that processes these large arrays.
It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?
PyTables is a package for managing hierarchical datasets. It is designed to solve this problem for you.
NumPy's memory-mapped data structure (memmap) might be a good choice here.
You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.
(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)
The method signature is:
A = NP.memmap(filename, dtype, mode, shape, order='C')
All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.
The two methods are:
flush (which writes to disk any changes you make to the array); and
close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)
example use:
import numpy as NP
from tempfile import mkdtemp
import os.path as PH
my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")
fname = PH.join(mkdtemp(), 'tempfile.dat')
mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)
# now write the data to the memmap array:
mm_obj[:] = data[:]
# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))
# verify that it's there!:
print(mm_obj[:20,:])
It seems like the ideal solution would
be something like a generator which
periodically loaded a block of the
data from the disk and then passed the
array values out one by one. This
would substantially reduce the amount
of memory required by the program
without requiring any extra work on
the part of the individual query
functions. Is it possible to do
something like this?
Yes, but not by keeping the arrays on disk in a single pickle -- the pickle protocol just isn't designed for "incremental deserialization".
You can write multiple pickles to the same open file, one after the other (use dump, not dumps), and then the "lazy evaluator for iteration" just needs to use pickle.load each time.
Example code (Python 3.1 -- in 2.any you'll want cPickle instead of pickle and a -1 for protocol, etc, of course;-):
>>> import pickle
>>> lol = [range(i) for i in range(5)]
>>> fp = open('/tmp/bah.dat', 'wb')
>>> for subl in lol: pickle.dump(subl, fp)
...
>>> fp.close()
>>> fp = open('/tmp/bah.dat', 'rb')
>>> def lazy(fp):
... while True:
... try: yield pickle.load(fp)
... except EOFError: break
...
>>> list(lazy(fp))
[range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)]
>>> fp.close()
Related
I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.
I am regularly dealing with large amounts of data (order of several GB), which are stored in memory in NumPy arrays. Often, I will be dealing with nested lists/tuples of such NumPy arrays. How should I store these to disk? I want to preserve the list/tuple structure of my data, the data has to be compressed to conserve disk space, and saving/loading needs to be fast.
(The particular use case I'm facing right now is a 4000-element long list of 2-tuples x where x[0].shape = (201,) and x[1].shape = (201,1000).)
I have tried several options, but all have downsides:
pickle storage into a gzip archive. This works well, and results in acceptable disk space usage, but is extremely slow and consumes a lot of memory while saving.
numpy.savez_compressed. Is much faster than pickle, but unfortunately only allows either a sequence of numpy arrays (not nested tuples/lists as I have) or a dictionary-style way of specifying the arguments.
Storing into HDF5 through h5py. This seems too cumbersome for my relatively simple needs. More importantly, I looked a lot into this, and also there does not seem to be a straightforward way to store heterogeneous (nested) lists.
hickle seems to do exactly what I want, however unfortunately it's incompatible with Python 3 at the moment (which is what I'm using).
I was thinking of writing a wrapper around numpy.savez_compressed, which would determine the nested structure of the data, store this structure in some variable nest_structure, flatten the full graph, and store both nest_structure and all the flattened data using numpy.savez_compressed. Then, the corresponding wrapper around numpy.load would understand the nest_structure variable, and re-create the graph and return it. However, I was hoping there is something like this already out there.
You may like the shelve package. It effectively wraps heterogeneous pickled objects in a convenient file. shelve is oriented more toward a "persistent storage" than classic save-to-file model.
The main benefit of using shelve is that you can conveniently save most kinds of structured data. The main disadvantage of using shelve is that it is Python-specific. Unlike HDF-5 or saved Matlab files or even simple CSV files, it isn't so easy to use other tools with your data.
Example of saving (Out of habit, I created objects and copy them to df, but you don't need to do this. You could just save directly to items in df):
import shelve
import numpy as np
a = np.arange(0, 1000, 12)
b = "This is a string"
class C(object):
alpha = 1.0
beta = [3, 4]
c = C()
class C(object):
alpha = 1.0
beta = [3, 4]
c = C()
df = shelve.open('test.shelve', 'c')
df['a'] = a
df['b'] = b
df['c'] = c
df.sync()
exit()
Following the above example, recovering data:
import shelve
import numpy as np
class C():
alpha = 1.0
beta = [3, 4]
df = shelve.open('test.shelve')
print(df['a'])
print(df['b'])
print(df['c'].alpha)
I have several huge arrays, and I am using np.save and np.load to save each array or dictionary in a single file and then I reload them, in order not to compute them another time as follows.
save(join(dir, "ListTitles.npy"), self.ListTitles)
self.ListTitles = load(join(dir,"ListTitles.npy"))
The problem is that when I try to use them afterwards, I have errors like (field name not found) or (len() of unsized object).
For example:
len(self.ListTitles) or when accessing a field of a dictionary return an error.
I don't know how to resolve this. Because when I simply use this code, it works perfectly:
M = array([[1,2,0], [3,4,0], [3,0,1]])
vector = zeros(3529)
save("M.npy", M)
save("vector.npy", vector)
vector = load("vector.npy")
B = load("M.npy")
print len(B)
print len(vector)
numpy's save and load functions are for numpy arrays, not for general Python data like dicts. Use the pickle module to save to file, and reload from file, most kinds of Python data structures (there are alternatives like dill which are however not in the standard library -- I'd recommend sticking with standard pickle unless it gives you specific problems).
I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.
I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)
import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)
This code runs, but expression_data is simply array([[1]]).
I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:
In [40]: e._get_ncol()
Out[40]: 1
In [41]: e._get_nrow()
Out[41]: 1
But then again who knows? Even if e did represent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.
Anyone any thoughts?
This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.
To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy has decent methods for data import. The file format is the only common interface required here.
data(iris)
iris$Species = unclass(iris$Species)
write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")
# now start a python session
import numpy as NP
fpath = "/path/to/my/file/np_iris.txt"
A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)
# print(type(A))
# returns: <type 'numpy.ndarray'>
print(A.shape)
# returns: (150, 5)
print(A[1:5,])
# returns:
[[ 4.9 3. 1.4 0.2 1. ]
[ 4.7 3.2 1.3 0.2 1. ]
[ 4.6 3.1 1.5 0.2 1. ]
[ 5. 3.6 1.4 0.2 1. ]]
According to the Documentation (and my own experience for what it's worth) loadtxt is the preferred method for conventional data import.
You can also pass in to loadtxt a tuple of data types (the argument is dtypes), one item in the tuple for each column. Notice 'skiprows=1' to step over the column headers (for loadtxt rows are indexed from 1, columns from 0).
Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.
If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'tempfile.dat')
# now create a memory-mapped file with shape and data type
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))
# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to
# the data stored on disk)
A[:] = somedata[:]
Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?
Passing the matrix to numpy is straightforward (and can even be made without making a copy):
http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.
You seem to be working with bioconductor classes, and might be interested in the following:
http://pypi.python.org/pypi/rpy2-bioconductor-extensions/