My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.
Related
I've a project that is utilizing HDF5. There are file structures as well as HDF5 data structures for each dataset.
Think of a large video. Each frame is divided up equally and written to multiple files as well as multiple HDF5 data chunks. A single 'video' may have 20+ files (representing temporal and slices), and then more files to represent additional slices. The datasets aren't very large- under 30gb- but are still cumbersome.
My initial dive to associate (stitch) the pieces back together was to put together an array of pointers to the individual frames, and then stack them for the temporal aspect of the video. This would be (fairly) small since I would be pointing to the locations on disk where everything was. This would also limit the amount of data I'd have to hold into memory- always a bonus- for when I scale to the 'larger' datasets.
However the way to accomplish this in Python eludes me- especially when considering I want to tie in the metadata for each frame (pixels, their locations, etc).
Is there a method I should be following to better reference the data and 'stitch' it back together? My current method was to create numpy arrays of the raw data. This has the detriment of reading all of the data in and storing it in memory (and disk).
So I have this large hdf5 file that features multiple large datsets. I am accessing it with h5py and want to read parts of every of those datasets into a common ndaray. Unfortunately, slicing across datasets is not supported, so I was wondering, what the most efficient way is, to assemble the ndarray with those datasets given the circumstances?
Currently, I am using something along the following lines:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(assemble_array, range(NrDatasets))
where the function assemble_array reads in the data into a predefined ndarray buffer of appropriate size, but this is not fast enough :/
Can anyone help?
I'm working on an app that processes a lot of data.
.... and keeps running my computer out of memory. :(
Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.
Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.
Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?
Based on your comments, I'd suggest that you split your task into two parts:
1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.
2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)
For my research I am working with large numpy arrays consisting of complex data.
arr = np.empty((15000, 25400), dtype='complex128')
np.save('array.npy'), arr)
When stored they are about 3 GB each. Loading these arrays is a time consuming process, which made me wonder if there are ways to speed this process up
One of the things I was thinking of was splitting the array into its complex and real part:
arr_real = arr.real
arr_im = arr.imag
and saving each part separately. However, this didn't seem to improve processing speed significantly. There is some documentation about working with large arrays, but I haven't found much information on working with complex data. Are there smart(er) ways to work with large complex arrays?
If you only need parts of the array in memory, you can load it using memory mapping:
arr = np.load('array.npy', mmap_mode='r')
From the docs:
A memory-mapped array is kept on disk. However, it can be accessed and
sliced like any ndarray. Memory mapping is especially useful for
accessing small fragments of large files without reading the entire
file into memory.
In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?
In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?
and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.
some related topics:
Chain datasets from multiple HDF5 files/datasets
I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:
http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage
You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.