How do I store a multidimensional array? - python

I am trying to write a code in python that will display the trajectory of projectile on a 2D graph. The initial velocity and launch angle will be varying. Instead of calculating it every time, I was wondering if there is any way to create a data file which will store all the values of the coordinates for each of those different combinations of speed and launch angle. That is a 4 dimensional database. Is this even possible?

This sounds like a pretty ideal case for using CSV as your file format. It's not a "4 dimension" so much as a "4 column" database.
initial_velocity, launch_angle, end_x, end_y
which you can write out and read in easily - using either the standard library's csv module, or pandas' read_csv()

i think that you should look at the HDF5 format, which has been specialized to work with big data in NASA and bulletproof in very large scale applications :
From the webside
HDF5 lets you store huge amounts of numerical data, and easily
manipulate that data from NumPy. For example, you can slice into
multi-terabyte datasets stored on disk, as if they were real NumPy
arrays. Thousands of datasets can be stored in a single file,
categorized and tagged however you want.
In addition from me is the point that NumPy has been developed to work with multidimensional array very efficiently. Good luck !

Related

Pythonic way to access HDF5 data structures similar to pointers

I've a project that is utilizing HDF5. There are file structures as well as HDF5 data structures for each dataset.
Think of a large video. Each frame is divided up equally and written to multiple files as well as multiple HDF5 data chunks. A single 'video' may have 20+ files (representing temporal and slices), and then more files to represent additional slices. The datasets aren't very large- under 30gb- but are still cumbersome.
My initial dive to associate (stitch) the pieces back together was to put together an array of pointers to the individual frames, and then stack them for the temporal aspect of the video. This would be (fairly) small since I would be pointing to the locations on disk where everything was. This would also limit the amount of data I'd have to hold into memory- always a bonus- for when I scale to the 'larger' datasets.
However the way to accomplish this in Python eludes me- especially when considering I want to tie in the metadata for each frame (pixels, their locations, etc).
Is there a method I should be following to better reference the data and 'stitch' it back together? My current method was to create numpy arrays of the raw data. This has the detriment of reading all of the data in and storing it in memory (and disk).

how can I save super large array into many small files?

In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?
In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?
and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.
some related topics:
Chain datasets from multiple HDF5 files/datasets
I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:
http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage
You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.

What data format for large files in R?

I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are
not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...
Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot.
Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?
Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.
For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.
SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.

Working with very large arrays - Numpy

My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.

Sensible storage of 1 billion+ values in a python list type structure

I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here
Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.
You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.
try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.

Categories