I've developed a model which uses several large, 3-dimensional datasets on the order of (1e7, 10, 1e5), and makes millions of read (and thousands of write) calls on slices of those datasets. So far the best tool I've found for making these calls is numpy.memmap, which allows for minimal data to be held in RAM and allows for clean indexing and very fast recall of data directly on the hard drive.
The downside of numpy.memmmap seems to be that performance is pretty uneven - the time to read a slice of the array may vary by 2 orders of magnitude between calls. Furthermore, I'm using Dask to parallelize many of the model functions in the script.
How is the performance of Dask DataFrames for making millions of calls to a large dataset? Would swapping out memmaps for DataFrames substantially increase processing times?
You would need to use Dask Array not Dask Dataframe. The performance is generally the same as Numpy, because Numpy does the actual computation.
Optimizations can speed up the calculation depending on the use case.
The overhead of the scheduler decreases performance. This is only applicable if you split the data into many partition and can usually be neglected.
Related
I'm trying to think of a reason (other than you only have a small dataset) that you wouldn't use Pyspark Dataframes.
Can everything that can be done with Pandas Dataframes be reproduced with Pyspark Dataframes?
Are there some Pandas-exclusive functions or some functions that are incredibly difficult to reproduce with Pyspark?
spark is a distributed processing framework. In addition to supporting the DataFrame functionality, it needs to run a JVM, a scheduler, cross-process/machine communication, it spins up databases, etc. So while of course, the answer to your question is no, not exactly everything is implemented the same way, the wider answer is that any distributed processing library naturally involves immense overhead. Lots of work goes into reducing this overhead, but it will never be trivial.
Dask (another distributed processing library with a DataFrame implementation) has a great section on best practices. In it, the first recommendation is not to use dask unless you have to:
Parallelism brings extra complexity and overhead. Sometimes it’s necessary for larger problems, but often it’s not. Before adding a parallel computing system like Dask to your workload you may want to first try some alternatives:
Use better algorithms or data structures: NumPy, Pandas, Scikit-Learn may have faster functions for what you’re trying to do. It may be worth consulting with an expert or reading through their docs again to find a better pre-built algorithm.
Better file formats: Efficient binary formats that support random access can often help you manage larger-than-memory datasets efficiently and simply. See the Store Data Efficiently section below.
Compiled code: Compiling your Python code with Numba or Cython might make parallelism unnecessary. Or you might use the multi-core parallelism available within those libraries.
Sampling: Even if you have a lot of data, there might not be much advantage from using all of it. By sampling intelligently you might be able to derive the same insight from a much more manageable subset.
Profile: If you’re trying to speed up slow code it’s important that you first understand why it is slow. Modest time investments in profiling your code can help you to identify what is slowing you down. This information can help you make better decisions about if parallelism is likely to help, or if other approaches are likely to be more effective.
There's a very good reason for this. In-memory, single-threaded applications are always going to be much faster for small datasets.
Very simplistically, if you imagine the single-threaded runtime for your workflow is T, the wall time of a distributed workflow will be T_parallelizable / n_cores + T_not_parallelizable + overhead. For pyspark, this overhead is very significant. It's worth it a lot of the time. But it's not nothing.
Calculate fft with 16GB memory,cause memory exhausted.
print(data_size)
freqs, times, spec_arr = signal.spectrogram(data, fs=samp_rate,nfft=1024,return_onesided=False,axis=0,scaling='spectrum',mode='magnitude')
Output as below:
537089518
Killed
How to calculate fft of large size data ,with existing python package?
A more general solution is to do that yourself. 1D FFTs can be split in smaller ones thanks to the well-known Cooley–Tukey FFT algorithm and multidimentional decomposition. For more information about this strategy, please read The Design and Implementation of FFTW3. You can do the operation in virtually mapped memory so to do that more easily. Some library/package like the FFTW enable you to relatively-easily perform fast in-place FFTs. You may need to write your own Python package or to use Cython so not to allocate additional memory that is not memory mapped.
One alternative solution is to save your data in HDF5 (for example using h5py, and then use out_of_core_fft and then read again the file. But, be aware that this package is a bit old and appear not to be maintained anymore.
My question is simple; and I could not find a resource that answers it. Somewhat similar links are using asarray, on numbers in general, and the most succinct one here.
How can I "calculate" the overhead of loading a numpy array into RAM (if there is any overhead)? Or, how to determine the least amount of RAM needed to hold all arrays in memory (without time-consuming trial and error)?
In short, I have several numpy arrays of shape (x, 1323000, 1), with x being as high as 6000. This leads to a disk usage of 30GB for the largest file.
All files together need 50GB. Is it therefore enough if I use slightly more than 50GB as RAM (using Kubernetes)? I want to use the RAM as efficiently as possible, so just using 100GBs is not an option.
My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.
There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).
Most of it depends on your use case.
I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...
At least as far as h5py vs pyTables, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.
If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use h5py. If you have data that's more table-like, and you want to query it, then pyTables is a much better option.
h5py is a relatively "vanilla" wrapper around the HDF5 libraries compared to pyTables. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables adds some extra metadata). h5py can do a lot, but for some use cases (e.g. what pyTables does) you're going to need to spend more time tweaking things.
pyTables has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.
To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset. h5py makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)
As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data. pyTables makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)