Say I have a Torch tensor of integers in a small range 0,...,R (e.g., R=31).
I want to store to disk in compressed form in a way that is close to the entropy of the vector.
The compression techniques I know (e.g., Huffman and arithmetic coding) all seem to be serial in nature.
Is there a fast Torch entropy compression implementation?
I'm happy to use an off the shelf implementation, but I can also try to implement myself if someone knows a suitable algorithm.
torch.save will store it with pickle protocol.
If you want to save space, to quantize these vectors before saving should help.
Also, you can try zlib module:
https://github.com/jonathantompson/torchzlib
One alternative is to transform it to numpy arrays and then use some of the compression methods available there.
Refer to:
Compress numpy arrays efficiently
For what you described, you can simply pack five-bit integers into a bit stream. It's easy to compress and decompress with the shift, or, and and bitwise operators (<<, >>, |, &). That would be as good as you could do, if your integers are uniformly distributed in 0..31, and there are no repeated patterns.
If, on the other hand, the distribution of your integers is significantly skewed or there are repeated patterns, then you should use an existing lossless compressor, such as zlib, zstd, or lzma2 (xz). For any of those, feed them one integer per byte.
To parallelize the computation, you can break up your 225 integers into many small subsets, each of which can be compressed independently. You could go down a few 10's of K each, likely with little overhead loss or compression loss. You will need to experiment with your data.
Related
Calculate fft with 16GB memory,cause memory exhausted.
print(data_size)
freqs, times, spec_arr = signal.spectrogram(data, fs=samp_rate,nfft=1024,return_onesided=False,axis=0,scaling='spectrum',mode='magnitude')
Output as below:
537089518
Killed
How to calculate fft of large size data ,with existing python package?
A more general solution is to do that yourself. 1D FFTs can be split in smaller ones thanks to the well-known Cooley–Tukey FFT algorithm and multidimentional decomposition. For more information about this strategy, please read The Design and Implementation of FFTW3. You can do the operation in virtually mapped memory so to do that more easily. Some library/package like the FFTW enable you to relatively-easily perform fast in-place FFTs. You may need to write your own Python package or to use Cython so not to allocate additional memory that is not memory mapped.
One alternative solution is to save your data in HDF5 (for example using h5py, and then use out_of_core_fft and then read again the file. But, be aware that this package is a bit old and appear not to be maintained anymore.
I'm looking for the best 64-bit (or at least 32-bit) hash function for NumPy that has next properties:
It is vectorized for numpy, meaning that it should have functions for hashing all elements of any N-D numpy array.
It can be applied to any hashable numpy's dtype. For this it is enough for such hash to be able to process just raw block of bytes.
It is very-very fast, same like xxhash. Especially it should be fast for a lot of small inputs, like huge array of 32, 64 bit numbers or short np.str_, but also should handle other dtypes.
It should be collision-resistant. I may use just some part of bits, so any number of bits inside hash should be collision resistant too.
It may be (or may be not) non-crtyptographic, meaning that it is alright if it can be inverted sometimes, like xxhash.
It should produce 64-bit integer or larger output, but if it is 32-bit then still is OK, although not that preferable. Would be good if possible to choose to produce hashes of sizes 32, 64, 128 bits.
It should itself convert numpy array internally to bytes for hashing to be fast, or at least maybe there is already in numpy such conversion function that converts whole N-D array of any popular dtype to variable sequences of bytes, good if someone will tell me about this.
I would use xxhash mentioned by link above, if it had numpy arrays vectorization. But right now it is only single-object, its bindings functions accept just one block of bytes per call producing one integer output. And xxhash uses just few CPU cycles for every call on small (4, 8 bytes) input, so probably doing pure-Python loop over large array to call xxhash for every number will be very inefficient.
I need it for different things, one is probabilistic existence filters (or sets), i.e. I need to design such structure (set) that should answer with given probability (for given number N of elements) if a requested element is probably in the set or not. For that I want to use lower bits of hash to spread inputs across K buckets and each bucket additionally stores some (tweakable) number of higher bits to increase probability of good answers. Another application is bloom filter. And I need this set to be very fast for adding and requesting, and to be as compact as possible in memory, and handle very large number of elements.
If there is no existing good solution then maybe I can also improve xxhash library and create a pull request to author's repository.
I tried various methods to do data compression when saving to disk some numpy arrays.
These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).
I tried with HDF5 (h5py) :
f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)
but this is quite slow, and the compression ratio is not the best we can expect.
I also tried with
numpy.savez_compressed()
but once again it may not be the best compression algorithm for such data (described before).
What would you choose for better compression ratio on a numpy array, with such data ?
(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)
What I do now:
import gzip
import numpy
f = gzip.GzipFile("my_array.npy.gz", "w")
numpy.save(file=f, arr=my_array)
f.close()
Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.
Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.
A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].
EDIT An illustration:
from pylab import *
import numpy
import numpy.random
import os.path
import subprocess
# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
numpy.random.randn(N) * (1<<7)).astype(int32)
numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size
subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise
data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it
The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above.
GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable.
You always need to make, at least some, tests to select the best configuration for your needs.
What constitutes the best compression (if any) highly depends on the nature of the data. Many kinds of measurement data are virtually completely incompressible, if loss-free compression is indeed required.
The pytables docs contains a lot of useful guidelines on data compression. It also details speed tradeoffs and so on; higher compression levels are usually a waste of time, as it turns out.
http://pytables.github.io/usersguide/optimization.html
Note that this is probably as good as it will get. For integer measurements, a combination of a shuffle filter with a simple zip-type compression usually works reasonably well. This filter very efficiently exploits the common situation where the highest-endian byte is usually 0, and only included to guard against overflow.
You might want to try blz. It can compress binary data very efficiently.
import blz
# this stores the array in memory
blz.barray(myarray)
# this stores the array on disk
blz.barray(myarray, rootdir='arrays')
It stores arrays either on file or compressed in memory. Compression is based on blosc.
See the scipy video for a bit of context.
First, for general data sets, the shuffle=True argument to create_dataset improves compression dramatically with roughly continuous datasets. It very cleverly rearranges the bits to be compressed so that (for continuous data) the bits change slowly, which means they can be compressed better. It slows the compression down a very little bit in my experience, but can substantially improve the compression ratios in my experience. It is not lossy, so you really do get the same data out as you put in.
If you don't care about the accuracy so much, you can also use the scaleoffset argument to limit the number of bits stored. Be careful, though, because this is not what it might sound like. In particular, it is an absolute precision, rather than a relative precision. For example, if you pass scaleoffset=8, but your data points are less then 1e-8 you'll just get zeros. Of course, if you've scaled the data to max out around 1, and don't think you can hear differences smaller than a part in a million, you can pass scaleoffset=6 and get great compression without much work.
But for audio specifically, I expect that you are right in wanting to use FLAC, because its developers have put in huge amounts of thought, balancing compression with preservation of distinguishable details. You can convert to WAV with scipy, and thence to FLAC.
This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?
I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.
First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.
The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.
If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.
On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.
BACKGROUND
The issue I'm working with is as follows:
Within the context of an experiment I am designing for my research, I produce a large number of large (length 4M) arrays which are somewhat sparse, and thereby could be stored as scipy.sparse.lil_matrix instances, or simply as scipy.array instances (the space gain/loss isn't the issue here).
Each of these arrays must be paired with a string (namely a word) for the data to make sense, as they are semantic vectors representing the meaning of that string. I need to preserve this pairing.
The vectors for each word in a list are built one-by-one, and stored to disk before moving on to the next word.
They must be stored to disk in a manner which could be then retrieved with dictionary-like syntax. For example if all the words are stored in a DB-like file, I need to be able to open this file and do things like vector = wordDB[word].
CURRENT APPROACH
What I'm currently doing:
Using shelve to open a shelf named wordDB
Each time the vector (currently using lil_matrix from scipy.sparse) for a word is built, storing the vector in the shelf: wordDB[word] = vector
When I need to use the vectors during the evaluation, I'll do the reverse: open the shelf, and then recall vectors by doing vector = wordDB[word] for each word, as they are needed, so that not all the vectors need be held in RAM (which would be impossible).
The above 'solution' fits my needs in terms of solving the problem as specified. The issue is simply that when I wish to use this method to build and store vectors for a large amount of words, I simply run out of disk space.
This is, as far as I can tell, because shelve pickles the data being stored, which is not an efficient way of storing large arrays, thus rendering this storage problem intractable with shelve for the number of words I need to deal with.
PROBLEM
The question is thus: is there a way of serializing my set of arrays which will:
Save the arrays themselves in compressed binary format akin to the .npy files generated by scipy.save?
Meet my requirement that the data be readable from disk as a dictionary, maintaining the association between words and arrays?
as JoshAdel already suggested, I would go for HDF5, the simplest way is to use h5py:
http://h5py.alfven.org/
you can attach several attributes to an array with a dictionary like sintax:
dset.attrs["Name"] = "My Dataset"
where dset is your dataset which can be sliced exactly as a numpy array, but in the background it does not load all the array into memory.
I would suggest to use scipy.save and have an dictionnary between the word and the name of the files.
Have you tried just using cPickle to pickle the dictionary directly using:
import cPickle
DD = dict()
f = open('testfile.pkl','wb')
cPickle.dump(DD,f,-1)
f.close()
Alternatively, I would just save the vectors in a large multidimensional array using hdf5 or netcdf if necessary since this allows you to open a large array without bringing it all into memory at once and then get slices as needed. You can then associate the words as an additional group in the netcdf4/hdf5 file and use the common indices to quickly associate the appropriate slice from each group, or just name the group the word and then have the data be the vector. You'd have to play around with which is more efficient.
http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html
Pytables also might be a useful storage layer on top of HDF5:
http://www.pytables.org
Avoid using shelve, it's bug ridden and has cross-platform issues.
The memory issue, however, has nothing to do with shelve. Numpy arrays provide efficient implementation of the pickle protocol and there is little memory overhead to cPickle.dumps(protocol=-1), compared to binary .npy (only the extra headers in pickle, basically).
So if binary/pickle isn't enough, you'll have to go for compression. Have a look at pytables or h5py (difference between the two).
If specifying the binary protocol in pickle is enough, you can consider something more lightweight than hdf5: check out sqlitedict for a replacement of shelve. It has no additional dependencies.