Statistics / distributions over very large data sets

Statistics / distributions over very large data sets - python

Looking at the discussions about simple statistics from a file of data, I wonder which of these techniques would scale best over very large data sets (~millions of entries, Gbytes of data).
Are numpy solutions that read entire data set into memory appropriate here? See:
Binning frequency distribution in Python

You are not telling which kind of data you haveand what you want to calculate!
If you have something that is or is easily converted into positive integers of moderate size (e.g., 0..1e8), you may use bincount. Here is an example of how to make a distribution (histogram) of the byte values of all bytes in a very large file (works up to whatever your file system can manage):
import numpy as np
# number of bytes to read at a time
CHUNKSIZE = 100000000
# open the file
f = open("myfile.dat", "rb")
# cumulative distribution array
cum = np.zeros(256)
# read through the file chunk by chunk
while True:
chunkdata = np.fromstring(f.read(CHUNKSIZE), dtype='uint8')
cum += np.bincount(chunkdata)
if len(chunkdata < CHUNKSIZE):
break
This is very fast, the speed is really limited by the disk access. (I got approximately 1 GB/s with a file in the OS cache.)
Of course, you may want to calculate some other statistics (standard deviation, etc.), but even then you can usually use the distributions (histograms) to calculate that statistics. However, if you do not need the distribution, then there may be even faster methods. Calculating the average is the same as just adding all values together.
If you have a text file, then the major challenge is in parsing the file chunk-by-chunk. The standard methods loadtxt and csv module are not necessarily very efficient with very large files.
If you have floating point numbers or very large integers, the method above does not work directly, but in some cases you may just use some bits of the FP numbers or round things to closest integer, etc. In any case the question really boils down to what kind of data you really have, and what statistics you want to calculate. There is no Swiss knife which would solve all statistics problems with huge files.
Reading the data into memory is a very good option if you have enough memory. In certain cases you can do it without having enough memory (use numpy.memmap). If you have a text file with 1 GB of floating point numbers, the end result may fit into less than 1 GB, and most computers can handle that very well. Just make sure you are using a 64-bit Python.

Related

Pandas .to_csv taking long to save relatively large dataframe?

My df is ~4 GB in memory, of float16 dtype columns. I am trying to save to a CSV file using pd.to_csv but it is taking excessively long for a not-too-large data frame.
Any help is appreciated.

float16 is a pretty dense data type - each floating point number is stored in 16 bits, or 2 bytes.
Assuming the entire data frame is float16, that would mean your data frame has roughly 2,000,000 numbers in it.
By contrast, an ASCII character is 1 byte, and a floating point number of unspecified precision often requires many characters.
A quick estimate using pseudo-random numbers suggests that the average number of characters used to represent float16 values in text is between 5.5 to 6 characters each.
>>> import numpy as np
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.68
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.84
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.57
So on average a dataframe of float16 will require more than 3x more disk space to write as a CSV than the space it occupies in memory (remember each number would require a character or line delimiter, adding one to the character size of each recorded value).
For a 4 GB dataframe, you could easily be looking at a 12 GB CSV file without compression. Exactly how long such a file would take to write depends on many factors, including disk speed, compression options (compressing the file would reduce the amount of data written, but different compression algorithms have widely varying compression times). Your process could also be competing for resources with something else happening on the machine, leading to a further slowdown.
"Excessively long" is subjective and could be anywhere from a few minutes (which seems reasonable for a 12 GB file) to days depending on your definition. There aren't enough details in your question to determine what, if anything, the problem actually is.

When numbers are greater than Python's sys.maxint, do they require a lot more memory?

I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory slowly increases as expected until I hit around line 40m, and then rapidly increases 1.5gb in 3-5 seconds before the process exits due to lack of memory.
After some investigation, I discovered that the blow-up occurs around the time when the current offset (curr_offset) is around 2b, which happens to be around my sys.maxint (2^31-1).
My questions are:
Do numbers greater than sys.maxint require substantially more memory to store? If so, why? If not, why would I be seeing this behavior?
What factors (e.g. which Python, which operating system) determine sys.maxint?
On my 2010 MacBook Pro using 64-bit Python, sys.maxint is 2^63-1.
On my Windows 7 laptop using 64-bit IronPython, sys.maxint is the smaller 2^31-1. Same with 32-bit Python. For various reasons, I can't get 64-bit Python on my Windows machine right now.
Is there a better way to create this list of offsets?
The code in question:
f = open('some_file', 'rb')
curr_offset = 0
offsets = []
for line in f:
offsets.append(curr_offset)
curr_offset += len(line)
f.close()

Integers larger than sys.maxint will require more memory as they're stored as longs. If your sys.maxint is just 2GB, you're using a 32-bit build -- download, install, and use, a 64-bit build, and you'll avoid the problem. Your code looks fine!

Here is a solution that works even with 32-bit Python versions: store the lengths of the lines (they are small), convert into a NumPy array of 64-bit integers, and only then calculate offsets:
import numpy
with open('some_file', 'rb') as input_file:
lengths = map(len, input_file)
offsets = numpy.array(lengths, dtype=numpy.uint64).cumsum()
where cumsum() calculates the cumulative sum of line lengths. 80 M lines will give a manageable 8*80 = 640 MB array of offsets.
The building of the lengths list can even be bypassed by building the array of lengths with numpy.fromiter():
import numpy
with open('some_file', 'rb') as input_file:
offsets = numpy.fromiter((len(line) for line in input_file), dtype=numpy.uint64).cumsum()
I guess that it should be hard to find a faster method, because using a single numeric type (64-bit integers) makes NumPy arrays faster than Python lists.

An offset in a 2.5 GB file should never need to be more than eight bytes. Indeed, a signed 64-bit integer is maximally 9223372036854775807, much greater than 2.5G.
If you have 80 million lines, you should require no more than 640 MB to store an array of 80M offsets.
I would investigate to see if something is buggy with your code or with Python, perhaps using a different container (perhaps an explicit numpy array of 64-bit integers), using a preinitialized list, or even a different language altogether to store and process your offsets, such as an off_t in C, with appropriate compile flags.
(If you're curious and want to look at demo code, I wrote a program in C called sample that stores 64-bit offsets to newlines in an input file, to be able to do reservoir sampling on a scale larger than GNU sort.)

Appending to a list will reallocate a buffer for the list once it passes the capacity of the current buffer. I don't know specifically how Python does it, but a common method is to allocate 1.5x to 2x the previous size of the buffer. This is an exponential operation, so it's normal to see the memory requirements increasing rapidly near the end. It might be the case that the size of the list is just too large overall; a quick test would be to replace curr_offset += len(line) with curr_offset += 1 and see if you have the same behavior.

If you really can't use a 64-bit version of Python, you can hold your calculated offsets in a NumPy array of numpy.uint64 numbers (maximum value of 2**64-1). This is a little inconvenient, because you have to dynamically extend the array when it reaches capacity, but this would solve your problem and would run on any version of Python.
PS: A more convenient NumPy-based solution, that does not require dynamically updating the size of the NumPy array of offsets, is given in my other answer.

Yes. Above some threshold, python is representing long numbers as bignums and they take space.

Working with very large arrays - Numpy

My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.

If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.

Compress numpy arrays efficiently

I tried various methods to do data compression when saving to disk some numpy arrays.
These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).
I tried with HDF5 (h5py) :
f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)
but this is quite slow, and the compression ratio is not the best we can expect.
I also tried with
numpy.savez_compressed()
but once again it may not be the best compression algorithm for such data (described before).
What would you choose for better compression ratio on a numpy array, with such data ?
(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)

What I do now:
import gzip
import numpy
f = gzip.GzipFile("my_array.npy.gz", "w")
numpy.save(file=f, arr=my_array)
f.close()

Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.
Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.
A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1].
EDIT An illustration:
from pylab import *
import numpy
import numpy.random
import os.path
import subprocess
# create 1M data points of a 24-bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
numpy.random.randn(N) * (1<<7)).astype(int32)
numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size
subprocess.call('xz -9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise
data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz -9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it

The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above.
GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable.
You always need to make, at least some, tests to select the best configuration for your needs.

What constitutes the best compression (if any) highly depends on the nature of the data. Many kinds of measurement data are virtually completely incompressible, if loss-free compression is indeed required.
The pytables docs contains a lot of useful guidelines on data compression. It also details speed tradeoffs and so on; higher compression levels are usually a waste of time, as it turns out.
http://pytables.github.io/usersguide/optimization.html
Note that this is probably as good as it will get. For integer measurements, a combination of a shuffle filter with a simple zip-type compression usually works reasonably well. This filter very efficiently exploits the common situation where the highest-endian byte is usually 0, and only included to guard against overflow.

You might want to try blz. It can compress binary data very efficiently.
import blz
# this stores the array in memory
blz.barray(myarray)
# this stores the array on disk
blz.barray(myarray, rootdir='arrays')
It stores arrays either on file or compressed in memory. Compression is based on blosc.
See the scipy video for a bit of context.

First, for general data sets, the shuffle=True argument to create_dataset improves compression dramatically with roughly continuous datasets. It very cleverly rearranges the bits to be compressed so that (for continuous data) the bits change slowly, which means they can be compressed better. It slows the compression down a very little bit in my experience, but can substantially improve the compression ratios in my experience. It is not lossy, so you really do get the same data out as you put in.
If you don't care about the accuracy so much, you can also use the scaleoffset argument to limit the number of bits stored. Be careful, though, because this is not what it might sound like. In particular, it is an absolute precision, rather than a relative precision. For example, if you pass scaleoffset=8, but your data points are less then 1e-8 you'll just get zeros. Of course, if you've scaled the data to max out around 1, and don't think you can hear differences smaller than a part in a million, you can pass scaleoffset=6 and get great compression without much work.
But for audio specifically, I expect that you are right in wanting to use FLAC, because its developers have put in huge amounts of thought, balancing compression with preservation of distinguishable details. You can convert to WAV with scipy, and thence to FLAC.

Compressing measurements data files

from measurements I get text files that basically contain a table of float numbers, with the dimensions 1000x1000. Those take up about 15MB of space which, considering that I get about 1000 result files in a series, is unacceptable to save. So I am trying to compress those by as much as possible without loss of data. My idea is to group the numbers into ~1000 steps over the range I expect and save those. That would provide sufficient resolution. However I still have 1.000.000 points to consider and thus my resulting file is still about 4MB. I probably won t be able to compress that any further?
The bigger problem is the calculation time this takes. Right now I d guesstimate 10-12 secs per file, so about 3 hrs for the 1000 files. WAAAAY to much. This is the algorithm I thougth up, do you have any suggestions? There's probably far more efficient algorithms to do that, but I am not much of a programmer...
import numpy
data=numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
out=numpy.empty((1000,1000),numpy.int16)
i=0
min=-0.5
max=0.5
step=(max-min)/1000
while i<=999:
j=0
while j<=999:
k=(data[i,j]//step)
out[i,j]=k
if data[i,j]>max:
out[i,j]=500
if data[i,j]<min:
out[i,j]=-500
j=j+1
i=i+1
numpy.savetxt('converted.txt', out, fmt="%i")
Thanks in advance for any hints you can provide!
Jakob

I see you store the numpy arrays as text files. There is a faster and more space-efficient way: just dump it.
If your floats can be stored as 32-bit floats, then use this:
data = numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
data.astype(numpy.float32).dump(open('converted.numpy', 'wb'))
then you can read it with
data = numpy.load(open('converted.numpy', 'rb'))
The files will be 1000x1000x4 Bytes, about 4MB.
The latest version of numpy supports 16-bit floats. Maybe your floats will fit in its limiter range.

numpy.savez_compressed will let you save lots of arrays into a single compressed, binary file.
However, you aren't going to be able to compress it that much -- if you have 15GB of data, you're not magically going to fit it in 200MB by compression algorithms. You have to throw out some of your data, and only you can decide how much you need to keep.

Use the zipfile, bz2 or gzip module to save to a zip, bz2 or gz file from python. Any compression scheme you write yourself in a reasonable amount of time will almost certainly be slower and have worse compression ratio than these generic but optimized and compiled solutions. Also consider taking eumiro's advice.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Statistics / distributions over very large data sets - python

Related

Pandas .to_csv taking long to save relatively large dataframe?

When numbers are greater than Python's sys.maxint, do they require a lot more memory?

Working with very large arrays - Numpy

Compress numpy arrays efficiently

Compressing measurements data files

Categories

Resources