What is the most compact way of storing numpy data? - python

I have large data set.
The best I could achieve is use numpy arrays and make a binary file out of it and then compressing it:
my_array = np.array([1.0, 2.0, 3.0, 4.0])
my_array.tobytes()
my_array = zlib.compress(my_array)
With my real data, however, the binary file becomes 22mb size so I am hoping to make it even smaller. I found that by default 64bit machines use the float64 which takes up 24 bytes in memory, 8 bytes for pointer to the value, 8 bytes for the double precision and 8 bytes for the garbage collector. If I change it to float32 I gain a lot in memory but lose in precision, I am not sure if I want that, but what about the 8 bytes for garbage collector, is it automatically stripped away?
Observations: I have already tried pickle, hickle, msgpack but 22mb is the best size I managed to reach.

An array with 46800 x 4 x 18 8-byte floats takes up 26956800 bytes. That's 25.7MiB or 27.0MB. A compressed size of 22MB is an 18% (or 14% if you really meant MiB) compression, which is pretty good by most standards, especially for random binary data. You are unlikely to improve on that much. Using a smaller datatype like float32, or perhaps trying to represent your data as rationals may be useful.
Since you mention that you want to store metadata, you can record a byte for the number of dimensions (numpy allows at most 32 dimensions), and N integers for the size in each dimension (either 32 or 64 bit). Let's say you use 64 bit integers. That makes for 193 bytes of metadata in your particular case, or 7*10-4% of the total array size.

Related

Pandas .to_csv taking long to save relatively large dataframe?

My df is ~4 GB in memory, of float16 dtype columns. I am trying to save to a CSV file using pd.to_csv but it is taking excessively long for a not-too-large data frame.
Any help is appreciated.
float16 is a pretty dense data type - each floating point number is stored in 16 bits, or 2 bytes.
Assuming the entire data frame is float16, that would mean your data frame has roughly 2,000,000 numbers in it.
By contrast, an ASCII character is 1 byte, and a floating point number of unspecified precision often requires many characters.
A quick estimate using pseudo-random numbers suggests that the average number of characters used to represent float16 values in text is between 5.5 to 6 characters each.
>>> import numpy as np
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.68
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.84
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.57
So on average a dataframe of float16 will require more than 3x more disk space to write as a CSV than the space it occupies in memory (remember each number would require a character or line delimiter, adding one to the character size of each recorded value).
For a 4 GB dataframe, you could easily be looking at a 12 GB CSV file without compression. Exactly how long such a file would take to write depends on many factors, including disk speed, compression options (compressing the file would reduce the amount of data written, but different compression algorithms have widely varying compression times). Your process could also be competing for resources with something else happening on the machine, leading to a further slowdown.
"Excessively long" is subjective and could be anywhere from a few minutes (which seems reasonable for a 12 GB file) to days depending on your definition. There aren't enough details in your question to determine what, if anything, the problem actually is.

how can I fix Memory Error on np.arange(0.01*1e10,100*1e10,0.5)?

I have Memory Error when I run np.arange() with large number like 1e10.
how can I fix Memory Error on np.arange(0.01*1e10,100*1e10,0.5)
arange returns a numpy array.
if we go from 0.01e10 to 100e10 with steps of 0.5, there are approximately 200e10 items in your array. As these numbers are double precision (64 bits, or 8 bytes) per item, you would need 16 terabytes of RAM.
The best idea would be to change your algorithm. For instance, if you are using it in a for loop. e.g:
for t in np.arange(0.01*1e10,100*1e10,0.5):
do_simulationstep(t)
Changing to use range in python3, or xrange in python2, means this array will be created on the fly using a generator.
for t in range(0.01*1e10,100*1e10,0.5):
do_simulationstep(t)
As noted in the comments, however, this will not work.
Range will only work with integers, so we will have to scale range to use integers and then rescale the result again:
for t in (x*0.5 for x in range(int(1e8/0.5),int(1e12/0.5))):
do_simulationstep(t)
However, should you really need such a large amount of memory, then I think amazon is renting out server that might support it:
EC2 In-Memory Processing Update: Instances with 4 to 16 TB of Memory + Scale-Out SAP HANA to 34 TB
You are trying to create an array of roughtly 2e12 elements. If every element was to be a byte, you would need approximately 2Tb of free memory to allocate it. Not sure you have so much ram available, that is why you have the memory error.
Note: the array you are trying to allocate contains floats, so it is even bigger. Do you really need so much elements?
Hope it helps,

Large Numpy array handler, numpy data procession, memmap funciton mapping

Large numpy array (over 4GB) with nyp file and memmap function
I was using numpy package for array calculation where I read https://docs.scipy.org/doc/numpy/neps/npy-format.html
In "Format Specification: Version 2.0" it said that, for .npy file, "version 2.0 format extends the header size to 4 GiB".
My question was that:
What was header size? Did that mean I could only save numpy.array of sizeat most 4GB array into the npy file? How large could a single array go?
I also read https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.memmap.html
where it stated that "Memory-mapped files cannot be larger than 2GB on 32-bit systems"
did it mean numpy.memmap's limitation was based on the memory of the system? Was there anyway to avoid such limitation?
Further, I read that we could chose the dtype of the array, where the best resolution was "complex128". Was there any way to "use" and "save" elements with more accuracy on a 64 bit computer?(more accurate than complex128 or float64)
The previous header size field was 16 bits wide, allowing headers smaller than 64KiB. Because the header describes the structure of the data, and doesn't contain the data itself, this is not a huge concern for most people. Quoting the notes, "This can be exceeded by structured arrays with a large number of columns." So to answer the first question, header size was under 64KiB but the data came after, so this wasn't the array size limit. The format didn't specify a data size limit.
Memory map capacity is dependent on operating system as well as machine architecture. Nowadays we've largely moved to flat but typically virtual address maps, so the program itself, stack, heap, and mapped files all compete for the same space, in total 4GiB for 32 bit pointers. Operating systems frequently partition this in quite large chunks, so some systems might only allow 2GiB total for user space, others 3GiB; and often you can map more memory than you can allocate otherwise. The memmap limitation is more closely tied to the operating system in use than the physical memory.
Non-flat address spaces, such as using distinct segments on OS/2, could allow larger usage. The cost is that a pointer is no longer a single word. PAE, for instance, supplies a way for the operating system to use more memory but still leaves processes with their own 32 bit limits. Typically it's easier nowadays to use a 64 bit system, allowing memory spaces up to 16 exabytes. Because data sizes have grown a lot, we also handle it in larger pieces, such as 4MiB or 16MiB allocations rather than the classic 4KiB pages or 512B sectors. Physical memory typically has more practical limits.
Yes, there are elements with more precision than 64 bit floating point; in particular, 64 bit integers. This effectively uses a larger mantissa by sacrificing all of the exponent. Complex128 is two 64 bit floats, and doesn't have higher precision but a second dimension. There are types that can grow arbitrarily precise, such as Python's long integers (long in python 2, int in python 3) and fractions, but numpy generally doesn't delve into those because they also have matching storage and computation costs. A basic property of the arrays is that they can be addressed using index calculations since the element size is consistent.

Memory needed for an item in a list

I have learning python for a year. In the lecture slides, I see that the compiler usually allocate 4 bytes to store items in a list.
Why can't the compiler allocate memory dynamically? For example, if I have a small value, why don't the compiler just assign 1 or 2 bytes to store it, would it be more memory efficient?
What is mean by 32 bit-integer where one bit is a sign bit and 31 bits represent number? One byte is equal to 8 bits and an integer is 32 bits, so why would they represent it with 8 bytes where they waste 4 bytes btw in some computers you see 64-bit , they can use 8 bytes and it has only effect how greater number you can represent. The more bits ==> the more greater number.It simply goes about your processor and it's registers.
It is not true that elements in a list or array take 4 bytes.
In languages with strong static types (that is, in which you have to declare the type of a variable before using it), an array takes exactly the size of the array times the size of each variable of that type. In C, char types usually uses one byte, and int types use an amount of bytes that depends on the system's architecture.
In languages with dynamic types, like Python, all variables are implemented as objects containing pointers (to a location in memory where data is actually stored) and labels (indicating the type of the variable, which will be used in runtime to determine the variable's behavior when performing operations).
The 4-byte issue probably refers to the fact that many architectures use 4 bytes for integers, but this is coincidental and it is a stretch to consider it a rule.

Python float precision float

I need to implement a Dynamic Programming algorithm to solve the Traveling Salesman problem in time that beats Brute Force Search for computing distances between points. For this I need to index subproblems by size and the value of each subproblem will be a float (the length of the tour). However holding the array in memory will take about 6GB RAM if I use python floats (which actually have double precision) and so to try and halve that amount (I only have 4GB RAM) I will need to use single precision floats. However I do not know how I can get single precision floats in Python (I am using Python 3). Could someone please tell me where I can find them (I was not able to find much on this on the internet). Thanks.
EDIT: I notice that numpy also has a float16 type which will allow for even more memory savings. The distances between points are around 10000 and there are 25 unique points and my answer needs to be to the nearest integer. Will float16 provide enought accuracy or do I need to use float32?
As a first step, you should use a NumPy array to store your data instead of a Python list.
As you correctly observe, a Python float uses double precision internally, and the double-precision value underlying a Python float can be represented in 8 bytes. But on a 64-bit machine, with the CPython reference implementation of Python, a Python float object takes a full 24 bytes of memory: 8 bytes for the underlying double-precision value, 8 bytes for a pointer to the object type, and 8 bytes for a reference count (used for garbage collection). There's no equivalent of Java's "primitive" types or .NET's "value" types in Python - everything is boxed. That makes the language semantics simpler, but means that objects tend to be fatter.
Now if we're creating a Python list of float objects, there's the added overhead of the list itself: one 8-byte object pointer per Python float (still assuming a 64-bit machine here). So in general, a list of n Python float objects is going to cost you over 32n bytes of memory. On a 32-bit machine, things are a little better, but not much: our float objects are going to take 16 bytes each, and with the list pointers we'll be using 20n bytes of memory for a list of floats of length n. (Caveat: this analysis doesn't quite work in the case that your list refers to the same Python float object from multiple list indices, but that's not a particularly common case.)
In contrast, a NumPy array of n double-precision floats (using NumPy's float64 dtype) stores its data in "packed" format in a single data block of 8n bytes, so allowing for the array metadata the total memory requirement will be a little over 8n bytes.
Conclusion: just by switching from a Python list to a NumPy array you'll reduce your memory needs by about a factor of 4. If that's still not enough, then it might make sense to consider reducing precision from double to single precision (NumPy's float32 dtype), if that's consistent with your accuracy needs. NumPy's float16 datatype takes only 2 bytes per float, but records only about 3 decimal digits of precision; I suspect that it's going to be close to useless for the application you describe.
You could try the c_float type from the ctypes standard library. Alternatively, if you are capable of installing additional packages you might try the numpy package. It includes the float32 type.

Categories