Pandas .to_csv taking long to save relatively large dataframe?

Pandas .to_csv taking long to save relatively large dataframe? - python

My df is ~4 GB in memory, of float16 dtype columns. I am trying to save to a CSV file using pd.to_csv but it is taking excessively long for a not-too-large data frame.
Any help is appreciated.

float16 is a pretty dense data type - each floating point number is stored in 16 bits, or 2 bytes.
Assuming the entire data frame is float16, that would mean your data frame has roughly 2,000,000 numbers in it.
By contrast, an ASCII character is 1 byte, and a floating point number of unspecified precision often requires many characters.
A quick estimate using pseudo-random numbers suggests that the average number of characters used to represent float16 values in text is between 5.5 to 6 characters each.
>>> import numpy as np
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.68
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.84
>>> np.mean([len(str(x)) for x in np.array(np.random.random(100), dtype=np.float16)])
5.57
So on average a dataframe of float16 will require more than 3x more disk space to write as a CSV than the space it occupies in memory (remember each number would require a character or line delimiter, adding one to the character size of each recorded value).
For a 4 GB dataframe, you could easily be looking at a 12 GB CSV file without compression. Exactly how long such a file would take to write depends on many factors, including disk speed, compression options (compressing the file would reduce the amount of data written, but different compression algorithms have widely varying compression times). Your process could also be competing for resources with something else happening on the machine, leading to a further slowdown.
"Excessively long" is subjective and could be anywhere from a few minutes (which seems reasonable for a 12 GB file) to days depending on your definition. There aren't enough details in your question to determine what, if anything, the problem actually is.

Related

What is the most compact way of storing numpy data?

I have large data set.
The best I could achieve is use numpy arrays and make a binary file out of it and then compressing it:
my_array = np.array([1.0, 2.0, 3.0, 4.0])
my_array.tobytes()
my_array = zlib.compress(my_array)
With my real data, however, the binary file becomes 22mb size so I am hoping to make it even smaller. I found that by default 64bit machines use the float64 which takes up 24 bytes in memory, 8 bytes for pointer to the value, 8 bytes for the double precision and 8 bytes for the garbage collector. If I change it to float32 I gain a lot in memory but lose in precision, I am not sure if I want that, but what about the 8 bytes for garbage collector, is it automatically stripped away?
Observations: I have already tried pickle, hickle, msgpack but 22mb is the best size I managed to reach.

An array with 46800 x 4 x 18 8-byte floats takes up 26956800 bytes. That's 25.7MiB or 27.0MB. A compressed size of 22MB is an 18% (or 14% if you really meant MiB) compression, which is pretty good by most standards, especially for random binary data. You are unlikely to improve on that much. Using a smaller datatype like float32, or perhaps trying to represent your data as rationals may be useful.
Since you mention that you want to store metadata, you can record a byte for the number of dimensions (numpy allows at most 32 dimensions), and N integers for the size in each dimension (either 32 or 64 bit). Let's say you use 64 bit integers. That makes for 193 bytes of metadata in your particular case, or 7*10-4% of the total array size.

Large Numpy array handler, numpy data procession, memmap funciton mapping

Large numpy array (over 4GB) with nyp file and memmap function
I was using numpy package for array calculation where I read https://docs.scipy.org/doc/numpy/neps/npy-format.html
In "Format Specification: Version 2.0" it said that, for .npy file, "version 2.0 format extends the header size to 4 GiB".
My question was that:
What was header size? Did that mean I could only save numpy.array of sizeat most 4GB array into the npy file? How large could a single array go?
I also read https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.memmap.html
where it stated that "Memory-mapped files cannot be larger than 2GB on 32-bit systems"
did it mean numpy.memmap's limitation was based on the memory of the system? Was there anyway to avoid such limitation?
Further, I read that we could chose the dtype of the array, where the best resolution was "complex128". Was there any way to "use" and "save" elements with more accuracy on a 64 bit computer?(more accurate than complex128 or float64)

The previous header size field was 16 bits wide, allowing headers smaller than 64KiB. Because the header describes the structure of the data, and doesn't contain the data itself, this is not a huge concern for most people. Quoting the notes, "This can be exceeded by structured arrays with a large number of columns." So to answer the first question, header size was under 64KiB but the data came after, so this wasn't the array size limit. The format didn't specify a data size limit.
Memory map capacity is dependent on operating system as well as machine architecture. Nowadays we've largely moved to flat but typically virtual address maps, so the program itself, stack, heap, and mapped files all compete for the same space, in total 4GiB for 32 bit pointers. Operating systems frequently partition this in quite large chunks, so some systems might only allow 2GiB total for user space, others 3GiB; and often you can map more memory than you can allocate otherwise. The memmap limitation is more closely tied to the operating system in use than the physical memory.
Non-flat address spaces, such as using distinct segments on OS/2, could allow larger usage. The cost is that a pointer is no longer a single word. PAE, for instance, supplies a way for the operating system to use more memory but still leaves processes with their own 32 bit limits. Typically it's easier nowadays to use a 64 bit system, allowing memory spaces up to 16 exabytes. Because data sizes have grown a lot, we also handle it in larger pieces, such as 4MiB or 16MiB allocations rather than the classic 4KiB pages or 512B sectors. Physical memory typically has more practical limits.
Yes, there are elements with more precision than 64 bit floating point; in particular, 64 bit integers. This effectively uses a larger mantissa by sacrificing all of the exponent. Complex128 is two 64 bit floats, and doesn't have higher precision but a second dimension. There are types that can grow arbitrarily precise, such as Python's long integers (long in python 2, int in python 3) and fractions, but numpy generally doesn't delve into those because they also have matching storage and computation costs. A basic property of the arrays is that they can be addressed using index calculations since the element size is consistent.

Memory-efficient 2d growable array in python?

I'm working on an app that processes a lot of data.
.... and keeps running my computer out of memory. :(
Python has a huge amount of memory overhead on variables (as per sys.getsizeof()). A basic tuple with one integer in it takes up 56 bytes, for example. An empty list, 64 bytes. Serious overhead.
Numpy arrays are great for reducing overhead. But they're not designed to grow efficiently (see Fastest way to grow a numpy numeric array). Array (https://docs.python.org/3/library/array.html) seems promising, but it's 1d. My data is 2d, with an arbitrary number of rows and a column width of 3 floats (ideally float32) for one array, and a column width of two ints (ideally uint32) for the other. Obviously, using ~80 bytes of python structure to store 12 or 8 bytes of data per row is going to total my memory consumption.
Is the only realistic way to keep memory usage down in Python to "fake" 2d, aka by addressing the array as arr[row*WIDTH+column] and counting rows as len(arr)/WIDTH?

Based on your comments, I'd suggest that you split your task into two parts:
1) In part 1, parse the JSON files using regexes and generate two CSV files in simple format: no headers, no spaces, just numbers. This should be quick and performant, with no memory issues: read text in, write text out. Don't try to keep anything in memory that you don't absolutely have to.
2) In part 2, use pandas read_csv() function to slurp in the CSV files directly. (Yes, pandas! You've probably already got it, and it's hella fast.)

How to deal with lots of data?

I need to deal with lots of data (such as float) in my program which costs me much memory. Also, I create some data structures to organize my data which cost memory, too.
Here is the example:
Heap at the end of the function Partition of a set of 6954910 objects. Total size = 534417168 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 3446006 50 248112432 46 248112432 46 array.array
1 1722999 25 124055928 23 372168360 70 vertex.Vertex
2 574705 8 82894088 16 455062448 85 list
.......
Any solution?

Python supports array objects that are internally maintained in packed binary arrays of simple data.
For example
import array
a = array.array('f', 0. for x in range(100000))
will create an array object containing 100,000 floats and its size will be approximately just 400Kb (4 bytes per element).
Of course you can store only values of the specific type in an array object, not any Python value as you would do with regular list objects.
The numpy module extends over this concept and provides you many ways to quickly manipulate multidimensional data structures of this kind (including viewing part of arrays as arrays sharing the same memory, reshaping arrays, performing math and search operations and much more).

If you need to deal with billions of rows of data per day, by far the simplest way to do that is to create a simple indexer script that splits the billions of rows in to small files based on some key (e.g. the first two digits of the IP address in a log file row).
If you need to deal with things like numbers theory, or log files, or something else where you have a lot of ints or floats:
1) Learn to use Numpy arrays well
2) Start using Numba's just-in-time compiling
3) Learn Cython (you can do much more than with Numba)
At least moderate level linux skills is a huge plus in dealing with large sets of data. Some things take seconds to do directly from command line, while it might not be at all obvious how to do the same thing in python.
At the very least use %timeit to test range of scales leading to your desired scale (e.g. 2,5 billion rows per day). This is a easy way to identify possible performance drops, and reduce size of arrays or other factors accordingly.
Learn more about profiling / performance hacking as soon as you're doing something with data.
To make the point about 'indexer' clear, a very simple example indexer I've created and used for doing a lot of computation on files with billions of rows of data using a $60 per month server.
https://github.com/mikkokotila/indexer

Statistics / distributions over very large data sets

Looking at the discussions about simple statistics from a file of data, I wonder which of these techniques would scale best over very large data sets (~millions of entries, Gbytes of data).
Are numpy solutions that read entire data set into memory appropriate here? See:
Binning frequency distribution in Python

You are not telling which kind of data you haveand what you want to calculate!
If you have something that is or is easily converted into positive integers of moderate size (e.g., 0..1e8), you may use bincount. Here is an example of how to make a distribution (histogram) of the byte values of all bytes in a very large file (works up to whatever your file system can manage):
import numpy as np
# number of bytes to read at a time
CHUNKSIZE = 100000000
# open the file
f = open("myfile.dat", "rb")
# cumulative distribution array
cum = np.zeros(256)
# read through the file chunk by chunk
while True:
chunkdata = np.fromstring(f.read(CHUNKSIZE), dtype='uint8')
cum += np.bincount(chunkdata)
if len(chunkdata < CHUNKSIZE):
break
This is very fast, the speed is really limited by the disk access. (I got approximately 1 GB/s with a file in the OS cache.)
Of course, you may want to calculate some other statistics (standard deviation, etc.), but even then you can usually use the distributions (histograms) to calculate that statistics. However, if you do not need the distribution, then there may be even faster methods. Calculating the average is the same as just adding all values together.
If you have a text file, then the major challenge is in parsing the file chunk-by-chunk. The standard methods loadtxt and csv module are not necessarily very efficient with very large files.
If you have floating point numbers or very large integers, the method above does not work directly, but in some cases you may just use some bits of the FP numbers or round things to closest integer, etc. In any case the question really boils down to what kind of data you really have, and what statistics you want to calculate. There is no Swiss knife which would solve all statistics problems with huge files.
Reading the data into memory is a very good option if you have enough memory. In certain cases you can do it without having enough memory (use numpy.memmap). If you have a text file with 1 GB of floating point numbers, the end result may fit into less than 1 GB, and most computers can handle that very well. Just make sure you are using a 64-bit Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.