Exporting a csv of pandas dataframe with LARGE np.arrays - python

I'm building a deep learning model for speech emotion recognition in google colab environment.
The process of the data and features extraction from the audio files is taking about 20+ mins of runtime.
Therefore, I have made a pandas DataFrame containing all of the data which I want to export to a CSV file so I wouldn't need to wait that long for the data to be extracted every time.
Because audio files have 44,100 frames per second on average (sample rate (Hz)), I get a huge array of values, so that
df.sample shows for e.g:
df.sample for variable 'x'
Each 'x' array has about 170K values, but only shows this minimizing representation in df.sample.
Unfortunately, df.to_csv copies the exact representation, and NOT the full arrays.
Is there a way to export the full DataFrame as CSV? (Should be miles and miles of data for each row...)

The problem is that a dataframe is not expected to contain np.arrays. As numpy is the underlying framework for Pandas, np.arrays are special for pandas. Anyway, a dataframe is intended to be a data processing tools, not a general purpose container, so I think you are using the wrong tool here.
If you still want to go that way, it is enough to change the np.arrays into lists:
df['x'] = df['x'].apply(list)
But at load time, you will have to declare a converter to change the string representations of lists into plain lists:
df = pd.read_csv('data.csv', converters={'x': ast.literal_eval, ...})
But again, a csv file is not intended to have fields containing large lists, and performances could not be what you expect.

Related

efficient partitioning column-wise when converting from dask dataframe to xarray (dask array)

A common task in my daily data wrangling is converting tab-delimited text files to xarray datasets and continuing analysis on the dataset and saving to zarr or netCDF format.
I have developed a data pipeline to read the data into dask dataframe, convert it to a dask array and assigning to a xarray dataset.
The data is two-dimensional along dimensions "time" and "subid".
ddf = dd.read_table(some_text_file, ...)
data = ddf.to_dask_array(lengths=True)
attrs = {"about": "some attribute"}
ds = xr.Dataset({"parameter": (["time", "subid"], data, attrs)}
Now, this is working fine as intended. However, recently we have been doing many computational heavy operations along the time dimension such that we often want to rechunk the data along that dimension, so we usually follow this up by:
ds_chunked = ds.chunk({"time": -1, "subid": "auto"})
and then save to disk or do some analysis and then save to disk.
This is however causing quite a bottleneck as the rechunking adds significant time to the pipeline generating the datasets. I am aware that there is no work-around to speeding up rechunking and that one should avoid it if looking for performance improvement. So my questions is really: can anyone think of any smart idea to avoid it or possibly improving its speed? I've looked into reading the data into partitions by column but I haven't found any info on dask dataframe reading data in partitions along columns. If I could "read the data column-wise" (i.e. along time dimension), I wouldn't have to rechunk.
To generate some data for this example you can replace the read_table with something like ddf = dask.datasets.timeseries().
Tried reading data into a dask dataframe but I have to rechunk everytime so I am looking for a way to read the data initially column-wise (along one dimension).

Most efficient way to insert large amount of data (230M entries) in pyspark table

What is the most efficient way to insert large amounts of data that is generated in a python script?
I am retrieving .grib files about weather parameters from several sources. These grib files consist of grid-based data (1201x2400x80), which results in a large amount of data.
I have written a script where each value is combined with the corresponding longitude and latitude, resulting in a data structure as follows:
+--------------------+-------+-------+--------+--------+
| value|lat_min|lat_max| lon_min| lon_max|
+--------------------+-------+-------+--------+--------+
| 0.0011200|-90.075|-89.925|-180.075|-179.925|
| 0.0016125|-90.075|-89.925|-179.925|-179.775|
+--------------------+-------+-------+--------+--------+
I have tried looping through each of the 80 time steps, and creating a pyspark dataframe, as well as reshaping the whole array into (230592000,), but both methods seem to either take ages to complete or fry the cluster's memory.
I have just discovered Resilient Distributed Dataset's (RDD), and am able to use the map function to create the full 230M entries in RDD format, converting this to a DataFrame or writing it to a file is very slow again.
Is there a way to multithread/distribute/optimize this in a way that is both efficient and doesn't need large amounts of memory?
Thanks in advance!

What did the HDF5 format do to the csv file?

I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB. I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe. This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM.
I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)?
HDF5 compresses the data, that explains lower amount of disk space used.
In terms of RAM, I wouldn't expect any difference at all, with maybe even slightly more memory consumed by HDF5 due to computations related to format processing.
I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.
How did you convert from csv to hdf5? Is the number of columns and rows the same
Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data. Vaex will do things in steps and then concatenate the result in a single file.
Also if some column are of unsupported type they might not be exported.
Doing some sanity checks will uncover more hints.

Problem exporting Pandas DataFrame with object containing documents as bytes

I am trying to export a DataFrame that contains documents as byte objects, however, I can not find a suitable file format that does not involve the relatively small (memory usage: 254.3+ KB) DataFrame expanding into something in the range of 100's of MB - even 1GB+.
So far I have tried to export the DataFrame as CSV and HDF5.
The column causing this huge expansion contains either .pdf, .doc, .txt or .msg files in byteformat:
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n256...
which was initially stored on a SQL-server as varbinary(max) and loaded by pandas default settings.
I have simply tried using pandas to export the DataFrame using:
df.to_csv('.csv') and
data_stored = pd.HDFStore('documents.h5')
data_stored['document'] = df
I wanted to keep the output data compact, as I would simply like to be able to load the data again at another time. The problem, however, is that the exports result in either a huge CSV or .h5 file. I guess there is some file-format that keeps the format and size of a pd.DataFrame?
I ended up exporting using df.to_pickle. Also I discovered that the size of the dataframe was indeed much larger than I initially thought, since the pandas method .info did not include the enormous amount of overhead memory. Instead, to view the entire memory, I used df.memory_usage(deep=True).sum() and indeed the dataframe took up around 1.1 GB.

Dimensions of data stored in HDF5

I have several .h5 files which contain Pandas DataFrames created with the .to_hdf method. My question is quite simple : is it possible to retrieve the dimension of the DataFrame stored in the .h5 file without loading all the data in RAM ?
Motivation : the DataFrames stored in those HDF5 files are quite big (up to several Gb) and loading all the data just to get the shape of the data is really time consuming.
You are probably going to want to use PyTables directly.
The API reference is here, but basically:
from tables import *
h5file = open_file("yourfile.h5", mode="r")
print h5file.root.<yourdataframe>.table.shape
print len(h5file.root.<yourdataframe>.table.cols) - 1 # first col is an index
Also, just for clarity, HDF5 does not read all the data when a dataset is opened. That would be a peculiarity of Pandas.

Categories