dask / pandas categorical transformation differences - python

I am managing larger than memory csv files of mostly categorical data. Initially I used to create a large csv file, then read it via Pandas read_csv, convert to categorical and save into hdf5. Once into categorical format, it nicely fits in memory.
Files are growing and I moved to Dask. Same process though.
However, in empty fields, Pandas seems to use np.nan and the category is not included in the cat.categories listing.
With Dask, empty values are filled with NaN, it is included as a separate category and whence saved into HDF I get future compatibility warning.
Is this a bug or am I missing any steps ? Behaviour seems to differ between pandas and dask.
Thanks
JC

This is solved in dask ver 0.11.1
See https://github.com/dask/dask/pull/1578

Related

What did the HDF5 format do to the csv file?

I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB. I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe. This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM.
I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)?
HDF5 compresses the data, that explains lower amount of disk space used.
In terms of RAM, I wouldn't expect any difference at all, with maybe even slightly more memory consumed by HDF5 due to computations related to format processing.
I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.
How did you convert from csv to hdf5? Is the number of columns and rows the same
Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data. Vaex will do things in steps and then concatenate the result in a single file.
Also if some column are of unsupported type they might not be exported.
Doing some sanity checks will uncover more hints.

Vaex Displaying Data

I have a 10.11 GB CSV File and I have converted to hdf5 using dask. It is a mixture of str, int and float values. When I try to read it with vaex I just get numbers as given in the screenshot. Can someone please help me out?
Screenshot:
I am not sure how dask (or dask.dataframe) stores data in HDF5 format. Pandas for instance stores the data in a row-based format. On the other hand vaex expects a column based HDF5 files.
From your screenshot I see that your hdf5 file also preserves the index column - vaex does not have such a column, and expects just the data.
To ensure the HDF5 files work with vaex, it is best to use vaex itself to do the CSV->HDF5 conversion. Otherwise perhaps something like arrow will work, since it is a standard (while HDF5 can be more flexible and this harder to support all possible version of storing data).

Problem exporting Pandas DataFrame with object containing documents as bytes

I am trying to export a DataFrame that contains documents as byte objects, however, I can not find a suitable file format that does not involve the relatively small (memory usage: 254.3+ KB) DataFrame expanding into something in the range of 100's of MB - even 1GB+.
So far I have tried to export the DataFrame as CSV and HDF5.
The column causing this huge expansion contains either .pdf, .doc, .txt or .msg files in byteformat:
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n256...
which was initially stored on a SQL-server as varbinary(max) and loaded by pandas default settings.
I have simply tried using pandas to export the DataFrame using:
df.to_csv('.csv') and
data_stored = pd.HDFStore('documents.h5')
data_stored['document'] = df
I wanted to keep the output data compact, as I would simply like to be able to load the data again at another time. The problem, however, is that the exports result in either a huge CSV or .h5 file. I guess there is some file-format that keeps the format and size of a pd.DataFrame?
I ended up exporting using df.to_pickle. Also I discovered that the size of the dataframe was indeed much larger than I initially thought, since the pandas method .info did not include the enormous amount of overhead memory. Instead, to view the entire memory, I used df.memory_usage(deep=True).sum() and indeed the dataframe took up around 1.1 GB.

How to aggregate multiple hdf5 files into a one image

I have several (up to few hundreds) hdf5 files, which contain the results of a parallel simulation: each compute node creates an independent hdf5 file to avoid any synchronization problems.
Is there any way to create an 'image' of all data in hdf5 files, such what, this 'image' would look like it has all data, but in reality, it will provide data from other files?
Here what I'm looking for:
"data-node0.h5"
spike/PopulationA -> pandas data frame columns=[0,3,6,9]
"data-node1.h5"
spike/PopulationA -> pandas data frame columns=[1,4,7,10]
"data-node2.h5"
spike/PopulationA -> pandas data frame columns=[2,5,8,11]
spike/PopulationB -> pandas data frame columns=[0,1,2,3]
"data.h5" = aggregate("data-node0.h5","data-node1.h5","data-node2.h5")
"data.h5"
spike/PopulationA -> pandas data frame columns=[0,1,2,3,4,5,6,7,8,9,10,11]
spike/PopulationB -> pandas data frame columns=[0,1,2,3]
Note that file data.h5 doesn't contain any data. It uses data from data-nodeX.h5 files.
Update Data in hdf5 files are pandas data frames with time series. The column in each data frame is 1D numpy array recorded from an object in the model. A column identifier is a unique ID of an object in the model. The table index is the model time in ms.
In version 1.10+, HDF5 added a virtual dataset feature that allows you to map data from multiple datasets into a top-level 'virtual' dataset, which stores no data itself.
The documentation is here:
https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.html
The complication, of course, is that it looks like you are using PyTables via Pandas and not raw HDF5. PyTables is HDF5, but adds a layer of structure and semantics on top of HDF5's groups and datasets. In order to create a virtual dataset based on PyTables, you are going to have to dig around in the sub-structure of the PyTables HDF5 objects to set up the mapping. Also, any virtual dataset you create will be a regular HDF5 dataset and not a PyTables table. This is certainly doable given a basic knowledge of HDF5, though possibly more work than you hoped.
h5py (a much lower-level and more direct Python wrapper for HDF5) has support for the virtual dataset feature, btw, so you can still do everything in Python, just not via PyTables.

Process Pandas DataFrames which don't fit in memory

I'm manipulating a huge DataFrame stored using HDFStore objects, the table is too big to be completely loaded in memory so I have to extract the data chunck by chunk, which is fine for a lot of tasks.
Here comes my problem, I would like to apply a PCA on the table which requires the whole DataFrame to be loaded but I don't have enough memory to do that.
The PCA function takes a numpy array or a pandas DataFrame as input, is there another way to apply a PCA that would directly use an object stored on disk?
Thank you a lot in advance,
ClydeX
Seems like a perfect fit for the new IncrementalPCA in the 0.16 dev branch of scikit-learn.
Update: link to the latest stable version

Categories