Dimensions of data stored in HDF5 - python

I have several .h5 files which contain Pandas DataFrames created with the .to_hdf method. My question is quite simple : is it possible to retrieve the dimension of the DataFrame stored in the .h5 file without loading all the data in RAM ?
Motivation : the DataFrames stored in those HDF5 files are quite big (up to several Gb) and loading all the data just to get the shape of the data is really time consuming.

You are probably going to want to use PyTables directly.
The API reference is here, but basically:
from tables import *
h5file = open_file("yourfile.h5", mode="r")
print h5file.root.<yourdataframe>.table.shape
print len(h5file.root.<yourdataframe>.table.cols) - 1 # first col is an index
Also, just for clarity, HDF5 does not read all the data when a dataset is opened. That would be a peculiarity of Pandas.

Related

upickling error data was truncated - better way to save large dataframe

I have a quite a large dataframe that I need to save. The size is approx 300mb when I save it using pickle.
I read about some other ways of saving large dataframes. I am using bz2.BZ2File & I can see the file is now only 50mb. However when I try to load the data I get the following error,
UnpicklingError: pickle data was truncated
Is there a better way for saving a large dataframe?
Saving the dataframe as a csv file can help. A dataframe contains more information than solely the data, so when pickling, such dataframe is converted to a string which takes up a lot of space which a csv would not.
Notice that the method to_csv even supports compression. E.g. to save as a zip:
df.to_csv('filename.zip', compression='infer')

Exporting a csv of pandas dataframe with LARGE np.arrays

I'm building a deep learning model for speech emotion recognition in google colab environment.
The process of the data and features extraction from the audio files is taking about 20+ mins of runtime.
Therefore, I have made a pandas DataFrame containing all of the data which I want to export to a CSV file so I wouldn't need to wait that long for the data to be extracted every time.
Because audio files have 44,100 frames per second on average (sample rate (Hz)), I get a huge array of values, so that
df.sample shows for e.g:
df.sample for variable 'x'
Each 'x' array has about 170K values, but only shows this minimizing representation in df.sample.
Unfortunately, df.to_csv copies the exact representation, and NOT the full arrays.
Is there a way to export the full DataFrame as CSV? (Should be miles and miles of data for each row...)
The problem is that a dataframe is not expected to contain np.arrays. As numpy is the underlying framework for Pandas, np.arrays are special for pandas. Anyway, a dataframe is intended to be a data processing tools, not a general purpose container, so I think you are using the wrong tool here.
If you still want to go that way, it is enough to change the np.arrays into lists:
df['x'] = df['x'].apply(list)
But at load time, you will have to declare a converter to change the string representations of lists into plain lists:
df = pd.read_csv('data.csv', converters={'x': ast.literal_eval, ...})
But again, a csv file is not intended to have fields containing large lists, and performances could not be what you expect.

Problem exporting Pandas DataFrame with object containing documents as bytes

I am trying to export a DataFrame that contains documents as byte objects, however, I can not find a suitable file format that does not involve the relatively small (memory usage: 254.3+ KB) DataFrame expanding into something in the range of 100's of MB - even 1GB+.
So far I have tried to export the DataFrame as CSV and HDF5.
The column causing this huge expansion contains either .pdf, .doc, .txt or .msg files in byteformat:
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n256...
which was initially stored on a SQL-server as varbinary(max) and loaded by pandas default settings.
I have simply tried using pandas to export the DataFrame using:
df.to_csv('.csv') and
data_stored = pd.HDFStore('documents.h5')
data_stored['document'] = df
I wanted to keep the output data compact, as I would simply like to be able to load the data again at another time. The problem, however, is that the exports result in either a huge CSV or .h5 file. I guess there is some file-format that keeps the format and size of a pd.DataFrame?
I ended up exporting using df.to_pickle. Also I discovered that the size of the dataframe was indeed much larger than I initially thought, since the pandas method .info did not include the enormous amount of overhead memory. Instead, to view the entire memory, I used df.memory_usage(deep=True).sum() and indeed the dataframe took up around 1.1 GB.

Large data with pivot table using Pandas

I’m currently using Postgres database to store survey answers.
My problem I’m facing is that I need to generate pivot table from Postgres database.
When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table.
However, my current database now has around 500k rows, and it’s increasing around 1000 rows per day. Reading whole dataset is not effective anymore.
My question is that do I need to use HDFS to store data on disk and supply it to Pandas to do pivoting?
My customers need to view pivot table output nearly real time. Do we have any way to solve it?
My theory is that I’ll create pivot table output of 500k rows and store the output somewhere, then when new data gets saved into database, I’ll only need to merge the new data with existing pivot table. I’m not quite sure if Pandas supports this way, or it needs a full dataset to do pivoting?
Have you tried using pickle. I'm a data scientist and use this all the time with data sets of 1M+ rows and several hundred columns.
In your particular case I would recommend the following.
import pickle
save_data = open('path/file.pickle', 'wb') #wb stands for write bytes
pickle.dump(pd_data, save_data)
save_data.close()
In the above code what you're doing is saving your data in a compact format that can quickly be loaded using:
pickle_data = open('path/file.pickle', 'rb') #rb stands for read bytes
pd_data = pickle.load(pickle_data)
pickle_data.close()
At which point you can append your data (pd_data) with the new 1,000 rows and save it again using pickle. If your data will continue to grow and you expect memory to become a problem I suggest identifying a way to append or concatenate the data rather than merge or join since the latter two can also result in memory issues.
You will find that this will cut out significant load time when reading something off your disk (I use Dropbox and its still lightning fast). What I usually do in order to reduce that even further is segment my data sets into groups of rows & columns and then write methods that load the pickled data as need be (super useful graphing).

Creating a large pd.dataframe - how?

I want to create a large pd.dataframe, out of 7 files 4GB .txt files, which I want to work with + save to .csv
What I did:
created a for loop and opened-concated one by one on axis=0, and so continuing my index (a timestamp).
However I am running into memory problems, even though I am working on a 100GB Ram server. I read somewhere that pandas takes up 5-10x of the data size.
What are my alternatives?
One is creating an empty csv - opening it + the txt + append a new chunk and saving.
Other ideas?
Creating hdf5 file with h5py library will allow you to create one big dataset and access it without loading all the data into the memory.
This answer provides an example of how to create and incrementally increase the hdf5 dataset: incremental writes to hdf5 with h5py

Categories