upickling error data was truncated - better way to save large dataframe - python

I have a quite a large dataframe that I need to save. The size is approx 300mb when I save it using pickle.
I read about some other ways of saving large dataframes. I am using bz2.BZ2File & I can see the file is now only 50mb. However when I try to load the data I get the following error,
UnpicklingError: pickle data was truncated
Is there a better way for saving a large dataframe?

Saving the dataframe as a csv file can help. A dataframe contains more information than solely the data, so when pickling, such dataframe is converted to a string which takes up a lot of space which a csv would not.
Notice that the method to_csv even supports compression. E.g. to save as a zip:
df.to_csv('filename.zip', compression='infer')

Related

What did the HDF5 format do to the csv file?

I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB. I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe. This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM.
I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)?
HDF5 compresses the data, that explains lower amount of disk space used.
In terms of RAM, I wouldn't expect any difference at all, with maybe even slightly more memory consumed by HDF5 due to computations related to format processing.
I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.
How did you convert from csv to hdf5? Is the number of columns and rows the same
Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data. Vaex will do things in steps and then concatenate the result in a single file.
Also if some column are of unsupported type they might not be exported.
Doing some sanity checks will uncover more hints.

Problem exporting Pandas DataFrame with object containing documents as bytes

I am trying to export a DataFrame that contains documents as byte objects, however, I can not find a suitable file format that does not involve the relatively small (memory usage: 254.3+ KB) DataFrame expanding into something in the range of 100's of MB - even 1GB+.
So far I have tried to export the DataFrame as CSV and HDF5.
The column causing this huge expansion contains either .pdf, .doc, .txt or .msg files in byteformat:
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n256...
which was initially stored on a SQL-server as varbinary(max) and loaded by pandas default settings.
I have simply tried using pandas to export the DataFrame using:
df.to_csv('.csv') and
data_stored = pd.HDFStore('documents.h5')
data_stored['document'] = df
I wanted to keep the output data compact, as I would simply like to be able to load the data again at another time. The problem, however, is that the exports result in either a huge CSV or .h5 file. I guess there is some file-format that keeps the format and size of a pd.DataFrame?
I ended up exporting using df.to_pickle. Also I discovered that the size of the dataframe was indeed much larger than I initially thought, since the pandas method .info did not include the enormous amount of overhead memory. Instead, to view the entire memory, I used df.memory_usage(deep=True).sum() and indeed the dataframe took up around 1.1 GB.

Large python dictionary. Storing, loading, and writing to it

I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?
Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).
The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.
Any help would be much appreciated. Thank you!
Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.
That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.
Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.
From the Pandas Doc:
Apache Parquet provides a partitioned binary columnar serialization
for data frames. It is designed to make reading and writing data
frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to
shrink the file size as much as possible while still maintaining good
read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame
s, supporting all of the pandas dtypes, including extension dtypes
such as datetime with tz.
Read further here: Pandas Parquet
try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object
it very lightweight and useful library to work with large data

Is it possible to append to an existing Feathers format file?

I am working on a very huge dataset with 20 million+ records. I am trying to save all that data into a feathers format for faster access and also append as I proceed with me analysis.
Is there a way to append pandas dataframe to an existing feathers format file?
Feather files are intended to be written at once. Thus appending to them is not a supported use case.
Instead I would recommend to you for such a large dataset to write the data into individual Apache Parquet files using pyarrow.parquet.write_table or pandas.DataFrame.to_parquet and read the data also back into Pandas using pyarrow.parquet.ParquetDataset or pandas.read_parquet. These functions can treat a collection of Parquet files as a single dataset that is read at once into a single DataFrame.

What is the difference between save a pandas dataframe to pickle and to csv?

I am learning python pandas.
I see a tutorial which shows two ways to save a pandas dataframe.
pd.to_csv('sub.csv') and to open pd.read_csv('sub.csv')
pd.to_pickle('sub.pkl') and to open pd.read_pickle('sub.pkl')
The tutorial says to_pickle is to save the dataframe to disk. I am confused about this. Because when I use to_csv, I did see a csv file appears in the folder, which I assume is also save to disk right?
In general, why we want to save a dataframe using to_pickle rather than save it to csv or txt or other format?
csv
✅human readable
✅cross platform
⛔slower
⛔more disk space
⛔doesn't preserve types in some cases
pickle
✅fast saving/loading
✅less disk space
⛔non human readable
⛔python only
Also take a look at parquet format (to_parquet, read_parquet)
✅fast saving/loading
✅less disk space than pickle
✅supported by many platforms
⛔non human readable
Pickle is a serialized way of storing a Pandas dataframe. Basically, you are writing down the exact representation of the dataframe to disk. This means the types of the columns are and the indices are the same. If you simply save a file as csv, you are just storing it as a comma separated list. Depending on your data set, some information will be lost when you load it back up.
You can read more about pickle library in python, here.

Categories