I have a 10.11 GB CSV File and I have converted to hdf5 using dask. It is a mixture of str, int and float values. When I try to read it with vaex I just get numbers as given in the screenshot. Can someone please help me out?
Screenshot:
I am not sure how dask (or dask.dataframe) stores data in HDF5 format. Pandas for instance stores the data in a row-based format. On the other hand vaex expects a column based HDF5 files.
From your screenshot I see that your hdf5 file also preserves the index column - vaex does not have such a column, and expects just the data.
To ensure the HDF5 files work with vaex, it is best to use vaex itself to do the CSV->HDF5 conversion. Otherwise perhaps something like arrow will work, since it is a standard (while HDF5 can be more flexible and this harder to support all possible version of storing data).
Related
I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB. I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe. This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM.
I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)?
HDF5 compresses the data, that explains lower amount of disk space used.
In terms of RAM, I wouldn't expect any difference at all, with maybe even slightly more memory consumed by HDF5 due to computations related to format processing.
I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.
How did you convert from csv to hdf5? Is the number of columns and rows the same
Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data. Vaex will do things in steps and then concatenate the result in a single file.
Also if some column are of unsupported type they might not be exported.
Doing some sanity checks will uncover more hints.
When reading a CSV file using Polars in Python, we can use the parameter dtypes to specify the schema to use (for some columns). I wonder can we do the same when reading or writing a Parquet file? I tried to specify the dtypes parameter but it doesn't work.
I have some Parquet files generated from PySpark and want to load those Parquet files into Rust. The Rust requires unsigned integers while Spark/PySpark does not have unsigned integers and output signed integers into Parquet files. To make things simpler, I'd like to convert types of columns of Parquet files before loading them into Rust. I know there are several different ways to achieve this (both in pandas and polars) but I wonder whether there's easy and efficient way to do this using polars.
The code that I used to cast column types using polars in Python is as below.
import polars as pl
...
df["id0"] = df.id0.cast(pl.datatypes.UInt64)
Parquet files have a schema. We respect the schema of:
the parquet file upon reading
the DataFrame upon writing
If you want to change the schema you read/write, you need to cast columns in the DataFrame.
That's what we would do if we would accept a schema, so efficiency is the same.
I am trying to export a DataFrame that contains documents as byte objects, however, I can not find a suitable file format that does not involve the relatively small (memory usage: 254.3+ KB) DataFrame expanding into something in the range of 100's of MB - even 1GB+.
So far I have tried to export the DataFrame as CSV and HDF5.
The column causing this huge expansion contains either .pdf, .doc, .txt or .msg files in byteformat:
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n256...
which was initially stored on a SQL-server as varbinary(max) and loaded by pandas default settings.
I have simply tried using pandas to export the DataFrame using:
df.to_csv('.csv') and
data_stored = pd.HDFStore('documents.h5')
data_stored['document'] = df
I wanted to keep the output data compact, as I would simply like to be able to load the data again at another time. The problem, however, is that the exports result in either a huge CSV or .h5 file. I guess there is some file-format that keeps the format and size of a pd.DataFrame?
I ended up exporting using df.to_pickle. Also I discovered that the size of the dataframe was indeed much larger than I initially thought, since the pandas method .info did not include the enormous amount of overhead memory. Instead, to view the entire memory, I used df.memory_usage(deep=True).sum() and indeed the dataframe took up around 1.1 GB.
I am learning python pandas.
I see a tutorial which shows two ways to save a pandas dataframe.
pd.to_csv('sub.csv') and to open pd.read_csv('sub.csv')
pd.to_pickle('sub.pkl') and to open pd.read_pickle('sub.pkl')
The tutorial says to_pickle is to save the dataframe to disk. I am confused about this. Because when I use to_csv, I did see a csv file appears in the folder, which I assume is also save to disk right?
In general, why we want to save a dataframe using to_pickle rather than save it to csv or txt or other format?
csv
✅human readable
✅cross platform
⛔slower
⛔more disk space
⛔doesn't preserve types in some cases
pickle
✅fast saving/loading
✅less disk space
⛔non human readable
⛔python only
Also take a look at parquet format (to_parquet, read_parquet)
✅fast saving/loading
✅less disk space than pickle
✅supported by many platforms
⛔non human readable
Pickle is a serialized way of storing a Pandas dataframe. Basically, you are writing down the exact representation of the dataframe to disk. This means the types of the columns are and the indices are the same. If you simply save a file as csv, you are just storing it as a comma separated list. Depending on your data set, some information will be lost when you load it back up.
You can read more about pickle library in python, here.
I am managing larger than memory csv files of mostly categorical data. Initially I used to create a large csv file, then read it via Pandas read_csv, convert to categorical and save into hdf5. Once into categorical format, it nicely fits in memory.
Files are growing and I moved to Dask. Same process though.
However, in empty fields, Pandas seems to use np.nan and the category is not included in the cat.categories listing.
With Dask, empty values are filled with NaN, it is included as a separate category and whence saved into HDF I get future compatibility warning.
Is this a bug or am I missing any steps ? Behaviour seems to differ between pandas and dask.
Thanks
JC
This is solved in dask ver 0.11.1
See https://github.com/dask/dask/pull/1578