What did the HDF5 format do to the csv file? - python

I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB. I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe. This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM.
I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)?

HDF5 compresses the data, that explains lower amount of disk space used.
In terms of RAM, I wouldn't expect any difference at all, with maybe even slightly more memory consumed by HDF5 due to computations related to format processing.

I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.
How did you convert from csv to hdf5? Is the number of columns and rows the same
Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data. Vaex will do things in steps and then concatenate the result in a single file.
Also if some column are of unsupported type they might not be exported.
Doing some sanity checks will uncover more hints.

Related

upickling error data was truncated - better way to save large dataframe

I have a quite a large dataframe that I need to save. The size is approx 300mb when I save it using pickle.
I read about some other ways of saving large dataframes. I am using bz2.BZ2File & I can see the file is now only 50mb. However when I try to load the data I get the following error,
UnpicklingError: pickle data was truncated
Is there a better way for saving a large dataframe?
Saving the dataframe as a csv file can help. A dataframe contains more information than solely the data, so when pickling, such dataframe is converted to a string which takes up a lot of space which a csv would not.
Notice that the method to_csv even supports compression. E.g. to save as a zip:
df.to_csv('filename.zip', compression='infer')

Converting a large csv to xml file

I have a large CSV file(30gb) with 7 columns. Would there be another format to save the file so that the size is much smaller because the first few columns have the same values for many rows?
I was thinking about an XML file type. How do I convert this large csv file to an xml file?
The solution I found involves the pandas package. But since the data is large, using pandas would not work on my 8gb ram laptop.
Pandas is an in-memory package, so the data must be smaller than the amount of RAM. Can you split the original 30 GB file into a collection of smaller files, and process in pandas one at a time? E.g., one file for each fund_ticker.
Dask supports out-of-memory processing for NumPy and pandas, but that is another layer of complexity. https://dask.org
Here is info from pandas docs on scaling to large data sets: https://pandas.pydata.org/docs/user_guide/scale.html
Finally, is a database an option for this use case?

Vaex Displaying Data

I have a 10.11 GB CSV File and I have converted to hdf5 using dask. It is a mixture of str, int and float values. When I try to read it with vaex I just get numbers as given in the screenshot. Can someone please help me out?
Screenshot:
I am not sure how dask (or dask.dataframe) stores data in HDF5 format. Pandas for instance stores the data in a row-based format. On the other hand vaex expects a column based HDF5 files.
From your screenshot I see that your hdf5 file also preserves the index column - vaex does not have such a column, and expects just the data.
To ensure the HDF5 files work with vaex, it is best to use vaex itself to do the CSV->HDF5 conversion. Otherwise perhaps something like arrow will work, since it is a standard (while HDF5 can be more flexible and this harder to support all possible version of storing data).

Large python dictionary. Storing, loading, and writing to it

I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?
Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).
The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.
Any help would be much appreciated. Thank you!
Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.
That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.
Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.
From the Pandas Doc:
Apache Parquet provides a partitioned binary columnar serialization
for data frames. It is designed to make reading and writing data
frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to
shrink the file size as much as possible while still maintaining good
read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame
s, supporting all of the pandas dtypes, including extension dtypes
such as datetime with tz.
Read further here: Pandas Parquet
try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object
it very lightweight and useful library to work with large data

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

Categories