I am working on a very huge dataset with 20 million+ records. I am trying to save all that data into a feathers format for faster access and also append as I proceed with me analysis.
Is there a way to append pandas dataframe to an existing feathers format file?
Feather files are intended to be written at once. Thus appending to them is not a supported use case.
Instead I would recommend to you for such a large dataset to write the data into individual Apache Parquet files using pyarrow.parquet.write_table or pandas.DataFrame.to_parquet and read the data also back into Pandas using pyarrow.parquet.ParquetDataset or pandas.read_parquet. These functions can treat a collection of Parquet files as a single dataset that is read at once into a single DataFrame.
Related
I have this large dataset around 6gb and have processed and cleaned the data using PySpark and now want to save it so I can use it elsewhere for machine learning uses
I am trying to find the fastest way of saving the datasets.
I followed this link, but its taking so long to save the csv or the parquet.
How to export a table dataframe in PySpark to csv?
Please can someone provide some information on how I can do this
I have a quite a large dataframe that I need to save. The size is approx 300mb when I save it using pickle.
I read about some other ways of saving large dataframes. I am using bz2.BZ2File & I can see the file is now only 50mb. However when I try to load the data I get the following error,
UnpicklingError: pickle data was truncated
Is there a better way for saving a large dataframe?
Saving the dataframe as a csv file can help. A dataframe contains more information than solely the data, so when pickling, such dataframe is converted to a string which takes up a lot of space which a csv would not.
Notice that the method to_csv even supports compression. E.g. to save as a zip:
df.to_csv('filename.zip', compression='infer')
I am trying to export a DataFrame that contains documents as byte objects, however, I can not find a suitable file format that does not involve the relatively small (memory usage: 254.3+ KB) DataFrame expanding into something in the range of 100's of MB - even 1GB+.
So far I have tried to export the DataFrame as CSV and HDF5.
The column causing this huge expansion contains either .pdf, .doc, .txt or .msg files in byteformat:
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n256...
which was initially stored on a SQL-server as varbinary(max) and loaded by pandas default settings.
I have simply tried using pandas to export the DataFrame using:
df.to_csv('.csv') and
data_stored = pd.HDFStore('documents.h5')
data_stored['document'] = df
I wanted to keep the output data compact, as I would simply like to be able to load the data again at another time. The problem, however, is that the exports result in either a huge CSV or .h5 file. I guess there is some file-format that keeps the format and size of a pd.DataFrame?
I ended up exporting using df.to_pickle. Also I discovered that the size of the dataframe was indeed much larger than I initially thought, since the pandas method .info did not include the enormous amount of overhead memory. Instead, to view the entire memory, I used df.memory_usage(deep=True).sum() and indeed the dataframe took up around 1.1 GB.
I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?
Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).
The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.
Any help would be much appreciated. Thank you!
Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.
That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.
Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.
From the Pandas Doc:
Apache Parquet provides a partitioned binary columnar serialization
for data frames. It is designed to make reading and writing data
frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to
shrink the file size as much as possible while still maintaining good
read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame
s, supporting all of the pandas dtypes, including extension dtypes
such as datetime with tz.
Read further here: Pandas Parquet
try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object
it very lightweight and useful library to work with large data
I am learning python pandas.
I see a tutorial which shows two ways to save a pandas dataframe.
pd.to_csv('sub.csv') and to open pd.read_csv('sub.csv')
pd.to_pickle('sub.pkl') and to open pd.read_pickle('sub.pkl')
The tutorial says to_pickle is to save the dataframe to disk. I am confused about this. Because when I use to_csv, I did see a csv file appears in the folder, which I assume is also save to disk right?
In general, why we want to save a dataframe using to_pickle rather than save it to csv or txt or other format?
csv
✅human readable
✅cross platform
⛔slower
⛔more disk space
⛔doesn't preserve types in some cases
pickle
✅fast saving/loading
✅less disk space
⛔non human readable
⛔python only
Also take a look at parquet format (to_parquet, read_parquet)
✅fast saving/loading
✅less disk space than pickle
✅supported by many platforms
⛔non human readable
Pickle is a serialized way of storing a Pandas dataframe. Basically, you are writing down the exact representation of the dataframe to disk. This means the types of the columns are and the indices are the same. If you simply save a file as csv, you are just storing it as a comma separated list. Depending on your data set, some information will be lost when you load it back up.
You can read more about pickle library in python, here.