I do a lot of work with data that has DateTime indexes and multi-indexes. Saving and reading as a .csv is tedious because every time I have to reset_index and name it "date" then when I read again, I have to convert the date back to a datetime and set the index. What format will help me avoid this? I'd prefer something open source - for instance I think SAS and Stata will do this, but they are proprietary.
feather was made for this:
https://github.com/wesm/feather
Feather provides binary columnar serialization for data frames. It is
designed to make reading and writing data frames efficient, and to
make sharing data across data analysis languages easy. This initial
version comes with bindings for python (written by Wes McKinney) and R
(written by Hadley Wickham).
Feather uses the Apache Arrow columnar memory specification to
represent binary data on disk. This makes read and write operations
very fast. This is particularly important for encoding null/NA values
and variable-length types like UTF8 strings.
Feather is a part of the broader Apache Arrow project. Feather defines
its own simplified schemas and metadata for on-disk representation.
Feather currently supports the following column types:
A wide range of numeric types (int8, int16, int32, int64, uint8,
uint16, uint32, uint64, float, double). Logical/boolean values. Dates,
times, and timestamps. Factors/categorical variables that have fixed
set of possible values. UTF-8 encoded strings. Arbitrary binary data.
Related
I have a Raspberry Pi 3B+ with 1Gb ram in which I am running a telegram bot.
This bot uses a database I store in csv format, which contains about 100k rows and four columns:
First two are for searching
Third is a result
those use about 20-30MB ram, this is assumable.
The last column is really a problem, it shoots up the ram usage to 180MB, impossible to manage for RPi, this column is also for searching, but I only need it sometimes.
I started only loading the df with read_csv at start of script and let the script polling, but when the db grows, I realized that this is too much for RPi.
What do you think is the best way to do this? Thanks!
Setting the dtype according to the data can reduce memory usage a lot.
With read_csv you can directly set the dtype for each column:
dtypeType name or dict of column -> type, optional
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
Use str or object together with suitable
na_values settings to preserve and not interpret dtype. If converters
are specified, they will be applied INSTEAD of dtype conversion.
Example:
df = pd.read_csv(my_csv, dtype={0:'int8',1:'float32',2:'category'}, header=None)
See the next section on some dtype examples (with an existing df).
To change that on an existing dataframe you can use astype on the respective column.
Use df.info() to check the df memory usage before and after the change.
Some examples:
# Columns with integer values
# check first if min & max of that column is in the respective int range to not loose info
df['column_with_integers'] = df['column_with_integers'].astype('int8')
# Columns with float values
# mind potential calculation precision = don't go too low
df['column_with_floats'] = df['column_with_floats'].astype('float32')
# Columns with categorical values (strings)
# e.g. when the rows have repeatingly the same strings
# like 'TeamRed', 'TeamBlue', 'TeamYellow' spread over 10k rows
df['Team_Name'] = df['Team_Name'].astype('category')
# Cange boolean like string columns to actual boolean
df['Yes_or_No'] = df['Yes_or_No'].map({'yes':True, 'no':False})
Kudos to Medallion Data Science with his Youtube Video Efficient Pandas Dataframes in Python - Make your code run fast with these tricks! where I learned those tips.
Kudos to Robert Haas for the additional link in the comments to Pandas Scaling to large datasets - Use efficient datatypes
Not sure if this is a good idea in this case but it might be worth trying.
The Dask package was designed to allow Pandas-like data analysis on dataframes that are too big to fit in memory (RAM) (as well as other things). It does this by only loading chunks of the complete dataframe into memory at a time.
However, not sure it was designed for machines like the Raspberry Pi (not even sure there is a distribution for it).
The good thing is Dask will slide seamlessly into your Pandas script so it might not be too much effort to try it out.
Here is a simple example I made before:
dask-examples.ipynb
If you try it let me know if it works, I'm also interested.
I had a csv file of 33GB but after converting to HDF5 format the file size drastically reduced to around 1.4GB. I used vaex library to read my dataset and then converted this vaex dataframe to pandas dataframe. This conversion of vaex dataframe to pandas dataframe did not put too much load on my RAM.
I wanted to ask what this process (CSV-->HDF5-->pandas dataframe) did so that now pandas dataframe did not take up too much memory instead of when I was reading the pandas dataframe directly from CSV file (csv-->pandas dataframe)?
HDF5 compresses the data, that explains lower amount of disk space used.
In terms of RAM, I wouldn't expect any difference at all, with maybe even slightly more memory consumed by HDF5 due to computations related to format processing.
I highly doubt it is anything to do with compression.. in fact I would assume the file should be larger in hdf5 format especially in the presence of numeric features.
How did you convert from csv to hdf5? Is the number of columns and rows the same
Assuming you converting it somehow with vaex, please check if you are not looking at a single "chunk" of data. Vaex will do things in steps and then concatenate the result in a single file.
Also if some column are of unsupported type they might not be exported.
Doing some sanity checks will uncover more hints.
I have a 10.11 GB CSV File and I have converted to hdf5 using dask. It is a mixture of str, int and float values. When I try to read it with vaex I just get numbers as given in the screenshot. Can someone please help me out?
Screenshot:
I am not sure how dask (or dask.dataframe) stores data in HDF5 format. Pandas for instance stores the data in a row-based format. On the other hand vaex expects a column based HDF5 files.
From your screenshot I see that your hdf5 file also preserves the index column - vaex does not have such a column, and expects just the data.
To ensure the HDF5 files work with vaex, it is best to use vaex itself to do the CSV->HDF5 conversion. Otherwise perhaps something like arrow will work, since it is a standard (while HDF5 can be more flexible and this harder to support all possible version of storing data).
I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?
Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).
The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.
Any help would be much appreciated. Thank you!
Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.
That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.
Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.
From the Pandas Doc:
Apache Parquet provides a partitioned binary columnar serialization
for data frames. It is designed to make reading and writing data
frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to
shrink the file size as much as possible while still maintaining good
read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame
s, supporting all of the pandas dtypes, including extension dtypes
such as datetime with tz.
Read further here: Pandas Parquet
try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object
it very lightweight and useful library to work with large data
I am naive to Python. But, what I came to know is that both are being used for serialization and deserialization. So, I just want to know what all basic differences in between them?
YAML is a language-neutral format that can represent primitive types (int, string, etc.) well, and is highly portable between languages. Kind of analogous to JSON, XML or a plain-text file; just with some useful formatting conventions mixed in -- in fact, YAML is a superset of JSON.
Pickle format is specific to Python and can represent a wide variety of data structures and objects, e.g. Python lists, sets and dictionaries; instances of Python classes; and combinations of these like lists of objects; objects containing dicts containing lists; etc.
So basically:
YAML represents simple data types & structures in a language-portable manner
pickle can represent complex structures, but in a non-language-portable manner
There's more to it than that, but you asked for the "basic" difference.
pickle is a special python serialization format when a python object is converted into a byte stream and back:
“Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse
operation, whereby a byte stream is converted back into an object
hierarchy.
The main point is that it is python specific.
On the other hand, YAML is language-agnostic and human-readable serialization format.
FYI, if you are choosing between these formats, think about:
serialization/derialization speed (see cPickle module)
do you need to store serialized files in a human-readable form?
what are you going to serialize? If it's a python-specific complex data structure, for example, then you should go with pickle.
See also:
Python serialization - Why pickle?
Lightweight pickle for basic types in python?
If it is not important for you to read files by a person, but you just need to save the file, and then read it, then use the pickle. It is much faster and the binaries weigh less.
YAML files are more readable as mentioned above, but also slower and larger in size.
I have tested for my application. I measured the time to upload and download an object to a file, as well as its size.
Serialization/deserialization method
Average time, s
Size of file, kB
PyYAML
1.73
1149.358
pickle
0.004
690.658
As you can see, yaml is 1,67 times heavier. And 432,5 times slower.
P. S. This is for my data. In your case, it may be different. But that's enough for comparison.