Convert large hdf5 dataset written via pandas/pytables to vaex - python

I have a very large dataset I write to hdf5 in chunks via append like so:
with pd.HDFStore(self.train_store_path) as train_store:
for filepath in tqdm(filepaths):
with open(filepath, 'rb') as file:
frame = pickle.load(file)
if frame.empty:
os.remove(filepath)
continue
try:
train_store.append(
key='dataset', value=frame,
min_itemsize=itemsize_dict)
os.remove(filepath)
except KeyError as e:
print(e)
except ValueError as e:
print(frame)
print(e)
except Exception as e:
print(e)
The data is far too large to load into one DataFrame, so I would like to try out vaex for further processing. There's a few things I don't really understand though.
Since vaex uses a different representation in hdf5 than pandas/pytables (VOTable), I'm wondering how to go about converting between those two formats. I tried loading the data in chunks into pandas, converting it to a vaex DataFrame and then storing it, but there seems to be no way to append data to an existing vaex hdf5 file, at least none that I could find.
Is there really no way to create a large hdf5 dataset from within vaex? Is the only option to convert an existing dataset to vaex' representation (constructing the file via a python script or TOPCAT)?
Related to my previous question, if I work with a large dataset in vaex out-of-core, is it possible to then persist the results of any transformations i apply in vaex into the hdf5 file?

The problem with this storage format is that it is not column-based, which does not play well with datasets with large number of rows, since if you only work with 1 column, for instance, the OS will probably also read large portions of the other columns, as well as the CPU cache gets polluted with it. It would be better to store them to a column based format such as vaex' hdf5 format, or arrow.
Converting to a vaex dataframe can done using:
import vaex
vaex_df = vaex.from_pandas(pandas_df, copy_index=False)
You can do this for each dataframe, and store them on disk as hdf5 or arrow:
vaex_df.export('batch_1.hdf5') # or 'batch_1.arrow'
If you do this for many files, you can lazily (i.e. no memory copies will be made) concatenate them, or use the vaex.open function:
df1 = vaex.open('batch_1.hdf5')
df2 = vaex.open('batch_2.hdf5')
df = vaex.concat([df1, df2]) # will be seen as 1 dataframe without mem copy
df_altnerative = vaex.open('batch*.hdf5') # same effect, but only needs 1 line
Regarding your question about the transformations:
If you do transformations to a dataframe, you can write out the computed values, or get the 'state', which includes the transformations:
import vaex
df = vaex.example()
df['difference'] = df.x - df.y
# df.export('materialized.hdf5', column_names=['difference']) # do this if IO is fast, and memory abundant
# state = df.state_get() # get state in memory
df.state_write('mystate.json') # or write as json
import vaex
df = vaex.example()
# df.join(vaex.open('materialized.hdf5')) # join on rows number (super fast, 0 memory use!)
# df.state_set(state) # or apply the state from memory
df.state_load('mystate.json') # or from disk
df

Related

How do I create a Dask DataFrame partition by partition and write portions to disk while the DataFrame is still incomplete?

I have python code for data analysis that iterates through hundreds of datasets, does some computation and produces a result as a pandas DataFrame, and then concatenates all the results together. I am currently working with a set of data where these results are too large to fit into memory, so I'm trying to switch from pandas to Dask.
The problem is that I have looked through the Dask documentation and done some Googling and I can't really figure out how to create a Dask DataFrame iteratively like how I described above in a way that will take advantage of Dask's ability to only keep portions of the DataFrame in memory. Everything I see assumes that you either have all the data already stored in some format on disk, or that you have all the data in memory and now want to save it to disk.
What's the best way to approach this? My current code using pandas looks something like this:
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
dfs = []
for data in datasets:
result = process_data(data)
dfs.append(result)
final_result = pd.concat(dfs)
final_result.to_csv("result.csv")
Expanding from #MichelDelgado comment, the correct approach should somethign like this:
import dask.dataframe as dd
from dask.delayed import delayed
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
delayed_dfs = []
for data in datasets:
result = delayed(process_data)(data)
delayed_dfs.append(result)
ddf = dd.from_delayed(delayed_dfs)
ddf.to_csv('export-*.csv')
Note that this would created multiple CSV files, one per input partition.
You can find documentation here: https://docs.dask.org/en/stable/delayed-collections.html.
Also, be careful to actually read the data into the process function. So the data argument in the code above should only be an identifier, like a file path or equivalent.

How to calculate Pandas Dataframe size on disk before writing as parquet?

Using python 3.9 with Pandas 1.4.3 and PyArrow 8.0.0.
I have a couple parquet files (all with the same schema) which I would like to merge up to a certain threshold (not fixed size, but not higher than the threshold).
I have a directory, lets call it input that contains parquet files.
Now, if I use os.path.getsize(path) I get the size on disk, but merging 2 files and taking the sum of that size (i.e os.path.getsize(path1) + os.path.getsize(path2)) naturally won't yield good result due to the metadata and other things.
I've tried the following to see if I can have some sort of indication about the file size before writing it into parquet.
print(df.info())
print(df.memory_usage().sum())
print(df.memory_usage(deep=True).sum())
print(sys.getsizeof(df))
print(df.values.nbytes + df.index.nbytes + df.columns.nbytes)
Im aware that the size is heavily depended on compression, engine, schema, etc, so for that I would like to simply have a factor.
Simply put, if I want a threshold of 1mb per file, ill have a 4mb actual threshold since I assume that the compression will compress the data by 75% (4mb -> 1mb)
So in total i'll have something like
compressed_threshold_in_mb = 1
compression_factor = 4
and the condition to keep appending data into a merged dataframe would be by checking the multiplication of the two, i.e:
if total_accumulated_size > compressed_threshold_in_mb * compression_factor:
assuming total_accumulated_size is the accumulator of how much the dataframe will weigh on disk
You can save the data frame to parquet in memory to have an exact idea of how much data it is going to use:
import io
import pandas as pd
def get_parquet_size(df: pd.DataFrame) -> int:
with io.BytesIO() as buffer:
df.to_parquet(buffer)
return buffer.tell()

Split large dataframes (pandas) into chunks (but after grouping)

I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.

Python Pandas memory - subsetting and releasing main data frame

I have Pandas data frame big data frame loaded into memory.
Trying to utilize memory more efficient way.
For this purposes, i won't use this data frame after i will subset from this data frame only rows i am interested are:
DF = pd.read_csv("Test.csv")
DF = DF[DF['A'] == 'Y']
Already tried this solution but not sure if it most effective.
Is solution above is most efficient for memory usage?
Please advice.
you can try the following trick (if you can read the whole CSV file into memory):
DF = pd.read_csv("Test.csv").query("A == 'Y'")
Alternatively, you can read your data in chunks, using read_csv()
But i would strongly recommend you to save your data in HDF5 Table format (you may also want to compress it) - then you could read your data conditionally, using where parameter in read_hdf() function.
For example:
df = pd.read_hdf('/path/to/my_storage.h5', 'my_data', where="A == 'Y'")
Here you can find some examples and a comparison of usage for different storage options

Large, persistent DataFrame in pandas

I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in pandas?
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd
tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess
# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'
table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration
# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']
# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0])
# connect to database
cnx = sqlite3.connect(out_sqlite)
# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):
df = pd.read_csv(in_csv,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
# columns to read
df.columns = columns
sql.to_sql(df,
name=table_name,
con=cnx,
index=False, # don't use CSV file index
index_label='molecule_id', # use a unique column from DataFrame as index
if_exists='append')
cnx.close()
Below is my working flow.
import sqlalchemy as sa
import pandas as pd
import psycopg2
count = 0
con = sa.create_engine('postgresql://postgres:pwd#localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Base on your file size, you'd better optimized the chunksize.
for chunk in chunks:
chunk.to_sql(name='Table', if_exists='append', con=con)
count += 1
print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
link to dask on github
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.

Categories