I'm trying to create a Keras Tokenizer out of a single column from hundreds of large CSV files. Dask seems like a good tool for this. My current approach eventually causes memory issues:
df = dd.read_csv('data/*.csv', usecol=['MyCol'])
# Process column and get underlying Numpy array.
# This greatly reduces memory consumption, but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)
How can I do this by parts? Something along the lines of:
df = pd.read_csv('a-single-file.csv', chunksize=1000)
for chunk in df:
# Process a chunk at a time
Dask DataFrame is technically a set of pandas dataframes, called partitions. When you get the underlying numpy array you are destroying the partitioning structure and it will be one big array. I recommend using map_partition function of Dask DataFrames to apply regular pandas functions on each partition separately.
I also recommend map_partition when it suits your problem. However, if you really just want sequential access, and an API similar to read_csv(chunksize=...) then you might be looking for the partitions attribute
for part in df.partitions:
process(model, part.compute())
Related
A common task in my daily data wrangling is converting tab-delimited text files to xarray datasets and continuing analysis on the dataset and saving to zarr or netCDF format.
I have developed a data pipeline to read the data into dask dataframe, convert it to a dask array and assigning to a xarray dataset.
The data is two-dimensional along dimensions "time" and "subid".
ddf = dd.read_table(some_text_file, ...)
data = ddf.to_dask_array(lengths=True)
attrs = {"about": "some attribute"}
ds = xr.Dataset({"parameter": (["time", "subid"], data, attrs)}
Now, this is working fine as intended. However, recently we have been doing many computational heavy operations along the time dimension such that we often want to rechunk the data along that dimension, so we usually follow this up by:
ds_chunked = ds.chunk({"time": -1, "subid": "auto"})
and then save to disk or do some analysis and then save to disk.
This is however causing quite a bottleneck as the rechunking adds significant time to the pipeline generating the datasets. I am aware that there is no work-around to speeding up rechunking and that one should avoid it if looking for performance improvement. So my questions is really: can anyone think of any smart idea to avoid it or possibly improving its speed? I've looked into reading the data into partitions by column but I haven't found any info on dask dataframe reading data in partitions along columns. If I could "read the data column-wise" (i.e. along time dimension), I wouldn't have to rechunk.
To generate some data for this example you can replace the read_table with something like ddf = dask.datasets.timeseries().
Tried reading data into a dask dataframe but I have to rechunk everytime so I am looking for a way to read the data initially column-wise (along one dimension).
I do have the following problem:
The downside:
I do have a large amount of data that does not fit a pandas df in memory.
The upside:
The data is mostly independent from each other. The only restriction is, that all elements with the same id have to be calculated together in a chunk.
What I tried:
Dask looked like a perfect fit for this problem.
I have a dask kubernetes cluster with multiple workers and no problem loading the data from sql database into a dask df.
The calculation itself is not that easy to implement in dask because some functions are missing or problematic with dask (e.g Multiindex, pivot). However because the data is mostly independent I tried to make the calculations chunkwise in pandas. When I call the .set_index(id) function on the dask df all equal id's are in the same partition. Thereby I wanted to iterate over the partitions and make the calculations on a (temporary) pandas df and store the result right away.
The code basically looks like this:
from dask import dataframe as dd
from distributed import Client
client = Client("kubernetes adress")
#Loading:
for i in range(x):
df_chunk = load_from_sql
future = client.scatter(df_chunk)
futures.append(future)
dask_df = dd.from_delayed(futures, meta=df_chunk)
dask_df = dask_df.set_index("id")
dask_df.map_partitions(lambda part: calculation(part).compute()
where
def calculation(part):
part = # do something in pandas
part.to_csv/to_sql # store data somwhere
del part / client.cancel(part) # release memory of temporary pandas df
With small amounts of data this runs smoothly however with a lot data the memory of the workers becomes full and the process stops with a cancelled error.
What am I doing wrong or are there any alternatives with dask to work memory efficient with chunkwise data from a dask df?
Loading the data directly in chunks from the database is currently not an option.
I am completely stuck, hence I am looking for kind advice.
My aim is to read out many hdf5 files in parallel, extract the multi-dim arrays inside, and store each array in one row, precisely a cell, of a dask dataframe. I don't opt for a pandas df, because I believe it will be too big.
It is not possible to read with read_hdf() from dask hdf5 files created with h5py.
What could I do to import thousands of hdf5-files with dask in paralleL and get access to the multi-dim arrays inside?
I would like to create a dask dataframe, where each 2d-array (extracted from the n-dim arrays inside the hdfs) is stored in one cell of a dask dataframe.
Consequently, the number of row corresponds to the number of total arrays found in all files, here 9. I store in one column the arrays.
In the future I would like to append more columns with other data to this dask dataframe. and I would like to operate on the arrays with another Python library and store the results in other columns of the dask dataframe. The dataframe should contain all the information I extract and manipulate. I would also like to add data from other hdf5 files. Like a minidata base. Is this reasonable?
I can work in parallel because the arrays are independent of each other.
How would you realise this, please? xarray was suggested to me as well, but I don't know what's the best way.
Earlier I tried to collect all arrays in a multi-dimensional dask array, but the conversion to a dataframe is only possible for ndim=2.
Thank you for your advice. Have a good day.
import numpy as np
import h5py
import dask.dataframe as dd
import dask.array as da
import dask
print('This is dask version', dask.__version__)
ra=np.ones([10,3199,4000])
print(ra.shape)
file_list=[]
for i in range(0,4):
#print(i)
fstr='data_{0}.h5'.format(str(i))
#print(fstr)
hf = h5py.File('./'+fstr, 'w')
hf.create_dataset('dataset_{0}'.format(str(i)), data=ra)
hf.close()
file_list.append(fstr)
!ls
print(file_list)
for i,fn in enumerate(file_list):
dd.read_hdf(fn,key='dataset_{0}'.format(str(i))) #breaks here
You can pre-process the data into dataframes using dask.distributed and then convert the futures to a single dask.dataframe using dask.dataframe.from_delayed.
from dask.distributed import Client
import dask.dataframe as dd
client = Client()
def preprocess_hdf_file_to_dataframe(filepath):
# process your data into a dataframe however you want, e.g.
with xr.open_dataset(filepath) as ds:
return ds.to_dataframe()
files = ['file1.hdf5', 'file2.hdf5']
futures = client.map(preprocess_hdf_file_to_dataframe, files)
df = dd.from_delayed(futures)
That said, this seems like a perfect use case for xarray, which can read HDF5 files and work with dask natively, e.g.
ds = xr.open_mfdataset(files)
This dataset is similar to a dask.dataframe, in that it contains references to dask.arrays which are read from the file. But xarray is built to handle N-dimensional arrays natively and can work much more naturally with the HDF5 format.
There are certainly areas where dataframes make more sense than a Dataset or DataArray, though, and converting between them can be tricky with larger-than-memory data, so the first approach is always an option if you want a dataframe.
I am trying to use groupby and apply a custom function on a huge dataset, which is giving me memory errors and the workers are getting killed because of the shuffling. How can I avoid shuffle and do this efficiently.
I am reading around fifty 700MB (each) parquet files and the data in those files is isolated, i.e. no group exists in more than one file. If I try running my code on one file, it works fine but fails when I try to run on the complete dataset.
Dask documentation talks about problems with groupby when you apply a custom function on groups, but they do not offer a solution for such data:
http://docs.dask.org/en/latest/dataframe-groupby.html#difficult-cases
How can I process my dataset in a reasonable timeframe (it takes around 6 minutes for groupby-apply on a single file) and hopefully avoid shuffle. I do not need my results to be sorted, or groupby trying to sort my complete dataset from different files.
I have tried using persist but the data does not fit into RAM (32GB). Even though dask does not support multi column index, but I tried adding a index on one column to support groupby to no avail. Below is what the structure of code looks like:
from dask.dataframe import read_parquet
df = read_parquet('s3://s3_directory_path')
results = df.groupby(['A', 'B']).apply(custom_function).compute()
# custom function sorts the data within a group (the groups are small, less than 50 entries) on a field and computes some values based on heuristics (it computes 4 values, but I am showing 1 in example below and other 3 calculations are similar)
def custom_function(group):
results = {}
sorted_group = group.sort_values(['C']).reset_index(drop=True)
sorted_group['delta'] = sorted_group['D'].diff()
sorted_group.delta = sorted_group.delta.shift(-1)
results['res1'] = (sorted_group[sorted_group.delta < -100]['D'].sum() - sorted_group.iloc[0]['D'])
# similarly 3 more results are generated
results_df = pd.DataFrame(results, index=[0])
return results_df
One possibility is that I process one file at a time and do it multiple times, but in that case dask seems useless (no parallel processing) and it will take hours to achieve the desired results. Is there any way to do this efficiently using dask, or any other library? How do people deal with such data?
If you want to avoid shuffling and can promise that groups are well isolated, then you could just call a pandas groupby apply across every partition with map_partitions
df.map_partitions(lambda part: part.groupby(...).apply(...))
Update:
The pandas df was created like this:
df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])
Creating a dask df from this df looks like this:
df = dd.from_pandas(encoded, 50)
Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):
result = df.groupby('journal_entry').max().reset_index().compute()
Original:
I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:
result = df.groupby('id').max().reset_index()
Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.
What strategy could mitigate the memory issue?
If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!
I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.
you could use dask.dataframe for this task
import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()
All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.
As an idea i would say, splitting the data column wise let's say four times, and use the id for each subset to perform the operations and then remerge