I'm trying to use dask to calculate summaries of data stored in a dataset split between about 1,000 parquet files. Each file is between 1Mb - 10Mb. When i convert a series into an array and compute max on that array, it works great. However, when I try and concatenate two arrays, I quickly run out of memory:
import dask.dataframe as dd
import dask.array as da
data = dd.read_parquet('../data/*.parquet')
field_1 = data['field_1'].to_dask_array()
field_2 = data['field_2'].to_dask_array()
both_fields = da.concatenate((field_1, field_2))
result = da.max(both_fields).compute()
I've tried using Client, but this is very slow and gives a lot of warnings about garbage collection taking too much CPU time. However, it does seem to run without running out of memory.
I'm new to dask, so perhaps I've misunderstood something?
If there is no specific reason for using arrays, then this might work better:
both_fields = dd.concat([ddf['field_1'], ddf['field_2']], axis=0)
Related
I'm trying a small POC to try to group by & aggregate to reduce data from a large CSV in pandas and Dask, and I'm observing high memory usage and/or slower than I would expect processing times... does anyone have any tips for a python/pandas/dask noob to improve this?
Background
I have a request to build a file ingestion tool that would:
Be able to take in files of a few GBs where each row contains user id and some other info
do some transformations
reduce the data to { user -> [collection of info]}
send batches of this data to our web services
Based on my research, since files are only few GBs, I found that Spark, etc would be overkill, and Pandas/Dask may be a good fit, hence the POC.
Problem
Processing a 1GB csv takes ~1 min for both pandas and Dask, and consumes 1.5GB ram for pandas and 9GB ram for dask (!!!)
Processing a 2GB csv takes ~3 mins and 2.8GB ram for pandas, Dask crashes!
What am I doing wrong here?
for pandas, since I'm processing the CSV in small chunks, I did not expect the RAM usage to be so high
for Dask, everything I read online suggested that Dask processes the CSV in blocks indicated by blocksize, and as such the ram usage should expect to be blocksize * size per block, but I wouldn't expect total to be 9GB when the block size is only 6.4MB. I don't know why on earth its ram usage skyrockets to 9GB for a 1GB csv input
(Note: if I don't set the block size dask crashes even on input 1GB)
My code
I can't share the CSV, but it has 1 integer column followed by 8 text columns. Both user_id and order_id columns referenced below are text columns.
1GB csv has 14000001 lines
2GB csv has 28000001 lines
5GB csv has 70000001 lines
I generated these csvs with random data, and the user_id column I randomly picked from 10 pre-randomly-generated values, so I'd expect the final output to be 10 user ids each with a collection of who knows how many order ids.
Pandas
#!/usr/bin/env python3
from pandas import DataFrame, read_csv
import pandas as pd
import sys
test_csv_location = '1gb.csv'
chunk_size = 100000
pieces = list()
for chunk in pd.read_csv(test_csv_location, chunksize=chunk_size, delimiter='|', iterator=True):
df = chunk.groupby('user_id')['order_id'].agg(size= len,list= lambda x: list(x))
pieces.append(df)
final = pd.concat(pieces).groupby('user_id')['list'].agg(size= len,list=sum)
final.to_csv('pandastest.csv', index=False)
Dask
#!/usr/bin/env python3
from dask.distributed import Client
import dask.dataframe as ddf
import sys
test_csv_location = '1gb.csv'
df = ddf.read_csv(test_csv_location, blocksize=6400000, delimiter='|')
# For each user, reduce to a list of order ids
grouped = df.groupby('user_id')
collection = grouped['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
The groupby operation is expensive because dask will try to shuffle data across workers to check who has which user_id values. If user_id has a lot of unique values (sounds like it), there is a lot of cross-checks to be done across workers/partitions.
There are at least two ways out of it:
set user_id as index. This will be expensive during the indexing stage, but subsequent operations will be faster because now dask doesn't have to check each partition for the values of user_id it has.
df = df.set_index('user_id')
collection = df.groupby('user_id')['order_id'].apply(list, meta=('order_id', 'f8'))
collection.to_csv('./dasktest.csv', single_file=True)
if your files have a structure that you know about, e.g. as an extreme example, if user_id are somewhat sorted, so that first the csv file contains only user_id values that start with 1 (or A, or whatever other symbols are used), then with 2, etc, then you could use that information to form partitions in 'blocks' (loose term) in a way that groupby would be needed only within those 'blocks'.
References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981
I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.
The following operation works, but takes nearly 2h:
from dask import dataframe as ddf
ddf.read_csv('data.csv').to_parquet('data.pq')
Is there a way to parallelize this?
The file data.csv is ~2G uncompressed with 16 million rows by 22 columns.
I'm not sure if it is a problem with data or not. I made a toy example on my machine and the same command takes ~9 seconds
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask.distributed import Client
import dask
client = Client()
# if you wish to connect to the dashboard
client
# fake df size ~2.1 GB
# takes ~180 seconds
N = int(5e6)
df = pd.DataFrame({i: np.random.rand(N)
for i in range(22)})
df.to_csv("data.csv", index=False)
# the following takes ~9 seconds on my machine
dd.read_csv("data.csv").to_parquet("data_pq")
read_csv has a blocksize parameter (docs) that you can use to control the size of the resulting partition and hence, the number of partitions. That, from what I understand, will result in reading the partitions in parallel - each worker will read block size at relevant offset.
You can set blocksize so that it yields the required number of partition to take advantage of the cores you have. For example
cores = 8
size = os.path.getsize('data.csv')
ddf = dd.read_csv("data.csv", blocksize=np.rint(size/cores))
print(ddf.npartitions)
Will output:
8
Better yet, you can try to modify the size so that the resulting parquet has partitions of recommended size (which I have seen different numbers in different places :-|).
Writing out multiple CSV files in parallel is also easy with the repartition method:
df = dd.read_csv('data.csv')
df = df.repartition(npartitions=20)
df.to_parquet('./data_pq', write_index=False, compression='snappy')
Dask likes working with partitions that are around 100 MB, so 20 partitions should be good for a 2GB dataset.
You could also speed this up, by splitting the CSV before reading it, so Dask can read the CSV files in parallel. You could use these tactics to break up the 2GB CSV into 20 different CSVs and then write them out without repartitioning:
df = dd.read_csv('folder_with_small_csvs/*.csv')
df.to_parquet('./data_pq', write_index=False, compression='snappy')
I need to read in large (+15GB) NetCDF files into a program, which holds a 3D variable (etc Time as the record dimension, and the data is latitudes by longitudes).
I'm processing the data in a 3 level nested loop (checking each block of the NetCDF if it passes a certain criteria. For example;
from netCDF4 import Dataset
import numpy as np
File = Dataset('Somebigfile.nc', 'r')
Data = File.variables['Wind'][:]
Getdimensions = np.shape(Data)
Time = Getdimensions[0]
Latdim = Getdimensions[1]
Longdim = Getdimensions[2]
for t in range(0,Time):
for i in range(0,Latdim):
for j in range(0,Longdim):
if Data[t,i,j] > Somethreshold:
#Do something
Is there anyway I can read in the NetCDF file one time record at a time? Reducing the memory usage hugely. Any help hugely appreciated.
I know of NCO operators but would prefer not to use these methods to break up files before using the script.
It sounds like you've already settled on a solution but I'll throw out a much more elegant and vectorized (likely faster) solution that uses xarray and dask. Your nested for loop is going to be very inefficient. Combining xarray and dask, you can work on the data in your file incrementally in a semi-vectorized manor.
Since your Do something step isn't all that specific, you'll have to extrapolate from my example.
import xarray as xr
# xarray will open your file but doesn't load in any data until you ask for it
# dask handles the chunking and memory management for you
# chunk size can be optimized for your specific dataset.
ds = xr.open_dataset('Somebigfile.nc', chunks={'time': 100})
# mask out values below the threshold
da_thresh = ds['Wind'].where(ds['Wind'] > Somethreshold)
# Now just operate on the values greater than your threshold
do_something(da_thresh)
Xarray/Dask docs: http://xarray.pydata.org/en/stable/dask.html