I am trying to start using dask for handling large data sets in some ML projects. Loading singular CSV files into a dask dataframe works fine. When i try to use multiple CSV files, any "compute" like operation causes the program to hang indefinitely.
This runs fine
import dask.dataframe as dd
import pandas as pd
import dask
from dask.distributed import Client
client = Client(processes=False)
df = dd.read_csv('sftp://somestuff//4120109.csv')
shape = dask.delayed(print)(df.shape)
shape.compute()
Output: (3600, 3723)
The following code hangs indefinitely
import dask.dataframe as dd
import pandas as pd
import dask
from dask.distributed import Client
client = Client(processes=False)
df = dd.read_csv('sftp://somestuff//412010*.csv')
shape = dask.delayed(print)(df.shape)
shape.compute()
It should load the 10 files that match and give a shape of (36000, 3273)
I know it hangs specifically on the shape.compute() line after putting in some choice print lines. Any help would be greatly appreciated!!!
You should not mix dask.delayed and dask.dataframe. Probably you just wanted to call dask.compute(df.shape)
https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections
Related
References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981
I'm trying to read a bunch of large csv files (multiple files) from google storage. I use the Dask distribution library for parallel computation, but the problem I'm facing here is, though I mention the blocksize (100mb), I'm not sure how to read partition by partition and save it to my postgres database so that I don't want overload my memory.
from dask.distributed import Client
from dask.diagnostics import ProgressBar
client = Client(processes=False)
import dask.dataframe as dd
def read_csv_gcs():
with ProgressBar():
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
pd = df.compute(scheduler='threads')
return pd
def write_df_to_db(df):
try:
from sqlalchemy import create_engine
engine = create_engine('postgresql://usr:pass#localhost:5432/sampledb')
df.to_sql('sampletable', engine, if_exists='replace',index=False)
except Exception as e:
print(e)
pass
pd = read_csv_gcs()
write_df_to_db(pd)
The above code is my basic implementation, but as said I would like to read it in chunk and update the db. Something like
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
for chunk in df:
write_it_to_db(chunk)
Is it possible to do it in Dask? or should I go for pandas's chunksize and iterate, then save it to DB (But I miss parallel computation here)?
Can someone shed some light?
This line
df.compute(scheduler='threads')
says: load the data in chunks in worker threads, and concatenate them all into a single in-memory dataframe, df. This is not what you wanted. You wanted to insert the chunks as they come and then drop them from memory.
You probably wanted to use map_partitions
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
df.map_partitions(write_it_to_db).compute()
or use df.to_delayed().
Note that, depending on your SQL driver, you might not be able to get parallelism this way, and if not, the pandas iter-chunk method would have worked just as well.
I have the below code. It uses dask distributed to read 100 json files:(Workers: 5
Cores: 5
Memory: 50.00 GB)
from dask.distributed import Client
import dask.dataframe as dd
client = Client('xxxxxxxx:8786')
df = dd.read_json('gs://xxxxxx/2018-04-18/data-*.json')
df = client.persist(df)
When I run the code, I only see one worker takes up the read_json() task, and then I got memory error and got WorkerKilled error.
Should I manually read each file and concat them? or is dask supposed to do it under-the-hood?
You may want to use dask.bag instead of dask.dataframe
import json
import dask.bag as db
mybag = db.read_text('gs://xxxxxx/2018-04-18/data-*.json').map(json.loads)
After that you can convert the bag into a dask dataframe with
mybag.to_dataframe()
This may require some additional uses of dask.map to get the structure right.
If your data is hadoop style json (aka one object per line), the bag trick will still work but you may need to operate on individual lines.
I have a csv file on client . To make it's data accessible to worker nodes on different machines ,I am using client.scatter with reference to
Loading local file from client onto dask distributed cluster .
This is my code .
import pandas as pd
import time
import dask as dask
import dask.distributed as distributed
import dask.dataframe as dd
import dask.delayed as delayed
from dask.distributed import Client,progress
client = Client('scheduler:port')
df = pd.read_csv('file.csv')
[f]= client.scatter([df]) # send dataframe to one worker
ddf = dd.from_delayed([f], meta=df).repartition(npartitions=100).persist()
ddf = ddf.groupby(['col1'])[['col2']].sum()
future=client.compute(ddf)
print future
progress(future)
result = client.gather(future)
print result
This code is running fine for a csv file with 1 million records but for a csv file with 60 million records ,it is taking too long .Am I making any mistake ? Any help would be appreciated.
I'm converting a large textfile to a hdf storage in hopes of a faster data access. The conversion works allright, however reading from the csv file is not done in parallel. It is really slow (takes about 30min for a 1GB textfile on an SSD, so my guess is that it is not IO-bound).
Is there a way to have it read in multiple threads in parralel?
Sice it might be important, I'm currently forced to run under Windows -- just in case that makes any difference.
from dask import dataframe as ddf
df = ddf.read_csv("data/Measurements*.csv",
sep=';',
parse_dates=["DATETIME"],
blocksize=1000000,
)
df.categorize([ 'Type',
'Condition',
])
df.to_hdf("data/data.hdf", "Measurements", 'w')
Yes, dask.dataframe can read in parallel. However you're running into two problems:
Pandas.read_csv only partially releases the GIL
By default dask.dataframe parallelizes with threads because most of Pandas can run in parallel in multiple threads (releases the GIL). Pandas.read_csv is an exception, especially if your resulting dataframes use object dtypes for text
dask.dataframe.to_hdf(filename) forces sequential computation
Writing to a single HDF file will force sequential computation (it's very hard to write to a single file in parallel.)
Edit: New solution
Today I would avoid HDF and use Parquet instead. I would probably use the multiprocessing or dask.distributed schedulers to avoid GIL issues on a single machine. The combination of these two should give you full linear scaling.
from dask.distributed import Client
client = Client()
df = dask.dataframe.read_csv(...)
df.to_parquet(...)
Solution
Because your dataset likely fits in memory, use dask.dataframe.read_csv to load in parallel with multiple processes, then switch immediately to Pandas.
import dask.dataframe as ddf
import dask.multiprocessing
df = ddf.read_csv("data/Measurements*.csv", # read in parallel
sep=';',
parse_dates=["DATETIME"],
blocksize=1000000,
)
df = df.compute(get=dask.multiprocessing.get) # convert to pandas
df['Type'] = df['Type'].astype('category')
df['Condition'] = df['Condition'].astype('category')
df.to_hdf('data/data.hdf', 'Measurements', format='table', mode='w')
Piggybacking off of #MRocklin's answer, in newer versions of dask, you can use df.compute(scheduler='processes') or df.compute(scheduler='threads') to convert to pandas using multiprocessing or multithreading:
from dask import dataframe as ddf
df = ddf.read_csv("data/Measurements*.csv",
sep=';',
parse_dates=["DATETIME"],
blocksize=1000000,
)
df = df.compute(scheduler='processes') # convert to pandas
df['Type'] = df['Type'].astype('category')
df['Condition'] = df['Condition'].astype('category')
df.to_hdf('data/data.hdf', 'Measurements', format='table', mode='w')