I have three sources and a Dask Dataframe for each of them. I need to apply a function that computes an operation that combines data from the three sources. The operation requires a state to be calculated ( I can't change that).
The three sources are in parquet format and I read the data using read_parquet Dask Dataframe function:
#dask.delayed
def load_data(data_path):
ddf = dd.read_parquet(data_path, engine="pyarrow")
return ddf
results = []
sources_path=["/source1","/source2","/source3"]
for source_path in sources_path:
data = load_data(source_path)
results.append(data)
I create another delayed function that executes the operation:
#dask.delayed
def process(sources):
operation(sources[0][<list of columns>],sources[1][<list of columns>],sources[2][<list of columns>])
The operation function comes from a custom library. It could not actually be parallelized because it has an internal state.
Reading the dask documentation, this is not a best practice.
Is there a way to apply a custom function on multiple dask dataframe without using delayed function?
I have python code for data analysis that iterates through hundreds of datasets, does some computation and produces a result as a pandas DataFrame, and then concatenates all the results together. I am currently working with a set of data where these results are too large to fit into memory, so I'm trying to switch from pandas to Dask.
The problem is that I have looked through the Dask documentation and done some Googling and I can't really figure out how to create a Dask DataFrame iteratively like how I described above in a way that will take advantage of Dask's ability to only keep portions of the DataFrame in memory. Everything I see assumes that you either have all the data already stored in some format on disk, or that you have all the data in memory and now want to save it to disk.
What's the best way to approach this? My current code using pandas looks something like this:
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
dfs = []
for data in datasets:
result = process_data(data)
dfs.append(result)
final_result = pd.concat(dfs)
final_result.to_csv("result.csv")
Expanding from #MichelDelgado comment, the correct approach should somethign like this:
import dask.dataframe as dd
from dask.delayed import delayed
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
delayed_dfs = []
for data in datasets:
result = delayed(process_data)(data)
delayed_dfs.append(result)
ddf = dd.from_delayed(delayed_dfs)
ddf.to_csv('export-*.csv')
Note that this would created multiple CSV files, one per input partition.
You can find documentation here: https://docs.dask.org/en/stable/delayed-collections.html.
Also, be careful to actually read the data into the process function. So the data argument in the code above should only be an identifier, like a file path or equivalent.
References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981
I am trying to download a file from google storage bucket and parse them. There are millions of such file, that needs to be downloaded, parsed and do some operations(Natural language processing etc) on them.
I am trying below code using dask's parallel processing and it is working but it is calling extract_skill twice instead of once for each row in panda's dataframe. Please help me understand why extract_skill method is being called twice.
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
# downloading file and extract skill sets and store in skill_sets column
chunk_size = 20
df_list = np.array_split(temp_df, temp_df.shape[0]/chunk_size)
temp_df["skill_sets"] = ""
result_df = pd.DataFrame(data={}, columns=temp_df.columns)
for df_ in df_list:
df_["skill_sets"] = dd.from_pandas(df_, npartitions=4, sort=False, name='x').apply(extract_skill, axis=1, meta='object').compute()
result_df = pd.concat([result_df, df_], axis=0)
extract_skill()
def extract_skill(row):
// download file, parse and do some nlp stuff
file_name = row['file_path']
......
......
return skill_sets
Thanks in advance.
The DataFrame.apply method runs your function on a small sample of data in order to determine the datatypes and columns of the output. See the docstring of this function and look for the keyword "meta" for more information.
I read data from a CSV file using rpy2's read_csv() which creates a DataFrame. Now I want to directly manipulate an entire column. What I tried so far:
from rpy2.robjects.packages import importr
utils = importr('utils')
df = utils.read_csv(logn, header=args.head, skip=args.skip)
df.rx2('a').ro / 10
which I expected to write back to the DataFrame which it apparently doesn't: df is not affected by this operation. So, another idea was
df.rx2('a') = df.rx2('a').ro / 10
but that produces an error that function calls are not assignable - which is not obvious to me since the LHS should return a Vector(?)
So what did I miss?
In Python function calls are indeed not assignable, which creates the necessity to adapt a little the R code.
Try:
df[df.names.index('a')] = df.rx2('a').ro / 10