Load images into a Dask Dataframe - python

I have a dask dataframe which contains image paths in a column (called img_paths). What I want to do in the next steps is to load images using those image paths into an another column (called img_loaded) and followed by applying some pre-processing functions.
However, during loading (or, image reading) process I am always getting different results including one time delayed wrapping of the imread function, another time correct loading of the image (I can see the arrays) and rest of the times: FileNotFoundError.
In addition to the following examples, I have used map_partitions function as well but I am also ended up in similar outputs except without having the arrays. In the end, I want to use map_partitions function than apply function.
Following is my code and descriptions about the problems:
import pandas as pd
import dask
import dask.dataframe as dd
from skimage.io import imread
imgs = ['https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/so/so-logo.png?v=9c558ec15d8a'] * 42
# create a pandas dataframe using image paths
df = pd.DataFrame({"img_paths": imgs})
# convert it into dask dataframe
ddf = dd.from_pandas(df, npartitions=2)
# convert imread function as delayed
delayed_imread = dask.delayed(imread, pure=True)
First try: using lambda function and apply delayed imread to each cell
ddf["img_loaded"] = ddf.images.apply(lambda x: delayed_imread(x))
ddf.compute()
Here what I get is wrapping of the delayed imread function when using the compute() method. I do not understand why? Following is the output:
Second try: without using lambda function
ddf["img_loaded"] = ddf.images.apply(delayed_imread)
ddf.compute()
This has worked! At least, I can see the loaded images as the arrays. But, I really do not get it why? why is this different than the first solution (i.e., using lambda function) Following is the output:
Third try: with/without using lambda function and without using delayed imread.
ddf["load"] = ddf.images.apply(imread) # or, lambda x: imread(x)
ddf.compute()
Here, again just for an experimentation I did not use the delayed imread function, rather I use simply the skimage.io.imread function. And, I have tried both using with and without lambda function. In each time, I got FileNotFoundError. I did not get this. Why can't it find the image path (although, they are correct) when using non-delayed imread function?
In addition to Ronald's answer, how to use map_partitions function:
ddf["img_loaded"] = ddf.map_partitions(lambda df: df.images.apply(lambda x: imread(x)), meta=("images", np.uint8)).compute()
ddf.compute()

The solution
import pandas as pd
import dask
import dask.dataframe as dd
import numpy as np
from skimage.io import imread
imgs = ['https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/so/so-logo.png?v=9c558ec15d8a'] * 4
# create a pandas dataframe using image paths
df = pd.DataFrame({"img_paths": imgs})
# convert it into dask dataframe
ddf = dd.from_pandas(df, npartitions=2)
# convert imread function as delayed
delayed_imread = dask.delayed(imread, pure=True)
# give dask information about the function output type
ddf['img_paths'].apply(imread, meta=('img_loaded', np.uint8)).compute()
# OR turn it into dask.dealayed, which infers output type `object`
ddf['img_paths'].apply(delayed_imread).compute()
The explanation
If you do try applying the print function, without computation you see the reason for FileNotFoundError of code: ddf.images.apply(imread).compute()
ddf['img_paths'].apply(print)
Output:
> foo
> foo
When you add apply function to the graph, Dask runs through it string foo to infer the type of the output => imread was trying to open file named foo.
To get a better understanding I encourage you to try:
ddf.apply(print, axis=1)
And try to predict what gets printed.
Delayed cells after .compute()
The reason is apply expects a function reference which is then called. By creating lambda function calling the delayed function you are basically double-delaying your function.

Related

Use Dask Dataframe On delayed function

I have three sources and a Dask Dataframe for each of them. I need to apply a function that computes an operation that combines data from the three sources. The operation requires a state to be calculated ( I can't change that).
The three sources are in parquet format and I read the data using read_parquet Dask Dataframe function:
#dask.delayed
def load_data(data_path):
ddf = dd.read_parquet(data_path, engine="pyarrow")
return ddf
results = []
sources_path=["/source1","/source2","/source3"]
for source_path in sources_path:
data = load_data(source_path)
results.append(data)
I create another delayed function that executes the operation:
#dask.delayed
def process(sources):
operation(sources[0][<list of columns>],sources[1][<list of columns>],sources[2][<list of columns>])
The operation function comes from a custom library. It could not actually be parallelized because it has an internal state.
Reading the dask documentation, this is not a best practice.
Is there a way to apply a custom function on multiple dask dataframe without using delayed function?

How do I create a Dask DataFrame partition by partition and write portions to disk while the DataFrame is still incomplete?

I have python code for data analysis that iterates through hundreds of datasets, does some computation and produces a result as a pandas DataFrame, and then concatenates all the results together. I am currently working with a set of data where these results are too large to fit into memory, so I'm trying to switch from pandas to Dask.
The problem is that I have looked through the Dask documentation and done some Googling and I can't really figure out how to create a Dask DataFrame iteratively like how I described above in a way that will take advantage of Dask's ability to only keep portions of the DataFrame in memory. Everything I see assumes that you either have all the data already stored in some format on disk, or that you have all the data in memory and now want to save it to disk.
What's the best way to approach this? My current code using pandas looks something like this:
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
dfs = []
for data in datasets:
result = process_data(data)
dfs.append(result)
final_result = pd.concat(dfs)
final_result.to_csv("result.csv")
Expanding from #MichelDelgado comment, the correct approach should somethign like this:
import dask.dataframe as dd
from dask.delayed import delayed
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
delayed_dfs = []
for data in datasets:
result = delayed(process_data)(data)
delayed_dfs.append(result)
ddf = dd.from_delayed(delayed_dfs)
ddf.to_csv('export-*.csv')
Note that this would created multiple CSV files, one per input partition.
You can find documentation here: https://docs.dask.org/en/stable/delayed-collections.html.
Also, be careful to actually read the data into the process function. So the data argument in the code above should only be an identifier, like a file path or equivalent.

Cant fit dataframe with fbprophet using dask to read the csv into a dataframe

References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981

Method called twice instead of single call in Dask's multiprocessing

I am trying to download a file from google storage bucket and parse them. There are millions of such file, that needs to be downloaded, parsed and do some operations(Natural language processing etc) on them.
I am trying below code using dask's parallel processing and it is working but it is calling extract_skill twice instead of once for each row in panda's dataframe. Please help me understand why extract_skill method is being called twice.
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
# downloading file and extract skill sets and store in skill_sets column
chunk_size = 20
df_list = np.array_split(temp_df, temp_df.shape[0]/chunk_size)
temp_df["skill_sets"] = ""
result_df = pd.DataFrame(data={}, columns=temp_df.columns)
for df_ in df_list:
df_["skill_sets"] = dd.from_pandas(df_, npartitions=4, sort=False, name='x').apply(extract_skill, axis=1, meta='object').compute()
result_df = pd.concat([result_df, df_], axis=0)
extract_skill()
def extract_skill(row):
// download file, parse and do some nlp stuff
file_name = row['file_path']
......
......
return skill_sets
Thanks in advance.
The DataFrame.apply method runs your function on a small sample of data in order to determine the datatypes and columns of the output. See the docstring of this function and look for the keyword "meta" for more information.

How to manipulate columns of a data frame

I read data from a CSV file using rpy2's read_csv() which creates a DataFrame. Now I want to directly manipulate an entire column. What I tried so far:
from rpy2.robjects.packages import importr
utils = importr('utils')
df = utils.read_csv(logn, header=args.head, skip=args.skip)
df.rx2('a').ro / 10
which I expected to write back to the DataFrame which it apparently doesn't: df is not affected by this operation. So, another idea was
df.rx2('a') = df.rx2('a').ro / 10
but that produces an error that function calls are not assignable - which is not obvious to me since the LHS should return a Vector(?)
So what did I miss?
In Python function calls are indeed not assignable, which creates the necessity to adapt a little the R code.
Try:
df[df.names.index('a')] = df.rx2('a').ro / 10

Categories