Python and Dask - reading and concatenating multiple files - python

I have some parquet files, all coming from the same domain but with some differences in structure. I need to concatenate all of them. Below some example of these files:
file 1:
A,B
True,False
False,False
file 2:
A,C
True,False
False,True
True,True
What I am looking to do is to read and concatenate these files in the fastest way possible obtaining the following result:
A,B,C
True,False,NaN
False,False,NaN
True,NaN,False
False,NaN,True
True,NaN,True
To do that I am using the following code, extracted using (Reading multiple files with Dask, Dask dataframes: reading multiple files & storing filename in column):
import glob
import dask.dataframe as dd
from dask.distributed import Client
import dask
def read_parquet(path):
return pd.read_parquet(path)
if __name__=='__main__':
files = glob.glob('test/*/file.parquet')
print('Start dask client...')
client = Client()
results = [dd.from_delayed(dask.delayed(read_parquet)(diag)) for diag in diag_files]
results = dd.concat(results).compute()
client.close()
This code works, and it is already the fastest version I could come up with (I tried sequential pandas and multiprocessing.Pool). My idea was that Dask could ideally start part of the concatenation while still reading some of the files, however, from the task graph I see some sequential reading of the metadata of each parquet file, see the screenshot below:
The first part of the task graph is a mixture of read_parquet followed by read_metadata. The first part always shows only 1 task executed (in the task processing tab). The second part is a combination of from_delayed and concat and it is using all of my workers.
Any suggestion on how to speed up the file reading and reduce the execution time of the first part of the graph?

The problem with your code is that you use Pandas version of
read_parquet.
Instead use:
dask version of read_parquet,
map and gather methods offered by Client,
dask version of concat,
Something like:
def read_parquet(path):
return dd.read_parquet(path)
def myRead():
L = client.map(read_parquet, glob.glob('file_*.parquet'))
lst = client.gather(L)
return dd.concat(lst)
result = myRead().compute()
Before that I created a client, once only.
The reason was that during my earlier experiments I got an error
message when I attempted to create it again (in a function), even
though the first instance has been closed before.

Related

How do I create a Dask DataFrame partition by partition and write portions to disk while the DataFrame is still incomplete?

I have python code for data analysis that iterates through hundreds of datasets, does some computation and produces a result as a pandas DataFrame, and then concatenates all the results together. I am currently working with a set of data where these results are too large to fit into memory, so I'm trying to switch from pandas to Dask.
The problem is that I have looked through the Dask documentation and done some Googling and I can't really figure out how to create a Dask DataFrame iteratively like how I described above in a way that will take advantage of Dask's ability to only keep portions of the DataFrame in memory. Everything I see assumes that you either have all the data already stored in some format on disk, or that you have all the data in memory and now want to save it to disk.
What's the best way to approach this? My current code using pandas looks something like this:
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
dfs = []
for data in datasets:
result = process_data(data)
dfs.append(result)
final_result = pd.concat(dfs)
final_result.to_csv("result.csv")
Expanding from #MichelDelgado comment, the correct approach should somethign like this:
import dask.dataframe as dd
from dask.delayed import delayed
def process_data(data) -> pd.DataFrame:
# Do stuff
return df
delayed_dfs = []
for data in datasets:
result = delayed(process_data)(data)
delayed_dfs.append(result)
ddf = dd.from_delayed(delayed_dfs)
ddf.to_csv('export-*.csv')
Note that this would created multiple CSV files, one per input partition.
You can find documentation here: https://docs.dask.org/en/stable/delayed-collections.html.
Also, be careful to actually read the data into the process function. So the data argument in the code above should only be an identifier, like a file path or equivalent.

Saving to the same parquet file in parallel using dask leading to ArrowInvalid

I am doing some simulation where I compute some stuff for several time step. For each I want to save a parquet file where each line correspond to a simulation this looks like so :
def simulation():
nsim = 3
timesteps = [1,2]
data = {} #initialization not shown here
for i in nsim:
compute_stuff()
for j in timesteps:
data[str(j)]= compute_some_other_stuff()
return data
Once I have my dict data containing the result of my simulation (under numpy arrays), I transform it into dask.DataFrame objects to then save them to file using the .to_parquet() method as follows:
def save(data):
for i in data.keys():
data[i] = pd.DataFrame(data[i], bins=...)
df = from_pandas(data[i], npartitions=2)
f.to_parquet(datafolder + i + "/", engine="pyarrow", append=True, ignore_divisions = True)
When use this code only once it works perfectly and the struggle arrises when I try to implement it in parallel. Using dask I do:
client = Client(n_workers=10, processes=True)
def f():
data = simulation()
save(data)
to_compute = [delayed(f)(n) for n in range(20)]
compute(to_compute)
The behaviour of this last portion of code is quite random. At some point this happens:
distributed.worker - WARNING - Compute Failed
Function: f
args: (4)
kwargs: {}
Exception: "ArrowInvalid('Parquet file size is 0 bytes')"
....
distributed.worker - WARNING - Compute Failed
Function: f
args: (12)
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
I think these errors are due to the fact that 2 processes try to write at the same time to the same parquet file, and it not well handled (as it can be on txt file). I already tried to switch to pySpark / Koalas without much success. I there a better way to save the result along a simulation (in case of a crash / wall time on a cluster)?
You are making a classic dask mistake of invoking the dask API from within functions that are themselves delayed. The error indicates that things are happening in parallel (which is what dask does!) which are not expected to change during processing. Specifically, a file is clearly being edited by one task while another one is reading it (not sure which).
What you probably want to do, is use concat on the dataframe pieces and then a single call to to_parquet.
Note that it seems all of your data is actually held in the client, and you are using from_parquet. This seems like a bad idea, since you are missing out on one of dask's biggest features, to only load data when needed. You should, instead, load your data inside delayed functions or dask dataframe API calls.

Cant fit dataframe with fbprophet using dask to read the csv into a dataframe

References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981

How to read chunks of multiple large CSV files from google cloud storage using Dask without overloading the memory all at once

I'm trying to read a bunch of large csv files (multiple files) from google storage. I use the Dask distribution library for parallel computation, but the problem I'm facing here is, though I mention the blocksize (100mb), I'm not sure how to read partition by partition and save it to my postgres database so that I don't want overload my memory.
from dask.distributed import Client
from dask.diagnostics import ProgressBar
client = Client(processes=False)
import dask.dataframe as dd
def read_csv_gcs():
with ProgressBar():
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
pd = df.compute(scheduler='threads')
return pd
def write_df_to_db(df):
try:
from sqlalchemy import create_engine
engine = create_engine('postgresql://usr:pass#localhost:5432/sampledb')
df.to_sql('sampletable', engine, if_exists='replace',index=False)
except Exception as e:
print(e)
pass
pd = read_csv_gcs()
write_df_to_db(pd)
The above code is my basic implementation, but as said I would like to read it in chunk and update the db. Something like
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
for chunk in df:
write_it_to_db(chunk)
Is it possible to do it in Dask? or should I go for pandas's chunksize and iterate, then save it to DB (But I miss parallel computation here)?
Can someone shed some light?
This line
df.compute(scheduler='threads')
says: load the data in chunks in worker threads, and concatenate them all into a single in-memory dataframe, df. This is not what you wanted. You wanted to insert the chunks as they come and then drop them from memory.
You probably wanted to use map_partitions
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
df.map_partitions(write_it_to_db).compute()
or use df.to_delayed().
Note that, depending on your SQL driver, you might not be able to get parallelism this way, and if not, the pandas iter-chunk method would have worked just as well.

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")
Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).
To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

Categories