I want to use Dask for operations of the form
df.groupby(some_columns).apply(some_function)
where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.
Dask documentation states (and several other StackOverflow answers cite) that groupby-apply is not appropriate for aggregations:
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
It is not clear whether Aggregation supports operations on multiple columns. However, this DataFrames tutorial seems to do exactly what I'm suggesting, with roughly some_function = lambda x: LinearRegression().fit(...). The example seems to work as intended, and I've similarly had no problems so far with e.g. some_function = lambda x: x.to_csv(...).
Under what conditions can I expect that some_function will be passed all rows of the group? If this is never guaranteed, is there a way to break the LinearRegression example? And most importantly, what is the best way to handle these use cases?
It appears that the current version of documentation and the source code are not in sync. Specifically, in the source code for dask.groupby, there is this message:
Dask groupby supports reductions, i.e., mean, sum and alike, and apply. The former do not shuffle the data and are efficiently implemented as tree reductions. The latter is implemented by shuffling the underlying partiitons such that all items of a group can be found in the same parititon.
This is not consistent with the warning in the docs about partition-group. The snippet below and task graph visualization also show that there is shuffling of data to ensure that partitions contain all members of the same group:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'group': [0,0,1,1,2,0,1,2], 'npeople': [1,2,3,4,5,6,7,8]})
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
return df['npeople'].sum()
results_pandas = df.groupby('group').apply(myfunc).sort_index()
results_dask = ddf.groupby('group').apply(myfunc).compute().sort_index()
print(sum(results_dask != results_pandas))
# will print 0, so there are no differences
# between dask and pandas groupby
This is speculative, but maybe one way to work around the scenario where partition-group leads to rows from a single group split across partitions is to explicitly re-partition the data in a way that ensures each group is associated with a unique partition.
One way to achieve that is by creating an index that is the same as the group identifier column. This in general is not a cheap operation, but it can be helped by pre-processing the data in a way that the group identifier is already sorted.
Related
Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.
I have a dataFrame of approx. 4 million rows and 35 columns as input.
All I do to this dataFrame is the following steps:
For a list of given columns, I calculate a sum for a given list of group features and joined it as new column to my input dataFrame
I drop each new column sum right after I joined it to the dataFrame.
Therefore we end up with the same dataFrame as we started from (in theory).
However, I noticed that if my list of given columns gets too big (from more than 6 columns), the output dataFrame becomes impossible to manipulate. Even a simple display takes 10 minutes.
Here is an example of my code (df is my input dataFrame):
for c in list_columns:
df = df.join(df.groupby(list_group_features).agg(sum(c).alias('sum_' + c)), list_group_features)
df = df.drop('sum_' + c)
This happens due to the inner workings of Spark and its lazy evaluation.
What Spark does when you call groupby, join, agg, it attaches these calls to the plan of the df object. So even though it is not executing anything on the data, you are creating a large execution plan which is internally stored in the Spark DataFrame object.
Only when you call an action (show, count, write, etc.), Spark optimizes the plan and executes it. If the plan is too large, the optimization step can take a while to perform. Also remember that the plan optimization is happening on the driver, not on the executors. So if your driver is busy or overloaded, it delays spark plan optimization step as well.
It is useful to remember that joins are expensive operations in Spark, both for optimization and execution. If you can, you should always avoid joins when operating on a single DataFrame and utilise the window functionality instead. Joins should only be used if you are joining different dataframes from different sources (different tables).
A way to optimize your code would be:
import pyspark
import pyspark.sql.functions as f
w = pyspark.sql.Window.partitionBy(list_group_features)
agg_sum_exprs = [f.sum(f.col(c)).alias("sum_" + c).over(w) for c in list_columns]
res_df = df.select(df.columns + agg_sum_exprs)
This should be scalable and fast for large list_group_features and list_columns lists.
I have a very large (150M rows - 30GB RAM) dataframe. I do a groupby (around 40 groups) and apply a function on each group. Takes about 30 minutes to perform everything. Here was the code I used:
df = df.groupby(by='date').apply(func=my_func)
Since the operations are not interdependant, I figured I'd use Dask to parallelize the processing of each group seperately.
So I use this code:
from dask import dataframe as dd
df_dask = dd.from_pandas(df_pandas, npartitions=40)
template = pd.DataFrame(columns=['A','B','C','D','E'])
df_dask = df_dask.groupby(by='date').apply(func=my_func, meta=template)
df_dask = df_dask.compute()
However, when I run this, I get different results depending on the value of npartitions I give it. If I give a value of 1, it gives me the same (and correct) results, but then it takes the same amount of time as with pandas. If I give it a higher number, it performs faster, but returns way fewer rows. I don't understand the relationship between npartitions and the groupby.
Also, if I try with a slightly larger DataFrame (40GB), Dask runs out of memory, even though I have 64GB on my machine, while pandas is fine.
Any ideas?
Dask's DataFrameGroupBy.apply applies the user-provided function to each partition: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.groupby.DataFrameGroupBy.apply.
If you need a custom reduction, use Aggregate: https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate
I've noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well.
I think there are several answers clearly implicitly using this, but, to be concrete, here is one using groupby+cumsum.
Is there anything actually promising this behavior? The documentation only states:
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).
Although the docs don't state this internally, it uses stable sort when generating the groups.
See:
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356
As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:
_algos.groupsort_indexer implements counting sort and it is at least
O(ngroups), where
ngroups = prod(shape)
shape = map(len, keys)
That is, linear in the number of combinations (cartesian product) of unique
values of groupby keys. This can be huge when doing multi-key groupby.
np.argsort(kind='mergesort') is O(count x log(count)) where count is the
length of the data-frame;
Both algorithms are stable sort and that is necessary for correctness of
groupby operations.
e.g. consider:
df.groupby(key)[col].transform('first')
Yes; the description of the sort parameter of DataFrame.groupby now promises that groupby (with or without key sorting) "preserves the order of rows within each group":
sort : bool, default True
Sort group keys. Get better performance by
turning this off. Note this does not influence the order of
observations within each group. Groupby preserves the order of rows
within each group.
I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna