I've noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well.
I think there are several answers clearly implicitly using this, but, to be concrete, here is one using groupby+cumsum.
Is there anything actually promising this behavior? The documentation only states:
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).
Although the docs don't state this internally, it uses stable sort when generating the groups.
See:
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356
As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:
_algos.groupsort_indexer implements counting sort and it is at least
O(ngroups), where
ngroups = prod(shape)
shape = map(len, keys)
That is, linear in the number of combinations (cartesian product) of unique
values of groupby keys. This can be huge when doing multi-key groupby.
np.argsort(kind='mergesort') is O(count x log(count)) where count is the
length of the data-frame;
Both algorithms are stable sort and that is necessary for correctness of
groupby operations.
e.g. consider:
df.groupby(key)[col].transform('first')
Yes; the description of the sort parameter of DataFrame.groupby now promises that groupby (with or without key sorting) "preserves the order of rows within each group":
sort : bool, default True
Sort group keys. Get better performance by
turning this off. Note this does not influence the order of
observations within each group. Groupby preserves the order of rows
within each group.
Related
I want to use Dask for operations of the form
df.groupby(some_columns).apply(some_function)
where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.
Dask documentation states (and several other StackOverflow answers cite) that groupby-apply is not appropriate for aggregations:
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
It is not clear whether Aggregation supports operations on multiple columns. However, this DataFrames tutorial seems to do exactly what I'm suggesting, with roughly some_function = lambda x: LinearRegression().fit(...). The example seems to work as intended, and I've similarly had no problems so far with e.g. some_function = lambda x: x.to_csv(...).
Under what conditions can I expect that some_function will be passed all rows of the group? If this is never guaranteed, is there a way to break the LinearRegression example? And most importantly, what is the best way to handle these use cases?
It appears that the current version of documentation and the source code are not in sync. Specifically, in the source code for dask.groupby, there is this message:
Dask groupby supports reductions, i.e., mean, sum and alike, and apply. The former do not shuffle the data and are efficiently implemented as tree reductions. The latter is implemented by shuffling the underlying partiitons such that all items of a group can be found in the same parititon.
This is not consistent with the warning in the docs about partition-group. The snippet below and task graph visualization also show that there is shuffling of data to ensure that partitions contain all members of the same group:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'group': [0,0,1,1,2,0,1,2], 'npeople': [1,2,3,4,5,6,7,8]})
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
return df['npeople'].sum()
results_pandas = df.groupby('group').apply(myfunc).sort_index()
results_dask = ddf.groupby('group').apply(myfunc).compute().sort_index()
print(sum(results_dask != results_pandas))
# will print 0, so there are no differences
# between dask and pandas groupby
This is speculative, but maybe one way to work around the scenario where partition-group leads to rows from a single group split across partitions is to explicitly re-partition the data in a way that ensures each group is associated with a unique partition.
One way to achieve that is by creating an index that is the same as the group identifier column. This in general is not a cheap operation, but it can be helped by pre-processing the data in a way that the group identifier is already sorted.
Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.
df.Last_3mth_Avg.isnull().groupby([df['ShopID'],df['ProductID']]).sum().astype(int).reset_index(name='count')
The code above help me to see the number of null values by shopid and productid. Question is df.Last_3mth_Avg.isnull() becomes a series, how a groupby([df['ShopID'],df['ProductID']]) can be used afterwards?
I use the solution from:
Pandas count null values in a groupby function
You should filter your df first:
df[df.Last_3mth_Avg.isnull()].groupby(['ShopID','ProductID']).agg('count')
There are two ways to use groupby:
The common way is to use on the dataframe so you just mention the column names in the by= parameter
The second way is you apply on a series but use equal sized series in the by= parameter. This is rarely used and helps when you want to do convertions on a specific column and use groupby in the same line
So, the above code line should work
I have a dataframe with over 100 columns, I would like to check all pairs to see which are unique identifiers.
You can use drop_duplicates(subset), specifying the columns you would regard as possible identifiers in the subset argument.
Since you have so many columns it will probably be easiest to take all columns and subtract from them the ones you would disregard (such as value columns).
You can use from collections import counter. see doc
When working with groupby on a pandas DataFrame instance, I have never not used either as_index=False or reset_index(). I cannot actually think of any reason why I wouldn't do so. Because my behavior is not the pandas default (indeed, because the groupby index exists at all), I suspect that there is some functionality of pandas that I am not taking advantage of.
Can anyone describe cases where it would be advantageous to not reset the index?
When you perform a groupby/agg operation, it is natural to think of the result as a mapping from the groupby keys to the aggregated scalar values. If we were using plain Python, a dict would be the natural data structure to hold such a mapping from keys to values. Since we are using Pandas, a Series is the natural data structure. Its index would hold the keys, and the Series values would be the aggregrated scalars. If there is more than one aggregated value for each key, then the natural data structure to use would be a DataFrame.
The advantage of holding the keys in an index rather than a column is that looking up values based on index labels is an O(1) operation, whereas looking up values based on a value in a column is an O(n) operation.
Since the result of a groupby/agg operation fits naturally into a Series or DataFrame with groupby keys as the index, and since indexes have this special fast lookup property, it is better to return the result in this form by default.