Dask DataFrame Groupby: Most frequent value of column in aggregate - python

A custom dask GroupBy Aggregation is very handy, but I am having trouble to define one working for the most often value in a column.
What do I have:
So from the example here, we can define custom aggregate functions like this:
custom_sum = dd.Aggregation('custom_sum', lambda s: s.sum(), lambda s0: s0.sum())
my_aggregate = {
'A': custom_sum,
'B': custom_most_often_value, ### <<< This is the goal.
'C': ['max','min','mean'],
'D': ['max','min','mean']
}
col_name = 'Z'
ddf_agg = ddf.groupby(col_name).agg(my_aggregate).compute()
While this works for custom_sum (as on the example page), the adaption to most often value could be like this (from the example here):
custom_most_often_value = dd.Aggregation('custom_most_often_value', lambda x:x.value_counts().index[0], lambda x0:x0.value_counts().index[0])
but it yields
ValueError: Metadata inference failed in `_agg_finalize`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
Then I tried to find the meta keyword in the dd.Aggregation implementation to define it, but could not find it.. And the fact that it is not needed in the example of custom_sum makes me think that the error is somewhere else..
So my question would be, how to get that mostly occuring value of a column in a df.groupby(..).agg(..). Thanks!

A quick clarification rather than an answer: the meta parameter is used in the .agg() method, to specify the column data types you expect, best expressed as a zero-length pandas dataframe. Dask will supply dummy data to your function otherwise, to try to guess those types, but this doesn't always work.

The issue that you're running into, is that the separate stages of the aggregation can't be the same function applied recursively, as in the custom_sum example that you're looking at.
I've modified code from this answer, leaving comments from #
user8570642, because they are very helpful. Note that this method will solve for a list of groupby keys:
https://stackoverflow.com/a/46082075/3968619
def chunk(s):
# for the comments, assume only a single grouping column, the
# implementation can handle multiple group columns.
#
# s is a grouped series. value_counts creates a multi-series like
# (group, value): count
return s.value_counts()
def agg(s):
# print('agg',s.apply(lambda s: s.groupby(level=-1).sum()))
# s is a grouped multi-index series. In .apply the full sub-df will passed
# multi-index and all. Group on the value level and sum the counts. The
# result of the lambda function is a series. Therefore, the result of the
# apply is a multi-index series like (group, value): count
return s.apply(lambda s: s.groupby(level=-1).sum())
# faster version using pandas internals
s = s._selected_obj
return s.groupby(level=list(range(s.index.nlevels))).sum()
def finalize(s):
# s is a multi-index series of the form (group, value): count. First
# manually group on the group part of the index. The lambda will receive a
# sub-series with multi index. Next, drop the group part from the index.
# Finally, determine the index with the maximum value, i.e., the mode.
level = list(range(s.index.nlevels - 1))
return (
s.groupby(level=level)
.apply(lambda s: s.reset_index(level=level, drop=True).idxmax())
)
max_occurence = dd.Aggregation('mode', chunk, agg, finalize)
chunk will count the values for the groupby object in each partition. agg will take the results from chunk and groupy the original groupby command and sum the value counts, so that we have the value counts for every group. finalize will take the multi-index series provided by agg and return the most frequently occurring value of B for each group from Z.
Here's a test case:
df = dd.from_pandas(
pd.DataFrame({"A":[1,1,1,1,2,2,3]*10,"B":[5,5,5,5,1,1,1]*10,
'Z':['mike','amy','amy','amy','chris','chris','sandra']*10}), npartitions=10)
res = df.groupby(['Z']).agg({'B': mode}).compute()
print(res)

Related

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Why do I get a series inside an apply/assign function in pandas. Want to use each value to look up a dict

I have a dict of countries and population:
population_dict = {"Germany": 1111, .... }
In my df (sort_countries) I have a column called 'country' and I want to add another column called 'population' from the dictionary above (matching 'country' with 'population'):
population_df = sort_countries.assign(
population=lambda x: population_dict[x["country"]], axis = 1)
population_df.head()
which gives the error: TypeError: 'Series' objects are mutable, thus they cannot be hashed.
Why is x["country"] a Series when I would imagine it should return just the name of the country.
This bit of pandas always confuses me. In my lambdas I would expect x to be a row and I just select the country from that row. Instead len(x["country"]) gives me 192 (the number of my countries, the whole Series).
How else can I match them using lambdas and not a separate function?
Note that x["country"] is a Series, albeit a single element one, this cannot be used to index the dictionary. If you want just the value associated with it, use x["country"].item().
However, a better approach tailor made for this kind of thing is using df.map:
population_df["population"] = population_df["country"].map(population_dict)
map will automatically map keys taken from population_df["country"] and map them to their appropriate values in population_dict.
Also:
population_df["population"] = population_df.apply(lambda x: population_dict[x["country"]], axis=1)
works.
Or:
population_df["population"] = population_df[["country"]].applymap(lambda x: population_dict[x])

How to extract value out of an array of Ordereddicts?

If I have a csv file rows where one column has ordereddicts in them, how do I create a new column extract a single element of each ordereddict using python (3.+)/ pandas(.18)?
Here's an example. My column, attributes, has billingPostalCodes hidden in ordereddicts. All I care about is creating a column with the billingPostalCodes.
Here's what my data looks like now:
import pandas as pd
from datetime import datetime
import csv
from collections import OrderedDict
df = pd.read_csv('sf_account_sites.csv')
print(df)
yields:
id attributes
1 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
2 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'55555')])
...
I know on an individual level if I do this:
dict = OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
print(dict['BillingPostalCode'])
I'll get 85020 back as a result.
What do I have to get it to look like this?
id zip_codes
1 85020
2 55555
...
Do I have to use an apply function? A for loop? I've tried a lot of different things but I can't get anything to work on the dataframe.
Thanks in advance, and let me know if I need to be more specific.
This took me a while to work out, but the problem is resolved by doing the following:
df.apply(lambda row: row["attributes"]["BillingPostalCode"], axis = 1)
The trick here is to note that axis = 1 forces pandas to iterate through every row, rather than each column (which is the default setting, as seen in the docs).
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None,
args=(), **kwds)
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the
DataFrame’s index (axis=0) or the columns (axis=1). Return type
depends on whether passed function aggregates, or the reduce argument
if the DataFrame is empty.
Parameters:
func : function Function to apply to each column/row
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
From there, it is a simple matter to first extract the relevant column - in this case attributes - and then from there extract only the BillingPostalCode.
You'll need to format the resulting DataFrame to have the correct column names.

Conditional iteration of key,value in DataFrameGroupBy

I have a pandas (v 0.12) dataframe data in python (2.7). I groupby() with respect to the A and B colmuns in data to form the groups object which is of type <class 'pandas.core.groupby.DataFrameGroupBy'>.
I want to loop through and apply a function to the dataframes within groups that have more than one row in them. My code is below, here each dataframe is the value in the key,value pair:
import pandas as pd
groups = data.groupby(['A','B'])
len(groups)
>> 196320 # too large - will be slow to iterate through all
for key, value in groups:
if len(value)>1:
print(value)
Since I am only interested in applying the function to values where len(value)>1, is it possible to save time by embedding this condition to filter and loop through only the key-value pairs that satisfy this condition. I can do something like below to ascertain the size of each value but I am not sure how to marry this aggreagation with the original groups object.
size_values = data.groupby(['A','B']).agg({'C' : [np.size]})
I am hoping the question is clear, please let me know if any clarification is needed.
You could assign length of the group back to column and filter by its value:
data['count'] = data.groupby(['A','B'],as_index=False)['A'].transform(np.size)
After that you could:
data[data['count'] > 1].groupby(['A','B']).apply(your_function)
Or just skip assignment if it is a one time operation:
data[data.groupby(['A','B'],as_index=False)['A'].transform(np.size) > 1].groupby(['A','B']).apply(your_function)

Categories