pandas - Groupby two functions - python

I've been trying to get a cumsum on a pandas groupby object. I need the cumsum to be shifted by one, which is achieved by shift(). However, doing both of these functions on a single groupby object gives some unwanted results:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [2, 3, 5, 2, 3, 5]})
df.groupby('A').cumsum().shift()
which gives:
B
0 NaN
1 2.0
2 5.0
3 10.0
4 2.0
5 5.0
I.e. the last value of the cumsum() on group 1 is shifted into the first value of group 2. What I want is these groups to stay seperated, and to get:
B
0 NaN
1 2.0
2 5.0
3 NaN
4 2.0
5 5.0
But I'm not sure how to get both functions to work on the groupby object combined. Can't find this question anywhere else. Have been playing around with agg but can't seem to work that out. Any help would be appreciated.

Use lambda function with GroupBy.apply, also is necessary define columns in list after groupby for processing:
df['B'] = df.groupby('A')['B'].apply(lambda x: x.cumsum().shift())
print (df)
A B
0 1 NaN
1 1 2.0
2 1 5.0
3 2 NaN
4 2 2.0
5 2 5.0

The result of your first operation df.groupby('A').cumsum() is a regular dataframe. It is equivalent to df.groupby('A')[['B']].cumsum(), but Pandas conveniently allows you to omit the [['B']] indexing part.
Any subsequent operation on this dataframe therefore will not by default be performed groupwise, unless you use GroupBy again:
res = df.groupby('A').cumsum().groupby(df['A']).shift()
But, as you can see, this repeats the grouping operation and will be inefficient. You can instead define a single function which combines cumsum and shift in the correct order, then apply this function on a single GroupBy object. Defining this single function is known as function composition, and it's not native to Python. Here are a few alternatives:
Define a new named function
This is an explicit and recommended solution:
def cum_shift(x):
return x.cumsum().shift()
res1 = df.groupby('A')[['B']].apply(cum_shift)
Define an anonymous lambda function
A one-line version of the above:
res2 = df.groupby('A')[['B']].apply(lambda x: x.cumsum().shift())
Use a library which composes
This a pure functional solution; for example, via 3rd party toolz:
from toolz import compose
from operator import methodcaller
cumsum_shift_comp = compose(methodcaller('shift'), methodcaller('cumsum'))
res3 = df.groupby('A')[['B']].apply(cumsum_shift_comp)
All the above give the equivalent result:
assert res.equals(res1) and res1.equals(res2) and res2.equals(res3)
print(res1)
B
0 NaN
1 2.0
2 5.0
3 NaN
4 2.0
5 5.0

Related

Filling NaN values with rolling mean of the previous non-NaN values

I have recently come across a case where I would like to replace NaN values with the rolling mean of the previous non-NaN values in such a way that each newly generated rolling mean is then considered a non-NaN and is used for the next NaN. This is the sample data set:
df = pd.DataFrame({'col1': [1, 3, 4, 5, 6, np.NaN, np.NaN, np.NaN]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 6.0
5 NaN # (6.0 + 5.0) / 2
6 NaN # (5.5 + 6.0) / 2
7 NaN # ...
I have also found a solution for this which I am struggling to understand:
from functools import reduce
reduce(lambda x, _: x.fillna(x.rolling(2, min_periods=2).mean().shift()), range(df['col1'].isna().sum()), df)
My problem with this solution is reduce function takes 3 arguments, where we first define the lambda function then we specify the iterator. In the solution above I don't understand the last df we put in the function call for reduce and I struggle to understand how it works in general to populate the NaN.
I would appreciate any explanation of how it works. Also if there is any pandas, numpy based solution as reduce is not seemingly efficient here.
for i in df.index:
if np.isnan(df["col1"][i]):
df["col1"][i] = (df["col1"][i - 1] + df["col1"][i - 2]) / 2
This can be a start using for loop, it will fail if the first 2 values of the dataframe are NAN

How to implement arbitrary condition in pandas style function? [duplicate]

I would like to perform arithmetic on one or more dataframes columns using pd.eval. Specifically, I would like to port the following code that evaluates a formula:
x = 5
df2['D'] = df1['A'] + (df1['B'] * x)
...to code using pd.eval. The reason for using pd.eval is that I would like to automate many workflows, so creating them dynamically will be useful to me.
My two input DataFrames are:
import pandas as pd
import numpy as np
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df1
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
df2
A B C D
0 5 9 8 9
1 4 3 0 3
2 5 0 2 3
3 8 1 3 3
4 3 7 0 1
I am trying to better understand pd.eval's engine and parser arguments to determine how best to solve my problem. I have gone through the documentation, but the difference was not made clear to me.
What arguments should be used to ensure my code is working at the maximum performance?
Is there a way to assign the result of the expression back to df2?
Also, to make things more complicated, how do I pass x as an argument inside the string expression?
You can use 1) pd.eval(), 2) df.query(), or 3) df.eval(). Their various features and functionality are discussed below.
Examples will involve these dataframes (unless otherwise specified).
np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
1) pandas.eval
This is the "Missing Manual" that pandas doc should contain.
Note: of the three functions being discussed, pd.eval is the most important. df.eval and df.query call
pd.eval under the hood. Behaviour and usage is more or less
consistent across the three functions, with some minor semantic
variations which will be highlighted later. This section will
introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.
pd.eval can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the question as stated, you can do
x = 5
pd.eval("df1.A + (df1.B * x)")
Some things to note here:
The entire expression is a string
df1, df2, and x refer to variables in the global namespace, these are picked up by eval when parsing the expression
Specific columns are accessed using the attribute accessor index. You can also use "df1['A'] + (df1['B'] * x)" to the same effect.
I will be addressing the specific issue of reassignment in the section explaining the target=... attribute below. But for now, here are more simple examples of valid operations with pd.eval:
pd.eval("df1.A + df2.A") # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5") # Valid, returns a pd.DataFrame object
...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.
pd.eval("df1 > df2")
pd.eval("df1 > 5")
pd.eval("df1 < df2 and df3 < df4")
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")
A list detailing all the supported features and syntax can be found in the documentation. In summary,
Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
Comparison operations, including chained comparisons, e.g., 2 < df < df2
Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool
list and tuple literals, e.g., [1, 2] or (1, 2)
Attribute access, e.g., df.a
Subscript expressions, e.g., df[0]
Simple variable evaluation, e.g., pd.eval('df') (this is not very useful)
Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and
arctan2.
This section of the documentation also specifies syntax rules that are not supported, including set/dict literals, if-else statements, loops, and comprehensions, and generator expressions.
From the list, it is obvious you can also pass expressions involving the index, such as
pd.eval('df1.A * (df1.index > 1)')
1a) Parser Selection: The parser=... argument
pd.eval supports two different parser options when parsing the expression string to generate the syntax tree: pandas and python. The main difference between the two is highlighted by slightly differing precedence rules.
Using the default parser pandas, the overloaded bitwise operators & and | which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and and or. So,
pd.eval("(df1 > df2) & (df3 < df4)")
Will be the same as
pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')
And also the same as
pd.eval("df1 > df2 and df3 < df4")
Here, the parentheses are necessary. To do this conventionally, the parentheses would be required to override the higher precedence of bitwise operators:
(df1 > df2) & (df3 < df4)
Without that, we end up with
df1 > df2 & df3 < df4
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Use parser='python' if you want to maintain consistency with python's actual operator precedence rules while evaluating the string.
pd.eval("(df1 > df2) & (df3 < df4)", parser='python')
The other difference between the two types of parsers are the semantics of the == and != operators with list and tuple nodes, which have the similar semantics as in and not in respectively, when using the 'pandas' parser. For example,
pd.eval("df1 == [1, 2, 3]")
Is valid, and will run with the same semantics as
pd.eval("df1 in [1, 2, 3]")
OTOH, pd.eval("df1 == [1, 2, 3]", parser='python') will throw a NotImplementedError error.
1b) Backend Selection: The engine=... argument
There are two options - numexpr (the default) and python. The numexpr option uses the numexpr backend which is optimized for performance.
With Python backend, your expression is evaluated similar to just passing the expression to Python's eval function. You have the flexibility of doing more inside expressions, such as string operations, for instance.
df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')
0 True
1 False
2 True
Name: A, dtype: bool
Unfortunately, this method offers no performance benefits over the numexpr engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so use at your own risk! It is generally not recommended to change this option to 'python' unless you know what you're doing.
1c) local_dict and global_dict arguments
Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict
For example:
pd.eval("df1 > thresh")
UndefinedVariableError: name 'thresh' is not defined
This fails because thresh is not defined. However, this works:
pd.eval("df1 > thresh", local_dict={'thresh': 10})
This is useful when you have variables to supply from a dictionary. Alternatively, with the Python engine, you could simply do this:
mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')
But this is going to possibly be much slower than using the 'numexpr' engine and passing a dictionary to local_dict or global_dict. Hopefully, this should make a convincing argument for the use of these parameters.
1d) The target (+ inplace) argument, and Assignment Expressions
This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval to an object that implements __getitem__ such as dicts, and (you guessed it) DataFrames.
Consider the example in the question
x = 5
df2['D'] = df1['A'] + (df1['B'] * x)
To assign a column "D" to df2, we do
pd.eval('D = df1.A + (df1.B * x)', target=df2)
A B C D
0 5 9 8 5
1 4 3 0 52
2 5 0 2 22
3 8 1 3 48
4 3 7 0 42
This is not an in-place modification of df2 (but it can be... read on). Consider another example:
pd.eval('df1.A + df2.A')
0 10
1 11
2 7
3 16
4 10
dtype: int32
If you wanted to (for example) assign this back to a DataFrame, you could use the target argument as follows:
df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
F B G H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to
# df = df.assign(B=pd.eval('df1.A + df2.A'))
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
If you wanted to perform an in-place mutation on df, set inplace=True.
pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to
# df['B'] = pd.eval('df1.A + df2.A')
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
If inplace is set without a target, a ValueError is raised.
While the target argument is fun to play around with, you will seldom need to use it.
If you wanted to do this with df.eval, you would use an expression involving an assignment:
df = df.eval("B = #df1.A + #df2.A")
# df.eval("B = #df1.A + #df2.A", inplace=True)
df
F B G H
0 NaN 10 NaN NaN
1 NaN 11 NaN NaN
2 NaN 7 NaN NaN
3 NaN 16 NaN NaN
4 NaN 10 NaN NaN
Note
One of pd.eval's unintended uses is parsing literal strings in a manner very similar to ast.literal_eval:
pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)
It can also parse nested lists with the 'python' engine:
pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]
And lists of strings:
pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]
The problem, however, is for lists with length larger than 100:
pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python')
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
More information can this error, causes, fixes, and workarounds can be found here.
2) DataFrame.eval:
As mentioned above, df.eval calls pd.eval under the hood, with a bit of juxtaposition of arguments. The v0.23 source code shows this:
def eval(self, expr, inplace=False, **kwargs):
from pandas.core.computation.eval import eval as _eval
inplace = validate_bool_kwarg(inplace, 'inplace')
resolvers = kwargs.pop('resolvers', None)
kwargs['level'] = kwargs.pop('level', 0) + 1
if resolvers is None:
index_resolvers = self._get_index_resolvers()
resolvers = dict(self.iteritems()), index_resolvers
if 'target' not in kwargs:
kwargs['target'] = self
kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
return _eval(expr, inplace=inplace, **kwargs)
eval creates arguments, does a little validation, and passes the arguments on to pd.eval.
For more, you can read on: When to use DataFrame.eval() versus pandas.eval() or Python eval()
2a) Usage Differences
2a1) Expressions with DataFrames vs. Series Expressions
For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval("df1 + df2") when you call df1.eval or df2.eval.
2a2) Specifying Column Names
Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df1, you would call pd.eval with the following expression:
pd.eval("df1.A + df1.B")
With df.eval, you need only supply the column names:
df1.eval("A + B")
Since, within the context of df1, it is clear that "A" and "B" refer to column names.
You can also refer to the index and columns using index (unless the index is named, in which case you would use the name).
df1.eval("A + index")
Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable "ilevel_k" which stands for "index at level k". IOW, the expression above can be written as df1.eval("A + ilevel_0").
These rules also apply to df.query.
2a3) Accessing Variables in Local/Global Namespace
Variables supplied inside expressions must be preceded by the "#" symbol, to avoid confusion with column names.
A = 5
df1.eval("A > #A")
The same goes for query.
It goes without saying that your column names must follow the rules for valid identifier naming in Python to be accessible inside eval. See here for a list of rules on naming identifiers.
2a4) Multiline Queries and Assignment
A little known fact is that eval supports multiline expressions that deal with assignment (whereas query doesn't). For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do
df1.eval("""
E = A + B
F = #df2.A + #df2.B
G = E >= F
""")
A B C D E F G
0 5 0 3 3 5 14 False
1 7 9 3 5 16 7 True
2 2 4 7 6 6 5 True
3 8 8 1 6 16 9 True
4 7 7 8 1 14 10 True
3) eval vs query
It helps to think of df.query as a function that uses pd.eval as a subroutine.
Typically, query (as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the True result. The result of the expression is then passed to loc (in most cases) to return the rows that satisfy the expression. According to the documentation,
The result of the evaluation of this expression is first passed to
DataFrame.loc and if that fails because of a multidimensional key
(e.g., a DataFrame) then the result will be passed to
DataFrame.__getitem__().
This method uses the top-level pandas.eval() function to evaluate the
passed query.
In terms of similarity, query and df.eval are both alike in how they access column names and variables.
This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider
df1.A
0 5
1 7
2 2
3 8
4 7
Name: A, dtype: int32
df1.B
0 9
1 3
2 0
3 1
4 7
Name: B, dtype: int32
To get all rows where "A" >= "B" in df1, we would use eval like this:
m = df1.eval("A >= B")
m
0 True
1 False
2 False
3 True
4 True
dtype: bool
m represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter df1:
df1[m]
# df1.loc[m]
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
However, with query, the intermediate result "m" is directly passed to loc, so with query, you would simply need to do
df1.query("A >= B")
A B C D
0 5 0 3 3
3 8 8 1 6
4 7 7 8 1
Performance wise, it is exactly the same.
df1_big = pd.concat([df1] * 100000, ignore_index=True)
%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")
14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But the latter is more concise, and expresses the same operation in a single step.
Note that you can also do weird stuff with query like this (to, say, return all rows indexed by df1.index)
df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
3 8 8 1 6
4 7 7 8 1
But don't.
Bottom line: Please use query when querying or filtering rows based on a conditional expression.
There are great tutorials already, but bear in mind that before jumping wildly into the usage of eval/query attracted by its simpler syntax, it has severe performance issues if your dataset has less than 15,000 rows.
In that case, simply use df.loc[mask1, mask2].
Refer to: Expression Evaluation via eval()

Sum of a groupby dataframe not equal to the sum of a dataframe [duplicate]

I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?
pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"
Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2
I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.
All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.
One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)
I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.

Why doesn't first and last in a groupby give me first and last

I'm posting this because the topic just got brought up in another question/answer and the behavior isn't very well documented.
Consider the dataframe df
df = pd.DataFrame(dict(
A=list('xxxyyy'),
B=[np.nan, 1, 2, 3, 4, np.nan]
))
A B
0 x NaN
1 x 1.0
2 x 2.0
3 y 3.0
4 y 4.0
5 y NaN
I wanted to get the first and last rows of each group defined by column 'A'.
I tried
df.groupby('A').B.agg(['first', 'last'])
first last
A
x 1.0 2.0
y 3.0 4.0
However, This doesn't give me the np.NaNs that I expected.
How do I get the actual first and last values in each group?
As noted here by #unutbu:
The groupby.first and groupby.last methods return the first and last non-null values respectively.
To get the actual first and last values, do:
def h(x):
return x.values[0]
def t(x):
return x.values[-1]
df.groupby('A').B.agg([h, t])
h t
A
x NaN 2.0
y 3.0 NaN
One option is to use the .nth method:
>>> gb = df.groupby('A')
>>> gb.nth(0)
B
A
x NaN
y 3.0
>>> gb.nth(-1)
B
A
x 2.0
y NaN
>>>
However, I haven't found a way to aggregate them neatly. Of course, one can always use a pd.DataFrame constructor:
>>> pd.DataFrame({'first':gb.B.nth(0), 'last':gb.B.nth(-1)})
first last
A
x NaN 2.0
y 3.0 NaN
Note: I explicitly used the gb.B attribute, or else you have to use .squeeze

pandas GroupBy columns with NaN (missing) values

I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?
pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"
Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2
I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.
All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.
One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)
I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.

Categories