Probably a duplicate, but I have spent too much time on this now googling without any luck. Assume I have a data frame:
import pandas as pd
data = {"letters": ["a", "a", "a", "b", "b", "b"],
"boolean": [True, True, True, True, True, False],
"numbers": [1, 2, 3, 1, 2, 3]}
df = pd.DataFrame(data)
df
I want to 1) group by letters, 2) take the mean of numbers if all values in boolean have the same value. In R I would write:
library(dplyr)
df %>%
group_by(letters) %>%
mutate(
condition = n_distinct(boolean) == 1,
numbers = ifelse(condition, mean(numbers), numbers)
) %>%
select(-condition)
This would result in the following output:
# A tibble: 6 x 3
# Groups: letters [2]
letters boolean numbers
<chr> <lgl> <dbl>
1 a TRUE 2
2 a TRUE 2
3 a TRUE 2
4 b TRUE 1
5 b TRUE 2
6 b FALSE 3
How would you do it using Python pandas?
We can use lazy groupby and transform:
g = df.groupby('letters')
df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean')
Output:
letters boolean numbers
0 a True 2
1 a True 2
2 a True 2
3 b True 1
4 b True 2
5 b False 3
Another way would be to use np.where. where a group has one unique value, find mean. Where it doesnt keep the numbers. Code below
df['numbers'] =np.where(df.groupby('letters')['boolean'].transform('nunique')==1,df.groupby('letters')['numbers'].transform('mean'), df['numbers'])
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Alternatively, mask where condition does not apply as you compute the mean.
m=df.groupby('letters')['boolean'].transform('nunique')==1
df.loc[m, 'numbers']=df[m].groupby('letters')['numbers'].transform('mean')
Since you are comparing drectly to R, I would prefer to use siuba rather than pandas:
from siuba import mutate, if_else, _, select, group_by, ungroup
df1 = df >>\
group_by(_.letters) >> \
mutate( condition = _.boolean.unique().size == 1,
numbers = if_else(_.condition, _.numbers.mean(), _.numbers)
) >>\
ungroup() >> select(-_.condition)
print(df1)
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Note that >> is the pipe. I added \ in order to jump to the next line. Also note that to refer to the variables you use _.variable
EDIT
It seems your R code has an issue, In R, you should rather use condition = all(boolean) instead of the code you have. That will translate to condition = boolean.all() in Python
datar is another solution for you:
>>> import pandas as pd
>>> data = {"letters": ["a", "a", "a", "b", "b", "b"],
... "boolean": [True, True, True, True, True, False],
... "numbers": [1, 2, 3, 1, 2, 3]}
>>> df = pd.DataFrame(data)
>>>
>>> from datar.all import f, group_by, mutation, n_distinct, if_else, mean, select
>>> df >> group_by(f.letters) \
... >> mutate(
... condition=n_distinct(f.boolean) == 1,
... numbers = if_else(f.condition, mean(f.numbers), f.numbers)
... ) \
... >> select(~f.condition)
letters boolean numbers
<object> <bool> <float64>
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
[Groups: letters (n=2)]
Related
expected result table
bool count
0 FALSE
1 FALSE
2 TRUE 0
3 FALSE 1
4 FALSE 2
5 FALSE 3
6 TRUE 0
7 FALSE 1
8 TRUE 0
9 TRUE 0
How to calculate the value of column 'count'
Here you go:
# create bool dataframe
df = pd.DataFrame(dict(bool_= [0, 0, 1, 0, 0, 1, 1, 0, 0, 0]), dtype= bool)
df.index = list("abcdefghij")
# create a new Series unique integers to associate a group for the rows
# between True values
ix = pd.Series(range(df.shape[0])).where(df.bool_.values, np.nan).ffill().values
# if the first rows are False, they will be NaNs and shouldn't be
# counted so only perform groupby and cumcount() for what is notna
notna = pd.notna(ix)
df["count"] = df[notna].groupby(ix[notna]).cumcount()
>>> df
bool_ count
a False NaN
b False NaN
c True 0.0
d False 1.0
e False 2.0
f True 0.0
g True 0.0
h False 1.0
i False 2.0
j False 3.0
Use a GroupBy.cumcount and mask with where:
g = df['bool'].cumsum()
df['count'] = df['bool'].groupby(g).cumcount().where(g.gt(0))
Alternative:
g = df['bool'].cumsum()
df['count'] = (df['bool'].groupby(g).cumcount()
.where(df['bool'].cummax())
)
Output:
bool count
0 False NaN
1 False NaN
2 True 0.0
3 False 1.0
4 False 2.0
5 True 0.0
6 True 0.0
7 False 1.0
8 False 2.0
9 False 3.0
You can try groupby the cumsum of bool column then transform a customize function to check if first element in each group is True
df['m'] = df['bool'].cumsum()
df['out'] = (df.groupby(df['bool'].cumsum())
['bool'].transform(lambda col: range(len(col)) if col.iloc[0] else [pd.NA]*len(col)))
print(df)
bool count m out
0 False NaN 0 <NA>
1 False NaN 0 <NA>
2 True 0.0 1 0
3 False 1.0 1 1
4 False 2.0 1 2
5 False 3.0 1 3
6 True 0.0 2 0
7 False 1.0 2 1
8 True 0.0 3 0
9 True 0.0 4 0
I think your question is not clear.
We need a little more context and objectives to work with here.
Let's assume that you have a dataframe of Boolean values [True, False], and you wish to compute a count of how many "True" and how many "False"
import pandas as pd
import random
## Randomly generating Boolean values to populate a dataframe
choices = [ 'True', 'False' ]
df = pd.DataFrame(index = range(10), columns = ['boolean'])
df['boolean'] = df['boolean'].apply(lambda x: random.choice(choices))
Randomly generated data
boolean
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 True
8 False
9 False
## Reporting the count of True and False values
results = df.groupby('boolean').size()
print(results)
Results
boolean
False 8
True 2
If you want to obtain the count not in pandas way, you can try this.
result = []
count = np.nan
for i in df['bool']:
if i == True:
count = 0
result.append(count)
if i == False:
count += 1
result.append(count)
elif i == False:
result.append(np.nan)
result
Out[4]: [nan, nan, 0, 1, 2, 3, 0, 1, 0, 0]
df['count'] = result
If you mean the sum of all the elements in count then you can do it this way:
Count_Total = df['count'].sum()
I have a data frame and I'd like to calculate cumulative sum based on 2 conditions:
1st which is a boolean already in the table
and a fixed threshold that checks what's the cumulative sum.
I've succeed with 1st or 2nd but I find it hard to combine both.
For the first one I used groupby
df['group'] = np.cumsum((df['IsSuccess'] != df['IsSuccess'].shift(1)))
df['SumSale'] = df[['Sale', 'group']].groupby('group').cumsum()
For the 2nd frompyfunc
sumlm = np.frompyfunc(lambda a,b: b if (a+b>5) else a+b, 2, 1)
df['SumSale'] = sumlm.accumulate(df['Sale'], dtype=object)
My df is, and the SumSale is the result I'm looking for.
df2 = pd.DataFrame({'Sale': [10, 2, 2, 1, 3, 2, 1, 3, 5, 5],
'IsSuccess': [False, True, False, False, True, False, True, False, False, False],
'SumSaleExpected': [10, 12, 2, 3, 6, 2, 3, 6, 11, 16]})
So to summarize I'd like to start cumulating the sum once that sum is over 5 and the row IsSuccess is True. I'd like to avoid for loop if possible as well.
Thank you for help!
I hope I've understood your question right. This example will substract necessary value ("reset") when cumulative sum of sale is greater than 5 and IsSuccess==True:
df["SumSale"] = df["Sale"].cumsum()
# "reset" when SumSale>5 and IsSuccess==True
m = df["SumSale"].gt(5) & df["IsSuccess"].eq(True)
df.loc[m, "to_remove"] = df["SumSale"]
df["to_remove"] = df["to_remove"].ffill().shift().fillna(0)
df["SumSale"] -= df["to_remove"]
df = df.drop(columns="to_remove")
print(df)
Prints:
Sale IsSuccess SumSale
0 1 False 1.0
1 2 True 3.0
2 3 False 6.0
3 2 False 8.0
4 4 True 12.0
5 3 False 3.0
6 5 True 8.0
7 5 False 5.0
EDIT:
def fn():
sale, success = yield
cum = sale
while True:
sale, success = yield cum
if success and cum > 5:
cum = sale
else:
cum += sale
s = fn()
next(s)
df["ss"] = df["IsSuccess"].shift()
df["SumSale"] = df.apply(lambda x: s.send((x["Sale"], x["ss"])), axis=1)
df = df.drop(columns="ss")
print(df)
Prints:
Sale IsSuccess SumSaleExpected SumSale
0 10 False 10 10
1 2 True 12 12
2 2 False 2 2
3 1 False 3 3
4 3 True 6 6
5 2 False 2 2
6 1 True 3 3
7 3 False 6 6
8 5 False 11 11
9 5 False 16 16
You can modify your group approach to account for both conditions by taking the cumsum() of the two conditions:
cond1 = df.Sale.cumsum().gt(5).shift().bfill()
cond2 = df.IsSuccess.shift().bfill()
df['group'] = (cond1 & cond2).cumsum()
Now that group accounts for both conditions, you can directly cumsum() within these pseudogroups:
df['SumSale'] = df.groupby('group').Sale.cumsum()
# Sale IsSuccess group SumSale
# 0 1 False 0 1
# 1 2 True 0 3
# 2 3 False 0 6
# 3 2 False 0 8
# 4 4 True 0 12
# 5 3 False 1 3
How can I apply a pandas groupby to columns that are numerical and boolean? I want to sum over the numerical columns and I want the aggregation of the boolean values to be any, that is True if there are any Trues and False if there are only False.
Performing a sum aggregation will give the desired result as long as you cast the boolean columns back to boolean types. Example
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'bool': [True, False, True, True, False, False],
'c': [10, 10, 15, 15, 20, 20]})
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
df.groupby('id').sum()
bool c
id
1 1.0 20
2 2.0 30
3 0.0 40
As you can see when applying the sum True is cast as 1 and False is cast as zero. This effectively acts as the desired any operation. Casting back to boolean:
df['bool'] = df['bool'].astype(bool)
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
You can choose the functions which you aggregate by with the following:
df.groupby("id").agg({
"bool":lambda arr: any(arr),
"c":sum,
})
I'm using python and pandas to work on some data.
My data looks like the following:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar'],
'B' : [1, 2, 3, 4, 5, 6],
'C' : [True, False, True, True, False, True]})
print(df)
A B C
0 foo 1 True
1 bar 2 False
2 foo 3 True
3 bar 4 True
4 foo 5 False
5 bar 6 True
What I would like to do:
Groupby by "A"
Select the value B by groups where C == True
Calculate a mean value on this selection
Create a new column "D" to store theses values
So the result would be:
A B C D
0 foo 1 True 2
1 bar 2 False 5
2 foo 3 True 2
3 bar 4 True 5
4 foo 5 False 2
5 bar 6 True 5
I have tried some mixes of groupby, filter and transform, but I cannot succed to make it work.
I imagine the solution close to the followings
df.groupby(["A"])[df.loc[df["C"] == True, "B"]].transform("mean")
or
df.groupby(["A"]).filter(lambda x: x["D"] == True)["B"].transform("mean")
But none of these syntax work.
Thanks for helping me and people in general,
Use Series.map with means of filtered rows, ==True should be omitted:
df['D'] = df['A'].map(df.loc[df.C, 'B'].groupby(df["A"]).mean())
print(df)
A B C D
0 foo 1 True 2
1 bar 2 False 5
2 foo 3 True 2
3 bar 4 True 5
4 foo 5 False 2
5 bar 6 True 5
Problem
I have a dataframe that looks like this:
Key Var ID_1 Var_1 ID_2 Var_2 ID_3 Var_3
1 True 1.0 True NaN NaN 5.0 True
2 True NaN NaN 4.0 False 7.0 True
3 False 2.0 False 5.0 True NaN NaN
Each row has exactly 2 non-null sets of data (ID/Var), and the remaining third is guaranteed to be null. What I want to do is "condense" the dataframe by removing the missing elements.
Desired Output
Key Var First_ID First_Var Second_ID Second_Var
1 True 1 True 5 True
2 True 4 False 7 True
3 False 2 False 5 True
The ordering is not important, so long as the Id/Var pairs are maintained.
Current Solution
Below is a working solution that I have:
import pandas as pd
import numpy as np
data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})
sorted_columns = ['Key', 'Var', 'ID_1', 'Var_1', 'ID_2', 'Var_2', 'ID_3', 'Var_3']
data = data[sorted_columns]
output = np.empty(shape=[data.shape[0], 6], dtype=str)
for i, *row in data.itertuples():
output[i] = [element for element in row if np.isfinite(element)]
print(output)
[['1' 'T' '1' 'T' '5' 'T']
['2' 'T' '4' 'F' '7' 'T']
['3' 'F' '2' 'F' '5' 'T']]
This is acceptable, but not ideal. I can live with not having the column names, but my big issue is having to cast the data inside the array into a string in order to avoid my booleans being converted to numeric.
Are there other solutions that do a better job at preserving the data? Bonus points if the result is a pandas dataframe.
There is one simple solution i.e push the nans to right and drop the nans on axis 1. i.e
ndf = data.apply(lambda x : sorted(x,key=pd.isnull),1).dropna(1)
Output:
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True
Hope it helps.
A numpy solution from Divakar here for 10x speed i.e
def mask_app(a):
out = np.full(a.shape,np.nan,dtype=a.dtype)
mask = ~np.isnan(a.astype(float))
out[np.sort(mask,1)[:,::-1]] = a[mask]
return out
ndf = pd.DataFrame(mask_app(data.values),columns=data.columns).dropna(1)
Key Var ID_1 Var_1 ID_2 Var_2
0 1 True 1 True 5 True
1 2 True 4 False 7 True
2 3 False 2 False 5 True