I have a data frame and I'd like to calculate cumulative sum based on 2 conditions:
1st which is a boolean already in the table
and a fixed threshold that checks what's the cumulative sum.
I've succeed with 1st or 2nd but I find it hard to combine both.
For the first one I used groupby
df['group'] = np.cumsum((df['IsSuccess'] != df['IsSuccess'].shift(1)))
df['SumSale'] = df[['Sale', 'group']].groupby('group').cumsum()
For the 2nd frompyfunc
sumlm = np.frompyfunc(lambda a,b: b if (a+b>5) else a+b, 2, 1)
df['SumSale'] = sumlm.accumulate(df['Sale'], dtype=object)
My df is, and the SumSale is the result I'm looking for.
df2 = pd.DataFrame({'Sale': [10, 2, 2, 1, 3, 2, 1, 3, 5, 5],
'IsSuccess': [False, True, False, False, True, False, True, False, False, False],
'SumSaleExpected': [10, 12, 2, 3, 6, 2, 3, 6, 11, 16]})
So to summarize I'd like to start cumulating the sum once that sum is over 5 and the row IsSuccess is True. I'd like to avoid for loop if possible as well.
Thank you for help!
I hope I've understood your question right. This example will substract necessary value ("reset") when cumulative sum of sale is greater than 5 and IsSuccess==True:
df["SumSale"] = df["Sale"].cumsum()
# "reset" when SumSale>5 and IsSuccess==True
m = df["SumSale"].gt(5) & df["IsSuccess"].eq(True)
df.loc[m, "to_remove"] = df["SumSale"]
df["to_remove"] = df["to_remove"].ffill().shift().fillna(0)
df["SumSale"] -= df["to_remove"]
df = df.drop(columns="to_remove")
print(df)
Prints:
Sale IsSuccess SumSale
0 1 False 1.0
1 2 True 3.0
2 3 False 6.0
3 2 False 8.0
4 4 True 12.0
5 3 False 3.0
6 5 True 8.0
7 5 False 5.0
EDIT:
def fn():
sale, success = yield
cum = sale
while True:
sale, success = yield cum
if success and cum > 5:
cum = sale
else:
cum += sale
s = fn()
next(s)
df["ss"] = df["IsSuccess"].shift()
df["SumSale"] = df.apply(lambda x: s.send((x["Sale"], x["ss"])), axis=1)
df = df.drop(columns="ss")
print(df)
Prints:
Sale IsSuccess SumSaleExpected SumSale
0 10 False 10 10
1 2 True 12 12
2 2 False 2 2
3 1 False 3 3
4 3 True 6 6
5 2 False 2 2
6 1 True 3 3
7 3 False 6 6
8 5 False 11 11
9 5 False 16 16
You can modify your group approach to account for both conditions by taking the cumsum() of the two conditions:
cond1 = df.Sale.cumsum().gt(5).shift().bfill()
cond2 = df.IsSuccess.shift().bfill()
df['group'] = (cond1 & cond2).cumsum()
Now that group accounts for both conditions, you can directly cumsum() within these pseudogroups:
df['SumSale'] = df.groupby('group').Sale.cumsum()
# Sale IsSuccess group SumSale
# 0 1 False 0 1
# 1 2 True 0 3
# 2 3 False 0 6
# 3 2 False 0 8
# 4 4 True 0 12
# 5 3 False 1 3
Related
I have this dataframe:
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Condition': [False, False, True, False, False, False, False, False, False, False, True, False]})
ID Condition
0 1 False
1 1 False
2 1 True
3 1 False
4 1 False
5 1 False
6 1 False
7 1 False
8 1 False
9 1 False
10 1 True
11 1 False
I want to add a new column Sequence with a sequence of numbers. The condition is when the first True appears in the Condition column, the following rows must contain the sequence 1, 2, 3, 1, 2, 3... until another True appears again, at which point the sequence is restarted again. Furthermore, ideally, until the first True appears, the values in the new column should be 0. El resultado final serĂa:
ID Condition Sequence
0 1 False 0
1 1 False 0
2 1 True 1
3 1 False 2
4 1 False 3
5 1 False 1
6 1 False 2
7 1 False 3
8 1 False 1
9 1 False 2
10 1 True 1
11 1 False 2
I have tried to do it with cumsum and cumcount but I can't find the exact code.
Any suggestion?
Let us do cumsum to identify blocks of rows, then group the dataframe by blocks and use cumcount to create sequential counter, then with some simple maths we can get the output
b = df['Condition'].cumsum()
df['Seq'] = df.groupby(b).cumcount().mod(3).add(1).mask(b < 1, 0)
Explained
Identify blocks/groups of rows using cumsum
b = df['Condition'].cumsum()
print(b)
0 0
1 0
2 1 # -- group 1 start --
3 1
4 1
5 1
6 1
7 1
8 1
9 1 # -- group 1 ended --
10 2
11 2
Name: Condition, dtype: int32
Group the dataframe by the blocks and use cumcount to create a sequential counter per block
c = df.groupby(b).cumcount()
print(c)
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 0
11 1
dtype: int64
Modulo(%) divide the sequential counter by 3 to create a repeating sequence that repeats every three rows
c = c.mod(3).add(1)
print(c)
0 1
1 2
2 1
3 2
4 3
5 1
6 2
7 3
8 1
9 2
10 1
11 2
dtype: int64
Mask the values in sequence with 0 where the group(b) is < 1
c = c.mask(b < 1, 0)
print(c)
0 0
1 0
2 1
3 2
4 3
5 1
6 2
7 3
8 1
9 2
10 1
11 2
Result
ID Condition Seq
0 1 False 0
1 1 False 0
2 1 True 1
3 1 False 2
4 1 False 3
5 1 False 1
6 1 False 2
7 1 False 3
8 1 False 1
9 1 False 2
10 1 True 1
11 1 False 2
This was the simplest way I could think of doing it
import pandas as pd
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Condition': [False, False, True, False, False, False, False, False, False, False, True, False]})
conditions = df.Condition.tolist()
sequence = []
buf = 1
seenTrue = False
for condition in conditions:
#If it's seen a True in the list, this bool is set to True
if condition or seenTrue:
seenTrue = True
#Checking buffer and resetting back to 0
if buf%4 == 0 or condition:
buf = 1
sequence.append(buf)
buf += 1
#While True has not been seen, all 0s to be appended.
if not seenTrue:
sequence.append(0)
df["Sequence"] = sequence
Effectively looping through and then adding the new column in. The buffer is reset whenever it reaches 4 or when a new True is seen, giving you the looping 1,2,3 effect.
The solution I've come up with is just simply looping through the Condition column, adding 0's to the list until you have seen the first True. When you have found a True, you set seen_true to True and set seq_count to 1. After the first True, you keep increasing seq_count, until it's larger then 3 or you see a new True. In both cases, you reset seq_count to 1. This gives you the column you were looking for.
import pandas as pd
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Condition': [False, False, True, False, False, False, False, False, False, False, True, False]})
l = []
seq_count = 0
first_true = False
for index, row in df.iterrows():
con = row["Condition"]
if con:
seq_count = 1
first_true = True
elif first_true:
seq_count += 1
if seq_count > 3:
seq_count = 1
l.append(seq_count)
df["Sequence"] = l
Output:
ID Condition Sequence
0 1 False 0
1 1 False 0
2 1 True 1
3 1 False 2
4 1 False 3
5 1 False 1
6 1 False 2
7 1 False 3
8 1 False 1
9 1 False 2
10 1 True 1
11 1 False 2
If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.
you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.
I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]
How can I apply a pandas groupby to columns that are numerical and boolean? I want to sum over the numerical columns and I want the aggregation of the boolean values to be any, that is True if there are any Trues and False if there are only False.
Performing a sum aggregation will give the desired result as long as you cast the boolean columns back to boolean types. Example
df = pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'bool': [True, False, True, True, False, False],
'c': [10, 10, 15, 15, 20, 20]})
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
df.groupby('id').sum()
bool c
id
1 1.0 20
2 2.0 30
3 0.0 40
As you can see when applying the sum True is cast as 1 and False is cast as zero. This effectively acts as the desired any operation. Casting back to boolean:
df['bool'] = df['bool'].astype(bool)
id bool c
0 1 True 10
1 1 False 10
2 2 True 15
3 2 True 15
4 3 False 20
5 3 False 20
You can choose the functions which you aggregate by with the following:
df.groupby("id").agg({
"bool":lambda arr: any(arr),
"c":sum,
})
I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).
Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False
Given a Pandas dataframe:
df = pd.DataFrame({'A':[1, 2, 3, 4, 5],
'B': [0.1, 0.2, 0.3, 0.4, 0.5],
'C': [11, 12, 13, 14, 15]})
A B C
0 1 0.1 11
1 2 0.2 12
2 3 0.3 13
3 4 0.4 14
4 5 0.5 15
For all of the columns where the range of values is between 0 and 1, I'd like to multiply all values in those columns by a constant (say, 100). I don't know a priori which columns have values between 0 and 1 and there are 100+ columns.
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
I've tried using .min() and .max() and compared them to the desired range to return True/False values for each column.
(df.min() >= 0) & (df.max() <= 1)
A False
B True
C False
but it isn't obvious how to then select the True columns and multiply those values by 100.
Update
I came up with this solution instead
col_names = ((df.min() >= 0) & (df.max() <= 1)).index
df[col_names] = df[col_names] * 100
Something like this?
to_multiply = [col for col in df if 1 >= min(df[col]) >= 0 and 1 >= max(df[col]) >= 0]
df[to_multiply] = df[to_multiply] * 100
We can construct a boolean mask that test if the values in the df are greater than (gt) 0 and less than (lt) 1 and then call np.all and pass axis=0 to generate a boolean mask to filter the columns and then multiply all values in that column by 100:
In [58]:
df[df.columns[np.all(df.gt(0) & df.lt(1),axis=0)]] *= 100
df
Out[58]:
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
Breaking the above down:
In [61]:
df.gt(0) & df.lt(1)
Out[61]:
A B C
0 False True False
1 False True False
2 False True False
3 False True False
4 False True False
In [62]:
np.all(df.gt(0) & df.lt(1),axis=0)
Out[62]:
array([False, True, False], dtype=bool)
In [63]:
df.columns[np.all(df.gt(0) & df.lt(1),axis=0)]
Out[63]:
Index(['B'], dtype='object')
You can update your DataFrame based on your selection criteria:
df.update(df.loc[:, (df.ge(0).all() & df.le(1).all())].mul(100))
>>> df
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15
Any column which is greater than or equal to zero and less than or equal to one is multiplied by 100.
Other comparison operators:
.ge (greater than or equal to)
.gt (greater than)
.le (less than or equal to)
.lt (less than)
.eq (equals)
Use .all() to check if all values are within range and if true, multiply them -
In [1877]: paste
for col in df.columns:
if (0<df[col]).all() and (df[col]<1).all():
df[col] = df[col] * 100
## -- End pasted text --
In [1878]: df
Out[1878]:
A B C
0 1 10 11
1 2 20 12
2 3 30 13
3 4 40 14
4 5 50 15