I have this dataframe:
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Condition': [False, False, True, False, False, False, False, False, False, False, True, False]})
ID Condition
0 1 False
1 1 False
2 1 True
3 1 False
4 1 False
5 1 False
6 1 False
7 1 False
8 1 False
9 1 False
10 1 True
11 1 False
I want to add a new column Sequence with a sequence of numbers. The condition is when the first True appears in the Condition column, the following rows must contain the sequence 1, 2, 3, 1, 2, 3... until another True appears again, at which point the sequence is restarted again. Furthermore, ideally, until the first True appears, the values in the new column should be 0. El resultado final serĂa:
ID Condition Sequence
0 1 False 0
1 1 False 0
2 1 True 1
3 1 False 2
4 1 False 3
5 1 False 1
6 1 False 2
7 1 False 3
8 1 False 1
9 1 False 2
10 1 True 1
11 1 False 2
I have tried to do it with cumsum and cumcount but I can't find the exact code.
Any suggestion?
Let us do cumsum to identify blocks of rows, then group the dataframe by blocks and use cumcount to create sequential counter, then with some simple maths we can get the output
b = df['Condition'].cumsum()
df['Seq'] = df.groupby(b).cumcount().mod(3).add(1).mask(b < 1, 0)
Explained
Identify blocks/groups of rows using cumsum
b = df['Condition'].cumsum()
print(b)
0 0
1 0
2 1 # -- group 1 start --
3 1
4 1
5 1
6 1
7 1
8 1
9 1 # -- group 1 ended --
10 2
11 2
Name: Condition, dtype: int32
Group the dataframe by the blocks and use cumcount to create a sequential counter per block
c = df.groupby(b).cumcount()
print(c)
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 0
11 1
dtype: int64
Modulo(%) divide the sequential counter by 3 to create a repeating sequence that repeats every three rows
c = c.mod(3).add(1)
print(c)
0 1
1 2
2 1
3 2
4 3
5 1
6 2
7 3
8 1
9 2
10 1
11 2
dtype: int64
Mask the values in sequence with 0 where the group(b) is < 1
c = c.mask(b < 1, 0)
print(c)
0 0
1 0
2 1
3 2
4 3
5 1
6 2
7 3
8 1
9 2
10 1
11 2
Result
ID Condition Seq
0 1 False 0
1 1 False 0
2 1 True 1
3 1 False 2
4 1 False 3
5 1 False 1
6 1 False 2
7 1 False 3
8 1 False 1
9 1 False 2
10 1 True 1
11 1 False 2
This was the simplest way I could think of doing it
import pandas as pd
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Condition': [False, False, True, False, False, False, False, False, False, False, True, False]})
conditions = df.Condition.tolist()
sequence = []
buf = 1
seenTrue = False
for condition in conditions:
#If it's seen a True in the list, this bool is set to True
if condition or seenTrue:
seenTrue = True
#Checking buffer and resetting back to 0
if buf%4 == 0 or condition:
buf = 1
sequence.append(buf)
buf += 1
#While True has not been seen, all 0s to be appended.
if not seenTrue:
sequence.append(0)
df["Sequence"] = sequence
Effectively looping through and then adding the new column in. The buffer is reset whenever it reaches 4 or when a new True is seen, giving you the looping 1,2,3 effect.
The solution I've come up with is just simply looping through the Condition column, adding 0's to the list until you have seen the first True. When you have found a True, you set seen_true to True and set seq_count to 1. After the first True, you keep increasing seq_count, until it's larger then 3 or you see a new True. In both cases, you reset seq_count to 1. This gives you the column you were looking for.
import pandas as pd
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Condition': [False, False, True, False, False, False, False, False, False, False, True, False]})
l = []
seq_count = 0
first_true = False
for index, row in df.iterrows():
con = row["Condition"]
if con:
seq_count = 1
first_true = True
elif first_true:
seq_count += 1
if seq_count > 3:
seq_count = 1
l.append(seq_count)
df["Sequence"] = l
Output:
ID Condition Sequence
0 1 False 0
1 1 False 0
2 1 True 1
3 1 False 2
4 1 False 3
5 1 False 1
6 1 False 2
7 1 False 3
8 1 False 1
9 1 False 2
10 1 True 1
11 1 False 2
Related
I have a data frame and I'd like to calculate cumulative sum based on 2 conditions:
1st which is a boolean already in the table
and a fixed threshold that checks what's the cumulative sum.
I've succeed with 1st or 2nd but I find it hard to combine both.
For the first one I used groupby
df['group'] = np.cumsum((df['IsSuccess'] != df['IsSuccess'].shift(1)))
df['SumSale'] = df[['Sale', 'group']].groupby('group').cumsum()
For the 2nd frompyfunc
sumlm = np.frompyfunc(lambda a,b: b if (a+b>5) else a+b, 2, 1)
df['SumSale'] = sumlm.accumulate(df['Sale'], dtype=object)
My df is, and the SumSale is the result I'm looking for.
df2 = pd.DataFrame({'Sale': [10, 2, 2, 1, 3, 2, 1, 3, 5, 5],
'IsSuccess': [False, True, False, False, True, False, True, False, False, False],
'SumSaleExpected': [10, 12, 2, 3, 6, 2, 3, 6, 11, 16]})
So to summarize I'd like to start cumulating the sum once that sum is over 5 and the row IsSuccess is True. I'd like to avoid for loop if possible as well.
Thank you for help!
I hope I've understood your question right. This example will substract necessary value ("reset") when cumulative sum of sale is greater than 5 and IsSuccess==True:
df["SumSale"] = df["Sale"].cumsum()
# "reset" when SumSale>5 and IsSuccess==True
m = df["SumSale"].gt(5) & df["IsSuccess"].eq(True)
df.loc[m, "to_remove"] = df["SumSale"]
df["to_remove"] = df["to_remove"].ffill().shift().fillna(0)
df["SumSale"] -= df["to_remove"]
df = df.drop(columns="to_remove")
print(df)
Prints:
Sale IsSuccess SumSale
0 1 False 1.0
1 2 True 3.0
2 3 False 6.0
3 2 False 8.0
4 4 True 12.0
5 3 False 3.0
6 5 True 8.0
7 5 False 5.0
EDIT:
def fn():
sale, success = yield
cum = sale
while True:
sale, success = yield cum
if success and cum > 5:
cum = sale
else:
cum += sale
s = fn()
next(s)
df["ss"] = df["IsSuccess"].shift()
df["SumSale"] = df.apply(lambda x: s.send((x["Sale"], x["ss"])), axis=1)
df = df.drop(columns="ss")
print(df)
Prints:
Sale IsSuccess SumSaleExpected SumSale
0 10 False 10 10
1 2 True 12 12
2 2 False 2 2
3 1 False 3 3
4 3 True 6 6
5 2 False 2 2
6 1 True 3 3
7 3 False 6 6
8 5 False 11 11
9 5 False 16 16
You can modify your group approach to account for both conditions by taking the cumsum() of the two conditions:
cond1 = df.Sale.cumsum().gt(5).shift().bfill()
cond2 = df.IsSuccess.shift().bfill()
df['group'] = (cond1 & cond2).cumsum()
Now that group accounts for both conditions, you can directly cumsum() within these pseudogroups:
df['SumSale'] = df.groupby('group').Sale.cumsum()
# Sale IsSuccess group SumSale
# 0 1 False 0 1
# 1 2 True 0 3
# 2 3 False 0 6
# 3 2 False 0 8
# 4 4 True 0 12
# 5 3 False 1 3
I am attempting to create a matrix of 1 if every 2nd column value is greater than the previous column value and 0s if less, when I use np.where it just flattens it I want to keep the first column and the last column and it shape.
df = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
newd=pd.DataFrame()
for x in df.columns[1::2]:
if bool(df.iloc[:,df.columns.get_loc(x)] <=
df.iloc[:,df.columns.get_loc(x)-1]):
newdf.append(1)
else:newdf.append(0)
This question was a little vague, but I will answer a question that I think gets at the heart of what you are asking:
Say you start with a matrix:
df1 = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
Which creates:
A B C D
0 2.464130 0.796172 -1.406528 0.332499
1 -0.370764 -0.185119 -0.514149 0.158218
2 -2.164707 0.888354 0.214550 1.334445
3 2.019189 0.910855 0.582508 -0.861778
4 1.574337 -1.063037 0.771726 -0.196721
5 1.091648 0.407703 0.406509 -1.052855
6 -1.587963 -1.730850 0.168353 -0.899848
7 0.225723 0.042629 2.152307 -1.086585
Now you can use pd.df.shift() to shift the entire matrix, and then check the resulting columns item by item in one step. For example:
df1.shift(1)
Creates:
A B C D
0 -0.370764 -0.185119 -0.514149 0.158218
1 -2.164707 0.888354 0.214550 1.334445
2 2.019189 0.910855 0.582508 -0.861778
3 1.574337 -1.063037 0.771726 -0.196721
4 1.091648 0.407703 0.406509 -1.052855
5 -1.587963 -1.730850 0.168353 -0.899848
6 0.225723 0.042629 2.152307 -1.086585
7 NaN NaN NaN NaN
And now you can check the resulting columns with a new matrix as so:
df2 = df1.shift(-1) > df1
which returns:
A B C D
0 False False True False
1 False True True True
2 True True True False
3 False False True True
4 False True False False
5 False False False True
6 True True True False
7 False False False False
To complete your question, we convert the True/False to 1/0 as such:
df2 = df2.applymap(lambda x: 1 if x == True else 0)
Which returns:
A B C D
0 0 0 1 0
1 0 1 1 1
2 1 1 1 0
3 0 0 1 1
4 0 1 0 0
5 0 0 0 1
6 1 1 1 0
7 0 0 0 0
In one line:
df2 = (df1.shift(-1)>df1).replace({True:1,False:0})
I have a pandas series of Boolean values, and I would like to label contiguous groups of True values. How is it possible to do this? Is it possible to do this in a vectorised manner? Any help would be hugely appreciated!
Data:
A
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 False
8 False
9 True
10 True
Desired:
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
Here's a unlikely but simple and working solution:
import scipy.ndimage.measurements as mnts
labeled, clusters = mnts.label(df.A.values)
# labeled is what you want, cluster is the number of clusters.
df.Labels = labeled # puts it into df
Tested as:
a = array([False, False, True, True, True, False, True, False, False,
True, False, True, True, True, True, True, True, True,
False, True], dtype=bool)
labeled, clusters = mnts.label(a)
>>> labeled
array([0, 0, 1, 1, 1, 0, 2, 0, 0, 3, 0, 4, 4, 4, 4, 4, 4, 4, 0, 5], dtype=int32)
>>> clusters
5
With cumsum
a = df.A.values
z = np.zeros(a.shape, int)
z[a] = pd.factorize((~a).cumsum()[a])[0] + 1
df.assign(Label=z)
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
You can use cumsum and groupby + ngroup to mark groups.
v = (~df.A).cumsum().where(df.A).bfill()
df['Label'] = (
v.groupby(v).ngroup().add(1).where(df.A).fillna(0, downcast='infer'))
df
A Label
0 False 0
1 True 1
2 True 1
3 True 1
4 False 0
5 False 0
6 True 2
7 False 0
8 False 0
9 True 3
10 True 3
I have a Series df:
index
0 1
1 1
2 1
3 1
4 1
5 -1
6 -1
7 -1
8 1
9 1
10 1
11 -1
dtype: int64
Another boolean Series is likes indicator or points b:
index
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
9 True
10 False
11 False
I can set the df values using b, df[b]=0:
index
0 1
1 1
2 0
3 1
4 1
5 -1
6 0
7 -1
8 1
9 0
10 1
11 -1
And now what I want to fill the values between 2:5,6:7,9:11 with the value -1 and the result I want is a new df:
index
0 1
1 1
2 -1
3 -1
4 -1
5 -1
6 -1
7 -1
8 1
9 -1
10 -1
11 -1
Which means when b is True, (index:2,6,9), I would fill the value 1 in df between the index(index:2,6,9) and the index of the nearest -1 values (index:5,7,11).
The fill value is -1, the fill range is [2:5,6:7,9:11]
I've thought method like where, replace, pad etc, but cannot work it out. Maybe find its index array [2,6,9],and the nearest -1 array [5,7,11], and rearrange into [2:5,6:7,9:11] is a way.
Is there some ways more useful?
numpy.where() looks like it can do what you need:
Code:
import numpy as np
starts = np.where(df == 0)
ends = np.where(df == -1)
for start, end in zip(starts[0], ends[0]):
df[start:end] = -1
Test Data:
import pandas as pd
df = pd.DataFrame([1, 1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1])
b = pd.DataFrame([False, False, True, False, False, False, True,
False, False, True, False, False,])
df[b] = 0
print(df)
Results:
0
0 1
1 1
2 -1
3 -1
4 -1
5 -1
6 -1
7 -1
8 1
9 -1
10 -1
11 -1
I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).
Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False