Pandas: Delete rows of each group by condition - python

Good afternoon, I have a dataframe like this where I have different groups that are reflected in the column "NumeroPosesion".
Event NumeroPosesion
0 completedPass 1
1 completedPass 1
2 takeon 1
3 failedPass 1
4 takeon 1
5 dribbleYES 1
6 shot 1
7 takeon 2
8 dribbleNO 2
9 completedPass 2
10 completedPass 2
11 shot 2
12 completedPass 2
13 completedPass 2
14 completedPass 2
The idea is the following:
When the first "Event" = "shot" appears, delete all the rows below that group.
Iterate from the last row of the group (it will be the one with "Event" = "shot" and go up until "Event" is different from "takeon", "completedPass" or "dribbleYES".
When it is different, delete all rows above the different one in the group.
Dataframe expected:
Event NumeroPosesion
0 takeon 1
1 dribbleYES 1
2 shot 1
3 completedPass 2
4 completedPass 2
5 shot 2

Use boolean indexing with help of groupby.cummax/cummin:
# remove rows after "shot" for each group
m1 = df.loc[::-1, 'Event'].eq('shot').groupby(df['NumeroPosesion']).cummax()
# remove rows before the first non "takeon"/"completedPass"/"dribbleYES"
m2 = (df.loc[m1, 'Event'].isin(['shot', 'takeon', 'completedPass', 'dribbleYES'])[::-1]
.groupby(df['NumeroPosesion']).cummin()
)
# slice
out = df[m1&m2]
Output:
Event NumeroPosesion
4 takeon 1
5 dribbleYES 1
6 shot 1
9 completedPass 2
10 completedPass 2
11 shot 2
Intermediates:
Event NumeroPosesion m1 m2
0 completedPass 1 True False
1 completedPass 1 True False
2 takeon 1 True False
3 failedPass 1 True False
4 takeon 1 True True
5 dribbleYES 1 True True
6 shot 1 True True
7 takeon 2 True False
8 dribbleNO 2 True False
9 completedPass 2 True True
10 completedPass 2 True True
11 shot 2 True True
12 completedPass 2 False NaN
13 completedPass 2 False NaN
14 completedPass 2 False NaN

Related

Condition Shift in Pandas

I am trying to count up a number during a sequence change.
The number shall always be +1, when changing from the negative to the positive range.
Here the code:
data = {'a': [-1,-1,-2,-3,4,5,6,-2,-2,-3,6,3,6,7,-1,-5,-7,1,34,5]}
df = pd.DataFrame (data)
df['p'] = df.a > 0
df['g'] = (df['p'] != df['p'].shift()).cumsum()
This is the output:
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 3
8 -2 False 3
9 -3 False 3
10 6 True 4
11 3 True 4
12 6 True 4
13 7 True 4
14 -1 False 5
I need an output that looks like this:
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
Anybody got an idea?
You can match mask by & for bitwise AND:
df['p'] = df.a > 0
df['g'] = ((df['p'] != df['p'].shift()) & df['p']).cumsum() + 1
Another idea is filter by mask in column p, forward filling missing values replace NaN by first group and add 1:
df['p'] = df.a > 0
df['g'] = ((df['p'] != df['p'].shift()))[df['p']].cumsum()
df['g'] = df['g'].ffill().fillna(0).astype(int) + 1
Solution with differencies, without helper p column:
df['g'] = df.a.gt(0).view('i1').diff().gt(0).cumsum().add(1)
print (df)
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
Use np.sign and diff, gt, and cumsum
s = np.sign(df.a)
df['g'] = s.diff().gt(0).cumsum() + 1
Out[255]:
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
You can try this:
df["g"] = (df.p.astype(int).diff() > 0).astype(int).cumsum() + 1
output:
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4

how to reindex python dataframe based on column grouping

i am a newbie in python. please assist. I have a huge dataframe consisting of thousands of rows. an example of the df is shown below.
STATE VOLUME
INDEX
1 on 10
2 on 15
3 on 10
4 off 20
5 off 30
6 on 15
7 on 20
8 off 10
9 off 30
10 off 10
11 on 20
12 off 25
i want to be able to index this data based on the 'state' column such that the first batch of 'on' and 'off' registers as index 1, the next batch of 'on' and 'off' registers as index 2 etc etc... i want to be able to select a group of data if i select rows with index 1.
ID VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
You can try this with pd.Series.eq with pd.Series.shift and take cumsum using pd.Series.cumsum
df.index = (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
df.index.name = 'INDEX'
STATE VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Details
The idea is to find where an off is followed by an on.
# (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
eq(off).shift eq(on) eq(off).shift & eq(on)
INDEX
1 NaN True False
2 False True False
3 False True False
4 False False False
5 True False False
6 True True True
7 False True False
8 False False False
9 True False False
10 True False False
11 True True True
12 False False False
You could try this with pd.Series.shift and pd.Series.cumsum:
df.index=((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()+1
Same as this with np.where:
temp=pd.Series(np.where((df.STATE.shift(-1) != df.STATE)&(df.STATE.eq('off')),1,0))
df.index=temp.shift(1,fill_value=0).cumsum().astype(int).add(1)
Output:
df
STATE VOLUME
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Explanation:
With (df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off'), you will get a mask with the last value when it changes to 'off':
(df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
11 False
12 True
Then you shift it to include that last value, and then you do a cumsum() knowing that True: 1 and False: 0:
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0)
1 0
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
12 False
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
And finally you add 1(+1) to the index, to get the desired result.

Pandas cummulative sum based on True/False condition

I'm using python and need to solve the dataframe as cumsum() the value until the boolean column change its value from True to False. How to solve this task?
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3 << reset from here
5 False 5 8
6 True 2 10
....
Thank all!
You can try this
a = df.Bool.eq(True).cumsum().shift().fillna(0)
df['Expected_cumsum']= df.groupby(a)['Value'].cumsum()
df
Output
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3
5 False 5 8
6 True 2 10

Using cumsum to find unique chapters

I have a dataframe like this:
df = pd.DataFrame()
text secFlag
0 book 1
1 headings 1
2 chapter 1
3 one 1
4 page 0
5 one 0
6 text 0
7 chapter 1
8 two 1
9 page 0
10 two 0
11 text 0
12 page 0
13 three 0
10 text 0
11 chapter 1
12 three 1
13 something 0
I want to find the cumulative sum so that I can mark all the pages belonging to a specific chapter by a running index number.
**Desired output**
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 2
3 one 1 2
4 page 0 2
5 one 0 2
6 text 0 2
7 chapter 1 3
8 two 1 3
9 page 0 3
10 two 0 3
11 text 0 3
12 page 0 3
13 three 0 3
10 text 0 3
11 chapter 1 4
12 three 1 4
13 something 0 4
This is what I tried:
df['chapter'] = ((df['secFlag'].shift(-1) == 1)).cumsum()
But, this is not giving me the desired output, as it is incrementing as soon as a value is 1 in the section flag. Note that multiple words are part of the text, and the chapter heading will usually have multiple words.
Can you please suggest a simple way to get this done?
thanks
If need flag by first 1 in secFlag solution is:
df['chapter'] = ((df['secFlag'] == 1) & (df['secFlag'] != df['secFlag'].shift())).cumsum()
print (df)
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 1
3 one 1 1
4 page 0 1
5 one 0 1
6 text 0 1
7 chapter 1 2
8 two 1 2
9 page 0 2
10 two 0 2
11 text 0 2
12 page 0 2
13 three 0 2
10 text 0 2
11 chapter 1 3
12 three 1 3
13 something 0 3
Details:
a = (df['secFlag'] == 1)
b = (df['secFlag'] != df['secFlag'].shift())
c = a & b
d = c.cumsum()
print (pd.concat([df,a,b,c,d],
axis=1,
keys=('orig','==1','!=shifted','chained by &','cumsum')))
orig ==1 !=shifted chained by & cumsum
text secFlag secFlag secFlag secFlag secFlag
0 book 1 True True True 1
1 headings 1 True False False 1
2 chapter 1 True False False 1
3 one 1 True False False 1
4 page 0 False True False 1
5 one 0 False False False 1
6 text 0 False False False 1
7 chapter 1 True True True 2
8 two 1 True False False 2
9 page 0 False True False 2
10 two 0 False False False 2
11 text 0 False False False 2
12 page 0 False False False 2
13 three 0 False False False 2
10 text 0 False False False 2
11 chapter 1 True True True 3
12 three 1 True False False 3
13 something 0 False True False 3

Select a specific value in a group using python pandas

I have a dataset with below data.
id status div
1 True 0
2 False 2
2 True 1
3 False 4
3 False 5
1 False 5
4 True 3
4 True 10
5 False 3
5 False 3
5 True 2
I want my output as
id status div
1 True 0
2 True 1
3 False 4
4 True 3
5 True 2
If true is present in the group i want it to be true else if only False is present i want to be False.
I have tried using Pandas group by but unable to select the condition.
Use DataFrameGroupBy.any with map by helper Series with first Truerow per groups if exist:
s = (df.sort_values(['status','id'], ascending=False)
.drop_duplicates('id')
.set_index('id')['div'])
print (s)
id
5 2
4 3
2 1
1 0
3 4
Name: div, dtype: int64
df1 = df.groupby('id')['status'].any().reset_index()
df1['div'] = df1['id'].map(s)
print (df1)
id status div
0 1 True 0
1 2 True 1
2 3 False 4
3 4 True 3
4 5 True 2

Categories