Using cumsum to find unique chapters - python

I have a dataframe like this:
df = pd.DataFrame()
text secFlag
0 book 1
1 headings 1
2 chapter 1
3 one 1
4 page 0
5 one 0
6 text 0
7 chapter 1
8 two 1
9 page 0
10 two 0
11 text 0
12 page 0
13 three 0
10 text 0
11 chapter 1
12 three 1
13 something 0
I want to find the cumulative sum so that I can mark all the pages belonging to a specific chapter by a running index number.
**Desired output**
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 2
3 one 1 2
4 page 0 2
5 one 0 2
6 text 0 2
7 chapter 1 3
8 two 1 3
9 page 0 3
10 two 0 3
11 text 0 3
12 page 0 3
13 three 0 3
10 text 0 3
11 chapter 1 4
12 three 1 4
13 something 0 4
This is what I tried:
df['chapter'] = ((df['secFlag'].shift(-1) == 1)).cumsum()
But, this is not giving me the desired output, as it is incrementing as soon as a value is 1 in the section flag. Note that multiple words are part of the text, and the chapter heading will usually have multiple words.
Can you please suggest a simple way to get this done?
thanks

If need flag by first 1 in secFlag solution is:
df['chapter'] = ((df['secFlag'] == 1) & (df['secFlag'] != df['secFlag'].shift())).cumsum()
print (df)
text secFlag chapter
0 book 1 1
1 headings 1 1
2 chapter 1 1
3 one 1 1
4 page 0 1
5 one 0 1
6 text 0 1
7 chapter 1 2
8 two 1 2
9 page 0 2
10 two 0 2
11 text 0 2
12 page 0 2
13 three 0 2
10 text 0 2
11 chapter 1 3
12 three 1 3
13 something 0 3
Details:
a = (df['secFlag'] == 1)
b = (df['secFlag'] != df['secFlag'].shift())
c = a & b
d = c.cumsum()
print (pd.concat([df,a,b,c,d],
axis=1,
keys=('orig','==1','!=shifted','chained by &','cumsum')))
orig ==1 !=shifted chained by & cumsum
text secFlag secFlag secFlag secFlag secFlag
0 book 1 True True True 1
1 headings 1 True False False 1
2 chapter 1 True False False 1
3 one 1 True False False 1
4 page 0 False True False 1
5 one 0 False False False 1
6 text 0 False False False 1
7 chapter 1 True True True 2
8 two 1 True False False 2
9 page 0 False True False 2
10 two 0 False False False 2
11 text 0 False False False 2
12 page 0 False False False 2
13 three 0 False False False 2
10 text 0 False False False 2
11 chapter 1 True True True 3
12 three 1 True False False 3
13 something 0 False True False 3

Related

Pandas How to flag consecutive values ignoring the first occurrence

I have the following code:
data={'id':[1,2,3,4,5,6,7,8,9,10,11],
'value':[1,0,1,0,1,1,1,0,0,1,0]}
df=pd.DataFrame.from_dict(data)
df
Out[8]:
id value
0 1 1
1 2 0
2 3 1
3 4 0
4 5 1
5 6 1
6 7 1
7 8 0
8 9 0
9 10 1
10 11 0
I want to create a flag column that indicate with 1 consecutive values starting from the second occurrence and ignoring the first.
With the actual solution:
df['flag'] =
df.value.groupby([df.value,df.flag.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int)
Out[8]:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 1
5 6 1 1
6 7 1 1
7 8 0 1
8 9 0 1
9 10 1 0
10 11 0 0
While I need a solution like this, where the first occurence is flagged as 0 and 1 starting from the second:
Out[8]:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0
Create consecutive groups by compared Series.shifted values by not equal and Series.cumsum, create counter by GroupBy.cumcount and compare if greater values like 0 by Series.gt, last map True, False to 1, 0 by casting to integers by Series.astype:
df['flag'] = (df.groupby(df['value'].ne(df['value'].shift()).cumsum())
.cumcount()
.gt(0)
.astype(int))
print (df)
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0
How it working:
print (df.assign(g = df['value'].ne(df['value'].shift()).cumsum(),
counter = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount(),
mask = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount().gt(0)))
id value g counter mask
0 1 1 1 0 False
1 2 0 2 0 False
2 3 1 3 0 False
3 4 0 4 0 False
4 5 1 5 0 False
5 6 1 5 1 True
6 7 1 5 2 True
7 8 0 6 0 False
8 9 0 6 1 True
9 10 1 7 0 False
10 11 0 8 0 False
Use groupby.cumcount and a custom grouper:
# group by identical successive values
grp = df['value'].ne(df['value'].shift()).cumsum()
# flag all but the first one (>0)
# convert the booleans True/False to integers 1/0
df['flag'] = df.groupby(grp).cumcount().gt(0).astype(int)
Generic code to skip first N:
N = 1
grp = df['value'].ne(df['value'].shift()).cumsum()
df['flag'] = df.groupby(grp).cumcount().ge(N).astype(int)
Output:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0

Pandas: Delete rows of each group by condition

Good afternoon, I have a dataframe like this where I have different groups that are reflected in the column "NumeroPosesion".
Event NumeroPosesion
0 completedPass 1
1 completedPass 1
2 takeon 1
3 failedPass 1
4 takeon 1
5 dribbleYES 1
6 shot 1
7 takeon 2
8 dribbleNO 2
9 completedPass 2
10 completedPass 2
11 shot 2
12 completedPass 2
13 completedPass 2
14 completedPass 2
The idea is the following:
When the first "Event" = "shot" appears, delete all the rows below that group.
Iterate from the last row of the group (it will be the one with "Event" = "shot" and go up until "Event" is different from "takeon", "completedPass" or "dribbleYES".
When it is different, delete all rows above the different one in the group.
Dataframe expected:
Event NumeroPosesion
0 takeon 1
1 dribbleYES 1
2 shot 1
3 completedPass 2
4 completedPass 2
5 shot 2
Use boolean indexing with help of groupby.cummax/cummin:
# remove rows after "shot" for each group
m1 = df.loc[::-1, 'Event'].eq('shot').groupby(df['NumeroPosesion']).cummax()
# remove rows before the first non "takeon"/"completedPass"/"dribbleYES"
m2 = (df.loc[m1, 'Event'].isin(['shot', 'takeon', 'completedPass', 'dribbleYES'])[::-1]
.groupby(df['NumeroPosesion']).cummin()
)
# slice
out = df[m1&m2]
Output:
Event NumeroPosesion
4 takeon 1
5 dribbleYES 1
6 shot 1
9 completedPass 2
10 completedPass 2
11 shot 2
Intermediates:
Event NumeroPosesion m1 m2
0 completedPass 1 True False
1 completedPass 1 True False
2 takeon 1 True False
3 failedPass 1 True False
4 takeon 1 True True
5 dribbleYES 1 True True
6 shot 1 True True
7 takeon 2 True False
8 dribbleNO 2 True False
9 completedPass 2 True True
10 completedPass 2 True True
11 shot 2 True True
12 completedPass 2 False NaN
13 completedPass 2 False NaN
14 completedPass 2 False NaN

how to reindex python dataframe based on column grouping

i am a newbie in python. please assist. I have a huge dataframe consisting of thousands of rows. an example of the df is shown below.
STATE VOLUME
INDEX
1 on 10
2 on 15
3 on 10
4 off 20
5 off 30
6 on 15
7 on 20
8 off 10
9 off 30
10 off 10
11 on 20
12 off 25
i want to be able to index this data based on the 'state' column such that the first batch of 'on' and 'off' registers as index 1, the next batch of 'on' and 'off' registers as index 2 etc etc... i want to be able to select a group of data if i select rows with index 1.
ID VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
You can try this with pd.Series.eq with pd.Series.shift and take cumsum using pd.Series.cumsum
df.index = (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
df.index.name = 'INDEX'
STATE VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Details
The idea is to find where an off is followed by an on.
# (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
eq(off).shift eq(on) eq(off).shift & eq(on)
INDEX
1 NaN True False
2 False True False
3 False True False
4 False False False
5 True False False
6 True True True
7 False True False
8 False False False
9 True False False
10 True False False
11 True True True
12 False False False
You could try this with pd.Series.shift and pd.Series.cumsum:
df.index=((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()+1
Same as this with np.where:
temp=pd.Series(np.where((df.STATE.shift(-1) != df.STATE)&(df.STATE.eq('off')),1,0))
df.index=temp.shift(1,fill_value=0).cumsum().astype(int).add(1)
Output:
df
STATE VOLUME
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Explanation:
With (df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off'), you will get a mask with the last value when it changes to 'off':
(df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
11 False
12 True
Then you shift it to include that last value, and then you do a cumsum() knowing that True: 1 and False: 0:
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0)
1 0
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
12 False
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
And finally you add 1(+1) to the index, to get the desired result.

Select a specific value in a group using python pandas

I have a dataset with below data.
id status div
1 True 0
2 False 2
2 True 1
3 False 4
3 False 5
1 False 5
4 True 3
4 True 10
5 False 3
5 False 3
5 True 2
I want my output as
id status div
1 True 0
2 True 1
3 False 4
4 True 3
5 True 2
If true is present in the group i want it to be true else if only False is present i want to be False.
I have tried using Pandas group by but unable to select the condition.
Use DataFrameGroupBy.any with map by helper Series with first Truerow per groups if exist:
s = (df.sort_values(['status','id'], ascending=False)
.drop_duplicates('id')
.set_index('id')['div'])
print (s)
id
5 2
4 3
2 1
1 0
3 4
Name: div, dtype: int64
df1 = df.groupby('id')['status'].any().reset_index()
df1['div'] = df1['id'].map(s)
print (df1)
id status div
0 1 True 0
1 2 True 1
2 3 False 4
3 4 True 3
4 5 True 2

creating a column which keeps a running count of consecutive values

I am trying to create a column (“consec”) which will keep a running count of consecutive values in another (“binary”) without using loop. This is what the desired outcome would look like:
. binary consec
1 0 0
2 1 1
3 1 2
4 1 3
5 1 4
5 0 0
6 1 1
7 1 2
8 0 0
However, this...
df['consec'][df['binary']==1] = df['consec'].shift(1) + df['binary']
results in this...
. binary consec
0 1 NaN
1 1 1
2 1 1
3 0 0
4 1 1
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
I see other posts which use grouping or sorting, but unfortunately, I don't see how that could work for me.
You can use the compare-cumsum-groupby pattern (which I really need to getting around to writing up for the documentation), with a final cumcount:
>>> df = pd.DataFrame({"binary": [0,1,1,1,0,0,1,1,0]})
>>> df["consec"] = df["binary"].groupby((df["binary"] == 0).cumsum()).cumcount()
>>> df
binary consec
0 0 0
1 1 1
2 1 2
3 1 3
4 0 0
5 0 0
6 1 1
7 1 2
8 0 0
This works because first we get the positions where we want to reset the counter:
>>> (df["binary"] == 0)
0 True
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 True
Name: binary, dtype: bool
The cumulative sum of these gives us a different id for each group:
>>> (df["binary"] == 0).cumsum()
0 1
1 1
2 1
3 1
4 2
5 3
6 3
7 3
8 4
Name: binary, dtype: int64
And then we can pass this to groupby and use cumcount to get an increasing index in each group.
For those who ended up here looking for an answer to the "misunderstood" version:
To reset count for each change in the binary column, so that consec does "keep a running count of consecutive values", the following seems to work:
df["consec2"] = df["binary"].groupby((df["binary"] <> df["binary"].shift()).cumsum()).cumcount()

Categories