I am trying to count up a number during a sequence change.
The number shall always be +1, when changing from the negative to the positive range.
Here the code:
data = {'a': [-1,-1,-2,-3,4,5,6,-2,-2,-3,6,3,6,7,-1,-5,-7,1,34,5]}
df = pd.DataFrame (data)
df['p'] = df.a > 0
df['g'] = (df['p'] != df['p'].shift()).cumsum()
This is the output:
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 3
8 -2 False 3
9 -3 False 3
10 6 True 4
11 3 True 4
12 6 True 4
13 7 True 4
14 -1 False 5
I need an output that looks like this:
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
Anybody got an idea?
You can match mask by & for bitwise AND:
df['p'] = df.a > 0
df['g'] = ((df['p'] != df['p'].shift()) & df['p']).cumsum() + 1
Another idea is filter by mask in column p, forward filling missing values replace NaN by first group and add 1:
df['p'] = df.a > 0
df['g'] = ((df['p'] != df['p'].shift()))[df['p']].cumsum()
df['g'] = df['g'].ffill().fillna(0).astype(int) + 1
Solution with differencies, without helper p column:
df['g'] = df.a.gt(0).view('i1').diff().gt(0).cumsum().add(1)
print (df)
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
Use np.sign and diff, gt, and cumsum
s = np.sign(df.a)
df['g'] = s.diff().gt(0).cumsum() + 1
Out[255]:
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
You can try this:
df["g"] = (df.p.astype(int).diff() > 0).astype(int).cumsum() + 1
output:
a p g
0 -1 False 1
1 -1 False 1
2 -2 False 1
3 -3 False 1
4 4 True 2
5 5 True 2
6 6 True 2
7 -2 False 2
8 -2 False 2
9 -3 False 2
10 6 True 3
11 3 True 3
12 6 True 3
13 7 True 3
14 -1 False 3
15 -5 False 3
16 -7 False 3
17 1 True 4
18 34 True 4
19 5 True 4
Related
I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let's say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() + 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count += 1
else:
if count >= 3:
n_groups += 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN
One simple approach using find_peaks to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
def function1(dd:pd.DataFrame):
dd.loc[:,'period']=None
if len(dd)>=4:
dd.iloc[1:,2]=dd.iloc[1:,1]
return dd
df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)
out:
values col1 period
0 0 0 None
1 8 0 None
2 1 0 None
3 0 1 None
4 5 1 1
5 6 1 1
6 4 1 1
7 7 1 1
8 0 2 None
9 2 2 2
10 9 2 2
11 1 2 2
12 0 3 None
I am trying to set the value of a column, "C" based on the idxmax() of a groupby ("B"). To make it a bit more complicated though, in the event of a NaN or 0, I would like it to return the min value excluding the NaN or 0 if such a value exists. Here is an example dataframe:
Index
A
B
C
0
1
5
False
1
1
10
False
2
2
9
False
3
2
NaN
False
4
3
3
False
5
3
5
False
6
4
NaN
False
7
4
NaN
False
8
5
0
False
9
5
5
False
I am trying to set column "C" to True for the idxmax() of column B, split by a groupby on column "A":
A
B
C
0
1
5
True
1
1
10
False
2
2
9
True
3
2
NaN
False
4
3
3
True
5
3
5
False
6
4
NaN
True
7
4
NaN
False
8
5
0
False
9
5
5
True
Thanks!
Let's use groupby with transform like this:
df['C_new'] = df.groupby('A')['B'].transform('idxmax') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False False
1 1 1 10.0 False True
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False False
5 5 3 5.0 False True
or, idxmin...
df['C_new'] = df.groupby('A')['B'].transform('idxmin') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False True
1 1 1 10.0 False False
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False True
5 5 3 5.0 False False
i am a newbie in python. please assist. I have a huge dataframe consisting of thousands of rows. an example of the df is shown below.
STATE VOLUME
INDEX
1 on 10
2 on 15
3 on 10
4 off 20
5 off 30
6 on 15
7 on 20
8 off 10
9 off 30
10 off 10
11 on 20
12 off 25
i want to be able to index this data based on the 'state' column such that the first batch of 'on' and 'off' registers as index 1, the next batch of 'on' and 'off' registers as index 2 etc etc... i want to be able to select a group of data if i select rows with index 1.
ID VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
You can try this with pd.Series.eq with pd.Series.shift and take cumsum using pd.Series.cumsum
df.index = (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
df.index.name = 'INDEX'
STATE VOLUME
INDEX
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Details
The idea is to find where an off is followed by an on.
# (df['STATE'].eq('off').shift() & df['STATE'].eq('on')).cumsum() + 1
eq(off).shift eq(on) eq(off).shift & eq(on)
INDEX
1 NaN True False
2 False True False
3 False True False
4 False False False
5 True False False
6 True True True
7 False True False
8 False False False
9 True False False
10 True False False
11 True True True
12 False False False
You could try this with pd.Series.shift and pd.Series.cumsum:
df.index=((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()+1
Same as this with np.where:
temp=pd.Series(np.where((df.STATE.shift(-1) != df.STATE)&(df.STATE.eq('off')),1,0))
df.index=temp.shift(1,fill_value=0).cumsum().astype(int).add(1)
Output:
df
STATE VOLUME
1 on 10
1 on 15
1 on 10
1 off 20
1 off 30
2 on 15
2 on 20
2 off 10
2 off 30
2 off 10
3 on 20
3 off 25
Explanation:
With (df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off'), you will get a mask with the last value when it changes to 'off':
(df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
11 False
12 True
Then you shift it to include that last value, and then you do a cumsum() knowing that True: 1 and False: 0:
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0)
1 0
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
10 False
11 True
12 False
((df.STATE.shift(-1) != df.STATE)&df.STATE.eq('off')).shift(fill_value=0).cumsum()
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
11 2
12 2
And finally you add 1(+1) to the index, to get the desired result.
Imagine we have a dataframe:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 3 55
5 3 104
6 1 23
7 5 22
8 3 144
I want to remove the rows where specifically a 3 is repeated in the num column, and keep the first. So the two rows with repeating 1's in the num column should still be in the resulting DataFrame together with all the other columns.
What I have so far, which removes every double value, not only the 3's:
data.groupby((data['num'] != data['num'].shift()).cumsum().values).first()
Expected result or correct code:
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 1 23
5 5 22
6 3 144
Use:
df = data[data['num'].ne(3) | data['num'].ne(data['num'].shift())]
print (df)
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Detail:
Compare for not equal:
print (data['num'].ne(3))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 False
Name: num, dtype: bool
Compare by shifted values for first consecutive:
print (data['num'].ne(data['num'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
Chain by | for bitwise OR:
print (data['num'].ne(3) | data['num'].ne(data['num'].shift()))
0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
Name: num, dtype: bool
You could use the bellow conditions in order to perform boolean indexation in the dataframe:
# True where num is 3
c1 = df['num'].eq(3)
# True where num is repeated
c2 = df['num'].eq(df['num'].shift(1))
# boolean indexation on df
df[(c1 & ~c2) | ~(c1)]
num line
0 1 56
1 1 90
2 2 66
3 3 4
6 1 23
7 5 22
8 3 144
Details
df.assign(is_3=c1, is_repeated=c2, filtered=(c1 & ~c2) | ~(c1))
num line is_3 is_repeated filtered
0 1 56 False False True
1 1 90 False True True
2 2 66 False False True
3 3 4 True False True
4 3 55 True True False
5 3 104 True True False
6 1 23 False False True
7 5 22 False False True
8 3 144 True False True
For this dataframe
df
basketID productCode
0 1 23
1 1 24
2 1 25
3 2 23
4 3 23
5 4 25
6 5 24
7 5 25
Gives as expected
(df['productCode']) == 23
0 True
1 False
2 False
3 True
4 True
5 False
6 False
7 False
But if I want both 23 and 1
(df['productCode']) == 23 & (df['basketID'] == 1)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
Everything is false.
Why first line was not recognized?
You need ) after 23 because operator precedence:
(df['productCode'] == 23) & (df['basketID'] == 1)