I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let's say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() + 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count += 1
else:
if count >= 3:
n_groups += 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN
One simple approach using find_peaks to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
def function1(dd:pd.DataFrame):
dd.loc[:,'period']=None
if len(dd)>=4:
dd.iloc[1:,2]=dd.iloc[1:,1]
return dd
df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)
out:
values col1 period
0 0 0 None
1 8 0 None
2 1 0 None
3 0 1 None
4 5 1 1
5 6 1 1
6 4 1 1
7 7 1 1
8 0 2 None
9 2 2 2
10 9 2 2
11 1 2 2
12 0 3 None
Related
Let's say that I have a dataframe as follow:
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
Then, I convert it into a boolean form:
df.eq(1)
Out[213]:
A
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 True
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 False
19 False
20 True
21 True
What I want is to count consecutive sets of True values in the column. In this example, the output would be:
df
Out[215]:
A count
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN
My progress has been by using tools as 'groupby' and 'cumsum' but honestly, I can not figure out how to solve it. Thanks in advance
You can use df['A'].diff().ne(0).cumsum() to generate a grouper that will group each consecutive group of zeros/ones:
# A side-by-side comparison:
>>> pd.concat([df['A'], df['A'].diff().ne(0).cumsum()], axis=1)
A A
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 0 2
6 0 2
7 1 3
8 1 3
9 0 4
10 1 5
11 1 5
12 1 5
13 1 5
14 1 5
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6
20 1 7
21 1 7
Thus, group by that grouper, calculate sums, replace zero with NaN + dropna, and reset the index:
df['count'] = df.groupby(df['A'].diff().ne(0).cumsum()).sum().replace(0, np.nan).dropna().reset_index(drop=True)
Output:
>>> df
A B
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN
I propose an alternative way that makes use of the split string function.
Let's transform the Series df.A into a string and then split it where the zeros are.
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
ll = ''.join(df.A.astype('str').tolist()).split('0')
The list ll looks like
print(ll)
['11111', '', '11', '11111', '', '', '', '', '11']
now we count the lengths of every string and put it into a list
[len(item) for item in ll if len(item)>0]
This is doable if the Series is not too long.
I have reduced my data set to the last few steps. My pandas dataframe looks like this
FAC
0 1
1 2
2 1
3 3
4 2
5 1
6 2
7 1
8 1
9 3
10 2
11 1
12 2
13 3
14 1
I also have a list that I have identified to match.
match_list = [1, 2, 1, 1, 3]
what I am looking for is the slide through (5 item window) the data frame column and spot the row that matched the list pattern. The final result is something that looks like this. I will be thankful for any help.
FAC Error
0 1 some val
1 2 some val
2 1 some val
3 3 some val
4 2 some val
5 1 some val
6 2 some val
7 1 0
8 1 some val
9 3 some val
10 2 some val
11 1 some val
12 2 some val
13 3 some val
14 1 some val
This can be done with rolling:
match_list = [1, 2, 1, 1, 3]
match_list = np.array(match_list)
def match(x):
return (len(x)==len(match_list) and (x==match_list).all())
df['error'] = np.where(df.FAC.rolling(5, center=True).apply(match)==1, 0, 'some value')
Output:
FAC error
0 1 some value
1 2 some value
2 1 some value
3 3 some value
4 2 some value
5 1 some value
6 2 some value
7 1 0
8 1 some value
9 3 some value
10 2 some value
11 1 some value
12 2 some value
13 3 some value
14 1 some value
Update: to count the match, you can simply do mean instead of all inside the function:
def count_match(x):
return (len(x)==len(match_list))* (x==match_list).mean()
df['error'] = df.FAC.rolling(5,center=True).apply(count_match)
Output:
FAC error
0 1 NaN
1 2 NaN
2 1 0.6
3 3 0.0
4 2 0.4
5 1 0.4
6 2 0.2
7 1 1.0
8 1 0.2
9 3 0.2
10 2 0.4
11 1 0.6
12 2 0.0
13 3 NaN
14 1 NaN
I have a DF according to below:
id_var1 id_var2 num_var1 num_var2
1 1 1 1
1 2 1 0
1 3 2 0
1 4 2 3
1 5 3 3
1 6 3 3
1 7 3 0
1 8 4 0
2 1 1 0
2 2 2 1
2 3 5 0
2 4 2 0
2 5 1 2
2 6 1 2
2 7 2 0
I want a DF with the following appearance:
id_var1 id_var2 num_var1 num_var2 row_sum
1 1 1 1 2
1 2 1 0 NaN
1 3 2 0 Nan
1 4 2 3 11
1 5 3 3 Nan
1 6 3 3 Nan
1 7 3 0 Nan
1 8 4 0 Nan
2 1 1 0 Nan
2 2 2 1 7
2 3 5 0 Nan
2 4 2 0 Nan
2 5 1 2 4
2 6 1 2 Nan
2 7 2 0 Nan
At each first num_var2 which is not 0 I want to sum(num_var1) the same row + as many rows down as num_var2 states.
Example1: Row 4 has num_var2 = 3 --> sum(num_var1) for row 4 + 3 rows down = 11 for id_var1 = 1 and id_var2 = 4
Example2: Row 12 has num_var2 = 2 --> sum(num_var1) for row 12 + 2 rows down = 4 for id_var1 = 2 and id_var2 = 5.
Can someone please help me with this one? Can it be done without a slow row-itteration?
Code for DF below:
df = pd.DataFrame({ 'id_var1' : [1] * 8 + [2] * 7
,'id_var2' : [i for i in range(1,9)] + [i for i in range(1,8)]
,'num_var1' : [1,1,2,2,3,3,3,4] + [1,2,5,2,1,1,2]
,'num_var2' : [1, 0,0,3,3,3,0,0] + [0,1,0,0,2,2,0]
})
Let me know if this works for you.
First create a list of values from num_var1 column.
Then get sum of sub list- Created from num_var1 , from the current index to the required number items (taken from column num_var2).
sublst() function is called only when the previous record's num_var2 not matching current record's num_var2 .
import pandas as pd
df = pd.DataFrame({ 'id_var1' : [1] * 8 + [2] * 7
,'id_var2' : [i for i in range(1,9)] + [i for i in range(1,8)]
,'num_var1' : [1,1,2,2,3,3,3,4] + [1,2,5,2,1,1,2]
,'num_var2' : [1, 0,0,3,3,3,0,0] + [0,1,0,0,2,2,0]
})
num_var1 =df['num_var1'].tolist() # values to be used for calcualtion
df['index1'] = df.index
def sublst(row):
if row['num_var2']>0:
x= num_var1[row['index1']:row['index1']+row['num_var2']+1]
return sum(x)
df['sum'] = df[df.num_var2 != df.num_var2.shift()].apply(sublst,axis=1)
print df
Output
id_var1 id_var2 num_var1 num_var2 index1 sum
0 1 1 1 1 0 2.0
1 1 2 1 0 1 NaN
2 1 3 2 0 2 NaN
3 1 4 2 3 3 11.0
4 1 5 3 3 4 NaN
5 1 6 3 3 5 NaN
6 1 7 3 0 6 NaN
7 1 8 4 0 7 NaN
8 2 1 1 0 8 NaN
9 2 2 2 1 9 7.0
10 2 3 5 0 10 NaN
11 2 4 2 0 11 NaN
12 2 5 1 2 12 4.0
13 2 6 1 2 13 NaN
14 2 7 2 0 14 NaN
Having the following dataframe:
df = pd.DataFrame(np.ones(10).reshape(10,1), columns=['A'])
df.ix[2]['A'] = 0
df.ix[6]['A'] = 0
A
0 1
1 1
2 0
3 1
4 1
5 1
6 0
7 1
8 1
9 1
I am trying to add a new column B which would contain a number of "1"-occurrences in the column A until the first "0"-event before. Expected output should be like this:
A B
0 1 0
1 1 2
2 0 0
3 1 0
4 1 0
5 1 3
6 0 0
7 1 0
8 1 0
9 1 3
Any efficient vectorized way to do this?
You can use:
a = df.A.groupby((df.A != df.A.shift()).cumsum()).cumcount() + 1
print (a)
0 1
1 2
2 1
3 1
4 2
5 3
6 1
7 1
8 2
9 3
dtype: int64
b = ((~df.A.astype(bool)).shift(-1).fillna(df.A.iat[-1].astype(bool)))
print (b)
0 False
1 True
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 True
Name: A, dtype: bool
df['B'] = ( a * b )
print (df)
A B
0 1.0 0
1 1.0 2
2 0.0 0
3 1.0 0
4 1.0 0
5 1.0 3
6 0.0 0
7 1.0 0
8 1.0 0
9 1.0 3
Explanation:
#difference with shifted A
df['C'] = df.A != df.A.shift()
#cumulative sum
df['D'] = (df.A != df.A.shift()).cumsum()
#cumulative count each group
df['a'] = df.A.groupby((df.A != df.A.shift()).cumsum()).cumcount() + 1
#invert and convert to boolean
df['F'] = ~df.A.astype(bool)
#shift
df['G'] = (~df.A.astype(bool)).shift(-1)
#fill last nan
df['b'] = (~df.A.astype(bool)).shift(-1).fillna(df.A.iat[-1].astype(bool))
print (df)
A B C D a F G b
0 1.0 0 True 1 1 False False False
1 1.0 2 False 1 2 False True True
2 0.0 0 True 2 1 True False False
3 1.0 0 True 3 1 False False False
4 1.0 0 False 3 2 False False False
5 1.0 3 False 3 3 False True True
6 0.0 0 True 4 1 True False False
7 1.0 0 True 5 1 False False False
8 1.0 0 False 5 2 False False False
9 1.0 3 False 5 3 False NaN True
Last NaN is problematic. So I check last value of column A by df.A.iat[-1] and convert to boolean. So if it is 0, output is False and finally 0 or if 1, output is True and then is used last value of a.
Lets suppose this dataframe, which I want to filter in such a way I iterate from the last index backwards until I find two consecutive 'a' = 0. Once that happens, the rest of the dataframe (including both zeros) shall be filtered:
a
1 6.5
2 0
3 0
4 4.0
5 0
6 3.2
Desired result:
a
4 4.0
5 0
6 3.2
My initial idea was ussing apply for filtering, and inside the apply function using shift(1) == 0 & shift(2) == 0, but based of that I could filter each row individually, but not returning false for the remaining rows after the double zero is found unless I use a global variable or something nasty like that.
Any smart way of doing this?
You could do that with sort_index with ascending=False, cumsum and dropna:
In [89]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2].dropna()
Out[89]:
a
4 4.0
5 0.0
6 3.2
Step by step:
In [99]: df.sort_index(ascending=False)
Out[99]:
a
6 3.2
5 0.0
4 4.0
3 0.0
2 0.0
1 6.5
In [100]: df.sort_index(ascending=False) == 0
Out[100]:
a
6 False
5 True
4 False
3 True
2 True
1 False
In [101]: (df.sort_index(ascending=False) == 0).cumsum()
Out[101]:
a
6 0
5 1
4 1
3 2
2 3
1 3
In [103]: (df.sort_index(ascending=False) == 0).cumsum() < 2
Out[103]:
a
6 True
5 True
4 True
3 False
2 False
1 False
In [104]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2]
Out[104]:
a
1 NaN
2 NaN
3 NaN
4 4.0
5 0.0
6 3.2
EDIT
IIUC you could use something like that using pd.rolling_sum and first_valid_index if your index started from 1:
df_sorted = df.sort_index(ascending=False)
df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
With the #jezrael example:
In [208]: df
Out[208]:
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
df_sorted = df.sort_index(ascending=False)
In [210]: df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
Out[210]:
a
11 3.2
12 5.0
You can use groupby with cumcount and cumsum, then invert df and use cumsum again:
print df
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
print df[df.groupby((df['a'].diff(1)!=0).astype('int').cumsum()).cumcount()[::-1].cumsum()[::-1]== 0]
a
11 3.2
12 5.0
Explanation:
print (df['a'].diff(1) != 0)
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 True
10 True
11 True
12 True
Name: a, dtype: bool
print (df['a'].diff(1) != 0).astype('int')
1 1
2 1
3 0
4 1
5 1
6 0
7 0
8 1
10 1
11 1
12 1
Name: a, dtype: int32
print (df['a'].diff(1) != 0).astype('int') .cumsum()
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
10 6
11 7
12 8
Name: a, dtype: int32
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()
1 0
2 0
3 1
4 0
5 0
6 1
7 2
8 0
10 0
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1]
1 5
2 5
3 5
4 4
5 4
6 4
7 3
8 1
10 1
11 1
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1] == 0
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
10 False
11 False
11 True
12 True
dtype: bool
Numpy's ediff1d function is useful here
inverted = a[::-1]
index = (numpy.ediff1d(inverted) == 0).argmax()
a[index:]