Pandas: complex filtering with apply - python

Lets suppose this dataframe, which I want to filter in such a way I iterate from the last index backwards until I find two consecutive 'a' = 0. Once that happens, the rest of the dataframe (including both zeros) shall be filtered:
a
1 6.5
2 0
3 0
4 4.0
5 0
6 3.2
Desired result:
a
4 4.0
5 0
6 3.2
My initial idea was ussing apply for filtering, and inside the apply function using shift(1) == 0 & shift(2) == 0, but based of that I could filter each row individually, but not returning false for the remaining rows after the double zero is found unless I use a global variable or something nasty like that.
Any smart way of doing this?

You could do that with sort_index with ascending=False, cumsum and dropna:
In [89]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2].dropna()
Out[89]:
a
4 4.0
5 0.0
6 3.2
Step by step:
In [99]: df.sort_index(ascending=False)
Out[99]:
a
6 3.2
5 0.0
4 4.0
3 0.0
2 0.0
1 6.5
In [100]: df.sort_index(ascending=False) == 0
Out[100]:
a
6 False
5 True
4 False
3 True
2 True
1 False
In [101]: (df.sort_index(ascending=False) == 0).cumsum()
Out[101]:
a
6 0
5 1
4 1
3 2
2 3
1 3
In [103]: (df.sort_index(ascending=False) == 0).cumsum() < 2
Out[103]:
a
6 True
5 True
4 True
3 False
2 False
1 False
In [104]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2]
Out[104]:
a
1 NaN
2 NaN
3 NaN
4 4.0
5 0.0
6 3.2
EDIT
IIUC you could use something like that using pd.rolling_sum and first_valid_index if your index started from 1:
df_sorted = df.sort_index(ascending=False)
df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
With the #jezrael example:
In [208]: df
Out[208]:
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
df_sorted = df.sort_index(ascending=False)
In [210]: df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
Out[210]:
a
11 3.2
12 5.0

You can use groupby with cumcount and cumsum, then invert df and use cumsum again:
print df
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
print df[df.groupby((df['a'].diff(1)!=0).astype('int').cumsum()).cumcount()[::-1].cumsum()[::-1]== 0]
a
11 3.2
12 5.0
Explanation:
print (df['a'].diff(1) != 0)
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 True
10 True
11 True
12 True
Name: a, dtype: bool
print (df['a'].diff(1) != 0).astype('int')
1 1
2 1
3 0
4 1
5 1
6 0
7 0
8 1
10 1
11 1
12 1
Name: a, dtype: int32
print (df['a'].diff(1) != 0).astype('int') .cumsum()
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
10 6
11 7
12 8
Name: a, dtype: int32
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()
1 0
2 0
3 1
4 0
5 0
6 1
7 2
8 0
10 0
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1]
1 5
2 5
3 5
4 4
5 4
6 4
7 3
8 1
10 1
11 1
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1] == 0
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
10 False
11 False
11 True
12 True
dtype: bool

Numpy's ediff1d function is useful here
inverted = a[::-1]
index = (numpy.ediff1d(inverted) == 0).argmax()
a[index:]

Related

Isolate sequence of positive numbers in a pandas dataframe

I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let's say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() + 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count += 1
else:
if count >= 3:
n_groups += 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN
One simple approach using find_peaks to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
def function1(dd:pd.DataFrame):
dd.loc[:,'period']=None
if len(dd)>=4:
dd.iloc[1:,2]=dd.iloc[1:,1]
return dd
df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)
out:
values col1 period
0 0 0 None
1 8 0 None
2 1 0 None
3 0 1 None
4 5 1 1
5 6 1 1
6 4 1 1
7 7 1 1
8 0 2 None
9 2 2 2
10 9 2 2
11 1 2 2
12 0 3 None

Row-wise replace operation in pandas dataframe

In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.
Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)
My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.

Is there a function in pandas like cumsum() but for the mean? I need to apply it based on a condition

I need to extract the cumulative mean only while my column A is different form zero. Each time it is zero, the cumulative mean should restart. Thanks so much in advance I am not so good using python.
Input:
ColumnA
0 5
1 6
2 7
3 0
4 0
5 1
6 2
7 3
8 0
9 5
10 10
11 15
Expected Output:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
You can try with cumsum to make groups and then with expanding+mean to make the cumulative mean
groups=df.ColumnA.eq(0).cumsum()
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
Details:
Make groups when column is equal to 0 with eq and cumsum, since eq gives you a mask with True and False values, and with cumsum these values are taken as 1 or 0:
groups=df.ColumnA.eq(0).cumsum()
groups
0 0
1 0
2 0
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ColumnA, dtype: int32
Then group by that groups and use apply to do the cumulative mean over elements different to 0:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean())
ColumnA
0 5.0
1 5.5
2 6.0
3 NaN
4 NaN
5 1.0
6 1.5
7 2.0
8 NaN
9 5.0
10 7.5
11 10.0
And finally use fillna to fill with 0 the nan values:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
ColumnA
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0
You can use boolean indexing to compare rows that are ==0 and !=0 against the previous rows with .shift(). Then, jsut take the .cumsum() to separate into groups, according to where zeros are within ColumnA.
df['CumulativeMean'] = (df.groupby((((df.shift()['ColumnA'] != 0) & (df['ColumnA'] == 0)) |
(df.shift()['ColumnA'] == 0) & (df['ColumnA'] != 0))
.cumsum())['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[6]:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
I'll have broken down the logic of the boolean indexing within the .groupby statement down into multiple columns that build into the final result of the column abcd_cumsum. From there, ['ColumnA'].apply(lambda x: x.expanding().mean())) takes the mean of the group up to any given row in that group. For example, The second row (index of 1) takes the grouped mean of the first and second row, but excludes the third row.
df['a'] = (df.shift()['ColumnA'] != 0)
df['b'] = (df['ColumnA'] == 0)
df['ab'] = (df['a'] & df['b'])
df['c'] = (df.shift()['ColumnA'] == 0)
df['d'] = (df['ColumnA'] != 0)
df['cd'] = (df['c'] & df['d'])
df['abcd'] = (df['ab'] | df['cd'])
df['abcd_cumsum'] = (df['ab'] | df['cd']).cumsum()
df['CumulativeMean'] = (df.groupby(df['abcd_cumsum'])['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[7]:
ColumnA a b ab c d cd abcd abcd_cumsum \
0 5 True False False False True False False 0
1 6 True False False False True False False 0
2 7 True False False False True False False 0
3 0 True True True False False False True 1
4 0 False True False True False False False 1
5 1 False False False True True True True 2
6 2 True False False False True False False 2
7 3 True False False False True False False 2
8 0 True True True False False False True 3
9 5 False False False True True True True 4
10 10 True False False False True False False 4
11 15 True False False False True False False 4
CumulativeMean
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0

How do groupby elements in pandas based on consecutive row values

I have a dataframe as below :
distance_along_path
0 0
1 2.2
2 4.5
3 7.0
4 0
5 3.0
6 5.0
7 0
8 2.0
9 5.0
10 7.0
I want be able to group these by the distance_along_path values, every time a 0 is seen a new group is created and until the next 0 all these rows are under 1 group as indicated below
distance_along_path group
0 0 A
1 2.2 A
2 4.5 A
3 7.0 A
4 0 B
5 3.0 B
6 5.0 B
7 0 C
8 2.0 C
9 5.0 C
10 7.0 C
Thank you
You can try eq followed by cumcun:
df["group"] = df.distance_along_path.eq(0).cumsum()
Explanation:
Use eq to find values equals to 0
Use cumcun to apply a cumulative count on True values
Code + Illustration
# Step 1
print(df.distance_along_path.eq(0))
# 0 True
# 1 False
# 2 False
# 3 False
# 4 True
# 5 False
# 6 False
# 7 True
# 8 False
# 9 False
# 10 False
# Name: distance_along_path, dtype: bool
# Step 2
print(df.assign(group=df.distance_along_path.eq(0).cumsum()))
# distance_along_path group
# 0 0.0 1
# 1 2.2 1
# 2 4.5 1
# 3 7.0 1
# 4 0.0 2
# 5 3.0 2
# 6 5.0 2
# 7 0.0 3
# 8 2.0 3
# 9 5.0 3
# 10 7.0 3
Note : as you can see, the group column is number and not a letter but that doesn't matter if it's used in a groupby.

Compute distance between rows in pandas DataFrame

I have a pandas DataFrame filled with zeros except from some 1.0 values. For each row, I want to compute the distance to the next occurence of 1.0. Any idea how to do it ?
input dataframe:
index col1
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 0.0
6 0.0
7 1.0
8 0.0
Expected output dataframe:
index col1
0 4.0
1 3.0
2 2.0
3 1.0
4 0.0
5 2.0
6 1.0
7 0.0
8 0.0
Use:
df['new'] = df.groupby(df['col1'].eq(1).iloc[::-1].cumsum()).cumcount(ascending=False)
print (df)
col1 new
0 0.0 4
1 0.0 3
2 0.0 2
3 0.0 1
4 1.0 0
5 0.0 2
6 0.0 1
7 1.0 0
8 0.0 0
Explanation:
First compare by 1 with Series.eq:
print (df['col1'].eq(1))
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
Name: col1, dtype: bool
Then swap ordering by Series.iloc:
print (df['col1'].eq(1).iloc[::-1])
8 False
7 True
6 False
5 False
4 True
3 False
2 False
1 False
0 False
Name: col1, dtype: bool
Create groups by Series.cumsum:
print (df['col1'].eq(1).iloc[::-1].cumsum())
8 0
7 1
6 1
5 1
4 2
3 2
2 2
1 2
0 2
Name: col1, dtype: int32
Pass groups to GroupBy.cumcount with ascending=False for count from back:
print (df.groupby(df['col1'].eq(1).iloc[::-1].cumsum()).cumcount(ascending=False))
0 4
1 3
2 2
3 1
4 0
5 2
6 1
7 0
8 0
dtype: int64

Categories