Compute distance between rows in pandas DataFrame - python

I have a pandas DataFrame filled with zeros except from some 1.0 values. For each row, I want to compute the distance to the next occurence of 1.0. Any idea how to do it ?
input dataframe:
index col1
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 0.0
6 0.0
7 1.0
8 0.0
Expected output dataframe:
index col1
0 4.0
1 3.0
2 2.0
3 1.0
4 0.0
5 2.0
6 1.0
7 0.0
8 0.0

Use:
df['new'] = df.groupby(df['col1'].eq(1).iloc[::-1].cumsum()).cumcount(ascending=False)
print (df)
col1 new
0 0.0 4
1 0.0 3
2 0.0 2
3 0.0 1
4 1.0 0
5 0.0 2
6 0.0 1
7 1.0 0
8 0.0 0
Explanation:
First compare by 1 with Series.eq:
print (df['col1'].eq(1))
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
Name: col1, dtype: bool
Then swap ordering by Series.iloc:
print (df['col1'].eq(1).iloc[::-1])
8 False
7 True
6 False
5 False
4 True
3 False
2 False
1 False
0 False
Name: col1, dtype: bool
Create groups by Series.cumsum:
print (df['col1'].eq(1).iloc[::-1].cumsum())
8 0
7 1
6 1
5 1
4 2
3 2
2 2
1 2
0 2
Name: col1, dtype: int32
Pass groups to GroupBy.cumcount with ascending=False for count from back:
print (df.groupby(df['col1'].eq(1).iloc[::-1].cumsum()).cumcount(ascending=False))
0 4
1 3
2 2
3 1
4 0
5 2
6 1
7 0
8 0
dtype: int64

Related

Row-wise replace operation in pandas dataframe

In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.
Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)
My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.

Is there a function in pandas like cumsum() but for the mean? I need to apply it based on a condition

I need to extract the cumulative mean only while my column A is different form zero. Each time it is zero, the cumulative mean should restart. Thanks so much in advance I am not so good using python.
Input:
ColumnA
0 5
1 6
2 7
3 0
4 0
5 1
6 2
7 3
8 0
9 5
10 10
11 15
Expected Output:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
You can try with cumsum to make groups and then with expanding+mean to make the cumulative mean
groups=df.ColumnA.eq(0).cumsum()
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
Details:
Make groups when column is equal to 0 with eq and cumsum, since eq gives you a mask with True and False values, and with cumsum these values are taken as 1 or 0:
groups=df.ColumnA.eq(0).cumsum()
groups
0 0
1 0
2 0
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ColumnA, dtype: int32
Then group by that groups and use apply to do the cumulative mean over elements different to 0:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean())
ColumnA
0 5.0
1 5.5
2 6.0
3 NaN
4 NaN
5 1.0
6 1.5
7 2.0
8 NaN
9 5.0
10 7.5
11 10.0
And finally use fillna to fill with 0 the nan values:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
ColumnA
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0
You can use boolean indexing to compare rows that are ==0 and !=0 against the previous rows with .shift(). Then, jsut take the .cumsum() to separate into groups, according to where zeros are within ColumnA.
df['CumulativeMean'] = (df.groupby((((df.shift()['ColumnA'] != 0) & (df['ColumnA'] == 0)) |
(df.shift()['ColumnA'] == 0) & (df['ColumnA'] != 0))
.cumsum())['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[6]:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
I'll have broken down the logic of the boolean indexing within the .groupby statement down into multiple columns that build into the final result of the column abcd_cumsum. From there, ['ColumnA'].apply(lambda x: x.expanding().mean())) takes the mean of the group up to any given row in that group. For example, The second row (index of 1) takes the grouped mean of the first and second row, but excludes the third row.
df['a'] = (df.shift()['ColumnA'] != 0)
df['b'] = (df['ColumnA'] == 0)
df['ab'] = (df['a'] & df['b'])
df['c'] = (df.shift()['ColumnA'] == 0)
df['d'] = (df['ColumnA'] != 0)
df['cd'] = (df['c'] & df['d'])
df['abcd'] = (df['ab'] | df['cd'])
df['abcd_cumsum'] = (df['ab'] | df['cd']).cumsum()
df['CumulativeMean'] = (df.groupby(df['abcd_cumsum'])['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[7]:
ColumnA a b ab c d cd abcd abcd_cumsum \
0 5 True False False False True False False 0
1 6 True False False False True False False 0
2 7 True False False False True False False 0
3 0 True True True False False False True 1
4 0 False True False True False False False 1
5 1 False False False True True True True 2
6 2 True False False False True False False 2
7 3 True False False False True False False 2
8 0 True True True False False False True 3
9 5 False False False True True True True 4
10 10 True False False False True False False 4
11 15 True False False False True False False 4
CumulativeMean
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0

How do groupby elements in pandas based on consecutive row values

I have a dataframe as below :
distance_along_path
0 0
1 2.2
2 4.5
3 7.0
4 0
5 3.0
6 5.0
7 0
8 2.0
9 5.0
10 7.0
I want be able to group these by the distance_along_path values, every time a 0 is seen a new group is created and until the next 0 all these rows are under 1 group as indicated below
distance_along_path group
0 0 A
1 2.2 A
2 4.5 A
3 7.0 A
4 0 B
5 3.0 B
6 5.0 B
7 0 C
8 2.0 C
9 5.0 C
10 7.0 C
Thank you
You can try eq followed by cumcun:
df["group"] = df.distance_along_path.eq(0).cumsum()
Explanation:
Use eq to find values equals to 0
Use cumcun to apply a cumulative count on True values
Code + Illustration
# Step 1
print(df.distance_along_path.eq(0))
# 0 True
# 1 False
# 2 False
# 3 False
# 4 True
# 5 False
# 6 False
# 7 True
# 8 False
# 9 False
# 10 False
# Name: distance_along_path, dtype: bool
# Step 2
print(df.assign(group=df.distance_along_path.eq(0).cumsum()))
# distance_along_path group
# 0 0.0 1
# 1 2.2 1
# 2 4.5 1
# 3 7.0 1
# 4 0.0 2
# 5 3.0 2
# 6 5.0 2
# 7 0.0 3
# 8 2.0 3
# 9 5.0 3
# 10 7.0 3
Note : as you can see, the group column is number and not a letter but that doesn't matter if it's used in a groupby.

Fast way to get the number of NaNs in a column counted from the last valid value in a DataFrame

Say I have a DataFrame like
A B
0 0.1880 0.345
1 0.2510 0.585
2 NaN NaN
3 NaN NaN
4 NaN 1.150
5 0.2300 1.210
6 0.1670 1.290
7 0.0835 1.400
8 0.0418 NaN
9 0.0209 NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
I want a new DataFrame of the same shape where each entry represents the number of NaNs counted up to its position started from the last valid value as follows
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
I wonder if this can be done efficiently by utilizing some of the Pandas/Numpy functions?
You can use:
a = df.isnull()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
print (df1)
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
For better understanding:
#add NaN where True in a
a2 = b.mask(a)
#forward filling NaN
a3 = b.mask(a).ffill()
#replace NaN to 0, cast to int
a4 = b.mask(a).ffill().fillna(0).astype(int)
#substract b to a4
a5 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
df1 = pd.concat([a,b,a2, a3, a4, a5], axis=1,
keys=['a','b','where','ffill nan','substract','output'])
print (df1)
a b where ffill nan substract output
A B A B A B A B A B A B
0 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
1 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
2 True True 1 1 NaN NaN 0.0 0.0 0 0 1 1
3 True True 2 2 NaN NaN 0.0 0.0 0 0 2 2
4 True False 3 2 NaN 2.0 0.0 2.0 0 2 3 0
5 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
6 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
7 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
8 False True 3 3 3.0 NaN 3.0 2.0 3 2 0 1
9 False True 3 4 3.0 NaN 3.0 2.0 3 2 0 2
10 True True 4 5 NaN NaN 3.0 2.0 3 2 1 3
11 True True 5 6 NaN NaN 3.0 2.0 3 2 2 4
12 True True 6 7 NaN NaN 3.0 2.0 3 2 3 5

Pandas: complex filtering with apply

Lets suppose this dataframe, which I want to filter in such a way I iterate from the last index backwards until I find two consecutive 'a' = 0. Once that happens, the rest of the dataframe (including both zeros) shall be filtered:
a
1 6.5
2 0
3 0
4 4.0
5 0
6 3.2
Desired result:
a
4 4.0
5 0
6 3.2
My initial idea was ussing apply for filtering, and inside the apply function using shift(1) == 0 & shift(2) == 0, but based of that I could filter each row individually, but not returning false for the remaining rows after the double zero is found unless I use a global variable or something nasty like that.
Any smart way of doing this?
You could do that with sort_index with ascending=False, cumsum and dropna:
In [89]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2].dropna()
Out[89]:
a
4 4.0
5 0.0
6 3.2
Step by step:
In [99]: df.sort_index(ascending=False)
Out[99]:
a
6 3.2
5 0.0
4 4.0
3 0.0
2 0.0
1 6.5
In [100]: df.sort_index(ascending=False) == 0
Out[100]:
a
6 False
5 True
4 False
3 True
2 True
1 False
In [101]: (df.sort_index(ascending=False) == 0).cumsum()
Out[101]:
a
6 0
5 1
4 1
3 2
2 3
1 3
In [103]: (df.sort_index(ascending=False) == 0).cumsum() < 2
Out[103]:
a
6 True
5 True
4 True
3 False
2 False
1 False
In [104]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2]
Out[104]:
a
1 NaN
2 NaN
3 NaN
4 4.0
5 0.0
6 3.2
EDIT
IIUC you could use something like that using pd.rolling_sum and first_valid_index if your index started from 1:
df_sorted = df.sort_index(ascending=False)
df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
With the #jezrael example:
In [208]: df
Out[208]:
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
df_sorted = df.sort_index(ascending=False)
In [210]: df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
Out[210]:
a
11 3.2
12 5.0
You can use groupby with cumcount and cumsum, then invert df and use cumsum again:
print df
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
print df[df.groupby((df['a'].diff(1)!=0).astype('int').cumsum()).cumcount()[::-1].cumsum()[::-1]== 0]
a
11 3.2
12 5.0
Explanation:
print (df['a'].diff(1) != 0)
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 True
10 True
11 True
12 True
Name: a, dtype: bool
print (df['a'].diff(1) != 0).astype('int')
1 1
2 1
3 0
4 1
5 1
6 0
7 0
8 1
10 1
11 1
12 1
Name: a, dtype: int32
print (df['a'].diff(1) != 0).astype('int') .cumsum()
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
10 6
11 7
12 8
Name: a, dtype: int32
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()
1 0
2 0
3 1
4 0
5 0
6 1
7 2
8 0
10 0
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1]
1 5
2 5
3 5
4 4
5 4
6 4
7 3
8 1
10 1
11 1
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1] == 0
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
10 False
11 False
11 True
12 True
dtype: bool
Numpy's ediff1d function is useful here
inverted = a[::-1]
index = (numpy.ediff1d(inverted) == 0).argmax()
a[index:]

Categories