How do groupby elements in pandas based on consecutive row values - python

I have a dataframe as below :
distance_along_path
0 0
1 2.2
2 4.5
3 7.0
4 0
5 3.0
6 5.0
7 0
8 2.0
9 5.0
10 7.0
I want be able to group these by the distance_along_path values, every time a 0 is seen a new group is created and until the next 0 all these rows are under 1 group as indicated below
distance_along_path group
0 0 A
1 2.2 A
2 4.5 A
3 7.0 A
4 0 B
5 3.0 B
6 5.0 B
7 0 C
8 2.0 C
9 5.0 C
10 7.0 C
Thank you

You can try eq followed by cumcun:
df["group"] = df.distance_along_path.eq(0).cumsum()
Explanation:
Use eq to find values equals to 0
Use cumcun to apply a cumulative count on True values
Code + Illustration
# Step 1
print(df.distance_along_path.eq(0))
# 0 True
# 1 False
# 2 False
# 3 False
# 4 True
# 5 False
# 6 False
# 7 True
# 8 False
# 9 False
# 10 False
# Name: distance_along_path, dtype: bool
# Step 2
print(df.assign(group=df.distance_along_path.eq(0).cumsum()))
# distance_along_path group
# 0 0.0 1
# 1 2.2 1
# 2 4.5 1
# 3 7.0 1
# 4 0.0 2
# 5 3.0 2
# 6 5.0 2
# 7 0.0 3
# 8 2.0 3
# 9 5.0 3
# 10 7.0 3
Note : as you can see, the group column is number and not a letter but that doesn't matter if it's used in a groupby.

Related

Is there a function in pandas like cumsum() but for the mean? I need to apply it based on a condition

I need to extract the cumulative mean only while my column A is different form zero. Each time it is zero, the cumulative mean should restart. Thanks so much in advance I am not so good using python.
Input:
ColumnA
0 5
1 6
2 7
3 0
4 0
5 1
6 2
7 3
8 0
9 5
10 10
11 15
Expected Output:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
You can try with cumsum to make groups and then with expanding+mean to make the cumulative mean
groups=df.ColumnA.eq(0).cumsum()
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
Details:
Make groups when column is equal to 0 with eq and cumsum, since eq gives you a mask with True and False values, and with cumsum these values are taken as 1 or 0:
groups=df.ColumnA.eq(0).cumsum()
groups
0 0
1 0
2 0
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ColumnA, dtype: int32
Then group by that groups and use apply to do the cumulative mean over elements different to 0:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean())
ColumnA
0 5.0
1 5.5
2 6.0
3 NaN
4 NaN
5 1.0
6 1.5
7 2.0
8 NaN
9 5.0
10 7.5
11 10.0
And finally use fillna to fill with 0 the nan values:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
ColumnA
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0
You can use boolean indexing to compare rows that are ==0 and !=0 against the previous rows with .shift(). Then, jsut take the .cumsum() to separate into groups, according to where zeros are within ColumnA.
df['CumulativeMean'] = (df.groupby((((df.shift()['ColumnA'] != 0) & (df['ColumnA'] == 0)) |
(df.shift()['ColumnA'] == 0) & (df['ColumnA'] != 0))
.cumsum())['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[6]:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
I'll have broken down the logic of the boolean indexing within the .groupby statement down into multiple columns that build into the final result of the column abcd_cumsum. From there, ['ColumnA'].apply(lambda x: x.expanding().mean())) takes the mean of the group up to any given row in that group. For example, The second row (index of 1) takes the grouped mean of the first and second row, but excludes the third row.
df['a'] = (df.shift()['ColumnA'] != 0)
df['b'] = (df['ColumnA'] == 0)
df['ab'] = (df['a'] & df['b'])
df['c'] = (df.shift()['ColumnA'] == 0)
df['d'] = (df['ColumnA'] != 0)
df['cd'] = (df['c'] & df['d'])
df['abcd'] = (df['ab'] | df['cd'])
df['abcd_cumsum'] = (df['ab'] | df['cd']).cumsum()
df['CumulativeMean'] = (df.groupby(df['abcd_cumsum'])['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[7]:
ColumnA a b ab c d cd abcd abcd_cumsum \
0 5 True False False False True False False 0
1 6 True False False False True False False 0
2 7 True False False False True False False 0
3 0 True True True False False False True 1
4 0 False True False True False False False 1
5 1 False False False True True True True 2
6 2 True False False False True False False 2
7 3 True False False False True False False 2
8 0 True True True False False False True 3
9 5 False False False True True True True 4
10 10 True False False False True False False 4
11 15 True False False False True False False 4
CumulativeMean
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0

How to group by data using one column perform some operation on another column and assign new groups pandas

I have a dataframe as below :
distance_along_path ID
0 0 1
1 2.2 1
2 4.5 1
3 7.0 1
4 0 2
5 0 3
6 3.0 2
7 5.0 3
8 0 4
9 2.0 4
10 5.0 4
11 0 5
12 3.0 5
11 7.0 4
12
I want be able to group these by id's first and the by distance_along_path values, every time a 0 is seen in distance along path for the id, new group is created and until the next 0 all these rows are under A group as indicated below
distance_along_path ID group
0 0 1 1
1 2.2 1 1
2 4.5 1 1
3 7.0 1 1
4 0 1 2
5 0 2 3
6 3.0 1 2
7 5.0 2 3
8 0 2 4
9 2.0 2 4
10 5.0 2 4
11 0 1 5
12 3.0 1 5
13 7.0 1 5
14 0 1 6
15 0 2 7
16 3.0 1 6
17 5.0 2 7
18 1.0 2 7
Thank you
try the following:
grp_id = df.groupby(['ID']).id.count().reset_index()
grp_distance = grp_id.groupby(['distance_along_path'].grp_id['distance_along_path']==0

Pandas: Drop duplicates based on row value

I have a dataframe and I want to drop duplicates based on different conditions....
A B
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
8 - 5.1
9 - 5.3
I want to drop all the duplicates from column A except rows with "-". After this, I want to drop duplicates from column A with "-" as a value based on their column B value. Given the input dataframe, this should return the following:-
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
I have the following code but it's not very efficient for very large amounts of data, how can I improve this....
def generate(df):
str_col = df[df["A"] == "-"]
df.drop(df[df["A"] == "-"].index, inplace=True)
df = df.drop_duplicates(subset="A")
str_col = b.drop_duplicates(subset="B")
bigdata = df.append(str_col, ignore_index=True)
return bigdata.sort_values("B")
duplicated and eq:
df[~df.duplicated('A') # keep those not duplicates in A
| (df['A'].eq('-') # or those '-' in A
& ~df['B'].duplicated())] # which are not duplicates in B
Output:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3
df.drop_duplicates(subset=['A', 'B'])
Given a full set of data:
A B C
0 1 1.0 0
1 1 1.0 1
2 2 2.0 2
3 2 2.0 3
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
8 - 5.1 8
9 - 5.3 9
Result:
A B C
0 1 1.0 0
2 2 2.0 2
4 3 3.0 4
5 4 4.0 5
6 5 5.0 6
7 - 5.1 7
9 - 5.3 9
groupby + head
df.groupby(['A','B']).head(1)
Out[7]:
A B
0 1 1.0
2 2 2.0
4 3 3.0
5 4 4.0
6 5 5.0
7 - 5.1
9 - 5.3

Compute distance between rows in pandas DataFrame

I have a pandas DataFrame filled with zeros except from some 1.0 values. For each row, I want to compute the distance to the next occurence of 1.0. Any idea how to do it ?
input dataframe:
index col1
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 0.0
6 0.0
7 1.0
8 0.0
Expected output dataframe:
index col1
0 4.0
1 3.0
2 2.0
3 1.0
4 0.0
5 2.0
6 1.0
7 0.0
8 0.0
Use:
df['new'] = df.groupby(df['col1'].eq(1).iloc[::-1].cumsum()).cumcount(ascending=False)
print (df)
col1 new
0 0.0 4
1 0.0 3
2 0.0 2
3 0.0 1
4 1.0 0
5 0.0 2
6 0.0 1
7 1.0 0
8 0.0 0
Explanation:
First compare by 1 with Series.eq:
print (df['col1'].eq(1))
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
Name: col1, dtype: bool
Then swap ordering by Series.iloc:
print (df['col1'].eq(1).iloc[::-1])
8 False
7 True
6 False
5 False
4 True
3 False
2 False
1 False
0 False
Name: col1, dtype: bool
Create groups by Series.cumsum:
print (df['col1'].eq(1).iloc[::-1].cumsum())
8 0
7 1
6 1
5 1
4 2
3 2
2 2
1 2
0 2
Name: col1, dtype: int32
Pass groups to GroupBy.cumcount with ascending=False for count from back:
print (df.groupby(df['col1'].eq(1).iloc[::-1].cumsum()).cumcount(ascending=False))
0 4
1 3
2 2
3 1
4 0
5 2
6 1
7 0
8 0
dtype: int64

Pandas: complex filtering with apply

Lets suppose this dataframe, which I want to filter in such a way I iterate from the last index backwards until I find two consecutive 'a' = 0. Once that happens, the rest of the dataframe (including both zeros) shall be filtered:
a
1 6.5
2 0
3 0
4 4.0
5 0
6 3.2
Desired result:
a
4 4.0
5 0
6 3.2
My initial idea was ussing apply for filtering, and inside the apply function using shift(1) == 0 & shift(2) == 0, but based of that I could filter each row individually, but not returning false for the remaining rows after the double zero is found unless I use a global variable or something nasty like that.
Any smart way of doing this?
You could do that with sort_index with ascending=False, cumsum and dropna:
In [89]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2].dropna()
Out[89]:
a
4 4.0
5 0.0
6 3.2
Step by step:
In [99]: df.sort_index(ascending=False)
Out[99]:
a
6 3.2
5 0.0
4 4.0
3 0.0
2 0.0
1 6.5
In [100]: df.sort_index(ascending=False) == 0
Out[100]:
a
6 False
5 True
4 False
3 True
2 True
1 False
In [101]: (df.sort_index(ascending=False) == 0).cumsum()
Out[101]:
a
6 0
5 1
4 1
3 2
2 3
1 3
In [103]: (df.sort_index(ascending=False) == 0).cumsum() < 2
Out[103]:
a
6 True
5 True
4 True
3 False
2 False
1 False
In [104]: df[(df.sort_index(ascending=False) == 0).cumsum() < 2]
Out[104]:
a
1 NaN
2 NaN
3 NaN
4 4.0
5 0.0
6 3.2
EDIT
IIUC you could use something like that using pd.rolling_sum and first_valid_index if your index started from 1:
df_sorted = df.sort_index(ascending=False)
df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
With the #jezrael example:
In [208]: df
Out[208]:
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
df_sorted = df.sort_index(ascending=False)
In [210]: df[df_sorted[(pd.rolling_sum((df_sorted==0), window=2) == 2)].first_valid_index()+1:]
Out[210]:
a
11 3.2
12 5.0
You can use groupby with cumcount and cumsum, then invert df and use cumsum again:
print df
a
1 6.5
2 0.0
3 0.0
4 7.0
5 0.0
6 0.0
7 0.0
8 4.0
9 0.0
10 0.0
11 3.2
12 5.0
print df[df.groupby((df['a'].diff(1)!=0).astype('int').cumsum()).cumcount()[::-1].cumsum()[::-1]== 0]
a
11 3.2
12 5.0
Explanation:
print (df['a'].diff(1) != 0)
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 True
10 True
11 True
12 True
Name: a, dtype: bool
print (df['a'].diff(1) != 0).astype('int')
1 1
2 1
3 0
4 1
5 1
6 0
7 0
8 1
10 1
11 1
12 1
Name: a, dtype: int32
print (df['a'].diff(1) != 0).astype('int') .cumsum()
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
10 6
11 7
12 8
Name: a, dtype: int32
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()
1 0
2 0
3 1
4 0
5 0
6 1
7 2
8 0
10 0
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1]
1 5
2 5
3 5
4 4
5 4
6 4
7 3
8 1
10 1
11 1
11 0
12 0
dtype: int64
print df.groupby( (df['a'].diff(1) != 0).astype('int').cumsum() ).cumcount()[::-1].cumsum()[::-1] == 0
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
10 False
11 False
11 True
12 True
dtype: bool
Numpy's ediff1d function is useful here
inverted = a[::-1]
index = (numpy.ediff1d(inverted) == 0).argmax()
a[index:]

Categories