How to delete groups based on conditions/values of another column - python

In my dataframe I want to delete those groups of column B, in which all values in column C are smaller than 3.
So there should only be those groups left, which only have values in column C that are bigger than 3.
B
C
11
1
22
2
11
2
22
4
22
1
33
2
33
1
22
4
So in my example only group 22 should stay.
Probably something like this pseudo code:
df_clean = df.groupby('B')['C']< 3.0
How do I code an algorithm that can do this?

maybe by creating df_count counting the number of elements with C-value greater 2:
df_count = df.groupby(['B'])['C'].apply(lambda x: (x>2).sum()).reset_index(name='count')
B count
0 11 0
1 22 2
2 33 0
and then sorting out those with 0:
df = df[df['B'].isin(df_count[df_count['count'] > 0]['B'].unique())].sort_index()
B C
1 22 2
3 22 4
4 22 1
7 22 4

As I understand, each group must have all its value less than 3 to be considered,
I'd start by getting the groups that satisfy this condition by comparing the maximum value with the target:3
>>> groups = [group for group in df['C'].unique() if max(df[df.C==group].B.values) < 3]
>>> groups
[11, 33]
then, you can slice your dataframe and get a new one with only the desired groups
>>> df[df.C.isin(groups)]
C B
0 11 1
2 11 2
5 33 2
6 33 1

You can use groupby and any with a for loop to get your desired output.
for i ,j in df.groupby('B'):
if (j['C'] >= 3).any() == True:
result = j
B C
1 22 2
3 22 4
4 22 1
7 22 4
or other way round if you are looking for groups with all values less than 3.
result = []
for i ,j in df.groupby('B'):
if (j['C'] < 3).all() == True:
result.append(j)
[ B C
0 11 1
2 11 2,
B C
5 33 2
6 33 1]

Related

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Looking for a sequential pattern with condition

I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).
Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64
I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])
You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False

Loop to create a new row for a given range of numbers

I am sure this is an easy fix, but I haven't been able to find the exact solution to my problem. My data set has a column called 'LANE' which contains unique values. I want to add rows for each 'LANE' based on a range of numbers (which would be 0 to 12). As a result each 'LANE' would have 13 rows with a new column 'NUMBER' ranging from 0 up to and including 12.
Example:
Input
LANE
a
b
Output
LANE NUMBER
a 0
a 1
a 2
a 3
a 4
a 5
a 6
a 7
a 8
a 9
a 10
a 11
a 12
b 0
b 1
b 2
b 3
b 4
b 5
b 6
b 7
b 8
b 9
b 10
b 11
b 12
I am currently trying different forms of:
num = 0
while num <= 12:
for x in df['LANE']:
df['NUMBER'] = num
num += 1
The problem with this loop is, I still have one record for each lane and the 'NUMBER' column only has the value 12.
Comprehension
For loops are the natural and naive way to produce Cartesian products. Comprehensions allow us to embed this more succinctly.
pd.DataFrame(
[[l, n] for l in df.LANE for n in range(12)],
columns=['LANE', 'NUMBER']
)
LANE NUMBER
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 a 7
8 a 8
9 a 9
10 a 10
11 a 11
12 b 0
13 b 1
14 b 2
15 b 3
16 b 4
17 b 5
18 b 6
19 b 7
20 b 8
21 b 9
22 b 10
23 b 11
itertools.product
This logic is almost identical to the Comprehension solution but it uses itertools built in product function. product is an iterator that pops out each combination one at a time. I force the result by unpacking with the splat * like so [*product(a, b)]. Ultimately, it is a list of lists that gets passed to the pd.DataFrame constructor in the same way as the Comprehension solution above.
from itertools import product
pd.DataFrame([*product(df.LANE, range(12))], columns=['LANE', 'NUMBER'])
groupby/cumcount and repeat
I don't like this answer but it provides some perspective on the simplicity of the other answers.
I use repeat to replicate each index value 12 times. I use this repeated index in a loc which returns a dataframe sliced with passed index. I then use groupbys cumcount to count each position within the group and add that as a new column.
df.loc[df.index.repeat(12)].assign(NUMBER=lambda d: d.groupby('LANE').cumcount())
LANE NUMBER
0 a 0
0 a 1
0 a 2
0 a 3
0 a 4
0 a 5
0 a 6
0 a 7
0 a 8
0 a 9
0 a 10
0 a 11
1 b 0
1 b 1
1 b 2
1 b 3
1 b 4
1 b 5
1 b 6
1 b 7
1 b 8
1 b 9
1 b 10
1 b 11
Another approach using pandas as below:
# First approach, one liner code
df = pd.DataFrame({'Lane': ['a'] * 12 + ['b'] * 12,
'Number': list(range(12)) * 2})
# Second approach
df = pd.DataFrame({'Lane': ['a'] * 12 + ['b'] * 12})
df['Number'] = df.groupby('Lane').cumcount()

Group rows by overlapping ranges

I have a dataframe, where the left column is the left - most location of an object, and the right column is the right most location. I need to group the objects if they overlap, or they overlap objects that overlap (recursively).
So, for example, if this is my dataframe:
left right
0 0 4
1 5 8
2 10 13
3 3 7
4 12 19
5 18 23
6 31 35
so lines 0 and 3 overlap - thus they should be on the same group, and also line 1 is overlapping line 3 - thus it joins the group.
So, for this example the output should be something like that:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
I thought of various directions, but didn't figure it out (without an ugly for).
Any help will be appreciated!
I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:
df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31],
'right':[4,8,13,7,19,16,23,35]})
df
The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).
One solution is using the following approach (based on a method for combining intervals posted by #CentAu):
# Union intervals by #CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)
# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if
l.contains(x)][0])
...which outputs:
Can you try this, use rolling max and rolling min, to find the intersection of the range :
df=df.sort_values(['left','right'])
df['Group']=((df.right.rolling(window=2,min_periods=1).min()-df.left.rolling(window=2,min_periods=1).max())<0).cumsum()
df.sort_index()
Out[331]:
left right Group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
For example , (1,3) and (2,4)
To find the intersection
mix(3,4)-max(1,2)=1 ; 1 is more than 0; then two intervals have intersection
You can sort samples and utilize cumulative functions cummax and cumsum. Let's take your example:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
First you need to sort values so that longer ranges come first:
df = df.sort_values(['left', 'right'], ascending=[True, False])
Result:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
Then you can find overlapping groups through comparing 'left' with previous 'right' values:
df['group'] = (df['right'].cummax().shift() <= df['left']).cumsum()
df.sort_index(inplace=True)
Result:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 13 16 1
6 18 23 1
7 31 35 2
In one line:

Select particular rows from inside groups in pandas dataframe

Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.
This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27
This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().

Categories