Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Related
Hope I can explain the question properly.
In basic terms, imagining the df as below:
print(df)
year id
1 16100
1 150
1 150
2 66
2 370
2 370
2 530
3 41
3 43
3 61
Would need df.seq to be a cycling 1 to n value if the year rows are identical, until it changes.
df.seq2 would be still n, instead of n+1, if the above rows id value is identical.
So if we imagine excel like formula would be something like
df.seq2 = IF(A2=A1,IF(B2=B1,F1,F1+1),1)
which would make the desired output seq and seq2 below:
year id seq seq2
1 16100 1 1
1 150 2 2
1 150 3 2
2 66 1 1
2 370 2 2
2 370 3 2
2 530 4 3
3 41 1 1
3 43 2 2
3 61 3 3
Did test couple things like (assuming I've generated the df.seq)
comb_df['match'] = comb_df.year.eq(comb_df.year.shift())
comb_df['match2'] = comb_df.id.eq(comb_df.id.shift())
comb_df["seq2"] = np.where((comb_df["match"].shift(+1) == True) & (comb_df["match2"].shift(+1) == True), comb_df["seq"] - 1, comb_df["seq2"])
But the problem is this doesn't really work out if there are multiple duplicates in a row etc.
Perhaps issue can not be resolved purely on numpy sort of way but perhaps I'd have to iterate over the rows?
There are 2-3 million rows, so the performance will be an issue if the solution would be very slow.
Would need to generate both df.seq and df.seq2
Any ideas would be extremely helpful!
We can do groupby with cumcount and factorize
df['seq'] = df.groupby('year').cumcount()+1
df['seq2'] = df.groupby('year')['id'].transform(lambda x : x.factorize()[0]+1)
df
Out[852]:
year id seq seq2
0 1 16100 1 1
1 1 150 2 2
2 1 150 3 2
3 2 66 1 1
4 2 370 2 2
5 2 370 3 2
6 2 530 4 3
7 3 41 1 1
8 3 43 2 2
9 3 61 3 3
I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).
Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64
I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])
You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False
I have a pandas data frame that looks something like this:
v1 v2 v3 result
0 12 31 31 0
1 34 52 4 1
2 32 4 5 1
3 7 89 2 0
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1
7 0 1 3 0
As you may note, in the very last column it has a pattern of zeroes and ones.
Is it possible to split this data frame into two sub-data frames?
My desired output will be:
df1:
v1 v2 v3 result
0 34 52 4 1
1 32 4 5 1
df2:
0 5 17 8 1
1 11 25 23 1
2 2 32 34 1
df.groupby() will definitely not work, as it will just create two big dataframes; one with ones, the second one with zeroes. I am not interested in keeping data marked as zeroes.
Thanks in advance!
PS.
In reality this dataframe is much bigger, so I am trying to create df1, df2, ... dfn
You can create dictionary of DataFrames:
mask = df['result'].eq(1)
a = pd.factorize(df['result'].eq(0).cumsum()[mask])[0]
dfs = dict(tuple(df[mask].groupby(a)))
print (dfs[0])
v1 v2 v3 result
1 34 52 4 1
2 32 4 5 1
print (dfs[1])
v1 v2 v3 result
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1
Details:
Create boolean mask for filtering by eq (==):
mask = df['result'].eq(1)
print (mask)
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 False
Name: result, dtype: bool
Create counter Series by comparing by 0 and Series.cumsum:
print (df['result'].eq(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 3
Name: result, dtype: int32
Filtering by boolean indexing only 1 rows:
print (df['result'].eq(0).cumsum()[mask])
1 1
2 1
4 2
5 2
6 2
Name: result, dtype: int32
Add factorize for groups strating by 0:
a = pd.factorize(df['result'].eq(0).cumsum()[mask])[0]
print (a)
[0 0 1 1 1]
Create dictionary from groupby object, but also filter rows by boolean mask:
dfs = dict(tuple(df[mask].groupby(a)))
print (dfs)
{0: v1 v2 v3 result
1 34 52 4 1
2 32 4 5 1, 1: v1 v2 v3 result
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1}
# Flag the rows that will be the beginning of a new dataframe
df['_start_new_gp'] = (df.result == 1) & (df.result.shift() == 0)
# Get rigs of the results = 0 (here creating a copy - not necessary)
df2 = df[df.result == 1].copy()
# Use a cumulative sum on the '_start_new_gp' column to create a "group number"
df2['_group_number'] = df2['_start_new_gp'].cumsum()
# Group by "group number"
grouped = df2.groupby('_group_number')
# Get list of dataframes
dataframes = [group for _, group in grouped]
Using numpy.split:
s = df.loc[df.result.eq(1)]
idx = np.where(np.diff(s.index)!=1)[0] + 1
for d in np.split(s, idx):
print(d, end='\n\n')
v1 v2 v3 result
1 34 52 4 1
2 32 4 5 1
v1 v2 v3 result
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1
I have a dataframe, where the left column is the left - most location of an object, and the right column is the right most location. I need to group the objects if they overlap, or they overlap objects that overlap (recursively).
So, for example, if this is my dataframe:
left right
0 0 4
1 5 8
2 10 13
3 3 7
4 12 19
5 18 23
6 31 35
so lines 0 and 3 overlap - thus they should be on the same group, and also line 1 is overlapping line 3 - thus it joins the group.
So, for this example the output should be something like that:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
I thought of various directions, but didn't figure it out (without an ugly for).
Any help will be appreciated!
I found the accepted solution (update: now deleted) to be misleading because it fails to generalize to similar cases. e.g. for the following example:
df = pd.DataFrame({'left': [0,5,10,3,12,13,18,31],
'right':[4,8,13,7,19,16,23,35]})
df
The suggested aggregate function outputs the following dataframe (note that the 18-23 should be in group 1, along with 12-19).
One solution is using the following approach (based on a method for combining intervals posted by #CentAu):
# Union intervals by #CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Create a list of intervals
df['left_right'] = df[['left', 'right']].apply(list, axis=1)
intervals = union(df.left_right)
# Add a group column
df['group'] = df['left'].apply(lambda x: [g for g,l in enumerate(intervals) if
l.contains(x)][0])
...which outputs:
Can you try this, use rolling max and rolling min, to find the intersection of the range :
df=df.sort_values(['left','right'])
df['Group']=((df.right.rolling(window=2,min_periods=1).min()-df.left.rolling(window=2,min_periods=1).max())<0).cumsum()
df.sort_index()
Out[331]:
left right Group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 18 23 1
6 31 35 2
For example , (1,3) and (2,4)
To find the intersection
mix(3,4)-max(1,2)=1 ; 1 is more than 0; then two intervals have intersection
You can sort samples and utilize cumulative functions cummax and cumsum. Let's take your example:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
First you need to sort values so that longer ranges come first:
df = df.sort_values(['left', 'right'], ascending=[True, False])
Result:
left right
0 0 4
3 3 7
1 5 8
2 10 13
4 12 19
5 13 16
6 18 23
7 31 35
Then you can find overlapping groups through comparing 'left' with previous 'right' values:
df['group'] = (df['right'].cummax().shift() <= df['left']).cumsum()
df.sort_index(inplace=True)
Result:
left right group
0 0 4 0
1 5 8 0
2 10 13 1
3 3 7 0
4 12 19 1
5 13 16 1
6 18 23 1
7 31 35 2
In one line:
I have a very large data frame (hundreds of millions of rows). There are two group ID's, group_id_1 and group_id_2. The data frame looks like this:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 4 95 1
1 4 55 2
2 2 44 1
2 4 88 1
2 4 90 2
For each group_id_1 x group_id_2 combo, I need to duplicate the row with the latest time, and increment the time by one. In other words, my table should look like:
group_id_1 group_id_2 value1 time
1 2 45 1
1 2 49 2
1 2 49 3
1 4 95 1
1 4 55 2
1 4 55 3
2 2 44 1
2 2 44 2
2 4 88 1
2 4 90 2
2 4 90 3
Right now, I am doing:
for name, group in df.groupby(['group_id_1', 'group_id_2']):
last, = group.sort_values(by='time').tail(1)['time'].values
temp = group[group['time']==last]
temp.loc[:, 'time'] = last + 1
group = group.append(temp)
This is insanely inefficient. If I put the above code into a function, and use the .apply() method with the groupby object, it also takes an enormous amount of time.
How do I speed this process up?
You can use groupby with aggregate last, add time by add and concat to original:
df1 = df.sort_values(by='time').groupby(['group_id_1', 'group_id_2']).last().reset_index()
df1.time = df1.time.add(1)
print (df1)
group_id_1 group_id_2 value1 time
0 1 2 49 3
1 1 4 55 3
2 2 2 44 2
3 2 4 90 3
df = pd.concat([df,df1])
df = df.sort_values(['group_id_1','group_id_2']).reset_index(drop=True)
print (df)
group_id_1 group_id_2 value1 time
0 1 2 45 1
1 1 2 49 2
2 1 2 49 3
3 1 4 95 1
4 1 4 55 2
5 1 4 55 3
6 2 2 44 1
7 2 2 44 2
8 2 4 88 1
9 2 4 90 2
10 2 4 90 3
First, sort the dataframe by time (this should be more efficient than sorting each group by time):
df = df.sort_values('time')
Second, get the last row in each group (without sorting the groups to improve performance):
last = df.groupby(['group_id_1', 'group_id_2'], sort=False).last()
Third, increment the time:
last['time'] = last['time'] + 1
Fourth, concatenate:
df = pd.concat([df, last])
Fifth, sort back to the original order:
df = df.sort_values(['group_id_1', 'group_id_2'])
Explanation: concatenating and then sorting will be much faster than inserting rows one by one.