Looking for a sequential pattern with condition - python

I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).

Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64

I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])

You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False

Related

Newly created column in a data frame need to be updated with values based on condition ,from another column

DF has four columns and column 'Id' in unique and it is grouped by column 'idhogar'.
column ' parentesco1' has status 0 (or) 1 . 'Target' columns has values,which are different for various rows under same column values of 'idhogar'
INDEX Id parentesco1 idhogar Target
0 ID_fe8c32eba 0 4616164 2
1 ID_ca701e058 1 4616164 2
2 ID_5ad4372cd 0 4983866 3
3 ID_1e320689c 1 4983866 3
4 ID_700e30a8d 0 5905417 2
5 ID_bc99ecfb8 0 5905417 2
6 ID_308a05a16 1 5905417 2
7 ID_00186dde5 1 7.56E+06 4
8 ID_34570a74c 1 20713493 4
9 ID_b13870a19 1 27651991 3
10 ID_74e989389 1 45038655 4
11 ID_726ba7d34 0 60027579 4
12 ID_b75d7c648 0 60027579 4
13 ID_37e7b3aaa 1 60027579 4
14 ID_396da5a70 0 104578907 2
15 ID_4381374bb 1 104578907 2
16 ID_272a9b4d5 0 119024319 4
17 ID_1225f3779 0 119024319 4
18 ID_fc5dfaa2e 0 119024319 4
19 ID_7390a3f99 1 119024319 4
New column'Rev_target' created ,need to have the value of 'Target' of row having ' parentesco1' as 1 for all the rows falling under the group of same 'idhogar'.
I tried the following but not successful.
for idhogar in df['idhogar'].unique():
if len(df[df['idhogar'] == idhogar]['Target'].unique())!= 1:
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=rev_target_val
# NOT WORKING AS REQUIRED ---- gives output as NaN in all rows of newly created column
Tried the below but throwing error
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=np.where(len(df[df['idhogar'] == idhogar]['Target'].unique())!=
1,rev_target_val,df['Target'])
ValueError: operands could not be broadcast together with shapes () (0,) (9557,)
Tried the below but not working as intended,gives same value as 2 in all the rows of new'Rev_target' column
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=df.apply(lambda x: rev_target_val if (len(df[df['idhogar'] == idhogar]
['Target'].unique())!= 1) else df['Target'],axis=1)
Would appreciate a solution from you and thanks in advance.
I would sort the dataframe on parentesco1 in descending order to make sure that the parentesco1 1 row is the first row. Then a transform could easily access that row:
df['Rev_target'] = df.sort_values('parentesco1', ascending=False).groupby(
'idhogar').transform(lambda x: x.iloc[0])['Target']
It gives:
INDEX Id parentesco1 idhogar Target Rev_target
0 0 ID_fe8c32eba 0 4616164.0 2 2
1 1 ID_ca701e058 1 4616164.0 2 2
2 2 ID_5ad4372cd 0 4983866.0 3 3
3 3 ID_1e320689c 1 4983866.0 3 3
4 4 ID_700e30a8d 0 5905417.0 2 2
5 5 ID_bc99ecfb8 0 5905417.0 2 2
6 6 ID_308a05a16 1 5905417.0 2 2
7 7 ID_00186dde5 1 7560000.0 4 4
8 8 ID_34570a74c 1 20713493.0 4 4
9 9 ID_b13870a19 1 27651991.0 3 3
10 10 ID_74e989389 1 45038655.0 4 4
11 11 ID_726ba7d34 0 60027579.0 4 4
12 12 ID_b75d7c648 0 60027579.0 4 4
13 13 ID_37e7b3aaa 1 60027579.0 4 4
14 14 ID_396da5a70 0 104578907.0 2 2
15 15 ID_4381374bb 1 104578907.0 2 2
16 16 ID_272a9b4d5 0 119024319.0 4 4
17 17 ID_1225f3779 0 119024319.0 4 4
18 18 ID_fc5dfaa2e 0 119024319.0 4 4
19 19 ID_7390a3f99 1 119024319.0 4 4

Map values to separate col using conditional statement: python

I'm hoping to use a conditional statement to create a new column but I'm unsure on the best way to proceed.
Using below, I essentially have various Items that contain a specific Direction for a given point in time. I want to use the ID to provide the correct Direction. So match the values in Items and ID to determine the correct Direction.
I've manually inserted this below as Main Direction but will need to automate this. I then want to pass a conditional statement to X. Specifically, if == 'Up' then add 10, if == 'Down' then subtract 10.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,2,2,2,2,3,3,3,3],
'ID' : ['A','A','A','A','B','B','B','B','A','A','A','A'],
'Items' : ['A','B','A','A','B','A','A','B','A','A','B','B'],
'Direction' : ['Up','Down','Up','Up','Down','Up','Up','Down','Up','Up','Down','Down'],
'Main Direction' : ['Up','Up','Up','Up','Down','Down','Down','Down','Up','Up','Up','Up'],
'X' : [1,2,3,4,6,7,8,9,3,4,5,6],
})
df['Dist'] = [df['X'] + 10 if x == 'Up' else df['X'] -10 for x in df['Main Direction']]
intended output:
Time ID All Direction Main Direction X Dist
0 1 A A Up Up 1 11
1 1 A B Down Up 2 12
2 1 A A Up Up 3 13
3 1 A A Up Up 4 14
4 2 B B Down Down 6 -4
5 2 B A Up Down 7 -3
6 2 B A Up Down 8 -2
7 2 B B Down Down 9 -1
8 3 A A Up Up 3 13
9 3 A A Up Up 4 14
10 3 A B Down Up 5 15
11 3 A B Down Up 6 16
Try with np.where
df.X=np.where(df['Main Direction'].eq('Up'),df.X+10,df.X-10)
df
Time ID Items Direction Main Direction X
0 1 A A Up Up 11
1 1 A B Down Up 12
2 1 A A Up Up 13
3 1 A A Up Up 14
4 2 B B Down Down -4
5 2 B A Up Down -3
6 2 B A Up Down -2
7 2 B B Down Down -1
8 3 A A Up Up 13
9 3 A A Up Up 14
10 3 A B Down Up 15
11 3 A B Down Up 16
Here is how to do the Main direction column
df['testDirec'] = np.where(df['ID'] == df['Items'],df['Direction'],None)
df['testDirec'] = df['testDirec'].ffill()
Gives the same as Main Direction
Time ID Items Direction Main Direction X testDirec
0 1 A A Up Up 1 Up
1 1 A B Down Up 2 Up
2 1 A A Up Up 3 Up
3 1 A A Up Up 4 Up
4 2 B B Down Down 6 Down
5 2 B A Up Down 7 Down
6 2 B A Up Down 8 Down
7 2 B B Down Down 9 Down
8 3 A A Up Up 3 Up
9 3 A A Up Up 4 Up
10 3 A B Down Up 5 Up
11 3 A B Down Up 6 Up
You can use any custom function with apply. If you're operating row-wise (e.g. need more than one column) you'll call this with axis=1, and it will pass a row. Otherwise you can just do df['my_col'].apply(lambda x: do_something_to_x(x)) and it will pass values of the col. For your example:
def calc_dist(row):
if row['Main Direction'] == 'Up':
return row['X'] + 10
else:
return row['X'] - 10
df['Dist'] = df.apply(calc_dist, axis=1)

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Logical AND of multiple columns in pandas

I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function
You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64
Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31
how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.

Loop to create a new row for a given range of numbers

I am sure this is an easy fix, but I haven't been able to find the exact solution to my problem. My data set has a column called 'LANE' which contains unique values. I want to add rows for each 'LANE' based on a range of numbers (which would be 0 to 12). As a result each 'LANE' would have 13 rows with a new column 'NUMBER' ranging from 0 up to and including 12.
Example:
Input
LANE
a
b
Output
LANE NUMBER
a 0
a 1
a 2
a 3
a 4
a 5
a 6
a 7
a 8
a 9
a 10
a 11
a 12
b 0
b 1
b 2
b 3
b 4
b 5
b 6
b 7
b 8
b 9
b 10
b 11
b 12
I am currently trying different forms of:
num = 0
while num <= 12:
for x in df['LANE']:
df['NUMBER'] = num
num += 1
The problem with this loop is, I still have one record for each lane and the 'NUMBER' column only has the value 12.
Comprehension
For loops are the natural and naive way to produce Cartesian products. Comprehensions allow us to embed this more succinctly.
pd.DataFrame(
[[l, n] for l in df.LANE for n in range(12)],
columns=['LANE', 'NUMBER']
)
LANE NUMBER
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 a 7
8 a 8
9 a 9
10 a 10
11 a 11
12 b 0
13 b 1
14 b 2
15 b 3
16 b 4
17 b 5
18 b 6
19 b 7
20 b 8
21 b 9
22 b 10
23 b 11
itertools.product
This logic is almost identical to the Comprehension solution but it uses itertools built in product function. product is an iterator that pops out each combination one at a time. I force the result by unpacking with the splat * like so [*product(a, b)]. Ultimately, it is a list of lists that gets passed to the pd.DataFrame constructor in the same way as the Comprehension solution above.
from itertools import product
pd.DataFrame([*product(df.LANE, range(12))], columns=['LANE', 'NUMBER'])
groupby/cumcount and repeat
I don't like this answer but it provides some perspective on the simplicity of the other answers.
I use repeat to replicate each index value 12 times. I use this repeated index in a loc which returns a dataframe sliced with passed index. I then use groupbys cumcount to count each position within the group and add that as a new column.
df.loc[df.index.repeat(12)].assign(NUMBER=lambda d: d.groupby('LANE').cumcount())
LANE NUMBER
0 a 0
0 a 1
0 a 2
0 a 3
0 a 4
0 a 5
0 a 6
0 a 7
0 a 8
0 a 9
0 a 10
0 a 11
1 b 0
1 b 1
1 b 2
1 b 3
1 b 4
1 b 5
1 b 6
1 b 7
1 b 8
1 b 9
1 b 10
1 b 11
Another approach using pandas as below:
# First approach, one liner code
df = pd.DataFrame({'Lane': ['a'] * 12 + ['b'] * 12,
'Number': list(range(12)) * 2})
# Second approach
df = pd.DataFrame({'Lane': ['a'] * 12 + ['b'] * 12})
df['Number'] = df.groupby('Lane').cumcount()

Categories