How to split pandas dataframe using periodic values column - python

I have a pandas data frame that looks something like this:
v1 v2 v3 result
0 12 31 31 0
1 34 52 4 1
2 32 4 5 1
3 7 89 2 0
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1
7 0 1 3 0
As you may note, in the very last column it has a pattern of zeroes and ones.
Is it possible to split this data frame into two sub-data frames?
My desired output will be:
df1:
v1 v2 v3 result
0 34 52 4 1
1 32 4 5 1
df2:
0 5 17 8 1
1 11 25 23 1
2 2 32 34 1
df.groupby() will definitely not work, as it will just create two big dataframes; one with ones, the second one with zeroes. I am not interested in keeping data marked as zeroes.
Thanks in advance!
PS.
In reality this dataframe is much bigger, so I am trying to create df1, df2, ... dfn

You can create dictionary of DataFrames:
mask = df['result'].eq(1)
a = pd.factorize(df['result'].eq(0).cumsum()[mask])[0]
dfs = dict(tuple(df[mask].groupby(a)))
print (dfs[0])
v1 v2 v3 result
1 34 52 4 1
2 32 4 5 1
print (dfs[1])
v1 v2 v3 result
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1
Details:
Create boolean mask for filtering by eq (==):
mask = df['result'].eq(1)
print (mask)
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 False
Name: result, dtype: bool
Create counter Series by comparing by 0 and Series.cumsum:
print (df['result'].eq(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 3
Name: result, dtype: int32
Filtering by boolean indexing only 1 rows:
print (df['result'].eq(0).cumsum()[mask])
1 1
2 1
4 2
5 2
6 2
Name: result, dtype: int32
Add factorize for groups strating by 0:
a = pd.factorize(df['result'].eq(0).cumsum()[mask])[0]
print (a)
[0 0 1 1 1]
Create dictionary from groupby object, but also filter rows by boolean mask:
dfs = dict(tuple(df[mask].groupby(a)))
print (dfs)
{0: v1 v2 v3 result
1 34 52 4 1
2 32 4 5 1, 1: v1 v2 v3 result
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1}

# Flag the rows that will be the beginning of a new dataframe
df['_start_new_gp'] = (df.result == 1) & (df.result.shift() == 0)
# Get rigs of the results = 0 (here creating a copy - not necessary)
df2 = df[df.result == 1].copy()
# Use a cumulative sum on the '_start_new_gp' column to create a "group number"
df2['_group_number'] = df2['_start_new_gp'].cumsum()
# Group by "group number"
grouped = df2.groupby('_group_number')
# Get list of dataframes
dataframes = [group for _, group in grouped]

Using numpy.split:
s = df.loc[df.result.eq(1)]
idx = np.where(np.diff(s.index)!=1)[0] + 1
for d in np.split(s, idx):
print(d, end='\n\n')
v1 v2 v3 result
1 34 52 4 1
2 32 4 5 1
v1 v2 v3 result
4 5 17 8 1
5 11 25 23 1
6 2 32 34 1

Related

Newly created column in a data frame need to be updated with values based on condition ,from another column

DF has four columns and column 'Id' in unique and it is grouped by column 'idhogar'.
column ' parentesco1' has status 0 (or) 1 . 'Target' columns has values,which are different for various rows under same column values of 'idhogar'
INDEX Id parentesco1 idhogar Target
0 ID_fe8c32eba 0 4616164 2
1 ID_ca701e058 1 4616164 2
2 ID_5ad4372cd 0 4983866 3
3 ID_1e320689c 1 4983866 3
4 ID_700e30a8d 0 5905417 2
5 ID_bc99ecfb8 0 5905417 2
6 ID_308a05a16 1 5905417 2
7 ID_00186dde5 1 7.56E+06 4
8 ID_34570a74c 1 20713493 4
9 ID_b13870a19 1 27651991 3
10 ID_74e989389 1 45038655 4
11 ID_726ba7d34 0 60027579 4
12 ID_b75d7c648 0 60027579 4
13 ID_37e7b3aaa 1 60027579 4
14 ID_396da5a70 0 104578907 2
15 ID_4381374bb 1 104578907 2
16 ID_272a9b4d5 0 119024319 4
17 ID_1225f3779 0 119024319 4
18 ID_fc5dfaa2e 0 119024319 4
19 ID_7390a3f99 1 119024319 4
New column'Rev_target' created ,need to have the value of 'Target' of row having ' parentesco1' as 1 for all the rows falling under the group of same 'idhogar'.
I tried the following but not successful.
for idhogar in df['idhogar'].unique():
if len(df[df['idhogar'] == idhogar]['Target'].unique())!= 1:
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=rev_target_val
# NOT WORKING AS REQUIRED ---- gives output as NaN in all rows of newly created column
Tried the below but throwing error
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=np.where(len(df[df['idhogar'] == idhogar]['Target'].unique())!=
1,rev_target_val,df['Target'])
ValueError: operands could not be broadcast together with shapes () (0,) (9557,)
Tried the below but not working as intended,gives same value as 2 in all the rows of new'Rev_target' column
for idhogar in df['idhogar'].unique():
rev_target_val=df[(df['idhogar']== idhogar) & (df['parentesco1']==1)]['Target']
df['Rev_target']=df.apply(lambda x: rev_target_val if (len(df[df['idhogar'] == idhogar]
['Target'].unique())!= 1) else df['Target'],axis=1)
Would appreciate a solution from you and thanks in advance.
I would sort the dataframe on parentesco1 in descending order to make sure that the parentesco1 1 row is the first row. Then a transform could easily access that row:
df['Rev_target'] = df.sort_values('parentesco1', ascending=False).groupby(
'idhogar').transform(lambda x: x.iloc[0])['Target']
It gives:
INDEX Id parentesco1 idhogar Target Rev_target
0 0 ID_fe8c32eba 0 4616164.0 2 2
1 1 ID_ca701e058 1 4616164.0 2 2
2 2 ID_5ad4372cd 0 4983866.0 3 3
3 3 ID_1e320689c 1 4983866.0 3 3
4 4 ID_700e30a8d 0 5905417.0 2 2
5 5 ID_bc99ecfb8 0 5905417.0 2 2
6 6 ID_308a05a16 1 5905417.0 2 2
7 7 ID_00186dde5 1 7560000.0 4 4
8 8 ID_34570a74c 1 20713493.0 4 4
9 9 ID_b13870a19 1 27651991.0 3 3
10 10 ID_74e989389 1 45038655.0 4 4
11 11 ID_726ba7d34 0 60027579.0 4 4
12 12 ID_b75d7c648 0 60027579.0 4 4
13 13 ID_37e7b3aaa 1 60027579.0 4 4
14 14 ID_396da5a70 0 104578907.0 2 2
15 15 ID_4381374bb 1 104578907.0 2 2
16 16 ID_272a9b4d5 0 119024319.0 4 4
17 17 ID_1225f3779 0 119024319.0 4 4
18 18 ID_fc5dfaa2e 0 119024319.0 4 4
19 19 ID_7390a3f99 1 119024319.0 4 4

return first column number that fulfills a condition in pandas

I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!
Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object
No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64

Get longest streak of consecutive weeks by group in pandas

Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4

Looking for a sequential pattern with condition

I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).
Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64
I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])
You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False

applying several functions in transform in pandas

After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()
I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17
With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500

Categories