finding streaks in pandas dataframe - python

I have a pandas dataframe as follows:
time winner loser stat
1 A B 0
2 C B 0
3 D B 1
4 E B 0
5 F A 0
6 G A 0
7 H A 0
8 I A 1
each row is a match result. the first column is the time of the match, second and third column contain winner/loser and the fourth column is one stat from the match.
I want to detect streaks of zeros for this stat per loser.
The expected result should look like this:
time winner loser stat streak
1 A B 0 1
2 C B 0 2
3 D B 1 0
4 E B 0 1
5 F A 0 1
6 G A 0 2
7 H A 0 3
8 I A 1 0
In pseudocode the algorithm should work like this:
.groupby loser column.
then iterate over each row of each loser group
in each row, look at the stat column: if it contains 0, then increment the streak value from the previous row by 0. if it is not 0, then start a new streak, that is, put 0 into the streak column.
So the .groupby is clear. But then I would need some sort of .apply where I can look at the previous row? this is where I am stuck.

You can apply custom function f, then cumsum, cumcount and astype:
def f(x):
x['streak'] = x.groupby( (x['stat'] != 0).cumsum()).cumcount() +
( (x['stat'] != 0).cumsum() == 0).astype(int)
return x
df = df.groupby('loser', sort=False).apply(f)
print df
time winner loser stat streak
0 1 A B 0 1
1 2 C B 0 2
2 3 D B 1 0
3 4 E B 0 1
4 5 F A 0 1
5 6 G A 0 2
6 7 H A 0 3
7 8 I A 1 0
For better undestanding:
def f(x):
x['c'] = (x['stat'] != 0).cumsum()
x['a'] = (x['c'] == 0).astype(int)
x['b'] = x.groupby( 'c' ).cumcount()
x['streak'] = x.groupby( 'c' ).cumcount() + x['a']
return x
df = df.groupby('loser', sort=False).apply(f)
print df
time winner loser stat c a b streak
0 1 A B 0 0 1 0 1
1 2 C B 0 0 1 1 2
2 3 D B 1 1 0 0 0
3 4 E B 0 1 0 1 1
4 5 F A 0 0 1 0 1
5 6 G A 0 0 1 1 2
6 7 H A 0 0 1 2 3
7 8 I A 1 1 0 0 0

Not as elegant as jezrael's answer, but for me easier to understand...
First, define a function that works with a single loser:
def f(df):
df['streak2'] = (df['stat'] == 0).cumsum()
df['cumsum'] = np.nan
df.loc[df['stat'] == 1, 'cumsum'] = df['streak2']
df['cumsum'] = df['cumsum'].fillna(method='ffill')
df['cumsum'] = df['cumsum'].fillna(0)
df['streak'] = df['streak2'] - df['cumsum']
df.drop(['streak2', 'cumsum'], axis=1, inplace=True)
return df
The streak is essentially a cumsum, but we need to reset it each time stat is 1. We therefore subtract the value of the cumsum where stat is 1, carried forward until the next 1.
Then groupby and apply by loser:
df.groupby('loser').apply(f)
The result is as expected.

You could use iterrows to access previous row:
df['streak'] = 0
for i, row in df.iterrows():
if i != 0:
if row['stat'] == 0:
if row['loser'] == df.ix[i-1, 'loser']:
df.ix[i, 'streak'] = df.ix[i-1, 'streak'] + 1
else:
df.ix[i, 'streak'] = 1
else:
if row['stat'] == 0:
df.ix[i, 'streak'] = 1
Which gives:
In [210]: df
Out[210]:
time winner loser stat streak
0 1 A B 0 1
1 2 C B 0 2
2 3 D B 1 0
3 4 E B 0 1
4 5 F A 0 1
5 6 G A 0 2
6 7 H A 0 3
7 8 I A 1 0

Related

Select rows where two or more columns are bigger than 0 in pandas

I am working with a dataframe in pandas. My dataframe had 55 columns and 70.000 rows.
How can I select the rows where two or more values are bigger than 0?
It now looks like this:
A B C D E
a 0 2 0 8 0
b 3 0 0 0 0
c 6 2 5 0 0
And would like to make this:
A B C D E F
a 0 2 0 8 0 true
b 3 0 0 0 0 false
c 6 2 5 0 0 true
Have tried converting it to just 0's and 1's and summing that, like so:
df[df > 0] = 1
df[(df > 0).sum(axis=1) >= 2]
But then I lose all the other info in the dataframe and I still want to be able to see the original values.
Try assigning to a column like this:
>>> df['F'] = df.gt(0).sum(axis=1).ge(2)
>>> df
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True
Or try with astype(bool):
>>> df['F'] = df.astype(bool).sum(axis=1).ge(2)
>>> df
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True
>>>
You are close, only assign mask to new column:
df['F'] = (df > 0).sum(axis=1) >= 2
Or:
df['F'] = np.count_nonzero(df, axis=1) >= 2
print (df)
A B C D E F
a 0 2 0 8 0 True
b 3 0 0 0 0 False
c 6 2 5 0 0 True

How can i make a dataframe where the count is higher than a specific value?

I have a df that looks like this:
df
a b c d
0 1 0 0 1
1 1 1 0 1
2 0 1 1 1
3 1 0 0 1
I try to get a df where the count of the columns is higher than 2. But can't find the solution for this. It should look like this:
a d
0 1 1
1 1 1
2 0 1
3 1 1
If there are only 1 and 0 values use DataFrame.loc with boolean indexing, first : is for match all rows:
df = df.loc[:, df.sum() > 2]
print (df)
a d
0 1 1
1 1 1
2 0 1
3 1 1
Detail:
print (df.sum())
a 3
b 2
c 1
d 4
dtype: int64
print (df.sum() > 2)
a True
b False
c False
d True
dtype: bool
If possible another values and need count only 1:
df = df.loc[:, df.eq(1).sum() > 2]

Change 1st row of a dataframe based on a condition in pandas

I have 2 columns on whose value I want to update the third column for only 1 row.
I have-
df = pd.DataFrame({'A':[1,1,2,3,4,4],
'B':[2,2,4,3,2,1],
'C':[0] * 6})
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If A= 1 and B=2 then only 1st row has C=1 like this -
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
Right now I have used
df.loc[(df['A']==1) & (df['B']==2)].iloc[[0]].loc['C'] = 1
but it doesn't change the dataframe.
Solution if match always at least one row:
Create boolean mask and set to first True index value by idxmax:
mask = (df['A']==1) & (df['B']==2)
df.loc[mask.idxmax(), 'C'] = 1
But if no value matched idxmax return first False value, so added if-else:
mask = (df['A']==1) & (df['B']==2)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
mask = (df['A']==10) & (df['B']==20)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
Using pd.Series.cumsum to ensure only the first matching criteria is satisfied:
mask = df['A'].eq(1) & df['B'].eq(2)
df.loc[mask & mask.cumsum().eq(1), 'C'] = 1
print(df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If performance is a concern, see Efficiently return the index of the first value satisfying condition in array.

Concat() alternate group by python3.0

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.
Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

Condition based on First Element of a group in python

I have a dataframe df with columns [ShowOnAir, AfterPremier, ID, EverOnAir].
My condition is that
if it is the first element of groupby(df.ID)
then if (df.ShowOnAir ==0 or df.AfterPremier == 0), then EverOnAir = 0
else EverOnAir = 1
I am not sure how to compare the first element of the groupby, with elements of the orignal dataframe df.
would really appreciate if I could get help in it ,
Thank you
You can get a row number for your groups by using cumsum, then you can do your logic on the resulting dataframe:
df = pd.DataFrame([[1],[1],[2],[2],[2]])
df['n']=1
df.groupby(0).cumsum()
n
0 1
1 2
2 1
3 2
4 3
You can first create new column EverOnAir filled 1. Then groupby by ID and apply custom function f, where find first element of columns by iat and fill 0:
print df
ShowOnAir AfterPremier ID
0 0 0 a
1 0 1 a
2 1 1 a
3 1 1 b
4 1 0 b
5 0 0 b
6 0 1 c
7 1 0 c
8 0 0 c
def f(x):
#print x
x['EverOnAir'].iat[0] = np.where((x['ShowOnAir'].iat[0] == 0 ) |
(x['AfterPremier'].iat[0] == 0), 0, 1)
return x
df['EverOnAir'] = 1
print df.groupby('ID').apply(f)
ShowOnAir AfterPremier ID EverOnAir
0 0 0 a 0
1 0 1 a 1
2 1 1 a 1
3 1 1 b 1
4 1 0 b 1
5 0 0 b 1
6 0 1 c 0
7 1 0 c 1
8 0 0 c 1

Categories