Pythonic way to calculate streaks in pandas dataframe

Pythonic way to calculate streaks in pandas dataframe - python

Given df
df = pd.DataFrame([[1, 5, 2, 8, 2], [2, 4, 4, 20, 2], [3, 3, 1, 20, 2], [4, 2, 2, 1, 3], [5, 1, 4, -5, -4], [1, 5, 2, 2, -20],
[2, 4, 4, 3, -8], [3, 3, 1, -1, -1], [4, 2, 2, 0, 12], [5, 1, 4, 20, -2]],
columns=['A', 'B', 'C', 'D', 'E'], index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Based on this answer, I created a function to calculate streaks (up, down).
def streaks(df, column):
#Create sign column
df['sign'] = 0
df.loc[df[column] > 0, 'sign'] = 1
df.loc[df[column] < 0, 'sign'] = 0
# Downstreak
df['d_streak2'] = (df['sign'] == 0).cumsum()
df['cumsum'] = np.nan
df.loc[df['sign'] == 1, 'cumsum'] = df['d_streak2']
df['cumsum'] = df['cumsum'].fillna(method='ffill')
df['cumsum'] = df['cumsum'].fillna(0)
df['d_streak'] = df['d_streak2'] - df['cumsum']
df.drop(['d_streak2', 'cumsum'], axis=1, inplace=True)
# Upstreak
df['u_streak2'] = (df['sign'] == 1).cumsum()
df['cumsum'] = np.nan
df.loc[df['sign'] == 0, 'cumsum'] = df['u_streak2']
df['cumsum'] = df['cumsum'].fillna(method='ffill')
df['cumsum'] = df['cumsum'].fillna(0)
df['u_streak'] = df['u_streak2'] - df['cumsum']
df.drop(['u_streak2', 'cumsum'], axis=1, inplace=True)
del df['sign']
return df
The function works well, however is very long. I'm sure there's a much betterway to write this. I tried the other answer in but didn't work well.
This is the desired output
streaks(df, 'E')
A B C D E d_streak u_streak
1 1 5 2 8 2 0.0 1.0
2 2 4 4 20 2 0.0 2.0
3 3 3 1 20 2 0.0 3.0
4 4 2 2 1 3 0.0 4.0
5 5 1 4 -5 -4 1.0 0.0
6 1 5 2 2 -20 2.0 0.0
7 2 4 4 3 -8 3.0 0.0
8 3 3 1 -1 -1 4.0 0.0
9 4 2 2 0 12 0.0 1.0
10 5 1 4 20 -2 1.0 0.0

You could simplify the function as shown:
def streaks(df, col):
sign = np.sign(df[col])
s = sign.groupby((sign!=sign.shift()).cumsum()).cumsum()
return df.assign(u_streak=s.where(s>0, 0.0), d_streak=s.where(s<0, 0.0).abs())
Using it:
streaks(df, 'E')
Firstly, compute the sign of each cell present in the column under consideration using np.sign. These assign +1 to positive numbers and -1 to the negative.
Next, identify sets of adjacent values (comparing current cell and it's next) using sign!=sign.shift() and take it's cumulative sum which would serve in the grouping process.
Perform groupby letting these as the key/condition and again take the cumulative sum across the sub-group elements.
Finally, assign the positive computed cumsum values to ustreak and the negative ones (absolute value after taking their modulus) to dstreak.

Related

Count how many consecutive rows meet a condition with pandas

I have a table like this:
import pandas as pd
df = pd.DataFrame({
"day": [1, 2, 3, 4, 5, 6],
"tmin": [-2, -3, -1, -4, -4, -2]
})
I want to create a column like this:
df['days_under_0_until_now'] = [1, 2, 3, 4, 5, 6]
df['days_under_-2_until_now'] = [1, 2, 0, 1, 2, 3]
df['days_under_-3_until_now'] = [0, 1, 0, 1, 2, 0]
So days_under_X_until_now means how many consecutive days until now tmin was under or equals X
I'd like to avoid do this with loops since the data is huge. Is there an alternative way to do it?

For improve performance avoid using groupby compare values of column to list and then use this solution for count consecutive Trues:
vals = [0,-2,-3]
arr = df['tmin'].to_numpy()[:, None] <= np.array(vals)[ None, :]
cols = [f'days_under_{v}_until_now' for v in vals]
df1 = pd.DataFrame(arr, columns=cols, index=df.index)
b = df1.cumsum()
df = df.join(b.sub(b.mask(df1).ffill().fillna(0)).astype(int))
print (df)
day tmin days_under_0_until_now days_under_-2_until_now \
0 1 -2 1 1
1 2 -3 2 2
2 3 -1 3 0
3 4 -4 4 1
4 5 -4 5 2
5 6 -2 6 3
days_under_-3_until_now
0 0
1 1
2 0
3 1
4 2
5 0

Count unique records based on conditional result of aggregate functions on multiple columns

My data looks like this:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
## df output
ID group attempts successes score
0 1 A 0 1 None
1 1 A 1 0 5
2 1 A 1 0 5
3 1 A 1 0 4
4 2 A 1 0 5
5 2 A 1 0 4
6 3 A 1 0 5
7 3 A 0 1 None
8 3 A 1 0 1
9 4 B 1 0 5
10 4 B 1 0 0
11 4 B 1 0 1
12 4 B 0 1 None
13 5 B 0 1 None
14 5 B 1 0 1
15 5 B 0 1 None
I'm trying to group by two columns (group, score) and count the number of unique ID after first identifying which groups of (group, ID) have at least 1 successes count across all score values. In other words, I only want to count the ID once (unique) in the aggregation if it has at least one associated success. I also only want to only count unique IDs per each (group, ID) pair regardless of the number of attempt_counts it contains (i.e if there's a sum of 5 success counts, I only want to include 1).
The successes and attempts columns are binary (only 1 or 0). For example, for ID = 1, group = A, there is at least 1 success. Therefore, when counting the number of unique IDs per (group, score), I will include that ID.
I'd like the final output to look something like this so that I can calculate the ratio of unique successes to unique attempts for each (group, score) combination.
group score successes_count attempts_counts ratio
A 5 2 3 0.67
4 1 2 0.50
1 1 1 1.0
0 0 0 inf
B 5 1 1 1.0
4 0 0 inf
1 2 2 1.0
0 1 1 1.0
So far I've been able to run a pivot table to sums per (group, ID) to identify those IDs that have at least 1 success. However, I'm not sure the best way to use this to reach my desired final state.
p = pd.pivot_table(data=df_new,
values=['ID'],
index=['group', 'ID'],
columns=['successes', 'attempts'],
aggfunc={'ID': 'count'})
# p output
ID
successes 0 1
attempts 1 0
group ID
A 1 3.0 1.0
2 2.0 NaN
3 2.0 1.0
B 4 3.0 1.0
5 1.0 2.0

Let's try something like:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4,
4, 4, 5, 5, 5],
'group': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
'B', 'B', 'B', 'B', 'B', 'B'],
'attempts': [0, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 0, 0, 1, 0],
'successes': [1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 1, 1, 0, 1],
'score': [None, 5, 5, 4, 5, 4, 5, None, 1, 5,
0, 1, None, None, 1, None]})
# Groups With At least 1 Success
m = df.groupby('group')['successes'].transform('max').astype(bool)
# Filter Out
df = df[m]
# Replace 0 successes with NaNs
df['successes'] = df['successes'].replace(0, np.nan)
# FFill BFill each group so that any success will fill the group
df['successes'] = df.groupby(['ID', 'group'])['successes'] \
.apply(lambda s: s.ffill().bfill())
# Pivot then stack to make sure each group has all score values
# Sort and reset index
# Rename Columns
# fix types
p = df.drop_duplicates() \
.pivot_table(index='group',
columns='score',
values=['attempts', 'successes'],
aggfunc='sum',
fill_value=0) \
.stack() \
.sort_values(['group', 'score'], ascending=[True, False]) \
.reset_index() \
.rename(columns={'attempts': 'attempts_counts',
'successes': 'successes_count'}) \
.convert_dtypes()
# Calculate Ratio
p['ratio'] = p['successes_count'] / p['attempts_counts']
print(p)
Output:
group score attempts_counts successes_count ratio
0 A 5 3 2 0.666667
1 A 4 2 1 0.5
2 A 1 1 1 1.0
3 A 0 0 0 NaN
4 B 5 1 1 1.0
5 B 4 0 0 NaN
6 B 1 2 2 1.0
7 B 0 1 1 1.0

dynamic shift with groupby on dataframe

I need to shift a grouped data frame by a dynamic number. I can do it with apply, but the performance is not very good.
Any way to do that without apply?
Here is a sample of what I would like to do:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
# THIS DOESN'T WORK:
df['VALUE'] = df.groupby('GROUP').SUM.shift(df.SHIFT)
I do it with apply the following way:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
def func(group):
s = group.SHIFT.iloc[0]
group['SUM'] = group.SUM.shift(s)
return group
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
df = df.groupby('GROUP').apply(func)

Here is a pure numpy version that works if the data frame is sorted by group (like your example):
# these rows are not null after shifting
notnull = np.where(df.groupby('GROUP').cumcount() >= df['SHIFT'])[0]
# source rows for rows above
source = notnull - df['SHIFT'].values[notnull]
shifted = np.empty(df.shape[0])
shifted[:] = np.nan
shifted[notnull] = df.groupby('GROUP')['VALUE'].cumsum().values[source]
df['SUM'] = shifted
It first gets the indices of rows that are to be updated. The shifts can be subtracted to yield the source rows.

A solution that avoids apply, could be the following, if the groups are contiguous:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
# compute values required for the slices
_, start = np.unique(df.GROUP.values, return_index=True)
gp = df.groupby('GROUP')
shifts = gp.SHIFT.first()
sizes = gp.size().values
end = (sizes - shifts.values) + start
# compute slices
source = [i for s, f in zip(start, end) for i in range(s, f)]
target = [i for j, s, f in zip(start, shifts, sizes) for i in range(j + s, j + f)]
# compute cumulative sum and arrays of nan
s = gp.VALUE.cumsum().values
r = np.empty_like(s, dtype=np.float32)
r[:] = np.nan
# set the on the array of nan
np.put(r, target, s[source])
# set the sum column
df['SUM'] = r
print(df)
Output
GROUP SHIFT VALUE SUM
0 A 2 1 NaN
1 A 2 2 NaN
2 A 2 3 1.0
3 A 2 4 3.0
4 A 2 5 6.0
5 A 2 6 10.0
6 B 3 7 NaN
7 B 3 8 NaN
8 B 3 9 NaN
9 B 3 0 7.0
10 B 3 1 15.0
11 B 3 2 24.0
With the exception of building the slices (source and target) all computations are done in a pandas/numpy level that should be fast. The idea is to manually simulate what would be done in the apply function.

delete elements from a data frame w.r.t columns of another data frame

I have a data frame say df1 with MULTILEVEL INDEX:
A B C D
0 0 0 1 2 3
4 5 6 7
1 2 8 9 10 11
3 2 3 4 5
and I have another data frame with 2 common columns in df2 also with MULTILEVEL INDEX
X B C Y
0 0 0 0 7 3
1 4 5 6 7
1 2 8 2 3 11
3 2 3 4 5
I need to remove the rows from df1 where the values of column B and C are the same as in df2, so I should be getting something like this:
A B C D
0 0 0 1 2 3
0 2 8 9 10 11
I have tried to do this by getting the index of the common elements and then remove them via a list, but they are all messed up and are in multi-level form.

You can do this in a one liner using pandas.dataframe.iloc, numpy.where and numpy.logical_or like this: (I find it to be the simplest way)
df1 = df1.iloc[np.where(np.logical_or(df1['B']!=df2['B'],df1['C']!=df2['C']))]
of course don't forget to:
import numpy as np
output:
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
Hope this was helpful. If there are any questions or remarks please feel free to comment.

You could make MultiIndexes out of the B and C columns, and then call the index's isin method:
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
For example,
import pandas as pd
df1 = pd.DataFrame({'A': [0, 4, 8, 2], 'B': [1, 5, 9, 3], 'C': [2, 6, 10, 4], 'D': [3, 7, 11, 5], 'P': [0, 0, 1, 1], 'Q': [0, 0, 2, 3]})
df1 = df1.set_index(list('PQ'))
df1.index.names = [None,None]
df2 = pd.DataFrame({'B': [0, 5, 2, 3], 'C': [7, 6, 3, 4], 'P': [0, 0, 1, 1], 'Q': [0, 1, 2, 3], 'X': [0, 4, 8, 2], 'Y': [3, 7, 11, 5]})
df2 = df2.set_index(list('PQ'))
df2.index.names = [None,None]
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
print(result)
yields
A B C D
0 0 0 1 2 3
1 2 8 9 10 11

In Pandas/Numpy, How to implement a rolling function inside each chunk using 2 different columns?

I have a df that is 'divided' by chunks, like this:
A = pd.DataFrame([[1, 5, 2, 0], [2, 4, 4, 0], [3, 3, 1, 1], [4, 2, 2, 0], [5, 1, 4, 0], [2, 4, 4, 1]],
columns=['A', 'B', 'C', 'D'], index=[1, 2, 3, 4, 5, 6,])
In this example, the chunk size is 3, and we have 2 chunks (signaled by the element 1 in the column 'D'). I need to perfom a rolling calculation inside each chunk, that involves 2 columns. Specifically, I need to create a column 'E' that is equal to column 'B' minus the rolling min of column 'C', in function:
def retracement(x):
return x['B'] - pd.rolling_min(x['C'], window=3)
I need to apply the formula above for each chunk. So following this recipe I tried:
chunk_size = 3
A['E'] = A.groupby(np.arange(len(A))//chunk_size).apply(lambda x: retracement(x))
ValueError: Wrong number of items passed 3, placement implies 1
The output would look like:
A B C D E
1 1 5 2 0 3
2 2 4 4 0 2
3 3 3 1 1 2
4 4 2 2 0 0
5 5 1 4 0 -1
6 2 4 4 1 2
Thanks
Update:
Following #EdChum recommendation didn't work, I got
TypeError: <lambda>() got an unexpected keyword argument 'axis'

something like this:
def chunkify(chunk_size):
df['chunk'] = (df.index.values - 1) / chunk_size
df['E'] = df.groupby('chunk').apply(lambda x: x.B - pd.expanding_min(x.C)).values.flatten()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pythonic way to calculate streaks in pandas dataframe - python

Related

Count how many consecutive rows meet a condition with pandas

Count unique records based on conditional result of aggregate functions on multiple columns

dynamic shift with groupby on dataframe

delete elements from a data frame w.r.t columns of another data frame

In Pandas/Numpy, How to implement a rolling function inside each chunk using 2 different columns?

Categories

Resources