Calculating cumsum in python across two dimensions - python

I have a dataset that looks like the following:
Subject Session Trial Choice
1 1 1 A
1 1 2 B
1 1 3 B
1 2 1 B
1 2 2 B
2 1 1 A
And I want to generate two additional columns-- one that returns a value based on "choice", and one that tracks the cumulative sum of those choices for each session per subject. I would like the output to look like this:
Subject Session Trial Choice Score Cum Score
1 1 1 A 1 1
1 1 2 B -1 0
1 1 3 B -1 -1
1 2 1 B -1 -1
1 2 2 B -1 -2
2 1 1 A 1 1
I have tried the following, based on answers to similar questions:
def change_score(c):
if c['Chosen'] == A:
return 1.0
elif c['Chosen'] == B:
return -1.0
else:
return ''
df1['change_score'] = df1.apply(change_score, axis=1)
df1['Session']=df1['Subject'].apply(lambda x: x[:7])
df1['cumulative_score']=df1.groupby(['Session'])['change_score'].cumsum()
This results in the following error: TypeError: 'int' object is not subscriptable
I'm (obviously) very new to python and would appreciate any help.

Do this in two steps. The first is to create your Score column. Use np.where:
df['Score'] = np.where(df.Choice == 'A', 1, -1)
df
Subject Session Trial Choice Score
0 1 1 1 A 1
1 1 1 2 B -1
2 1 1 3 B -1
3 1 2 1 B -1
4 1 2 2 B -1
5 2 1 1 A 1
Alternatively, for more options, use a nested where:
df['Score'] = np.where(df.Choice == 'A', 1,
np.where(df.Choice == 'B', -1, np.nan)
Note that you shouldn't mix string and numeric types in a single column (don't use '') if you want performance.
Alternatively, use np.select:
df['Score'] = np.select([df.Choice == 'A', df.Choice == 'B'], [1, -1])
Now, generate the CumScore column with a groupby:
df['CumScore'] = df.groupby('Session').Score.cumsum()
df
Subject Session Trial Choice Score CumScore
0 1 1 1 A 1 1
1 1 1 2 B -1 0
2 1 1 3 B -1 -1
3 1 2 1 B -1 -1
4 1 2 2 B -1 -2
5 2 1 1 A 1 0

Related

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

Get maximum occurance of one specific value per row with pandas

I have the following dataframe:
1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 1
1 0 0 0 0 1 1 0 1 0
2 1 1 0 1 1 0 0 1 1
...
I want to get for each row the longest sequence of value 0 in the row.
so, the expected results for this dataframe will be an array that looks like this:
[5,4,2,...]
as on the first row, maximum sequenc eof value 0 is 5, ect.
I have seen this post and tried for the beginning to get this for the first row (though I would like to do this at once for the whole dataframe) but I got errors:
s=df_day.iloc[0]
(~s).cumsum()[s].value_counts().max()
TypeError: ufunc 'invert' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
when I inserted manually the values like this:
s=pd.Series([0,0,1,0,0,0,0,0,1])
(~s).cumsum()[s].value_counts().max()
>>>7
I got 7 which is number of total 0 in the row but not the max sequence.
However, I don't understand why it raises the error at first, and , more important, I would like to run it on the end on the while dataframe and per row.
My end goal: get the maximum uninterrupted occurance of value 0 in a row.
Vectorized solution for counts consecutive 0 per rows, so for maximal use max of DataFrame c:
#more explain https://stackoverflow.com/a/52718619/2901002
m = df.eq(0)
b = m.cumsum(axis=1)
c = b.sub(b.mask(m).ffill(axis=1).fillna(0)).astype(int)
print (c)
1 2 3 4 5 6 7 8 9
0 1 2 0 1 2 3 4 5 0
1 1 2 3 4 0 0 1 0 1
2 0 0 1 0 0 1 2 0 0
df['max_consecutive_0'] = c.max(axis=1)
print (df)
1 2 3 4 5 6 7 8 9 max_consecutive_0
0 0 0 1 0 0 0 0 0 1 5
1 0 0 0 0 1 1 0 1 0 4
2 1 1 0 1 1 0 0 1 1 2
Use:
df = df.T.apply(lambda x: (x != x.shift()).astype(int).cumsum().where(x.eq(0)).dropna().value_counts().max())
OUTPUT
0 5
1 4
2 2
The following code should do the job.
the function longest_streak will count the number of consecutive zeros and return the max, and you can use apply on your df.
from itertools import groupby
def longest_streak(l):
lst = []
for n,c in groupby(l):
num,count = n,sum(1 for i in c)
if num==0:
lst.append((num,count))
maxx = max([y for x,y in lst])
return(maxx)
df.apply(lambda x: longest_streak(x),axis=1)

Subsetting the data frame and applying cumulative operation on multiple columns

I have a dataset that looks like below.
df=pd.DataFrame({'unit': ['ABC', 'DEF', 'GEH','IJK','DEF','XRF','BRQ'], 'A': [1,1,1,0,0,0,1], 'B': [1,1,1,1,1,1,0],'C': [1,1,1,0,0,0,1],'row_num': [7,6,5,4,3,2,1]})
I am trying to get the logic
Step 1-Consider a subset with row_number <=4.
Step 2- Column A,B,C has total 12 values(0's and 1's).
Steps 3-Count number of '1' within columns A,B,C. From the example
there are five 1's and seven 0's which calculates to 40%(5/12) of
1's.
Steps-4 since count of 1's is greater than 40% create a column flag
with 1 else if count of 1 is less than 10% then 0.
Hopefully I got it this time:
subdf = df.iloc[3:, 1:4]
df['flag'] = 1 if subdf.values.sum()/subdf.size >= 0.1 else 0
output:
unit A B C row_num flag
0 ABC 1 1 1 7 1
1 DEF 1 1 1 6 1
2 GEH 1 1 1 5 1
3 IJK 0 1 0 4 1
4 DEF 0 1 0 3 1
5 XRF 0 1 0 2 1
6 BRQ 1 0 1 1 1

conditionally fill all subsequent values of dataframe column

i want to "forward fill" the values of a new column in a DataFrame according to the first instance of a condition being satisfied. here is a basic example:
import pandas as pd
import numpy as np
x1 = [1,2,4,-3,4,1]
df1 = pd.DataFrame({'x':x1})
i'd like to add a new column to df1 - 'condition' - where the value will be 1 upon the occurrence of a negative number,else 0, but i'd like the remaining values of the column to be set to 1 once the negative number is found
so, i would look for desired output as follows:
condition x
0 0 1
1 0 2
2 0 4
3 1 -3
4 1 4
5 1 1
No one's used cummax so far:
In [165]: df1["condition"] = (df1["x"] < 0).cummax().astype(int)
In [166]: df1
Out[166]:
x condition
0 1 0
1 2 0
2 4 0
3 -3 1
4 4 1
5 1 1
Using np.cumsum:
df1['condition'] = np.where(np.cumsum(np.where(df1['x'] < 0, 1, 0)) == 0, 0, 1)
Output:
x condition
0 1 0
1 2 0
2 4 0
3 -3 1
4 4 1
5 1 1
You can use Boolean series here:
df1['condition'] = (df1.index >= (df1['x'] < 0).idxmax()).astype(int)
print(df1)
x condition
0 1 0
1 2 0
2 4 0
3 -3 1
4 4 1
5 1 1

python function on pandas df using multiple columns and reset variable

Whats the best way to do the following in python/pandas please?
I want to count the occurences where trend data 2 steps out of line with trend data 1 and reset the counter each time trend data 1 changes.
I'm struggling with the right way to do it on the dataframe creating a new column df['D'] in this example.
df['A'] = trend data 1
df['B'] = boolean indicator if trend data 1 changes
df['C'] = trend data 2
df['D'] = desired result
df['A'] df['B'] df['C'] df['D']
1 0 1 0
1 0 1 0
-1 1 -1 0
-1 0 -1 0
-1 0 1 1
-1 0 -1 1
-1 0 -1 1
-1 0 1 2
-1 0 1 2
-1 0 -1 2
1 1 1 0
1 0 1 0
1 0 -1 1
1 0 1 1
1 0 -1 2
1 0 1 2
1 0 1 2
in excel i would simply use:
=IF(B2=1,0,IF(AND((C2<>C1),(C2<>A2)),D1+1,D1))
however, i've always struggled in not being able to reference prior cells in pandas.
I can't use np.where(). I'm sure its just apply a function in the correct way but I can't seem to make it work referencing other columns and resetting the variable. I've looked at other answers but can't seem to find anything to work in this situation.
something like
note: create df['E'] = df['C'].shift(1)
def corrections(x):
if df['B'] == 1:
x = 0
elif ((df['C'] != df['E']) AND ( df['C'] != df['A'])):
x = x + 1
else:
x
apologies, as I feel i'm missing something rather simple with this question but just keep going round in circles!
def make_D (df):
counter = 0
array = []
for index in df.index:
if df.loc[index, 'A']!=df.loc[index, 'C']:
counter = counter + 1
if index>0:
if df.loc[index, 'B'] != df.loc[index-1, 'B']:
counter = 0
array.append(counter)
df['D'] = array
return (df)
new_df = make_D(df)
hope it helps!
#Set a list to store values for column D
d = []
#calculate D using the given conditions
df.apply(lambda x: d.append(0) if ((x.name==0)|(x.B==1)) else d.append(d[-1]+1) if (x.C!=df.iloc[x.name-1].C) & (x.C!=x.A) else d.append(d[-1]), axis=1)
#set columns D using values from the list d.
df['D'] = d
Out[594]:
A B C D
0 1 0 1 0
1 1 0 1 0
2 -1 1 -1 0
3 -1 0 -1 0
4 -1 0 1 1
5 -1 0 -1 1
6 -1 0 -1 1
7 -1 0 1 2
8 -1 0 1 2
9 -1 0 -1 2
10 1 1 1 0
11 1 0 1 0
12 1 0 -1 1
13 1 0 1 1
14 1 0 -1 2
15 1 0 1 2
16 1 0 1 2

Categories