Custom Cumulative Sum in Pandas - python

Noob question, apologies.
I am trying to do a cumulative sum on a table I have imported. However I wish for it to perform slightly differently at a mid point in the column before continuing. Is there a way to get a cumsum() to calculate to a row then continue from a further point
df['Cumlative Sum'] = df['Value'].cumsum()
| | Value | Cumlative Sum | Expected Cumlative Sum |
|----|-------|---------------|------------------------|
| 0 | 329.6 | 329.6 | 329.6 |
| 1 | 34.0 | 363.6 | 363.6 |
| 2 | 10 | 373.6 | 373.6 |
| 3 | 8 | 381.6 | 381.6 |
| 4 | 3 | 384.6 | 384.6 |
| 5 | -2 | 382.6 | 382.6 |
| 6 | -4 | 378.6 | 378.6 |
| 7 | -34 | 344.6 | 344.6 |
| 8 | -1 | 343.6 | 343.6 |
| 9 | 343.6 | 687.2 | 343.6 |
| 10 | 0 | 687.2 | 343.6 |
| 11 | -33 | 654.2 | 310.6 |
| 12 | -3 | 651.2 | 307.6 |
| 13 | 0 | 651.2 | 307.6 |
| 14 | 1 | 652.2 | 308.6 |
| 15 | 4 | 656.2 | 312.6 |
| 16 | 0 | 656.2 | 312.6 |
| 17 | 21 | 677.2 | 333.6 |
| 18 | 333.6 | 1010.8 | 333.6 |

You can get started with something like this ..
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,100,size=(20,2)),columns=['A','B'])
def Offset_CumSum(Column, Percentage_Offset=0.5):
return np.cumsum(Column[int(len(Column)*Percentage_Offset):])
Cumsum_DF = df.apply(lambda x: Offset_CumSum(x), axis=0)
print(df)
print(Cumsum_DF)
This produces the following output.
A B
0 29 11
1 9 51
2 99 31
3 30 44
4 76 13
5 32 48
6 85 83
7 9 98
8 49 34
9 25 0
10 39 22
11 25 96
12 69 7
13 28 6
14 4 92
15 90 32
16 68 72
17 63 25
18 85 47
19 61 31
A B
10 39 22
11 64 118
12 133 125
13 161 131
14 165 223
15 255 255
16 323 327
17 386 352
18 471 399
19 532 430
=====================================================================
Adding a question dataset specific code after seeing the edit.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,100,size=(20,2)),columns=['A','B'])
def Offset_CumSum(Column, Percentage_Offset=0.5):
return np.cumsum(Column[: int(len(Column)*Percentage_Offset)]).tolist() + np.cumsum(Column[int(len(Column)*Percentage_Offset):]).tolist()
Cumsum_DF = df.apply(lambda x: Offset_CumSum(x), axis=0)
print(df)
print(Cumsum_DF)
This should work.

df['Group Flag'] = ""
df.loc[0:8, 'Group Flag'] = 0
df.loc[9:17, 'Group Flag'] = 1
df['Cumlative Sum'] = df.groupby('Group Flag')['Value'].cumsum()
df.drop('Group Flag', axis=1)
df[['Title','Value','Cumlative Sum']]

Related

groupby/eq compute mean of specific column

Im trying to figure out to how to use groupby/eq to computer the mean of specific column, i have a df as seen below (original df).
I would like to groupby 'Group' and 'players' with class equals 1 and get the mean of the 'score'.
example:
Group = a, players =2
(16+13+19)/3 = 16
+-------+---------+-------+-------+------------+
| Group | players | class | score | score_mean |
+-------+---------+-------+-------+------------+
| a | 2 | 2 | 14 | |
| a | 2 | 1 | 16 | 16 |
| a | 2 | 1 | 13 | 16 |
| a | 2 | 2 | 13 | |
| a | 2 | 1 | 19 | 16 |
| a | 2 | 2 | 17 | |
| a | 2 | 2 | 14 | |
+-------+---------+-------+-------+------------+
i've tried:
df['score_mean'] = df['class'].eq(1).groupby(['Group', 'players'])['score'].transform('mean')
but i kept getting "KeyError"
original df:
+----+-------+---------+-------+-------+
| | Group | players | class | score |
+----+-------+---------+-------+-------+
| 0 | a | 1 | 1 | 10 |
| 1 | c | 2 | 1 | 20 |
| 2 | a | 1 | 3 | 29 |
| 3 | c | 1 | 3 | 22 |
| 4 | a | 2 | 2 | 14 |
| 5 | b | 1 | 2 | 16 |
| 6 | a | 2 | 1 | 16 |
| 7 | b | 2 | 3 | 17 |
| 8 | c | 1 | 2 | 22 |
| 9 | b | 1 | 2 | 23 |
| 10 | c | 2 | 2 | 22 |
| 11 | d | 1 | 1 | 13 |
| 12 | a | 2 | 1 | 13 |
| 13 | d | 1 | 3 | 23 |
| 14 | a | 2 | 2 | 13 |
| 15 | d | 2 | 1 | 34 |
| 16 | b | 1 | 3 | 32 |
| 17 | c | 2 | 2 | 29 |
| 18 | b | 2 | 2 | 28 |
| 19 | a | 2 | 1 | 19 |
| 20 | a | 1 | 1 | 19 |
| 21 | c | 1 | 1 | 27 |
| 22 | b | 1 | 3 | 47 |
| 23 | a | 2 | 2 | 17 |
| 24 | c | 1 | 1 | 14 |
| 25 | c | 2 | 2 | 25 |
| 26 | a | 1 | 3 | 67 |
| 27 | b | 2 | 3 | 21 |
| 28 | a | 1 | 3 | 27 |
| 29 | c | 1 | 1 | 16 |
| 30 | a | 2 | 2 | 14 |
| 31 | b | 1 | 2 | 25 |
+----+-------+---------+-------+-------+
data = {'Group':['a','c','a','c','a','b','a','b','c','b','c','d','a','d','a','d',
'b','c','b','a','a','c','b','a','c','c','a','b','a','c','a','b'],
'players':[1,2,1,1,2,1,2,2,1,1,2,1,2,1,2,2,1,2,2,2,1,1,1,2,1,2,1,2,1,1,2,1],
'class':[1,1,3,3,2,2,1,3,2,2,2,1,1,3,2,1,3,2,2,1,1,1,3,2,1,2,3,3,3,1,2,2],
'score':[10,20,29,22,14,16,16,17,22,23,22,13,13,23,13,34,32,29,28,19,19,27,47,17,14,25,67,21,27,16,14,25]}
df = pd.DataFrame(data)
kindly advice
Many thanks & Regards
Try:
Via set_index(),groupby() ,assign() and reset_index() method:
df=(df.set_index(['Group','players'])
.assign(score_mean=df[df['class'].eq(1)].groupby(['Group', 'players'])['score'].mean())
.reset_index())
Update:
If you want the first df as your output then use:
grouped=df.groupby(['Group', 'players','class']).transform('mean')
grouped=grouped.assign(players=df['players'],Group=df['Group'],Class=df['class']).where(df['Group']=='a').dropna()
grouped['score']=grouped.apply(lambda x:float('NaN') if x['players']==1 else x['score'],1)
grouped=grouped.dropna(subset=['score'])
Now if you print grouped you will get your desired output
If I got you right, need values returned only where class=1. Not sure what that will serve but code below. Use groupby transform and chain where
df['score_mean']=df.groupby(['Group','players'])['score'].transform('mean').where(df['class']==1).fillna('')
Group players class score score_mean
0 a 1 1 10 10
1 a 2 1 20 20
2 a 3 5 29
3 a 4 5 22
4 a 5 5 14
5 b 1 7 16
6 b 2 7 16
7 b 3 7 17
8 c 1 4 22
9 c 2 2 23
10 c 3 2 22
11 d 1 4 13
12 d 2 4 13
13 d 3 3 23
14 d 4 8 13
15 d 5 7 34
16 e 1 7 32
17 e 2 2 29
18 e 3 2 28
19 e 4 1 19 19
20 e 5 1 19 19
21 e 6 1 27 27
22 f 1 5 47
23 f 2 5 17
24 f 3 7 14
25 f 4 7 25
26 g 1 3 67
27 g 2 3 21
28 g 3 3 27
29 g 4 8 16
30 g 5 8 14
31 g 6 8 25
You could first filter by class and then create score_mean by doing a groupby and transform.
(
df[df['class']==1]
.assign(score_mean = lambda x: x.groupby(['Group', 'players']).score.transform('mean'))
)
Group players class score score_mean
0 a 1 1 10 14.5
1 c 2 1 20 20.0
6 a 2 1 16 16.0
11 d 1 1 13 13.0
12 a 2 1 13 16.0
15 d 2 1 34 34.0
19 a 2 1 19 16.0
20 a 1 1 19 14.5
21 c 1 1 27 19.0
24 c 1 1 14 19.0
29 c 1 1 16 19.0
If you want to keep other classes and set the mean to '', you can do:
(
df[df['class']==1]
.groupby(['Group', 'players']).score.transform('mean')
.pipe(lambda x: df.assign(score_mean = x))
.fillna('')
)

Removing duplicates with three conditions, Pandas

I have the following dataframe:
reference | topcredit | currentbalance | creditlimit
1 1 | 50 | 20 | 70
2 1 | 30 | 28 | 50
3 1 | 50 | 20 | 70
4 1 | 81 | 32 | 100
5 2 | 70 | 0 | 56
6 2 | 50 | 20 | 70
7 2 | 100 | 0 | 150
8 3 | 85 | 85 | 95
9 3 | 85 | 85 | 95
And so on...
I want to drop duplicates based on the 'reference' only those that have the same topcredit, currentbalance and creditlimit.
In the reference 1 I have two that have the same numbers in the three columns in line 1 and 3, but also in reference 2, line 6 I would like to keep 1 of reference 1 and also line 6 of reference 2. In reference 3 both lines have the same information too.
The expected output is:
reference | topcredit | currentbalance | creditlimit
1 | 50 | 20 | 70
1 | 30 | 28 | 50
1 | 81 | 32 | 100
2 | 70 | 24 | 56
2 | 50 | 20 | 70
2 | 100 | 80 | 150
3 | 85 | 85 | 95
I would apreciate the help, I've been searching how to do it for a while.

Pandas: sum multiple columns based on similar consecutive numbers in another column

Given the following table
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 3 | 194.92 | 100 | 1 |
| 4 | 194.92 | 52 | 1 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 7 | 194.85 | 900 | 1 |
| 8 | 194.85 | 25 | 1 |
| 9 | 194.85 | 224 | 1 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 12 | 194.6 | 10 | 1 |
| 13 | 194.6 | 25 | 1 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 18 | 195 | 90 | 1 |
| 19 | 195 | 100 | 1 |
| 20 | 195 | 50 | 1 |
| 21 | 195 | 50 | 1 |
| 22 | 195 | 25 | 1 |
| 23 | 195 | 5 | 1 |
| 24 | 195 | 500 | 1 |
| 25 | 195 | 100 | 1 |
| 26 | 195.09 | 100 | 1 |
| 27 | 195 | 120 | 1 |
| 28 | 195 | 60 | 1 |
| 29 | 195 | 40 | 1 |
| 30 | 195 | 10 | 1 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 33 | 194.81 | 20 | 1 |
| 34 | 194.81 | 50 | 1 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
For faster testing you can also find here the same table in a pandas dataframe
pd_data_before = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[3,194.92,100,1],[4,194.92,52,1],[5,194.9,99,1],[6,194.86,74,1],[7,194.85,900,1],[8,194.85,25,1],[9,194.85,224,1],[10,194.6,101,1],[11,194.85,19,1],[12,194.6,10,1],[13,194.6,25,1],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[18,195,90,1],[19,195,100,1],[20,195,50,1],[21,195,50,1],[22,195,25,1],[23,195,5,1],[24,195,500,1],[25,195,100,1],[26,195.09,100,1],[27,195,120,1],[28,195,60,1],[29,195,40,1],[30,195,10,1],[31,194.6,1,1],[32,194.99,1,1],[33,194.81,20,1],[34,194.81,50,1],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
The question is how do we sum up the volume and transactions based on similar consecutive prices? The end result would be something like this:
+----+--------+--------+--------------+
| Nr | Price | Volume | Transactions |
+----+--------+--------+--------------+
| 1 | 194.6 | 100 | 1 |
| 2 | 195 | 10 | 1 |
| 4 | 194.92 | 152 | 2 |
| 5 | 194.9 | 99 | 1 |
| 6 | 194.86 | 74 | 1 |
| 9 | 194.85 | 1149 | 3 |
| 10 | 194.6 | 101 | 1 |
| 11 | 194.85 | 19 | 1 |
| 13 | 194.6 | 35 | 2 |
| 14 | 194.53 | 12 | 1 |
| 15 | 194.85 | 14 | 1 |
| 16 | 194.6 | 11 | 1 |
| 17 | 194.85 | 93 | 1 |
| 25 | 195 | 920 | 8 |
| 26 | 195.09 | 100 | 1 |
| 30 | 195 | 230 | 4 |
| 31 | 194.6 | 1 | 1 |
| 32 | 194.99 | 1 | 1 |
| 34 | 194.81 | 70 | 2 |
| 35 | 194.97 | 17 | 1 |
| 36 | 194.99 | 25 | 1 |
| 37 | 195 | 75 | 1 |
+----+--------+--------+--------------+
You can also find the result ready made in a pandas dataframe below:
pd_data_after = pd.DataFrame([[1,194.6,100,1],[2,195,10,1],[4,194.92,152,2],[5,194.9,99,1],[6,194.86,74,1],[9,194.85,1149,3],[10,194.6,101,1],[11,194.85,19,1],[13,194.6,35,2],[14,194.53,12,1],[15,194.85,14,1],[16,194.6,11,1],[17,194.85,93,1],[25,195,920,8],[26,195.09,100,1],[30,195,230,4],[31,194.6,1,1],[32,194.99,1,1],[34,194.81,70,2],[35,194.97,17,1],[36,194.99,25,1],[37,195,75,1]],columns=['Nr','Price','Volume','Transactions'])
I managed to achieve this in a for loop. But the problem is that it is very slow when iterating each row. My data set is huge, around 50 million rows.
Is there any way to achieve this without looping?
A common trick to groupby consecutive values is the following:
df.col.ne(df.col.shift()).cumsum()
We can use that here, then use agg to keep the first values of the columns we aren't summing, and to sum the values we do want to sum.
(df.groupby(df.Price.ne(df.Price.shift()).cumsum())
.agg({'Nr': 'last', 'Price': 'first', 'Volume':'sum', 'Transactions': 'sum'})
).reset_index(drop=True)
Nr Price Volume Transactions
0 1 194.60 100 1
1 2 195.00 10 1
2 4 194.92 152 2
3 5 194.90 99 1
4 6 194.86 74 1
5 9 194.85 1149 3
6 10 194.60 101 1
7 11 194.85 19 1
8 13 194.60 35 2
9 14 194.53 12 1
10 15 194.85 14 1
11 16 194.60 11 1
12 17 194.85 93 1
13 25 195.00 920 8
14 26 195.09 100 1
15 30 195.00 230 4
16 31 194.60 1 1
17 32 194.99 1 1
18 34 194.81 70 2
19 35 194.97 17 1
20 36 194.99 25 1
21 37 195.00 75 1

Pandas: How to create a multi-indexed pivot

I have a set of experiments defined by two variables: scenario and height. For each experiment, I take 3 measurements: result 1, 2 and 3.
The dataframe that collects all the results looks like this:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Scenario']= np.repeat(['Scenario a','Scenario b','Scenario c'],3)
df['height'] = np.tile([0,1,2],3)
df['Result 1'] = np.arange(1,10)
df['Result 2'] = np.arange(20,29)
df['Result 3'] = np.arange(30,39)
If I run the following:
mypiv = df.pivot('Scenario','height').transpose()
writer = pd.ExcelWriter('test_df_pivot.xlsx')
mypiv.to_excel(writer,'test df pivot')
writer.save()
I obtain a dataframe where columns are the scenarios, and the rows have a multi-index defined by result and height:
+----------+--------+------------+------------+------------+
| | height | Scenario a | Scenario b | Scenario c |
+----------+--------+------------+------------+------------+
| Result 1 | 0 | 1 | 4 | 7 |
| | 1 | 2 | 5 | 8 |
| | 2 | 3 | 6 | 9 |
| Result 2 | 0 | 20 | 23 | 26 |
| | 1 | 21 | 24 | 27 |
| | 2 | 22 | 25 | 28 |
| Result 3 | 0 | 30 | 33 | 36 |
| | 1 | 31 | 34 | 37 |
| | 2 | 32 | 35 | 38 |
+----------+--------+------------+------------+------------+
How can I create a pivot where the indices are swapped, i.e. height first, then result?
I couldn't find a way to create it directly. I managed to get what I want swapping the levels and the re-sorting the results:
mypiv2 = mypiv.swaplevel(0,1 , axis=0).sortlevel(level=0,axis=0,sort_remaining=True)
but I was wondering if there is a more direct way.
You can first set_index and then stack with unstack:
print (df.set_index(['height','Scenario']).stack().unstack(level=1))
Scenario Scenario a Scenario b Scenario c
height
0 Result 1 1 4 7
Result 2 20 23 26
Result 3 30 33 36
1 Result 1 2 5 8
Result 2 21 24 27
Result 3 31 34 37
2 Result 1 3 6 9
Result 2 22 25 28
Result 3 32 35 38

Python: Running maximum by another column?

I have a dataframe like this, which tracks the value of certain items (ids) over time:
mytime=np.tile( np.arange(0,10) , 2 )
myids=np.repeat( [123,456], [10,10] )
myvalues=np.random.random_integers(20,30,10*2)
df=pd.DataFrame()
df['myids']=myids
df['mytime']=mytime
df['myvalues']=myvalues
+-------+--------+----------+--+--+
| myids | mytime | myvalues | | |
+-------+--------+----------+--+--+
| 123 | 0 | 29 | | |
+-------+--------+----------+--+--+
| 123 | 1 | 23 | | |
+-------+--------+----------+--+--+
| 123 | 2 | 26 | | |
+-------+--------+----------+--+--+
| 123 | 3 | 24 | | |
+-------+--------+----------+--+--+
| 123 | 4 | 25 | | |
+-------+--------+----------+--+--+
| 123 | 5 | 29 | | |
+-------+--------+----------+--+--+
| 123 | 6 | 28 | | |
+-------+--------+----------+--+--+
| 123 | 7 | 21 | | |
+-------+--------+----------+--+--+
| 123 | 8 | 20 | | |
+-------+--------+----------+--+--+
| 123 | 9 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 0 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 1 | 24 | | |
+-------+--------+----------+--+--+
| 456 | 2 | 20 | | |
+-------+--------+----------+--+--+
| 456 | 3 | 26 | | |
+-------+--------+----------+--+--+
| 456 | 4 | 29 | | |
+-------+--------+----------+--+--+
| 456 | 5 | 29 | | |
+-------+--------+----------+--+--+
| 456 | 6 | 24 | | |
+-------+--------+----------+--+--+
| 456 | 7 | 21 | | |
+-------+--------+----------+--+--+
| 456 | 8 | 27 | | |
+-------+--------+----------+--+--+
| 456 | 9 | 29 | | |
+-------+--------+----------+--+--+
I'd need to calculate the running maximum for each id.
np.maximum.accumulate()
would calculate the running maximum regardless of id, whereas I need a similar calculation, which however resets every time the id changes. I can think of a simple script to do it in numba (I have very large arrays and non-vectorised non-numba code would be slow), but is there an easier way to do it?
With just two values I can run:
df['running max']= np.hstack(( np.maximum.accumulate(df[ df['myids']==123 ]['myvalues']) , np.maximum.accumulate(df[ df['myids']==456 ]['myvalues']) ) )
but this is not feasible with lots and lots of values.
Thanks!
Here you go. Assumption is mytime is sorted.
mytime=np.tile( np.arange(0,10) , 2 )
myids=np.repeat( [123,456], [10,10] )
myvalues=np.random.random_integers(20,30,10*2)
df=pd.DataFrame()
df['myids']=myids
df['mytime']=mytime
df['myvalues']=myvalues
groups = df.groupby('myids')
df['run_max_group'] = groups['myvalues'].transform(np.maximum.accumulate)
Output...
myids mytime myvalues run_max_group
0 123 0 27 27
1 123 1 21 27
2 123 2 24 27
3 123 3 25 27
4 123 4 22 27
5 123 5 20 27
6 123 6 20 27
7 123 7 30 30
8 123 8 24 30
9 123 9 22 30
10 456 0 29 29
11 456 1 23 29
12 456 2 30 30
13 456 3 28 30
14 456 4 26 30
15 456 5 25 30
16 456 6 28 30
17 456 7 27 30
18 456 8 20 30
19 456 9 24 30
It seems that it is indeed not too difficult
byid = df.groupby('myid')
rmax = byid['myvalues].cummax()
for k, indices in byid.indices.items():
print 'myid = %s' % k
print 'running max = %s' % rmax[indices]
I have (almost) no previous pandas, but using ipython as an exploratory instrument I was able to find a solution. I recommend the use of ipython to explore large and complex libraries.
p.s. re my previous comment: no need for axis=

Categories