conditional cumulative sum based on column comparison in pandas dataframe - python

I'm relatively new to pandas, and I'm sure there is an easy solution, but I could not figure it out on my own. I have a dataframe of transactions that looks like this:
OrderId Size Price Side TimeSecO TimeUSecO TimeSecOT TimeUSecOT AmountBuy AmountSell
10 100 41.44000000 BUY 1403200077 47720 1403200100 640070
11 100 41.43000000 BUY 1403200077 47979 1403200112 43383
12 100 41.45000000 SELL 1403200077 48311 1403200090 61100
14 100 41.45000000 BUY 1403200092 253793 1403200092 374767
17 100 41.44000000 SELL 1403200103 24382 1403200125 929563
20 100 41.43000000 SELL 1403200116 208057 1403200116 226762
31 100 41.46000000 SELL 1403200214 874124 1403200259 751002
37 100 41.46000000 BUY 1403200278 494827 1403200300 729545
42 100 41.45000000 BUY 1403200335 601039 1403200361 925384
42 100 41.45000000 BUY 1403200335 601039 1403200361 925415
45 500 15.54000000 SELL 1403200365 997248 1403200741 26216
49 100 41.45000000 SELL 1403200375 419253 1403200402 959968
53 100 42.61000000 SELL 1403200377 403525 1403200377 403680
54 100 42.61000000 BUY 1403200377 501636 1403200377 501770
I want to calculate rolling cumulative sums for each OrderId and put them into 2 new columns with respect to Side column, CumAmountBuy and CumAmountSell, where TimeSecO > TimeSecOT.
For example, for the above dataframe the correct cumulative sums for OrderId 10, OrderId 11 and OrderId 12 would be CumAmountBuy = 0 and CumAmountSell = 0, because there are no records in the dataframe where 1403200077 > TimeUSecOT.
For OrderId 14, CumAmountBuy = 0, and CumAmountSell = 100, as OrderId 12 has already happened at this point, and it was a Side=SELL, and it fulfilled the requirement of TimeSecO > TimeSecOT (1403200092 > 1403200090).

I can think of a dirty trick but when the dataframe gets huge, I don't think it's efficient.
In [42]: df['flag'] = df.TimeSecO.map(lambda sec: (sec > df.TimeSecOT).values)
In [43]: df['CumAmountBuy'] = df.flag.map(lambda f: np.dot(f,df['Size']*(df['Side']=='BUY')))
In [44]: df['CumAmountSell'] = df.flag.map(lambda f: np.dot(f,df['Size']*(df['Side']=='SELL')))
In [45]: df[['CumAmountBuy','CumAmountSell']]
Out[45]:
CumAmountBuy CumAmountSell
OrderId
10 0 0
11 0 0
12 0 0
14 0 100
17 200 100
20 300 100
31 300 300
37 300 400
42 400 400
42 400 400
45 600 400
49 600 400
53 600 400
54 600 400

Related

Sum of the column values if the rows meet the conditions

I am trying to calculate the sum of sales for stores in the same neighborhood based on their geographic coordinates. I have sample data:
data={'ID':['1','2','3','4'],'SALE':[100,120,110,95],'X':[23,22,21,24],'Y':[44,45,41,46],'X_MIN':[22,21,20,23],'Y_MIN':[43,44,40,45],'X_MAX':[24,23,22,25],'Y_MAX':[45,46,42,47]}
ID
SALE
X
Y
X_MIN
Y_MIN
X_MAX
Y_MAX
1
100
23
44
22
43
24
45
2
120
22
45
21
44
23
46
3
110
21
41
20
40
22
42
4
95
24
46
23
45
25
47
X and Y are the coordinates of the store. X and Y with MIN and MAX are the area they cover. For each row, I want to sum sales for all stores that are within the boundaries of the single store. I expect results similar to the table below where SUM for ID 1 is equal 220 because the coordinates (X and Y) are within the MIN and MAX limits of this store for ID 1 and ID 2 while for ID 4 only this one store is between his coordinates so the sum of sales is equal 95.
final={'ID':['1','2','3','4'],'SUM':[220,220,110,95]}
ID
SUM
1
220
2
220
3
110
4
95
What I've tried:
data['SUM'] = data.apply(lambda x: data['SALE'].sum(data[(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]),axis=1)
Unfortunately the code does not work and I am getting the following error:
TypeError: unhashable type: 'DataFrame'
I am asking for help in solving this problem.
If you put the summation at the end, your solution works:
data['SUM'] = data.apply(lambda x: (data['SALE'][(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]).sum(),axis=1)
###output of data['SUM']:
###0 220
###1 220
###2 110
###3 95

How do forward roll on a specific subset of data and while modifying the original dataset?

I'm trying to perform this operation on this dataset. I'm trying to calculate cummulative Sum of the specific subset of the dataset.I want the changes to reflect on real dataset.
. Table below illustrates how I want to calculate Offset.
#OFFSET
min = data.exit_block.min()
max = data.exit_block.max()
temp = 0
data['Offset']
for i in tqdm(range(min,min+10)):
offset = data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size'].sum()
data.loc[data["entry_block"] == i ,'Offset'] = data[data['entry_block']==i]['size'].cumsum() + offset
print(len(data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size']))
print(offset)
print(data[data['entry_block']==i]['size'].cumsum().head() )
print(data[data['entry_block']==i]['size'].head())
break
In the code above I'm creating a dataset B from original dataset and trying to perform of the cummulative sum operation on the origial dataset from the values driven from dataset B.
Index
Entry_block
Exit_block
Size
Offset
1
10
20
10
10
2
11
20
150
160
3
18
20
100
260
4
19
21
40
300
5
20
21
120
120
6
20
21
180
300
7
20
21
210
510
8
20
21
90
600
9
20
21
450
1050

Pandas: Using Append Adds New Column and Makes Another All NaN

I just started learning pandas a week ago or so and I've been struggling with a pandas dataframe for a bit now. My data looks like this:
State NY CA Other Total
Year
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
I made this table from a dataset that included 30 or so values for the variable I'm representing as State here. If they weren't NY or CA, in the example, I summed them and put them in an 'Other' category. The years here were made from a normalized list of dates (originally mm/dd/yyyy and yyyy-mm-dd) as such, if this is contributing to my issue:
dict = {'Date': pd.to_datetime(my_df.Date).dt.year}
and later:
my_df = my_df.rename_axis('Year')
I'm trying now to append a row at the bottom that shows the totals in each category:
final_df = my_df.append({'Year' : 'Total',
'NY': my_df.NY.sum(),
'CA': my_df.CA.sum(),
'Other': my_df.Other.sum(),
'Total': my_df.Total.sum()},
ignore_index=True)
This does technically work, but it makes my table look like this:
NY CA Other Total State
0 450 50 25 525 NaN
1 300 75 5 380 NaN
2 500 100 100 700 NaN
3 250 50 100 400 NaN
4 a b c d Total
('a' and so forth are the actual totals of the columns.) It adds a column at the beginning and puts my 'Year' column at the end. In fact, it removes the 'Date' label as well, and turns all the years in the last column into NaNs.
Is there any way I can get this formatted properly? Thank you for your time.
I believe you need create Series by sum and rename it:
final_df = my_df.append(my_df.sum().rename('Total'))
print (final_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another solution is use loc for setting with enlargement:
my_df.loc['Total'] = my_df.sum()
print (my_df)
NY CA Other Total
State
2003 450 50 25 525
2004 300 75 5 380
2005 500 100 100 700
2006 250 50 100 400
Total 1500 275 230 2005
Another idea from previous answer - add parameters margins=True and margins_name='Total' to crosstab:
df1 = df.assign(**dct)
out = (pd.crosstab(df1['Firing'], df1['State'], margins=True, margins_name='Total'))

Removing rows below first line that meets threshold in pandas dataframe

I have a df that looks like:
import pandas as pd
import numpy as np
d = {'Hours':np.arange(12, 97, 12),
'Average':np.random.random(8),
'Count':[500, 250, 125, 75, 60, 25, 5, 15]}
df = pd.DataFrame(d)
This df has a decrease number of cases for each row. After the count drops below a certain threshold, I'd like to drop off the remainder, for example after a < 10 case threshold was reached.
Starting:
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
6 0.302894 5 84
7 0.418912 15 96
Finished (everything after row 6 removed):
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
We can use the index generated from the boolean index and slice the df using iloc:
In [58]:
df.iloc[:df[df.Count < 10].index[0]]
Out[58]:
Average Count Hours
0 0.183016 500 12
1 0.046221 250 24
2 0.687945 125 36
3 0.387634 75 48
4 0.167491 60 60
5 0.660325 25 72
Just to break down what is happening here
In [54]:
# use a boolean mask to index into the df
df[df.Count < 10]
Out[54]:
Average Count Hours
6 0.244839 5 84
In [56]:
# we want the index and can subscript the first element using [0]
df[df.Count < 10].index
Out[56]:
Int64Index([6], dtype='int64')

filter pandas dataframe based in another column

this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this
Volumes
Index ProductA ProductB ProductC ProductD Limit
0 100 300 400 78 100
1 110 370 20 30 100
2 90 320 200 121 100
3 150 320 410 99 100
....
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 90 0
1 51 110 29 99 0
2 49 120 25 88 0
3 51 110 22 96 0
....
I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column
so, the ideal output would be
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 0 0
1 51 110 0 0 0
2 0 120 25 88 0
3 51 110 22 0 0
....
I tried
import pandas as pd
import numpy as np
d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110],
'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4}
d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320],
'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4}
Prices = pd.DataFrame(d_price)
Volumes = pd.DataFrame(d_volume)
Prices[Volumes > Volumes.Limit]=0
but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in
Prices[Volumes > Volumes.Limit]=0
Since Limit varies on each row, you should use, for example, apply like following:
Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do.
test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values)
final = Prices[test].fillna(0)

Categories