Conditional groupby and shift in pandas - python

I have a dataframe that looks like this, which store the order by order information of an order book. Type N = new order, D = delete order and E = execution. the same order_id might be reused.
So basically the problem is that delete and execution does not have a proper price as they should be inferred by the last new order with the same oid. Could someone suggest a method to achieve this? Thank you
Input
type order_id price
0 N 10 99
1 E 10 0
1 E 10 0
1 D 10 0
0 N 11 98
1 N 10 97
1 D 10 0
Output
type order_id price
0 N 10 99
1 E 10 **99**
1 E 10 **99**
1 D 10 **99**
0 N 11 98
1 N 10 97
1 D 10 **97**

Seems like you need replace + ffill, since here I assuming you have the correct order of your df.
df.replace(0,np.nan).ffill()
Out[758]:
type order_id price
0 N 10 99.0
1 E 10 99.0
1 E 10 99.0
1 D 10 99.0
0 N 11 98.0
1 N 10 97.0
1 D 10 97.0
Or we adding groupby
df.assign(price=df.price.replace(0,np.nan)).groupby(df.type.eq('N').cumsum()).price.ffill().values

I think need:
df['price'] = df['price'].mask(df['type'].isin(['E','D']))
#df['price'] = df['price'].where(df['type'] == 'N')
df['price'] = df.groupby(df['order_id'].ne(df['order_id'].shift()).cumsum())['price'].ffill()
print (df)
type order_id price
0 N 10 99.0
1 E 10 99.0
1 E 10 99.0
1 D 10 99.0
0 N 11 98.0
1 N 10 97.0
1 D 10 97.0
Explanation:
First replace values of price to NaNs with mask or inverted condition with where
Then groupby by helper Series created by consecutive order_id with forward fill NaNs by DataFrameGroupBy.ffill:

Related

Last row of some column in dataframe not included

So I have tried to find an average of a value for an index 0 before it exchange to another index.
An example of the dataframe:
column_a
value_b
sum_c
count_d_
avg_e
0
10
10
1
0
20
30
2
0
30
60
3
20
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
30
2
15
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
however, only the last row for sum and count is unavailable, so the avg cannot be calculated for it
part of the code...
#sum and avg for each section
for i, row in df.iloc[0:-1].iterrows():
if df['column_a'][i] == 0:
sum = sum + df['value_b'][i]
df['sum_c'][i] = sum
count = count + 1
df['count_d'][i] = count
else:
sum = 0
count = 0
df['sum_c'][i] = sum
df['count_d'][i] = count
totcount = 0
for m, row in df.iloc[0:-1].iterrows():
if df.loc[m, 'column_a'] == 0 :
if (df.loc[m+1, 'sum_c'] == 0) :
totcount = df.loc[m, 'count_d']
avg_e = (df.loc[m, 'sum_c']) / totcount
df.loc[m, 'avg_e'] = avg_e
have tried only using df.iloc[0:].iterrows but it produce an error.
You can rewrite you full code with groupby.cummax, groupby.cumcount, groupby.transform('mean') and masking with where.
# compute a mask with True for the last value per successive group
m = df['column_a'].ne(df['column_a'].shift(-1))[::-1]
# make a grouper
group = m.cumsum()
# for each group
g = df.groupby(group)['value_b']
# compute the cumsum
df['sum_c'] = g.cumsum()
# compute the cumcount
df['count_d_'] = g.cumcount().add(1)
# compute the mean and assign to the last row per group
df['avg_e'] = g.transform('mean').where(m)
Output:
column_a value_b sum_c count_d_ avg_e
0 0 10 10 1 NaN
1 0 20 30 2 NaN
2 0 30 60 3 20.0
3 1 10 10 1 NaN
4 1 20 30 2 NaN
5 1 30 60 3 20.0
6 0 10 10 1 NaN
7 0 20 30 2 15.0
8 1 10 10 1 NaN
9 1 20 30 2 NaN
10 1 30 60 3 20.0
11 0 10 10 1 NaN
12 0 20 30 2 15.0
It is the expected behavior of df.iloc[0:-1] to return all the rows excepting the last one. When using slicing, remember that the last index you provide is not included in the return range. Since -1 is the index of the last row, [0:-1] excludes the last row.
The solution given by #mozway is anyway more elegant, but if for any reason you still want to use iterrows(), you can use df.iloc[0:].
The error ou got when you did may be due to your df.loc[m+1, 'sum_c']. At the last row, m+1 will be out of bounds and produce an IndexError.

Python Dataframe compute season wise performance compared to a reference (100%)

I have a dataframe with seasons as a column. Each season has two variety items. I want to compute each season-wise performance of all items with one item A, where item A performance will be marked as 100%.
code:
xdf = pd.DataFrame({'Season':[1,1,2,2,3,3],'item':['A','B','A','B','A','B'],'value':[25,30,50,75,40,60]})
xdf =
Season item value
0 1 A 25
1 1 B 30
2 2 A 50
3 2 B 75
4 3 A 40
5 3 B 60
Expected answer:
xdf =
Season item value %value
0 1 A 25 100 # This value(25) is taken as reference i.e. 100%
1 1 B 30 120 # Now value of `B` in Season 1 is 30. So, performance % = 30/25 *100
2 2 A 50 100 # This value(50) is taken as reference i.e. 100%
3 2 B 75 150 # Now value of `B` in Season 2 is 75. So, performance % = 75/50 *100
4 3 A 40 100 # This value(40) is taken as reference i.e. 100%
5 3 B 60 150 # Now value of `B` in Season 3 is 60. So, performance % = 60/40 *100
Let us create multiindex on Season and item in order to simplify calculation:
s = xdf.set_index(['Season', 'item'])['value']
xdf['%value'] = s.div(s.xs('A', level=1)).mul(100).tolist()
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0
If your rows are correctly ordered ('A' before 'B'), you can use pct_change. If needed sort by item first.
xdf['%value'] = xdf.groupby('Season')['value'].pct_change().fillna(0) * 100 + 100
print(xdf)
# Output
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0
Hopefully this function solve your problem.
df=xdf.copy()
for i in xdf['Season'].unique():
_df = xdf[xdf['Season'] == i]
idx = _df.iloc[1:].index
val=_df.iloc[0].value
for j in idx:
cv=xdf.iloc[j].value
xdf.at[j, 'value'] = (100/val)*cv
xdf.loc[xdf.groupby('Season')['value'].head(1).index, 'value'] = 100
df['%value']=xdf['value']
df

Calculate time delta from last occurrence of some boolean column (from last 1) per category?

Given a simple dataframe:
df = pd.DataFrame({'user': ['x','x','x','x','x','y','y','y','y'],
'Flag': [0,1,0,0,1,0,1,0,0],
'time': [10,34,40,43,44,12,20, 46, 51]})
I want to calculate the timedelta from the last flag == 1 for each user.
I did the diffs:
df.sort_values(['user', 'time']).groupby('user')['time'].diff().fillna(pd.Timedelta(10000000)).dt.total_seconds()/60
But it doesn't seem to solve my issue, I need time delta between the 1's and if there wasn't any then fill with some number N.
Please advise
For example:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 NaN
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 NaN
7 y 0 46 26.0
8 y 0 51 31.0
I am not sure that I understood correctly, but if you want to compute the time delta only between 1's per group of user, you can apply your computation on the sliced dataframe for 1's only and using groupby:
df['delta'] = (df[df['Flag'].eq(1)] # select 1's only
.groupby('user') # group by user
['time'].diff() # compute the diff
.dt.total_seconds()/60 # convert to minutes
)
output:
user Flag time delta
0 x 0 0 days 10:30:00 NaN
1 x 1 0 days 11:34:00 NaN
2 x 0 0 days 11:43:00 NaN
3 y 0 0 days 13:43:00 NaN
4 y 1 0 days 14:40:00 NaN
5 y 0 0 days 15:32:00 NaN
6 y 1 0 days 18:30:00 230.0
7 w 0 0 days 19:30:00 NaN
8 w 0 0 days 20:11:00 NaN
edit. Here is a working solution for the updated question.
IIUC the update, you want to calculate the difference to the last 1 per user, and if the flag is 1, the difference to the last valid value per user if any.
In summary, it creates subgroup for ranges starting with 1s, then uses these groups to calculate the diffs. Finally masks the 1s with a diff with them previous value (is existing)
(df.assign(mask=df['Flag'].eq(1),
group=lambda d: d.groupby('user')['mask'].cumsum(),
# diff from last 1
diff=lambda d: d.groupby(['user', 'group'])['time'].apply(lambda g: g -(g.iloc[0] if g.name[1]>0 else float('nan'))),
)
# mask 1s with their own diff
.assign(## diff=lambda d: d['diff'].mask(d['mask'],d.groupby('user')['time'].diff()) ## OLD VERSION
diff= lambda d: d['diff'].mask(d['mask'],
(d[d['mask'].groupby(d['user']).cumsum().eq(0)|d['mask']]
.groupby('user')['time'].diff())
)
)
.drop(['mask', 'group'], axis=1) # cleanup temp columns
)
Output:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 24.0
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 8.0
7 y 0 46 26.0
8 y 0 51 31.0

How to subtract rows in a df based on a value in another column

I am trying to calculate the difference in certain rows based on the values from other columns.
Using the example data frame below, I want to calculate the difference in Time based on the values in the Code column. Specifically, I want to loop through and determine the time difference between B and A. So Time in B - Time in A.
I can do this manually using the iloc function but I was hoping to determine a more efficient way. Especially if I have to repeat this process numerous times.
import pandas as pd
import numpy as np
k = 5
N = 15
d = ({'Time' : np.random.randint(k, k + 100 , size=N),
'Code' : ['A','x','B','x','A','x','B','x','A','x','B','x','A','x','B']})
df = pd.DataFrame(data=d)
Output:
Code Time
0 A 89
1 x 39
2 B 24
3 x 62
4 A 83
5 x 57
6 B 69
7 x 10
8 A 87
9 x 62
10 B 86
11 x 11
12 A 54
13 x 44
14 B 71
Expected Output:
diff
1 -65
2 -14
3 -1
4 17
First filter by boolean indexing, then subtract by sub with reset_index for default index for align Series a and b and last if want one column DataFrame add to_frame:
a = df.loc[df['Code'] == 'A', 'Time'].reset_index(drop=True)
b = df.loc[df['Code'] == 'B', 'Time'].reset_index(drop=True)
Similar alternative solution:
a = df.loc[df['Code'] == 'A'].reset_index()['Time']
b = df.loc[df['Code'] == 'B'].reset_index()['Time']
c = b.sub(a).to_frame('diff')
print (c)
diff
0 -65
1 -14
2 -1
3 17
Last for new index start from 1 add rename:
c = b.sub(a).to_frame('diff').rename(lambda x: x + 1)
print (c)
diff
1 -65
2 -14
3 -1
4 17
Another approach if need count more difference is reshape by unstack:
df = df.set_index(['Code', df.groupby('Code').cumcount() + 1])['Time'].unstack()
print (df)
1 2 3 4 5 6 7
Code
A 89.0 83.0 87.0 54.0 NaN NaN NaN
B 24.0 69.0 86.0 71.0 NaN NaN NaN
x 39.0 62.0 57.0 10.0 62.0 11.0 44.0
#last remove `NaN`s rows
c = df.loc['B'].sub(df.loc['A']).dropna()
print (c)
1 -65.0
2 -14.0
3 -1.0
4 17.0
dtype: float64
#subtract with NaNs values - fill_value=0 return non NaNs values
d = df.loc['x'].sub(df.loc['A'], fill_value=0)
print (d)
1 -50.0
2 -21.0
3 -30.0
4 -44.0
5 62.0
6 11.0
7 44.0
dtype: float64
Assuming your Code is a repeat of 'A', 'x', 'B', 'x', you can just use
>>> (df.Time[df.Code == 'B'].reset_index() - df.Time[df.Code == 'A'].reset_index())[['Time']]
Time
0 -65
1 -14
2 -1
3 17
But note that the original assumption, that 'A' and 'B' values alternate, seems fragile.
If you want the indexes to run from 1 to 4, as in your question, you can assign the previous to diff, and then use
diff.index += 1
>>> diff
Time
1 -65
2 -14
3 -1
4 17

Pandas: Inventory recalculation given a final value

I'm coding a Pyhton script to make an inventory recalculation of a specific SKU from today over the past 365 days, given the actual stock. For that I'm using a Python Pandas Dataframe, as it is shown below:
Index DATE SUM_IN SUM_OUT
0 5/12/18 500 0
1 5/13/18 0 -403
2 5/14/18 0 -58
3 5/15/18 0 -39
4 5/16/18 100 0
5 5/17/18 0 -98
6 5/18/18 276 0
7 5/19/18 0 -139
8 5/20/18 0 -59
9 5/21/18 0 -70
The dataframe presents the sum of quantities IN and OUT of the warehouse, grouped by date. My intention is to add a column named "STOCK" that presents the stock level of the SKU of the represented day. For that, what I have is the actual stock level (index 9). So what I need is to recalculate all the levels day by day through all the dates series (From index 9 until index 0).
In Excel it's easy. I can put the actual level in the last row and just extend a the calculation until I reach the row of index 0. As presented (Column E is the formula, Column G is the desired Output):
Does someone can help me achieve this result?
I already have the stock level of the last day (i. e. 5/21/2018 is equal to 10). What I need is place the number 10 in index 9 and calculate the stock levels of the other past days, from index 8 until 0.
The desired output should be:
Index DATE TRANSACTION_IN TRANSACTION_OUT SUM_IN SUM_OUT STOCK
0 5/12/18 1 0 500 0 500
1 5/13/18 0 90 0 -403 97
2 5/14/18 0 11 0 -58 39
3 5/15/18 0 11 0 -39 0
4 5/16/18 1 0 100 0 100
5 5/17/18 0 17 0 -98 2
6 5/18/18 1 0 276 0 278
7 5/19/18 0 12 0 -139 139
8 5/20/18 0 4 0 -59 80
9 5/21/18 0 7 0 -70 10
(Updated)
last_stock = 10 # You should try another value
a = (df.SUM_IN + df.SUM_OUT).cumsum()
df["STOCK"] = a - (a.iloc[-1] - last_stock)
By using cumsum to create the key for groupby , then we using cumsum again
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()
Out[292]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 276.0
7 137.0
8 78.0
9 8.0
dtype: float64
Update
s=df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-df.STOCK
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-s.groupby(df['SUM_IN'].gt(0).cumsum()).bfill().fillna(0)
Out[318]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 278.0
7 139.0
8 80.0
9 10.0
dtype: float64

Categories