if negative then with weighted average - python

I have a DataFrame:
a = {'Price': [10, 15, 20, 25, 30], 'Total': [10000, 12000, 15000, 14000, 10000],
'Previous Quarter': [0, 10000, 12000, 15000, 14000]}
a = pd.DataFrame(a)
print (a)
With this raw data, i have added a number of additional columns including a weighted average price (WAP)
a['Change'] = a['Total'] - a['Previous Quarter']
a['Amount'] = a['Price']*a['Change']
a['Cum Sum Amount'] = np.cumsum(a['Amount'])
a['WAP'] = a['Cum Sum Amount'] / a['Total']
This is fine, however as the total starts to decrease this brings down the weighted average price.
my question is, if Total decreases how would i get WAP to reflect the row above? For instance in row 3, Total is 1000, which is lower than in row 2. This brings WAP down from 12.6 to 11.78, but i would like it to say 12.6 instead of 11.78.
I have tried looping through a['Total'] < 0 then a['WAP'] = 0 but this impacts the whole column.
Ultimately i am looking for a WAP column which reads:
10, 10.83, 12.6, 12.6, 12.6

You could use cummax:
a['WAP'] = (a['Cum Sum Amount'] / a['Total']).cummax()
print (a['WAP'])
0 10.000000
1 10.833333
2 12.666667
3 12.666667
4 12.666667
Name: WAP, dtype: float64

As a total Python beginner, here are two options I could think of
Either
a['WAP'] = np.maximum.accumulate(a['Cum Sum Amount'] / a['Total'])
Or after you've already created WAP you could modify only the subset using the diff method (thanks to #ayhan for the loc which will modify a in place)
a.loc[a['WAP'].diff() < 0, 'WAP'] = max(a['WAP'])

Related

Python - add column and calculate value based on condition

I'm having a dataset that looks as follows:
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32]}
df = pd.DataFrame(data)
Now what I aim to do is to add a new column "SMB" and calculate it as follows:
Subset data based on year and quarter, e. g. get all values where year = 2012, and quarter = 2
Sort the subset based on column MC and split it based on the size into small and big (0.5 Quantile)
If the value in MC is lower than 0.5 quantile add value "small" to the newly created column "SMB", if it is higher than the 0.5 quantile add value "big"
Repeat the process for all rows where quarter = 2
For all other rows add np.nan
so the output should look like that
data = {'Year':[2012, 2013, 2012, 2013, 2014, 2013],
'Quarter':[2, 2, 2, 2, 3, 1],
'ID':['CH7744', 'US4652', 'CA47441', 'CH1147', 'DE7487', 'US5174'],
'MC':[3348.22, 8542.55, 11851.2, 15718.1, 29914.7, 8731.78 ],
'PB': [2.74, 0.95, 1.57, 2.13, 0.54, 5.32],
'SMB': ['Small', 'Small', 'Big', 'Big', np.NaN, np.NaN]}
df = pd.DataFrame(data)
I tried to create a loop but I was unable to properly merge it back into the previous dataframe as I need other quarter values for further calculation. Using below code I sort of achieved what I wanted to have, but I had to merge back the data into the original dataset.
I'm sure there is a much nicer way on how to achieve this.
# Quantile 0.5 for MC sorting (small & big)
smbQuantile = 0.5
Years = df['Year'].unique()
dataframes_list = []
# Calculate Small and Big and merge back into dataFrame
for i in Years:
df_temp = df.loc[(df_sb['Year'] == i) & (df['Quarter'] == 2)]
df_temp['SMB'] = ''
#Assign factor size based on market cap
df_temp.SMB[df_temp.MKT_CAP <= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Small'
df_temp.SMB[df_temp.MKT_CAP >= df_temp.MKT_CAP.quantile(smbQuantile)] = 'Big'
dataframes_list.append(df_temp)
df = pd.concat(dataframes_list)
You can use groupby.rank and groupby.transform('size') combined with numpy.select:
g = df.groupby(['Year', 'Quarter'])['MC']
df['SMB'] = np.select([g.rank(pct=True).le(0.5),
g.transform('size').ge(2)],
['Small', 'Big'], np.nan)
output:
Year Quarter ID MC PB SMB
0 2012 2 CH7744 3348.22 2.74 Small
1 2013 2 US4652 8542.55 0.95 Small
2 2012 2 CA47441 11851.20 1.57 Big
3 2013 2 CH1147 15718.10 2.13 Big
4 2014 3 DE7487 29914.70 0.54 nan
5 2013 1 US5174 8731.78 5.32 nan

Creating a new column based on if-elif-else condition with np.where in python

I have a DataFrame df: "merged"
Quantity Total Price
Rate
Rate1
2000000
15
14.5
I want to create a new column based on the following criteria:
1-) if 0 > row A(Quantity Total Price) <= 50000000 :row C(Rate1) will be same with row B(Rate) with another result column
2-) if 50000000 > row A(Quantity Total Price) <= 500000000 :row C(Rate1) will be calculated like another result column >>>> merged['Rate']*0.9 + merged['Rate1']*0.1
3-) if 500000000 > row A(Quantity Total Price) <= 2000000000 :row C(Rate1) will be calculated like another result column >>>> merged['Rate']*0.8 + merged['Rate1']*0.2
4-) if 2000000000 > row A(Quantity Total Price) <= 4000000000 :row C(Rate1) will be calculated like another result column >>>> merged['Rate']*0.5 + merged['Rate1']*0.5
5-) if 4000000000 > row A(Quantity Total Price) <= 6000000000 :row C(Rate1) will be calculated like another result column >>>> merged['Rate']*0.25 + merged['Rate1']*0.75
6-)if 6000000000 > row A(Quantity Total Price) <= 99999999999999 :row C(Rate1) will be stay same at another result column.
My expected output (example for all condition)
İf Quantity Total First suppose first result: 20.000.000
İf Quantity Total First suppose second result: 100.000.000
İf Quantity Total First suppose third result: 700.000.000
İf Quantity Total First suppose fourth result: 3.000.000.000
İf Quantity Total First suppose fifth result: 5.000.000.000
İf Quantity Total First suppose sixth result: 7.000.000.000
Result
15
14.95
14.9
14.75
14.625
14.5
For typical if else cases I do np.where but I take a error like ValueError: Length of values (5) does not match length of index (1)
My code;
merged['Rate1'] = np.where(
[merged['Quantity Total First'] <= 500000000,
(merged["Quantity Total First"] >= 50000000) & (merged["Quantity Total First"] <= 500000000),
(merged["Quantity Total First"] >= 500000000) & (merged["Quantity Total First"] <= 2000000000),
(merged["Quantity Total First"] >= 2000000000) & (merged["Quantity Total First"] <= 4000000000),
(merged["Quantity Total First"] >= 4000000000) & (merged["Quantity Total First"] <= 6000000000),
],
[merged['Rate'],
merged['Rate']*0.9 + merged['Rate1']*0.1,
merged['Rate']*0.8 + merged['Rate1']*0.2,
merged['Rate']*0.5 + merged['Rate1']*0.5,
merged['Rate']*0.25 + merged['Rate1']*0.75
],
data_state2['Rate1']
)
Can you pls help me? You can coding from the beginning. Thnx
You can use pandas.cut to map your values.
NB. for clarity, I divided all values by 1 million.
# list of bins used to categorize the values
bins = [0, 50, 500, 2000, 4000, 6000, float('inf')]
# matching factors to map ]0-50] -> 1 ; ]50-500] -> 0.9, etc.
factors = [1, 0.9, 0.8, 0.5, 0.25, 0]
# get the factors and convert to float
f = pd.cut(df['Quantity Total Price'], bins=bins, labels=factors).astype(float)
# use the factors in numerical operation
df['Result'] = df['Rate']*f+df['Rate1']*(1-f)
output:
Quantity Total Price Rate Rate1 Result
0 20 15 14.5 15.000
1 100 15 14.5 14.950
2 700 15 14.5 14.900
3 3000 15 14.5 14.750
4 5000 15 14.5 14.625
5 7000 15 14.5 14.500
used input:
df = pd.DataFrame({'Quantity Total Price': [20, 100, 700, 3000, 5000, 7000],
'Rate': [15]*6, 'Rate1': [14.5]*6})

panda data frame applying multiple columns

product_code order eachprice
TN45 10 500
BY11 20 360
AJ21 5 800
and i need to create a new column based on order and each price if order>=10, then 5% discount, order>=50 then 10% discount for the price, how can i apply a function to achieve this:
product_code order each_price discounted_price
TN45 10 500 4500
BY11 20 360 6480
AJ21 5 800 4000
i tried to apply a function e.g.
df['discount'] = df.apply(function, axis=1)
but errors prompts
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
can anyone help? thanks
You could use nested numpy.where calls to achieve this. I've added an extra intermediate column to the results for the percentage discount, then used this column to calculate the final discounted price:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product_code': ['TN45', 'BY11', 'AJ21'],
'order': [10, 20, 5],
'each_price': [500, 360, 800]
})
df['discount'] = np.where(
df['order'] >= 50,
0.1,
np.where(
df['order'] >= 10,
0.05,
0
)
)
df['discounted_price'] = df['order'] * df['each_price'] * (1 - df['discount'])
Note that my results are slightly different from those in your expected output, but I believe they are correct based on the description of the discount conditions you gave:
product_code order each_price discount discounted_price
0 TN45 10 500 0.05 4750.0
1 BY11 20 360 0.05 6840.0
2 AJ21 5 800 0.00 4000.0
As you mention you are trying by using apply function. I did the same and is working. I am not sure what part of the function was wrong in your case.
import pandas as pd
df = pd.DataFrame({
'product_code': ['TN45', 'BY11', 'AJ21'],
'order': [10, 20, 5],
'each_price': [500, 360, 800]
})
# This is the apply function
def make_discount(row):
total=row["order"] * row['each_price']
if row["order"] >= 10:
total=total - (total*0.05)
elif row["order"] >= 50:
total=total - (total*0.1)
return total
Output:
df["discount_price"] = df.apply(make_discount, axis=1)
df
product_code order each_price discount_price
0 TN45 10 500 4750.0
1 BY11 20 360 6840.0
2 AJ21 5 800 4000.0

Groupby Sum returns the wrong sum value as it has been multiplied in Pandas

Here's a sample code:
import pandas as pd
data = {'Date': ['10/10/21', '10/10/21', '13/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '13/10/21', '13/10/21', '13/10/21', '10/10/21', '10/10/21'],
'ID': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'TotalTimeSpentInMinutes': [19, 6, 14, 17, 51, 53, 66, 19, 14, 28, 44, 22, 41],
'Vehicle': ['V3', 'V1', 'V3', 'V1','V1','V1','V1','V1','V1','V1','V1','V1','V1']
}
df = pd.DataFrame(data)
prices = {
'V1': 9.99,
'V2': 9.99,
'V3': 14.00,
}
default_price = 9.99
df = df.sort_values('ID')
df['OrdersPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['ID'].transform('count')
df['MinutesPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['TotalTimeSpentInMinutes'].transform(sum)
df['HoursPD'] = df['MinutesPD'] / 60
df['Pay excl extra'] = df.apply(lambda x: prices[x.get('Vehicle', default_price)]*x['HoursPD'], axis=1).round(2)
extra = 1.20
df['Extra Pay'] = df.apply(lambda x: extra*x['OrdersPD'], axis=1)
df['Total_pay'] = df['Pay excl extra'] + df['Extra Pay'].round(2)
df['Total Pay PD'] = df.groupby(['ID'])['Total_pay'].transform(sum)
#Returns wrong sum
df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#Returns wrong sum
df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)
#Returns wrong sum
df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)
print(df)
I'm trying to find the total sum per ID for 2 things: Hours and Pay.
Here's my code to find the total for hours and pay
Hours:
df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#I've also tried with just .sum() but it returns an empty column
Pay:
df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)
Output Example for ID 1: - ABS Final Pay
Date ID Vehicle OrdersPD HoursPD PayExclExtra ExtraPay
10/10/21 1 V1 1 0.1 1 1.20
10/10/21 1 V3 1 0.3166 4.43 1.20
13/10/21 1 V3 1 0.2333 3.27 1.20
Total_pay Total Pay PD Total Courier Hours ABS Final Pay
2.20 12.30 0.65 36.90
5.63 12.30 0.65 36.90
4.47 12.30 0.65 36.90
The 2 columns Total Courier Hours and ABS Final Pay are wrong because right now the code calculates the total by doing this:
ABS Final Pay = Total Pay PD * OrdersPD per count of ID
Example: for 10/10/21 - it does 12.30 * 2 = 24.60
for 13/10/21 - it does 12.30 * 1 = 12.30
ABS Final Pay returns 36.90 when it should be 12.30 (7.83 + 4.47 from the 2 days)
Total Pay PD for ID 1 is also wrong as it should show the sum of pay per date, example of expected output:
Date ID Vehicle OrdersPD Total PD
10/10/21 1 V1 1 7.83
10/10/21 1 V3 1 7.83
13/10/21 1 V1 1 4.47
Total Courier Hours seems to be fine for ID 1 when it's split into 3 rows with 1 order per row but when it has more than 1 order, it calculates it wrong as it multiplies it.
Example for ID 2 - Total Courier Hours
It calculates it doing this sum:
Total Courier Hours = HoursPD * OrdersPD per count of ID
Example: 11/10/21 - ID 2 had 5 orders, 2.85 * 5 = 14.25
13/10/21 - 3 orders, 2.01 * 3 = 6.03
10/10/21 - 2 orders, 1.05 * 2 = 2.1
Total Courier Hours returns 22.38 when it should be 5.91 (2.85 + 2.01 + 1.05 from the 3 days)
Sorry for the long post, I hope this makes sense and thanks in advance.
The drop_duplicates line may have been the issue. Once I removed the code:
df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)
I was able to calculate the totals more accurately line by line instead of having to do calculations to the columns within the code.
To separate it neatly, I printed the columns by groupby's in a different excel sheet.
Example:
per_courier = (
df.groupby(['ID'])['Total Pay']
.agg(sum)
)

faster way to calculate the weighted mean based on rolling offset

I have an example df like df = pd.DataFrame({'price': [100, 101, 99, 95, 97, 88], 'qty': [12, 5, 1, 3, 1, 3]}). I want to calculate the rolling 5 qty average of (price * qty / qty), and the desired output is 100, 101, 100.6, 97, 96.2, 91.2.
I don't have a good way to calculate this currently unfortunately, I have a slow way that gets close which is to calculate the cumulative sum of qty and then df.qty_cumsum[(df.qty_cumsum<= x.qty_cumsum- 5)].argmax() which returns the max arg of the qty - 5, then I can use this to calculate weighted average in a second step.
Thanks
One option is to repeat price, then take rolling with rows, and groupby index, taking last:
np.repeat(df['price'], df['qty']).rolling(5).mean().groupby(level=0).last()
Output:
0 100.0
1 101.0
2 100.6
3 97.0
4 96.2
5 91.2
Name: price, dtype: float64
P.S. And if you have large qty values, it would also probably make sense to make it more efficient by clipping qty to 5 (since there is no difference if it's 5 or 12, for example):
np.repeat(df['price'], np.clip(df['qty'], 0, 5)
).rolling(5).mean().groupby(level=0).last()

Categories