panda data frame applying multiple columns - python

product_code order eachprice
TN45 10 500
BY11 20 360
AJ21 5 800
and i need to create a new column based on order and each price if order>=10, then 5% discount, order>=50 then 10% discount for the price, how can i apply a function to achieve this:
product_code order each_price discounted_price
TN45 10 500 4500
BY11 20 360 6480
AJ21 5 800 4000
i tried to apply a function e.g.
df['discount'] = df.apply(function, axis=1)
but errors prompts
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
can anyone help? thanks

You could use nested numpy.where calls to achieve this. I've added an extra intermediate column to the results for the percentage discount, then used this column to calculate the final discounted price:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'product_code': ['TN45', 'BY11', 'AJ21'],
'order': [10, 20, 5],
'each_price': [500, 360, 800]
})
df['discount'] = np.where(
df['order'] >= 50,
0.1,
np.where(
df['order'] >= 10,
0.05,
0
)
)
df['discounted_price'] = df['order'] * df['each_price'] * (1 - df['discount'])
Note that my results are slightly different from those in your expected output, but I believe they are correct based on the description of the discount conditions you gave:
product_code order each_price discount discounted_price
0 TN45 10 500 0.05 4750.0
1 BY11 20 360 0.05 6840.0
2 AJ21 5 800 0.00 4000.0

As you mention you are trying by using apply function. I did the same and is working. I am not sure what part of the function was wrong in your case.
import pandas as pd
df = pd.DataFrame({
'product_code': ['TN45', 'BY11', 'AJ21'],
'order': [10, 20, 5],
'each_price': [500, 360, 800]
})
# This is the apply function
def make_discount(row):
total=row["order"] * row['each_price']
if row["order"] >= 10:
total=total - (total*0.05)
elif row["order"] >= 50:
total=total - (total*0.1)
return total
Output:
df["discount_price"] = df.apply(make_discount, axis=1)
df
product_code order each_price discount_price
0 TN45 10 500 4750.0
1 BY11 20 360 6840.0
2 AJ21 5 800 4000.0

Related

Groupby Sum returns the wrong sum value as it has been multiplied in Pandas

Here's a sample code:
import pandas as pd
data = {'Date': ['10/10/21', '10/10/21', '13/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '11/10/21', '13/10/21', '13/10/21', '13/10/21', '10/10/21', '10/10/21'],
'ID': [1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'TotalTimeSpentInMinutes': [19, 6, 14, 17, 51, 53, 66, 19, 14, 28, 44, 22, 41],
'Vehicle': ['V3', 'V1', 'V3', 'V1','V1','V1','V1','V1','V1','V1','V1','V1','V1']
}
df = pd.DataFrame(data)
prices = {
'V1': 9.99,
'V2': 9.99,
'V3': 14.00,
}
default_price = 9.99
df = df.sort_values('ID')
df['OrdersPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['ID'].transform('count')
df['MinutesPD'] = df.groupby(['ID', 'Date', 'Vehicle'])['TotalTimeSpentInMinutes'].transform(sum)
df['HoursPD'] = df['MinutesPD'] / 60
df['Pay excl extra'] = df.apply(lambda x: prices[x.get('Vehicle', default_price)]*x['HoursPD'], axis=1).round(2)
extra = 1.20
df['Extra Pay'] = df.apply(lambda x: extra*x['OrdersPD'], axis=1)
df['Total_pay'] = df['Pay excl extra'] + df['Extra Pay'].round(2)
df['Total Pay PD'] = df.groupby(['ID'])['Total_pay'].transform(sum)
#Returns wrong sum
df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#Returns wrong sum
df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)
#Returns wrong sum
df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)
print(df)
I'm trying to find the total sum per ID for 2 things: Hours and Pay.
Here's my code to find the total for hours and pay
Hours:
df['Total Courier Hours'] = df.groupby(['ID'])['HoursPD'].transform(sum)
#I've also tried with just .sum() but it returns an empty column
Pay:
df['ABS Final Pay'] = df.groupby(['ID'])['Total Pay PD'].transform(sum)
Output Example for ID 1: - ABS Final Pay
Date ID Vehicle OrdersPD HoursPD PayExclExtra ExtraPay
10/10/21 1 V1 1 0.1 1 1.20
10/10/21 1 V3 1 0.3166 4.43 1.20
13/10/21 1 V3 1 0.2333 3.27 1.20
Total_pay Total Pay PD Total Courier Hours ABS Final Pay
2.20 12.30 0.65 36.90
5.63 12.30 0.65 36.90
4.47 12.30 0.65 36.90
The 2 columns Total Courier Hours and ABS Final Pay are wrong because right now the code calculates the total by doing this:
ABS Final Pay = Total Pay PD * OrdersPD per count of ID
Example: for 10/10/21 - it does 12.30 * 2 = 24.60
for 13/10/21 - it does 12.30 * 1 = 12.30
ABS Final Pay returns 36.90 when it should be 12.30 (7.83 + 4.47 from the 2 days)
Total Pay PD for ID 1 is also wrong as it should show the sum of pay per date, example of expected output:
Date ID Vehicle OrdersPD Total PD
10/10/21 1 V1 1 7.83
10/10/21 1 V3 1 7.83
13/10/21 1 V1 1 4.47
Total Courier Hours seems to be fine for ID 1 when it's split into 3 rows with 1 order per row but when it has more than 1 order, it calculates it wrong as it multiplies it.
Example for ID 2 - Total Courier Hours
It calculates it doing this sum:
Total Courier Hours = HoursPD * OrdersPD per count of ID
Example: 11/10/21 - ID 2 had 5 orders, 2.85 * 5 = 14.25
13/10/21 - 3 orders, 2.01 * 3 = 6.03
10/10/21 - 2 orders, 1.05 * 2 = 2.1
Total Courier Hours returns 22.38 when it should be 5.91 (2.85 + 2.01 + 1.05 from the 3 days)
Sorry for the long post, I hope this makes sense and thanks in advance.
The drop_duplicates line may have been the issue. Once I removed the code:
df.drop_duplicates((['ID','Date','Vehicle']), inplace=True)
I was able to calculate the totals more accurately line by line instead of having to do calculations to the columns within the code.
To separate it neatly, I printed the columns by groupby's in a different excel sheet.
Example:
per_courier = (
df.groupby(['ID'])['Total Pay']
.agg(sum)
)

Concatenate arrays into a single table using pandas

I have a .csv file, from this file I group it by year so that it gives me as a result the maximum, minimum and average values
import pandas as pd
DF = pd.read_csv("PJME_hourly.csv")
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
His output is as follows:
2002 PJME_MW
max 55934.000000
min 19247.000000
mean 31565.617106
2003 PJME_MW
max 53737.000000
min 19414.000000
mean 31698.758621
2004 PJME_MW
max 51962.000000
min 19543.000000
mean 32270.434867
I would like to know how I can make it all join in a single column (PJME_MW), but that each group of operations (max, min, mean) is identified by the year that corresponds to it.
If you convert the dates to_datetime(), you can group them using the dt.year accessor:
df = pd.read_csv('PJME_hourly.csv')
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
Toy example:
df = pd.DataFrame({'Datetime': ['2019-01-01','2019-02-01','2020-01-01','2020-02-01','2021-01-01'], 'PJME_MV': [3,5,30,50,100]})
# Datetime PJME_MV
# 0 2019-01-01 3
# 1 2019-02-01 5
# 2 2020-01-01 30
# 3 2020-02-01 50
# 4 2021-01-01 100
df.Datetime = pd.to_datetime(df.Datetime)
df.groupby(df.Datetime.dt.year).agg(['min', 'max', 'mean'])
# PJME_MV
# min max mean
# Datetime
# 2019 3 5 4
# 2020 30 50 40
# 2021 100 100 100
The code could be optimized but how is now works, change this part of your code:
for i in range(2002,2019):
neblina = DF[DF.Datetime.str.contains(str(i))]
dateframe = neblina.agg({"PJME_MW" : ['max','min','mean']})
print(i , pd.concat([dateframe],axis=0,sort= False))
Use this instead
aggs = ['max','min','mean']
df_group = df.groupby('Datetime')['PJME_MW'].agg(aggs).reset_index()
out_columns = ['agg_year', 'PJME_MW']
out = []
aux = pd.DataFrame(columns=out_columns)
for agg in aggs:
aux['agg_year'] = agg + '_' + df_group['Datetime']
aux['PJME_MW'] = df_group[agg]
out.append(aux)
df_out = pd.concat(out)
Edit: Concatenation form has been changed
Final edit: I didn't understand the whole problem, sorry. You don't need the code after groupby function

making numerical categories of pandas data

I tried to look at some reference where I can make an extra column that is categorical based on another column. I tried the documentation already pandas categorical, and stack overflow does not seem to have this, but I think it must be, but maybe I am using the wrong search tags?
for example
Size Size_cat
10 0-50
50 0-50
150 50-500
450 50-500
5000 1000-9000
10000 >9000
notice that the size category 500-1000 is missing (but no number corresponds to that)
The problem lies here is that I create a pandas crosstable later like this:
summary_table = pd.crosstab(index[res_sum["Type"],res_sum["Size"]],columns=[res_sum["Found"]],margins=True)
summary_table = summary_table.div(summary_table["All"] / 100, axis=0)
After some editing of this table I get this kind of result:
Found Exact Near No
Type Size
DEL 50 80 20 0
100 60 40 0
500 80 20 0
1000 60 40 0
5000 40 60 0
10000 20 80 0
DEL_Total 56.666667 43.333333 0
DUP 50 0 0 100
100 0 0 100
500 0 100 0
1000 0 100 0
5000 0 100 0
10000 20 80 0
DUP_Total 3.333333 63.333333 33.333333
the problem is that now (Size) just puts the sizes here, and therefore this table can vary in size. If 5000-DEL is missing in the data, that column will also disappear and then DUP has 6 categories and DEL 5. Additionally if I add more sizes, this table will become very large. So I wanted to make categories of the sizes, but always retaining the same categories, even if some of them are empty.
I hope I am clear, because it is kinda hard to explain.
this is what I tried already:
highest_size = res['Size'].max()
categories = int(math.ceil(highest_size / 100.0) * 100.0)
categories = int(categories / 10)
labels = ["{0} - {1}".format(i, i + categories) for i in range(0, highest_size, categories)]
print(highest_size)
print(categories)
print(labels)
10000
1000
['0 - 1000', '1000 - 2000', '2000 - 3000', '3000 - 4000', '4000 - 5000', '5000 - 6000', '6000 - 7000', '7000 - 8000', '8000 - 9000', '9000 - 10000']
I get number categories, but of course now they depend on the highest number, and the categories change based on the data. additionally I still need to link them to the 'Size' column in pandas. This does not work.
df['group'] = pd.cut(df.value, range(0, highest_size), right=False, labels=labels)
If possible I would like to make my own categories, instead of using range to get the same steps, like I made in the first example above. (otherwise it takes way to long to get to 10000 with steps of 100, and taking steps of 1000 will lose a lot of data in the smaller regions)
See a mock up below, to help you get the logic. Basically, you bin the Score into custom groups, by using cut (or even lambda or map ) and passing the value to the function GroupMapping. Let me know if it works.
import pandas as pd
df=pd.DataFrame({
'Name':['Harry','Sally','Mary','John','Francis','Devon','James','Holly','Molly','Nancy','Ben'],
'Score': [1143,2040,2500,3300,3143,2330,2670,2140,2890,3493,1723]}
)
def GroupMapping(dl):
if int(dl) <= 1000: return '0-1000'
elif 1000 < dl <= 2000: return '1000 - 2000'
elif 2000 < dl <= 3000: return '2000 - 3000'
elif 3000 < dl <= 4000: return '3000 - 4000'
else: return 'None'
#df["Group"] = df['Score'].map(GroupMapping)
#df["Group"] = df['Score'].apply(lambda row: GroupMapping(row))
df['Group'] = pd.cut(df['Score'], [0, 1000, 2000, 3000, 4000], labels=['0-1000', '1000 - 2000', '2000 - 3000','3000 - 4000' ])
df

if negative then with weighted average

I have a DataFrame:
a = {'Price': [10, 15, 20, 25, 30], 'Total': [10000, 12000, 15000, 14000, 10000],
'Previous Quarter': [0, 10000, 12000, 15000, 14000]}
a = pd.DataFrame(a)
print (a)
With this raw data, i have added a number of additional columns including a weighted average price (WAP)
a['Change'] = a['Total'] - a['Previous Quarter']
a['Amount'] = a['Price']*a['Change']
a['Cum Sum Amount'] = np.cumsum(a['Amount'])
a['WAP'] = a['Cum Sum Amount'] / a['Total']
This is fine, however as the total starts to decrease this brings down the weighted average price.
my question is, if Total decreases how would i get WAP to reflect the row above? For instance in row 3, Total is 1000, which is lower than in row 2. This brings WAP down from 12.6 to 11.78, but i would like it to say 12.6 instead of 11.78.
I have tried looping through a['Total'] < 0 then a['WAP'] = 0 but this impacts the whole column.
Ultimately i am looking for a WAP column which reads:
10, 10.83, 12.6, 12.6, 12.6
You could use cummax:
a['WAP'] = (a['Cum Sum Amount'] / a['Total']).cummax()
print (a['WAP'])
0 10.000000
1 10.833333
2 12.666667
3 12.666667
4 12.666667
Name: WAP, dtype: float64
As a total Python beginner, here are two options I could think of
Either
a['WAP'] = np.maximum.accumulate(a['Cum Sum Amount'] / a['Total'])
Or after you've already created WAP you could modify only the subset using the diff method (thanks to #ayhan for the loc which will modify a in place)
a.loc[a['WAP'].diff() < 0, 'WAP'] = max(a['WAP'])

How to set a minimum value when performing cumsum on a dataframe column (physical inventory cannot go below 0)

How to perform a cumulative sum with a minimum value in python/pandas?
In the table below:
the "change in inventory" column reflects the daily sales/new stock purchases.
data entry/human errors mean that applying cumsum shows a negative inventory level of -5 which is not physically possible.
as shown by the "inventory" column, the data entry errors continue to be a problem at the end (100 vs 95).
dataframe
change in inventory inventory cumsum
2015-01-01 100 100 100
2015-01-02 -20 80 80
2015-01-03 -30 50 50
2015-01-04 -40 10 10
2015-01-05 -15 0 -5
2015-01-06 100 100 95
One way to achieve this would be to use loops however it would be messy and there probably is a more efficient way to do this.
Here is the code to generate the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict({'change in inventory': {'2015-01-01': 100,
'2015-01-02': -20,
'2015-01-03': -30,
'2015-01-04': -40,
'2015-01-05': -15,
'2015-01-06': 100},
'inventory': {'2015-01-01': 100,
'2015-01-02': 80,
'2015-01-03': 50,
'2015-01-04': 10,
'2015-01-05': 0,
'2015-01-06': 100}})
df['cumsum'] = df['change in inventory'].cumsum()
df
How to apply a cumulative sum with a minimum value in python/pandas to produce the values shown in the "inventory" column?
Depending on the data, it can be far more efficient to loop over blocks with the same sign, eg. with large running sub-blocks all positive or negative. You only have to be careful going back to positive values after a run of negative values.
With a minimum limiting value as minS summing over vector:
import numpy as np
i_sign = np.append(np.where(np.diff(np.sign(vector)) > 0)[0],[len(vector)])
i0 = 1
csum = np.maximum(minS, vector[:1])
for i1 in i_sign:
tmp_csum = np.maximum(minS, csum[-1] + np.cumsum(vector[i0:i1+1]))
csum = np.append(csum, tmp_csum)
i0 = i1
Final output in csum.
You can use looping, unfortunately:
lastvalue = 0
newcum = []
for row in df['change in inventory']:
thisvalue = row + lastvalue
if thisvalue < 0:
thisvalue = 0
newcum.append( thisvalue )
lastvalue = thisvalue
print pd.Series(newcum, index=df.index)
2015-01-01 100
2015-01-02 80
2015-01-03 50
2015-01-04 10
2015-01-05 0
2015-01-06 100
dtype: int64
very ugly solution
start = df.index[0]
df['cumsum'] = [max(df['change in inventory'].loc[start:end].sum(), 0)
for end in df.index]

Categories