Cumulative subtracting a pandas group by column from a variable - python

Hi I have a dataframe that lists items that I own, along with their Selling Price.
I also have a variable that defines my current debt. Example:
import pandas as pd
current_debt = 16000
d = {
'Person' : ['John','John','John','John','John'],
'Ïtem': ['Car','Bike','Computer','Phone','TV'],
'Price':[10500,3300,2100,1100,800],
}
df = pd.DataFrame(data=d)
df
I would like to "payback" the current_debt starting with the most expensive item and continuing until the debt is paid. I would like to list the left over money aligned to the last item sold. I'm hoping the function can inlcude a groupby clause for Person as sometimes there is more than one name in the list
My expected output for the debt in the example above would be:
If anyone could help with a function to calculate this that would be fantastic. I wasnt sure whether I needed to convert the dataframe to a list or it could be kept as a dataframe. Thanks very much!

Using a cumsum transformation and np.where to cover your logic for the final price column:
import numpy as np
df = df.sort_values(["Person", "Price"], ascending=False)
df['CumPrice'] = df.groupby("Person")['Price'].transform('cumsum')
df['Diff'] = df['CumPrice'] - current_debt
df['PriceLeft'] = np.where(
df['Diff'] <= 0,
0,
np.where(
df['Diff'] < df['Price'],
df['Diff'],
df['Price']
)
)
Result:
Person Item Price CumPrice Diff PriceLeft
0 John Car 10500 10500 -5500 0
1 John Bike 3300 13800 -2200 0
2 John Computer 2100 15900 -100 0
3 John Phone 1100 17000 1000 1000
4 John TV 800 17800 1800 800

Related

How to efficiently compare every row to all other rows in a Pandas Dataframe based on different conditions?

I have a Python Pandas dataframe which consists of different columns:
import pandas as pd
import numpy as np
dict = {'Payee Name':["John", "John", "John", "Sam", "Sam"],
'Amount': [100, 30, 95, 30, 30],
'Payment Method':['Cheque', 'Electronic', 'Electronic', 'Cheque', 'Electronic'],
'Payment Reference Number' : [1,2,3,4,5],
'Payment Date' : ['1/1/2022', '1/2/2022', '1/3/2022', '1/4/2022','1/5/2022']
}
df = pd.DataFrame(dict)
df['Payment Date'] = pd.to_datetime(df['Payment Date'],format='%d/%m/%Y')
[Payee Name] - the name of payee
[Amount] - the payment amount
[Payment Method] - either "Cheque" or "Electronic"
[Payment Reference Number] - the payment number
[Payment Date] - the date when the payment is made
Each row of the above represents one single payment entry.
The dataframe looks like this:
Payee Name Amount Payment Method Payment Reference Number Payment Date
0 John 100 Cheque 1 2022-01-01
1 John 30 Electronic 2 2022-02-01
2 John 95 Electronic 3 2022-03-01
3 Sam 30 Cheque 4 2022-04-01
4 Sam 30 Electronic 5 2022-05-01
I want to create a report that can identify any payments with the same/similar payment amounts (+/- 10%) that were paid to the same person under different payment methods. By doing so, I try to compare every row with all other rows in the same dataframe with the following conditions.
Conditions when comparing two rows:
Same payee
Different payment methods
The payment amount is the same or within a difference of 10%
If the above conditions are all true, then the [Check] column will have the below message.
"Yes - same amount" - if the payment amount is the same.
"Yes - within 10%" - if the difference in the payment amounts is 10% or less.
I have written the below codes. It works, but the performance is slow due to iteration in a Pandas Dataframe. It took about 7 mins to execute 1,300 rows. In my real life datafile, it has about 200,000 rows. So, may I ask if there are other methods that can help obtain the same result but run faster?
df['Check'] = 0
limit = 0.1 # to set the threshold for the payment difference
for i in df.index:
for j in df.index:
if df['Amount'].iloc[i] == df['Amount'].iloc[j] and df['Payee Name'].iloc[i] == df['Payee Name'].iloc[j] and df['Payment Method'].iloc[i] != df['Payment Method'].iloc[j] and i != j:
df['Check'].iloc[i] = "Yes - same amount"
break
else:
change = df['Amount'].iloc[j] / df['Amount'].iloc[i] - 1
if change > -limit and change < limit and df['Payee Name'].iloc[i] == df['Payee Name'].iloc[j] and df['Payment Method'].iloc[i] != df['Payment Method'].iloc[j] and i != j:
df['Check'].iloc[i] = "Yes - within 10%"
break
After running the code, the result is as follows:
Payee Name Amount Payment Method Payment Reference Number Payment Date Check
0 John 100 Cheque 1 2022-01-01 Yes - within 10%
1 John 30 Electronic 2 2022-02-01 0
2 John 95 Electronic 3 2022-03-01 Yes - within 10%
3 Sam 30 Cheque 4 2022-04-01 Yes - same amount
4 Sam 30 Electronic 5 2022-05-01 Yes - same amount
Much appreciate any advice.
I suggest you try using groupby on 'Payee Name' to break your dataframe into smaller pieces then run the inefficient code on these individually. (see split-apply-combine for discussion of approach). With luck you will see sufficient improvement to call it a day and move on to your next project.
See split-apply-combine for discussion on this use of groupby.
## Original code minus payee logic and unnecessary index check.
def find_double_dippers(df):
s_judgement = pd.Series(0, df.index)
limit = 0.1
for i in df.index:
for j in df.index:
if df['Amount'].loc[i] == df['Amount'].loc[j] and df['Payment Method'].loc[i] != df['Payment Method'].loc[j]:
s_judgement.loc[i] = "Yes - same amount"
break
else:
change = df['Amount'].loc[j] / df['Amount'].loc[i] - 1
if change > -limit and change < limit and df['Payment Method'].loc[i] != df['Payment Method'].loc[j]:
s_judgement.loc[i] = "Yes - within 10%"
break
return s_judgement # for combine portion of split-apply-combine
df['Check'] = df.groupby('Payee Name', group_keys=False).apply(find_double_dippers)

how to sum multiple row data using pandas and data is excel formet

Hi Everyone how to sum multiple row data using pandas and the data is excel format, and sum only edwin and maria data,please help me out thanks in advance.
excel data
name
salary
incentive
0
john
2000
400
1
edwin
3000
600
2
maria
1000
200
expected output
name
salary
incentive
0
Total
5000
1000
1
john
2000
400
2
edwin
3000
600
3
maria
1000
200
Judging by the Total line, you need the sums of 'john', 'edwin', not edwin and maria. I used the isin function, which returns a boolean mask, which is then used to select the desired rows (the ind variable). In the dataframe, the line with Total is filled with sums. Then pd.concat is used to concatenate the remaining lines. On the account sum in Excel, do not understand what you want?
import pandas as pd
df = pd.DataFrame({'name':['john', 'edwin', 'maria'], 'salary':[2000, 3000, 1000], 'incentive':[400, 600, 200]})
ind = df['name'].isin(['john', 'edwin'])
df1 = pd.DataFrame({'name':['Total'], 'salary':[df.loc[ind, 'salary'].sum()], 'incentive':[df.loc[ind, 'incentive'].sum()]})
df1 = pd.concat([df1, df])
df1 = df1.reset_index().drop(columns='index')
print(df1)
Output
name salary incentive
0 Total 5000 1000
1 john 2000 400
2 edwin 3000 600
3 maria 1000 200

Python - Get percentage based on column values

I want to evaluate 'percent of number of releases in a year' as a parameter of popularity of a genre in the movieLens dataset.
Sample data is shown below:
I can set the index to be the year as
df1 = df.set_index('year')
then, I can find the total per row and then divide the individual cells to get a sense of percentages as:
df1= df.set_index('year')
df1['total'] = df1.iloc[:,1:4].sum(axis=1)
df2 = df1.drop('movie',axis=1)
df2 = df2.div(df2['total'], axis= 0) * 100
df2.head()
Now,what's the best way to get % of number of releases in a year? I believe use 'groupby' and then heatmap?
You can clearly use groupby method:
import pandas as pd
import numpy as np
df = pd.DataFrame({'movie':['Movie1','Movie2','Movie3'], 'action':[1,0,0], 'com':[np.nan,np.nan,1], 'drama':[1,1,np.nan], 'year
':[1994,1994,1995]})
df.fillna(0,inplace=True)
df.set_index('year')
print((df.groupby(['year']).sum()/len(df))*100)
Output:
action com drama
year
1994 33.333333 0.000000 66.666667
1995 0.000000 33.333333 0.000000
Also, you can use pandas built-in style for the colored representation of the dataframe (or just use seaborn):
df = df.groupby(['year']).sum()/len(df)*100
df.style.background_gradient(cmap='viridis')
Output:

Group-specific calculation on Pandas DataFrame

I'm wondering what is the most elegant/pythonic way of subtracting the brand-specific mean price from the price in the following DataFrame.
Put differently, I want to create a second column equal to the original price minus 1200 for Apple products, and equal to the original price minus 700 for Lenovo products.
import pandas as pd
from io import StringIO
csv = StringIO('''product,brand,price
macbook,Apple,1000
macbook air,Apple,1200
macbook pro,Apple,1400
thinkbook,Lenovo,600
thinkpad,Lenovo,800
''')
df = pd.read_csv(csv)
Thanks in advance for your help!
You can subtract the grouped by mean from the price to create a new column called Price_Diff_Mean. Use .transform('mean') to create a series of values of same length to the column price and subtract those values from price:
df['Price_Diff_Mean'] = df['price'] - df.groupby('brand')['price'].transform('mean')
df
Out[6]:
product brand price Price_Diff_Mean
0 macbook Apple 1000 -200
1 macbook air Apple 1200 0
2 macbook pro Apple 1400 200
3 thinkbook Lenovo 600 -100
4 thinkpad Lenovo 800 100
Alternatively, you can add a column in with .assign , which will give you the same result:
df = df.assign(Price_Diff_Mean = df['price'] - df.groupby('brand')['price'].transform('mean'))
This is a slightly more elegant way, in my opinion:
df['newcolumn'] = df.groupby('brand').transform(lambda x: x - x.mean())

How to sort a MultiIndex Pandas Pivot Table based on a certain column

I am new to Python and am trying to play around with the Pandas Pivot Tables. I have searched and searched but none of the answers have been what I am looking for. Basically, I am trying to sort the below pandas pivot table
import numpy as np
import pandas as pd
df = pd.DataFrame({
"TIME":["FQ1","FQ2","FQ2","FQ2"],
"NAME":["Robert",'Miranda',"Robert","Robert"],
"TOTAL":[900,42,360,2000],
"TYPE":["Air","Ground","Air","Ground"],
"GROUP":["A","A","A","A"]})
pt = pd.pivot_table(data=df,
values =["TOTAL"], aggfunc = (np.sum),
index = ["GROUP","TYPE","NAME"],
columns = "TIME",
fill_value=0,
margins = True)
Basically I am hoping to sort the "Type" and the "Name" column based on the sum of each row.
The end goal in this case would be "Ground" type appearing first before "Air", and within the "Ground" type, I'm hoping to have Robert appear before Miranda, since his sum is higher.
Here is how it appears now:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Miranda 0 42 42
Robert 0 2000 2000
All 900 2402 3302
Thanks to anyone who is able to help!!
Try this, because your column header is multiindex, you need to use a tuple to access the colums:
pt.sort_values(['GROUP','TYPE',('TOTAL','All')],
ascending=[True, True, False])
Output:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Robert 0 2000 2000
Miranda 0 42 42
All 900 2402 3302

Categories