Group-specific calculation on Pandas DataFrame - python

I'm wondering what is the most elegant/pythonic way of subtracting the brand-specific mean price from the price in the following DataFrame.
Put differently, I want to create a second column equal to the original price minus 1200 for Apple products, and equal to the original price minus 700 for Lenovo products.
import pandas as pd
from io import StringIO
csv = StringIO('''product,brand,price
macbook,Apple,1000
macbook air,Apple,1200
macbook pro,Apple,1400
thinkbook,Lenovo,600
thinkpad,Lenovo,800
''')
df = pd.read_csv(csv)
Thanks in advance for your help!

You can subtract the grouped by mean from the price to create a new column called Price_Diff_Mean. Use .transform('mean') to create a series of values of same length to the column price and subtract those values from price:
df['Price_Diff_Mean'] = df['price'] - df.groupby('brand')['price'].transform('mean')
df
Out[6]:
product brand price Price_Diff_Mean
0 macbook Apple 1000 -200
1 macbook air Apple 1200 0
2 macbook pro Apple 1400 200
3 thinkbook Lenovo 600 -100
4 thinkpad Lenovo 800 100
Alternatively, you can add a column in with .assign , which will give you the same result:
df = df.assign(Price_Diff_Mean = df['price'] - df.groupby('brand')['price'].transform('mean'))

This is a slightly more elegant way, in my opinion:
df['newcolumn'] = df.groupby('brand').transform(lambda x: x - x.mean())

Related

how to sum multiple row data using pandas and data is excel formet

Hi Everyone how to sum multiple row data using pandas and the data is excel format, and sum only edwin and maria data,please help me out thanks in advance.
excel data
name
salary
incentive
0
john
2000
400
1
edwin
3000
600
2
maria
1000
200
expected output
name
salary
incentive
0
Total
5000
1000
1
john
2000
400
2
edwin
3000
600
3
maria
1000
200
Judging by the Total line, you need the sums of 'john', 'edwin', not edwin and maria. I used the isin function, which returns a boolean mask, which is then used to select the desired rows (the ind variable). In the dataframe, the line with Total is filled with sums. Then pd.concat is used to concatenate the remaining lines. On the account sum in Excel, do not understand what you want?
import pandas as pd
df = pd.DataFrame({'name':['john', 'edwin', 'maria'], 'salary':[2000, 3000, 1000], 'incentive':[400, 600, 200]})
ind = df['name'].isin(['john', 'edwin'])
df1 = pd.DataFrame({'name':['Total'], 'salary':[df.loc[ind, 'salary'].sum()], 'incentive':[df.loc[ind, 'incentive'].sum()]})
df1 = pd.concat([df1, df])
df1 = df1.reset_index().drop(columns='index')
print(df1)
Output
name salary incentive
0 Total 5000 1000
1 john 2000 400
2 edwin 3000 600
3 maria 1000 200

Cumulative subtracting a pandas group by column from a variable

Hi I have a dataframe that lists items that I own, along with their Selling Price.
I also have a variable that defines my current debt. Example:
import pandas as pd
current_debt = 16000
d = {
'Person' : ['John','John','John','John','John'],
'Ïtem': ['Car','Bike','Computer','Phone','TV'],
'Price':[10500,3300,2100,1100,800],
}
df = pd.DataFrame(data=d)
df
I would like to "payback" the current_debt starting with the most expensive item and continuing until the debt is paid. I would like to list the left over money aligned to the last item sold. I'm hoping the function can inlcude a groupby clause for Person as sometimes there is more than one name in the list
My expected output for the debt in the example above would be:
If anyone could help with a function to calculate this that would be fantastic. I wasnt sure whether I needed to convert the dataframe to a list or it could be kept as a dataframe. Thanks very much!
Using a cumsum transformation and np.where to cover your logic for the final price column:
import numpy as np
df = df.sort_values(["Person", "Price"], ascending=False)
df['CumPrice'] = df.groupby("Person")['Price'].transform('cumsum')
df['Diff'] = df['CumPrice'] - current_debt
df['PriceLeft'] = np.where(
df['Diff'] <= 0,
0,
np.where(
df['Diff'] < df['Price'],
df['Diff'],
df['Price']
)
)
Result:
Person Item Price CumPrice Diff PriceLeft
0 John Car 10500 10500 -5500 0
1 John Bike 3300 13800 -2200 0
2 John Computer 2100 15900 -100 0
3 John Phone 1100 17000 1000 1000
4 John TV 800 17800 1800 800

Python - Get percentage based on column values

I want to evaluate 'percent of number of releases in a year' as a parameter of popularity of a genre in the movieLens dataset.
Sample data is shown below:
I can set the index to be the year as
df1 = df.set_index('year')
then, I can find the total per row and then divide the individual cells to get a sense of percentages as:
df1= df.set_index('year')
df1['total'] = df1.iloc[:,1:4].sum(axis=1)
df2 = df1.drop('movie',axis=1)
df2 = df2.div(df2['total'], axis= 0) * 100
df2.head()
Now,what's the best way to get % of number of releases in a year? I believe use 'groupby' and then heatmap?
You can clearly use groupby method:
import pandas as pd
import numpy as np
df = pd.DataFrame({'movie':['Movie1','Movie2','Movie3'], 'action':[1,0,0], 'com':[np.nan,np.nan,1], 'drama':[1,1,np.nan], 'year
':[1994,1994,1995]})
df.fillna(0,inplace=True)
df.set_index('year')
print((df.groupby(['year']).sum()/len(df))*100)
Output:
action com drama
year
1994 33.333333 0.000000 66.666667
1995 0.000000 33.333333 0.000000
Also, you can use pandas built-in style for the colored representation of the dataframe (or just use seaborn):
df = df.groupby(['year']).sum()/len(df)*100
df.style.background_gradient(cmap='viridis')
Output:

How to sort a MultiIndex Pandas Pivot Table based on a certain column

I am new to Python and am trying to play around with the Pandas Pivot Tables. I have searched and searched but none of the answers have been what I am looking for. Basically, I am trying to sort the below pandas pivot table
import numpy as np
import pandas as pd
df = pd.DataFrame({
"TIME":["FQ1","FQ2","FQ2","FQ2"],
"NAME":["Robert",'Miranda',"Robert","Robert"],
"TOTAL":[900,42,360,2000],
"TYPE":["Air","Ground","Air","Ground"],
"GROUP":["A","A","A","A"]})
pt = pd.pivot_table(data=df,
values =["TOTAL"], aggfunc = (np.sum),
index = ["GROUP","TYPE","NAME"],
columns = "TIME",
fill_value=0,
margins = True)
Basically I am hoping to sort the "Type" and the "Name" column based on the sum of each row.
The end goal in this case would be "Ground" type appearing first before "Air", and within the "Ground" type, I'm hoping to have Robert appear before Miranda, since his sum is higher.
Here is how it appears now:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Miranda 0 42 42
Robert 0 2000 2000
All 900 2402 3302
Thanks to anyone who is able to help!!
Try this, because your column header is multiindex, you need to use a tuple to access the colums:
pt.sort_values(['GROUP','TYPE',('TOTAL','All')],
ascending=[True, True, False])
Output:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Robert 0 2000 2000
Miranda 0 42 42
All 900 2402 3302

Counting total monthly values from a CSV in Python

I am trying to record monthy sales totals over the course of 2.5 years in a csv data set.
I started with a csv file of transaction history for a SKU, which was sorted by date (MM/DD/YYYY), with varying statuses indicating whether the item was sold, archived (quoted, not sold), or open. I managed to figure out how to only display the "sold" rows, but cannot figure out how to display a total amount sold per month.
Here's what I have thus far.
#Import Libraries
from pandas import DataFrame, read_csv
import pandas as pd
#Set Variables
fields = ['Date', 'Qty', 'Status']
file = r'kp4.csv'
df = pd.read_csv(file, usecols=fields)
# Filters Dataset to only display "Sold" items in Status column
data = (df[df['Status'] == "Sold"])
print (data)
Output:
Date Qty Status
4 2/21/2018 5 Sold
4 2/21/2018 5 Sold
11 2/16/2018 34 Sold
14 3/16/2018 1 Sold
My ideal output would look something like this:
Date Qty Status
4 02/2018 39 Sold
5 03/2018 1 Sold
I've tried groupy, manipulating the year format, assigning indexes per other tutorials and have gotten nothing but errors. If anyone can point me in the right direction, it would be greatly appreciated.
Thanks!
IIUC
df.Date=pd.to_datetime(df.Date)
df=df.drop_duplicates()
df.groupby(df.Date.dt.strftime('%m/%Y')).agg({'Qty':'sum','Status':'first'})
Out[157]:
Qty Status
Date
02/2018 39 Sold
03/2018 1 Sold

Categories