I'd like to get some % rates based on a .groupby() in pandas. My goal is to take an indicator column Ind and get the Rate of A (numerator) divided by the total (A+B) in that year
Example Data:
import pandas as pd
import numpy as np
df: pd.DataFrame = pd.DataFrame([['2011','A',1,2,3], ['2011','B',4,5,6],['2012','A',15,20,4],['2012','B',17,12,12]], columns=["Year","Ind","X", "Y", "Z"])
print(df)
Year Ind X Y Z
0 2011 A 1 2 3
1 2011 B 4 5 6
2 2012 A 15 20 4
3 2012 B 17 12 12
Example for year 2011: XRate would be summing up the A indicators for X (which would be 1) and dividing byt the total (A+B) which would be 5 thus I would receive an Xrate of 0.20.
I would like to do this for all columns X, Y, Z to get the rates. I've tried doing lambda applys but can't quite get the desired results.
Desired Results:
Year XRate YRate ZRate
0 2011 0.20 0.29 0.33
1 2012 0.47 0.63 0.25
You can group the dataframe on Year and aggregate using sum:
s1 = df.groupby('Year').sum()
s2 = df.query("Ind == 'A'").groupby('Year').sum()
s2.div(s1).round(2).add_suffix('Rate')
XRate YRate ZRate
Year
2011 0.20 0.29 0.33
2012 0.47 0.62 0.25
Related
I have a dataframe like this,
>>> data = {
'year':[2019, 2020, 2020, 2019, 2020, 2019],
'provider':['X', 'X', 'Y', 'Z', 'Z', 'T'],
'price':[100, 122, 0, 150, 120, 80],
'count':[20, 15, 24, 16, 24, 10]
}
>>> df = pd.DataFrame(data)
>>> df
year provider price count
0 2019 X 100 20
1 2020 X 122 15
2 2020 Y 0 24
3 2019 Z 150 16
4 2020 Z 120 24
5 2019 T 80 10
And this is expected output:
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50
I want to group prices on providers and find price, count differences between 2019 and 2020.
If there is no price or count record at 2020 or 2019, don't want to see related provider.
By the assumption that there are always only 1 or 2 rows per provider, we can first sort_values on year to make sure 2019 comes before 2020.
Then we groupby on provider and divide the rows of price and count and substract 1.
df = df.sort_values('year')
grp = (
df.groupby('provider')
.apply(lambda x: x[['price', 'count']].div(x[['price', 'count']].shift()).sub(1))
)
dfnew = df[['provider']].join(grp).dropna()
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50
Or only vectorized methods:
dfnew = df[df['provider'].duplicated(keep=False)].sort_values(['provider', 'year'])
dfnew[['price', 'count']] = (
dfnew[['price', 'count']].div(dfnew[['price', 'count']].shift()).sub(1)
)
dfnew = dfnew[dfnew['provider'].eq(dfnew['provider'].shift())].drop('year', axis=1)
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50
You can try:
final = (df.set_index(['provider','year']).groupby(level=0)
.pct_change().dropna().droplevel(1).add_suffix('_count').reset_index())
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50
I have the following code that generates four columns as I intended
df['revenue'] = pd.to_numeric(df['revenue']) #not exactly sure what this does
df['Date'] = pd.to_datetime(df['Date'], unit='s')
df['Year'] = df['Date'].dt.year
df['First Purchase Date'] = pd.to_datetime(df['First Purchase Date'], unit='s')
df['number_existing_customers'] = df.groupby(df['Year'])[['Existing Customer']].sum()
df['number_new_customers'] = df.groupby(df['Year'])[['New Customer']].sum()
df['Rate'] = df['number_new_customers']/df['number_existing_customers']
Table = df.groupby(df['Year'])[['New Customer', 'Existing Customer', 'Rate', 'revenue']].sum()
print(Table)
I want to be able to divide one column by another (new customers by existing) but I seem to be getting zeros when creating the new column (see output below).
>>> print(Table)
New Customer Existing Customer Rate revenue
Year
2014 7.00 2.00 0.00 11,869.47
2015 1.00 3.00 0.00 9,853.93
2016 5.00 3.00 0.00 4,058.53
2017 9.00 3.00 0.00 8,056.37
2018 12.00 7.00 0.00 22,031.23
2019 16.00 10.00 0.00 97,142.42
All you need to do is define the column and then use the corresponding operator, in this case /:
Table['Rate'] = Table['New customer']/Table['Existing customer']
In this example I'm copying your Table output and using the code I've posted:
import pandas as pd
import numpy as np
data = {'Year':[2014,2015,2016,2017,2018,2019],'New customer':[7,1,5,9,12,16],'Existing customer':[2,3,3,3,7,10],'revenue':[1000,1000,1000,1001,1100,1200]}
Table = pd.DataFrame(data).set_index('Year')
Table['Rate'] = Table['New customer']/Table['Existing customer']
print(Table)
Output:
New customer Existing customer revenue Rate
Year
2014 7 2 1000 3.500000
2015 1 3 1000 0.333333
2016 5 3 1000 1.666667
2017 9 3 1001 3.000000
2018 12 7 1100 1.714286
2019 16 10 1200 1.600000
I have a dataframe such as the following:
What's the best way to calculate a cumulative return to fill the Nan Values? The logic of each cell is shown.
Following is the intended result:
import pandas as pd
df = pd.DataFrame({"DATE":[2018,2019,2020,2021,2022,2023,2024],"RATIO":[0.03,0.04,0.05,0.06,0.07,0.08,0.09],"PROFIT":[10,20,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.loc[df['DATE']==2020, ['PROFIT']] = 20000*(1+0.04)
df.loc[df['DATE']==2021, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)
df.loc[df['DATE']==2022, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)
df.loc[df['DATE']==2023, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)
df.loc[df['DATE']==2024, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)*(1+0.080)
df
You are looking for cumprod
df['PROFIT']=df['PROFIT'].fillna(df.RATIO.shift().add(1).iloc[2:].cumprod()*20000)
df
Out[30]:
DATE RATIO PROFIT
0 2018 0.03 10.00000
1 2019 0.04 20.00000
2 2020 0.05 20800.00000
3 2021 0.06 21840.00000
4 2022 0.07 23150.40000
5 2023 0.08 24770.92800
6 2024 0.09 26752.60224
I have around 8781 rows in my dataset. I have grouped the different items according to month and calculated the mean of a particular item of every month. Now, I want to store the result of every month after inserting the new row after every month.
Below is the code that I have worked upon for grouping the item and calculated the mean.
Please, anyone, tell how I can insert a new row after every month and store my groupby result in it.
a = pd.read_csv("data3.csv")
print (a)
df=pd.DataFrame(a,columns=['month','day','BedroomLights..kW.'])
print(df)
groupby_month=df['day'].groupby(df['month'])
print(groupby_month)
c=list(df['day'].groupby(df['month']))
print(c)
d=df['day'].groupby(df['month']).describe()
print (d)
#print(groupby_month.mean())
e=df['BedroomLights..kW.'].groupby(df['month']).mean()
print(e)
A sample of csv file is :
Day Month Year lights Fan temperature windspeed
1 1 2016 0.003 0.12 39 8.95
2 1 2016 0.56 1.23 34 9.54
3 1 2016 1.43 0.32 32 10.32
4 1 2016 0.4 1.43 24 8.32
.................................................
1 12 2016 0.32 0.54 22 7.65
2 12 2016 1.32 0.43 21 6.54
The excepted output I want is adding a new row that is mean of items of every month like:
Month lights ......
1 0.32
1 0.43
...............
mean as a new row
...............
12 0.32
12 0.43
mean .........
The output of the code I have shown is as follows:
month
1 0.006081
2 0.005993
3 0.005536
4 0.005729
5 0.005823
6 0.005587
7 0.006214
8 0.005509
9 0.005935
10 0.005821
11 0.006226
12 0.006056
Name: BedroomLights..kW., dtype: float64
If your indices are named 1mean, 2mean, 3mean, etc., sort_indexes should place them where you want.
e.index = [str(n)+'mean' for n in range(1,13)]
df = df.append(e)
df = df.sort_index()
I have the following dataframe and would like to get the rolling cumulative return over the last lets say for this example 2 periods grouped by an identifier. For my actual case I need a longer period, but my problem is more with the groupby:
id return
2012 1 0.5
2012 2 0.2
2013 1 0.1
2013 2 0.3
The result should look like this:
id return cumreturn
2012 1 0.5 0.5
2012 2 0.2 0.2
2013 1 0.1 0.65
2013 2 0.3 0.56
It is import that the period is rolling. I have the following formula so far:
df["cumreturn"] = df.groupby("id")["return"].fillna(0).pd.rolling_apply(df,5,lambda x: np.prod(1+x)-1)
However, I get the following error: AttributeError: 'Series' object has no attribute 'pd'. I know how to get the rolling cumulative return. However, I just cant figure out how to combine it with groupby.
Let's try this:
df_out = (df.set_index('id', append=True)
.assign(cumreturn=df.groupby('id')['return'].rolling(2,min_periods=1)
.apply(lambda x: np.prod(1+x)-1)
.swaplevel(0,1)).reset_index(1))
Output:
id return cumreturn
2012 1 0.5 0.50
2012 2 0.2 0.20
2013 1 0.1 0.65
2013 2 0.3 0.56