I have a dataframe such as the following:
What's the best way to calculate a cumulative return to fill the Nan Values? The logic of each cell is shown.
Following is the intended result:
import pandas as pd
df = pd.DataFrame({"DATE":[2018,2019,2020,2021,2022,2023,2024],"RATIO":[0.03,0.04,0.05,0.06,0.07,0.08,0.09],"PROFIT":[10,20,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.loc[df['DATE']==2020, ['PROFIT']] = 20000*(1+0.04)
df.loc[df['DATE']==2021, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)
df.loc[df['DATE']==2022, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)
df.loc[df['DATE']==2023, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)
df.loc[df['DATE']==2024, ['PROFIT']] = 20000*(1+0.04)*(1+0.050)*(1+0.060)*(1+0.070)*(1+0.080)
df
You are looking for cumprod
df['PROFIT']=df['PROFIT'].fillna(df.RATIO.shift().add(1).iloc[2:].cumprod()*20000)
df
Out[30]:
DATE RATIO PROFIT
0 2018 0.03 10.00000
1 2019 0.04 20.00000
2 2020 0.05 20800.00000
3 2021 0.06 21840.00000
4 2022 0.07 23150.40000
5 2023 0.08 24770.92800
6 2024 0.09 26752.60224
Related
I'd like to get some % rates based on a .groupby() in pandas. My goal is to take an indicator column Ind and get the Rate of A (numerator) divided by the total (A+B) in that year
Example Data:
import pandas as pd
import numpy as np
df: pd.DataFrame = pd.DataFrame([['2011','A',1,2,3], ['2011','B',4,5,6],['2012','A',15,20,4],['2012','B',17,12,12]], columns=["Year","Ind","X", "Y", "Z"])
print(df)
Year Ind X Y Z
0 2011 A 1 2 3
1 2011 B 4 5 6
2 2012 A 15 20 4
3 2012 B 17 12 12
Example for year 2011: XRate would be summing up the A indicators for X (which would be 1) and dividing byt the total (A+B) which would be 5 thus I would receive an Xrate of 0.20.
I would like to do this for all columns X, Y, Z to get the rates. I've tried doing lambda applys but can't quite get the desired results.
Desired Results:
Year XRate YRate ZRate
0 2011 0.20 0.29 0.33
1 2012 0.47 0.63 0.25
You can group the dataframe on Year and aggregate using sum:
s1 = df.groupby('Year').sum()
s2 = df.query("Ind == 'A'").groupby('Year').sum()
s2.div(s1).round(2).add_suffix('Rate')
XRate YRate ZRate
Year
2011 0.20 0.29 0.33
2012 0.47 0.62 0.25
I have a dataframe like this,
>>> data = {
'year':[2019, 2020, 2020, 2019, 2020, 2019],
'provider':['X', 'X', 'Y', 'Z', 'Z', 'T'],
'price':[100, 122, 0, 150, 120, 80],
'count':[20, 15, 24, 16, 24, 10]
}
>>> df = pd.DataFrame(data)
>>> df
year provider price count
0 2019 X 100 20
1 2020 X 122 15
2 2020 Y 0 24
3 2019 Z 150 16
4 2020 Z 120 24
5 2019 T 80 10
And this is expected output:
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50
I want to group prices on providers and find price, count differences between 2019 and 2020.
If there is no price or count record at 2020 or 2019, don't want to see related provider.
By the assumption that there are always only 1 or 2 rows per provider, we can first sort_values on year to make sure 2019 comes before 2020.
Then we groupby on provider and divide the rows of price and count and substract 1.
df = df.sort_values('year')
grp = (
df.groupby('provider')
.apply(lambda x: x[['price', 'count']].div(x[['price', 'count']].shift()).sub(1))
)
dfnew = df[['provider']].join(grp).dropna()
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50
Or only vectorized methods:
dfnew = df[df['provider'].duplicated(keep=False)].sort_values(['provider', 'year'])
dfnew[['price', 'count']] = (
dfnew[['price', 'count']].div(dfnew[['price', 'count']].shift()).sub(1)
)
dfnew = dfnew[dfnew['provider'].eq(dfnew['provider'].shift())].drop('year', axis=1)
provider price count
1 X 0.22 -0.25
4 Z -0.20 0.50
You can try:
final = (df.set_index(['provider','year']).groupby(level=0)
.pct_change().dropna().droplevel(1).add_suffix('_count').reset_index())
provider price_rate count_rate
0 X 0.22 -0.25
1 Z -0.20 0.50
I have the following code that generates four columns as I intended
df['revenue'] = pd.to_numeric(df['revenue']) #not exactly sure what this does
df['Date'] = pd.to_datetime(df['Date'], unit='s')
df['Year'] = df['Date'].dt.year
df['First Purchase Date'] = pd.to_datetime(df['First Purchase Date'], unit='s')
df['number_existing_customers'] = df.groupby(df['Year'])[['Existing Customer']].sum()
df['number_new_customers'] = df.groupby(df['Year'])[['New Customer']].sum()
df['Rate'] = df['number_new_customers']/df['number_existing_customers']
Table = df.groupby(df['Year'])[['New Customer', 'Existing Customer', 'Rate', 'revenue']].sum()
print(Table)
I want to be able to divide one column by another (new customers by existing) but I seem to be getting zeros when creating the new column (see output below).
>>> print(Table)
New Customer Existing Customer Rate revenue
Year
2014 7.00 2.00 0.00 11,869.47
2015 1.00 3.00 0.00 9,853.93
2016 5.00 3.00 0.00 4,058.53
2017 9.00 3.00 0.00 8,056.37
2018 12.00 7.00 0.00 22,031.23
2019 16.00 10.00 0.00 97,142.42
All you need to do is define the column and then use the corresponding operator, in this case /:
Table['Rate'] = Table['New customer']/Table['Existing customer']
In this example I'm copying your Table output and using the code I've posted:
import pandas as pd
import numpy as np
data = {'Year':[2014,2015,2016,2017,2018,2019],'New customer':[7,1,5,9,12,16],'Existing customer':[2,3,3,3,7,10],'revenue':[1000,1000,1000,1001,1100,1200]}
Table = pd.DataFrame(data).set_index('Year')
Table['Rate'] = Table['New customer']/Table['Existing customer']
print(Table)
Output:
New customer Existing customer revenue Rate
Year
2014 7 2 1000 3.500000
2015 1 3 1000 0.333333
2016 5 3 1000 1.666667
2017 9 3 1001 3.000000
2018 12 7 1100 1.714286
2019 16 10 1200 1.600000
I have around 8781 rows in my dataset. I have grouped the different items according to month and calculated the mean of a particular item of every month. Now, I want to store the result of every month after inserting the new row after every month.
Below is the code that I have worked upon for grouping the item and calculated the mean.
Please, anyone, tell how I can insert a new row after every month and store my groupby result in it.
a = pd.read_csv("data3.csv")
print (a)
df=pd.DataFrame(a,columns=['month','day','BedroomLights..kW.'])
print(df)
groupby_month=df['day'].groupby(df['month'])
print(groupby_month)
c=list(df['day'].groupby(df['month']))
print(c)
d=df['day'].groupby(df['month']).describe()
print (d)
#print(groupby_month.mean())
e=df['BedroomLights..kW.'].groupby(df['month']).mean()
print(e)
A sample of csv file is :
Day Month Year lights Fan temperature windspeed
1 1 2016 0.003 0.12 39 8.95
2 1 2016 0.56 1.23 34 9.54
3 1 2016 1.43 0.32 32 10.32
4 1 2016 0.4 1.43 24 8.32
.................................................
1 12 2016 0.32 0.54 22 7.65
2 12 2016 1.32 0.43 21 6.54
The excepted output I want is adding a new row that is mean of items of every month like:
Month lights ......
1 0.32
1 0.43
...............
mean as a new row
...............
12 0.32
12 0.43
mean .........
The output of the code I have shown is as follows:
month
1 0.006081
2 0.005993
3 0.005536
4 0.005729
5 0.005823
6 0.005587
7 0.006214
8 0.005509
9 0.005935
10 0.005821
11 0.006226
12 0.006056
Name: BedroomLights..kW., dtype: float64
If your indices are named 1mean, 2mean, 3mean, etc., sort_indexes should place them where you want.
e.index = [str(n)+'mean' for n in range(1,13)]
df = df.append(e)
df = df.sort_index()
I have a dataframe that looks like so
time usd hour day
0 2015-08-30 07:56:28 1.17 7 0
1 2015-08-30 08:56:28 1.27 8 0
2 2015-08-30 09:56:28 1.28 9 0
3 2015-08-30 10:56:28 1.29 10 0
4 2015-08-30 11:56:28 1.29 11 0
14591 2017-04-30 23:53:46 9.28 23 609
Given this how would I go about building a numpy 2d matrix with hour being one axis day being the other axis and then usd being the value stored in the matrix
Consider the dataframe df
df = pd.DataFrame(dict(
time=pd.date_range('2015-08-30', periods=14000, freq='H'),
usd=(np.random.randn(14000) / 100 + 1.0005).cumprod()
))
Then we can set the index with the date and hour of df.time column and unstack. We take the values of this result in order to access the numpy array.
a = df.set_index([df.time.dt.date, df.time.dt.hour]).usd.unstack().values
I would do a pivot_table and leave the data as a pandas DataFrame but the conversion to a numpy array is trivial if you don't want labels.
import pandas as pd
data = <data>
data.pivot_table(values = 'usd', index = 'hour', columns = 'day').values
Edit: Thank you #pyRSquared for the "Value"able tip. (changed np.array(data) to df...values)
You can use the pivot functionality of pandas, as described here. You will get NaN values for usd, when there is no value for the day or hour.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'usd': [1.17, 1.27, 1.28, 1.29, 1.29, 9.28], 'hour': [7, 8, 9, 10, 11, 23], 'day': [0, 0, 0, 0, 0, 609]})
In [3]: df
Out[3]:
day hour usd
0 0 7 1.17
1 0 8 1.27
2 0 9 1.28
3 0 10 1.29
4 0 11 1.29
5 609 23 9.28
In [4]: df.pivot(index='hour', columns='day', values='usd')
Out[4]:
day 0 609
hour
7 1.17 NaN
8 1.27 NaN
9 1.28 NaN
10 1.29 NaN
11 1.29 NaN
23 NaN 9.28