I have the following dataframe and would like to get the rolling cumulative return over the last lets say for this example 2 periods grouped by an identifier. For my actual case I need a longer period, but my problem is more with the groupby:
id return
2012 1 0.5
2012 2 0.2
2013 1 0.1
2013 2 0.3
The result should look like this:
id return cumreturn
2012 1 0.5 0.5
2012 2 0.2 0.2
2013 1 0.1 0.65
2013 2 0.3 0.56
It is import that the period is rolling. I have the following formula so far:
df["cumreturn"] = df.groupby("id")["return"].fillna(0).pd.rolling_apply(df,5,lambda x: np.prod(1+x)-1)
However, I get the following error: AttributeError: 'Series' object has no attribute 'pd'. I know how to get the rolling cumulative return. However, I just cant figure out how to combine it with groupby.
Let's try this:
df_out = (df.set_index('id', append=True)
.assign(cumreturn=df.groupby('id')['return'].rolling(2,min_periods=1)
.apply(lambda x: np.prod(1+x)-1)
.swaplevel(0,1)).reset_index(1))
Output:
id return cumreturn
2012 1 0.5 0.50
2012 2 0.2 0.20
2013 1 0.1 0.65
2013 2 0.3 0.56
Related
I have a data set with the rating of user ID to all product ID. There are only 5000 products and 10,000 users but the ID is in different number. I would like to transform my dataframe to a coo_sparse_matrix(data, (row,col), shape) but with row and col as the real number of products and users, not the ID. Is there any way to do that? Below is the illustration:
Data frame:
User ID
Product ID
Rating
1
14
0.1
1
15
0.2
2
14
0.3
2
16
0.3
5
19
0.4
and expected to have a matrix (in sparse coo form)
ProductID
14
15
16
19
UserID
1
0.1
0.2
0
0
2
0.3
0
0.3
0
5
0
0
0
0.4
because normally the sparse_coo would give a very large matrix with index (1,2,...,19) for product ID and (1,2,3,4,5) for user ID.
Please help me, it is for the thesis due in 3 days and I just found out this error, I code with Python.
Thank you very much!
Hi hope this helps and good luck with your thesis:
import pandas as pd
from scipy.sparse import coo_matrix
dataframe=pd.DataFrame(data={'User ID':[1,1,2,2,5], 'Product ID':[14,15,14,16,19], 'Rating':[0.1,0.2,0.3,0.3,0.4]})
row=dataframe['User ID']
col=dataframe['Product ID']
data=dataframe['Rating']
coo=coo_matrix((data, (row, col))).toarray()
new_dataframe=pd.DataFrame(coo)
#Drop non existing Product IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[:, (new_dataframe != new_dataframe.iloc[0]).any()]
#Drop non existing User IDs --optional delet if not intended
new_dataframe=new_dataframe.loc[(new_dataframe!=0).any(axis=1)]
print(new_dataframe)
Output:
14 15 16 19
1 0.1 0.2 0.0 0.0
2 0.3 0.0 0.3 0.0
5 0.0 0.0 0.0 0.4
i want to round the number showing in my table
it looks now like:
i want it looks like:
How can i get that? use pandas or numpy and as simple as possible. Thanks!
In pandas , we can use pandas.DataFrame.round. See example below ( from pandas documentation )
data frame : df
dogs cats
0 0.21 0.32
1 0.01 0.67
2 0.66 0.03
3 0.21 0.18
Do round on df like below
df.round(1)
Output:
dogs cats
0 0.2 0.3
1 0.0 0.7
2 0.7 0.0
3 0.2 0.2
We can even specify the fields need to round, see link for more details : pandas.DataFrame.round
Or We can use default python round in a loop, as below
>>> round(5.76543, 2)
5.77
I have a data frame with columns containing different country values, I would like to have a function that shifts the rows in this dataframe independently without the dates. For example, I have a list of related profile shifters for each country which would be used in shifting the rows.
If the profile shifter for a country is -3, that country column, is shifted 3 times downwards, while the last 3 values become the first 3 values in the dataframe. If a profile shifter is +3, the third value of a row is shifted upwards while the first 2 values become the last values in that column.
After the rows have been shifted instead of having the default Nan value appear in the empty cells, I want the preceding or succeeding values to take up the empty cells. The function should also return a data frame Sample-dataset Profile Shifter Expected-results.
Sample Dataset:
Datetime ARG AUS BRA
1/1/2050 0.00 0.1 2.1 3.1
1/1/2050 1.00 0.2 2.2 3.2
1/1/2050 2.00 0.3 2.3 3.3
1/1/2050 3.00 0.4 2.4 3.4
1/1/2050 4.00 0.5 2.5 3.5
1/1/2050 5.00 0.6 2.6 3.6
Country Profile Shifters:
Country ARG AUS BRA
UTC -3 -2 4
Desired Output:
Datetime ARG AUS BRA
1/1/2050 0.00 0.3 2.4 3.4
1/1/2050 1.00 0.4 2.5 3.5
1/1/2050 2.00 0.5 2.1 3.1
1/1/2050 3.00 0.1 2.2 3.2
1/1/2050 4.00 0.2 2.3 3.3
This is what I have been trying for days now but it's not working
cols = df1.columns
for i in cols:
if i == 'ARG':
x = df1.iat[0:3,0]
df1['ARG'] = df1.ARG.shift(periods=-3)
df1['ARG'].replace(to_replace=np.nan, x)
elif i == 'AUS':
df1['AUS'] = df1.AUS.shift(periods=2)
elif i == 'BRA':
df1['BRA'] = df1.BRA.shift(periods=1)
else:
pass
This works but is far from being 'good pandas'. I hope that someone will come along and give a nicer, cleaner 'more pandas' answer.
Imports used:
import pandas as pd
import datetime as datetime
Offset data setup:
offsets = pd.DataFrame({"Country" : ["ARG", "AUS", "BRA"], "UTC Offset" : [-3, -2, 4]})
Produces:
Country UTC Offset
0 ARG -3
1 AUS -2
2 BRA 4
Note that the timezone offset data I've used here is in a slightly different structure from the example data (country codes by rows, rather than columns). Also worth pointing out that Australia and Brazil have several time zones, so there is no one single UTC offset which applies to those whole countries (only one in Argentina though).
Sample data setup:
sampleDf = pd.DataFrame()
for i in range(6):
dt = datetime.datetime(2050,1,1,i)
sampleDf = sampleDf.append({'Datetime' : dt,
'ARG' : i / 10,
'AUS' : (i + 10)/ 10,
'BRA' : (i + 20) / 10},
ignore_index=True)
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.0 2.0
1 2050-01-01 01:00:00 0.1 1.1 2.1
2 2050-01-01 02:00:00 0.2 1.2 2.2
3 2050-01-01 03:00:00 0.3 1.3 2.3
4 2050-01-01 04:00:00 0.4 1.4 2.4
5 2050-01-01 05:00:00 0.5 1.5 2.5
Code to shift cells:
for idx, offsetData in offsets.iterrows(): # See note 1
countryCode = offsetData["Country"]
utcOffset = offsetData["UTC Offset"]
dfRowCount = sampleDf.shape[0]
wrappedOffset = (dfRowCount + utcOffset) if utcOffset < 0 else \
(-dfRowCount + utcOffset) # See note 2
countryData = sampleDf[countryCode]
sampleDf[countryCode] = pd.concat([countryData.shift(utcOffset).dropna(),
countryData.shift(wrappedOffset).dropna()]).sort_index() # See note 3
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.4 2.4
1 2050-01-01 01:00:00 0.1 1.5 2.5
2 2050-01-01 02:00:00 0.2 1.0 2.0
3 2050-01-01 03:00:00 0.3 1.1 2.1
4 2050-01-01 04:00:00 0.4 1.2 2.2
5 2050-01-01 05:00:00 0.5 1.3 2.3
Notes
Iterating over rows in pandas like this (to me) indicates 'you've run out of pandas skill, and are kind of going against the design of pandas'. What I have here works, but it won't benefit from any/many of the efficiencies of using pandas, and would not be appropriate for a large dataset. Using itertuples rather than iterrows is supposed to be quicker, but I think neither is great, so I went with what seemed most readable for this case.
This solution does two shifts, one of the data shifted by the timezone offset, then a second shift of everything else to fill in what would otherwise be NaN holes left by the first shift. This line calculates the size of that second shift.
Finally, the results of the two shifts are concatenated together (after dropping any NaN values from both of them) and assigned back to the original (unshifted) column. sort_index puts them back in order based on the index, rather than having the two shifted parts one-after-another.
I'd like to get some % rates based on a .groupby() in pandas. My goal is to take an indicator column Ind and get the Rate of A (numerator) divided by the total (A+B) in that year
Example Data:
import pandas as pd
import numpy as np
df: pd.DataFrame = pd.DataFrame([['2011','A',1,2,3], ['2011','B',4,5,6],['2012','A',15,20,4],['2012','B',17,12,12]], columns=["Year","Ind","X", "Y", "Z"])
print(df)
Year Ind X Y Z
0 2011 A 1 2 3
1 2011 B 4 5 6
2 2012 A 15 20 4
3 2012 B 17 12 12
Example for year 2011: XRate would be summing up the A indicators for X (which would be 1) and dividing byt the total (A+B) which would be 5 thus I would receive an Xrate of 0.20.
I would like to do this for all columns X, Y, Z to get the rates. I've tried doing lambda applys but can't quite get the desired results.
Desired Results:
Year XRate YRate ZRate
0 2011 0.20 0.29 0.33
1 2012 0.47 0.63 0.25
You can group the dataframe on Year and aggregate using sum:
s1 = df.groupby('Year').sum()
s2 = df.query("Ind == 'A'").groupby('Year').sum()
s2.div(s1).round(2).add_suffix('Rate')
XRate YRate ZRate
Year
2011 0.20 0.29 0.33
2012 0.47 0.62 0.25
I have a dataframe, something like
name perc score
a 0.2 40
b 0.4 89
c 0.3 90
I want to have a total row where 'perc' has a mean aggregation and 'score' has a sum aggregation. The output should be like
name perc score
a 0.2 40
b 0.4 89
c 0.3 90
total 0.3 219
I want it as a dataframe output as I need to build plots using this. For now, I tried doing
df.loc['total'] = df.sum()
but this provides the sum for the percentage column as well, whereas I want an average for the percentage. How to do this in pandas?
try this:
df.loc['total'] = [df['perc'].mean(), df['score'].sum()]
Output:
perc score
name
a 0.20 40.0
b 0.40 89.0
c 0.30 90.0
total 0.30 219.0