Python create index from returns - python

I have a dataframe of portfolio returns:
date Portfolio %
30/11/2001 4.8
31/12/2001 -0.7
31/01/2002 1.3
28/02/2002 -1.4
29/03/2002 3.3
I need to create an index of returns, but to do this i need to have a starting figure of 1.0 and the formula references the previous row. The output should look like this:
date Portfolio % Index
1.0 NaN
30/11/2001 4.8 1.048
31/12/2001 -0.7 1.040
31/01/2002 1.3 1.054
28/02/2002 -1.4 1.039
29/03/2002 3.3 1.073
As an example the formula for the second result is:
1.048*(1+-0.7/100)
I've tried the following code, but it doesn't get the required result.
portfolio['Index'] = portfolio['Portfolio %'] / portfolio['Portfolio %'].iloc[0]
The issues i have:
I can't get the starting variable
I can't get the formula to reference the previous row.
I believe it is the same issue as this post: Create and index from returns PANDAS. However, it was never answered fully.

Use, Series.div, Series.add along with Series.cumprod :
df['Index'] = df['Portfolio %'].div(100).add(1).cumprod()
Result:
# print(df)
date Portfolio % Index
0 30/11/2001 4.8 1.048000
1 31/12/2001 -0.7 1.040664
2 31/01/2002 1.3 1.054193
3 28/02/2002 -1.4 1.039434
4 29/03/2002 3.3 1.073735

Related

Shifting Row values Upwards/Downwards and Replacing Empty Cells with Preceding or Succeeding Values in Pandas DataFrame

I have a data frame with columns containing different country values, I would like to have a function that shifts the rows in this dataframe independently without the dates. For example, I have a list of related profile shifters for each country which would be used in shifting the rows.
If the profile shifter for a country is -3, that country column, is shifted 3 times downwards, while the last 3 values become the first 3 values in the dataframe. If a profile shifter is +3, the third value of a row is shifted upwards while the first 2 values become the last values in that column.
After the rows have been shifted instead of having the default Nan value appear in the empty cells, I want the preceding or succeeding values to take up the empty cells. The function should also return a data frame Sample-dataset Profile Shifter Expected-results.
Sample Dataset:
Datetime ARG AUS BRA
1/1/2050 0.00 0.1 2.1 3.1
1/1/2050 1.00 0.2 2.2 3.2
1/1/2050 2.00 0.3 2.3 3.3
1/1/2050 3.00 0.4 2.4 3.4
1/1/2050 4.00 0.5 2.5 3.5
1/1/2050 5.00 0.6 2.6 3.6
Country Profile Shifters:
Country ARG AUS BRA
UTC -3 -2 4
Desired Output:
Datetime ARG AUS BRA
1/1/2050 0.00 0.3 2.4 3.4
1/1/2050 1.00 0.4 2.5 3.5
1/1/2050 2.00 0.5 2.1 3.1
1/1/2050 3.00 0.1 2.2 3.2
1/1/2050 4.00 0.2 2.3 3.3
This is what I have been trying for days now but it's not working
cols = df1.columns
for i in cols:
if i == 'ARG':
x = df1.iat[0:3,0]
df1['ARG'] = df1.ARG.shift(periods=-3)
df1['ARG'].replace(to_replace=np.nan, x)
elif i == 'AUS':
df1['AUS'] = df1.AUS.shift(periods=2)
elif i == 'BRA':
df1['BRA'] = df1.BRA.shift(periods=1)
else:
pass
This works but is far from being 'good pandas'. I hope that someone will come along and give a nicer, cleaner 'more pandas' answer.
Imports used:
import pandas as pd
import datetime as datetime
Offset data setup:
offsets = pd.DataFrame({"Country" : ["ARG", "AUS", "BRA"], "UTC Offset" : [-3, -2, 4]})
Produces:
Country UTC Offset
0 ARG -3
1 AUS -2
2 BRA 4
Note that the timezone offset data I've used here is in a slightly different structure from the example data (country codes by rows, rather than columns). Also worth pointing out that Australia and Brazil have several time zones, so there is no one single UTC offset which applies to those whole countries (only one in Argentina though).
Sample data setup:
sampleDf = pd.DataFrame()
for i in range(6):
dt = datetime.datetime(2050,1,1,i)
sampleDf = sampleDf.append({'Datetime' : dt,
'ARG' : i / 10,
'AUS' : (i + 10)/ 10,
'BRA' : (i + 20) / 10},
ignore_index=True)
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.0 2.0
1 2050-01-01 01:00:00 0.1 1.1 2.1
2 2050-01-01 02:00:00 0.2 1.2 2.2
3 2050-01-01 03:00:00 0.3 1.3 2.3
4 2050-01-01 04:00:00 0.4 1.4 2.4
5 2050-01-01 05:00:00 0.5 1.5 2.5
Code to shift cells:
for idx, offsetData in offsets.iterrows(): # See note 1
countryCode = offsetData["Country"]
utcOffset = offsetData["UTC Offset"]
dfRowCount = sampleDf.shape[0]
wrappedOffset = (dfRowCount + utcOffset) if utcOffset < 0 else \
(-dfRowCount + utcOffset) # See note 2
countryData = sampleDf[countryCode]
sampleDf[countryCode] = pd.concat([countryData.shift(utcOffset).dropna(),
countryData.shift(wrappedOffset).dropna()]).sort_index() # See note 3
Produces:
Datetime ARG AUS BRA
0 2050-01-01 00:00:00 0.0 1.4 2.4
1 2050-01-01 01:00:00 0.1 1.5 2.5
2 2050-01-01 02:00:00 0.2 1.0 2.0
3 2050-01-01 03:00:00 0.3 1.1 2.1
4 2050-01-01 04:00:00 0.4 1.2 2.2
5 2050-01-01 05:00:00 0.5 1.3 2.3
Notes
Iterating over rows in pandas like this (to me) indicates 'you've run out of pandas skill, and are kind of going against the design of pandas'. What I have here works, but it won't benefit from any/many of the efficiencies of using pandas, and would not be appropriate for a large dataset. Using itertuples rather than iterrows is supposed to be quicker, but I think neither is great, so I went with what seemed most readable for this case.
This solution does two shifts, one of the data shifted by the timezone offset, then a second shift of everything else to fill in what would otherwise be NaN holes left by the first shift. This line calculates the size of that second shift.
Finally, the results of the two shifts are concatenated together (after dropping any NaN values from both of them) and assigned back to the original (unshifted) column. sort_index puts them back in order based on the index, rather than having the two shifted parts one-after-another.

How to create new columns by looping through columns in different dataframes?

I have two pd.dataframes:
df1:
Year Replaced Not_replaced
2015 1.5 0.1
2016 1.6 0.3
2017 2.1 0.1
2018 2.6 0.5
df2:
Year HI LO RF
2015 3.2 2.9 3.0
2016 3.0 2.8 2.9
2017 2.7 2.5 2.6
2018 2.6 2.2 2.3
I need to create a third df3 by using the following equation:
df3[column1]=df1['Replaced']-df1['Not_replaced]+df2['HI']
df3[column2]=df1['Replaced']-df1['Not_replaced]+df2['LO']
df3[column3]=df1['Replaced']-df1['Not_replaced]+df2['RF']
I can merge the two dataframes and manually create 3 new columns one by one, but I can't figure out how to use the loop function to create the results.
You can create an empty dataframe & fill it with values while looping
(Note: col_names & df3.columns must be of the same length)
df3 = pd.DataFrame(columns = ['column1','column2','column3'])
col_names = ["HI", "LO","RF"]
for incol,df3column in zip(col_names,df3.columns):
df3[df3column] = df1['Replaced']-df1['Not_replaced']+df2[incol]
print(df3)
output
column1 column2 column3
0 4.6 4.3 4.4
1 4.3 4.1 4.2
2 4.7 4.5 4.6
3 4.7 4.3 4.4
for the for loop, I would first merge df1 and df2 into to create a new df, called df3. Then, I would create a list of te names of the columns you want to iterate through:
col_names = ["HI", "LO","RF"]
for col in col_names:
df3[f"column_{col}]= df3['Replaced']-df3['Not_replaced]+df3[col]

Efficient way to randomly select all rows from pandas dataframe corresponding to a column value

I have a pandas dataframe containing about 2 Million rows which looks like the following example
ID V1 V2 V3 V4 V5
12 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
01 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
02 0.2 0.3 0.5 0.03 0.9
12 0.5 0.4 0.6 0.7 1.8
07 3.8 2.9 1.1 1.6 1.5
19 0.9 1.2 1.8 2.6 9.0
19 0.5 0.4 0.6 0.7 1.8
06 3.8 2.9 1.1 1.6 1.5
17 0.9 1.2 1.8 2.6 9.0
18 0.9 1.2 1.8 2.6 9.0
I want to create three subsets of this data such that the column ID is mutually exclusive. And each of the subset includes all rows corresponding to the ID column in the main dataframe.
As of now, I am randomly shuffling the ID column and selecting unique ID's as a list. Using this list I'm selecting all rows that from the dataframe who's ID belong to fraction of the list.
import numpy as np
import random
distinct = list(set(df.ID.values))
random.shuffle(distinct)
X1, X2 = distinct[:1000000], distinct[1000000:2000000]
df_X1 = df.loc[df['ID'].isin(list(X1))]
df_X2 = df.loc[df['ID'].isin(list(X2))]
This is working as expected for smaller data, however for larger data the run doesn't even complete for many hours. Is there a more efficient way to do this? appreciate responses.
I think the slow down is coming in the nested isin list inside the loc slice. I tried a different approach using numpy and a boolean index that seems to double the speed.
First to set up the dataframe. I wasn't sure how many unique items you had so I selected 50. I was also unsure how many columns so arbitrarily selected 10,000 columns and rows.
df = pd.DataFrame(np.random.randn(10000, 10000))
ID = np.random.randint(0,50,10000)
df['ID'] = ID
Then I try to use mostly numpy arrays and avoid the nested list using a boolean index.
# Create a numpy array from the ID columns
a_ID = np.array(df['ID'])
# use the numpy unique method to get a unique array
# a = np.unique(np.array(df['ID']))
a = np.unique(a_ID)
# shuffle the unique array
np.random.seed(100)
np.random.shuffle(a)
# cut the shuffled array in half
X1 = a[0:25]
# create a boolean mask
mask = np.isin(a_ID, X1)
# set the index to the mask
df.index = mask
df.loc[True]
When I ran your code on my sample df, times were 817 ms, the code above runs at 445 ms.
Not sure if this helps. Good question, thanks.

reorder panda data frame columns vertical

I'm a bit new to panda and have some diabetic data that i would like to reorder.
I'd like to copy the data from column 'wakeup' through '23:00:00',
and put this data vertical under each other so I would get a new dataframe column:
5.6
8.1
9.9
6.3
4.1
13.3
NAN
3.9
3.3
6.8
.....etc
I'm assuming the data is in a dataframe already. You can index the columns you want and then use melt as suggested. Without any parameters, melt will 'stack' all your data into one column of a new dataframe. There's another column created to identify the original column names, but you can drop that if needed.
df.loc[:, 'wakeup':'23:00:00'].melt()
variable value
0 wakeup 5.6
1 wakeup 8.1
2 wakeup 9.9
3 wakeup 6.3
4 wakeup 4.1
5 wakeup 13.3
6 wakeup NAN
7 09:30:00 3.9
8 09:30:00 3.3
9 09:30:00 6.8
...
You mention you want this as another column, but there's no way to sensibly add it into your existing dataframe. The shape likely won't match also.
Solved it myself finally took me quite some time.
Notice here the orginal data was in df1 result in dfAllMeasurements
dfAllMeasurements = df1.loc[:, 'weekday':'23:00:00']
temp = dfAllMeasurements.set_index('weekday','ID').stack(dropna=False) #dropna = keeping NAN
dfAllMeasurements = temp.reset_index(drop=False, level=0).reset_index()

Rolling Linear Fit with Python DataFrame

I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta
I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.
I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7

Categories