Scenario: I am trying to do multiple portfolio optimizations, with different constraints (weights, risk, risk aversion...) in a multi-period scenario.
What I already did: From the examples of cvxpy I found how to optimize a portfolio under a non-linear quadratic formula that results in a list of weights for the assets in the portfolio composition. My problem is that, although I have 15 years of monthly data, I don't know how to optimize for different periods (the code, as of its current form, yields the best composition for the entire time span of my data).
Question 1: Is it possible to make the code optimize for different periods. such as 1, 3, 4, 6, 9, 12 months (in that case, yielding different weights for each of those periods) If so, how could one do that?.
Question 2: Is it possible to restrain the number of assets in each portfolio composition? What is the best way to achieve that? (the current code uses all of them, but I would like to test when the number of assets is limited, to control the turnover level).
Code:
from cvxpy import *
from cvxopt import *
import pandas as pd
import numpy as np
prices = pd.DataFrame()
logret = pd.DataFrame()
normret = pd.DataFrame()
returns = pd.DataFrame()
prices = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Prices Final')
logret = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Returns log')
normret = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Returns normal')
returns = normret
def calculate_portfolio(returns, selected_solver):
cov_mat = returns.cov()
Sigma = np.asarray(cov_mat.values)
w = Variable(len(cov_mat))
gamma = quad_form(w, Sigma)
prob = Problem(Minimize(gamma), [sum_entries(w) == 1])
prob.solve(solver=selected_solver)
weights = []
for weight in w.value:
weights.append(float(weight[0]))
return weights
The problem of multiperiod is that your model will be overfitted. On the other hand, you can backtest traditional portfolio optimization models asumming a rebalancing period. Riskfolio-Lib has an example using backtrader where it compares S&P500 with diferent portfolios using quarterly rebalancing. You can check the example in this link: https://riskfolio-lib.readthedocs.io/en/latest/examples.html
The standard mean-variance portfolio model is a static model. No dynamics in the model. (Time series are only used to estimate the variance-covariance matrix and the expected return). Some related models can answer questions like when and how to rebalance.
Restricting the number of assets in a portfolio leads to what is called the cardinality-constrained portfolio problem. This becomes basically an MIQP (Mixed-Integer Quadratic Programming problem).
Related
I am trying to figure out what type of a machine learning model to build for the following case:
I want to be able to predict whether a customer will deposit in the next 30 days, using past data:
# Data
import pandas as pd
client_id = [1 , 1, 1, 2, 2, 3]
deposit_amount = [10, 20, 30, 15, 45, 55]
deposit_date = ["2022-01-05", "2022-01-06", "2022-01-07", "2022-01-05", "2022-01-06", "2022-01-06"]
dat = pd.DataFrame([client_id, deposit_amount, deposit_date]).T
dat.columns = ["client_id", "deposit_amount", "deposit_date"]
dat
# Use ML algorithm
---
# Output of the ML Algorithm
model_prediction = pd.DataFrame([[1, 2, 3],[0, 1, 0]]).T
model_prediction.columns = ["client_id", "target"]
As you can see there are multiple rows for a customer for each time when they deposit. Which machine learning algorithm do you suggest? Since I am trying to predict something time related, is there any model which can deal with time-based data and outputs a binary column?
You can use whatever you want as classification algorithm (SVC, RandomForest, LogisticRegression, etc). The most important thing is to prepare your data correctly.
For example, the date variable is not useful as such but the time delta before two deposits is interesting for you:
dat['deposit_date'] = pd.to_datetime(dat['deposit_date'])
dat['deposit_date'] = dat.groupby('client_id')['deposit_date'].diff().dt.days.fillna(0)
Once you have prepared your data, it is important to normalize/standardize them to avoid giving importance to certain variables with large numbers.
You can use several algorithms and compare them. You can also try different hyper settings for each model (watch out for overfitting).
The second most important thing is choosing your metric to evaluate your model. As you have classification problem (and more specifically binary classification), you need to use metrics like f1-score, precision, recall.
As you can see, the choice of the algorithm is not very important compared to your analysis of the problem :-)
I have gotten the assignment to analyze a dataset of 1.000+ houses, build a multiple regression model to predict prices and then select the three houses which are the cheapest compared to the predicted price. Other than selecting specifically three houses, there is also the constraint of a "budget" of 7.000.000 total for purchasing the three houses.
I have gotten so far as to develop the regression model as well as calculate the predicted prices and the risiduals and added them to the original dataset. I am however completely stumped as to how to write a code to select the three houses, given the budget restraint and optimizing for highest combined risidual.
Here is my code so far:
### modules
import pandas as pd
import statsmodels.api as sm
### Data import
df = pd.DataFrame({"Zip Code" : [94127, 94110, 94112, 94114],
"Days listed" : [38, 40, 40, 40],
"Price" : [633000, 1100000, 440000, 1345000],
"Bedrooms" : [0, 3, 0, 0],
"Loft" : [1, 0, 1, 1],
"Square feet" : [1124, 2396, 625, 3384],
"Lotsize" : [2500, 1750, 2495, 2474],
"Year" : [1924, 1900, 1923, 1907]})
### Creating LM
y = df["Price"] # dependent variable
x = df[["Zip Code", "Days listed", "Bedrooms", "Loft", "Square feet", "Lotsize", "Year"]]
x = sm.add_constant(x) # adds a constant
lm = sm.OLS(y,x).fit() # fitting the model
# predict house prices
prices = sm.add_constant(x)
### Summary
#print(lm.summary())
### Adding predicted values and risidual values to df
df["predicted"] = pd.DataFrame(lm.predict(prices)) # predicted values
df["risidual"] = df["Price"] - df["predicted"] # risidual values
If anyone has an idea, could you explain to me the steps and give a code example?
Thank you very much!
With the clarification that you are looking for the best combination your problem is more complicated ;)
I have tried a "brute-force" approach but at least my laptop takes forever with the full dataset. Find below my thoughts:
Obviously we have to calculate the combinations of many houses, therefore my first approach was to reduce the dataset as far as possible.
If Price+2*min(Price)>budget there will be no combination with two houses that is smaller
If risidual is negative we will not consider the house during optimization
In pandas this will look as this:
budget=7000000
df=df[df['Price']<(budget-2*df['Price'].min())].copy()
df=df[df['risidual']>0].copy()
This reduces the objects from 1395 to 550.
Unfortunatly, 550 ID are still many combinations (27578100) as calculated with itertools:
import itertools
idx=[a for a in itertools.combinations(df.index,3)]
You can evaluate these combinations by
result={comb: df.loc[[*comb], 'risidual'].sum() for comb in idx[10000:] if df.loc[[*comb], 'Price'].sum() < budget}
Note: I have limited the evaluation to the first 10000 values due to the calculation duration.
print("Combination: {}\nPrice: {}\nCost: {}".format(max(result),df.loc[[*max(result)], 'Price'].sum(),result[max(result)] ))
Maybe it is advisable to calculate the combination of just two object first to further reduce the possible combinations. I think you should have a look at the Knapsack problem
I think you are almost there. Given that df["risidual"] has the difference between predicted and real price you have to select the subset that fits your limit e.g.
df_budget=df[df['price']<=budget].copy()
using pandas nlargest() you could retrieve the three biggest differences
df_budget.nlargest(3, 'risidual')
Note: Code was not tested due to missing sample data
I asked the same question here but have not received an answer for a week now, probably because its more of a coding than statistics question
I want to perform a mediation analysis with a fixed effects model as base model in python. I know, that you can perform mediation analysis using statsmodels' Mediation module. But fixed effects models (as far as I know) are only possible with linearmodels. As I haven't performed a mediation analysis so far, I'm not to sure how to combine those two modules.
I've tried the following (from step 3 onward is relevant for my question):
import pandas as pd
from linearmodels import PanelOLS
import statsmodels.api as sm2
from statsmodels.stats.mediation import Mediation
##1. direct effect: X --> Y
DV_LF = df.Y
IV_X = sm2.add_constant(df[['X', 'Control']])
fe_mod_X = PanelOLS(DV_LF, IV_X, entity_effects=True )
fe_res_X = fe_mod_X.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_X)
##2. X --> M
DV_A = df.M
IV_A = sm2.add_constant(df[['X', 'Control']])
fe_mod_A = PanelOLS(DV_A, IV_A, entity_effects=True )
fe_res_A = fe_mod_A.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_A)
##3. M --> Y
IV_M = sm2.add_constant(df[['M', 'Control']])
fe_mod_M = PanelOLS(DV_LF, IV_M, entity_effects=True )
fe_res_M = fe_mod_M.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_M)
##4. X, M --> Y
IV_T = sm2.add_constant(df[['X', 'M', 'Control']])
fe_mod_T = PanelOLS(DV_LF, IV_T, entity_effects=True )
fe_res_T = fe_mod_T.fit(cov_type='clustered', cluster_entity=True)
print(fe_res_T)
med = Mediation(fe_res_T, fe_res_A, 'X', 'M').fit()
med.summary()
but I'm running into an error:
AttributeError: 'PanelEffectsResults' object has no attribute 'exog'
Does my approach make sense? If so how do I have to change the code to make PanelOLS and Mediation work together?
EDIT:
Would it be valid to calculate the indirect effect by multiplying the coefficient of X on M (step 2) with the coefficient of M on Y from step 4 and the direct effect by subtracting the indirect effect off the coefficient of X on Y (step 1)? I'm a little bit confused by that research poster, saying, that normal mediation analysis are not valid for within subject analysis such as Fixed Effects. If my approach is valid, how would I calculate the significance of the difference between the direct and the total effect? I read that bootstrapping would be a good choice, but I am unsure how to apply it in my case.
I am working on a simple simulation for a stock of products. In specific, I want to calculate the expected shortage of products for various servicelevels. For example, if I assume that the demand of a product follows a normal distribution with a mean of 100 and a standard deviation of 20, for a servicelevel of 90% it would be necessary to have 125.63 units of the product on stock. And I would then still expect a shortage of 0.9469 units:
My current approach the problem is the following:
# Import libraries
import pandas as pd
import numpy as np
from scipy.stats import norm
# Create an exemplary dataset
idx = pd.Index(range(0, 1000), name='productid')
df = pd.DataFrame({'loc': np.random.normal(100, 30, 1000),
'scale': np.random.normal(20, 5, 1000)}, index=idx)
# Calculate quantile
df['quantile'] = norm.ppf(0.9, df['loc'], df['scale'])
# Calculate expected shortage
df['shortage'] = df.apply(lambda row: norm(row['loc'], row['scale'])\
.expect(lambda x: x-row['quantile'], lb=row['quantile']), axis=1)
The code is actually working quiet well, but there is a problem with performance though. The calculation of the expected shortage takes around 15 seconds for 1000 products. In the real dataset I have 10000 products and I need to repeat the operation around 100 times as I want to do the simulation for various servicelevels.
So if anyone knows a more efficient alternative to the scipy.stats expect function, or knows how to boost performance by tweaking the existing code, I would be really happy.
I am following the basic examples found here to simulate a simple system's energy generation in 15 minute intervals.
I would like to know, however, how can I introduce losses in the system following the same basic example. That is, with the following code:
import pandas as pd
import matplotlib.pyplot as plt
import pvlib
from pvlib.pvsystem import PVSystem
from pvlib.location import Location
from pvlib.modelchain import basic_chain, ModelChain
#%%
naive_times = pd.DatetimeIndex(start='01-30-2017', end='08-02-2017', freq='15min')
coordinates = [(52, 4, 'Amsterdam', 10, 'Etc/GMT-1')]
sandia_modules = pvlib.pvsystem.retrieve_sam('SandiaMod')
sapm_inverters = pvlib.pvsystem.retrieve_sam('cecinverter')
module = sandia_modules['Hanwha_HSL60P6_PA_4_250T__2013_']
inverter = sapm_inverters['ABB__PVI_10_0_I_OUTD_x_US_208_y_208V__CEC_2011_']
temp_air = 20
wind_speed = 0
system = PVSystem(surface_tilt = 13, surface_azimuth = 270, module_parameters = module, modules_per_string = 20, strings_per_inverter = 2, inverter_parameters = inverter)
for latitude, longitude, name, altitude, timezone in coordinates:
location = Location(latitude, longitude, name=name, altitude=altitude, tz=timezone)
mc = ModelChain(system, location, orientation_strategy=None)
mc.run_model(naive_times.tz_localize(timezone))
ac = mc.ac
energy = ac*0.001*0.25
plt.figure()
energy.plot()
I get
What I would like to have is a similar thing as this, obtained from real measurements:
In detail,
As you can see, a lot of losses from shading, DC losses, etc.
My question now is how to proceed from my code sample and achieve a plot similar to the ones in images 2 and 3?
Thanks in advance!
Your question is about dc losses and shading, but the biggest difference between your current ModelChain and the real system is the weather, particularly the irradiance, since two days in a row are not identical, which is due to changing cloud cover, rather than static losses.
The example on readthedocs: https://pvlib-python.readthedocs.io/en/latest/modelchain.html includes applying weather data at step 4. Further on in Demystifying ModelChain Internals, it defines weather. Unfortunately, it does not work with POA (plane of array) irradiance, which is the most common type measured on-site. However, ghi and dhi can be estimated from POA, but apparently there are no functions implemented.
weather : None or DataFrame, default None
If None, assumes air temperature is 20 C, wind speed is 0
m/s and irradiation calculated from clear sky data. Column
names must be 'wind_speed', 'temp_air', 'dni', 'ghi', 'dhi'.
Do not pass incomplete irradiation data. Use method
:py:meth:`~pvlib.modelchain.ModelChain.complete_irradiance`
instead.
The readthedoc page does provide some information about adding different kinds of DC losses, mostly through specific physical models (aoi or spectral). Unfortunately, shading is complex, depending on the system and its surroundings, and no one has created a shading loss module.