More effictient alternative to scipy.stats expect function - python

I am working on a simple simulation for a stock of products. In specific, I want to calculate the expected shortage of products for various servicelevels. For example, if I assume that the demand of a product follows a normal distribution with a mean of 100 and a standard deviation of 20, for a servicelevel of 90% it would be necessary to have 125.63 units of the product on stock. And I would then still expect a shortage of 0.9469 units:
My current approach the problem is the following:
# Import libraries
import pandas as pd
import numpy as np
from scipy.stats import norm
# Create an exemplary dataset
idx = pd.Index(range(0, 1000), name='productid')
df = pd.DataFrame({'loc': np.random.normal(100, 30, 1000),
'scale': np.random.normal(20, 5, 1000)}, index=idx)
# Calculate quantile
df['quantile'] = norm.ppf(0.9, df['loc'], df['scale'])
# Calculate expected shortage
df['shortage'] = df.apply(lambda row: norm(row['loc'], row['scale'])\
.expect(lambda x: x-row['quantile'], lb=row['quantile']), axis=1)
The code is actually working quiet well, but there is a problem with performance though. The calculation of the expected shortage takes around 15 seconds for 1000 products. In the real dataset I have 10000 products and I need to repeat the operation around 100 times as I want to do the simulation for various servicelevels.
So if anyone knows a more efficient alternative to the scipy.stats expect function, or knows how to boost performance by tweaking the existing code, I would be really happy.

Related

Expected mean of correlated data in python

I have successfully generated three correlated random variables with Cholesky. I use the same mean(10) and the same standard deviation(5) for all of them. However, I tried to calculate the expected mean of the correlated variables, but I got some an unpleasant results I can't seem to know where exactly the problem. Please here is a working code:
import numpy as np
import pandas as pd
corr = np.array([[1,0.7,0.7], [0.7,1,0.7],[0.7,0.7,1]])
chol = np.linalg.cholesky(corr)
N=1000
rand_data = np.random.normal(10, 5, size=(3,N))
# generate uncorrelated data
uncorrelated_data = pd.DataFrame(rand_data, index=['A','B','C']).T/100
uncorrelated_data.corr() # shows barely any correlation as it should
uncorrelated_data.mean()*100 # shows each mean around 10
Output
A 10.308595
B 9.931958
C 10.165347
Generating correlation among them
x = np.dot(chol, rand_data) # cholesky
correlated_data = pd.DataFrame(x, index=['A','B','C']).T/100
print(correlated_data.corr()) # shows there are correlations among variable
sim_corr_rets.mean()*100 # mean keep increasing in across the variables
Output:
A 10.308595
B 14.308853
C 16.752117
The means of the uncorrelated variables were as expected but the mean of the correlated variables keeps increasing from the first variable to the last variable. My expectation is that each mean will be around the actual mean. Please could my noble seniors help me figure out the problem or suggest an alternative solution?

Python data science: How to select three houses in dataset with budget constraint, optimizing for highest risidual between predicted and actual price

I have gotten the assignment to analyze a dataset of 1.000+ houses, build a multiple regression model to predict prices and then select the three houses which are the cheapest compared to the predicted price. Other than selecting specifically three houses, there is also the constraint of a "budget" of 7.000.000 total for purchasing the three houses.
I have gotten so far as to develop the regression model as well as calculate the predicted prices and the risiduals and added them to the original dataset. I am however completely stumped as to how to write a code to select the three houses, given the budget restraint and optimizing for highest combined risidual.
Here is my code so far:
### modules
import pandas as pd
import statsmodels.api as sm
### Data import
df = pd.DataFrame({"Zip Code" : [94127, 94110, 94112, 94114],
"Days listed" : [38, 40, 40, 40],
"Price" : [633000, 1100000, 440000, 1345000],
"Bedrooms" : [0, 3, 0, 0],
"Loft" : [1, 0, 1, 1],
"Square feet" : [1124, 2396, 625, 3384],
"Lotsize" : [2500, 1750, 2495, 2474],
"Year" : [1924, 1900, 1923, 1907]})
### Creating LM
y = df["Price"] # dependent variable
x = df[["Zip Code", "Days listed", "Bedrooms", "Loft", "Square feet", "Lotsize", "Year"]]
x = sm.add_constant(x) # adds a constant
lm = sm.OLS(y,x).fit() # fitting the model
# predict house prices
prices = sm.add_constant(x)
### Summary
#print(lm.summary())
### Adding predicted values and risidual values to df
df["predicted"] = pd.DataFrame(lm.predict(prices)) # predicted values
df["risidual"] = df["Price"] - df["predicted"] # risidual values
If anyone has an idea, could you explain to me the steps and give a code example?
Thank you very much!
With the clarification that you are looking for the best combination your problem is more complicated ;)
I have tried a "brute-force" approach but at least my laptop takes forever with the full dataset. Find below my thoughts:
Obviously we have to calculate the combinations of many houses, therefore my first approach was to reduce the dataset as far as possible.
If Price+2*min(Price)>budget there will be no combination with two houses that is smaller
If risidual is negative we will not consider the house during optimization
In pandas this will look as this:
budget=7000000
df=df[df['Price']<(budget-2*df['Price'].min())].copy()
df=df[df['risidual']>0].copy()
This reduces the objects from 1395 to 550.
Unfortunatly, 550 ID are still many combinations (27578100) as calculated with itertools:
import itertools
idx=[a for a in itertools.combinations(df.index,3)]
You can evaluate these combinations by
result={comb: df.loc[[*comb], 'risidual'].sum() for comb in idx[10000:] if df.loc[[*comb], 'Price'].sum() < budget}
Note: I have limited the evaluation to the first 10000 values due to the calculation duration.
print("Combination: {}\nPrice: {}\nCost: {}".format(max(result),df.loc[[*max(result)], 'Price'].sum(),result[max(result)] ))
Maybe it is advisable to calculate the combination of just two object first to further reduce the possible combinations. I think you should have a look at the Knapsack problem
I think you are almost there. Given that df["risidual"] has the difference between predicted and real price you have to select the subset that fits your limit e.g.
df_budget=df[df['price']<=budget].copy()
using pandas nlargest() you could retrieve the three biggest differences
df_budget.nlargest(3, 'risidual')
Note: Code was not tested due to missing sample data

Confidence Interval for Sample Mean in Python (Different from Manual)

I'm trying to create some material for introductory statistics for a seminar. The above code computes a 95% confidence interval for estimating the mean, but the result is not the same from the one implemented in Python. Is there something wrong with my math / code? Thanks.
EDIT:
Data was sampled from here
import pandas as pd
import numpy as np
x = np.random.normal(60000,15000,200)
income = pd.DataFrame()
income = pd.DataFrame()
income['Data Scientist'] = x
# Manual Implementation
sample_mean = income['Data Scientist'].mean()
sample_std = income['Data Scientist'].std()
standard_error = sample_std / (np.sqrt(income.shape[0]))
print('Mean',sample_mean)
print('Std',sample_std)
print('Standard Error',standard_error)
print('(',sample_mean-2*standard_error,',',sample_mean+2*standard_error,')')
# Python Library
import scipy.stats as st
se = st.sem(income['Data Scientist'])
a = st.t.interval(0.95, len(income['Data Scientist'])-1, loc=sample_mean, scale=se)
print(a)
print('Standard Error from this code block',se)
You've got 2 errors.
First, you are using 2 for the multiplier for the CI. The more accurate value is 1.96. "2" is just a convenient estimator. That is making your CI generated manually too fat.
Second, you are comparing a normal distribution to the t-distribution. This probably isn't causing more than decimal dust in difference because you have 199 degrees of freedom for the t-dist, which is basically the normal.
Below is the z-score of 1.96 and computation of CI with apples-to-apples comparison to the norm distribution vs. t.
In [45]: st.norm.cdf(1.96)
Out[45]: 0.9750021048517795
In [46]: print('(',sample_mean-1.96*standard_error,',',sample_mean+1.96*standard_error,')')
( 57558.007862202685 , 61510.37559873406 )
In [47]: st.norm.interval(0.95, loc=sample_mean, scale=se)
Out[47]: (57558.044175045005, 61510.33928589174)

How to calculate the correlation coefficient of grouped quantities in Pandas?

I have a DataFrame in which each row represents a traffic accident. Two of the columns are Speed_limit and Number_of_casualties. I would like to compute the Pearson correlation coefficient between the speed limit and the ratio of the number of casualties to accidents for each speed limit.
My solution so far is to get the relevant quantities as arrays and use SciPy's pearsonr:
import pandas as pd
import scipy.stats
df = pd.DataFrame({'Speed_limit': [10, 10, 20, 20, 20, 30],
'Number_of_casualties': [1, 2, 3, 4, 1, 4]})
accidents_per_speed_limit = df['Speed_limit'].value_counts().sort_index()
number_of_casualties_per_speed_limit = df.groupby('Speed_limit').sum()['Number_of_casualties']
speed_limit = accidents_per_speed_limit.index
ratio = number_of_casualties_per_speed_limit.values / accidents_per_speed_limit.values
r, _ = scipy.stats.pearsonr(x=speed_limit, y=ratio)
print("The Pearson's correlation coefficient between the number of casualties per accidents and the speed limit is {r}.".format(r=r))
However, it would seem to me that it should be possible to do this more elegantly using the pandas.DataFrame.corr method. How could I refactor this code to make it more pandas-like?
Instead of count and sum you can use directly use mean of groupby data then use series corr (by default method is pearson) i.e
m = df.groupby('Speed_limit').mean().reset_index()
m['Speed_limit'].corr(m['Number_of_casualties'])
Output :
0.99926008128973687
I found the following way using two auxiliary DataFrames:
df_aux = df.groupby('Speed_limit').agg(['count', 'sum'])
df_aux2 = pd.DataFrame({'ratio': df_aux['Number_of_casualties', 'sum'] / df_aux['Number_of_casualties', 'count'],
'speed_limit': df_aux.index})
print(df_aux2.corr()['ratio']['speed_limit'])
which corroborates the result obtained with scipy.stats.pearsonr. It's still not very elegant though, and I would appreciate suggestions for improvements.

Multi-period portfolio optimization in python

Scenario: I am trying to do multiple portfolio optimizations, with different constraints (weights, risk, risk aversion...) in a multi-period scenario.
What I already did: From the examples of cvxpy I found how to optimize a portfolio under a non-linear quadratic formula that results in a list of weights for the assets in the portfolio composition. My problem is that, although I have 15 years of monthly data, I don't know how to optimize for different periods (the code, as of its current form, yields the best composition for the entire time span of my data).
Question 1: Is it possible to make the code optimize for different periods. such as 1, 3, 4, 6, 9, 12 months (in that case, yielding different weights for each of those periods) If so, how could one do that?.
Question 2: Is it possible to restrain the number of assets in each portfolio composition? What is the best way to achieve that? (the current code uses all of them, but I would like to test when the number of assets is limited, to control the turnover level).
Code:
from cvxpy import *
from cvxopt import *
import pandas as pd
import numpy as np
prices = pd.DataFrame()
logret = pd.DataFrame()
normret = pd.DataFrame()
returns = pd.DataFrame()
prices = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Prices Final')
logret = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Returns log')
normret = pd.read_excel(open('//folder//Dgms89//calculation v3.xlsx', 'rb'), sheetname='Returns normal')
returns = normret
def calculate_portfolio(returns, selected_solver):
cov_mat = returns.cov()
Sigma = np.asarray(cov_mat.values)
w = Variable(len(cov_mat))
gamma = quad_form(w, Sigma)
prob = Problem(Minimize(gamma), [sum_entries(w) == 1])
prob.solve(solver=selected_solver)
weights = []
for weight in w.value:
weights.append(float(weight[0]))
return weights
The problem of multiperiod is that your model will be overfitted. On the other hand, you can backtest traditional portfolio optimization models asumming a rebalancing period. Riskfolio-Lib has an example using backtrader where it compares S&P500 with diferent portfolios using quarterly rebalancing. You can check the example in this link: https://riskfolio-lib.readthedocs.io/en/latest/examples.html
The standard mean-variance portfolio model is a static model. No dynamics in the model. (Time series are only used to estimate the variance-covariance matrix and the expected return). Some related models can answer questions like when and how to rebalance.
Restricting the number of assets in a portfolio leads to what is called the cardinality-constrained portfolio problem. This becomes basically an MIQP (Mixed-Integer Quadratic Programming problem).

Categories