I want to convert the R package Hmisc::wtd.quantile() into python.
Here is the example in R:
I took this as reference and it seems that the logics are different than R:
# First function
def weighted_quantile(values, quantiles, sample_weight = None,
values_sorted = False, old_style = False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), 'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
# weighted_quantiles = np.cumsum(sample_weight)
# weighted_quantiles /= np.sum(sample_weight)
weighted_quantiles = np.cumsum(sample_weight)/np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
weighted_quantile(values = [0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319],
quantiles = np.arange(0, 1 + 1 / 5, 1 / 5),
sample_weight = [1,1,1,1,1])
>> array([0.2136325, 0.2136325, 0.4079128, 0.4890342, 0.5083345, 0.6197319])
# Second function
def weighted_percentile(data, weights, perc):
"""
perc : percentile in [0-1]!
"""
data = np.array(data)
weights = np.array(weights)
ix = np.argsort(data)
data = data[ix] # sort data
weights = weights[ix] # sort weights
cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
return np.interp(perc, cdf, data)
weighted_percentile([0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319], [1,1,1,1,1], np.arange(0, 1 + 1 / 5, 1 / 5))
>> array([0.2136325 , 0.31077265, 0.4484735 , 0.49868435, 0.5640332 ,
0.6197319 ])
Both are different with R. Any idea?
I am Python-illiterate, but from what I see and after some quick checks I can tell you the following.
Here you use uniform (sampling) weights, so you could also directly use the quantile() function. Not surprisingly, it gives the same results as wtd.quantile() with uniform weights:
x <- c(0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319)
n <- length(x)
x <- sort(x)
quantile(x, probs = seq(0,1,0.2))
# 0% 20% 40% 60% 80% 100%
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
The R quantile() function get the quantiles in a 'textbook' way, i.e. by determining the index i of the obs to use with i = q(n+1).
In your case:
seq(0,1,0.2)*(n+1)
# 0.0 1.2 2.4 3.6 4.8 6.0
Of course since you have 5 values/obs and you want quintiles, the indices are not integers. But you know for example that the first quintile (i = 1.2) lies between obs 1 and obs 2. More precisely, it is a linear combination of the two observations (the 'weights' are derived from the value of the index):
0.2*x[1] + 0.8*x[2]
# 0.3690567
You can do the same for the all the quintiles, on the basis of the indices:
q <-
c(min(x), ## 0: actually, the first obs
0.2*x[1] + 0.8*x[2], ## 1.2: quintile lies between obs 1 and 2
0.4*x[2] + 0.6*x[3], ## 2.4: quintile lies between obs 2 and 3
0.6*x[3] + 0.4*x[4], ## 3.6: quintile lies between obs 3 and 4
0.8*x[4] + 0.2*x[5], ## 4.8: quintile lies between obs 4 and 5
max(x) ## 6: actually, the last obs
)
q
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
You can see that you get exactly the output of quantile() and wtd.quantile().
If instead of 0.2*x[1] + 0.8*x[2] we consider the following:
0.5*x[1] + 0.5*x[2]
# 0.3107726
We get the output of your second Python function. It appears that your second function considers uniform 'weights' (obviously I am not talking about the sampling weights here) when combining the two observations. The issue (at least for the second Python function) seems to come from this. I know these are just insights, but I hope they will help.
EDIT: note that the difference between the two is not necessary an 'issue' with the python code. There are different quantile estimators (and their weighted versions) and the python functions could simply rely on a different estimator than Hmisc::wtd.quantile(). I think that the latter uses the weighted version of the Harrell-Davis quantile estimator. If you really want to implement this one, you should check the source code of Hmisc::wtd.quantile() and try to 'directly' translate this into Python.
Related
I start with the optimization function from scipy.
I tried to create my code by copying the Find optimal vector that minimizes function solution
I have an array that contains series in columns. I need to multiply each of them by a weight so that the sum of last row of these columns multiplied by the weights gives a given number (constraint).
The sum of the series multiplied by the weights gives a new series where I extract the max-draw-down and I want to minimize this mdd.
I wrote my code as best as I can (2 months of Python and 3 hours of scipy) and can't solve the error message on the function used to solve the problem.
Here is my code and any help would be much appreciated:
import numpy as np
from scipy.optimize import fmin_slsqp
# based on: https://stackoverflow.com/questions/41145643/find-optimal-vector-that-minimizes-function
# the number of columns (and so of weights) can vary; it should be generic, regardless the number of columns
def mdd(serie): # finding the max-draw-down of a series (put aside not to create add'l problems)
min = np.nanargmax(np.fmax.accumulate(serie) - serie)
max = np.nanargmax((serie)[:min])
return serie[np.nanargmax((serie)[:min])] - serie[min] # max-draw-down
# defining the input data
# mat is an array of 5 columns containing series of independent data
mat = np.array([[1, 0, 0, 1, 1],[2, 0, 5, 3, 4],[3, 2, 4, 3, 7],[4, 1, 3, 3.1, -6],[5, 0, 2, 5, -7],[6, -1, 4, 1, -8]]).astype('float32')
w = np.ndarray(shape=(5)).astype('float32') # 1D vector for the weights to be used for the columns multiplication
w0 = np.array([1/5, 1/5, 1/5, 1/5, 1/5]).astype('float32') # initial weights (all similar as a starting point)
fixed_value = 4.32 # as a result of constraint nb 1
# testing the operations that are going to be used in the minimization
series = np.sum(mat * w0, axis=1)
# objective:
# minimize the mdd of the series by modifying the weights (w)
def test(w, mat):
series = np.sum(mat * w, axis=1)
return mdd(series)
# constraints:
def cons1(last, w, fixed_value): # fixed_value = 4.32
# the sum of the weigths multiplied by the last value of each column must be equal to this fixed_value
return np.sum(mat[-1, :] * w) - fixed_value
def cons2(w): # the sum of the weights must be equal to 1
return np.sum(w) - 1
# solution:
# looking for the optimal set of weights (w) values that minimize the mdd with the two contraints and bounds being respected
# all w values must be between 0 and 1
result = fmin_slsqp(test, w0, f_eqcons=[cons1, cons2], bounds=[(0.0, 1.0)]*len(w), args=(mat, fixed_value, w0), full_output=True)
weights, fW, its, imode, smode = result
print(weights)
You weren't that far off the mark. The biggest problem lies in the mdd function: In case there is no draw-down, your function spits out an empty list as an intermediate result, which then can no longer cope with the argmax function.
def mdd(serie): # finding the max-draw-down of a series (put aside not to create add'l problems)
i = np.argmax(np.maximum.accumulate(serie) - serie) # end of the period
start = serie[:i]
# check if there is dd at all
if not start.any():
return 0
j = np.argmax(start) # start of period
return serie[j] - serie[i] # max-draw-down
In addition, you must make sure that the parameter list is the same for all functions involved (cost function and constraints).
# objective:
# minimize the mdd of the series by modifying the weights (w)
def test(w, mat,fixed_value):
series = mat # w
return mdd(series)
# constraints:
def cons1(w, mat, fixed_value): # fixed_value = 4.32
# the sum of the weigths multiplied by the last value of each column must be equal to this fixed_value
return mat[-1, :] # w - fixed_value
def cons2(w, mat, fixed_value): # the sum of the weights must be equal to 1
return np.sum(w) - 1
# solution:
# looking for the optimal set of weights (w) values that minimize the mdd with the two contraints and bounds being respected
# all w values must be between 0 and 1
result = fmin_slsqp(test, w0, eqcons=[cons1, cons2], bounds=[(0.0, 1.0)]*len(w), args=(mat,fixed_value), full_output=True)
One more remark: You can make the matrix-vector multiplications much leaner with the #-operator.
Given this simulated data:
import numpy as np
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * X + np.random.normal(size=100)
We build a UnobservedComponents model with the first 70 points to run inferences on the last 30 points like so:
model = UnobservedComponents(y[:70], level='llevel', exog=X[:70])
f_model = model.fit()
forecaster = f_model.get_forecast(
steps=30,
exog=X[70:].reshape(-1, 1)
)
conf_int = forecaster.conf_int()
If we observe the mean for the 95% confidence interval, we get the following:
conf_int.mean(axis=0)
array([118.19789195, 122.14101161])
But when trying to get the same values through model simulations, we don't quite get the same results. Here's the script we run for the simulated boundaries:
sim_model = UnobservedComponents(np.zeros(30), level='llevel', exog=X[70:])
res = []
predicted_state = f_model.predicted_state[..., -1]
predicted_state_cov = f_model.predicted_state_cov[..., -1]
for i in range(1000):
init_state = np.random.multivariate_normal(
predicted_state,
predicted_state_cov
)
sim = sim_model.simulate(
f_model.params,
30,
initial_state=init_state)
res.append(sim.mean())
Printing the lower 2.5 and upper 97.5 percentile we get:
np.percentile(res, [2.5, 97.5])
array([119.06735028, 121.26810407])
As we use model simulations to distinguish signal from noise in data, this difference ended up being big enough to lead to contradictory conclusions. If we make for instance:
y[70:] += 1
Then according to the first technique we conclude the new y carries no signal as its mean is lower than 122.14. But the same is not true if we use the second technique: as the upper boundary is 121.2, we conclude that there's signal.
What we are trying to understand now is whether this is expected. Shouldn't the lower and upper 95% confidence interval of both techniques be equal?
Background:
I'd like to solve a wide array of optimization problems such as asset weights in a portfolio, and parameters in trading strategies where the variables are passed to functions containing a bunch of other variables as well.
Until now, I've been able to do these things easily in Excel using the Solver Add-In. But I think it would be much more efficient and even more widely applicable using Python. For the sake of clarity, I'm going to boil the question down to the essence of portfolio optimization.
My question (short version):
Here's a dataframe and a corresponding plot with asset returns.
Dataframe 1:
A1 A2
2017-01-01 0.0075 0.0096
2017-01-02 -0.0075 -0.0033
.
.
2017-01-10 0.0027 0.0035
Plot 1 - Asset returns
Based on that, I would like to find the weights for the optimal portfolio with regards to risk / return (Sharpe ratio), represented by the green dot in the plot below (the red dot is the so-called minimum variance portfolio, and represents another optimization problem).
Plot 2 - Efficient frontier and optimal portfolios:
How can I do this with numpy or scipy?
The details:
The following code section contains the function returns() to build a dataframe with random returns for two assets, as well as a function pf_sharpe to calculate the Sharpe ratio of two given weights for a portfolio of the returns.
# imports
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
# Sharpe ratio
def pf_sharpe(df, w1, w2):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
Parameters
==========
df : pandas dataframe
w1 : portfolio weight for asset 1
w2 : portfolio weight for asset 2
'''
weights = [w1,w2]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return pf_sharpe
# Make df with random returns and calculate
# sharpe ratio for a 80/20 split between assets
df_returns = returns(rows = 10, names = ['A1', 'A2'])
df_returns.plot(kind = 'bar')
sharpe = pf_sharpe(df = df_returns, w1 = 0.8, w2 = 0.2)
print(sharpe)
# Output:
# 5.09477512073
Now I'd like to find the portfolio weights that optimize the Sharpe ratio. I think you could express the optimization problem as follows:
maximize:
pf_sharpe()
by changing:
w1, w2
under the constraints:
0 < w1 < 1
0 < w2 < 1
w1 + w2 = 1
What I've tried so far:
I found a possible setup in the post Python Scipy Optimization.minimize using SLSQP showing maximized results. Below is what I have so far, and it addresses a central aspect of my question directly:
[...]where the variables are passed to functions containing a bunch of other variables as well.
As you can see, my initial challenge prevents me from even testing if my bounds and constraints will be accepted by the function optimize.minimize(). I haven't even bothered to take into consideration the fact that this is a maximization and not a minimization problem (hopefully amendable by changing the sign of the function).
Attempts:
# bounds
b = (0,1)
bnds = (b,b)
# constraints
def constraint1(w1,w2):
return w1 - w2
cons = ({'type': 'eq', 'fun':constraint1})
# initial guess
x0 = [0.5, 0.5]
# Testing the initial guess
print(pf_sharpe(df = df_returns, weights = x0))
# Optimization attempts
attempt1 = optimize.minimize(pf_sharpe(), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt2 = optimize.minimize(pf_sharpe(df = df_returns, weights), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt3 = optimize.minimize(pf_sharpe(weights, df = df_returns), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
Results:
Attempt1 is closest to the scipy setup here, but understandably fails because neither df nor weights have been specified.
Attempt2 fails with SyntaxError: positional argument follows keyword argument
Attempt3 fails with NameError: name 'weights' is not defined
I was under the impression that df could freely be specified, and that x0 in optimize.minimize would be considered the variables to be tested as 'representatives' for the weights in the function specified by pf_sharpe().
As you surely understand, my transition from Excel to Python in this regard has not been the easiest, and there is plenty I don't understand here. Anyway, I'm hoping some of you may offer some suggestions or clarifications!
Thank you!
Appendix 1 - Simulation approach:
This particular portfolio optimization problem can easily be solved by simulating a bunch of portfolio weights. And I did exactly that to produce the portfolio plot above. Here's the whole function if anyone is interested:
# Portfolio simulation
def portfolioSim(df, simRuns):
''' Function to take a df with asset returns,
runs a number of simulated portfolio weights,
plots return and risk for those weights,
and finds minimum risk portfolio
and max risk / return portfolio
Parameters
==========
df : pandas dataframe with returns
simRuns : number of simulations
'''
prets = []
pvols = []
pwgts = []
names = list(df_returns)
for p in range (simRuns):
# Assign random weights
weights = np.random.random(len(list(df_returns)))
weights /= np.sum(weights)
weights = np.asarray(weights)
# Calculate risk and returns with random weights
prets.append(np.sum(df_returns.mean() * weights) * 252)
pvols.append(np.sqrt(np.dot(weights.T, np.dot(df_returns.cov() * 252, weights))))
pwgts.append(weights)
prets = np.array(prets)
pvols = np.array(pvols)
pwgts = np.array(pwgts)
pshrp = prets / pvols
# Store calculations in a df
df1 = pd.DataFrame({'return':prets})
df2 = pd.DataFrame({'risk':pvols})
df3 = pd.DataFrame(pwgts)
df3.columns = names
df4 = pd.DataFrame({'sharpe':pshrp})
df_temp = pd.concat([df1, df2, df3, df4], axis = 1)
# Plot resulst
plt.figure(figsize=(8, 4))
plt.scatter(pvols, prets, c=prets / pvols, cmap = 'viridis', marker='o')
# Min risk
min_vol_port = df_temp.iloc[df_temp['risk'].idxmin()]
plt.plot([min_vol_port['risk']], [min_vol_port['return']], marker='o', markersize=12, color="red")
# Max sharpe
max_sharpe_port = df_temp.iloc[df_temp['sharpe'].idxmax()]
plt.plot([max_sharpe_port['risk']], [max_sharpe_port['return']], marker='o', markersize=12, color="green")
# Test run
portfolioSim(df = df_returns, simRuns = 250)
Appendix 2 - Excel Solver approach:
Here is how I would approach the problem using Excel Solver. Instead of linking to a file, I've only attached a screenshot and included the most important formulas in a code section. I'm guessing not many of you is going to be interested in reproducing this anyway. But I've included it just to show that it can be done quite easily in Excel.
Grey ranges represent formulas. Ranges that can be changed and used as arguments in the optimization problem are highlighted in yellow. The green range is the objective function.
Here's an image of the worksheet and Solver setup:
Excel formulas:
C3 =AVERAGE(C7:C16)
C4 =AVERAGE(D7:D16)
H4 =COVARIANCE.P(C7:C16;D7:D16)
G5 =COVARIANCE.P(C7:C16;D7:D16)
G10 =G8+G9
G13 =MMULT(TRANSPOSE(G8:G9);C3:C4)
G14 =SQRT(MMULT(TRANSPOSE(G8:G9);MMULT(G4:H5;G8:G9)))
H13 =G12/G13
H14 =G13*252
G16 =G13/G14
H16 =H13/H14
End notes:
As you can see from the screenshot, Excel solver suggests a 47% / 53% split between A1 and A2 to obtain an optimal Sharpe Ratio of 5,6. Running the Python function sr_opt = portfolioSim(df = df_returns, simRuns = 25000) yields a Sharpe Ratio of 5,3 with corresponding weights of 46% and 53% for A1 and A2:
print(sr_opt)
#Output
#return 0.361439
#risk 0.067851
#A1 0.465550
#A2 0.534450
#sharpe 5.326933
The method applied in Excel is GRG Nonlinear. I understand that changing the SLSQP argument to a non-linear method would get me somewhere, and I've look into Nonlinear solvers in scipy as well, but with little success.
And maybe Scipy even isn't the best option here?
A more detailed answer, 1st part of your code remains the same
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
The function pf_sharpe is modified, the 1st input is one of the weights, the parameter to be optimised. Instead of inputting constraint w1 + w2 = 1, we can define w2 as 1-w1 inside pf_sharpe, which is perfectly equivalent but simpler and faster. Also, minimize will attempt to minimize pf_sharpe, and you actually want to maximize it, so now the output of pf_sharpe is multiplied by -1.
# Sharpe ratio
def pf_sharpe(weight, df):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
'''
weights = [weight[0], 1-weight[0]]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return -pf_sharpe
# initial guess
x0 = [0.5]
df_returns = returns(rows = 10, names = ['A1', 'A2'])
# Optimization attempts
out = minimize(pf_sharpe, x0, method='SLSQP', bounds=[(0, 1)], args=(df_returns,))
optimal_weights = [out.x, 1-out.x]
print(optimal_weights)
print(-pf_sharpe(out.x, df_returns))
This returns an optimized Sharpe Ratio of 6.16 (better than 5.3) for w1 practically one and w2 practically 0
I would like to fit my surface equation to some data. I already tried scipy.optimize.leastsq but as I cannot specify the bounds it gives me an unusable results. I also tried scipy.optimize.least_squares but it gives me an error:
ValueError: too many values to unpack
My equation is:
f(x,y,z)=(x-A+y-B)/2+sqrt(((x-A-y+B)/2)^2+C*z^2)
parameters A, B, C should be found so that the equation above would be as close as possible to zero when the following points are used for x,y,z:
[
[-0.071, -0.85, 0.401],
[-0.138, -1.111, 0.494],
[-0.317, -0.317, -0.317],
[-0.351, -2.048, 0.848]
]
The bounds would be A > 0, B > 0, C > 1
How I should obtain such a fit? What is the best tool in python to do that. I searched for examples on how to fit 3d surfaces but most of examples involving function fitting is about line or flat surface fits.
I've edited this answer to provide a more general example of how this problem can be solved with scipy's general optimize.minimize method as well as scipy's optimize.least_squares method.
First lets set up the problem:
import numpy as np
import scipy.optimize
# ===============================================
# SETUP: define common compoments of the problem
def our_function(coeff, data):
"""
The function we care to optimize.
Args:
coeff (np.ndarray): are the parameters that we care to optimize.
data (np.ndarray): the input data
"""
A, B, C = coeff
x, y, z = data.T
return (x - A + y - B) / 2 + np.sqrt(((x - A - y + B) / 2) ** 2 + C * z ** 2)
# Define some training data
data = np.array([
[-0.071, -0.85, 0.401],
[-0.138, -1.111, 0.494],
[-0.317, -0.317, -0.317],
[-0.351, -2.048, 0.848]
])
# Define training target
# This is what we want the target function to be equal to
target = 0
# Make an initial guess as to the parameters
# either a constant or random guess is typically fine
num_coeff = 3
coeff_0 = np.ones(num_coeff)
# coeff_0 = np.random.rand(num_coeff)
This isn't strictly least squares, but how about something like this?
This solution is like throwing a sledge hammer at the problem. There probably is a way to use least squares to get a solution more efficiently using an SVD solver, but if you're just looking for an answer scipy.optimize.minimize will find you one.
# ===============================================
# FORMULATION #1: a general minimization problem
# Here the bounds and error are all specified within the general objective function
def general_objective(coeff, data, target):
"""
General function that simply returns a value to be minimized.
The coeff will be modified to minimize whatever the output of this function
may be.
"""
# Constraints to keep coeff above 0
if np.any(coeff < 0):
# If any constraint is violated return infinity
return np.inf
# The function we care about
prediction = our_function(coeff, data)
# (optional) L2 regularization to keep coeff small
# (optional) reg_amount = 0.0
# (optional) reg = reg_amount * np.sqrt((coeff ** 2).sum())
losses = (prediction - target) ** 2
# (optional) losses += reg
# Return the average squared error
loss = losses.sum()
return loss
general_result = scipy.optimize.minimize(general_objective, coeff_0,
method='Nelder-Mead',
args=(data, target))
# Test what the squared error of the returned result is
coeff = general_result.x
general_output = our_function(coeff, data)
print('====================')
print('general_result =\n%s' % (general_result,))
print('---------------------')
print('general_output = %r' % (general_output,))
print('====================')
The output looks like this:
====================
general_result =
final_simplex: (array([[ 2.45700466e-01, 7.93719271e-09, 1.71257109e+00],
[ 2.45692680e-01, 3.31991619e-08, 1.71255150e+00],
[ 2.45726858e-01, 6.52636219e-08, 1.71263360e+00],
[ 2.45713989e-01, 8.06971686e-08, 1.71260234e+00]]), array([ 0.00012404, 0.00012404, 0.00012404, 0.00012404]))
fun: 0.00012404137498459109
message: 'Optimization terminated successfully.'
nfev: 431
nit: 240
status: 0
success: True
x: array([ 2.45700466e-01, 7.93719271e-09, 1.71257109e+00])
---------------------
general_output = array([ 0.00527974, -0.00561568, -0.00719941, 0.00357748])
====================
I found in the documentation that all you need to do to adapt this to actual least squares is to specify the function that computes the residuals.
# ===============================================
# FORMULATION #2: a special least squares problem
# Here all that is needeed is a function that computes the vector of residuals
# the optimization function takes care of the rest
def least_squares_residuals(coeff, data, target):
"""
Function that returns the vector of residuals between the predicted values
and the target value. Here we want each predicted value to be close to zero
"""
A, B, C = coeff
x, y, z = data.T
prediction = our_function(coeff, data)
vector_of_residuals = (prediction - target)
return vector_of_residuals
# Here the bounds are specified in the optimization call
bound_gt = np.full(shape=num_coeff, fill_value=0, dtype=np.float)
bound_lt = np.full(shape=num_coeff, fill_value=np.inf, dtype=np.float)
bounds = (bound_gt, bound_lt)
lst_sqrs_result = scipy.optimize.least_squares(least_squares_residuals, coeff_0,
args=(data, target), bounds=bounds)
# Test what the squared error of the returned result is
coeff = lst_sqrs_result.x
lst_sqrs_output = our_function(coeff, data)
print('====================')
print('lst_sqrs_result =\n%s' % (lst_sqrs_result,))
print('---------------------')
print('lst_sqrs_output = %r' % (lst_sqrs_output,))
print('====================')
The output here is:
====================
lst_sqrs_result =
active_mask: array([ 0, -1, 0])
cost: 6.197329866927735e-05
fun: array([ 0.00518416, -0.00564099, -0.00710112, 0.00385024])
grad: array([ -4.61826888e-09, 3.70771396e-03, 1.26659198e-09])
jac: array([[-0.72611025, -0.27388975, 0.13653112],
[-0.74479565, -0.25520435, 0.1644325 ],
[-0.35777232, -0.64222767, 0.11601263],
[-0.77338046, -0.22661953, 0.27104366]])
message: '`gtol` termination condition is satisfied.'
nfev: 13
njev: 13
optimality: 4.6182688779976278e-09
status: 1
success: True
x: array([ 2.46392438e-01, 5.39025298e-17, 1.71555150e+00])
---------------------
lst_sqrs_output = array([ 0.00518416, -0.00564099, -0.00710112, 0.00385024])
====================
So I need to calculate the joint probability distribution for N variables. I have code for two variables, but I am having trouble generalizing it to higher dimensions. I imagine there is some sort of pythonic vectorization that could be helpful, but, right now my code is very C like (and yes I know that is not the right way to write Python). My 2D code is below:
import numpy
import math
feature1 = numpy.array([1.1,2.2,3.0,1.2,5.4,3.4,2.2,6.8,4.5,5.6,1.9,2.8,3.7,4.4,7.3,8.3,8.1,7.0,8.0,6.8,6.2,4.9,5.7,6.3,3.7,2.4,4.5,8.5,9.5,9.9]);
feature2 = numpy.array([11.1,12.8,13.0,11.6,15.2,13.8,11.1,17.8,12.5,15.2,11.6,20.8,14.7,14.4,15.3,18.3,11.4,17.0,16.0,16.8,12.2,14.9,15.7,16.3,13.7,12.4,14.2,18.5,19.8,19.0]);
#===Concatenate All Features===#
numFrames = len(feature1);
allFeatures = numpy.zeros((2,numFrames));
allFeatures[0,:] = feature1;
allFeatures[1,:] = feature2;
#===Create the Array to hold all the Bins===#
numBins = int(0.25*numFrames);
allBins = numpy.zeros((allFeatures.shape[0],numBins+1));
#===Find the maximum and minimum of each feature===#
allRanges = numpy.zeros((allFeatures.shape[0],2));
for f in range(allFeatures.shape[0]):
allRanges[f,0] = numpy.amin(allFeatures[f,:]);
allRanges[f,1] = numpy.amax(allFeatures[f,:]);
#===Create the Array to hold all the individual feature probabilities===#
allIndividualProbs = numpy.zeros((allFeatures.shape[0],numBins));
#===Grab all the Individual Probs and the Bins===#
for f in range(allFeatures.shape[0]):
freqhist, binedges = numpy.histogram(allFeatures[f,:],bins=numBins,range=[allRanges[f,0],allRanges[f,1]],density=False);
allBins[f,:] = binedges;
allIndividualProbs[f,:] = freqhist;
#===Create the joint probability array===#
jointProbs = numpy.zeros((numBins,numBins));
#===Compute the joint probability distribution===#
numElements = 0;
for b1 in range(numBins):
for b2 in range(numBins):
for f1 in range(numFrames):
for f2 in range(numFrames):
if ( ( (feature1[f1] >= allBins[0,b1]) and (feature1[f1] <= allBins[0,b1+1]) ) and ((feature2[f2] >= allBins[1,b2]) and (feature2[f2] <= allBins[1,b2+1])) ):
jointProbs[b1,b2] += 1;
numElements += 1;
jointProbs /= numElements;
#===But what if I add the following===#
feature3 = numpy.array([21.1,21.8,23.5,27.6,25.2,23.8,22.1,22.8,26.5,25.2,28.6,20.8,24.7,24.4,29.3,28.3,27.4,26.0,26.2,26.1,25.9,24.0,22.7,22.3,23.7,26.4,24.2,28.5,29.8,29.0]);
How can I generalize the large loop? For N variables (features) this loop would be enormous. Is there a Pythonic way to do this easily?
Check out the function numpy.histogramdd. This function can compute histograms in arbitrary numbers of dimensions. If you set the parameter normed=True, it returns the bin count divided by the bin hypervolume. If you'd prefer something more like a probability mass function (where everything sums to 1), just normalize it yourself. All together, you'll have something like:
import numpy as np
numBins = 10 # number of bins in each dimension
data = np.random.randn(100000, 3) # generate 100000 3-d random data points
jointProbs, edges = np.histogramdd(data, bins=numBins)
jointProbs /= jointProbs.sum()