I'm trying to follow this as an example but can't seem to adapt it to work with my dataset since I need truncated normals:
https://stackoverflow.com/questions/35990467/fit-two-gaussians-to-a-histogram-from-one-set-of-data-python#=
I have a dataset that is definitely a mixture of 2 truncated normals. The minimum value in the domain is 0 and the maximum is 1. I want to create an object that I can fit to optimize the parameters and get the likelihood of a sequence of numbers being drawn from that distribution. One option may be to just use the KDE model and using the pdf to get the likelihood. However, I want the exact mean and standard deviations of the 2 distributions. I guess I could, split the data in half and then model the 2 normals separately but I also want to learn how to use optimize in SciPy. I'm just starting to experiment with this type of statistical analysis so my apologies if this seems naive.
I'm not sure how to get a pdf this way that can integrate to 1 and have a domain constrained between 0 and 1.
import requests
from ast import literal_eval
from scipy import optimize, stats
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Actual Data
u = np.asarray(literal_eval(requests.get("https://pastebin.com/raw/hP5VJ9vr").text))
# u.size ==> 6000
u.min(), u.max()
# (1.3628525454666037e-08, 0.99973136607553781)
# Distribution
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
sns.kdeplot(u, color="black", ax=ax)
ax.axvline(0, linestyle=":", color="red")
ax.axvline(1, linestyle=":", color="red")
kde = stats.gaussian_kde(u)
# KDE Model
def truncated_gaussian_lower(x,mu,sigma,A):
return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=0, a_max=None)
def truncated_gaussian_upper(x,mu,sigma,A):
return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=None, a_max=1)
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
kde = stats.gaussian_kde(u)
# Estimates: mu sigma A
estimates= [0.1, 1, 3,
0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )
# ---------------------------------------------------------------------------
# RuntimeError Traceback (most recent call last)
# <ipython-input-265-b2efb2ca0e0a> in <module>()
# 32 estimates= [0.1, 1, 3,
# 33 0.9, 1, 1]
# ---> 34 params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )
# /Users/mu/anaconda/lib/python3.6/site-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
# 738 cost = np.sum(infodict['fvec'] ** 2)
# 739 if ier not in [1, 2, 3, 4]:
# --> 740 raise RuntimeError("Optimal parameters not found: " + errmsg)
# 741 else:
# 742 # Rename maxfev (leastsq) to max_nfev (least_squares), if specified.
# RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 1400.
In response to #Uvar 's very helpful explanation below. I am trying to test the integral from 0 - 1 to see if it equals 1 but I'm getting 0.3. I think I'm missing a crucial step in logic:
# KDE Model
def truncated_gaussian(x,mu,sigma,A):
return A*np.exp(-(x-mu)**2/2/sigma**2)
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
if type(x) == np.ndarray:
norm_probas = truncated_gaussian(x,mu1,sigma1,A1) + truncated_gaussian(x,mu2,sigma2,A2)
mask_lower = x < 0
mask_upper = x > 1
mask_floor = (mask_lower.astype(int) + mask_upper.astype(int)) > 1
norm_probas[mask_floor] = 0
return norm_probas
else:
if (x < 0) or (x > 1):
return 0
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
kde = stats.gaussian_kde(u, bw_method=2e-2)
# # Estimates: mu sigma A
estimates= [0.1, 1, 3,
0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u)/integrate.quad(kde, 0 , 1)[0],estimates ,maxfev=5000)
# params
# array([ 9.89751700e-01, 1.92831695e-02, 7.84324114e+00,
# 3.73623345e-03, 1.07754038e-02, 3.79238972e+01])
# Test the integral from 0 - 1
x = np.linspace(0,1,1000)
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
ax.plot(x, kde(x), color="black", label="Data")
ax.plot(x, mixture_model(x, *params), color="red", label="Model")
ax.legend()
# Integrating from 0 to 1
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
# 0.3026863969781809
It seems you are misspecifying the fitting procedure.
You are trying to fit the kde.pdf(u) while constraining half-bounds.
foo = kde.pdf(u)
min(foo)
Out[329]: 0.22903365654960098
max(foo)
Out[330]: 4.0119283429320332
As you can see, the probability density function of u is not constrained to [0,1].
As such, just deleting the clipping action, will result in an exact fit.
def truncated_gaussian_lower(x,mu,sigma,A):
return A*np.exp((-(x-mu)**2)/(2*sigma**2))
def truncated_gaussian_upper(x,mu,sigma,A):
return A * np.exp((-(x-mu)**2)/(2*sigma**2))
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
estimates= [0.15, 1, 3,
0.95, 1, 1]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=kde.pdf(u), p0=estimates)
params
Out[327]:
array([ 0.00672248, 0.07462657, 4.01188383, 0.98006841, 0.07654998,
1.30569665])
y3 = mixture_model(u, params[0], params[1], params[2], params[3], params[4], params[5])
plt.plot(kde.pdf(u)+0.1) #add offset for visual inspection purpose
plt.plot(y3)
So, let's now say I change what I am plotting to:
plt.figure(); plt.plot(u,y3,'.')
Because, indeed:
np.allclose(y3, kde(u), atol=1e-2)
>>True
You can edit the mixture model a bit to be 0 outside of the domain [0, 1]:
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
if (x < 0) or (x > 1):
return 0
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
Doing so, however, will lose the option of instantly evaluating the function over an array of x.. So for the sake of argument, I will leave it out for now.
Anyway, we want our integral to sum up to 1 in the domain [0, 1], and one way to do this (feel free to play around with the bandwidth estimator in stats.gaussian_kde as well..) is to divide the probability density estimate by its integral over the domain. Take care that optimize.curve_fit only takes 1400 iterations in this implementation, so the initial parameter estimates matter.
from scipy import integrate
sum_prob = integrate.quad(kde, 0 , 1)[0]
y = kde(u)/sum_prob
# Estimates: mu sigma A
estimates= [0.15, 1, 5,
0.95, 0.5, 3]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=y, p0=estimates)
>>array([ 6.72247814e-03, 7.46265651e-02, 7.23699661e+00,
9.80068414e-01, 7.65499825e-02, 2.35533297e+00])
y3 = mixture_model(np.arange(0,1,0.001), params[0], params[1], params[2],
params[3], params[4], params[5])
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots()
sns.kdeplot(u, color="black", ax=ax)
ax.axvline(0, linestyle=":", color="red")
ax.axvline(1, linestyle=":", color="red")
plt.plot(np.arange(0,1,0.001), y3) #The red line is now your custom pdf with area-under-curve = 0.998 in the domain..
To check for the area under the curve, I used this hacky solution of redefining mixture_model..:
def mixture_model(x):
mu1=params[0]; sigma1=params[1]; A1=params[2]; mu2=params[3]; sigma2=params[4]; A2=params[5]
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
from scipy import integrate
integrated_value, error = integrate.quad(mixture_model, 0, 1) #0 lower bound, 1 upper bound
>>(0.9978588016186962, 5.222293368393178e-14)
Or doing the integral a second way:
import sympy
x = sympy.symbols('x', real=True, nonnegative=True)
foo = sympy.integrate(params[2]*sympy.exp((-(x-params[0])**2)/(2*params[1]**2))+params[5]*sympy.exp((-(x-params[3])**2)/(2*params[4]**2)),(x,0,1), manual=True)
foo.doit()
>>0.562981541724715*sqrt(pi) #this evaluates to 0.9978588016186956
And actually doing it your way as described in your edited question:
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
>>0.9978588016186962
If I set my bandwidth to your level (2e-2), indeed the evaluation comes down to 0.92, which is a worse result than the 0.998 we had earlier, but that is still significantly different from the 0.3 you report which is something I cannot recreate, even while copying your code snippets. Do you perhaps accidentally redefine functions/variables somewhere?
Related
I'd want to achieve similar result as how the Solver-function in Excel is working. I've been reading of Scipy optimization and been trying to build a function which outputs what I would like to find the maximal value of. The equation is based on four different variables which, see my code below:
import pandas as pd
import numpy as np
from scipy import optimize
cols = {
'Dividend2': [9390, 7448, 177],
'Probability': [341, 376, 452],
'EV': [0.53, 0.60, 0.55],
'Dividend': [185, 55, 755],
'EV2': [123, 139, 544],
}
df = pd.DataFrame(cols)
def myFunc(params):
"""myFunc metric."""
(ev, bv, vc, dv) = params
df['Number'] = np.where(df['Dividend2'] <= vc, 1, 0) \
+ np.where(df['EV2'] <= dv, 1, 0)
df['Return'] = np.where(
df['EV'] <= ev, 0, np.where(
df['Probability'] >= bv, 0, df['Number'] * df['Dividend'] - (vc + dv)
)
)
return -1 * (df['Return'].sum())
b1 = [(0.2,4), (300,600), (0,1000), (0,1000)]
start = [0.2, 600, 1000, 1000]
result = optimize.minimize(fun=myFunc, bounds=b1, x0=start)
print(result)
So I'd like to find the maximum value of the column Return in df when changing the variables ev,bv,vc & dv. I'd like them to be between in the intervals of ev: 0.2-4, bv: 300-600, vc: 0-1000 & dv: 0-1000.
When running my code it seem like the function stops at x0.
Solution
I will use optuna library to give you a solution to the type of problem you are trying to solve. I have tried using scipy.optimize.minimize and it appears that the loss-landscape is probably quite flat in most places, and hence the tolerances enforce the minimizing algorithm (L-BFGS-B) to stop prematurely.
Optuna Docs: https://optuna.readthedocs.io/en/stable/index.html
With optuna, it rather straight forward. Optuna only requires an objective function and a study. The study send various trials to the objective function, which in turn, evaluates the metric of your choice.
I have defined another metric function myFunc2 by mostly removing the np.where calls, as you can do-away with them (reduces number of steps) and make the function slightly faster.
# install optuna with pip
pip install -Uqq optuna
Although I looked into using a rather smooth loss landscape, sometimes it is necessary to visualize the landscape itself. The answer in section B elaborates on visualization. But, what if you want to use a smoother metric function? Section D sheds some light on this.
Order of code-execution should be:
Sections: C >> B >> B.1 >> B.2 >> B.3 >> A.1 >> A.2 >> D
A. Building Intuition
If you create a hiplot (also known as a plot with parallel-coordinates) with all the possible parameter values as mentioned in the search_space for Section B.2, and plot the lowest 50 outputs of myFunc2, it would look like this:
Plotting all such points from the search_space would look like this:
A.1. Loss Landscape Views for Various Parameter-Pairs
These figures show that mostly the loss-landscape is flat for any two of the four parameters (ev, bv, vc, dv). This could be a reason why, only GridSampler (which brute-forces the searching process) does better, compared to the other two samplers (TPESampler and RandomSampler). Please click on any of the images below to view them enlarged. This could also be the reason why scipy.optimize.minimize(method="L-BFGS-B") fails right off the bat.
01. dv-vc
02. dv-bv
03. dv-ev
04. bv-ev
05. cv-ev
06. vc-bv
# Create contour plots for parameter-pairs
study_name = "GridSampler"
study = studies.get(study_name)
views = [("dv", "vc"), ("dv", "bv"), ("dv", "ev"),
("bv", "ev"), ("vc", "ev"), ("vc", "bv")]
for i, (x, y) in enumerate(views):
print(f"Figure: {i}/{len(views)}")
study_contour_plot(study=study, params=(x, y))
A.2. Parameter Importance
study_name = "GridSampler"
study = studies.get(study_name)
fig = optuna.visualization.plot_param_importances(study)
fig.update_layout(title=f'Hyperparameter Importances: {study.study_name}',
autosize=False,
width=800, height=500,
margin=dict(l=65, r=50, b=65, t=90))
fig.show()
B. Code
Section B.3. finds the lowest metric -88.333 for:
{'ev': 0.2, 'bv': 500.0, 'vc': 222.2222, 'dv': 0.0}
import warnings
from functools import partial
from typing import Iterable, Optional, Callable, List
import pandas as pd
import numpy as np
import optuna
from tqdm.notebook import tqdm
warnings.filterwarnings("ignore", category=optuna.exceptions.ExperimentalWarning)
optuna.logging.set_verbosity(optuna.logging.WARNING)
PARAM_NAMES: List[str] = ["ev", "bv", "vc", "dv",]
DEFAULT_METRIC_FUNC: Callable = myFunc2
def myFunc2(params):
"""myFunc metric v2 with lesser steps."""
global df # define as a global variable
(ev, bv, vc, dv) = params
df['Number'] = (df['Dividend2'] <= vc) * 1 + (df['EV2'] <= dv) * 1
df['Return'] = (
(df['EV'] > ev)
* (df['Probability'] < bv)
* (df['Number'] * df['Dividend'] - (vc + dv))
)
return -1 * (df['Return'].sum())
def make_param_grid(
bounds: List[Tuple[float, float]],
param_names: Optional[List[str]]=None,
num_points: int=10,
as_dict: bool=True,
) -> Union[pd.DataFrame, Dict[str, List[float]]]:
"""
Create parameter search space.
Example:
grid = make_param_grid(bounds=b1, num_points=10, as_dict=True)
"""
if param_names is None:
param_names = PARAM_NAMES # ["ev", "bv", "vc", "dv"]
bounds = np.array(bounds)
grid = np.linspace(start=bounds[:,0],
stop=bounds[:,1],
num=num_points,
endpoint=True,
axis=0)
grid = pd.DataFrame(grid, columns=param_names)
if as_dict:
grid = grid.to_dict()
for k,v in grid.items():
grid.update({k: list(v.values())})
return grid
def objective(trial,
bounds: Optional[Iterable]=None,
func: Optional[Callable]=None,
param_names: Optional[List[str]]=None):
"""Objective function, necessary for optimizing with optuna."""
if param_names is None:
param_names = PARAM_NAMES
if (bounds is None):
bounds = ((-10, 10) for _ in param_names)
if not isinstance(bounds, dict):
bounds = dict((p, (min(b), max(b)))
for p, b in zip(param_names, bounds))
if func is None:
func = DEFAULT_METRIC_FUNC
params = dict(
(p, trial.suggest_float(p, bounds.get(p)[0], bounds.get(p)[1]))
for p in param_names
)
# x = trial.suggest_float('x', -10, 10)
return func((params[p] for p in param_names))
def optimize(objective: Callable,
sampler: Optional[optuna.samplers.BaseSampler]=None,
func: Optional[Callable]=None,
n_trials: int=2,
study_direction: str="minimize",
study_name: Optional[str]=None,
formatstr: str=".4f",
verbose: bool=True):
"""Optimizing function using optuna: creates a study."""
if func is None:
func = DEFAULT_METRIC_FUNC
study = optuna.create_study(
direction=study_direction,
sampler=sampler,
study_name=study_name)
study.optimize(
objective,
n_trials=n_trials,
show_progress_bar=True,
n_jobs=1,
)
if verbose:
metric = eval_metric(study.best_params, func=myFunc2)
msg = format_result(study.best_params, metric,
header=study.study_name,
format=formatstr)
print(msg)
return study
def format_dict(d: Dict[str, float], format: str=".4f") -> Dict[str, float]:
"""
Returns formatted output for a dictionary with
string keys and float values.
"""
return dict((k, float(f'{v:{format}}')) for k,v in d.items())
def format_result(d: Dict[str, float],
metric_value: float,
header: str='',
format: str=".4f"):
"""Returns formatted result."""
msg = f"""Study Name: {header}\n{'='*30}
✅ study.best_params: \n\t{format_dict(d)}
✅ metric: {metric_value}
"""
return msg
def study_contour_plot(study: optuna.Study,
params: Optional[List[str]]=None,
width: int=560,
height: int=500):
"""
Create contour plots for a study, given a list or
tuple of two parameter names.
"""
if params is None:
params = ["dv", "vc"]
fig = optuna.visualization.plot_contour(study, params=params)
fig.update_layout(
title=f'Contour Plot: {study.study_name} ({params[0]}, {params[1]})',
autosize=False,
width=width,
height=height,
margin=dict(l=65, r=50, b=65, t=90))
fig.show()
bounds = [(0.2, 4), (300, 600), (0, 1000), (0, 1000)]
param_names = PARAM_NAMES # ["ev", "bv", "vc", "dv",]
pobjective = partial(objective, bounds=bounds)
# Create an empty dict to contain
# various subsequent studies.
studies = dict()
Optuna comes with a few different types of Samplers. Samplers provide the strategy of how optuna is going to sample points from the parametr-space and evaluate the objective function.
https://optuna.readthedocs.io/en/stable/reference/samplers.html
B.1 Use TPESampler
from optuna.samplers import TPESampler
sampler = TPESampler(seed=42)
study_name = "TPESampler"
studies[study_name] = optimize(
pobjective,
sampler=sampler,
n_trials=100,
study_name=study_name,
)
# Study Name: TPESampler
# ==============================
#
# ✅ study.best_params:
# {'ev': 1.6233, 'bv': 585.2143, 'vc': 731.9939, 'dv': 598.6585}
# ✅ metric: -0.0
B.2. Use GridSampler
GridSampler requires a parameter search grid. Here we are using the following search_space.
from optuna.samplers import GridSampler
# create search-space
search_space = make_param_grid(bounds=bounds, num_points=10, as_dict=True)
sampler = GridSampler(search_space)
study_name = "GridSampler"
studies[study_name] = optimize(
pobjective,
sampler=sampler,
n_trials=2000,
study_name=study_name,
)
# Study Name: GridSampler
# ==============================
#
# ✅ study.best_params:
# {'ev': 0.2, 'bv': 500.0, 'vc': 222.2222, 'dv': 0.0}
# ✅ metric: -88.33333333333337
B.3. Use RandomSampler
from optuna.samplers import RandomSampler
sampler = RandomSampler(seed=42)
study_name = "RandomSampler"
studies[study_name] = optimize(
pobjective,
sampler=sampler,
n_trials=300,
study_name=study_name,
)
# Study Name: RandomSampler
# ==============================
#
# ✅ study.best_params:
# {'ev': 1.6233, 'bv': 585.2143, 'vc': 731.9939, 'dv': 598.6585}
# ✅ metric: -0.0
C. Dummy Data
For the sake of reproducibility, I am keeping a record of the dummy data used here.
import pandas as pd
import numpy as np
from scipy import optimize
cols = {
'Dividend2': [9390, 7448, 177],
'Probability': [341, 376, 452],
'EV': [0.53, 0.60, 0.55],
'Dividend': [185, 55, 755],
'EV2': [123, 139, 544],
}
df = pd.DataFrame(cols)
def myFunc(params):
"""myFunc metric."""
(ev, bv, vc, dv) = params
df['Number'] = np.where(df['Dividend2'] <= vc, 1, 0) \
+ np.where(df['EV2'] <= dv, 1, 0)
df['Return'] = np.where(
df['EV'] <= ev, 0, np.where(
df['Probability'] >= bv, 0, df['Number'] * df['Dividend'] - (vc + dv)
)
)
return -1 * (df['Return'].sum())
b1 = [(0.2,4), (300,600), (0,1000), (0,1000)]
start = [0.2, 600, 1000, 1000]
result = optimize.minimize(fun=myFunc, bounds=b1, x0=start)
print(result)
C.1. An Observation
So, it seems at first glance that the code executed properly and did not throw any error. It says it had success in finding the minimized solution.
fun: -0.0
hess_inv: <4x4 LbfgsInvHessProduct with dtype=float64>
jac: array([0., 0., 3., 3.])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL' # 💡
nfev: 35
nit: 2
status: 0
success: True
x: array([2.e-01, 6.e+02, 0.e+00, 0.e+00]) # 🔥
A close observation reveals that the solution (see 🔥) is no different from the starting point [0.2, 600, 1000, 1000]. So, seems like nothing really happened and the algorithm just finished prematurely?!!
Now look at the message above (see 💡). If we run a google search on this, you could find something like this:
Summary
b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
If the loss-landscape does not have a smoothely changing topography, the gradient descent algorithms will soon find that from one iteration to the next, there isn't much change happening and hence, will terminate further seeking. Also, if the loss-landscape is rather flat, this could see similar fate and get early-termination.
scipy-optimize-minimize does not perform the optimization - CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
D. Making the Loss Landscape Smoother
A binary evaluation of value = 1 if x>5 else 0 is essentially a step-function that assigns 1 for all values of x that are greater than 5 and 0 otherwise. But this introduces a kink - a discontinuity in smoothness and this could potentially introduce problems in traversing the loss-landscape.
What if we use a sigmoid function to introduce some smoothness?
# Define sigmoid function
def sigmoid(x):
"""Sigmoid function."""
return 1 / (1 + np.exp(-x))
For the above example, we could modify it as follows.
You can additionally introduce another factor (gamma: γ) as follows and try to optimize it to make the landscape smoother. Thus by controlling the gamma factor, you could make the function smoother and change how quickly it changes around x = 5
The above figure is created with the following code-snippet.
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina'
plt.style.use('seaborn-white')
def make_figure(figtitle: str="Sigmoid Function"):
"""Make the demo figure for using sigmoid."""
x = np.arange(-20, 20.01, 0.01)
y1 = sigmoid(x)
y2 = sigmoid(x - 5)
y3 = sigmoid((x - 5)/3)
y4 = sigmoid((x - 5)/0.3)
fig, ax = plt.subplots(figsize=(10,5))
plt.sca(ax)
plt.plot(x, y1, ls="-", label="$\sigma(x)$")
plt.plot(x, y2, ls="--", label="$\sigma(x - 5)$")
plt.plot(x, y3, ls="-.", label="$\sigma((x - 5) / 3)$")
plt.plot(x, y4, ls=":", label="$\sigma((x - 5) / 0.3)$")
plt.axvline(x=0, ls="-", lw=1.3, color="cyan", alpha=0.9)
plt.axvline(x=5, ls="-", lw=1.3, color="magenta", alpha=0.9)
plt.legend()
plt.title(figtitle)
plt.show()
make_figure()
D.1. Example of Metric Smoothing
The following is an example of how you could apply function smoothing.
from functools import partial
def sig(x, gamma: float=1.):
return sigmoid(x/gamma)
def myFunc3(params, gamma: float=0.5):
"""myFunc metric v3 with smoother metric."""
(ev, bv, vc, dv) = params
_sig = partial(sig, gamma=gamma)
df['Number'] = _sig(x = -(df['Dividend2'] - vc)) * 1 \
+ _sig(x = -(df['EV2'] - dv)) * 1
df['Return'] = (
_sig(x = df['EV'] - ev)
* _sig(x = -(df['Probability'] - bv))
* _sig(x = df['Number'] * df['Dividend'] - (vc + dv))
)
return -1 * (df['Return'].sum())
As already mentioned in my comment, the crucial problem is that np.where() is neither differentiable nor continuous. Consequently, your objective function violates the mathematical assumptions for most of the (derivate-based) algorithms under the hood of scipy.optimize.minimize.
So, basically, you've got three options:
Use a derivative-free algorithm and hope for the best.
Replace np.where() with a smooth approximation such that your objective is continuously differentiable.
Reformulate your problem as a MIP.
Since #CypherX answer pursues approach 1, I'd like to focus on 2. Here, the main idea is to approximate the np.where function. One possible approximation is
def smooth_if_then(x):
eps = 1e-12
return 0.5 + x/(2*np.sqrt(eps + x*x))
which is continuous and differentiable. Then, given a np.ndarray arr and a scalar value x, the expression np.where(arr <= x, 1, 0) is equivalent to smooth_if_then(x - arr).
Hence, the objective function becomes:
div = df['Dividend'].values
div2 = df['Dividend2'].values
ev2 = df['EV2'].values
ev = df['EV'].values
prob = df['Probability'].values
def objective(x, *params):
ev, bv, vc, dv = x
div_vals, div2_vals, ev2_vals, ev_vals, prob_vals = params
number = smooth_if_then(vc - div2_vals) + smooth_if_then(dv - ev2_vals)
part1 = smooth_if_then(bv - prob_vals) * (number * div_vals - (vc + dv))
part2 = smooth_if_then(-1*(ev - ev_vals)) * part1
return -1 * part2.sum()
and using the trust-constr algorithm (which is the most robust one inside scipy.optimize.minimize), yields:
res = minimize(lambda x: objective(x, div, div2, ev2, ev, prob), x0=start, bounds=b1, method="trust-constr")
barrier_parameter: 1.0240000000000006e-08
barrier_tolerance: 1.0240000000000006e-08
cg_niter: 5
cg_stop_cond: 0
constr: [array([8.54635975e-01, 5.99253512e+02, 9.95614973e+02, 9.95614973e+02])]
constr_nfev: [0]
constr_nhev: [0]
constr_njev: [0]
constr_penalty: 1.0
constr_violation: 0.0
execution_time: 0.2951819896697998
fun: 1.3046631387761482e-08
grad: array([0.00000000e+00, 0.00000000e+00, 8.92175218e-12, 8.92175218e-12])
jac: [<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>]
lagrangian_grad: array([-3.60651033e-09, 4.89643010e-09, 2.21847918e-09, 2.21847918e-09])
message: '`gtol` termination condition is satisfied.'
method: 'tr_interior_point'
nfev: 20
nhev: 0
nit: 14
niter: 14
njev: 4
optimality: 4.896430096425101e-09
status: 1
success: True
tr_radius: 478515625.0
v: [array([-3.60651033e-09, 4.89643010e-09, 2.20955743e-09, 2.20955743e-09])]
x: array([8.54635975e-01, 5.99253512e+02, 9.95614973e+02, 9.95614973e+02])
Last but not least: Using smooth approximations is a common way to achieve differentiability. However, it's worth mentioning that these approximations are not convex. In practice, this means that your optimization problem is not convex and thus, you have no guarantee that a found stationary point (local minimizer) is a global optimum. For this end, one either needs to use a global optimization algorithm or formulate the problem as a MIP. The latter is the recommended approach, both from a mathematical and a practice point of view.
I have a curve parameterized by time that intersects a shape (in this case just a rectangle). Following this elegant suggestion, I used shapely to determine where the objects intersect, however from there on, I struggle to find a good solution for when that happens. Currently, I am approximating the time awkwardly by finding the point of the curve that is closest (in space) to the intersection, and then using its time stamp.
But I believe there should be a better solution e.g. by solving the polynomial equation, maybe using the root method of a numpy polynomial. I'm just not sure how to do this, because I guess you would need to somehow introduce tolerances as it is likely that the curve will never assume exactly the same intersection coordinates as determined by shapely.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle, Ellipse
from matplotlib.collections import LineCollection
from shapely.geometry import LineString, Polygon
# the parameterized curve
coeffs = np.array([
[-2.65053088e-05, 2.76890591e-05],
[-5.70681576e-02, -2.69415587e-01],
[7.92564148e+02, 6.88557419e+02],
])
t_fit = np.linspace(-2400, 3600, 1000)
x_fit = np.polyval(coeffs[:, 0], t_fit)
y_fit = np.polyval(coeffs[:, 1], t_fit)
curve = LineString(np.column_stack((x_fit, y_fit)))
# the shape it intersects
area = {'x': [700, 1000], 'y': [1300, 1400]}
area_shape = Polygon([
(area['x'][0], area['y'][0]),
(area['x'][1], area['y'][0]),
(area['x'][1], area['y'][1]),
(area['x'][0], area['y'][1]),
])
# attempt at finding the time of intersection
intersection = curve.intersection(area_shape).coords[-1]
distances = np.hypot(x_fit-intersection[0], y_fit-intersection[1])
idx = np.where(distances == min(distances))
fit_intersection = x_fit[idx][0], y_fit[idx][0]
t_intersection = t_fit[idx]
print(t_intersection)
# code for visualization
fig, ax = plt.subplots(figsize=(5, 5))
ax.margins(0.4, 0.2)
ax.invert_yaxis()
area_artist = Rectangle(
(area['x'][0], area['y'][0]),
width=area['x'][1] - area['x'][0],
height=area['y'][1] - area['y'][0],
edgecolor='gray', facecolor='none'
)
ax.add_artist(area_artist)
points = np.array([x_fit, y_fit]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
z = np.linspace(0, 1, points.shape[0])
norm = plt.Normalize(z.min(), z.max())
lc = LineCollection(
segments, cmap='autumn', norm=norm, alpha=1,
linewidths=2, picker=8, capstyle='round',
joinstyle='round'
)
lc.set_array(z)
ax.add_collection(lc)
ax.autoscale_view()
ax.relim()
trans = (ax.transData + ax.transAxes.inverted()).transform
intersection_point = Ellipse(
xy=trans(fit_intersection), width=0.02, height=0.02, fc='none',
ec='black', transform=ax.transAxes, zorder=3,
)
ax.add_artist(intersection_point)
plt.show()
And just for the visuals, here is what the problem looks like in a plot:
The best is to use interpolation functions to compute (x(t), y(t)). And use a function to compute d(t): the distance to intersection. Then we use scipy.optimize.minimize on d(t) to find the t value at which d(t) is minimum. Interpolation will ensure good accuracy.
So, I added a few modifications to you code.
definitions of interpolation functions and distance calculation
Test if there is indeed intersection, otherwise it doesn't make sense.
Compute the intersection time by minimization
The code (UPDATED):
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle, Ellipse
from matplotlib.collections import LineCollection
from shapely.geometry import LineString, Polygon
from scipy.optimize import minimize
# Interpolate (x,y) at time t:
def interp_xy(t,tp, fpx,fpy):
# tp: time grid points, fpx, fpy: the corresponding x,y values
x=np.interp(t, tp, fpx)
y=np.interp(t, tp, fpy)
return x,y
# Compute distance to intersection:
def dist_to_intersect(t,tp, fpx, fpy, intersection):
x,y = interp_xy(t,tp,fpx,fpy)
d=np.hypot(x-intersection[0], y-intersection[1])
return d
# the parameterized curve
t_fit = np.linspace(-2400, 3600, 1000)
#t_fit = np.linspace(-4200, 0, 1000)
coeffs = np.array([[-2.65053088e-05, 2.76890591e-05],[-5.70681576e-02, -2.69415587e-01],[7.92564148e+02, 6.88557419e+02],])
#t_fit = np.linspace(-2400, 3600, 1000)
#coeffs = np.array([[4.90972365e-05, -2.03897149e-04],[2.19222264e-01, -1.63335372e+00],[9.33624672e+02, 1.07067102e+03], ])
#t_fit = np.linspace(-2400, 3600, 1000)
#coeffs = np.array([[-2.63100091e-05, -7.16542227e-05],[-5.60829940e-04, -3.19183803e-01],[7.01544289e+02, 1.24732452e+03], ])
#t_fit = np.linspace(-2400, 3600, 1000)
#coeffs = np.array([[-2.63574223e-05, -9.15525038e-05],[-8.91039302e-02, -4.13843734e-01],[6.35650643e+02, 9.40010900e+02], ])
x_fit = np.polyval(coeffs[:, 0], t_fit)
y_fit = np.polyval(coeffs[:, 1], t_fit)
curve = LineString(np.column_stack((x_fit, y_fit)))
# the shape it intersects
area = {'x': [700, 1000], 'y': [1300, 1400]}
area_shape = Polygon([
(area['x'][0], area['y'][0]),
(area['x'][1], area['y'][0]),
(area['x'][1], area['y'][1]),
(area['x'][0], area['y'][1]),
])
# attempt at finding the time of intersection
curve_intersection = curve.intersection(area_shape)
# We check if intersection is empty or not:
if not curve_intersection.is_empty:
# We can get the coords because intersection is not empty
intersection=curve_intersection.coords[-1]
distances = np.hypot(x_fit-intersection[0], y_fit-intersection[1])
print("Looking for minimal distance to intersection: ")
print('-------------------------------------------------------------------------')
# Call to minimize:
# We pass:
# - the function to be minimized (dist_to_intersect)
# - a starting value to t
# - arguments, method and tolerance tol. The minimization will succeed when
# dist_to_intersect < tol=1e-6
# - option: here --> verbose
dmin=np.min((x_fit-intersection[0])**2+(y_fit-intersection[1])**2)
index=np.where((x_fit-intersection[0])**2+(y_fit-intersection[1])**2==dmin)
t0=t_fit[index]
res = minimize(dist_to_intersect, t0, args=(t_fit, x_fit, y_fit, intersection), method='Nelder-Mead',tol = 1e-6, options={ 'disp': True})
print('-------------------------------------------------------------------------')
print("Result of the optimization:")
print(res)
print('-------------------------------------------------------------------------')
print("Intersection at time t = ",res.x[0])
fit_intersection = interp_xy(res.x[0],t_fit, x_fit,y_fit)
print("Intersection point : ",fit_intersection)
else:
print("No intersection.")
# code for visualization
fig, ax = plt.subplots(figsize=(5, 5))
ax.margins(0.4, 0.2)
ax.invert_yaxis()
area_artist = Rectangle(
(area['x'][0], area['y'][0]),
width=area['x'][1] - area['x'][0],
height=area['y'][1] - area['y'][0],
edgecolor='gray', facecolor='none'
)
ax.add_artist(area_artist)
#plt.plot(x_fit,y_fit)
points = np.array([x_fit, y_fit]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
z = np.linspace(0, 1, points.shape[0])
norm = plt.Normalize(z.min(), z.max())
lc = LineCollection(
segments, cmap='autumn', norm=norm, alpha=1,
linewidths=2, picker=8, capstyle='round',
joinstyle='round'
)
lc.set_array(z)
ax.add_collection(lc)
# Again, we check that intersection exists because we don't want to draw
# an non-existing point (it would generate an error)
if not curve_intersection.is_empty:
plt.plot(fit_intersection[0],fit_intersection[1],'o')
plt.show()
OUTPUT:
Looking for minimal distance to intersection:
-------------------------------------------------------------------------
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 31
Function evaluations: 62
-------------------------------------------------------------------------
Result of the optimization:
final_simplex: (array([[-1898.91943932],
[-1898.91944021]]), array([8.44804735e-09, 3.28684898e-07]))
fun: 8.448047349426054e-09
message: 'Optimization terminated successfully.'
nfev: 62
nit: 31
status: 0
success: True
x: array([-1898.91943932])
-------------------------------------------------------------------------
Intersection at time t = -1898.919439315796
Intersection point : (805.3563860471179, 1299.9999999916085)
Whereas your code gives a much less precise result:
t=-1901.5015015
intersection point: (805.2438793482748,1300.9671136070717)
Figure:
I am currently trying to evaluate some data of mine and tried replicating the fit function described here: https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_classic_dr_variable.htm
At first I was having some trouble with numpy.float_power overflowing, but I think I fixed it (did I really?).
I am now using scipy.optimize.curve_fit to fit the described sigmoid to my data, but it never actually seems to fit, but instead produces constant functions and I have no idea why.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
'''
Just a method that produces some simple test data
'''
def test_data_1():
return np.array([[0.000610352, 0.002441406, 0.009765625, 0.0390625, 0.15625, 0.625, 2.5, 10],
[0.89, 0.81, 0.64, 0.48, 0.45, 0.50, 0.58, 0.70]])
'''
Just a simple method that produces some more test data
'''
def test_data_2():
return np.array([[0.000610352, 0.002441406, 0.009765625, 0.0390625, 0.15625, 0.625, 2.5, 10],
[1, 0.83, 0.68, 0.52, 0.48, 0.59, 0.75, 0.62]])
'''
Dose response curve as described in: https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_classic_dr_variable.htm
'''
def sigmoidal_dose_response_with_variable_slope(x_data, *params):
# Extract relevant parameters. Flattening the array just in case?
r_params = np.array(params).flatten()
bottom = r_params[0]
top = r_params[1]
logec50 = r_params[2]
slope = r_params[3]
# Calculating the numerator
numerator = top - bottom
# Calculating the denominator
denominator = 1 + np.float_power(10, (logec50 - x_data) * slope, dtype=np.longdouble)
return np.array(bottom + (numerator / denominator), dtype=np.float64)
if __name__ == "__main__":
x_data, y_data = test_data_1()
# Guessing bottom and top as the highest and lowest y-values.
bottom_guess = np.min(y_data)
bottom_guess_idx = np.argmin(y_data)
top_guess = np.max(y_data)
top_guess_idx = np.argmax(y_data)
# Guessing logec50 as the middle between those parameters
logec50_guess = np.linalg.norm(x_data[top_guess_idx] - x_data[bottom_guess_idx]) / 2 \
+ np.min([x_data[top_guess_idx], x_data[bottom_guess_idx]])
# Guessing a slope of 1
slope_guess = 1
p0 = [bottom_guess, top_guess, logec50_guess, slope_guess]
# Fitting the curve to my data
popt, pcov = curve_fit(sigmoidal_dose_response_with_variable_slope, x_data, y_data, p0)
# Making the x-axis scale logarithmically
fig, ax = plt.subplots()
ax.set_xscale('log')
# Plot my data
plt.plot(x_data, y_data, 's')
# Calculate function data. The borders are merely a guess
x_val = np.linspace(0, 10, 100)
y_val = sigmoidal_dose_response_with_variable_slope(x_val, popt)
# Plot
plt.plot(x_val, y_val)
plt.show()
It should be easily testable.
Update:
Something like this is what I am looking for:
I am doing linear regression with two dimensional variables:
filtered[['p_tag_x', 'p_tag_y', 's_tag_x', 's_tag_y']].head()
p_tag_x p_tag_y s_tag_x s_tag_y
35 589.665646 1405.580171 517.5 1636.5
36 589.665646 1405.580171 679.5 1665.5
100 610.546851 2425.303250 569.5 2722.0
101 610.546851 2425.303250 728.0 2710.0
102 717.237730 1411.842428 820.0 1616.5
clt = linear_model.LinearRegression()
clt.fit(filtered[['p_tag_x', 'p_tag_y']], filtered[['s_tag_x', 's_tag_y']])
I am getting following coefficients of the regression:
clt.coef_
array([[ 0.4529769 , -0.22406594],
[-0.00859452, -0.00816968]])
And the residues (X_0, and Y_0)
clt.residues_
array([ 1452.97816371, 69.12754694])
How I should I understand the above coefficients matrix in terms of the regression line ?
As i already explained in the comments, you got an extra-dimension in your coef_ as well as intercept_ because you got 2 targets (y.shape(n_samples, n_targets)). In this case sklearn will fit 2 independent regressors, one for each target.
You then can just take those n regressors apart and handle each one on it's own.
The formula of your regression line is still:
y(w, x) = intercept_ + coef_[0] * x[0] + coef_[1] * x[1] ...
Sadly your example is a bit harder to visualize because of the dimensionality.
Consider this a demo, with a lot of ugly hard-coding for this specific case (and bad example data!):
Code:
# Warning: ugly demo-like code using a lot of hard-coding!!!!!
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import linear_model
X = np.array([[589.665646, 1405.580171],
[589.665646, 1405.580171],
[610.546851, 2425.303250],
[610.546851, 2425.303250],
[717.237730, 1411.842428]])
y = np.array([[517.5, 1636.5],
[679.5, 1665.5],
[569.5, 2722.0],
[728.0, 2710.0],
[820.0, 1616.5]])
clt = linear_model.LinearRegression()
clt.fit(X, y)
print(clt.coef_)
print(clt.residues_)
def curve_0(x, y): # target 0; single-point evaluation hardcoded for 2 features!
return clt.intercept_[0] + x * clt.coef_[0, 0] + y * clt.coef_[0, 1]
def curve_1(x, y): # target 1; single-point evaluation hardcoded for 2 features!
return clt.intercept_[1] + x * clt.coef_[1, 0] + y * clt.coef_[1, 1]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
xs = [np.amin(X[:, 0]), np.amax(X[:, 0])]
ys = [np.amin(X[:, 1]), np.amax(X[:, 1])]
# regressor 0
ax.scatter(X[:, 0], X[:, 1], y[:, 0], c='blue')
ax.plot([xs[0], xs[1]], [ys[0], ys[1]], [curve_0(xs[0], ys[0]), curve_0(xs[1], ys[1])], c='cyan')
# regressor 1
ax.scatter(X[:, 0], X[:, 1], y[:, 1], c='red')
ax.plot([xs[0], xs[1]], [ys[0], ys[1]], [curve_1(xs[0], ys[0]), curve_1(xs[1], ys[1])], c='magenta')
ax.set_xlabel('X[:, 0] feature 0')
ax.set_ylabel('X[:, 1] feature 1')
ax.set_zlabel('Y')
plt.show()
Output:
Remarks:
You don't have to calculate the formula by yourself: clt.predict() will do that!
The code-lines involving ax.plot(...) use the assumption, that our line is defined by just 2 points (linear)!
I hope that my question can be answered without having runnable code, as its too complex to create a small but running version. The following code is part of my project:
x0 = [0.5, 0.5]
solution = optimize.root(solveMe, x0, args=(Param, Result, False), method='broyden1')
if solution.status != 1:
Result.__dict__
two.plot_some_function(solveMe, np.arange(0.1, 1.5, 0.1), np.arange(0.1, 1.5, 0.1), Param, Result, False)
raise Exception ('did not converge')
solveMe is a function that returns a vector of two residuals, F(x0). Whenever root() does not converge, I create a grid between 0.1 and 1.5 for both variables and just plot the output of F(x0) for any xo on that two-dimensional grid. I also check whether there are grid points such that both residuals are close to zero. The code follows:
# for debuggign
def plot_some_function(func, x, y, *args):
from matplotlib import cm
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
X, Y = np.meshgrid(x, y)
Z0 = np.empty(X.shape)
Z1 = np.empty(X.shape)
for idx in np.ndindex(X.shape):
x, y = X[idx], Y[idx]
Z0[idx], Z1[idx] = func([x,y], *args)
if (abs(Z0[idx]) < 0.1) & (abs(Z1[idx]) < 0.1):
print idx
for Z in [Z0, Z1]:
fig, ax = plt.subplots()
p = ax.pcolor(X, Y, Z, cmap=cm.RdBu, vmin=abs(Z).min(), vmax=abs(Z).max())
cb = fig.colorbar(p, ax=ax)
plt.show()
I ran my main code. I got an exception, the solution does not converge with a set of parameters. Following are the plots (which show that the function is actually quite smooth).
The idx output was
(1, 0)
(2, 1)
which corresponds to (0.2, 0.1) and (0.3, 0.2).
But why does root() not converge then? Its output follows
status: 2
success: False
fun: array([ 0.01725503, 0.25234002])
x: array([ 0.36981866, 0.4440247 ])
message: 'The maximum number of iterations allowed has been reached.'
nit: 300
An additional fine-tuning in the grid for the matrices actually gives me the coordinates for the following:
solveMe([0.165, 0.258], Param, Result, False)
Out[6]: array([ 0.00012388, 0.00105457])
Which is much smaller than what the solver found.