Defining a numeric (custom) likelihood function in PyMC3 - python

After looking at several questions/answers (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) and PyMC3's documentation, I've managed to create a MCVE of my MCMC setup (see below).
My fitted parameters are continuous and discrete, so the priors are defined using pm.Uniform and pm.DiscreteUniform (with a re-scaling applied to the latter). My likelihood function is particularly convoluted (it involves comparing the N-dimensional histograms of my observed data and some synthetic data generated using the free parameters), so I had to write it using theano's #as_op operator.
The implementation shown here works on a toy model working on random data, but in my actual model the likelihood and parameters are very similar.
My questions are:
Is this correct? Is there anything I should be doing different?
The call to the likelihood function is just thrown there apparently doing nothing and connected to nothing. Is this the proper way to do this?
I'm using NUTS for the continuous parameters but since my likelihood is numeric, I don't think I should be able to do this. Since the code still runs, I'm nut sure what's going on.
This is the first time I've used PyMC3 so any pointers will be really helpful.
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from theano.compile.ops import as_op
def main():
trace = bayesMCMC()
print(pm.summary(trace))
pm.traceplot(trace)
plt.show()
def bayesMCMC():
"""
Define and process the full model.
"""
with pm.Model() as model:
# Define uniform priors.
A = pm.Uniform("A", lower=0., upper=5.)
B = pm.Uniform("B", lower=10., upper=20.)
C = pm.Uniform("C", lower=0., upper=1.)
# Define discrete priors.
minD, maxD, stepD = 0.005, 0.06, 0.005
ND = int((maxD - minD) / stepD)
D = pm.DiscreteUniform("D", 0., ND)
minE, maxE, stepE = 9., 10., 0.05
NE = int((maxE - minE) / stepE)
E = pm.DiscreteUniform("E", 0., NE)
# Is this correct??
logp(A, B, C, D, E)
step1 = pm.NUTS(vars=[A, B, C])
print("NUTS")
step2 = pm.Metropolis(vars=[D, E])
print("Metropolis")
trace = pm.sample(300, [step1, step2]) # , start)
return trace
#as_op(
itypes=[tt.dscalar, tt.dscalar, tt.dscalar, tt.lscalar, tt.lscalar],
otypes=[tt.dscalar])
def logp(A, B, C, D, E):
"""
Likelihood evaluation.
"""
# Get observed data and some extra info to re-scale the discrete parameters
obsData, minD, stepD, minE, stepE = obsservedData()
# Scale discrete parameters
D, E = D * stepD + minD, E * stepE + minE
# Generate synthetic data using the prior values
synthData = synthetic(A, B, C, D, E)
# Generate N-dimensional histograms for both data sets.
obsHist, edges = np.histogramdd(obsData)
synHist, _ = np.histogramdd(synthData, bins=edges)
# Flatten both histograms
obsHist_f, synHist_f = obsHist.ravel(), synHist.ravel()
# Remove all bins where N_bin=0.
binNzero = obsHist_f != 0
obsHist_f, synHist_f = obsHist_f[binNzero], synHist_f[binNzero]
# Assign small value to the 0 elements in synHist_f to avoid issues with
# the log()
synHist_f[synHist_f == 0] = 0.001
# Compare the histograms of the observed and synthetic data via a Poisson
# likelihood ratio.
lkl = -2. * np.sum(synHist_f - obsHist_f * np.log(synHist_f))
return lkl
def obsservedData():
"""Some 'observed' data."""
np.random.seed(12345)
N = 1000
obsData = np.random.uniform(0., 10., (N, 3))
minD, stepD = 0.005, 0.005
minE, stepE = 9., 0.05
return obsData, minD, stepD, minE, stepE
def synthetic(A, B, C, D, E):
"""
Dummy function to generate synthetic data. The actual function makes use
of the A, B, C, D, E variables (obviously).
"""
M = np.random.randint(100, 1000)
synthData = np.random.uniform(0., 10., (M, 3))
return synthData
if __name__ == "__main__":
main()

Related

Find the value of variables to maximize return of function in Python

I'd want to achieve similar result as how the Solver-function in Excel is working. I've been reading of Scipy optimization and been trying to build a function which outputs what I would like to find the maximal value of. The equation is based on four different variables which, see my code below:
import pandas as pd
import numpy as np
from scipy import optimize
cols = {
'Dividend2': [9390, 7448, 177],
'Probability': [341, 376, 452],
'EV': [0.53, 0.60, 0.55],
'Dividend': [185, 55, 755],
'EV2': [123, 139, 544],
}
df = pd.DataFrame(cols)
def myFunc(params):
"""myFunc metric."""
(ev, bv, vc, dv) = params
df['Number'] = np.where(df['Dividend2'] <= vc, 1, 0) \
+ np.where(df['EV2'] <= dv, 1, 0)
df['Return'] = np.where(
df['EV'] <= ev, 0, np.where(
df['Probability'] >= bv, 0, df['Number'] * df['Dividend'] - (vc + dv)
)
)
return -1 * (df['Return'].sum())
b1 = [(0.2,4), (300,600), (0,1000), (0,1000)]
start = [0.2, 600, 1000, 1000]
result = optimize.minimize(fun=myFunc, bounds=b1, x0=start)
print(result)
So I'd like to find the maximum value of the column Return in df when changing the variables ev,bv,vc & dv. I'd like them to be between in the intervals of ev: 0.2-4, bv: 300-600, vc: 0-1000 & dv: 0-1000.
When running my code it seem like the function stops at x0.
Solution
I will use optuna library to give you a solution to the type of problem you are trying to solve. I have tried using scipy.optimize.minimize and it appears that the loss-landscape is probably quite flat in most places, and hence the tolerances enforce the minimizing algorithm (L-BFGS-B) to stop prematurely.
Optuna Docs: https://optuna.readthedocs.io/en/stable/index.html
With optuna, it rather straight forward. Optuna only requires an objective function and a study. The study send various trials to the objective function, which in turn, evaluates the metric of your choice.
I have defined another metric function myFunc2 by mostly removing the np.where calls, as you can do-away with them (reduces number of steps) and make the function slightly faster.
# install optuna with pip
pip install -Uqq optuna
Although I looked into using a rather smooth loss landscape, sometimes it is necessary to visualize the landscape itself. The answer in section B elaborates on visualization. But, what if you want to use a smoother metric function? Section D sheds some light on this.
Order of code-execution should be:
Sections: C >> B >> B.1 >> B.2 >> B.3 >> A.1 >> A.2 >> D
A. Building Intuition
If you create a hiplot (also known as a plot with parallel-coordinates) with all the possible parameter values as mentioned in the search_space for Section B.2, and plot the lowest 50 outputs of myFunc2, it would look like this:
Plotting all such points from the search_space would look like this:
A.1. Loss Landscape Views for Various Parameter-Pairs
These figures show that mostly the loss-landscape is flat for any two of the four parameters (ev, bv, vc, dv). This could be a reason why, only GridSampler (which brute-forces the searching process) does better, compared to the other two samplers (TPESampler and RandomSampler). Please click on any of the images below to view them enlarged. This could also be the reason why scipy.optimize.minimize(method="L-BFGS-B") fails right off the bat.
01. dv-vc
02. dv-bv
03. dv-ev
04. bv-ev
05. cv-ev
06. vc-bv
# Create contour plots for parameter-pairs
study_name = "GridSampler"
study = studies.get(study_name)
views = [("dv", "vc"), ("dv", "bv"), ("dv", "ev"),
("bv", "ev"), ("vc", "ev"), ("vc", "bv")]
for i, (x, y) in enumerate(views):
print(f"Figure: {i}/{len(views)}")
study_contour_plot(study=study, params=(x, y))
A.2. Parameter Importance
study_name = "GridSampler"
study = studies.get(study_name)
fig = optuna.visualization.plot_param_importances(study)
fig.update_layout(title=f'Hyperparameter Importances: {study.study_name}',
autosize=False,
width=800, height=500,
margin=dict(l=65, r=50, b=65, t=90))
fig.show()
B. Code
Section B.3. finds the lowest metric -88.333 for:
{'ev': 0.2, 'bv': 500.0, 'vc': 222.2222, 'dv': 0.0}
import warnings
from functools import partial
from typing import Iterable, Optional, Callable, List
import pandas as pd
import numpy as np
import optuna
from tqdm.notebook import tqdm
warnings.filterwarnings("ignore", category=optuna.exceptions.ExperimentalWarning)
optuna.logging.set_verbosity(optuna.logging.WARNING)
PARAM_NAMES: List[str] = ["ev", "bv", "vc", "dv",]
DEFAULT_METRIC_FUNC: Callable = myFunc2
def myFunc2(params):
"""myFunc metric v2 with lesser steps."""
global df # define as a global variable
(ev, bv, vc, dv) = params
df['Number'] = (df['Dividend2'] <= vc) * 1 + (df['EV2'] <= dv) * 1
df['Return'] = (
(df['EV'] > ev)
* (df['Probability'] < bv)
* (df['Number'] * df['Dividend'] - (vc + dv))
)
return -1 * (df['Return'].sum())
def make_param_grid(
bounds: List[Tuple[float, float]],
param_names: Optional[List[str]]=None,
num_points: int=10,
as_dict: bool=True,
) -> Union[pd.DataFrame, Dict[str, List[float]]]:
"""
Create parameter search space.
Example:
grid = make_param_grid(bounds=b1, num_points=10, as_dict=True)
"""
if param_names is None:
param_names = PARAM_NAMES # ["ev", "bv", "vc", "dv"]
bounds = np.array(bounds)
grid = np.linspace(start=bounds[:,0],
stop=bounds[:,1],
num=num_points,
endpoint=True,
axis=0)
grid = pd.DataFrame(grid, columns=param_names)
if as_dict:
grid = grid.to_dict()
for k,v in grid.items():
grid.update({k: list(v.values())})
return grid
def objective(trial,
bounds: Optional[Iterable]=None,
func: Optional[Callable]=None,
param_names: Optional[List[str]]=None):
"""Objective function, necessary for optimizing with optuna."""
if param_names is None:
param_names = PARAM_NAMES
if (bounds is None):
bounds = ((-10, 10) for _ in param_names)
if not isinstance(bounds, dict):
bounds = dict((p, (min(b), max(b)))
for p, b in zip(param_names, bounds))
if func is None:
func = DEFAULT_METRIC_FUNC
params = dict(
(p, trial.suggest_float(p, bounds.get(p)[0], bounds.get(p)[1]))
for p in param_names
)
# x = trial.suggest_float('x', -10, 10)
return func((params[p] for p in param_names))
def optimize(objective: Callable,
sampler: Optional[optuna.samplers.BaseSampler]=None,
func: Optional[Callable]=None,
n_trials: int=2,
study_direction: str="minimize",
study_name: Optional[str]=None,
formatstr: str=".4f",
verbose: bool=True):
"""Optimizing function using optuna: creates a study."""
if func is None:
func = DEFAULT_METRIC_FUNC
study = optuna.create_study(
direction=study_direction,
sampler=sampler,
study_name=study_name)
study.optimize(
objective,
n_trials=n_trials,
show_progress_bar=True,
n_jobs=1,
)
if verbose:
metric = eval_metric(study.best_params, func=myFunc2)
msg = format_result(study.best_params, metric,
header=study.study_name,
format=formatstr)
print(msg)
return study
def format_dict(d: Dict[str, float], format: str=".4f") -> Dict[str, float]:
"""
Returns formatted output for a dictionary with
string keys and float values.
"""
return dict((k, float(f'{v:{format}}')) for k,v in d.items())
def format_result(d: Dict[str, float],
metric_value: float,
header: str='',
format: str=".4f"):
"""Returns formatted result."""
msg = f"""Study Name: {header}\n{'='*30}
✅ study.best_params: \n\t{format_dict(d)}
✅ metric: {metric_value}
"""
return msg
def study_contour_plot(study: optuna.Study,
params: Optional[List[str]]=None,
width: int=560,
height: int=500):
"""
Create contour plots for a study, given a list or
tuple of two parameter names.
"""
if params is None:
params = ["dv", "vc"]
fig = optuna.visualization.plot_contour(study, params=params)
fig.update_layout(
title=f'Contour Plot: {study.study_name} ({params[0]}, {params[1]})',
autosize=False,
width=width,
height=height,
margin=dict(l=65, r=50, b=65, t=90))
fig.show()
bounds = [(0.2, 4), (300, 600), (0, 1000), (0, 1000)]
param_names = PARAM_NAMES # ["ev", "bv", "vc", "dv",]
pobjective = partial(objective, bounds=bounds)
# Create an empty dict to contain
# various subsequent studies.
studies = dict()
Optuna comes with a few different types of Samplers. Samplers provide the strategy of how optuna is going to sample points from the parametr-space and evaluate the objective function.
https://optuna.readthedocs.io/en/stable/reference/samplers.html
B.1 Use TPESampler
from optuna.samplers import TPESampler
sampler = TPESampler(seed=42)
study_name = "TPESampler"
studies[study_name] = optimize(
pobjective,
sampler=sampler,
n_trials=100,
study_name=study_name,
)
# Study Name: TPESampler
# ==============================
#
# ✅ study.best_params:
# {'ev': 1.6233, 'bv': 585.2143, 'vc': 731.9939, 'dv': 598.6585}
# ✅ metric: -0.0
B.2. Use GridSampler
GridSampler requires a parameter search grid. Here we are using the following search_space.
from optuna.samplers import GridSampler
# create search-space
search_space = make_param_grid(bounds=bounds, num_points=10, as_dict=True)
sampler = GridSampler(search_space)
study_name = "GridSampler"
studies[study_name] = optimize(
pobjective,
sampler=sampler,
n_trials=2000,
study_name=study_name,
)
# Study Name: GridSampler
# ==============================
#
# ✅ study.best_params:
# {'ev': 0.2, 'bv': 500.0, 'vc': 222.2222, 'dv': 0.0}
# ✅ metric: -88.33333333333337
B.3. Use RandomSampler
from optuna.samplers import RandomSampler
sampler = RandomSampler(seed=42)
study_name = "RandomSampler"
studies[study_name] = optimize(
pobjective,
sampler=sampler,
n_trials=300,
study_name=study_name,
)
# Study Name: RandomSampler
# ==============================
#
# ✅ study.best_params:
# {'ev': 1.6233, 'bv': 585.2143, 'vc': 731.9939, 'dv': 598.6585}
# ✅ metric: -0.0
C. Dummy Data
For the sake of reproducibility, I am keeping a record of the dummy data used here.
import pandas as pd
import numpy as np
from scipy import optimize
cols = {
'Dividend2': [9390, 7448, 177],
'Probability': [341, 376, 452],
'EV': [0.53, 0.60, 0.55],
'Dividend': [185, 55, 755],
'EV2': [123, 139, 544],
}
df = pd.DataFrame(cols)
def myFunc(params):
"""myFunc metric."""
(ev, bv, vc, dv) = params
df['Number'] = np.where(df['Dividend2'] <= vc, 1, 0) \
+ np.where(df['EV2'] <= dv, 1, 0)
df['Return'] = np.where(
df['EV'] <= ev, 0, np.where(
df['Probability'] >= bv, 0, df['Number'] * df['Dividend'] - (vc + dv)
)
)
return -1 * (df['Return'].sum())
b1 = [(0.2,4), (300,600), (0,1000), (0,1000)]
start = [0.2, 600, 1000, 1000]
result = optimize.minimize(fun=myFunc, bounds=b1, x0=start)
print(result)
C.1. An Observation
So, it seems at first glance that the code executed properly and did not throw any error. It says it had success in finding the minimized solution.
fun: -0.0
hess_inv: <4x4 LbfgsInvHessProduct with dtype=float64>
jac: array([0., 0., 3., 3.])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL' # 💡
nfev: 35
nit: 2
status: 0
success: True
x: array([2.e-01, 6.e+02, 0.e+00, 0.e+00]) # 🔥
A close observation reveals that the solution (see 🔥) is no different from the starting point [0.2, 600, 1000, 1000]. So, seems like nothing really happened and the algorithm just finished prematurely?!!
Now look at the message above (see 💡). If we run a google search on this, you could find something like this:
Summary
b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
If the loss-landscape does not have a smoothely changing topography, the gradient descent algorithms will soon find that from one iteration to the next, there isn't much change happening and hence, will terminate further seeking. Also, if the loss-landscape is rather flat, this could see similar fate and get early-termination.
scipy-optimize-minimize does not perform the optimization - CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
D. Making the Loss Landscape Smoother
A binary evaluation of value = 1 if x>5 else 0 is essentially a step-function that assigns 1 for all values of x that are greater than 5 and 0 otherwise. But this introduces a kink - a discontinuity in smoothness and this could potentially introduce problems in traversing the loss-landscape.
What if we use a sigmoid function to introduce some smoothness?
# Define sigmoid function
def sigmoid(x):
"""Sigmoid function."""
return 1 / (1 + np.exp(-x))
For the above example, we could modify it as follows.
You can additionally introduce another factor (gamma: γ) as follows and try to optimize it to make the landscape smoother. Thus by controlling the gamma factor, you could make the function smoother and change how quickly it changes around x = 5
The above figure is created with the following code-snippet.
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina'
plt.style.use('seaborn-white')
def make_figure(figtitle: str="Sigmoid Function"):
"""Make the demo figure for using sigmoid."""
x = np.arange(-20, 20.01, 0.01)
y1 = sigmoid(x)
y2 = sigmoid(x - 5)
y3 = sigmoid((x - 5)/3)
y4 = sigmoid((x - 5)/0.3)
fig, ax = plt.subplots(figsize=(10,5))
plt.sca(ax)
plt.plot(x, y1, ls="-", label="$\sigma(x)$")
plt.plot(x, y2, ls="--", label="$\sigma(x - 5)$")
plt.plot(x, y3, ls="-.", label="$\sigma((x - 5) / 3)$")
plt.plot(x, y4, ls=":", label="$\sigma((x - 5) / 0.3)$")
plt.axvline(x=0, ls="-", lw=1.3, color="cyan", alpha=0.9)
plt.axvline(x=5, ls="-", lw=1.3, color="magenta", alpha=0.9)
plt.legend()
plt.title(figtitle)
plt.show()
make_figure()
D.1. Example of Metric Smoothing
The following is an example of how you could apply function smoothing.
from functools import partial
def sig(x, gamma: float=1.):
return sigmoid(x/gamma)
def myFunc3(params, gamma: float=0.5):
"""myFunc metric v3 with smoother metric."""
(ev, bv, vc, dv) = params
_sig = partial(sig, gamma=gamma)
df['Number'] = _sig(x = -(df['Dividend2'] - vc)) * 1 \
+ _sig(x = -(df['EV2'] - dv)) * 1
df['Return'] = (
_sig(x = df['EV'] - ev)
* _sig(x = -(df['Probability'] - bv))
* _sig(x = df['Number'] * df['Dividend'] - (vc + dv))
)
return -1 * (df['Return'].sum())
As already mentioned in my comment, the crucial problem is that np.where() is neither differentiable nor continuous. Consequently, your objective function violates the mathematical assumptions for most of the (derivate-based) algorithms under the hood of scipy.optimize.minimize.
So, basically, you've got three options:
Use a derivative-free algorithm and hope for the best.
Replace np.where() with a smooth approximation such that your objective is continuously differentiable.
Reformulate your problem as a MIP.
Since #CypherX answer pursues approach 1, I'd like to focus on 2. Here, the main idea is to approximate the np.where function. One possible approximation is
def smooth_if_then(x):
eps = 1e-12
return 0.5 + x/(2*np.sqrt(eps + x*x))
which is continuous and differentiable. Then, given a np.ndarray arr and a scalar value x, the expression np.where(arr <= x, 1, 0) is equivalent to smooth_if_then(x - arr).
Hence, the objective function becomes:
div = df['Dividend'].values
div2 = df['Dividend2'].values
ev2 = df['EV2'].values
ev = df['EV'].values
prob = df['Probability'].values
def objective(x, *params):
ev, bv, vc, dv = x
div_vals, div2_vals, ev2_vals, ev_vals, prob_vals = params
number = smooth_if_then(vc - div2_vals) + smooth_if_then(dv - ev2_vals)
part1 = smooth_if_then(bv - prob_vals) * (number * div_vals - (vc + dv))
part2 = smooth_if_then(-1*(ev - ev_vals)) * part1
return -1 * part2.sum()
and using the trust-constr algorithm (which is the most robust one inside scipy.optimize.minimize), yields:
res = minimize(lambda x: objective(x, div, div2, ev2, ev, prob), x0=start, bounds=b1, method="trust-constr")
barrier_parameter: 1.0240000000000006e-08
barrier_tolerance: 1.0240000000000006e-08
cg_niter: 5
cg_stop_cond: 0
constr: [array([8.54635975e-01, 5.99253512e+02, 9.95614973e+02, 9.95614973e+02])]
constr_nfev: [0]
constr_nhev: [0]
constr_njev: [0]
constr_penalty: 1.0
constr_violation: 0.0
execution_time: 0.2951819896697998
fun: 1.3046631387761482e-08
grad: array([0.00000000e+00, 0.00000000e+00, 8.92175218e-12, 8.92175218e-12])
jac: [<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>]
lagrangian_grad: array([-3.60651033e-09, 4.89643010e-09, 2.21847918e-09, 2.21847918e-09])
message: '`gtol` termination condition is satisfied.'
method: 'tr_interior_point'
nfev: 20
nhev: 0
nit: 14
niter: 14
njev: 4
optimality: 4.896430096425101e-09
status: 1
success: True
tr_radius: 478515625.0
v: [array([-3.60651033e-09, 4.89643010e-09, 2.20955743e-09, 2.20955743e-09])]
x: array([8.54635975e-01, 5.99253512e+02, 9.95614973e+02, 9.95614973e+02])
Last but not least: Using smooth approximations is a common way to achieve differentiability. However, it's worth mentioning that these approximations are not convex. In practice, this means that your optimization problem is not convex and thus, you have no guarantee that a found stationary point (local minimizer) is a global optimum. For this end, one either needs to use a global optimization algorithm or formulate the problem as a MIP. The latter is the recommended approach, both from a mathematical and a practice point of view.

Why is this subclass of `scipy.stats.rv_continuous` not generating random variates in the correct range?

I have the following subclass of scipy.stats.rv_continuous:
from scipy.stats import rv_continuous
import math
class Distribution(rv_continuous):
def _cdf(self, x, a, b, mu):
return (
math.erf(x/(math.sqrt(2)*a)) + \
math.erf((x - mu)/(math.sqrt(2)*b)) \
) / 2 + math.erf(mu/(math.sqrt(2)*b)) / 2
distribution = Distribution(a = 0, b = float('inf'))
As far as I can tell, everything is setup correctly (I've checked the math and it is also correct). However, for some reason, it only wants to generate values between 0 and mu, rather than the expected 0 and inf as explicitly specified. For example, here are 50 points generated with distribution.rvs(3, 1.6, 10) (along with the PDF):
and here's an example with distribution.rvs(0.6, 0.4, 4.85):
Why is my distribution 'capped' at mu? Have I setup my rv_continuous subclass wrong?
Your implementation of the CDF is not correct. Consider:
In [188]: distribution.cdf(25, 3, 16., 10)
Out[188]: 1.059763759070757
In [189]: distribution.cdf(40, 3, 16., 10)
Out[189]: 1.203618109186038
Those values are not correct. The CDF (which you implemented in the _cdf method) must not exceed 1.0.

Use Python lmfit with a variable number of parameters in function

I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())
I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)
I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()

Python: slice array uniformly with respect to dataset

I have a data set that has time t and a data d. Unfortunately, I changed the rate of exporting the data after some time (the rate was too high initially). I would like to sample the data so that I effectively remove the high-frequency exported data but maintain the low-frequency exported data near the end.
Consider the following code:
arr = np.loadtxt(file_name,skiprows=3)
Where t = arr[:,0], d = arr[:,1].
Here is a function to get a uniform slicing:
def get_uniform_slices(arr, N_desired_points):
s = arr.shape
if s[0] > N_desired_points:
n_skip = m.ceil(s[0]/N_desired_points)
else:
n_skip = 1
return arr[0::n_skip,:] # Sample output
However, the data then looks fine for the high-frequency exported data, but is too sparse for the low-frequency exported data.
Is there some way to slice such that indexes are uniformly spaced with respect to t?
Any help is greatly appreciated.
This is function I used to find the indexes, based on the accepted answer:
def get_uniform_index(t,N_desired_points):
t_uniform = np.linspace(np.amin(t),np.amax(t),N_desired_points)
t_desired = [nearest(t_d, t) for t_d in t_uniform]
i = np.in1d(t, t_desired)
return i
You have 2d data e.g.,
t = np.arange(0., 100., 0.5)
d = np.random.rand(len(t))
You want to keep only particular values of data at uniformly spaced times, e.g.
t_desired = np.arange(0., 100., 1.)
Let's pick them out the data points desired at the times desired using the in1d function:
d_pruned = d[np.in1d(t, t_desired)]
Of course, you must pick the t_desired and they should match values in t. If that's a problem, you could pick approximately uniform times using e.g.,
def nearest(x, arr):
index = (np.abs(arr - x)).argmin()
return arr[index]
t_uniform = np.arange(0., 100., 1.)
t_desired = [nearest(t_d, t) for t_d in t_uniform]
Here is the complete code:
import numpy as np
t = np.arange(0., 100., 0.5)
d = np.random.rand(len(t))
def nearest(x, arr):
index = (np.abs(arr - x)).argmin()
return arr[index]
t_uniform = np.arange(0., 100., 1.)
t_desired = [nearest(t_d, t) for t_d in t_uniform]
d_pruned = d[np.in1d(t, t_desired)]

Obtaining Legendre polynomial form once Legendre coefficients are determined

I have obtained the coefficients for the Legendre polynomial that best fits my data. Now I am needing to determine the value of that polynomial at each time-step of my data. I need to do this so that I can subtract the fit from my data. I have looked at the documentation for the Legendre module, and I'm not sure if I just don't understand my options or if there isn't a native tool in place for what I want. If my data-points were evenly spaced, linspace would be a good option, but that's not the case here. Does anyone have a suggestion for what to try?
For those who would like to demand a minimum working example of code, just use a random array, get the coefficients, and tell me from there how you would proceed. The values themselves don't matter. It's the technique that I'm asking about here. Thanks.
To simplify Ahmed's example
In [1]: from numpy.polynomial import Polynomial, Legendre
In [2]: p = Polynomial([0.5, 0.3, 0.1])
In [3]: x = np.random.rand(10) * 10
In [4]: y = p(x)
In [5]: pfit = Legendre.fit(x, y, 2)
In [6]: plot(*pfit.linspace())
Out[6]: [<matplotlib.lines.Line2D at 0x7f815364f310>]
In [7]: plot(x, y, 'o')
Out[7]: [<matplotlib.lines.Line2D at 0x7f81535d8bd0>]
The Legendre functions are scaled and offset, as the data should be confined to the interval [-1, 1] to get any advantage over the usual power basis. If you want the coefficients for plain old Legendre functions
In [8]: pfit.convert()
Out[8]: Legendre([ 0.53333333, 0.3 , 0.06666667], [-1., 1.], [-1., 1.])
But that isn't recommended.
Once you have a function, you can just generate a numpy array for the timepoints:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
... a,b,c = pfinal # obviously, for a*x^2 + b*x + c
... return (a*bins**2) + b*bins + c
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
It automatically evaluates it for each timepoint is in the numpy array.
Now all you have to do is rewrite mypolynomial to go from a simple quadratic example to a proper one for a Legendre polynomial. Treat the function as if it were evaluating a float to return the value, and when called on the numpy array it will automatically evaluate it for each value.
EDIT:
Let's say I wanted to generalize this to all standard polynomials:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
>>> hist = np.zeros((1, len(myarray))) # define blank return
... for i in range(len(pfinal)):
... # fixed a typo here, was pfinal[-i] which would give -0 rather than -1, since negative indexing starts at -1, not -0
... const = pfinal[-i-1] # negative index to go from 0 exponent to highest exponent
... hist += const*(bins**i)
... return hist
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
EDIT2: Typo fix
EDIT3:
#Ahmed is perfect right when he states Homer's rule is good for numerical stability. The implementation here would be as follows:
>>> def horner(coeffs, x):
... acc = 0
... for c in coeffs:
... acc = acc * x + c
... return acc
>>> horner((1,1,0), myarray)
array([ 2, 12, 56, 240, 272, 306, 380])
Slightly modified to keep the same argument order as before, from the code here:
http://rosettacode.org/wiki/Horner%27s_rule_for_polynomial_evaluation#Python
When you're using a nice library to fit polynomials, the library will in my experience usually have a function to evaluate them. So I think it is useful to know how you're generating these coefficients.
In the example below, I used two functions in numpy, legfit and legval which made it trivial to both fit and evaluate the Legendre polynomials without any need to invoke Horner's rule or do the bookkeeping yourself. (Though I do use Horner's rule to generate some example data.)
Here's a complete example where I generate some sparse data from a known polynomial, fit a Legendre polynomial to it, evaluate that polynomial on a dense grid, and plot. Note that the fitting and evaluating part takes three lines thanks to the numpy library doing all the heavy lifting.
It produces the following figure:
import numpy as np
### Setup code
def horner(coeffs, x):
"""Evaluate a polynomial at a point or array"""
acc = 0.0
for c in reversed(coeffs):
acc = acc * x + c
return acc
x = np.random.rand(10) * 10
true_coefs = [0.1, 0.3, 0.5]
y = horner(true_coefs, x)
### Fit and evaluate
legendre_coefs = np.polynomial.legendre.legfit(x, y, 2)
new_x = np.linspace(0, 10)
new_y = np.polynomial.legendre.legval(new_x, legendre_coefs)
### Plotting only
try:
import pylab
pylab.ion() # turn on interactive plotting
pylab.figure()
pylab.plot(x, y, 'o', new_x, new_y, '-')
pylab.xlabel('x')
pylab.ylabel('y')
pylab.title('Fitting Legendre polynomials and evaluating them')
pylab.legend(['original sparse data', 'fit'])
except:
print("Can't start plots.")

Categories