Calculating model evidence/marginals in Python - python

My question pertains to bayesian inference and how to numerically calculate model evidence given some data and a prior and a posterior distribution.
Given conjugate priors, the wikipedia article specifies model evidence as the following:
Where sigma and beta are parameters, m is the model, Y is the data and X is the prior.
Given the setup below, how do I calculate model evidence? I need something that returns one scalar number.
Below I have a minimal working example of generating some data (draws from a normal) and assuming a prior (a normal) and a likelihood function (a gaussian). Notice how both the PDF of the data and the prior integrate to (approximately) one, while the likelihood function can take values over 1.
I am mainly confused as to how to "integrate out" the parameters from the model, and thus take model complexity into consideration. I can see how this can be done analytically if you can write down the log-likelihood function. But can't really see how this can result in one scalar number.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy
import seaborn as sns
sns.set(style="white", palette="muted", color_codes=True)
%matplotlib inline
mu = 0
variance = 1
sigma = np.sqrt(variance)
data = np.random.normal(mu,variance,100)
x = np.linspace(-5,5,100)
density = scipy.stats.kde.gaussian_kde(data)
data_pdf = density(x)
prior_pdf = scipy.stats.norm.pdf(x, mu, sigma)
likelihood = np.exp(-np.power(x - mu, 2.) / (2 * np.power(sigma, 2.)))
I1=scipy.integrate.trapz(data_pdf,x)
I2=scipy.integrate.trapz(prior_pdf,x)
I3=scipy.integrate.trapz(likelihood,x)
fig1 = plt.figure(figsize=(7.5,5))
ax1 = fig1.add_subplot(3,1,1)
sns.despine(right=True)
ax1.plot(x,data_pdf,'k')
ax1.legend([r'$PDF(Data)$'],loc='upper left')
ax2 = fig1.add_subplot(3,1,2)
sns.despine(right=True)
ax2.plot(x,prior_pdf,'b')
ax2.legend([r'$Prior$'],loc='upper left')
ax3 = fig1.add_subplot(3,1,3)
sns.despine(right=True)
ax3.plot(x,likelihood,'r')
ax3.legend([r'$Likelihood$'],loc='upper left')
plt.tight_layout()
print(I1,I2,I3)

Related

How to fit more complex funtions with sklearn?

I used sklearn in python to fit polynomial functions:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_reg_model = LinearRegression()
poly_features = poly.fit_transform(xvalues.reshape(-1, 1))
poly_reg_model.fit(poly_features, y_values)
final_predicted = poly_reg_model.predict(poly_features)
...
Instead of only using x^n parts, I want to incude a (1-x^2)^(1/2) part in the fit-function.
Is this possible with sklearn?
I tried to define a Feature which includes more complex terms but I falied to achieve this.
No idea whether it is possible within scikitlearn - after all polynomial fit is constrained to specific polynomial formulations from the mathematical stanndpoint. If you want to fit a formula with some unknown parameters, you can use scipy.optimize.curve_fit. First let us generate some dummy data with noise:
import numpy as np
from matplotlib import pyplot as plt
def f(x):
return (1-x**2)**(1/2)
xvalues = np.linspace(-1, 1, 30)
yvalues = [f(x) + np.random.randint(-10, 10)/100 for x in xvalues]
Then, we set up our function to be optimized:
from scipy.optimize import curve_fit
def f_opt(x, a, b):
return (a-x**2)**(1/b)
popt, pcov = curve_fit(f_opt, xvalues, yvalues)
You can of course modify this function to be more elastic. Finally we plot the results
plt.scatter(xvalues, yvalues, label='data')
plt.plot(xvalues, f_opt(xvalues, *popt), 'r-', label='fit')
plt.legend()
So now you can use f_opt(new_x, *popt) to predict new points (alternatively you can print the values and hard-code them). popt basically has the parameters that you specify in f_opt except x - for more details check the documentation I've linked!

How to use Gaussian Mixture Model (GMM) for peak decomposition?

I have generated some data points as a linear mixture of three 1D bell shape Gaussian distributions with different parameters (mean and variance) using following code:
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
sns.set_style("darkgrid")
%matplotlib inline
from sklearn.mixture import GaussianMixture
x = np.linspace(start=-40,stop=40, num=1000)
y1 = stats.norm.pdf(x, loc=1,scale=1.5) # First Gaussian distribution
y2 = stats.norm.pdf(x, loc=5,scale=2.5) # Second Gaussian distribution
y3 = stats.norm.pdf(x, loc=-15,scale= 10) # Third Gaussian distribution
Y = y1+y2+y3
fig = plt.figure(figsize=(7, 5),dpi=300)
plt.plot(x,y1,lw=2,label='First component')
plt.plot(x,y2,lw=2,label='Second component')
plt.plot(x,y3,lw=2,label='Third component')
plt.plot(x,Y,lw=3,label= 'Linear Mixture')
plt.legend(loc='best',facecolor="white")
plt.show()
I tried to decompose these 3 peaks in a reverse process using sklearn.mixture.GaussianMixture. But it does not return my expected mean and variance of every Gaussian function.
model = GaussianMixture(n_components=3).fit(Y.reshape(-1,1))
print(model.means_)
print(model.covariances_)

How to find series of highest peaks of a repeating pattern using find_peaks() in Python?

I'm trying to determine the highest peaks of the pattern blocks in the following waveform:
Basically, I need to detect the following peaks only (highlighted):
If I use scipy.find_peaks(), it's unable to detect the appropriate peaks:
indices = find_peaks(my_waveform, prominence = 1)[0]
It ends up detecting all of the following points, which is not what I am looking for:
I can't provide the input arguments of distance or height thresholds to scipy.find_peaks() since there are many of the desired peaks on either extremes which are lower in height than the non-desired peaks in the middle.
Note: I had de-trended the waveform in order to aid this above problem too as you can see in the above snapshot, but it still doesn't give the right results.
So can anyone help with a correct way to tackle this?
Here's the code to fully reproduce the dataset I've shown ("autocorr" is the final waveform of interest)
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#GENERATION OF A FUNCTION WITH DUAL SEASONALITY & NOISE
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#x_daily_weekly_long is the final waveform on which I'm carrying out Autocorrelation
#PERFORMING AUTOCORRELATION:
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode = "same")
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#VISUALIZATION:
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
As you have some kind of double (or even triple) signal, I would attempt a double smoothing.
One to remove the overall trend, and one to remove the sharp noise.
A picture is probably better than a long explanation:
from scipy.signal import find_peaks
import pandas as pd
import numpy as np
def smooth(s, win):
return pd.Series(s).rolling(window=win, center=True).mean().ffill().bfill()
plt.plot(lags, autocorr, label='data')
WINDOW = 100 # needs to be determined empirically
# and so are the multipliers below
# double smoothing difference + clipping
ddiff = np.clip(smooth(autocorr, 2*WINDOW)-smooth(autocorr, 10*WINDOW), 0, np.inf)
plt.plot(lags, ddiff, label='smooth+clip')
peaks = find_peaks(ddiff, width=WINDOW)[0]
plt.plot(lags[peaks], autocorr[peaks], marker='o', ls='')
plt.plot(lags[peaks], ddiff[peaks], marker='o', ls='')
plt.legend()
output:
smoothing the original signal
As often in data analysis, the earlier you perform a transformation might be the better. You could also clean your original signal before running the autocorrelation. Here is a quick example (using the smooth function defined above):
from scipy.signal import find_peaks
x2 = smooth(x_daily_weekly_long, 100)
autocorr2 = signal.correlate(x2, x2, mode = "same")
plt.plot(lags, autocorr2)
idx = find_peaks(autocorr2)[0]
plt.plot(lags[idx], autocorr2[idx], marker='o', ls='')
cleaned signal:
For testing purposes i used a rough reconstruction of your signal.
import numpy as np
from scipy.signal import find_peaks, square
import matplotlib.pyplot as plt
x = np.linspace(3,103,10000)
sin = np.clip(np.sin(0.6*x)-0.5,0,10)
tri = np.concatenate([np.linspace(0,0.3,5000),np.linspace(0.3,0,5000)],axis =0)
sig = np.sin(6*x-1.2)
full = sin+tri+sig
peak run #1
peaks = find_peaks(full)[0]
plt.plot(full)
plt.scatter(peaks,full[peaks], color='red', s=5)
plt.show()
peak run #2 + index reextraction (this needs the actual values from your signal)
peaks2 = find_peaks(full[peaks])[0]
index = peaks[peaks2]
plt.plot(full)
plt.scatter(index,full[index], color='red', s=5)
plt.show()
If you know the period you can do this:
w=T
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
plt.scatter(lags[signal.find_peaks(signal.convolve(autocorr, np.ones(w)/w, mode="same"))[0]], autocorr[signal.find_peaks(signal.convolve(autocorr, np.ones(w)/w, mode="same"))[0]], color="r")
Result:
I don't know if it works in other cases.
EDIT:
another approach is to find the maximum in a slicing window, but also in this case you must define empirically a window size.
w=900
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
plt.scatter(lags[filters.maximum_filter(autocorr, size=W)==autocorr], autocorr[filters.maximum_filter(autocorr, size=W)==autocorr], color="r")
Result:

is there a simple method to smooth a curve without taking into account future values and without a time shift?

I have a Unix time series (x) with an associated signal value (y) which is generated every minute, dropping the first value and appending a new one. I am trying to smooth the resulting curve without loosing time accuracy with a specific emphasis on the final value of the smoothed curve which will be written to a database. I would like to be able to adjust the smoothing to a considerable degree.
I have studied (as mathematical layman, more or less) all options I could find and I could master. I came across Savitzki Golay which looked perfect until I realized it works well on past data but fails to produce a reliable final value if no future data is available for smoothing. I have tried many other methods which produced results but could not be adjusted like Savgol.
import pandas as pd
from bokeh.plotting import figure, show, output_file
from bokeh.layouts import column
from math import pi
from scipy.signal import savgol_filter
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from scipy.interpolate import splrep, splev
from scipy.ndimage import gaussian_filter1d
from scipy.signal import lfilter
from scipy.interpolate import UnivariateSpline
import matplotlib.pyplot as plt
df_sim = pd.read_csv("/home/20190905_Signal_Smooth_Test.csv")
#sklearn Polynomial*****************************************
poly = PolynomialFeatures(degree=4)
X = df_sim.iloc[:, 0:1].values
print(X)
y = df_sim.iloc[:, 1].values
print(y)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)
lin2 = LinearRegression()
lin2.fit(X_poly, y)
# Visualising the Polynomial Regression results
plt.scatter(X, y, color='blue')
plt.plot(X, lin2.predict(poly.fit_transform(X)), color='red')
plt.title('Polynomial Regression')
plt.xlabel('Time')
plt.ylabel('Signal')
plt.show()
#scipy interpolate********************************************
bspl = splrep(df_sim['timestamp'], df_sim['signal'], s=5)
bspl_y = splev(df_sim['timestamp'], bspl)
df_sim['signal_spline'] = bspl_y
#scipy gaussian filter****************************************
smooth = gaussian_filter1d(df_sim['signal'], 3)
df_sim['signal_gauss'] = smooth
#scipy lfilter************************************************
n = 5 # the larger n is, the smoother curve will be
b = [1.0 / n] * n
a = 1
histo_filter = lfilter(b, a, df_sim['signal'])
df_sim['signal_lfilter'] = histo_filter
print(df_sim)
#scipy UnivariateSpline**************************************
s = UnivariateSpline(df_sim['timestamp'], df_sim['signal'], s=5)
xs = df_sim['timestamp']
ys = s(xs)
df_sim['signal_univariante'] = ys
#scipy savgol filter****************************************
sg = savgol_filter(df_sim['signal'], 11, 3)
df_sim['signal_savgol'] = sg
df_sim['date'] = pd.to_datetime(df_sim['timestamp'], unit='s')
#plotting it all********************************************
print(df_sim)
w = 60000
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Various Signals y vs Timestamp x")
p.xaxis.major_label_orientation = pi / 4
p.grid.grid_line_alpha = 0.9
p.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p.line(x=df_sim['date'], y=df_sim['signal_spline'], color='blue')
p.line(x=df_sim['date'], y=df_sim['signal_gauss'], color='red')
p.line(x=df_sim['date'], y=df_sim['signal_lfilter'], color='magenta')
p.line(x=df_sim['date'], y=df_sim['signal_univariante'], color='yellow')
p1 = figure(x_axis_type="datetime", tools=TOOLS, plot_width=1000, plot_height=250,
title=f"Savgol vs Signal")
p1.xaxis.major_label_orientation = pi / 4
p1.grid.grid_line_alpha = 0.9
p1.line(x=df_sim['date'], y=df_sim['signal'], color='green')
p1.line(x=df_sim['date'], y=df_sim['signal_savgol'], color='blue')
output_file("signal.html", title="Signal Test")
show(column(p, p1)) # open a browser
I expect a result that is similar to Savitzky Golay but with valid final smoothed values for the data series. None of the other methods present the same flexibility to adjust the grade of smoothing. Most other methods shift the curve to the right. I can provide to csv file for testing.
This really depends on why you are smoothing the data. Every smoothing method will have side effects, such as letting some 'noise' through more than other. Research 'phase response of filtering'.
A common technique to avoid the problem of missing data at the end of a symmetric filter is to just forecast your data a few points ahead and use that. For example, if you are using a 5-term moving average filter you will be missing 2 data points when you go to calculate your end value.
To forecast these two points, you could use the auto_arima() function from the pmdarima module, or look at the fbprophet module (which I find quite good for this kind of situation).

How to write initial conditions in scipy.integrate.ode (or another) function?

I'm trying to solve differential equation using python scipy.integrate.odeint function and compare it to the mathcad solution.
So my equition is u'' + 0.106u'+ 0.006u = 0, the problem I'm stuck in is the initial conditions which are u(0)=0 and u'(1)=1. I don't understand how to set u'(1)=1.
Python code:
from scipy.integrate import odeint
import matplotlib.pyplot as plt
import numpy as np
def eq(u,t):
return [u[1], -0.106*u[1]-0.006*u[0]] #return u' and u''
time = np.linspace(0, 10)
u0 = [0,1] # initial conditions
Z = odeint(eq,u0,time) </code>
plt.plot(time, Z)
plt.xticks(range(0,10))
plt.grid(True)
plt.xlabel('time')
plt.ylabel('u(t)')
plt.show()
Mathcad code:
u''(t) + 0.106*u'(t) +0.006*u(t) = 0
u(0) = 0
u'(1) = 1
u := Odesolve(t,10)
Mathcad diagram looks like this:
https://pp.userapi.com/c850032/v850032634/108079/He1JsQonhpk.jpg
which is etalon.
And my python output is:
https://pp.userapi.com/c850032/v850032634/10809c/KB_HDekc8Fk.jpg
which does look similar, but clearly the u(t) is incorrect.
This is a boundary value problem, you need to use solve_bvp
from scipy.integrate import solve_bvp, solve_ivp
import matplotlib.pyplot as plt
import numpy as np
def eq(t,u): return [u[1], -0.106*u[1]-0.006*u[0]] #return u' and u''
def bc(u0,u1): return [u0[0], u1[1]-1 ]
res = solve_bvp(eq, bc, [0,1], [[0,1],[1,1]], tol=1e-3)
print res.message
# plot the piecewise polynomial interpolation,
# using the last segment for extrapolation
time = np.linspace(0, 10, 501)
plt.plot(time, res.sol(time)[0], '-', lw=2, label="BVP extrapolated")
# solve the extended solution as IVP solution on [1,10]
ivp = solve_ivp(eq, time[[0,-1]], res.y[:,0], t_eval=time)
plt.plot(time, ivp.y[0], '-', lw=2, label="IVP from BVP IC")
# plot decorations
plt.xticks(range(0,11))
plt.grid(True)
plt.legend()
plt.xlabel('time')
plt.ylabel('u(t)')
plt.show()
Note that the continuation is by extrapolation from the given interval [0,1] to [0,10] and that the values at 1 are with a tolerance of 1e-3. So one could get a better result over the large interval by using solve_ivp with the computed values at t=1 as initial values. The difference in this example is about 0.01.

Categories