Related
I want to iteratively fit a curve to data in python with the following approach:
Fit a polynomial curve (or any non-linear approach)
Discard values > 2 standard deviation from mean of the curve
repeat steps 1 and 2 till all values are within confidence interval of the curve
I can fit a polynomial curve as follows:
vals = array([0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
0.61574739])
x_values = np.linspace(0, 1, len(vals))
poly_degree = 3
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
How do I do steps 2 and 3?
With the eliminating points too far from an expected solution, you are probably looking for RANSAC (RANdom SAmple Consensus), which is fitting a curve (or any other function) to data within certain bounds, like your case with 2*STD.
You can use scikit-learn RANSAC estimator which is well aligned with included regressors such as LinearRegression. For your polynomial case you need to define your own regression class:
from sklearn.metrics import mean_squared_error
class PolynomialRegression(object):
def __init__(self, degree=3, coeffs=None):
self.degree = degree
self.coeffs = coeffs
def fit(self, X, y):
self.coeffs = np.polyfit(X.ravel(), y, self.degree)
def get_params(self, deep=False):
return {'coeffs': self.coeffs}
def set_params(self, coeffs=None, random_state=None):
self.coeffs = coeffs
def predict(self, X):
poly_eqn = np.poly1d(self.coeffs)
y_hat = poly_eqn(X.ravel())
return y_hat
def score(self, X, y):
return mean_squared_error(y, self.predict(X))
and then you can use RANSAC
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(PolynomialRegression(degree=poly_degree),
residual_threshold=2 * np.std(y_vals),
random_state=0)
ransac.fit(np.expand_dims(x_vals, axis=1), y_vals)
inlier_mask = ransac.inlier_mask_
Note, the X variable is transformed to 2d array as it is required by sklearn RANSAC implementation and in our custom class flatten back because of numpy polyfit function works with 1d array.
y_hat = ransac.predict(np.expand_dims(x_vals, axis=1))
plt.plot(x_vals, y_vals, 'bx', label='input samples')
plt.plot(x_vals[inlier_mask], y_vals[inlier_mask], 'go', label='inliers (2*STD)')
plt.plot(x_vals, y_hat, 'r-', label='estimated curve')
moreover, playing with the polynomial order and residual distance I got following results with degree=4 and range 1*STD
Another option is to use higher order regressor like Gaussian process
from sklearn.gaussian_process import GaussianProcessRegressor
ransac = RANSACRegressor(GaussianProcessRegressor(),
residual_threshold=np.std(y_vals))
Talking about generalization to DataFrame, you just need to set that all columns except one are features and the one remaining is the output, like here:
import pandas as pd
df = pd.DataFrame(np.array([x_vals, y_vals]).T)
ransac.fit(df[df.columns[:-1]], df[df.columns[-1]])
y_hat = ransac.predict(df[df.columns[:-1]])
it doesn't look like you'll get anything worthwhile following that procedure, there are much better techniques for handling unexpected data. googling for "outlier detection" would be a good start.
with that said, here's how to answer your question:
start by pulling in libraries and getting some data:
import matplotlib.pyplot as plt
import numpy as np
Y = np.array([
0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
0.61574739])
X = np.linspace(0, 1, len(Y))
next do an initial plot of the data:
plt.plot(X, Y, '.')
as this lets you see what we're dealing with and whether a polynomial would ever be a good fit --- short answer is that this method isn't going to get very far with this sort of data
at this point we should stop, but to answer the question I'll go on, mostly following your polynomial fitting code:
poly_degree = 5
sd_cutoff = 1 # 2 keeps everything
coeffs = np.polyfit(X, Y, poly_degree)
poly_eqn = np.poly1d(coeffs)
Y_hat = poly_eqn(X)
delta = Y - Y_hat
sd_p = np.std(delta)
ok = abs(delta) < sd_p * sd_cutoff
hopefully this makes sense, I use a higher degree polynomial and only cutoff at 1SD because otherwise nothing will be thrown away. the ok array contains True values for those points that are within sd_cutoff standard deviations
to check this, I'd then do another plot. something like:
plt.scatter(X, Y, color=np.where(ok, 'k', 'r'))
plt.fill_between(
X,
Y_hat - sd_p * sd_cutoff,
Y_hat + sd_p * sd_cutoff,
color='#00000020')
plt.plot(X, Y_hat)
which gives me:
so the black dots are the points to keep (i.e. X[ok] gives me these back, and np.where(ok) gives you indicies).
you can play around with the parameters, but you probably want a distribution with fatter tails (e.g. a Student's T-distribution) but, as I said above, using Google for outlier detection would be my suggestion
There are three functions need to solve this. First a line fitting function is necesary to fit a line to a set of points:
def fit_line(x_values, vals, poly_degree):
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
return poly_eqn, y_hat
We need to know the standard deviation from the points to the line. This function computes that standard deviation:
def compute_sd(x_values, vals, y_hat):
distances = []
for x,y, y1 in zip(x_values, vals, y_hat): distances.append(abs(y - y1))
return np.std(distances)
Finally, we need to compare the distance from a point to the line. The point needs to be thrown out if the distance from the point to the line is greater than two times the standard deviation.
def compare_distances(x_values, vals):
new_vals, new_x_vals = [],[]
for x,y in zip(x_values, vals):
y1 = np.polyval(poly_eqn, x)
distance = abs(y - y1)
if distance < 2*sd:
plt.plot((x,x),(y,y1), c='g')
new_vals.append(y)
new_x_vals.append(x)
else:
plt.plot((x,x),(y,y1), c='r')
plt.scatter(x,y, c='r')
return new_vals, new_x_vals
As you can see in the following graphs, this method does not work well for fitting a line to data that has a lot of outliers. All the points end up getting eliminated for being too far from the fitted line.
while len(vals)>0:
poly_eqn, y_hat = fit_line(x_values, vals, poly_degree)
plt.scatter(x_values, vals)
plt.plot(x_values, y_hat)
sd = compute_sd(x_values, vals, y_hat)
new_vals, new_x_vals = compare_distances(x_values, vals)
plt.show()
vals, x_values = np.array(new_vals), np.array(new_x_vals)
Lets say i have a randomly distributed data which looks like:
I want to replace each data point y[x_i] with fixed width gaussian
and add them together. It should give me:
My code is very primitive and slow:
def gaussian(x, mu, sig):
return 1/(sig*np.sqrt(2*np.pi))*np.exp(-np.power(x - mu, 2.) / (
2 * np.power(sig, 2.)))
def gaussian_smoothing(x, y, sig=0.5, n=1000):
x_new = np.linspace(x.min()-10*sig, x.max()+10*sig, n)
y_new = np.zeros(x_new.shape)
for _x, _y in zip(x, y):
y_new += _y*gaussian(x_new, _x, sig)
return x_new, y_new
For large data-sets it takes a long time to perform such smoothing.
I was looking at np.convolve. However it seams that that it is only applicable to evenly distributed data and x step for data and gaussians should be the same. What would be the fastest way to perform such operation.
yon try to estimate it as a Guassian mixture with smaller number of components (like EM algorithm) using sklearn:
import matplotlib.pyplot as plt
from numpy.random import choice
from sklearn import mixture
import scipy.stats
import numpy
# generate some data
x = numpy.array([1.,1.1,1.6,2.,2.1,2.2,2.9,3.,8.,62.,62.2,63.,63.4,64.5,65.,67.,69.])
# generate weights to it
y = numpy.random.rand(x.shape[0])
# normalize weigth to 1
y /= y.sum()
# resamlple to 5000 samples with equal weights according to original weights
x_rsmp = numpy.array([choice(x, p=y) for _ in range(5000)])
x_rsmp.sort()
x_rsmp = x_rsmp.reshape(-1,1)
# define number of components - this must be user seelcted or estimated
n_comp = 2
# fit the mixture
gmm = mixture.GaussianMixture(n_components=n_comp, covariance_type='full')
gmm.fit(x_rsmp)
# plot it
fig = plt.figure()
ax = fig.add_subplot(111)
x_gauss = numpy.linspace(-10,100,1000)
for n_c in range(n_comp):
norm_pdf = scipy.stats.norm.pdf(x_gauss, gmm.means_[n_c,0], gmm.covariances_[n_c,0])
ax.plot(x_gauss, norm_pdf, label='gauss %d' % (n_c+1))
ax.stem(x,y,'gray')
plt.legend()
It yields n_c Gaussian components with means gmm.means_ and covariances gmm.covariances_.
I am aware that SGD has been asked before on SO but I wanted to have an opinion on my code as below:
import numpy as np
import matplotlib.pyplot as plt
# Generating data
m,n = 10000,4
x = np.random.normal(loc=0,scale=1,size=(m,4))
theta_0 = 2
theta = np.append([],[1,0.5,0.25,0.125]).reshape(n,1)
y = np.matmul(x,theta) + theta_0*np.ones(m).reshape((m,1)) + np.random.normal(loc=0,scale=0.25,size=(m,1))
# input features
x0 = np.ones([m,1])
X = np.append(x0,x,axis=1)
# defining the cost function
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2
# initializations
theta_GD = np.append([theta_0],[theta]).reshape(n+1,1)
alp = 1e-5
num_iterations = 10000
# Batch Sum
def batch(i,j,theta_GD):
batch_sum = 0
for k in range(i,i+9):
batch_sum += float((y[k]-np.transpose(theta_GD).dot(X[k]))*X[k][j])
return batch_sum
# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10
theta_updated = theta_current
return theta_updated
# gradient descent
cost_vec = []
for i in range(num_iterations):
cost_vec.append(compute_cost(X[i], y[i], theta_GD))
theta_GD = gradient_step(theta_GD, X, y, alp,i)
plt.plot(cost_vec)
plt.xlabel('iterations')
plt.ylabel('cost')
I was trying a mini-batch GD with a batch size of 10. I am getting extremely oscillatory behavior for the MSE. Where's the issue? Thanks.
P.S. I was following NG's https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent
This is a description of the underlying mathematical principle, not a code based solution...
The cost function is highly nonlinear (np.power()) and recursive and recursive and nonlinear systems can oscillate ( self-oscillation https://en.wikipedia.org/wiki/Self-oscillation ). In mathematics this is subject to chaos theory / theory of nonlinear dynamical systems ( https://pdfs.semanticscholar.org/8e0d/ee3c433b1806bfa0d98286836096f8c2681d.pdf ), cf the Logistic Map
( https://en.wikipedia.org/wiki/Logistic_map ). The logistic map oscillates if the growth factor r exceeds a threshold. The growth factor is a measure for how much energy is in the system.
In your code the critical parts are the cost function, the cost vector, that is the history of the system and the time steps :
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2
cost_vec = []
for i in range(num_iterations):
cost_vec.append(compute_cost(X[i], y[i], theta_GD))
theta_GD = gradient_step(theta_GD, X, y, alp,i)
# Gradient Step
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10
theta_updated = theta_current
return theta_updated
If you compare this to an implementation of the logistic map you see the similarities
from pylab import show, scatter, xlim, ylim
from random import randint
iter = 1000 # Number of iterations per point
seed = 0.5 # Seed value for x in (0, 1)
spacing = .0001 # Spacing between points on domain (r-axis)
res = 8 # Largest n-cycle visible
# Initialize r and x lists
rlist = []
xlist = []
def logisticmap(x, r): <------------------ nonlinear function
return x * r * (1 - x)
# Return nth iteration of logisticmap(x. r)
def iterate(n, x, r):
for i in range(1,n):
x = logisticmap(x, r)
return x
# Generate list values -- iterate for each value of r
for r in [i * spacing for i in range(int(1/spacing),int(4/spacing))]:
rlist.append(r)
xlist.append(iterate(randint(iter-res/2,iter+res/2), seed, r)) <--------- similar to cost_vector, the history of the system
scatter(rlist, xlist, s = .01)
xlim(0.9, 4.1)
ylim(-0.1,1.1)
show()
source of code : https://www.reddit.com/r/learnpython/comments/zzh28/a_simple_python_implementation_of_the_logistic_map/
Basing on this you can try to modify your cost function by introducing a factor similar to the growth factor in the logistic map to reduce the intensity of oscillation of the system
def gradient_step(theta_current, X, y, alp,i):
for j in range(0,n):
theta_current[j]-= alp*batch(i,j,theta_current)/10 <--- introduce a factor somewhere to keep the system under the oscillation threshold
theta_updated = theta_current
return theta_updated
or
def compute_cost(X,y,theta_GD):
return np.sum(np.power(y-np.matmul(np.transpose(theta_GD),X),2))/2 <--- introduce a factor somewhere to keep the system under the oscillation threshold
If this is not working maybe follow the suggestions in https://www.reddit.com/r/MachineLearning/comments/3y9gkj/how_can_i_avoid_oscillations_in_gradient_descent/ ( timesteps,... )
I am trying to find the peak of my data set by fitting it to a Lorentzian (more specifically I have to find at what value of the B-field the peak occurs). However, what follows the peak is not symmetric and definitely not linear so I am having trouble getting a good fit. This is what I have tried:
import numpy
import pylab
from scipy.optimize import leastsq # Levenberg-Marquadt Algorithm #
def lorentzian(x,p):
numerator = (p[0]**2 )
denominator = ( x - (p[1]) )**2 + p[0]**2
y = p[2]*(numerator/denominator)+p[3]*(x-p[0])+p[4]
return y
def residuals(p,y,x):
err = y - lorentzian(x,p)
return err
a = numpy.loadtxt('QHE.dat')
x = a[int(len(a)*8.2/10):,0]
y = a[int(len(a)*8.2/10):,1]
# initial values #
p = [0.4,1.2,1.5,1,1] # [hwhm, peak center, intensity] #
pbest = leastsq(residuals,p,args=(y,x),full_output=1)
best_parameters = pbest[0]
# fit to data #
fit = lorentzian(x,best_parameters)
peaks.append(best_parameters)
pylab.figure()
pylab.plot(x,y,'wo')
pylab.plot(x,fit,'r-',lw=2)
pylab.xlabel('B field', fontsize=18)
pylab.ylabel('Resistance', fontsize=18)
pylab.show()`
Does anyone have a suggestion how to handle this?
Edit:
Here is the data file I am trying to fit. The goal is to find the minimum.
Currently, I am trying to solve a problem from astrophysics which can be simplified as following :
I wanted to fit a linear model (say y = a + b*x) to observed data, and I wish to use PyMC to characterize posterior of a and b in discrete grid parameter space like in this figure:
I know PyMC has DiscreteMetropolis class to find posterior in discrete space, but that's in integer space, not in custom discrete space. So I am thinking to define a potential to force PyMC to search in the grid, but not working well...Can anyone help with this? or Anyone has solved a similar problem? Any thoughts will be greatly appreciated :)
Here is my draft code, commented out potential class is my idea to force PyMC to search in the grid:
import sys
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import pymc
#------------------------------------------------------------
# Generate the data
size = 200
slope_true = 12.3
y_intercept_true = 22.4
x = np.linspace(0, 1, size)
# y = a + b*x
y_true = y_intercept_true + slope_true * x
# add noise
y = y_true + np.random.normal(scale=.03, size=size)
# Define searching parameter space
# Note: this is discrete but not in the form of integer
slope_search_space = np.linspace(1,30,51)
y_intercept_search_space = np.linspace(1,30,51)
#------------------------------------------------------------
#Start initializing PyMC
#pymc.stochastic(dtype=int)
def slope(value = 5, t_l=1, t_h=30):
"""The switchpoint for the rate of disaster occurrence."""
def logp(value, t_l, t_h):
if value > t_h or value < t_l:
return -np.inf
else:
return -np.log(t_h - t_l + 1)
##pymc.potential
#def slope_prior(val=slope,t_l=-30, t_h=30):
# if val not in slope_search_space:
# return -np.inf
# return -np.log(t_h - t_l + 1)
#---
#pymc.stochastic(dtype=int)
def y_intercept(value=4, t_l=1, t_h=30):
"""The switchpoint for the rate of disaster occurrence."""
def logp(value, t_l, t_h):
if value > t_h or value < t_l:
return -np.inf
else:
return -np.log(t_h - t_l + 1)
##pymc.potential
#def y_intercept_prior(val=y_intercept,t_l=-30, t_h=30):
# if val not in y_intercept_search_space:
# return -np.inf
# return -np.log(t_h - t_l + 1)
# Define observed data
#pymc.deterministic
def mu(x=x, slope=slope, y_intercept=y_intercept):
# Linear age-price model
return y_intercept + slope*x
# Sampling distribution of prices
p = pymc.Poisson('p', mu, value=y, observed=True)
model = dict(slope=slope, y_intercept=y_intercept, mu=mu, p=p)
#-----------------------------------------------------------
# perform the MCMC
M = pymc.MCMC(model)
trace = M.sample(iter=10000,burn=5000)
#Plot
pymc.Matplot.plot(M)
plt.figure()
pymc.Matplot.summary_plot([M.slope,M.y_intercept])
plt.show()
I managed to solve my problem a few days ago. And to my surprise, some of my astronomy friends in Facebook group are also interested in this question, so I think it might be useful to post my solution just in case other people are having the same issue. Please note, this solution may not be the best way to tackle this problem, in fact, I believed there's more elegant way. But for now, this is the best I can come up with. Hope this is helpful to some of you.
The way I solve the problem is very straightforward, and I summarized as follow
1> Define slope, y_intercept stochastic variable in continuous form (PyMC then will use Metropolis to do sampling)
2> Define a function find_nearest to map continuous random variable slope, y_intercept to Grid e.g. Grid_slope=np.array([1,2,3,4,…51]), slope=4.678, then find_nearest(Grid_slope, slope) will return 5, as slope value is closest to 5 in the Grid_slope. Similarly to y_intercept variable.
3> When compute the likelihood, this is where I do the trick, I applied the find_nearest function to model in likelihood function i.e. to change model(slope, y_intercept) to model(find_nearest(Grid_slope, slope), find_nearest(Grid_y_intercept, y_intercept)), which will compute likelihood only upon Grid parameter space.
4> The trace returned for slope and y_intercept by PyMC may not be strictly Grid value, you can use find_nearest function to map trace to Grid value, and then making any statistical inference from it. For my case, I just use the trace straightaway to get statistics, and the result is nice :)
import sys
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import pymc
#------------------------------------------------------------
# Generate the data
size = 200
slope_true = 12.3
y_intercept_true = 22.4
x = np.linspace(0, 1, size)
# y = a + b*x
y_true = y_intercept_true + slope_true * x
# add noise
y = y_true + np.random.normal(scale=.03, size=size)
# Define searching parameter space
# Note: this is discrete but not in the form of integer
slope_search_space = np.linspace(1,30,51)
y_intercept_search_space = np.linspace(1,30,51)
#------------------------------------------------------------
#Start initializing PyMC
from pymc import Normal, Gamma, deterministic, MCMC, Matplot, Uniform
# Constant priors for parameters
slope = Uniform('slope', 1, 30)
y_intercept = Uniform('y_intp', 1, 30)
# Precision of normal distribution of y value
tau = Uniform('tau',0,10000 )
#deterministic
def mu(x=x,slope=slope, y_intercept=y_intercept):
def find_nearest(array,value):
"""
This function maps 'value' to the nearest point in 'array'
"""
idx = (np.abs(array-value)).argmin()
return array[idx]
# Linear model
iso = find_nearest(y_intercept_search_space,y_intercept) + find_nearest(slope_search_space,slope)*x
return iso
# Sampling distribution of y
p = Normal('p', mu, tau, value=y, observed=True)
model = dict(slope=slope, y_intercept=y_intercept,tau=tau, mu=mu, p=p)
#-----------------------------------------------------------
# perform the MCMC
M = pymc.MCMC(model)
trace = M.sample(40000,20000)
#Plot
pymc.Matplot.plot(M)
M.slope.summary()
M.y_intercept.summary()
plt.figure()
pymc.Matplot.summary_plot([M.slope,M.y_intercept])
plt.show()