Generating 2nd degree polynomial out of some data - python

I have some data which I want to generate a 2nd degree polyfit like this as example:
I have tried two different codes but the polynomial just trying to go through all points.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('TESTEXskelet.csv', sep=",")
x = data.Gennemsnitlig_hastighed
y1 = data.Sum_VSP
np.polyfit(x,y1,2)
plt.grid()
plt.title("VSP sum/hastighed")
plt.ylabel('VSP - kW/ton')
plt.xlabel('Hastighed - km/t')
plt.scatter(x,y1,s=5) # Definere selve plottet
plt.plot(x, y1)
But then it plots it through every point.
I have also tried with sklearn, and I can upload that if requested.

You correctly fitted a 2nd degree polynomial. You are just not using it in the plot you do after that.
plt.scatter(x,y1,s=5) does a scatter plot of your original data, and plt.plot(x, y1) plots a line through all your data.
To plot the polynomial you need to catch the polynomial fit into a variable. Then define a range for the x-axis you want to plot over and predict y values based on the polynomial fit:
p = np.polyfit(x,y1,2)
xn = np.linspace(np.min(x), np.max(x), 100)
yn = np.poly1d(p)(xn)
plt.scatter(x,y1,s=5)
plt.plot(xn, yn)

polyfit returns the parameters to your polynomial, try
p = np.polyfit(x,y1,2)
y2 = np.polyval(p, x)
plt.plot(x, y2)

Related

How can I fit a gaussian curve in python?

I'm given an array and when I plot it I get a gaussian shape with some noise. I want to fit the gaussian. This is what I already have but when I plot this I do not get a fitted gaussian, instead I just get a straight line. I've tried this many different ways and I just can't figure it out.
random_sample=norm.rvs(h)
parameters = norm.fit(h)
fitted_pdf = norm.pdf(f, loc = parameters[0], scale = parameters[1])
normal_pdf = norm.pdf(f)
plt.plot(f,fitted_pdf,"green")
plt.plot(f, normal_pdf, "red")
plt.plot(f,h)
plt.show()
You can use fit from scipy.stats.norm as follows:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
data = np.random.normal(loc=5.0, scale=2.0, size=1000)
mean,std=norm.fit(data)
norm.fit tries to fit the parameters of a normal distribution based on the data. And indeed in the example above mean is approximately 5 and std is approximately 2.
In order to plot it, you can do:
plt.hist(data, bins=30, density=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
y = norm.pdf(x, mean, std)
plt.plot(x, y)
plt.show()
The blue boxes are the histogram of your data, and the green line is the Gaussian with the fitted parameters.
There are many ways to fit a gaussian function to a data set. I often use astropy when fitting data, that's why I wanted to add this as additional answer.
I use some data set that should simulate a gaussian with some noise:
import numpy as np
from astropy import modeling
m = modeling.models.Gaussian1D(amplitude=10, mean=30, stddev=5)
x = np.linspace(0, 100, 2000)
data = m(x)
data = data + np.sqrt(data) * np.random.random(x.size) - 0.5
data -= data.min()
plt.plot(x, data)
Then fitting it is actually quite simple, you specify a model that you want to fit to the data and a fitter:
fitter = modeling.fitting.LevMarLSQFitter()
model = modeling.models.Gaussian1D() # depending on the data you need to give some initial values
fitted_model = fitter(model, x, data)
And plotted:
plt.plot(x, data)
plt.plot(x, fitted_model(x))
However you can also use just Scipy but you have to define the function yourself:
from scipy import optimize
def gaussian(x, amplitude, mean, stddev):
return amplitude * np.exp(-((x - mean) / 4 / stddev)**2)
popt, _ = optimize.curve_fit(gaussian, x, data)
This returns the optimal arguments for the fit and you can plot it like this:
plt.plot(x, data)
plt.plot(x, gaussian(x, *popt))

Python: Curve_fit from scipy.optimze has no possibility for range of x values

I can't find a possiblity to tell curve_fit to only use x values within a specific range. I found the "bounds" parameter, but this only seems to apply to the parameters of my function.
When one has data, where you want to fit, for example, a linear curve (but only in a specific area of your data) you have to create a new list. Especially as pyplot.plot takes two separate lists for x and y values, while for manuall sorting out you need them as pairs of (x,y).
The easiest solution is indeed to create a new list, which is a filtered version of the original list. It is of course best to work with numpy arrays instead of python lists.
So assume to have two arrays x and y, of which you only want to fit those values where x is larger than some number a to a function f.
You can filter and curve_fit them as
x2 = x[x>a]
y2 = y[x>a]
popt2, _ = scipy.optimize.curve_fit(f, x2, y2 )
A complete example:
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
import scipy.optimize
x = np.linspace(-1,3)
y = x**2 + np.random.normal(size=len(x))
f = lambda x, a,b : a* x +b
popt, _ = scipy.optimize.curve_fit(f, x,y, p0=(1,0))
x2 = x[x>0.7]
y2 = y[x>0.7]
popt2, _ = scipy.optimize.curve_fit(f, x2,y2, p0=(1,0))
plt.plot(x,y, marker="o", ls="", ms=4, label="all data")
plt.plot(x, f(x, *popt), color="moccasin", label="fit all data")
plt.plot(x2, f(x2, *popt2), label="fit filtered data")
plt.legend()
plt.show()
Finally just to mention it, you can also connect several conditions using logical operators, like x[(x>0.7) & (x<2.5)].

How to correctly use scikit-learn's Gaussian Process for a 2D-inputs, 1D-output regression?

Prior to posting I did a lot of searches and found this question which might be exactly my problem. However, I tried what is proposed in the answer but unfortunately this did not fix it, and I couldn't add a comment to request further explanation, as I am a new member here.
Anyway, I want to use the Gaussian Processes with scikit-learn in Python on a simple but real case to start (using the examples provided in scikit-learn's documentation). I have a 2D input set (8 couples of 2 parameters) called X. I have 8 corresponding outputs, gathered in the 1D-array y.
# Inputs: 8 points
X = np.array([[p1, q1],[p2, q2],[p3, q3],[p4, q4],[p5, q5],[p6, q6],[p7, q7],[p8, q8]])
# Observations: 8 couples
y = np.array([r1,r2,r3,r4,r5,r6,r7,r8])
I defined an input test space x:
# Input space
x1 = np.linspace(x1min, x1max) #p
x2 = np.linspace(x2min, x2max) #q
x = (np.array([x1, x2])).T
Then I instantiate the GP model, fit it to my training data (X,y), and make the 1D prediction y_pred on my input space x:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
kernel = C(1.0, (1e-3, 1e3)) * RBF([5,5], (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=15)
gp.fit(X, y)
y_pred, MSE = gp.predict(x, return_std=True)
And then I make a 3D plot:
fig = pl.figure()
ax = fig.add_subplot(111, projection='3d')
Xp, Yp = np.meshgrid(x1, x2)
Zp = np.reshape(y_pred,50)
surf = ax.plot_surface(Xp, Yp, Zp, rstride=1, cstride=1, cmap=cm.jet,
linewidth=0, antialiased=False)
pl.show()
This is what I obtain:
When I modify the kernel parameters I get something like this, similar to what the poster I mentioned above got:
These plots don't even match the observation from the original training points (the lower response is obtained for [65.1,37] and the highest for [92.3,54]).
I am fairly new to GPs in 2D (also started Python not long ago) so I think I'm missing something here... Any answer would be helpful and greatly appreciated, thanks!
You're using two features to predict a third. Rather than a 3D plot like plot_surface, it's usually clearer if you use a 2D plot that's able to show information about a third dimension, like hist2d or pcolormesh. Here's a complete example using data/code similar to that in the question:
from itertools import product
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
X = np.array([[0,0],[2,0],[4,0],[6,0],[8,0],[10,0],[12,0],[14,0],[16,0],[0,2],
[2,2],[4,2],[6,2],[8,2],[10,2],[12,2],[14,2],[16,2]])
y = np.array([-54,-60,-62,-64,-66,-68,-70,-72,-74,-60,-62,-64,-66,
-68,-70,-72,-74,-76])
# Input space
x1 = np.linspace(X[:,0].min(), X[:,0].max()) #p
x2 = np.linspace(X[:,1].min(), X[:,1].max()) #q
x = (np.array([x1, x2])).T
kernel = C(1.0, (1e-3, 1e3)) * RBF([5,5], (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=15)
gp.fit(X, y)
x1x2 = np.array(list(product(x1, x2)))
y_pred, MSE = gp.predict(x1x2, return_std=True)
X0p, X1p = x1x2[:,0].reshape(50,50), x1x2[:,1].reshape(50,50)
Zp = np.reshape(y_pred,(50,50))
# alternative way to generate equivalent X0p, X1p, Zp
# X0p, X1p = np.meshgrid(x1, x2)
# Zp = [gp.predict([(X0p[i, j], X1p[i, j]) for i in range(X0p.shape[0])]) for j in range(X0p.shape[1])]
# Zp = np.array(Zp).T
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax.pcolormesh(X0p, X1p, Zp)
plt.show()
Output:
Kinda plain looking, but so was my example data. In general, you shouldn't expect to get particular interesting resulting with this few data points.
Also, if you do want the surface plot, you can just replace the pcolormesh line with what you originally had (more or less):
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(X0p, X1p, Zp, rstride=1, cstride=1, cmap='jet', linewidth=0, antialiased=False)
Output:
I'm also fairly new using scikit-learn gaussian process. But after some effort, I managed to implement a 3-d gaussian process regression successfully. There are a lot of examples of 1-d regression but nothing on higher input dimensions.
Perhaps you could show the values that you are using.
I found that sometimes the format in which you send the inputs can produce some issues. Try formatting input X as:
X = np.array([param1, param2]).T
and format the output as:
gp.fit(X, y.reshape(-1,1))
Also, as I understood, the implementation assumes a mean function m=0. If the output you are trying to regress presents an average value which differs significantly from 0 you should normalize it (that will probably solve your problem). Standardizing the parameter space will help as well.

How can I change de parameters of gaussian_kde for a scatter plot colored by density in matplotlib

As explained by Joe Kington answering in this question : How can I make a scatter plot colored by density in matplotlib, I made a scatter plot colored by density. However, due to the complex distribution of my data, I would like to change the parameters used to calculate the density.
Here is the results with some fake data similar to mine :
I would want to calibrate the density calculations of gaussian_kde so that the left part of the plot looks like this :
I don't like the first plot because the groups of points influence the density of adjacent groups of points and that prevents me from analyzing the distribution within a group. In other words, even if each of the 8 groups have exactly the same distribution, that won't be visible on the graph.
I tried to modify the covariance_factor (like I once did for a 2d plot of density over x), but when gaussian_kde is used with multiple dimension arrays it returns a numpy.ndarray, not a "scipy.stats.kde.gaussian_kde" object. Plus, I don't even know if changing the covariance_factor will do it.
Here's my dummy code :
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40,a+80))
y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10,b*4))
# Data for the second image
#x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40))
#y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10))
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
# My unsuccesfull try to modify covariance which would work in 1D with "z = gaussian_kde(x)"
#z.covariance_factor = lambda : 0.01
#z._compute_covariance()
# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50, edgecolor='')
plt.show()
The solution could use an other density calculator, I don't mind.
The goal is to make a density plot like the ones showed above, where I can play with the density parameters.
I'm using python 3.4.3
Did have a look at Seaborn? It's not exactly what you're asking for, but it already has functions for generating density plots:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kendalltau
import seaborn as sns
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10, a+10, a+20, a+20, a+30, a+30, a+40, a+40, a+80))
y = np.concatenate((b+10, b-10, b+10, b-10, b+10, b-10, b+10, b-10, b*4))
sns.jointplot(x, y, kind="hex", stat_func=kendalltau)
sns.jointplot(x, y, kind="kde", stat_func=kendalltau)
plt.show()
It gives:
and

Integrate 2D kernel density estimate

I have a x,y distribution of points for which I obtain the KDE through scipy.stats.gaussian_kde. This is my code and how the output looks (the x,y data can be obtained from here):
import numpy as np
from scipy import stats
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
m1, m2 = data[0], data[1]
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Perform a kernel density estimate (KDE) on the data
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
f = np.reshape(kernel(positions).T, x.shape)
# Define the number that will determine the integration limits
x1, y1 = 2.5, 1.5
# Perform integration?
# Plot the results:
import matplotlib.pyplot as plt
# Set limits
plt.xlim(xmin,xmax)
plt.ylim(ymin,ymax)
# KDE density plot
plt.imshow(np.rot90(f), cmap=plt.cm.gist_earth_r, extent=[xmin, xmax, ymin, ymax])
# Draw contour lines
cset = plt.contour(x,y,f)
plt.clabel(cset, inline=1, fontsize=10)
plt.colorbar()
# Plot point
plt.scatter(x1, y1, c='r', s=35)
plt.show()
The red point with coordinates (x1, y1) has (like every point in the 2D plot) an associated value given by f (the kernel or KDE) between 0 and 0.42. Let's say that f(x1, y1) = 0.08.
I need to integrate f with integration limits in x and y given by those regions where f evaluates to less than f(x1, y1), ie: f(x, y)<0.08.
For what I've seen python can perform integration of functions and one dimensional arrays through numerical integration, but I haven't seen anything that would let me perform a numerical integration on a 2D array (the f kernel) Furthermore, I'm not sure how I would even recognize the regions given by that particular condition (ie: f(x, y)less than a given value)
Can this be done at all?
Here is a way to do it using monte carlo integration. It is a little slow, and there is randomness in the solution. The error is inversely proportional to the square root of the sample size, while the running time is directly proportional to the sample size (where sample size refers to the monte carlo sample (10000 in my example below), not the size of your data set). Here is some simple code using your kernel object.
#Compute the point below which to integrate
iso = kernel((x1,y1))
#Sample from your KDE distribution
sample = kernel.resample(size=10000)
#Filter the sample
insample = kernel(sample) < iso
#The integral you want is equivalent to the probability of drawing a point
#that gets through the filter
integral = insample.sum() / float(insample.shape[0])
print integral
I get approximately 0.2 as the answer for your data set.
Currently, it is available
kernel.integrate_box([-np.inf,-np.inf], [2.5,1.5])
A direct way is to integrate
import matplotlib.pyplot as plt
import sklearn
from scipy import integrate
import numpy as np
mean = [0, 0]
cov = [[5, 0], [0, 10]]
x, y = np.random.multivariate_normal(mean, cov, 5000).T
plt.plot(x, y, 'o')
plt.show()
sample = np.array(zip(x, y))
kde = sklearn.neighbors.KernelDensity().fit(sample)
def f_kde(x,y):
return np.exp((kde.score_samples([[x,y]])))
point = x1, y1
integrate.nquad(f_kde, [[-np.inf, x1],[-np.inf, y1]])
The problem is that, this is very slow if you do it in a large scale. For example, if you want to plot the x,y line at x (0,100), it would take a long time to calculate.
Notice: I used kde from sklearn, but I believe you can also change it into other form as well.
Using the kernel as defined in the original question:
import numpy as np
from scipy import stats
from scipy import integrate
def integ_func(kde, x1, y1):
def f_kde(x, y):
return kde((x, y))
integ = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])
return integ
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)
# Define the number that will determine the integration limits
x1, y1 = 2.5, 1.5
print integ_func(kernel, x1, y1)

Categories