I am sorry for the probably stupid question but I am trying now for hours to estimate a density from a set of 2d data. Let's assume my data is given by the array: sample = np.random.uniform(0,1,size=(50,2)) . I just want to use scipys scikit learn package to estimate the density from the sample array (which is here of course a 2d uniform density) and I am trying the following:
import numpy as np
from sklearn.neighbors.kde import KernelDensity
from matplotlib import pyplot as plt
sp = 0.01
samples = np.random.uniform(0,1,size=(50,2)) # random samples
x = y = np.linspace(0,1,100)
X,Y = np.meshgrid(x,y) # creating grid of data , to evaluate estimated density on
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(samples) # creating density from samples
kde.score_samples(X,Y) # I want to evaluate the estimated density on the X,Y grid
But the last step always yields the error: score_samples() takes 2 positional arguments but 3 were given
So probably .score_samples cannot take a grid as input, but there no tutorials/docs for the 2d case so I don't know how to fix this issue. It would be really great if someone could help.
Looking at the Kernel Density Estimate of Species Distributions example, you have to package the x,y data together (both the training data and the new sample grid).
Below is a function that simplifies the sklearn API.
from sklearn.neighbors import KernelDensity
def kde2D(x, y, bandwidth, xbins=100j, ybins=100j, **kwargs):
"""Build 2D kernel density estimate (KDE)."""
# create grid of sample locations (default: 100x100)
xx, yy = np.mgrid[x.min():x.max():xbins,
y.min():y.max():ybins]
xy_sample = np.vstack([yy.ravel(), xx.ravel()]).T
xy_train = np.vstack([y, x]).T
kde_skl = KernelDensity(bandwidth=bandwidth, **kwargs)
kde_skl.fit(xy_train)
# score_samples() returns the log-likelihood of the samples
z = np.exp(kde_skl.score_samples(xy_sample))
return xx, yy, np.reshape(z, xx.shape)
This gives you the xx, yy, zz needed for something like a scatter or pcolormesh plot. I've copied the example from the scipy page on the gaussian_kde function.
import numpy as np
import matplotlib.pyplot as plt
m1 = np.random.normal(size=1000)
m2 = np.random.normal(scale=0.5, size=1000)
x, y = m1 + m2, m1 - m2
xx, yy, zz = kde2D(x, y, 1.0)
plt.pcolormesh(xx, yy, zz)
plt.scatter(x, y, s=2, facecolor='white')
Related
I am having some troubles understanding proper way to marginalize out variables from probability distributions. As I understand the proper way to do this is to sum over variables that is being marginalized out leaving only variables to be kept. For case of normal distribution, the result is also normal distribution. I can show this part with equations and doing integrals, but when I try to check in python I get incorrect results--the peak of resulting distribution is much higher.
Here is example (the code is from Marginalize a surface plot and use kernel density estimation (kde) on it)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import multivariate_normal, gaussian_kde
# Choose mean vector and variance-covariance matrix
mu = np.array([0, 0])
sigma = np.array([[2, 0], [0, 3]])
# Create surface plot data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
rv = multivariate_normal(mean=mu, cov=sigma)
Z = np.array([rv.pdf(pair) for pair in zip(X.ravel(), Y.ravel())])
Z = Z.reshape(X.shape)
# Plot it
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
pos = ax.plot_surface(X, Y, Z)
plt.show()
This makes plot of two variable normal distribution. If I take sum of variable x to get marginal distribution
Zmarg_y = Z.sum(axis=0)
plt.plot(x, Zmarg_y)
plt.show()
result is not the same as if I simply drop the variable instead of marginalize out. I tried this also with a 3 variable gaussian distribution where I marginalized 1 variable to get a 2 variable distribution. The result was also on a higher scale. Is there a problem with normalization here? I am studying probability for a first time and am trying to understand every single detail and I think I am misunderstanding something important about this. Thank you.
I'm wondering if there is a good way to match a Gaussian normal to a histogram in the form of a numpy array np.histogram(array, bins).
How can such a curve been plotted on the same graph and adjusted in height and width to the histogram?
You can fit your histogram using a Gaussian (i.e. normal) distribution, for example using scipy's curve_fit. I have written a small example below. Note that depending on your data, you may need to find a way to make good guesses for the starting values for the fit (p0). Poor starting values may cause your fit to fail.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
from scipy.stats import norm
def fit_func(x,a,mu,sigma,c):
"""gaussian function used for the fit"""
return a * norm.pdf(x,loc=mu,scale=sigma) + c
#make up some normally distributed data and do a histogram
y = 2 * np.random.normal(loc=1,scale=2,size=1000) + 2
no_bins = 20
hist,left = np.histogram(y,bins=no_bins)
centers = left[:-1] + (left[1] - left[0])
#fit the histogram
p0 = [2,0,2,2] #starting values for the fit
p1,_ = curve_fit(fit_func,centers,hist,p0,maxfev=10000)
#plot the histogram and fit together
fig,ax = plt.subplots()
ax.hist(y,bins=no_bins)
x = np.linspace(left[0],left[-1],1000)
y_fit = fit_func(x, *p1)
ax.plot(x,y_fit,'r-')
plt.show()
I'm given an array and when I plot it I get a gaussian shape with some noise. I want to fit the gaussian. This is what I already have but when I plot this I do not get a fitted gaussian, instead I just get a straight line. I've tried this many different ways and I just can't figure it out.
random_sample=norm.rvs(h)
parameters = norm.fit(h)
fitted_pdf = norm.pdf(f, loc = parameters[0], scale = parameters[1])
normal_pdf = norm.pdf(f)
plt.plot(f,fitted_pdf,"green")
plt.plot(f, normal_pdf, "red")
plt.plot(f,h)
plt.show()
You can use fit from scipy.stats.norm as follows:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
data = np.random.normal(loc=5.0, scale=2.0, size=1000)
mean,std=norm.fit(data)
norm.fit tries to fit the parameters of a normal distribution based on the data. And indeed in the example above mean is approximately 5 and std is approximately 2.
In order to plot it, you can do:
plt.hist(data, bins=30, density=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
y = norm.pdf(x, mean, std)
plt.plot(x, y)
plt.show()
The blue boxes are the histogram of your data, and the green line is the Gaussian with the fitted parameters.
There are many ways to fit a gaussian function to a data set. I often use astropy when fitting data, that's why I wanted to add this as additional answer.
I use some data set that should simulate a gaussian with some noise:
import numpy as np
from astropy import modeling
m = modeling.models.Gaussian1D(amplitude=10, mean=30, stddev=5)
x = np.linspace(0, 100, 2000)
data = m(x)
data = data + np.sqrt(data) * np.random.random(x.size) - 0.5
data -= data.min()
plt.plot(x, data)
Then fitting it is actually quite simple, you specify a model that you want to fit to the data and a fitter:
fitter = modeling.fitting.LevMarLSQFitter()
model = modeling.models.Gaussian1D() # depending on the data you need to give some initial values
fitted_model = fitter(model, x, data)
And plotted:
plt.plot(x, data)
plt.plot(x, fitted_model(x))
However you can also use just Scipy but you have to define the function yourself:
from scipy import optimize
def gaussian(x, amplitude, mean, stddev):
return amplitude * np.exp(-((x - mean) / 4 / stddev)**2)
popt, _ = optimize.curve_fit(gaussian, x, data)
This returns the optimal arguments for the fit and you can plot it like this:
plt.plot(x, data)
plt.plot(x, gaussian(x, *popt))
As explained by Joe Kington answering in this question : How can I make a scatter plot colored by density in matplotlib, I made a scatter plot colored by density. However, due to the complex distribution of my data, I would like to change the parameters used to calculate the density.
Here is the results with some fake data similar to mine :
I would want to calibrate the density calculations of gaussian_kde so that the left part of the plot looks like this :
I don't like the first plot because the groups of points influence the density of adjacent groups of points and that prevents me from analyzing the distribution within a group. In other words, even if each of the 8 groups have exactly the same distribution, that won't be visible on the graph.
I tried to modify the covariance_factor (like I once did for a 2d plot of density over x), but when gaussian_kde is used with multiple dimension arrays it returns a numpy.ndarray, not a "scipy.stats.kde.gaussian_kde" object. Plus, I don't even know if changing the covariance_factor will do it.
Here's my dummy code :
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40,a+80))
y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10,b*4))
# Data for the second image
#x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40))
#y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10))
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
# My unsuccesfull try to modify covariance which would work in 1D with "z = gaussian_kde(x)"
#z.covariance_factor = lambda : 0.01
#z._compute_covariance()
# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50, edgecolor='')
plt.show()
The solution could use an other density calculator, I don't mind.
The goal is to make a density plot like the ones showed above, where I can play with the density parameters.
I'm using python 3.4.3
Did have a look at Seaborn? It's not exactly what you're asking for, but it already has functions for generating density plots:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kendalltau
import seaborn as sns
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10, a+10, a+20, a+20, a+30, a+30, a+40, a+40, a+80))
y = np.concatenate((b+10, b-10, b+10, b-10, b+10, b-10, b+10, b-10, b*4))
sns.jointplot(x, y, kind="hex", stat_func=kendalltau)
sns.jointplot(x, y, kind="kde", stat_func=kendalltau)
plt.show()
It gives:
and
I have a x,y distribution of points for which I obtain the KDE through scipy.stats.gaussian_kde. This is my code and how the output looks (the x,y data can be obtained from here):
import numpy as np
from scipy import stats
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
m1, m2 = data[0], data[1]
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Perform a kernel density estimate (KDE) on the data
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
f = np.reshape(kernel(positions).T, x.shape)
# Define the number that will determine the integration limits
x1, y1 = 2.5, 1.5
# Perform integration?
# Plot the results:
import matplotlib.pyplot as plt
# Set limits
plt.xlim(xmin,xmax)
plt.ylim(ymin,ymax)
# KDE density plot
plt.imshow(np.rot90(f), cmap=plt.cm.gist_earth_r, extent=[xmin, xmax, ymin, ymax])
# Draw contour lines
cset = plt.contour(x,y,f)
plt.clabel(cset, inline=1, fontsize=10)
plt.colorbar()
# Plot point
plt.scatter(x1, y1, c='r', s=35)
plt.show()
The red point with coordinates (x1, y1) has (like every point in the 2D plot) an associated value given by f (the kernel or KDE) between 0 and 0.42. Let's say that f(x1, y1) = 0.08.
I need to integrate f with integration limits in x and y given by those regions where f evaluates to less than f(x1, y1), ie: f(x, y)<0.08.
For what I've seen python can perform integration of functions and one dimensional arrays through numerical integration, but I haven't seen anything that would let me perform a numerical integration on a 2D array (the f kernel) Furthermore, I'm not sure how I would even recognize the regions given by that particular condition (ie: f(x, y)less than a given value)
Can this be done at all?
Here is a way to do it using monte carlo integration. It is a little slow, and there is randomness in the solution. The error is inversely proportional to the square root of the sample size, while the running time is directly proportional to the sample size (where sample size refers to the monte carlo sample (10000 in my example below), not the size of your data set). Here is some simple code using your kernel object.
#Compute the point below which to integrate
iso = kernel((x1,y1))
#Sample from your KDE distribution
sample = kernel.resample(size=10000)
#Filter the sample
insample = kernel(sample) < iso
#The integral you want is equivalent to the probability of drawing a point
#that gets through the filter
integral = insample.sum() / float(insample.shape[0])
print integral
I get approximately 0.2 as the answer for your data set.
Currently, it is available
kernel.integrate_box([-np.inf,-np.inf], [2.5,1.5])
A direct way is to integrate
import matplotlib.pyplot as plt
import sklearn
from scipy import integrate
import numpy as np
mean = [0, 0]
cov = [[5, 0], [0, 10]]
x, y = np.random.multivariate_normal(mean, cov, 5000).T
plt.plot(x, y, 'o')
plt.show()
sample = np.array(zip(x, y))
kde = sklearn.neighbors.KernelDensity().fit(sample)
def f_kde(x,y):
return np.exp((kde.score_samples([[x,y]])))
point = x1, y1
integrate.nquad(f_kde, [[-np.inf, x1],[-np.inf, y1]])
The problem is that, this is very slow if you do it in a large scale. For example, if you want to plot the x,y line at x (0,100), it would take a long time to calculate.
Notice: I used kde from sklearn, but I believe you can also change it into other form as well.
Using the kernel as defined in the original question:
import numpy as np
from scipy import stats
from scipy import integrate
def integ_func(kde, x1, y1):
def f_kde(x, y):
return kde((x, y))
integ = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])
return integ
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)
# Define the number that will determine the integration limits
x1, y1 = 2.5, 1.5
print integ_func(kernel, x1, y1)