Integrate 2D kernel density estimate - python

I have a x,y distribution of points for which I obtain the KDE through scipy.stats.gaussian_kde. This is my code and how the output looks (the x,y data can be obtained from here):
import numpy as np
from scipy import stats
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
m1, m2 = data[0], data[1]
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Perform a kernel density estimate (KDE) on the data
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
f = np.reshape(kernel(positions).T, x.shape)
# Define the number that will determine the integration limits
x1, y1 = 2.5, 1.5
# Perform integration?
# Plot the results:
import matplotlib.pyplot as plt
# Set limits
plt.xlim(xmin,xmax)
plt.ylim(ymin,ymax)
# KDE density plot
plt.imshow(np.rot90(f), cmap=plt.cm.gist_earth_r, extent=[xmin, xmax, ymin, ymax])
# Draw contour lines
cset = plt.contour(x,y,f)
plt.clabel(cset, inline=1, fontsize=10)
plt.colorbar()
# Plot point
plt.scatter(x1, y1, c='r', s=35)
plt.show()
The red point with coordinates (x1, y1) has (like every point in the 2D plot) an associated value given by f (the kernel or KDE) between 0 and 0.42. Let's say that f(x1, y1) = 0.08.
I need to integrate f with integration limits in x and y given by those regions where f evaluates to less than f(x1, y1), ie: f(x, y)<0.08.
For what I've seen python can perform integration of functions and one dimensional arrays through numerical integration, but I haven't seen anything that would let me perform a numerical integration on a 2D array (the f kernel) Furthermore, I'm not sure how I would even recognize the regions given by that particular condition (ie: f(x, y)less than a given value)
Can this be done at all?

Here is a way to do it using monte carlo integration. It is a little slow, and there is randomness in the solution. The error is inversely proportional to the square root of the sample size, while the running time is directly proportional to the sample size (where sample size refers to the monte carlo sample (10000 in my example below), not the size of your data set). Here is some simple code using your kernel object.
#Compute the point below which to integrate
iso = kernel((x1,y1))
#Sample from your KDE distribution
sample = kernel.resample(size=10000)
#Filter the sample
insample = kernel(sample) < iso
#The integral you want is equivalent to the probability of drawing a point
#that gets through the filter
integral = insample.sum() / float(insample.shape[0])
print integral
I get approximately 0.2 as the answer for your data set.

Currently, it is available
kernel.integrate_box([-np.inf,-np.inf], [2.5,1.5])

A direct way is to integrate
import matplotlib.pyplot as plt
import sklearn
from scipy import integrate
import numpy as np
mean = [0, 0]
cov = [[5, 0], [0, 10]]
x, y = np.random.multivariate_normal(mean, cov, 5000).T
plt.plot(x, y, 'o')
plt.show()
sample = np.array(zip(x, y))
kde = sklearn.neighbors.KernelDensity().fit(sample)
def f_kde(x,y):
return np.exp((kde.score_samples([[x,y]])))
point = x1, y1
integrate.nquad(f_kde, [[-np.inf, x1],[-np.inf, y1]])
The problem is that, this is very slow if you do it in a large scale. For example, if you want to plot the x,y line at x (0,100), it would take a long time to calculate.
Notice: I used kde from sklearn, but I believe you can also change it into other form as well.
Using the kernel as defined in the original question:
import numpy as np
from scipy import stats
from scipy import integrate
def integ_func(kde, x1, y1):
def f_kde(x, y):
return kde((x, y))
integ = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])
return integ
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)
# Define the number that will determine the integration limits
x1, y1 = 2.5, 1.5
print integ_func(kernel, x1, y1)

Related

Generating 2nd degree polynomial out of some data

I have some data which I want to generate a 2nd degree polyfit like this as example:
I have tried two different codes but the polynomial just trying to go through all points.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv('TESTEXskelet.csv', sep=",")
x = data.Gennemsnitlig_hastighed
y1 = data.Sum_VSP
np.polyfit(x,y1,2)
plt.grid()
plt.title("VSP sum/hastighed")
plt.ylabel('VSP - kW/ton')
plt.xlabel('Hastighed - km/t')
plt.scatter(x,y1,s=5) # Definere selve plottet
plt.plot(x, y1)
But then it plots it through every point.
I have also tried with sklearn, and I can upload that if requested.
You correctly fitted a 2nd degree polynomial. You are just not using it in the plot you do after that.
plt.scatter(x,y1,s=5) does a scatter plot of your original data, and plt.plot(x, y1) plots a line through all your data.
To plot the polynomial you need to catch the polynomial fit into a variable. Then define a range for the x-axis you want to plot over and predict y values based on the polynomial fit:
p = np.polyfit(x,y1,2)
xn = np.linspace(np.min(x), np.max(x), 100)
yn = np.poly1d(p)(xn)
plt.scatter(x,y1,s=5)
plt.plot(xn, yn)
polyfit returns the parameters to your polynomial, try
p = np.polyfit(x,y1,2)
y2 = np.polyval(p, x)
plt.plot(x, y2)

How to marginalize out variable from multivariable distribution in Python?

I am having some troubles understanding proper way to marginalize out variables from probability distributions. As I understand the proper way to do this is to sum over variables that is being marginalized out leaving only variables to be kept. For case of normal distribution, the result is also normal distribution. I can show this part with equations and doing integrals, but when I try to check in python I get incorrect results--the peak of resulting distribution is much higher.
Here is example (the code is from Marginalize a surface plot and use kernel density estimation (kde) on it)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import multivariate_normal, gaussian_kde
# Choose mean vector and variance-covariance matrix
mu = np.array([0, 0])
sigma = np.array([[2, 0], [0, 3]])
# Create surface plot data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
rv = multivariate_normal(mean=mu, cov=sigma)
Z = np.array([rv.pdf(pair) for pair in zip(X.ravel(), Y.ravel())])
Z = Z.reshape(X.shape)
# Plot it
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
pos = ax.plot_surface(X, Y, Z)
plt.show()
This makes plot of two variable normal distribution. If I take sum of variable x to get marginal distribution
Zmarg_y = Z.sum(axis=0)
plt.plot(x, Zmarg_y)
plt.show()
result is not the same as if I simply drop the variable instead of marginalize out. I tried this also with a 3 variable gaussian distribution where I marginalized 1 variable to get a 2 variable distribution. The result was also on a higher scale. Is there a problem with normalization here? I am studying probability for a first time and am trying to understand every single detail and I think I am misunderstanding something important about this. Thank you.

How can I fit a gaussian curve in python?

I'm given an array and when I plot it I get a gaussian shape with some noise. I want to fit the gaussian. This is what I already have but when I plot this I do not get a fitted gaussian, instead I just get a straight line. I've tried this many different ways and I just can't figure it out.
random_sample=norm.rvs(h)
parameters = norm.fit(h)
fitted_pdf = norm.pdf(f, loc = parameters[0], scale = parameters[1])
normal_pdf = norm.pdf(f)
plt.plot(f,fitted_pdf,"green")
plt.plot(f, normal_pdf, "red")
plt.plot(f,h)
plt.show()
You can use fit from scipy.stats.norm as follows:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
data = np.random.normal(loc=5.0, scale=2.0, size=1000)
mean,std=norm.fit(data)
norm.fit tries to fit the parameters of a normal distribution based on the data. And indeed in the example above mean is approximately 5 and std is approximately 2.
In order to plot it, you can do:
plt.hist(data, bins=30, density=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
y = norm.pdf(x, mean, std)
plt.plot(x, y)
plt.show()
The blue boxes are the histogram of your data, and the green line is the Gaussian with the fitted parameters.
There are many ways to fit a gaussian function to a data set. I often use astropy when fitting data, that's why I wanted to add this as additional answer.
I use some data set that should simulate a gaussian with some noise:
import numpy as np
from astropy import modeling
m = modeling.models.Gaussian1D(amplitude=10, mean=30, stddev=5)
x = np.linspace(0, 100, 2000)
data = m(x)
data = data + np.sqrt(data) * np.random.random(x.size) - 0.5
data -= data.min()
plt.plot(x, data)
Then fitting it is actually quite simple, you specify a model that you want to fit to the data and a fitter:
fitter = modeling.fitting.LevMarLSQFitter()
model = modeling.models.Gaussian1D() # depending on the data you need to give some initial values
fitted_model = fitter(model, x, data)
And plotted:
plt.plot(x, data)
plt.plot(x, fitted_model(x))
However you can also use just Scipy but you have to define the function yourself:
from scipy import optimize
def gaussian(x, amplitude, mean, stddev):
return amplitude * np.exp(-((x - mean) / 4 / stddev)**2)
popt, _ = optimize.curve_fit(gaussian, x, data)
This returns the optimal arguments for the fit and you can plot it like this:
plt.plot(x, data)
plt.plot(x, gaussian(x, *popt))

How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

I'm trying to generate random samples from a lognormal distribution in Python, the application is for simulating network traffic. I'd like to generate samples such that:
The modal sample result is 320 (~10^2.5)
80% of the samples lie within the range 100 to 1000 (10^2 to 10^3)
My strategy is to use the inverse CDF (or Smirnov transform I believe):
Use the PDF for a normal distribution centred around 2.5 to calculate the PDF for 10^x where x ~ N(2.5,sigma).
Calculate the CDF for the above distribution.
Generate random uniform data along the interval 0 to 1.
Use the inverse CDF to transform the random uniform data into the required range.
The problem is, when I calculate the 10 and 90th percentile at the end, I have completely the wrong numbers.
Here is my code:
%matplotlib inline
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm
# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)
# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)
# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)
# Generate random uniform data
input = np.random.uniform(size=10000)
# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]
# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]
# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)
Which produces the output:
(223.99999999999997, 2480.0000000000009)
... and not the (100, 1000) that I would like to see. Any advice appreciated!
First, I'm not sure about Use the PDF for a normal distribution centred around 2.5. After all, log-normal is about base e logarithm (aka natural log), which means 320 = 102.5 = e5.77.
Second, I would approach problem in a different way. You need m and s to sample from Log-Normal.
If you look at wiki article above, you could see that it is two-parametric distribution. And you have exactly two conditions:
Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8
where CDF is expressed via error function (which is pretty much common function found in any library)
So two non-linear equations for two parameters. Solve them, find m and s and put it into any standard log-normal sampling
Severin's approach is much leaner than my original attempt using the Smirnov transform. This is the code that worked for me (using fsolve to find s, although its quite trivial to do it manually):
# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
from scipy.stats import lognorm
import math
target_modal_value = 320
# Define function to find roots of
def equation(s):
# From Wikipedia: Mode = exp(m - s*s) = 320
m = math.log(target_modal_value) + s**2
# Get probability mass from CDF at 100 and 1000, should equal to 0.8.
# Rearange equation so that =0, to find root (value of s)
return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)
# Solve non-linear equation to find s
s_initial_guess = 1
s = fsolve(equation, s_initial_guess)
# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))
# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')

how does 2d kernel density estimation in python (sklearn) work?

I am sorry for the probably stupid question but I am trying now for hours to estimate a density from a set of 2d data. Let's assume my data is given by the array: sample = np.random.uniform(0,1,size=(50,2)) . I just want to use scipys scikit learn package to estimate the density from the sample array (which is here of course a 2d uniform density) and I am trying the following:
import numpy as np
from sklearn.neighbors.kde import KernelDensity
from matplotlib import pyplot as plt
sp = 0.01
samples = np.random.uniform(0,1,size=(50,2)) # random samples
x = y = np.linspace(0,1,100)
X,Y = np.meshgrid(x,y) # creating grid of data , to evaluate estimated density on
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(samples) # creating density from samples
kde.score_samples(X,Y) # I want to evaluate the estimated density on the X,Y grid
But the last step always yields the error: score_samples() takes 2 positional arguments but 3 were given
So probably .score_samples cannot take a grid as input, but there no tutorials/docs for the 2d case so I don't know how to fix this issue. It would be really great if someone could help.
Looking at the Kernel Density Estimate of Species Distributions example, you have to package the x,y data together (both the training data and the new sample grid).
Below is a function that simplifies the sklearn API.
from sklearn.neighbors import KernelDensity
def kde2D(x, y, bandwidth, xbins=100j, ybins=100j, **kwargs):
"""Build 2D kernel density estimate (KDE)."""
# create grid of sample locations (default: 100x100)
xx, yy = np.mgrid[x.min():x.max():xbins,
y.min():y.max():ybins]
xy_sample = np.vstack([yy.ravel(), xx.ravel()]).T
xy_train = np.vstack([y, x]).T
kde_skl = KernelDensity(bandwidth=bandwidth, **kwargs)
kde_skl.fit(xy_train)
# score_samples() returns the log-likelihood of the samples
z = np.exp(kde_skl.score_samples(xy_sample))
return xx, yy, np.reshape(z, xx.shape)
This gives you the xx, yy, zz needed for something like a scatter or pcolormesh plot. I've copied the example from the scipy page on the gaussian_kde function.
import numpy as np
import matplotlib.pyplot as plt
m1 = np.random.normal(size=1000)
m2 = np.random.normal(scale=0.5, size=1000)
x, y = m1 + m2, m1 - m2
xx, yy, zz = kde2D(x, y, 1.0)
plt.pcolormesh(xx, yy, zz)
plt.scatter(x, y, s=2, facecolor='white')

Categories