How to marginalize out variable from multivariable distribution in Python? - python

I am having some troubles understanding proper way to marginalize out variables from probability distributions. As I understand the proper way to do this is to sum over variables that is being marginalized out leaving only variables to be kept. For case of normal distribution, the result is also normal distribution. I can show this part with equations and doing integrals, but when I try to check in python I get incorrect results--the peak of resulting distribution is much higher.
Here is example (the code is from Marginalize a surface plot and use kernel density estimation (kde) on it)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import multivariate_normal, gaussian_kde
# Choose mean vector and variance-covariance matrix
mu = np.array([0, 0])
sigma = np.array([[2, 0], [0, 3]])
# Create surface plot data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
rv = multivariate_normal(mean=mu, cov=sigma)
Z = np.array([rv.pdf(pair) for pair in zip(X.ravel(), Y.ravel())])
Z = Z.reshape(X.shape)
# Plot it
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
pos = ax.plot_surface(X, Y, Z)
This makes plot of two variable normal distribution. If I take sum of variable x to get marginal distribution
Zmarg_y = Z.sum(axis=0)
plt.plot(x, Zmarg_y)
result is not the same as if I simply drop the variable instead of marginalize out. I tried this also with a 3 variable gaussian distribution where I marginalized 1 variable to get a 2 variable distribution. The result was also on a higher scale. Is there a problem with normalization here? I am studying probability for a first time and am trying to understand every single detail and I think I am misunderstanding something important about this. Thank you.


plotting with a logscale distribution and 0

I'm trying to plot a probability distribution (say probability of k events). It should be plotted as a logscale on the horizontal axis since the behavior at large values of k looks like k^{-alpha}. So it's a straight line for large k on a logscale plot.
But 0 happens.
I want to plot this in a way that is easy to interpret.
For an example, consider a probability defined so that p_0 = 0.5 and for k= 1, 2, 3, ... we set p_k = Ck^{-2} where if I've calculated correctly C=3/pi^2. This should sum to 1 and produce a nice straight line for k>0, but obviously, I can't stick in 0. Nevertheless it's important that the person looking at the image understand that 0 exists and has significant probability.
I'm using matplotlib (in python), but really I'm interested in how we could visualize this. The implementation can be sorted later.
In order to put 0 into the plot, you have apply symlog to x axis and log to y axis. I am putting some code here in case you are not familiar with matplotlib, then you can start with code below. For details, pls check doc.
import numpy as np
import matplotlib.pyplot as plt
n = 100
x = np.arange(0, n)
y = 3/(np.pi*np.pi)/(x[1:])**2
y = np.concatenate([[0.5], y])
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
ax.plot(x, y, 'x')
ax.set_xlim(-1, n)

How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

I'm trying to generate random samples from a lognormal distribution in Python, the application is for simulating network traffic. I'd like to generate samples such that:
The modal sample result is 320 (~10^2.5)
80% of the samples lie within the range 100 to 1000 (10^2 to 10^3)
My strategy is to use the inverse CDF (or Smirnov transform I believe):
Use the PDF for a normal distribution centred around 2.5 to calculate the PDF for 10^x where x ~ N(2.5,sigma).
Calculate the CDF for the above distribution.
Generate random uniform data along the interval 0 to 1.
Use the inverse CDF to transform the random uniform data into the required range.
The problem is, when I calculate the 10 and 90th percentile at the end, I have completely the wrong numbers.
Here is my code:
%matplotlib inline
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm
# find value of mu and sigma so that 80% of data lies within range 2 to 3
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)
# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
# Generate random uniform data
input = np.random.uniform(size=10000)
# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]
# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]
# Check percentiles
Which produces the output:
(223.99999999999997, 2480.0000000000009)
... and not the (100, 1000) that I would like to see. Any advice appreciated!
First, I'm not sure about Use the PDF for a normal distribution centred around 2.5. After all, log-normal is about base e logarithm (aka natural log), which means 320 = 102.5 = e5.77.
Second, I would approach problem in a different way. You need m and s to sample from Log-Normal.
If you look at wiki article above, you could see that it is two-parametric distribution. And you have exactly two conditions:
Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8
where CDF is expressed via error function (which is pretty much common function found in any library)
So two non-linear equations for two parameters. Solve them, find m and s and put it into any standard log-normal sampling
Severin's approach is much leaner than my original attempt using the Smirnov transform. This is the code that worked for me (using fsolve to find s, although its quite trivial to do it manually):
# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
from scipy.stats import lognorm
import math
target_modal_value = 320
# Define function to find roots of
def equation(s):
# From Wikipedia: Mode = exp(m - s*s) = 320
m = math.log(target_modal_value) + s**2
# Get probability mass from CDF at 100 and 1000, should equal to 0.8.
# Rearange equation so that =0, to find root (value of s)
return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)
# Solve non-linear equation to find s
s_initial_guess = 1
s = fsolve(equation, s_initial_guess)
# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))
# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')

how does 2d kernel density estimation in python (sklearn) work?

I am sorry for the probably stupid question but I am trying now for hours to estimate a density from a set of 2d data. Let's assume my data is given by the array: sample = np.random.uniform(0,1,size=(50,2)) . I just want to use scipys scikit learn package to estimate the density from the sample array (which is here of course a 2d uniform density) and I am trying the following:
import numpy as np
from sklearn.neighbors.kde import KernelDensity
from matplotlib import pyplot as plt
sp = 0.01
samples = np.random.uniform(0,1,size=(50,2)) # random samples
x = y = np.linspace(0,1,100)
X,Y = np.meshgrid(x,y) # creating grid of data , to evaluate estimated density on
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(samples) # creating density from samples
kde.score_samples(X,Y) # I want to evaluate the estimated density on the X,Y grid
But the last step always yields the error: score_samples() takes 2 positional arguments but 3 were given
So probably .score_samples cannot take a grid as input, but there no tutorials/docs for the 2d case so I don't know how to fix this issue. It would be really great if someone could help.
Looking at the Kernel Density Estimate of Species Distributions example, you have to package the x,y data together (both the training data and the new sample grid).
Below is a function that simplifies the sklearn API.
from sklearn.neighbors import KernelDensity
def kde2D(x, y, bandwidth, xbins=100j, ybins=100j, **kwargs):
"""Build 2D kernel density estimate (KDE)."""
# create grid of sample locations (default: 100x100)
xx, yy = np.mgrid[x.min():x.max():xbins,
xy_sample = np.vstack([yy.ravel(), xx.ravel()]).T
xy_train = np.vstack([y, x]).T
kde_skl = KernelDensity(bandwidth=bandwidth, **kwargs)
# score_samples() returns the log-likelihood of the samples
z = np.exp(kde_skl.score_samples(xy_sample))
return xx, yy, np.reshape(z, xx.shape)
This gives you the xx, yy, zz needed for something like a scatter or pcolormesh plot. I've copied the example from the scipy page on the gaussian_kde function.
import numpy as np
import matplotlib.pyplot as plt
m1 = np.random.normal(size=1000)
m2 = np.random.normal(scale=0.5, size=1000)
x, y = m1 + m2, m1 - m2
xx, yy, zz = kde2D(x, y, 1.0)
plt.pcolormesh(xx, yy, zz)
plt.scatter(x, y, s=2, facecolor='white')

How can I change de parameters of gaussian_kde for a scatter plot colored by density in matplotlib

As explained by Joe Kington answering in this question : How can I make a scatter plot colored by density in matplotlib, I made a scatter plot colored by density. However, due to the complex distribution of my data, I would like to change the parameters used to calculate the density.
Here is the results with some fake data similar to mine :
I would want to calibrate the density calculations of gaussian_kde so that the left part of the plot looks like this :
I don't like the first plot because the groups of points influence the density of adjacent groups of points and that prevents me from analyzing the distribution within a group. In other words, even if each of the 8 groups have exactly the same distribution, that won't be visible on the graph.
I tried to modify the covariance_factor (like I once did for a 2d plot of density over x), but when gaussian_kde is used with multiple dimension arrays it returns a numpy.ndarray, not a "scipy.stats.kde.gaussian_kde" object. Plus, I don't even know if changing the covariance_factor will do it.
Here's my dummy code :
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40,a+80))
y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10,b*4))
# Data for the second image
#x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40))
#y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10))
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
# My unsuccesfull try to modify covariance which would work in 1D with "z = gaussian_kde(x)"
#z.covariance_factor = lambda : 0.01
# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50, edgecolor='')
The solution could use an other density calculator, I don't mind.
The goal is to make a density plot like the ones showed above, where I can play with the density parameters.
I'm using python 3.4.3
Did have a look at Seaborn? It's not exactly what you're asking for, but it already has functions for generating density plots:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kendalltau
import seaborn as sns
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10, a+10, a+20, a+20, a+30, a+30, a+40, a+40, a+80))
y = np.concatenate((b+10, b-10, b+10, b-10, b+10, b-10, b+10, b-10, b*4))
sns.jointplot(x, y, kind="hex", stat_func=kendalltau)
sns.jointplot(x, y, kind="kde", stat_func=kendalltau)
It gives:

Matplotlib: Coloring scatter plot by density relative to another data set

I'm new to Python and having some trouble with matplotlib. I currently have data that is contained in two numpy arrays, call them x and y, that I am plotting on a scatter plot with coordinates for each point (x, y) (i.e I have points x[0], y[0] and x1, y1 and so on on my plot). I have been using the following code segment to color the points in my scatter plot based on the spatial density of nearby points (found this on another stackoverflow post):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
x = np.random.normal(size=1000)
y = x*3 + np.random.normal(size=1000)
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
idx = z.argsort()
fig,ax = plt.subplots()
I've been using it without being sure exactly how it works (namely the point density calculation - if someone could explain how exactly that works, would also be much appreciated).
However, now I'd like to color code by the ratio of the spatial density of points in x,y to that of the spatial density of points in another set of numpy arrays, call them x2, y2. That is, I would like to make a plot such that I can identify how the density of points in x,y compares to the points in x2,y2 on the same scatter plot. Could someone please explain how I could go about doing this?
Thanks in advance for your help!
I've been trying to do the same thing based on that same earlier post, and I think I just figured it out! The trick is to use matplotlib.colors.Normalize() to define a scale and then weight it according to some data set (xnorm,ynorm):
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mplc
import as cm
from scipy.stats import gaussian_kde
def kdeplot(x,y,xnorm,ynorm):
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
wt = 1.0*len(x)/(len(xnorm)*1.0)
norm = mplc.Normalize(vmin=0, vmax=8/wt)
cmap = cm.gnuplot
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
args = (x,y)
kwargs = {'c':z,'s':10,'edgecolor':'','cmap':cmap,'norm':norm}
return args, kwargs
# (x1,y1) is some data set whose density map coloring you
# want to scale to (xnorm,ynorm)
args,kwargs = kdeplot(x1,y1,xnorm,ynorm)
I used trial and error to optimize my normalization for my particular data and choice of colormap. Here's what my data looks like scaled to itself; here's my data scaled to some comparison data (which is on the bottom of that image).
I'm not sure this method is entirely general, but it works in my case: I know that my data and the comparison data are in similar regions of parameter space, and they both have gaussian scatter, so I can use a naive linear scaling determined by the number of data points and it results in something that gives the right idea visually.
