Plotting probability distribution of data using sklearn's KDE function

Plotting probability distribution of data using sklearn's KDE function - python

I have a number of samples of a variable. I would like to use these samples to plot the probability distribution of the variable. I'm using kernel density estimation with a Gaussian kernel. I'm using sklearn library for this purpose. Here is the sample code I have implemented:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]
# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()
Below is the resulting output:
As you can see, the value on the y axis is above one. Hence, the y axis is NOT showing the probability distribution. I further plotted the histogram for this data:
# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()
and the result is below:
which makes sense as the bins sum up to one: 0.025*40=1
I'm having a hard time understanding why my kde plot is not a distribution. How can I fix this? Is there a normalization step that I'm missing?

First, if you extend the limits of your X_plot axis (i.e. X_plot = np.linspace(-1, 1,...)), you'll see that your KDE estimates a rather tall gaussian, and the area under curve is still 1.
Density values over 1 are perfectly legal, since the assumed distribution is continuous: there's no real probabilities for the exact points, and you should not treat your Y values as such; estimated probabilty for an interval is the respective area under curve.
Sample code to verify the estimated probability of hitting 0-0.004 range (roughly the same width as your histogram bin):
import scipy.integrate as integrate
interval = np.linspace(0, 0.004, 1000)[:, np.newaxis]
log_dens = kde.score_samples(interval)
print(integrate.trapz(np.exp(log_dens), interval[:,0]))
Second, once you check the area under curve you'll see your current hyperparameters aren't yielding too accurate of an estimation, reducing the bandwith or choosing a different algo might help.
You can also apply grid search to find the least inaccurate algo and bandwith, though this will take a good amount of time unless you reduce your sample size; also, choosing a narrow bandwidth may result in undersmoothing.
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'kernel':['gaussian', 'tophat'],'bandwidth': np.logspace(-2, 0, 10)}, cv=5, n_jobs=-1)
grid.fit(X)
print(f"best hyperparameters: {grid.best_params_}")
kde = grid.best_estimator_

Related

How to make a histogram from 30 csv files to plot the historgram and then for it with gaussian function and the standard deviation?

I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?

To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.

Use scipy lognormal distribution to fit data with small values, then show in matplotlib

I have a data set which contains values from 0 to 1e-5. I guess the data can be described by lognormal distribution. So I use scipy.stats.lognorm to fit my data and want to plot the origin data and the fitting distribution on a same figure by using matplotlib.
Firstly, I plot the sample by histogram:
Then, I add the fitting distribution by line plot. However, this will change the Y-axis to a very large number:
So the origin data (sample) cannot be seen on the figure!
I've check all variables and I found that the variable pdf_fitted is so large (>1e7). I really don't understand why a simple fit scistats.lognorm.fit to a sample that was generated by the same distribution scistats.lognorm.pdf doesn't work. Here is the codes to demonstrate my problem:
from matplotlib import pyplot as plt
from scipy import stats as scistats
import numpy as np
# generate a sample for x between 0 and 1e-5
x = np.linspace(0, 1e-5, num=1000)
y = scistats.lognorm.pdf(x, 3, loc=0, scale=np.exp(10))
h = plt.hist(y, bins=40) # plot the sample by histogram
# plt.show()
# fit the sample by using Log Normal distribution
param = scistats.lognorm.fit(y)
print("Log-normal distribution parameters : ", param)
pdf_fitted = scistats.lognorm.pdf(
x, *param[:-2], loc=param[-2], scale=param[-1])
plt.plot(x, pdf_fitted, label="Fitted Lognormal distribution")
plt.ticklabel_format(style='sci', scilimits=(-3, 4), axis='x')
plt.legend()
plt.show()

The problem
The immediate problem that you're having is that your fit is really, really bad. You can see this if you set the x and y scale on the plot to log, like with plt.xscale('log') and plt.yscale('log'). This lets you see both your histogram and your fitted data on a single plot:
so it's off by many orders of magnitude in both directions.
The fix
Your whole approach to generating a sample from the probability distribution represented by stats.lognorm and fitting it was wrong. Here's a correct way to do it, using the same parameters for the lognorm distribution that you supplied in your question:
from matplotlib import pyplot as plt
from scipy import stats as scistats
import numpy as np
plt.figure(figsize=(12,7))
realparam = [.1, 0, np.exp(10)]
# generate pdf data around the mean value
m = realparam[2]
x = np.linspace(m*.6, m*1.4, num=10000)
y = scistats.lognorm.pdf(x, *realparam)
# generate a matching random sample
sample = scistats.lognorm.rvs(*realparam, size=100000)
# plot the sample by histogram
h = plt.hist(sample, bins=100, density=True)
# fit the sample by using Log Normal distribution
param = scistats.lognorm.fit(sample)
print("Log-normal distribution parameters : ", param)
pdf_fitted = scistats.lognorm.pdf(x, *param)
plt.plot(x, pdf_fitted, lw=5, label="Fitted Lognormal distribution")
plt.legend()
plt.show()
Output:
Log-normal distribution parameters : (0.09916091013245995, -215.9562383088556, 22245.970148671593)

How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

I'm trying to generate random samples from a lognormal distribution in Python, the application is for simulating network traffic. I'd like to generate samples such that:
The modal sample result is 320 (~10^2.5)
80% of the samples lie within the range 100 to 1000 (10^2 to 10^3)
My strategy is to use the inverse CDF (or Smirnov transform I believe):
Use the PDF for a normal distribution centred around 2.5 to calculate the PDF for 10^x where x ~ N(2.5,sigma).
Calculate the CDF for the above distribution.
Generate random uniform data along the interval 0 to 1.
Use the inverse CDF to transform the random uniform data into the required range.
The problem is, when I calculate the 10 and 90th percentile at the end, I have completely the wrong numbers.
Here is my code:
%matplotlib inline
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm
# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)
# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)
# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)
# Generate random uniform data
input = np.random.uniform(size=10000)
# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]
# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]
# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)
Which produces the output:
(223.99999999999997, 2480.0000000000009)
... and not the (100, 1000) that I would like to see. Any advice appreciated!

First, I'm not sure about Use the PDF for a normal distribution centred around 2.5. After all, log-normal is about base e logarithm (aka natural log), which means 320 = 102.5 = e5.77.
Second, I would approach problem in a different way. You need m and s to sample from Log-Normal.
If you look at wiki article above, you could see that it is two-parametric distribution. And you have exactly two conditions:
Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8
where CDF is expressed via error function (which is pretty much common function found in any library)
So two non-linear equations for two parameters. Solve them, find m and s and put it into any standard log-normal sampling

Severin's approach is much leaner than my original attempt using the Smirnov transform. This is the code that worked for me (using fsolve to find s, although its quite trivial to do it manually):
# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
from scipy.stats import lognorm
import math
target_modal_value = 320
# Define function to find roots of
def equation(s):
# From Wikipedia: Mode = exp(m - s*s) = 320
m = math.log(target_modal_value) + s**2
# Get probability mass from CDF at 100 and 1000, should equal to 0.8.
# Rearange equation so that =0, to find root (value of s)
return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)
# Solve non-linear equation to find s
s_initial_guess = 1
s = fsolve(equation, s_initial_guess)
# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))
# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')

confidence intervals compared with standard deviation in seaborn

when I use seaborn's confidence intervals in pointplot I get deceptively small values, compared with standard error. Example:
import seaborn as sns
import matplotlib.pylab as plt
import pandas
import numpy as np
x = np.random.rand(100)
y = np.random.rand(100)
df = pandas.DataFrame({"x": x,
"y": y})
data = pandas.melt(df)
print "data: ", data
plt.figure()
plt.subplot(2, 1, 1)
sns.pointplot(x="variable", y="value", data=data)
plt.ylim([0, 0.9])
ax = plt.subplot(2, 1, 2)
m = [df["x"].mean(), df["y"].mean()]
e = [df["x"].std(), df["y"].std()]
plt.errorbar(range(1,3), m, yerr=e)
plt.ylim([0, 0.9])
plt.xlim([0, 4])
plt.xticks([1, 2])
ax.set_xticklabels(["x", "y"])
the standard deviations are significantly larger. what is the explanation for this? can seaborn plot error bars that are closer to simple metric like standard deviation?
in the bottom plot, the standard errors for x and y are shown and they are much bigger than seaborn's confidence intervals for x and y (in top plot).

Making my previous answer below more precise, since the standard deviation of a uniform random variable is 1/sqrt(12)~=0.2887 the bars in your second plot cover an interval of size roughly [0.5-0.2887,0.5+0.2887]=[0.2113,0.7887].
On the other hand, by the central limit theorem, the 95%-confidence interval of the empirical mean of 100 uniform random variables will be roughly [0.5-1.96*0.2887/sqrt(100),0.5+1.96*0.2887/sqrt(100)]~=[0.443,0.557]. This corresponds to the confidence interval drawn by seaborn in your first plot.
To summarize, for computations of statistical confidence intervals, the sample size plays a critical role and cannot be neglected!
Previous shorter answer
Seaplot's confidence-intervals take into account the number of samples that are used to estimate the mean. Given that you handed seaplot a decent number of 100 sample points the 95%-confidence interval for the empirical mean of the 100 sample points will indeed be pretty small.
In order to achieve a fair comparison, you should scale your standard errors by 1/sqrt(100) and then compare the plots.

plotting curve decision boundary in python using matplotlib

I am new to machine learning with python. I've managed to draw the straight decision boundary for logistic regression using matplotlib. However, I am facing a bit of difficulty in plotting a curve line to understand the case of overfitting using some sample dataset.
I am trying to build a logistic regression model using regularization and use regularization to control overfitting my data set.
I am aware of the sklearn library, however I prefer writing code separately
The test data sample I am working on is given below:
x=np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650')
y=np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0')
The decision boundary I am expecting is given in the graph below:
Any help would be appreciated.
I could plot a straight decision boundary using the code below:
# plot of x 2D
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(X[pos[0],0], X[pos[0],1], 'ro')
plt.plot(X[neg[0],0], X[neg[0],1], 'bo')
plt.xlim([min(X[:,0]),max(X[:,0])])
plt.ylim([min(X[:,1]),max(X[:,1])])
plt.show()
# plot of the decision boundary
plt.figure()
pos=np.where(y==1)
neg=np.where(y==0)
plt.plot(x[pos[0],1], x[pos[0],2], 'ro')
plt.plot(x[neg[0],1], x[neg[0],2], 'bo')
plt.xlim([x[:, 1].min()-2 , x[:, 1].max()+2])
plt.ylim([x[:, 2].min()-2 , x[:, 2].max()+2])
plot_x = [min(x[:,1])-2, max(x[:,1])+2] # Takes a lerger decision line
plot_y = (-1/theta_NM[2])*(theta_NM[1]*plot_x +theta_NM[0])
plt.plot(plot_x, plot_y)
And my decision boundary looks like this:
In an ideal scenario the above decision boundary is good but I would like to plot a curve decision boundary that will fit my training data very well but will overfit my test data. something similar to shown in the 1st plot

This can be done by gridding the parameter space and setting each grid point to the value of the closest point. Then running a contour plot on this grid.
But there are numerous variations, such as setting it to a value of a distance-weighted average; or smoothing the final contour; etc.
Here's an example for finding the initial contour:
import numpy as np
import matplotlib.pyplot as plt
# get the data as numpy arrays
xys = np.array(np.matrix('2,300;4,600;7,300;5,500;5,400;6,400;3,400;4,500;1,200;3,400;7,700;3,550;2.5,650'))
vals = np.array(np.matrix('0;1;1;1;0;1;0;0;0;0;1;1;0'))[:,0]
N = len(vals)
# some basic spatial stuff
xs = np.linspace(min(xys[:,0])-2, max(xys[:,0])+1, 10)
ys = np.linspace(min(xys[:,1])-100, max(xys[:,1])+100, 10)
xr = max(xys[:,0]) - min(xys[:,0]) # ranges so distances can weight x and y equally
yr = max(xys[:,1]) - min(xys[:,1])
X, Y = np.meshgrid(xs, ys) # meshgrid for contour and distance calcs
# set each gridpoint to the value of the closest data point:
Z = np.zeros((len(xs), len(ys), N))
for n in range(N):
Z[:,:,n] = ((X-xys[n,0])/xr)**2 + ((Y-xys[n,1])/yr)**2 # stack arrays of distances to each points
z = np.argmin(Z, axis=2) # which data point is the closest to each grid point
v = vals[z] # set the grid value to the data point value
# do the contour plot (use only the level 0.5 since values are 0 and 1)
plt.contour(X, Y, v, cmap=plt.cm.gray, levels=[.5]) # contour the data point values
# now plot the data points
pos=np.where(vals==1)
neg=np.where(vals==0)
plt.plot(xys[pos,0], xys[pos,1], 'ro')
plt.plot(xys[neg,0], xys[neg,1], 'bo')
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plotting probability distribution of data using sklearn's KDE function - python

Related

How to make a histogram from 30 csv files to plot the historgram and then for it with gaussian function and the standard deviation?

Use scipy lognormal distribution to fit data with small values, then show in matplotlib

How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

confidence intervals compared with standard deviation in seaborn

plotting curve decision boundary in python using matplotlib

Categories

Resources