sklearn: KDE not working for small values - python

I am struggling to implement the scikit-learn implementation of KDE for small input ranges. The following code works. Increasing the divisor variable to 100 and KDE struggles:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.neighbors import KernelDensity
# make data:
np.random.seed(0)
divisor = 1
gaussian1 = (3 * np.random.randn(1700))/divisor
gaussian2 = (9 + 1.5 * np.random.randn(300)) / divisor
gaussian_mixture = np.hstack([gaussian1, gaussian2])
# illustrate proper KDE with seaborn:
sns.distplot(gaussian_mixture);
# now implement in sklearn:
x_grid = np.linspace(min(gaussian1), max(gaussian2), 200)
kde_skl = KernelDensity(bandwidth=0.5)
kde_skl.fit(gaussian_mixture[:, np.newaxis])
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
pdf = np.exp(log_pdf)
fig, ax = plt.subplots(1, 1, sharey=True, figsize=(7, 4))
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5)
Works fine. However, changing the 'divisor' variable to 100 and the scipy and seaborn can handle the smaller data values. Sklearn's KDE cannot with my implementation:
What am I doing wrong and how can I rectify this? I need sklearns implementation of KDE so cannot use another algorithm.

Kernel Density Estimation is called a nonparametric-method, but actually it has a parameter called bandwidth.
Every application of KDE needs this parameter set!
When you do the seaborn-plot:
sns.distplot(gaussian_mixture);
you are not giving any bandwidth and seaborn uses default heuristics (scott or silverman). These are using the data to choose some bandwidth in a dependent way.
The sklearn-code of you looks like:
kde_skl = KernelDensity(bandwidth=0.5)
There is a fixed/constant bandwidth! This might give you trouble and might be the reason here. But it's at least something to look at. In general one would combine sklearn's KDE with GridSearchCV as cross-validation tool to select a good bandwidth. In many cases this is slower, but better than those heuristics above.
Sadly you did not explain why you want to use sklearn's KDE. My personal rating of the 3 popular candidates is statsmodels > sklearn > scipy.

Related

How to make a histogram from 30 csv files to plot the historgram and then for it with gaussian function and the standard deviation?

I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?
To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.

How to stack a kernel above each point with Seaborn kdeplot?

The objective is to stack a kernel above each point scattered along 1D. Based on OP, this can be achieved as below.
# define a half-kernel function. We normalize to have integral(half_kernel) = 1 if required
def half_kernel(x, center, width = 1, normalize = True):
kernel = norm.pdf ( x, center, width )
if normalize:
kernel *= 2
return kernel
# this are the points where we center our kernels -- random for testing
centers = np.array([5,5,5,1,2,1,1,8])
# Grid on which we look at the results
x = np.linspace(0,10,101)
# get the results here, each column is one of the kernels
discr_kernels = np.zeros((len(x),len(centers)))
for n in range(len(centers)):
discr_kernels[:,n] = half_kernel(x, centers[n])
y = discr_kernels.sum(axis= 1)
plt.plot(x,discr_kernels,'--')
plt.plot(x,y, '.-', label = 'total')
plt.legend(loc = 'best')
plt.show()
and produce
Since seaborn offer greater flexibility in customization, I would instead to replicate this using seaborn.
df =DataFrame ([5,5,5,1,2,1,1,8],columns=['seq_no'])
sns.kdeplot(data=df, x="seq_no")
plt.show()
However, the result is not the same
Really appreciate if someone can share any insight what setting needed to be change on the seaborn, to get similar output as provided using the first approach.
seaborn kdeplot documentation says that,
The bandwidth, or standard deviation of the smoothing kernel, is an
important parameter. Misspecification of the bandwidth can produce a
distorted representation of the data. Much like the choice of bin
width in a histogram, an over-smoothed curve can erase true features
of a distribution, while an under-smoothed curve can create false
features out of random variability. The rule-of-thumb that sets the
default bandwidth works best when the true distribution is smooth,
unimodal, and roughly bell-shaped. It is always a good idea to check
the default behavior by using bw_adjust to increase or decrease the
amount of smoothing.
This would do a trick.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame([5,5,5,1,2,1,1,8],columns=['seq_no'])
sns.kdeplot(data=df, x="seq_no", bw_adjust=0.6);plt.xlim(0,10);plt.grid()
plt.show()

Best fitting line for a scatter plot

Is there any way to find the best fitting line for a scatter plot if I don't know the relationship between 2 axes(else I could have used scipy.optimize).My scatter plot looks something like this
I would like to have a line like this
and i need to get the points of the best fitting line for my further calculation
for j in lat :
l=94*j
i=l-92
for lines in itertools.islice(input_file, i, l):
lines=lines.split()
p.append(float(Decimal(lines[0])))
vmr.append(float(Decimal(lines[3])))
plt.scatter(vmr, p)
You can use LOWESS (Locally Weighted Scatterplot Smoothing), a non-parametric regression method.
Statsmodels has an implementation here that you can use to fit your own smoother.
See this StackOverflow question on visualizing nonlinear relationships in scatter plots for an example using the Statsmodels implementation.
You could also use the implementation in the Seaborn visuzalization library's regplot() function with the keyword argument lowess=True. See the Seaborn documentation for details.
The following code is an example using Seaborn and the data from the StackOverflow question above:
import numpy as np
import seaborn as sns
sns.set_style("white")
x = np.arange(0,10,0.01)
ytrue = np.exp(-x/5.0) + 2*np.sin(x/3.0)
# add random errors with a normal distribution
y = ytrue + np.random.normal(size=len(x))
sns.regplot(x, y, lowess=True, color="black",
line_kws={"color":"magenta", "linewidth":5})
This probably isn't a matplotlib question, but I think you can do this kind of thing with pandas, using a rolling median.
smoothedData = dataSeries.rolling(10, center = True).median()
Actually you can do a rolling median with anything, but pandas has a built in function. Numpy may too.

Python/Scipy kde fit, scaling

I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)
My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde
df[var].hist()
plt.show() # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys) # a pdf with kinks
Alternatively, is there a slick way to use
count, div = np.histogram(df[var])
and then scale the count array to apply kde() to it?
Update
Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used
df[var].hist(bins=100)
I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.
If you increase the bandwidth using the bw_method parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()
yields
The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.
See also:
http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

Quantile-Quantile Plot using SciPy

How would you create a qq-plot using Python?
Assuming that you have a large set of measurements and are using some plotting function that takes XY-values as input. The function should plot the quantiles of the measurements against the corresponding quantiles of some distribution (normal, uniform...).
The resulting plot lets us then evaluate in our measurement follows the assumed distribution or not.
http://en.wikipedia.org/wiki/Quantile-quantile_plot
Both R and Matlab provide ready made functions for this, but I am wondering what the cleanest method for implementing in in Python would be.
Update: As folks have pointed out this answer is not correct. A probplot is different from a quantile-quantile plot. Please see those comments and other answers before you make an error in interpreting or conveying your distributions' relationship.
I think that scipy.stats.probplot will do what you want. See the documentation for more detail.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Result
Using qqplot of statsmodels.api is another option:
Very basic example:
import numpy as np
import statsmodels.api as sm
import pylab
test = np.random.normal(0,1, 1000)
sm.qqplot(test, line='45')
pylab.show()
Result:
Documentation and more example are here
If you need to do a QQ plot of one sample vs. another, statsmodels includes qqplot_2samples(). Like Ricky Robinson in a comment above, this is what I think of as a QQ plot vs a probability plot which is a sample against a theoretical distribution.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot_2samples.html
I came up with this. Maybe you can improve it. Especially the method of generating the quantiles of the distribution seems cumbersome to me.
You could replace np.random.normal with any other distribution from np.random to compare data against other distributions.
#!/bin/python
import numpy as np
measurements = np.random.normal(loc = 20, scale = 5, size=100000)
def qq_plot(data, sample_size):
qq = np.ones([sample_size, 2])
np.random.shuffle(data)
qq[:, 0] = np.sort(data[0:sample_size])
qq[:, 1] = np.sort(np.random.normal(size = sample_size))
return qq
print qq_plot(measurements, 1000)
To add to the confusion around Q-Q plots and probability plots in the Python and R worlds, this is what the SciPy manual says:
"probplot generates a probability plot, which should not be confused
with a Q-Q or a P-P plot. Statsmodels has more extensive functionality
of this type, see statsmodels.api.ProbPlot."
If you try out scipy.stats.probplot, you'll see that indeed it compares a dataset to a theoretical distribution. Q-Q plots, OTOH, compare two datasets (samples).
R has functions qqnorm, qqplot and qqline. From the R help (Version 3.6.3):
qqnorm is a generic function the default method of which produces a
normal QQ plot of the values in y. qqline adds a line to a
“theoretical”, by default normal, quantile-quantile plot which passes
through the probs quantiles, by default the first and third quartiles.
qqplot produces a QQ plot of two datasets.
In short, R's qqnorm offers the same functionality that scipy.stats.probplot provides with the default setting dist=norm. But the fact that they called it qqnorm and that it's supposed to "produce a normal QQ plot" may easily confuse users.
Finally, a word of warning. These plots don't replace proper statistical testing and should be used for illustrative purposes only.
It exists now in the statsmodels package:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html
You can use bokeh
from bokeh.plotting import figure, show
from scipy.stats import probplot
# pd_series is the series you want to plot
series1 = probplot(pd_series, dist="norm")
p1 = figure(title="Normal QQ-Plot", background_fill_color="#E8DDCB")
p1.scatter(series1[0][0],series1[0][1], fill_color="red")
show(p1)
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Here probplot draw the graph measurements vs normal distribution which speofied in dist="norm"
How big is your sample? Here is another option to test your data against any distribution using OpenTURNS library. In the example below, I generate a sample x of 1.000.000 numbers from a Uniform distribution and test it against a Normal distribution.
You can replace x by your data if you reshape it as x= [[x1], [x2], .., [xn]]
import openturns as ot
x = ot.Uniform().getSample(1000000)
g = ot.VisualTest.DrawQQplot(x, ot.Normal())
g
In my Jupyter Notebook, I see:
If you are writing a script, you can do it more properly
from openturns.viewer import View`
import matplotlib.pyplot as plt
View(g)
plt.show()

Categories