Python/Scipy kde fit, scaling - python

I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)
My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde
df[var].hist()
plt.show() # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys) # a pdf with kinks
Alternatively, is there a slick way to use
count, div = np.histogram(df[var])
and then scale the count array to apply kde() to it?
Update
Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used
df[var].hist(bins=100)
I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.

If you increase the bandwidth using the bw_method parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()
yields

The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.
See also:
http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

Related

How to make a histogram from 30 csv files to plot the historgram and then for it with gaussian function and the standard deviation?

I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?
To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.

sklearn: KDE not working for small values

I am struggling to implement the scikit-learn implementation of KDE for small input ranges. The following code works. Increasing the divisor variable to 100 and KDE struggles:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.neighbors import KernelDensity
# make data:
np.random.seed(0)
divisor = 1
gaussian1 = (3 * np.random.randn(1700))/divisor
gaussian2 = (9 + 1.5 * np.random.randn(300)) / divisor
gaussian_mixture = np.hstack([gaussian1, gaussian2])
# illustrate proper KDE with seaborn:
sns.distplot(gaussian_mixture);
# now implement in sklearn:
x_grid = np.linspace(min(gaussian1), max(gaussian2), 200)
kde_skl = KernelDensity(bandwidth=0.5)
kde_skl.fit(gaussian_mixture[:, np.newaxis])
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
pdf = np.exp(log_pdf)
fig, ax = plt.subplots(1, 1, sharey=True, figsize=(7, 4))
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5)
Works fine. However, changing the 'divisor' variable to 100 and the scipy and seaborn can handle the smaller data values. Sklearn's KDE cannot with my implementation:
What am I doing wrong and how can I rectify this? I need sklearns implementation of KDE so cannot use another algorithm.
Kernel Density Estimation is called a nonparametric-method, but actually it has a parameter called bandwidth.
Every application of KDE needs this parameter set!
When you do the seaborn-plot:
sns.distplot(gaussian_mixture);
you are not giving any bandwidth and seaborn uses default heuristics (scott or silverman). These are using the data to choose some bandwidth in a dependent way.
The sklearn-code of you looks like:
kde_skl = KernelDensity(bandwidth=0.5)
There is a fixed/constant bandwidth! This might give you trouble and might be the reason here. But it's at least something to look at. In general one would combine sklearn's KDE with GridSearchCV as cross-validation tool to select a good bandwidth. In many cases this is slower, but better than those heuristics above.
Sadly you did not explain why you want to use sklearn's KDE. My personal rating of the 3 popular candidates is statsmodels > sklearn > scipy.

Python Pylab pcolor options for publication quality plots

I am trying to make DFT (discrete fourier transforms) plots using pcolor in python. I have previously been using Mathematica 8.0 to do this but I find that the colorbar in mathematica 8.0 has bad one-to-one correlation with the data I try to represent. For instance, here is the data that I am plotting:
[[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]]
So, its a lot of zeros or small numbers in a DFT matrix or small quantity of high frequency energies.
When I plot this using mathematica, this is the result:
The color bar is off and I thought I'd like to plot this with python instead.
My python code (that I hijacked from here) is:
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
#from pylab import pcolor, show, colorbar, xticks, yticks
from pylab import *
data = np.array([[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]], np.float)
pcolor(data)
colorbar()
yticks(arange(0.5,10.5),range(0,10))
xticks(arange(0.5,10.5),range(0,10))
#show()
savefig('/home/mydir/foo.eps',figsize=(4,4),dpi=100)
And this python code plots as:
Now here is my question/list of questions:
I like how python plots this and would like to use this but...
How do I make all the "blue" which represents "0" go away like it does in my mathematica plot?
How do I rotate the plot to have the bright red spot in the top left corner?
The way I set the "dpi", is that correct?
Any useful references that I should use to strengthen my love for python?
I have looked through other questions on here and the user manual for numpy but found not much help.
I plan on publishing this data and it is rather important that I get all the bits and pieces right! :)
Edit:
Modified python code and resulting plot! What improvements would one suggest to this to make it publication worthy?
from numpy import corrcoef, sum, log, arange, save
from numpy.random import rand
from pylab import *
data = np.array([[0.,0.,0.10664,0.,0.,0.,0.0412719,0.,0.,0.],
[0.,0.351894,0.,0.17873,0.,0.,0.,0.,0.,0.],
[0.10663,0.,0.178183,0.,0.,0.,0.0405148,0.,0.,0.],
[0.,0.177586,0.,0.,0.,0.0500377,0.,0.,0.,0.],
[0.,0.,0.,0.,0.0588906,0.,0.,0.,0.,0.],
[0.,0.,0.,0.0493811,0.,0.,0.,0.,0.,0.],
[0.0397341,0.,0.0399249,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],
[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]], np.float)
v1 = abs(data).max()
v2 = abs(data).min()
pcolor(data, cmap="binary")
colorbar()
#xlabel("X", fontsize=12, fontweight="bold")
#ylabel("Y", fontsize=12, fontweight="bold")
xticks(arange(0.5,10.5),range(0,10),fontsize=19)
yticks(arange(0.5,10.5),range(0,10),fontsize=19)
axis([0,7,0,7])
#show()
savefig('/home/mydir/Desktop/py_dft.eps',figsize=(4,4),dpi=600)
The following will get you closer to what you want:
import matplotlib.pyplot as plt
plt.pcolor(data, cmap=plt.cm.OrRd)
plt.yticks(np.arange(0.5,10.5),range(0,10))
plt.xticks(np.arange(0.5,10.5),range(0,10))
plt.colorbar()
plt.gca().invert_yaxis()
plt.gca().set_aspect('equal')
plt.show()
The list of available colormaps by default is here. You'll need one that starts out white.
If none of those suits your needs, you can try generating your own, start by looking at LinearSegmentedColormap.
Just for the record, in Mathematica 9.0:
GraphicsGrid#{{MatrixPlot[l,
ColorFunction -> (ColorData["TemperatureMap"][Rescale[#, {Min#l, Max#l}]] &),
ColorFunctionScaling -> False], BarLegend[{"TemperatureMap", {0, Max#l}}]}}

histogram matching in Python

I am trying to do histogram matching of simulated data to observed precipitation data. The below shows a simple simulated case. I got the CDF of both the simulated and observed data and got stuck theree. I hope a clue would help me to get across..Thanks you in advance
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
import scipy.stats as st
sim = st.gamma(1,loc=0,scale=0.8) # Simulated
obs = st.gamma(2,loc=0,scale=0.7) # Observed
x = np.linspace(0,4,1000)
simpdf = sim.pdf(x)
obspdf = obs.pdf(x)
plt.plot(x,simpdf,label='Simulated')
plt.plot(x,obspdf,'r--',label='Observed')
plt.title('PDF of Observed and Simulated Precipitation')
plt.legend(loc='best')
plt.show()
plt.figure(1)
simcdf = sim.cdf(x)
obscdf = obs.cdf(x)
plt.plot(x,simcdf,label='Simulated')
plt.plot(x,obscdf,'r--',label='Observed')
plt.title('CDF of Observed and Simulated Precipitation')
plt.legend(loc='best')
plt.show()
# Inverse CDF
invcdf = interp1d(obscdf,x)
transfer_func = invcdf(simcdf)
plt.figure(2)
plt.plot(transfer_func,x,'g-')
plt.show()
I tried to reproduce your code, and got the following error:
ValueError: A value in x_new is above the interpolation range.
If you look at the plot of your two CDFs it is pretty straight forward to figure out what is going on:
When you now define invcdf = interp1d(obscdf, x), notice that obscdf ranges from
>>> obscdf[0]
0.0
>>> obscdf[-1]
0.977852889924409
and so invcdf can only interpolate values between those limits: beyond them we would have to do extrapolation, which is not all that well defined. SciPy's default behavior is to raise an error when asked to extrapolate. Which is exactly what happens when you ask for invcdf(simcdf), because
>>> simcdf[-1]
0.99326205300091452
is beyond the interpolation range.
If you read the interp1d docs you will see that this behavior can be modified doing
invcdf = interp1d(obscdf, x, bounds_error=False)
and now everything works out fine, although you need to reverse the order of your plotting arguments to plt.plot(x, transfer_func,'g-') to get the same as in the figure you posted:

Quantile-Quantile Plot using SciPy

How would you create a qq-plot using Python?
Assuming that you have a large set of measurements and are using some plotting function that takes XY-values as input. The function should plot the quantiles of the measurements against the corresponding quantiles of some distribution (normal, uniform...).
The resulting plot lets us then evaluate in our measurement follows the assumed distribution or not.
http://en.wikipedia.org/wiki/Quantile-quantile_plot
Both R and Matlab provide ready made functions for this, but I am wondering what the cleanest method for implementing in in Python would be.
Update: As folks have pointed out this answer is not correct. A probplot is different from a quantile-quantile plot. Please see those comments and other answers before you make an error in interpreting or conveying your distributions' relationship.
I think that scipy.stats.probplot will do what you want. See the documentation for more detail.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Result
Using qqplot of statsmodels.api is another option:
Very basic example:
import numpy as np
import statsmodels.api as sm
import pylab
test = np.random.normal(0,1, 1000)
sm.qqplot(test, line='45')
pylab.show()
Result:
Documentation and more example are here
If you need to do a QQ plot of one sample vs. another, statsmodels includes qqplot_2samples(). Like Ricky Robinson in a comment above, this is what I think of as a QQ plot vs a probability plot which is a sample against a theoretical distribution.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot_2samples.html
I came up with this. Maybe you can improve it. Especially the method of generating the quantiles of the distribution seems cumbersome to me.
You could replace np.random.normal with any other distribution from np.random to compare data against other distributions.
#!/bin/python
import numpy as np
measurements = np.random.normal(loc = 20, scale = 5, size=100000)
def qq_plot(data, sample_size):
qq = np.ones([sample_size, 2])
np.random.shuffle(data)
qq[:, 0] = np.sort(data[0:sample_size])
qq[:, 1] = np.sort(np.random.normal(size = sample_size))
return qq
print qq_plot(measurements, 1000)
To add to the confusion around Q-Q plots and probability plots in the Python and R worlds, this is what the SciPy manual says:
"probplot generates a probability plot, which should not be confused
with a Q-Q or a P-P plot. Statsmodels has more extensive functionality
of this type, see statsmodels.api.ProbPlot."
If you try out scipy.stats.probplot, you'll see that indeed it compares a dataset to a theoretical distribution. Q-Q plots, OTOH, compare two datasets (samples).
R has functions qqnorm, qqplot and qqline. From the R help (Version 3.6.3):
qqnorm is a generic function the default method of which produces a
normal QQ plot of the values in y. qqline adds a line to a
“theoretical”, by default normal, quantile-quantile plot which passes
through the probs quantiles, by default the first and third quartiles.
qqplot produces a QQ plot of two datasets.
In short, R's qqnorm offers the same functionality that scipy.stats.probplot provides with the default setting dist=norm. But the fact that they called it qqnorm and that it's supposed to "produce a normal QQ plot" may easily confuse users.
Finally, a word of warning. These plots don't replace proper statistical testing and should be used for illustrative purposes only.
It exists now in the statsmodels package:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html
You can use bokeh
from bokeh.plotting import figure, show
from scipy.stats import probplot
# pd_series is the series you want to plot
series1 = probplot(pd_series, dist="norm")
p1 = figure(title="Normal QQ-Plot", background_fill_color="#E8DDCB")
p1.scatter(series1[0][0],series1[0][1], fill_color="red")
show(p1)
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Here probplot draw the graph measurements vs normal distribution which speofied in dist="norm"
How big is your sample? Here is another option to test your data against any distribution using OpenTURNS library. In the example below, I generate a sample x of 1.000.000 numbers from a Uniform distribution and test it against a Normal distribution.
You can replace x by your data if you reshape it as x= [[x1], [x2], .., [xn]]
import openturns as ot
x = ot.Uniform().getSample(1000000)
g = ot.VisualTest.DrawQQplot(x, ot.Normal())
g
In my Jupyter Notebook, I see:
If you are writing a script, you can do it more properly
from openturns.viewer import View`
import matplotlib.pyplot as plt
View(g)
plt.show()

Categories