How to match a Gaussian normal to a histogram? - python

I'm wondering if there is a good way to match a Gaussian normal to a histogram in the form of a numpy array np.histogram(array, bins).
How can such a curve been plotted on the same graph and adjusted in height and width to the histogram?

You can fit your histogram using a Gaussian (i.e. normal) distribution, for example using scipy's curve_fit. I have written a small example below. Note that depending on your data, you may need to find a way to make good guesses for the starting values for the fit (p0). Poor starting values may cause your fit to fail.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
from scipy.stats import norm
def fit_func(x,a,mu,sigma,c):
"""gaussian function used for the fit"""
return a * norm.pdf(x,loc=mu,scale=sigma) + c
#make up some normally distributed data and do a histogram
y = 2 * np.random.normal(loc=1,scale=2,size=1000) + 2
no_bins = 20
hist,left = np.histogram(y,bins=no_bins)
centers = left[:-1] + (left[1] - left[0])
#fit the histogram
p0 = [2,0,2,2] #starting values for the fit
p1,_ = curve_fit(fit_func,centers,hist,p0,maxfev=10000)
#plot the histogram and fit together
fig,ax = plt.subplots()
ax.hist(y,bins=no_bins)
x = np.linspace(left[0],left[-1],1000)
y_fit = fit_func(x, *p1)
ax.plot(x,y_fit,'r-')
plt.show()

Related

How to make a histogram from 30 csv files to plot the historgram and then for it with gaussian function and the standard deviation?

I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?
To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.

How can I fit a gaussian curve in python?

I'm given an array and when I plot it I get a gaussian shape with some noise. I want to fit the gaussian. This is what I already have but when I plot this I do not get a fitted gaussian, instead I just get a straight line. I've tried this many different ways and I just can't figure it out.
random_sample=norm.rvs(h)
parameters = norm.fit(h)
fitted_pdf = norm.pdf(f, loc = parameters[0], scale = parameters[1])
normal_pdf = norm.pdf(f)
plt.plot(f,fitted_pdf,"green")
plt.plot(f, normal_pdf, "red")
plt.plot(f,h)
plt.show()
You can use fit from scipy.stats.norm as follows:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
data = np.random.normal(loc=5.0, scale=2.0, size=1000)
mean,std=norm.fit(data)
norm.fit tries to fit the parameters of a normal distribution based on the data. And indeed in the example above mean is approximately 5 and std is approximately 2.
In order to plot it, you can do:
plt.hist(data, bins=30, density=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
y = norm.pdf(x, mean, std)
plt.plot(x, y)
plt.show()
The blue boxes are the histogram of your data, and the green line is the Gaussian with the fitted parameters.
There are many ways to fit a gaussian function to a data set. I often use astropy when fitting data, that's why I wanted to add this as additional answer.
I use some data set that should simulate a gaussian with some noise:
import numpy as np
from astropy import modeling
m = modeling.models.Gaussian1D(amplitude=10, mean=30, stddev=5)
x = np.linspace(0, 100, 2000)
data = m(x)
data = data + np.sqrt(data) * np.random.random(x.size) - 0.5
data -= data.min()
plt.plot(x, data)
Then fitting it is actually quite simple, you specify a model that you want to fit to the data and a fitter:
fitter = modeling.fitting.LevMarLSQFitter()
model = modeling.models.Gaussian1D() # depending on the data you need to give some initial values
fitted_model = fitter(model, x, data)
And plotted:
plt.plot(x, data)
plt.plot(x, fitted_model(x))
However you can also use just Scipy but you have to define the function yourself:
from scipy import optimize
def gaussian(x, amplitude, mean, stddev):
return amplitude * np.exp(-((x - mean) / 4 / stddev)**2)
popt, _ = optimize.curve_fit(gaussian, x, data)
This returns the optimal arguments for the fit and you can plot it like this:
plt.plot(x, data)
plt.plot(x, gaussian(x, *popt))

How can I change de parameters of gaussian_kde for a scatter plot colored by density in matplotlib

As explained by Joe Kington answering in this question : How can I make a scatter plot colored by density in matplotlib, I made a scatter plot colored by density. However, due to the complex distribution of my data, I would like to change the parameters used to calculate the density.
Here is the results with some fake data similar to mine :
I would want to calibrate the density calculations of gaussian_kde so that the left part of the plot looks like this :
I don't like the first plot because the groups of points influence the density of adjacent groups of points and that prevents me from analyzing the distribution within a group. In other words, even if each of the 8 groups have exactly the same distribution, that won't be visible on the graph.
I tried to modify the covariance_factor (like I once did for a 2d plot of density over x), but when gaussian_kde is used with multiple dimension arrays it returns a numpy.ndarray, not a "scipy.stats.kde.gaussian_kde" object. Plus, I don't even know if changing the covariance_factor will do it.
Here's my dummy code :
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40,a+80))
y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10,b*4))
# Data for the second image
#x = np.concatenate((a+10,a+10,a+20,a+20,a+30,a+30,a+40,a+40))
#y = np.concatenate((b+10,b-10,b+10,b-10,b+10,b-10,b+10,b-10))
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
# My unsuccesfull try to modify covariance which would work in 1D with "z = gaussian_kde(x)"
#z.covariance_factor = lambda : 0.01
#z._compute_covariance()
# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50, edgecolor='')
plt.show()
The solution could use an other density calculator, I don't mind.
The goal is to make a density plot like the ones showed above, where I can play with the density parameters.
I'm using python 3.4.3
Did have a look at Seaborn? It's not exactly what you're asking for, but it already has functions for generating density plots:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kendalltau
import seaborn as sns
# Generate fake data
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
# Data for the first image
x = np.concatenate((a+10, a+10, a+20, a+20, a+30, a+30, a+40, a+40, a+80))
y = np.concatenate((b+10, b-10, b+10, b-10, b+10, b-10, b+10, b-10, b*4))
sns.jointplot(x, y, kind="hex", stat_func=kendalltau)
sns.jointplot(x, y, kind="kde", stat_func=kendalltau)
plt.show()
It gives:
and

Matplotlib: Coloring scatter plot by density relative to another data set

I'm new to Python and having some trouble with matplotlib. I currently have data that is contained in two numpy arrays, call them x and y, that I am plotting on a scatter plot with coordinates for each point (x, y) (i.e I have points x[0], y[0] and x1, y1 and so on on my plot). I have been using the following code segment to color the points in my scatter plot based on the spatial density of nearby points (found this on another stackoverflow post):
http://prntscr.com/abqowk
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
x = np.random.normal(size=1000)
y = x*3 + np.random.normal(size=1000)
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
idx = z.argsort()
fig,ax = plt.subplots()
ax.scatter(x,y,c=z,s=50,edgecolor='')
plt.show()
Output:
I've been using it without being sure exactly how it works (namely the point density calculation - if someone could explain how exactly that works, would also be much appreciated).
However, now I'd like to color code by the ratio of the spatial density of points in x,y to that of the spatial density of points in another set of numpy arrays, call them x2, y2. That is, I would like to make a plot such that I can identify how the density of points in x,y compares to the points in x2,y2 on the same scatter plot. Could someone please explain how I could go about doing this?
Thanks in advance for your help!
I've been trying to do the same thing based on that same earlier post, and I think I just figured it out! The trick is to use matplotlib.colors.Normalize() to define a scale and then weight it according to some data set (xnorm,ynorm):
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mplc
import matplotlib.cm as cm
from scipy.stats import gaussian_kde
def kdeplot(x,y,xnorm,ynorm):
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
wt = 1.0*len(x)/(len(xnorm)*1.0)
norm = mplc.Normalize(vmin=0, vmax=8/wt)
cmap = cm.gnuplot
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
args = (x,y)
kwargs = {'c':z,'s':10,'edgecolor':'','cmap':cmap,'norm':norm}
return args, kwargs
# (x1,y1) is some data set whose density map coloring you
# want to scale to (xnorm,ynorm)
args,kwargs = kdeplot(x1,y1,xnorm,ynorm)
plt.scatter(*args,**kwargs)
I used trial and error to optimize my normalization for my particular data and choice of colormap. Here's what my data looks like scaled to itself; here's my data scaled to some comparison data (which is on the bottom of that image).
I'm not sure this method is entirely general, but it works in my case: I know that my data and the comparison data are in similar regions of parameter space, and they both have gaussian scatter, so I can use a naive linear scaling determined by the number of data points and it results in something that gives the right idea visually.

Add more sample points to data

Given some data of shape 20x45, where each row is a separate data set, say 20 different sine curves with 45 data points each, how would I go about getting the same data, but with shape 20x100?
In other words, I have some data A of shape 20x45, and some data B of length 20x100, and I would like to have A be of shape 20x100 so I can compare them better.
This is for Python and Numpy/Scipy.
I assume it can be done with splines, so I am looking for a simple example, maybe just 2x10 to 2x20 or something, where each row is just a line, to demonstrate the solution.
Thanks!
Ubuntu beat me to it while I was typing this example, but his example just uses linear interpolation, which can be more easily done with numpy.interpolate... (The difference is only a keyword argument in scipy.interpolate.interp1d, however).
I figured I'd include my example, as it shows using scipy.interpolate.interp1d with a cubic spline...
import numpy as np
import scipy as sp
import scipy.interpolate
import matplotlib.pyplot as plt
# Generate some random data
y = (np.random.random(10) - 0.5).cumsum()
x = np.arange(y.size)
# Interpolate the data using a cubic spline to "new_length" samples
new_length = 50
new_x = np.linspace(x.min(), x.max(), new_length)
new_y = sp.interpolate.interp1d(x, y, kind='cubic')(new_x)
# Plot the results
plt.figure()
plt.subplot(2,1,1)
plt.plot(x, y, 'bo-')
plt.title('Using 1D Cubic Spline Interpolation')
plt.subplot(2,1,2)
plt.plot(new_x, new_y, 'ro-')
plt.show()
One way would be to use scipy.interpolate.interp1d:
import scipy as sp
import scipy.interpolate
import numpy as np
x=np.linspace(0,2*np.pi,45)
y=np.zeros((2,45))
y[0,:]=sp.sin(x)
y[1,:]=sp.sin(2*x)
f=sp.interpolate.interp1d(x,y)
y2=f(np.linspace(0,2*np.pi,100))
If your data is fairly dense, it may not be necessary to use higher order interpolation.
If your application is not sensitive to precision or you just want a quick overview, you could just fill the unknown data points with averages from neighbouring known data points (in other words, do naive linear interpolation).

Categories