I have some data in pandas dataframe
df['Difference'] = df.Congruent.values - df.Incongruent.values
mean = df.Difference.mean()
std = df.Difference.std(ddof=1)
median = df.Difference.median()
mode = df.Difference.mode()
and I want to plot a histogram together with normal distribution in 1 plot. Is there a plotting function that takes mean and sigma as arguments? I don't care whether it is matplotplib, seaborn or ggplot. The best would be if I could mark also mode and median of the data all within 1 plot.
You can use matplotlib/pylab with scipy.stats.norm.pdf and pass the mean and standard deviation as loc and scale:
import pylab
import numpy as np
from scipy.stats import norm
x = np.linspace(-10,10,1000)
y = norm.pdf(x, loc=2.5, scale=1.5) # for example
pylab.plot(x,y)
pylab.show()
Related
I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?
To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.
Problem statement - Variable X has a mean of 15 and a standard deviation of 2.
What is the minimum percentage of X values that lie between 8 and 17?
I know about 68-95-99.7 empirical rule. From Google I found that percentage of values within 1.5 standard deviations is 86.64%.
My code so far:
import scipy.stats
import numpy as np
X=np.random.normal(15,2)
As I understood,
13-17 is within 1 standard deviation having 68% values.
9-21 will be 3 standard deviations having 99.7% values.
7-23 is 4 standard deviations. So 8 is 3.5 standard deviations below the mean.
How to find the percentage of values from 8 to 17?
You basically want to know the area under the Probability Density Function (PDF) from x1=8 to x2=17.
You know that the area of PDF is the integral, so it is Cumulative Density Function (CDF).
Thus, to find the area between two specific values of x you need to integrate the PDF between these values, which is equivalent to do CDF[x2] - CDF[x1].
So, in python, we could do
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
mu = 15
sd = 2
# define the distribution
dist = sps.norm(loc=mu, scale=sd)
x = np.linspace(dist.ppf(.00001), dist.ppf(.99999))
# Probability Density Function
pdf = dist.pdf(x)
# Cumulative Density Function
cdf = dist.cdf(x)
and plot to take a look
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
axs[0].plot(x, pdf, color='k')
axs[0].fill_between(
x[(x>=8)&(x<=17)],
pdf[(x>=8)&(x<=17)],
alpha=.25
)
axs[0].set(
title='PDF'
)
axs[1].plot(x, cdf)
axs[1].axhline(dist.cdf(8), color='r', ls='--')
axs[1].axhline(dist.cdf(17), color='r', ls='--')
axs[1].set(
title='CDF'
)
plt.show()
So, the value we want is that area, that we can calculate as
cdf_at_8 = dist.cdf(8)
cdf_at_17 = dist.cdf(17)
cdf_between_8_17 = cdf_at_17 - cdf_at_8
print(f"{cdf_between_8_17:.1%}")
that gives 84.1%.
I'm wondering if there is a good way to match a Gaussian normal to a histogram in the form of a numpy array np.histogram(array, bins).
How can such a curve been plotted on the same graph and adjusted in height and width to the histogram?
You can fit your histogram using a Gaussian (i.e. normal) distribution, for example using scipy's curve_fit. I have written a small example below. Note that depending on your data, you may need to find a way to make good guesses for the starting values for the fit (p0). Poor starting values may cause your fit to fail.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
from scipy.stats import norm
def fit_func(x,a,mu,sigma,c):
"""gaussian function used for the fit"""
return a * norm.pdf(x,loc=mu,scale=sigma) + c
#make up some normally distributed data and do a histogram
y = 2 * np.random.normal(loc=1,scale=2,size=1000) + 2
no_bins = 20
hist,left = np.histogram(y,bins=no_bins)
centers = left[:-1] + (left[1] - left[0])
#fit the histogram
p0 = [2,0,2,2] #starting values for the fit
p1,_ = curve_fit(fit_func,centers,hist,p0,maxfev=10000)
#plot the histogram and fit together
fig,ax = plt.subplots()
ax.hist(y,bins=no_bins)
x = np.linspace(left[0],left[-1],1000)
y_fit = fit_func(x, *p1)
ax.plot(x,y_fit,'r-')
plt.show()
So based on my understanding of normal distribution the mean is zero by default when the standard deviation is 1. I was given an assignment to write a python program to generate a PDF of a normally distributed function with the range from 10 to 45 with a standard deviation of 2. Will the mean still be zero? I tried this but my plot doesn't form a bell shape. I don't know what I am doing wrong.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
mu=0 # mean
sigma=2
x=np.arange(10,45,0.1)
y=stats.norm.pdf(x, 0, sigma)
plt.plot(x,y)
plt.show()
See my plot here: myplot
Here since the range of random variables is between 10 to 45, so the mean will lie in between this range of values, around 27. You need to get the same using the mean function and then use your code as follows:
y=stats.norm.pdf(x, x.mean(), sigma )
This will give you a normal distribution curve
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(10,45,0.1)
sigma = 2
print('Mean :', round(x.mean(), 2),'SD :', sigma)
plt.plot(x, norm.pdf(x,x.mean(),sigma), 'r1', lw=2, alpha=0.5, label='norm PDF')
plt.legend(loc='best')
plt.show()
Which prints:
Mean : 27.45 SD : 2
And shows the shape of the probability density function:
I have an array of random integers for which I have calculated the mean and std, the standard deviation. Next I have an array of random numbers within the normal distribution of this (mean, std).
I want to plot now a scatter plot of the normal distribution array using matplotlib. Can you please help?
Code:
random_array_a = np.random.randint(2,15,size=75) #random array from [2,15)
mean = np.mean(random_array_a)
std = np.std(random_array_a)
sample_norm_distrib = np.random.normal(mean,std,75)
The scatter plot needs x and y axis...but what should it be?
I think what you may want is a histogram of the normal distribution:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(sample_norm_distrib)
The closest thing you can do to visualise your distribution of 1D output is doing scatter where your x & y are the same. this way you can see more accumulation of data in the high probability areas. For example:
import numpy as np
import matplotlib.pyplot as plt
mean = 0
std = 1
sample_norm_distrib = np.random.normal(mean,std,7500)
plt.figure()
plt.scatter(sample_norm_distrib,sample_norm_distrib)