I have an array of random integers for which I have calculated the mean and std, the standard deviation. Next I have an array of random numbers within the normal distribution of this (mean, std).
I want to plot now a scatter plot of the normal distribution array using matplotlib. Can you please help?
Code:
random_array_a = np.random.randint(2,15,size=75) #random array from [2,15)
mean = np.mean(random_array_a)
std = np.std(random_array_a)
sample_norm_distrib = np.random.normal(mean,std,75)
The scatter plot needs x and y axis...but what should it be?
I think what you may want is a histogram of the normal distribution:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(sample_norm_distrib)
The closest thing you can do to visualise your distribution of 1D output is doing scatter where your x & y are the same. this way you can see more accumulation of data in the high probability areas. For example:
import numpy as np
import matplotlib.pyplot as plt
mean = 0
std = 1
sample_norm_distrib = np.random.normal(mean,std,7500)
plt.figure()
plt.scatter(sample_norm_distrib,sample_norm_distrib)
Related
I have (x,y) coordinate pairs that I've plotted using sns.kdeplot. I want to randomly sample coordinates based on the 2D probability density function. How would I do that?
Here's some dummy data:
import numpy as np
import seaborn as sns
x_values = np.random.randint(low=0, high=10, size=100)
y_values = np.random.randint(low=0, high=10, size=100)
coordinate_pairs = list(zip(x_values,y_values))
sns.kdeplot(x_values, y_values)
I'm able to plot the probability density function, but how would I randomly sample (x,y) coordinate tuples from this distribution? Obviously the real data isn't completely random like dummy data provided above.
Thanks so much and have a great day.
Seaborn doesn't return the object that contains the kernel density estimate.
However if you look in the code, you can see that they use scipy.stats.gaussian_kde for that. So you can do the same outside of plotting.
import numpy as np
from scipy.stats import gaussian_kde
# random 2d values
X = np.random.randn(1000, 2)
# fit kernel density estimate. needs to be transposed for the function
kde = gaussian_kde(X.T)
# now you can resample from it
# transpose to have same shape as X
Y = kde.resample(1000).T
The following example of the curve function in R,
curve(dgamma(x, 3, .1), add=T, lwd=2, col="orange"),
plots the curve for the probability density function of the dgamma continuous distribution. The equivalent to dgamma in Python is scipy.stats.dgamma.
How can I plot the same curve for the same distribution in Python? I would like this more than fitting a kernel density estimator (KDE), which tend to be inaccurate.
I don't think you have a curve equivalent in matplotlib or seaborn for that matter. You have to define a set of points and plot over it on the same device. In this case, since you are doing a histogram, it's getting a number of evenly spaced points between the min and max :
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x = stats.gamma.rvs(a=3,scale=1/0.1,size=1000)
plt.hist(x,density=True)
xl = np.linspace(x.min(),x.max(),1000)
plt.plot(xl,stats.gamma.pdf(xl,a=3,scale=1/0.1))
I'm wondering if there is a good way to match a Gaussian normal to a histogram in the form of a numpy array np.histogram(array, bins).
How can such a curve been plotted on the same graph and adjusted in height and width to the histogram?
You can fit your histogram using a Gaussian (i.e. normal) distribution, for example using scipy's curve_fit. I have written a small example below. Note that depending on your data, you may need to find a way to make good guesses for the starting values for the fit (p0). Poor starting values may cause your fit to fail.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
from scipy.stats import norm
def fit_func(x,a,mu,sigma,c):
"""gaussian function used for the fit"""
return a * norm.pdf(x,loc=mu,scale=sigma) + c
#make up some normally distributed data and do a histogram
y = 2 * np.random.normal(loc=1,scale=2,size=1000) + 2
no_bins = 20
hist,left = np.histogram(y,bins=no_bins)
centers = left[:-1] + (left[1] - left[0])
#fit the histogram
p0 = [2,0,2,2] #starting values for the fit
p1,_ = curve_fit(fit_func,centers,hist,p0,maxfev=10000)
#plot the histogram and fit together
fig,ax = plt.subplots()
ax.hist(y,bins=no_bins)
x = np.linspace(left[0],left[-1],1000)
y_fit = fit_func(x, *p1)
ax.plot(x,y_fit,'r-')
plt.show()
I try to plot normalized histogram using example from numpy.random.normal documentation. For this purpose I generate normally distributed random sample.
mu_true = 0
sigma_true = 0.1
s = np.random.normal(mu_true, sigma_true, 2000)
Then I fitt normal distribution to the data and calculate pdf.
mu, sigma = stats.norm.fit(s)
points = np.linspace(stats.norm.ppf(0.01,loc=mu,scale=sigma),
stats.norm.ppf(0.9999,loc=mu,scale=sigma),100)
pdf = stats.norm.pdf(points,loc=mu,scale=sigma)
Display fitted pdf and data histogram.
plt.hist(s, 30, density=True);
plt.plot(points, pdf, color='r')
plt.show()
I use density=True, but it is obviously, that pdf and histogram are not normalized.
What can one suggests to plot truly normalized histogram and pdf?
Seaborn distplot also doesn't solve the problem.
import seaborn as sns
ax = sns.distplot(s)
What makes you think it is not normalised? At a guess, it's probably because the heights of each column extend to values greater than 1. However, this thinking is flawed because in a normalised histogram/pdf, the total area under it should sum to one (not the heights). When you are dealing with small steps in x (as you are), that are less than one, then it is not surprising that the column heights are greater than one!
You can see this clearly in the scipy example you link: the x-values are much greater (by an order of magnitude) so it follows that their y-values are also smaller. You will see the same effect if you change your distribution to cover a wider range of values. Try a sigma of 10 instead of 0.1, see what happens!
import numpy as np
from numpy.random import seed, randn
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
"Try this, for 𝜇 = 0"
seed(0)
points = np.linspace(-5,5,100)
pdf = norm.pdf(points,0,1)
plt.plot(points, pdf, color='r')
plt.hist(randn(50), density=True);
plt.show()
"or this, for 𝜇 = 10"
seed(0)
points = np.linspace(5,15,100)
pdf = norm.pdf(points,10,1)
plt.plot(points, pdf, color='r')
plt.hist(10+randn(50), density=True);
plt.show()
I have some data in pandas dataframe
df['Difference'] = df.Congruent.values - df.Incongruent.values
mean = df.Difference.mean()
std = df.Difference.std(ddof=1)
median = df.Difference.median()
mode = df.Difference.mode()
and I want to plot a histogram together with normal distribution in 1 plot. Is there a plotting function that takes mean and sigma as arguments? I don't care whether it is matplotplib, seaborn or ggplot. The best would be if I could mark also mode and median of the data all within 1 plot.
You can use matplotlib/pylab with scipy.stats.norm.pdf and pass the mean and standard deviation as loc and scale:
import pylab
import numpy as np
from scipy.stats import norm
x = np.linspace(-10,10,1000)
y = norm.pdf(x, loc=2.5, scale=1.5) # for example
pylab.plot(x,y)
pylab.show()