when I use seaborn's confidence intervals in pointplot I get deceptively small values, compared with standard error. Example:
import seaborn as sns
import matplotlib.pylab as plt
import pandas
import numpy as np
x = np.random.rand(100)
y = np.random.rand(100)
df = pandas.DataFrame({"x": x,
"y": y})
data = pandas.melt(df)
print "data: ", data
plt.figure()
plt.subplot(2, 1, 1)
sns.pointplot(x="variable", y="value", data=data)
plt.ylim([0, 0.9])
ax = plt.subplot(2, 1, 2)
m = [df["x"].mean(), df["y"].mean()]
e = [df["x"].std(), df["y"].std()]
plt.errorbar(range(1,3), m, yerr=e)
plt.ylim([0, 0.9])
plt.xlim([0, 4])
plt.xticks([1, 2])
ax.set_xticklabels(["x", "y"])
the standard deviations are significantly larger. what is the explanation for this? can seaborn plot error bars that are closer to simple metric like standard deviation?
in the bottom plot, the standard errors for x and y are shown and they are much bigger than seaborn's confidence intervals for x and y (in top plot).
Making my previous answer below more precise, since the standard deviation of a uniform random variable is 1/sqrt(12)~=0.2887 the bars in your second plot cover an interval of size roughly [0.5-0.2887,0.5+0.2887]=[0.2113,0.7887].
On the other hand, by the central limit theorem, the 95%-confidence interval of the empirical mean of 100 uniform random variables will be roughly [0.5-1.96*0.2887/sqrt(100),0.5+1.96*0.2887/sqrt(100)]~=[0.443,0.557]. This corresponds to the confidence interval drawn by seaborn in your first plot.
To summarize, for computations of statistical confidence intervals, the sample size plays a critical role and cannot be neglected!
Previous shorter answer
Seaplot's confidence-intervals take into account the number of samples that are used to estimate the mean. Given that you handed seaplot a decent number of 100 sample points the 95%-confidence interval for the empirical mean of the 100 sample points will indeed be pretty small.
In order to achieve a fair comparison, you should scale your standard errors by 1/sqrt(100) and then compare the plots.
Related
I have a number of samples of a variable. I would like to use these samples to plot the probability distribution of the variable. I'm using kernel density estimation with a Gaussian kernel. I'm using sklearn library for this purpose. Here is the sample code I have implemented:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
# -- data
init_range = 0.0793
X = np.random.uniform(low=-init_range, high=init_range, size=133280)[:, np.newaxis]
# -- kernel density estimation
kde = KernelDensity(kernel="gaussian", bandwidth=0.2).fit(X)
X_plot = np.linspace(min(X).item(), max(X).item(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
# -- plot density
plt.plot( X_plot[:, 0], np.exp(log_dens), lw=2, linestyle="-")
plt.ylim([0, 2.1])
plt.show()
Below is the resulting output:
As you can see, the value on the y axis is above one. Hence, the y axis is NOT showing the probability distribution. I further plotted the histogram for this data:
# -- plot hist
n_bins = 40
weights = np.ones_like(X) / float(len(X))
prob, bins, _ = plt.hist(X, n_bins, density=False, histtype='step', color='red', weights=weights)
plt.show()
and the result is below:
which makes sense as the bins sum up to one: 0.025*40=1
I'm having a hard time understanding why my kde plot is not a distribution. How can I fix this? Is there a normalization step that I'm missing?
First, if you extend the limits of your X_plot axis (i.e. X_plot = np.linspace(-1, 1,...)), you'll see that your KDE estimates a rather tall gaussian, and the area under curve is still 1.
Density values over 1 are perfectly legal, since the assumed distribution is continuous: there's no real probabilities for the exact points, and you should not treat your Y values as such; estimated probabilty for an interval is the respective area under curve.
Sample code to verify the estimated probability of hitting 0-0.004 range (roughly the same width as your histogram bin):
import scipy.integrate as integrate
interval = np.linspace(0, 0.004, 1000)[:, np.newaxis]
log_dens = kde.score_samples(interval)
print(integrate.trapz(np.exp(log_dens), interval[:,0]))
Second, once you check the area under curve you'll see your current hyperparameters aren't yielding too accurate of an estimation, reducing the bandwith or choosing a different algo might help.
You can also apply grid search to find the least inaccurate algo and bandwith, though this will take a good amount of time unless you reduce your sample size; also, choosing a narrow bandwidth may result in undersmoothing.
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KernelDensity(), {'kernel':['gaussian', 'tophat'],'bandwidth': np.logspace(-2, 0, 10)}, cv=5, n_jobs=-1)
grid.fit(X)
print(f"best hyperparameters: {grid.best_params_}")
kde = grid.best_estimator_
I want to make a histogram from 30 csv files, and then fit a gaussian function to see if my data is optimal. After that, I need to find the mean and standard deviation of those peaks. The file data size are too large, I do not know if I extract individual column and organize their value range into number of bins correctly.
I know it is a bit long and too many questions, please answer as much as you want, thank you very much!
> this is the links of the data
Below so far I have done (actually not much, coz I am beginner to data visualization.)
Firstly, I import the packages, savgol_filter to make the bin transparent, it seems better.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import savgol_filter
And then I convert the dimension and set limit.
def cm2inch(value):
return value/2.54
width = 9
height = 6.75
sliceMin, sliceMax = 300, 1002
Next I load all the data jupyter notebook by iteration 30 times, where I set up two arrays "times" and "voltages" to store the values.
times, voltages = [], []
for i in range(30):
time, ch1 = np.loadtxt(f"{i+1}.txt", delimiter=',', skiprows=5,unpack=True)
times.append(time)
voltages.append(ch1)
t = (np.array(times[0]) * 1e5)[sliceMin:sliceMax]
voltages = (np.array(voltages))[:, sliceMin:sliceMax]
1. I think I should need a hist function to plot the graph. Although I have the plot, but I am not sure if it is the proper way to generate the histogram.
hist, bin_edges = np.histogram(voltages, bins=500, density=True)
hist = savgol_filter(hist, 51, 3)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2
That is so far I have reached. the amplitude of the 3rd peak is too low, which is not what I expected. But please correct me if my expectation is wrong.
This is my histogram plot
I have updated my plot with the following code
labels = "hist"
if showGraph:
plt.title("Datapoints Distribution over Voltage [mV]", )
plt.xlabel("Voltage [mV]")
plt.ylabel("Data Points")
plt.plot(hist, label=labels)
plt.show()
2.(edited) I am not sure why my label cannot display, could you please correct me?
3.(edited) Besides, I want to make a fit curve by using gaussian function to the histogram. But there are three peaks, so how should I fit the function to them?
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
4. (edited) I realised that I have not mentioned the mean value yet.
I suppose that if I can locate the maximum value of the peak, then I can find the mean value of the specific peak. Do I need to fit the Gaussian first to find the peak, or I can find the straight ahead? Is it to find the local maximum so I can find it? If yes, how can I proceed it?
5. (edited) I know how to find the standard deviation from a single list, if I want to do similar logic, how to implement the code?
sample = [1,2,3,4,5,5,5,5,10]
standard_deviation = np.std(sample, ddof=1)
print(standard_deviation)
Feedback to suggestions:
I try to implement the gaussian fit, below are the packages I import.
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
Here isthe gaussian function, I put my 30 datasets voltages as the parameter of the Gaussian Mixture fit, which print our lots of values regarding mu and variance.
gmm = GaussianMixture(n_components=1)
gmm.fit(voltages)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
I process the code one by one. There is an error on the second line:
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(voltages), max(voltages), 0.05)
The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
I search from the web that, to use this is to indicate there is only one value, like if there are[T,T,F,F,T], you can have 4 possibilities.
I edit my code to:
Xs = np.arange(min(np.all(voltages)), max(np.all(voltages)), 0.05)
which gives me this:
'numpy.bool_' object is not iterable
I understand it is not a boolean object. At this stage, I do not know how to proceed the gaussian curve fit. Can anyone provides me an alternate way to do it?
To plot a histogram, the most vanilla matplotlib function, hist, is my go-to. Basically, if I have a list of samples, then I can plot a histogram of them with 100 bins via:
import matplotlib.pyplot as plt
plt.hist(samples, bins=100)
plt.show()
If you'd like to fit normal distribution(s) to your data, the best model for that is a Gaussian Mixture Model, which you can find more info about via scikit-learn's GMM page. That said, this is the code I use to fit a singular gaussian distribution to a dataset. If I wanted to fit k normal distributions, I'd need to use n_components=k. I've also included the resulting plot:
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
data = np.random.uniform(-1,1, size=(800,1))
data += np.random.uniform(-1,1, size=(800,1))
gmm = GaussianMixture(n_components=1)
gmm.fit(data)
print(gmm.means_, gmm.covariances_)
mu = gmm.means_[0][0]
variance = gmm.covariances_[0][0][0]
print(mu, variance)
fig, ax = plt.subplots(figsize=(6,6))
Xs = np.arange(min(data), max(data), 0.05)
ys = 1.0/np.sqrt(2*np.pi*variance) * np.exp(-0.5/variance * (Xs + mu)**2)
ax.hist(data, bins=100, label='data')
px = ax.twinx()
px.plot(Xs, ys, c='r', linestyle='dotted', label='fit')
ax.legend()
px.legend(loc='upper left')
plt.show()
As for question 3, I'm not sure which axis you'd like to capture the standard deviations of. If you'd like to get the standard deviation of columns, you can use np.std(data, axis=1), and use axis=0 for row-by-row standard deviation.
I try to plot normalized histogram using example from numpy.random.normal documentation. For this purpose I generate normally distributed random sample.
mu_true = 0
sigma_true = 0.1
s = np.random.normal(mu_true, sigma_true, 2000)
Then I fitt normal distribution to the data and calculate pdf.
mu, sigma = stats.norm.fit(s)
points = np.linspace(stats.norm.ppf(0.01,loc=mu,scale=sigma),
stats.norm.ppf(0.9999,loc=mu,scale=sigma),100)
pdf = stats.norm.pdf(points,loc=mu,scale=sigma)
Display fitted pdf and data histogram.
plt.hist(s, 30, density=True);
plt.plot(points, pdf, color='r')
plt.show()
I use density=True, but it is obviously, that pdf and histogram are not normalized.
What can one suggests to plot truly normalized histogram and pdf?
Seaborn distplot also doesn't solve the problem.
import seaborn as sns
ax = sns.distplot(s)
What makes you think it is not normalised? At a guess, it's probably because the heights of each column extend to values greater than 1. However, this thinking is flawed because in a normalised histogram/pdf, the total area under it should sum to one (not the heights). When you are dealing with small steps in x (as you are), that are less than one, then it is not surprising that the column heights are greater than one!
You can see this clearly in the scipy example you link: the x-values are much greater (by an order of magnitude) so it follows that their y-values are also smaller. You will see the same effect if you change your distribution to cover a wider range of values. Try a sigma of 10 instead of 0.1, see what happens!
import numpy as np
from numpy.random import seed, randn
from scipy.stats import norm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
"Try this, for 𝜇 = 0"
seed(0)
points = np.linspace(-5,5,100)
pdf = norm.pdf(points,0,1)
plt.plot(points, pdf, color='r')
plt.hist(randn(50), density=True);
plt.show()
"or this, for 𝜇 = 10"
seed(0)
points = np.linspace(5,15,100)
pdf = norm.pdf(points,10,1)
plt.plot(points, pdf, color='r')
plt.hist(10+randn(50), density=True);
plt.show()
I have two set of data with one containing around 11 million data points and the another around 5000. I would like to plot them both on one histogram. But because of the difference in size I need to normalise the frequency so I can plot them on the same figure. Below I have simulated what I have done with my data to be able to plot them. I have used the normed=True.
from numpy.random import randn
import matplotlib.pyplot as plt
import random
datalist1=[]
for x in range(1,50000):
datalist1.append(random.uniform(1,2))
datalist2=randn(5000000)
fig= plt.figure(1)
plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
plt.xlabel("Value")
plt.ylabel("Normalised Frequency")
plt.legend()
plt.show()
Can you please tell me if this is a good way to get around this issue? I would like to match the tallest hight between the two histogram frequencies to be 1 (or 100%).
The normed=True setting normalizes the histogram to an area of 1. That gives the histogram an interpretation as estimates of probability density functions.
In short, it actually makes sense not to normalize on the peak but on the area.
But if you really want to normalize by height you can modify the polygon data of the histogram:
h = plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
h = plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
This solution feels a bit hackish, but at least it's quick and dirty :)
I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.
With matplotlib, I plot against the range [0, max_data_value]
How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?
Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?
There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").
Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.
Here's a function that implements one of the more common outlier tests.
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
As an example of using it, you'd do something like the following:
import numpy as np
import matplotlib.pyplot as plt
# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier
# Generate some data
x = np.random.random(100)
# Append a few "bad" points
x = np.r_[x, -3, -10, 100]
# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]
# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)
ax1.hist(x)
ax1.set_title('Original')
ax2.hist(filtered)
ax2.set_title('Without Outliers')
plt.show()
If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:
plt.xlim(min_x_data_value,max_x_data_value)
Where the values are your desired limits to display.
plt.ylim(min,max) works to set limits on the y axis also.
I think using pandas quantile is useful and much more flexible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
pd_series = pd.Series(np.random.normal(size=300))
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))]
ax1.boxplot(pd_series)
ax1.set_title('Original')
ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')
plt.show()
I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.
What I run is something like this:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)
You can compare the results if you change the min and max in the clipping function until you find the right values for your data.
In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.
You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.
In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.
Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.
# xdata = some x data points ...
# ydata = some y data points ...
# Finding limits for y-axis
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad
Example usage:
fig = plt.figure(figsize=(6, 8))
ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')
ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])
plt.show()