How to normalize a non-normal distribution? - python

I have the above distribution with a mean of -0.02, standard deviation of 0.09 and with a sample size of 13905.
I am just not sure why the distribution is is left-skewed given the large sample size. From bin [-2.0 to -0.5], there are only 10 sample count/outliers in that bin, which explains the shape.
I am just wondering is it possible to normalize to make it more smooth and 'normal' distribution. Purpose is to feed it into a model, while reducing the standard error of the predictor.

You have two options here. You can either Box-Cox transform or Yeo-Johnson transform. The issue with Box-Cox transform is that it applies only to positive numbers. To use Box-Cox transform, you'll have to take an exponential, perform the Box-Cox transform and then take the log to get the data in the original scale. Box-Cox transform is available in scipy.stats
You can avoid those steps and simply use Yeo-Johnson transform. sklearn provides an API for that
from matplotlib import pyplot as plt
from scipy.stats import normaltest
import numpy as np
from sklearn.preprocessing import PowerTransformer
data=np.array([-0.35714286,-0.28571429,-0.00257143,-0.00271429,-0.00142857,0.,0.,0.,0.00142857,0.00285714,0.00714286,0.00714286,0.01,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.02142857,0.07142857])
pt = PowerTransformer(method='yeo-johnson')
data = data.reshape(-1, 1)
pt.fit(data)
transformed_data = pt.transform(data)
We have transformed our data but we need a way to measure and see if we have moved in the right direction. Since our goal was to move towards being a normal distribution, we will use a normality test.
k2, p = normaltest(data)
transformed_k2, transformed_p = normaltest(transformed_data)
The test returns two values k2 and p. The value of p is of our interest here.
if p is greater than some threshold (ex 0.001 or so), we can say reject the hypothesis that data comes from a normal distribution.
In the example above, you'll see that p is greater than 0.001 while transformed_p is less than this threshold indicating that we are moving in the right direction.

I agree with the top answer, except the last 2 paragraphs, because the interpretation of normaltest's output is flipped. These paragraphs should instead read:
"The test returns two values k2 and p. The value of p is of our interest here.
if p is greater less than some threshold (ex 0.001 or so), we can say reject the null hypothesis that data comes from a normal distribution.
In the example above, you'll see that p is greater less than 0.001 while transformed_p is less greater than this threshold indicating that we are moving in the right direction."
Source: normaltest documentation.

Related

scipy distributions provide zero for values smaller than loc (mean)

For the data yielding the below histogram, I used gamma.fit(data) function. It yields (0.2856629839547822, 0.001612367540316285, 1.3126526078419007) which must be the alpha, loc, and scale parameters of the distribution. However, the mean and standard deviations are m=0.04181341484525036 and s=0.02581912984507876 for the given dataset. The PDF is zero below the mean (m) value. I couldn't find any questions about this problem. What am I missing?
Histogram of the data
The PDF is most definitely not zero for all values below your calculated mean. I fact, over 40% of the PDF area resides at x<=m:
from scipy import stats
g = stats.gamma(0.2856629839547822, 0.001612367540316285, 1.3126526078419007)
m=0.04181341484525036
print(g.cdf(m))
0.4078

Seaborn KDEPlot - not enough variation in data?

I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)
But including 451 of the minimum values gives a very different output:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)
Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.
The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.
The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.
Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})
for i, bw in enumerate(['scott', 0.3]):
for j, num_same in enumerate([400, 450, 500]):
y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
sns.kdeplot(y, bw=bw, ax=axs[i, j])
axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()
The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.
PS: As mentioned by #mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34). The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", the difference between the 75th and 25th percentiles.
Edit: Since Seaborn 0.11, the statsmodel backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde.
If the sample has repeated values, this implies that the underlying distribution is not continuous. In the data that you show to illustrate the issue, we can see a Dirac distribution on the left. The kernel smoothing might be applied for such data, but with care. Indeed, to approximate such data, we might use a kernel smoothing where the bandwidth associated to the Dirac is zero. However, in most KDE methods, there is only one single bandwidth for all kernel atoms. Moreover, the various rules used to compute the bandwidth are based on some estimation of the rugosity of the second derivative of the PDF of the distribution. This cannot be applied to a discontinuous distribution.
We can, however, try to separate the sample into two sub-samples:
the sub-sample(s) with replications,
the sub-sample with unique realizations.
(This idea has already been mentionned by johanc).
Below is an attempt to perform this classification. The np.unique method is used to count the occurences of the replicated realizations. The replicated values are associated with Diracs and the weight in the mixture is estimated from the fraction of these replicated values in the sample. The remaining realizations, uniques, are then used to estimate the continuous distribution with KDE.
The following function will be useful in order to overcome a limitation with the current implementation of the draw method of Mixtures with OpenTURNS.
def DrawMixtureWithDiracs(distribution):
"""Draw a distributions which has Diracs.
https://github.com/openturns/openturns/issues/1489"""
graph = distribution.drawPDF()
graph.setLegends(["Mixture"])
for atom in distribution.getDistributionCollection():
if atom.getName() == "Dirac":
curve = atom.drawPDF()
curve.setLegends(["Dirac"])
graph.add(curve)
return graph
The following script creates a use-case with a Mixture containing a Dirac and a gaussian distributions.
import openturns as ot
import numpy as np
distribution = ot.Mixture([ot.Dirac(-3.0),
ot.Normal()], [0.5, 0.5])
DrawMixtureWithDiracs(distribution)
This is the result.
Then we create a sample.
sample = distribution.getSample(100)
This is where your problem begins. We count the number of occurences of each realizations.
array = np.array(sample)
unique, index, count = np.unique(array, axis=0, return_index=True,
return_counts=True)
For all realizations, replicated values are associated with Diracs and unique values are put in a separate list.
sampleSize = sample.getSize()
listOfDiracs = []
listOfWeights = []
uniqueValues = []
for i in range(len(unique)):
if count[i] == 1:
uniqueValues.append(unique[i][0])
else:
atom = ot.Dirac(unique[i])
listOfDiracs.append(atom)
w = count[i] / sampleSize
print("New Dirac =", unique[i], " with weight =", w)
listOfWeights.append(w)
The weight of the continuous atom is the complementary of the sum of the weights of the Diracs. This way, the sum of the weights will be equal to 1.
complementaryWeight = 1.0 - sum(listOfWeights)
weights = list(listOfWeights)
weights.append(complementaryWeight)
The easy part comes: the unique realizations can be used to fit a kernel smoothing. The KDE is then added to the list of atoms.
sampleUniques = ot.Sample(uniqueValues, 1)
factory = ot.KernelSmoothing()
kde = factory.build(sampleUniques)
atoms = list(listOfDiracs)
atoms.append(kde)
Et voilĂ : the Mixture is ready.
mixture_estimated = ot.Mixture(atoms, weights)
The following script compares the initial Mixture and the estimated one.
graph = DrawMixtureWithDiracs(distribution)
graph.setColors(["dodgerblue3", "dodgerblue3"])
curve = DrawMixtureWithDiracs(mixture_estimated)
curve.setColors(["darkorange1", "darkorange1"])
curve.setLegends(["Est. Mixture", "Est. Dirac"])
graph.add(curve)
graph
The figure seems satisfactory, since the continuous distribution is estimated from a sub-sample which size is only equal to 50, i.e. one half of the full sample.

How to tell if I am correctly fitting a normal distribution to data using Scipy?

I have a data set with about 10 continuous features and 1000 binary (categorical) features. After scaling and normalizing the data so that every feature has a mean of 0.0, I perform PCA on the data to get a reduced matrix z, keeping about 90% of the variance (keeping ~700 principal components).
If I now want to fit a normal distribution to the data I have the code below
import numpy as np
from scipy.stats import multivariate_normal
mean = np.mean(z, axis=0)
cov = np.cov(z, rowvar=0)
g = multivariate_normal(mean=mean, cov=cov)
# put a random sample of z through the pdf
# to check the probability of the sample occurring
print(g.pdf(z[np.random.randint(z.shape[0]),:]))
>>> 0.0
The problem is, no matter how many times I run print(g.pdf(z[np.random.randint(z.shape[0]),:])), I get an output of 0.0. I appreciate some samples in z will lie further from the mean than average, giving me and answer of close to 0.0. But I would have thought that at least some samples in z would be closer to the mean and therefore give me a much larger answer when I put a random value z through the pdf.
This may be something to do with my original data, or the reduced dataset z and how these are distributed. But I have performed multiple checks on both the original data set and z to ensure there are no nan values, ensuring the mean of each column of z is in fact 0.0 etc.
My results indicate that I have gaussian with extremely thin tails (a very narrow gaussian) so that everything is far from the peak. I don't think this should be the case.
Am I using multivariate_normal correctly? Are there any other checks I could perform on the data or otherwise? I know I am making a big assumption that the data is normally distributed, but surely not all values in z should yield a pdf value of 0.0.

How can I find the break frequencies/3dB points from a bandpass filter frequency sweep data in python?

The data that i have is stored in a 2D list where one column represents a frequency and the other column is its corresponding dB. I would like to programmatically identify the frequency of the 3db points on either end of the passband. I have two ideas on how to do this but they both have drawbacks.
Find maximum point then the average of points in the passband then find points about 3dB lower
Use the sympy library to perform numerical differentiation and identify the critical points/inflection points
use a histogram/bin function to find the amplitude of the passband.
drawbacks
sensitive to spikes, not quite sure how to do this
i don't under stand the math involved and the data is noisy which could lead to a lot of false positives
correlating the amplitude values with list index values could be tricky
Can you think of better ideas and/or ways to implement what I have described?
Assuming that you've loaded multiple readings of the PSD from the signal analyzer, try averaging them before attempting to find the bandedges. If the signal isn't changing too dramatically, the averaging process might smooth away any peaks and valleys and noise within the passband, making it easier to find the edges. This is what many spectrum analyzers can do to make for a smoother PSD.
In case that wasn't clear, assume that each reading gives you 128 tuples of the frequency and power and that you capture 100 of these buffers of data. Now average the 100 samples from bin 0, then samples from 1, 2, ..., 128. Now try and locate the bandpass on this data. It should be easier than on any single buffer. Note I used 100 as an example. If your data is very noisy, it may require more. If there isn't much noise, fewer.
Be careful when doing the averaging. Your data is in dB. To add the samples together in order to find an average, you must first convert the dB data back to decimal, do the adds, do the divide to find the average, and then convert the averaged power back into dB.
Ok it seems this has to be solved by data analysis. I would propose these steps:
Preprocess you data if you suspect it to bee too noisy. I'd suggest either moving-average filter (sp.convolve(data, sp.ones(n)/n, "same")) or better a savitzky-golay-filter (sp.signal.savgol_filter(data, n, polyorder=3)) because you will be interested in extrema of the data, which will be unnecessarily distorted by the ma filter. You might also want to get rid of artifacts like 60Hz noise at this stage.
If the signal you are interested in lives in a narrow band, the spectrum will be a single pronounced peak. In that case you could just fit a curve to your data, a gaussian would be appropriate in that case.
import scipy as sp
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
freq, pow = read_in_your_data_here()
freq, pow = sp.asarray(freq), sp.asarray(pow)
def gauss(x, a, mu, sig):
return a**sp.exp(-(x-mu)**2/(2.*sig**2))
(a, mu, sig), _ = curve_fit(gauss, freq, pow)
fitted_curve = gauss(freq, a, mu, sig)
plt.plot(freq, pow)
plt.plot(freq, fitted_curve)
plt.vlines(mu, min(pow)-2, max(pow)+2)
plt.show()
center_idx = sp.absolute(freq-mu).argmin()
pow_center = pow[center_idx]
pow_3db = pow_center - 3.
def interv_from_binvec(data):
indicator = sp.convolve(data, [-1,1], "same")
return indicator.argmin(), indicator.argmax()
passband_idx = interv_from_binvec(pow > pow_3db)
passband = freq[passband_idx[0]], freq[passband_idx[1]]
This is more an example than a solution, and relies heavily on the assumption the you are searching and finding a high SNR signal with a narrow band. It could be extended to handle more than one signal by use of a mixture model.
You can use scipy's UnivariateSpline and leastsq methods:
Create a spline of y-(np.max(y)-3)
Find the roots of it.
Calculate the difference between the two roots.
from scipy.interpolate import UnivariateSpline
from scipy.optimize import leastsq
x = df["Wavelength / nm"]
y = df["Power / dBm"]
#create spline
spline = UnivariateSpline(x, y-(np.max(y)-3), s=0)
# find the roots
r1, r2 = spline.roots()
# calculate the difference
threedB_bandwidth = abs(r2-r1)

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.
An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.
Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy
I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)

Categories