np.random.choice not producing expected histogram - python

I'm looking to generate random normally distributed numbers between 1 and 0, but as the mean moves closer to 1 or 0, the right or left side respectively becomes "squished".
After modifying the normal distribution and playing around with sliders in geogebra, I came up with the following:
Next I needed to create a method in python which would generate random samples that would be distributed according to this PDF.
Originally I thought the only way to do this was to try and derive a new equation for generating random numbers as seen in the Box-Muller proof (which I got by following along with this tutorial).
However, I thought there might be an easier way to do this by using the numpy library's np.random.choice() method.
After all, I should be able to integrate the PDF at a very small step size and get the various probabilities for said steps (approximately of course).
So with that I wrote the following script:
# Standard libs
import math
# Third party libs
import numpy as np
from alive_progress import alive_bar
from matplotlib import pyplot as plt
class RandomNumberGenerator:
def __init__(self):
pass
def clamped_normal_distribution(self, mu: float,
stddev: float, x: float):
""" Computes a value from the clamped normal distribution """
divideByZeroAvoider = 1e-5
if x < 0 or x > 1:
return 0
elif x >= 0 and x <= mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/(x**2 + divideByZeroAvoider)))
elif x <= 1 and x > mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/((1-x)**2 + divideByZeroAvoider)))
else:
print("This shouldn't happen!: {}".format(x))
return 0
if __name__ == '__main__':
rng = RandomNumberGenerator()
mu = 0.7
stddev = 1
stepSize = 1e-3
x = np.linspace(stepSize,1, int(1/stepSize) - 1)
# Determine the total area under the curve
samples = []
print("Generating samples...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
samples.append(rng.clamped_normal_distribution(
mu, stddev, i))
bar()
area = np.trapz(samples, dx=stepSize)
print("Area = {}".format(area))
# Determine the probability of x falling in a specific interval
probabilities = []
print("Generating probabilties...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
lead = rng.clamped_normal_distribution(mu,
stddev, i)
lag = rng.clamped_normal_distribution(mu,
stddev, i - stepSize)
probability = np.trapz(
np.array([lag, lead]),
dx=stepSize)
# Divide by the area because this isn't a standard normal
probabilities.append(probability / area)
bar()
# Should be approximately 1
print("Probability: {}".format(sum(probabilities)))
plt.plot(x, probabilities)
plt.show()
y = []
print("Performing distribution test...")
testSize = int(10e3)
with alive_bar(testSize) as bar:
for _ in range(testSize):
randSamp = np.random.choice(samples, p=probabilities)
y.append(randSamp)
bar()
plt.hist(y,300)
plt.show()
The first plot of the probabilities against the linearly spaced samples looks promising, giving me the following graph:
However, if we use these samples as choices with given probabilities, we get the following histogram:
I have no idea why this isn't working correctly.
I've tried other (smaller) examples like the ones listed on the numpy website, and they produce histograms of the according to the given probabilities array.
I'd really appreciate some advice/intuition if at all possible :).

It looks like there is a problem with the first argument in the call np.random.choice(samples, p=probabilities). The first argument should be x, not samples.
ADDITION BY AUTHOR:
The reason for this is the samples are the values of the curve (i.e. the y-axis and NOT the x-axis).
Thus the values with the highest probabilities (i.e. the samples around the mean) all have a value of ~1, which is why we see such a massive spike around the value 1.
Changing this to x gives us the following graphs (for 10e3 samples):
Working as expected, very nice.

Related

Implementation of the Metropolis-Hasting algorithm for solving gaussian integrals

I am currently having issue with the implementation of the Metropolis-Hastings algorithm.
I am trying to use the algorithm to calculate integrals of the form
In using this algorithm, we can obtain a long chain of configurations ( in this case, each configuration is just a single numbers) such that in the tail-end of the chain the probability of having a particular configuration follows (or rather tends to) a gaussian distribution.
My code seems to be messing up with obtaining the said gaussian distributions. There is a strange dependence on the transition probablity (the probablity of picking a new candidate configuration depending on the previous configuration in the chain). However, if this transition probability is symmetric, there should be no dependence on this function at all (it only affects speed at which phase space [space of potential configurations] is explored and how quickly the chain converges to the desired distribution)!
In my case I am using a normal distribution transition function (which satisfies the need to be symmetric), with width d.
For each d I use I do indeed get a gaussian distribution however the standard deviation, sigma, depends on my choice of d.
The resulting gaussian should have a sigma of roughly 0.701 but I find that the value I actually get depends on the parameter d, when it shouldn't.
I am not sure where the error in this code is, any help would be greatly appreciated!
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
'''
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
'''
'''
Implementing metropolis-hasting algorithm
'''
#Setting up the initial config for our chain, generating first 2 to generate numpy array
x_0 = np.random.uniform(-20,-10,2)
#Defining function that generates the next N steps in the chain, given a starting config x
#Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
#Success and failures implemented to see roughly the success rate of each step
def next_steps(x,N):
i = 0
Success = 0
Failures = 0
Data = np.array(x)
d = 1.5 #Spread of (normal) transition function
while i < N:
r = np.random.uniform(0,1)
delta = np.random.normal(0,d)
x_old = Data[-1]
x_new = x_old + delta
hasting_ratio = np.exp(-(x_new**2-x_old**2) )
if hasting_ratio > r:
i = i+1
Data = np.append(Data,x_new)
Success = Success +1
else:
Failures = Failures + 1
print(Success)
print(Failures)
return Data
#Number of steps in the chain
N_iteration = 50000
#Generating the data
Data = next_steps(x_0,N_iteration)
#Plotting data to see convergence of chain to gaussian distribution
plt.plot(Data)
plt.show()
#Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
#Plotting a histogram to visually see if guassian
plt.hist(Data, bins = 300)
plt.show()
You need to save x even when it doesn't change. Otherwise the center values are under-counted, and more so as d increases, which increases the variance.
import numpy as np
from scipy.stats import norm
"""
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
"""
"""
Implementing metropolis-hasting algorithm
"""
# Setting up the initial config for our chain
x_0 = np.random.uniform(-20, -10)
# Defining function that generates the next N steps in the chain, given a starting config x
# Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
# Success and failures implemented to see roughly the success rate of each step
def next_steps(x, N):
Success = 0
Failures = 0
Data = np.empty((N,))
d = 1.5 # Spread of (normal) transition function
for i in range(N):
r = np.random.uniform(0, 1)
delta = np.random.normal(0, d)
x_new = x + delta
hasting_ratio = np.exp(-(x_new ** 2 - x ** 2))
if hasting_ratio > r:
x = x_new
Success = Success + 1
else:
Failures = Failures + 1
Data[i] = x
print(Success)
print(Failures)
return Data
# Number of steps in the chain
N_iteration = 50000
# Generating the data
Data = next_steps(x_0, N_iteration)
# Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)

Calculate correlation in xarray with missing data

I am trying to calculate a correlation between two datasets in xarray along the time dimension. My dataset are both lat x lon x time. One of my datasets has enough data missing that is isn't reasonable to interpolate and eliminate gaps, instead I would like to just ignore missing values. I have some simple bits of code that are working somewhat, but none that fits my exact use case. For example:
def covariance(x,y,dims=None):
return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)
def correlation(x,y,dims=None):
return covariance(x,y,dims) / (x.std(dims) * y.std(dims))
works well if no data is missing but of course can't work with nans. While there is a good example written for xarray here, even with this code I am struggling to calcuate the pearson's correlation not the spearman's.
import numpy as np
import xarray as xr
import bottleneck
def covariance_gufunc(x, y):
return ((x - x.mean(axis=-1, keepdims=True))
* (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)
def pearson_correlation_gufunc(x, y):
return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))
def spearman_correlation_gufunc(x, y):
x_ranks = bottleneck.rankdata(x, axis=-1)
y_ranks = bottleneck.rankdata(y, axis=-1)
return pearson_correlation_gufunc(x_ranks, y_ranks)
def spearman_correlation(x, y, dim):
return xr.apply_ufunc(
spearman_correlation_gufunc, x, y,
input_core_dims=[[dim], [dim]],
dask='parallelized',
output_dtypes=[float])
Finally there was a useful discussion on github of adding this as a feature to xarray but it has yet to be implemented. Is there an efficient way to do this on datasets with data gaps?
I've been following this Github discussion and the subsequent attempts to implement a .corr() method, seems like we're pretty close but it's still not there yet.
In the meantime, the basic code which most are attempting to merge is outlined pretty well in this other answer (How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?). It's a good solution which leverages vectorized operations in NumPy and with some small tweaking (see accepted answer in the link) can be made to account for NaNs along the time axis.
def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time.
Thus the input data could be a 1D time series, or for example, have three
dimensions (time,lat,lon).
Datasets can be provided in any order, but note that the regression slope
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value,
and standard error on regression between the two datasets along their
aligned time dimension.
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount.
"""
#1. Ensure that the data are properly alinged to each other.
x,y = xr.align(x,y)
#2. Add lag information if any, and shift the data accordingly
if lagx!=0:
# If x lags y by 1, x must be shifted 1 step backwards.
# But as the 'zero-th' value is nonexistant, xr assigns it as invalid
# (nan). Hence it needs to be dropped
x = x.shift(time = -lagx).dropna(dim='time')
# Next important step is to re-align the two datasets so that y adjusts
# to the changed coordinates of x
x,y = xr.align(x,y)
if lagy!=0:
y = y.shift(time = -lagy).dropna(dim='time')
x,y = xr.align(x,y)
#3. Compute data length, mean and standard deviation along time axis:
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd = x.std(axis=0)
ystd = y.std(axis=0)
#4. Compute covariance along time axis
cov = np.sum((x - xmean)*(y - ymean), axis=0)/(n)
#5. Compute correlation along time axis
cor = cov/(xstd*ystd)
#6. Compute regression slope and intercept:
slope = cov/(xstd**2)
intercept = ymean - xmean*slope
#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats
from scipy.stats import t
pval = t.sf(tstats, n-2)*2
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)
return cov,cor,slope,intercept,pval,stderr
Hope this helps! Fingers crossed the merge comes soon for this.
The solution is in the github thread https://github.com/pydata/xarray/issues/1115
def covariance(x, y, dim=None):
valid_values = x.notnull() & y.notnull()
valid_count = valid_values.sum(dim)
demeaned_x = (x - x.mean(dim)).fillna(0)
demeaned_y = (y - y.mean(dim)).fillna(0)
return xr.dot(demeaned_x, demeaned_y, dims=dim) / valid_count
def correlation(x, y, dim=None):
# dim should default to the intersection of x.dims and y.dims
return covariance(x, y, dim) / (x.std(dim) * y.std(dim))

Chi-squared goodness of fit test in Python: way too low p-values, but the fitting function is correct

Despite having searched for two day in related questions, I have not really found an answer to this Problem yet...
In the following code, I generate n normally distributed random variables, which are then represented in a histogram:
import numpy as np
import matplotlib.pyplot as plt
n = 10000 # number of generated random variables
x = np.random.normal(0,1,n) # generate n random variables
# plot this in a non-normalized histogram:
plt.hist(x, bins='auto', normed=False)
# get the arrays containing the bin counts and the bin edges:
histo, bin_edges = np.histogram(x, bins='auto', normed=False)
number_of_bins = len(bin_edges)-1
After that, a curve fitting function and its parameters are found.
It is normally distributed with the parameters a1 and b1, and scaled with scaling_factor to meet the fact that the sample is unnormalized.
It indeed fits the histogram quite well:
import scipy as sp
a1, b1 = sp.stats.norm.fit(x)
scaling_factor = n*(x.max()-x.min())/number_of_bins
plt.plot(x_achse,scaling_factor*sp.stats.norm.pdf(x_achse,a1,b1),'b')
Here's the plot of the histogram with the fitting function in red.
After that, I want to test how well this function fits the histogram using the chi-squared test.
This test uses the observed values and the expected values in those points. To calculate the expected values, I first calculate the location of the middle of each bin, this information is contained in the array x_middle. I then calculate the value of the fitting function at the middle point of each bin, which gives the expected_value array:
observed_values = histo
bin_width = bin_edges[1] - bin_edges[0]
# array containing the middle point of each bin:
x_middle = np.linspace( bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins)*bin_width,
num = number_of_bins)
expected_values = scaling_factor*sp.stats.norm.pdf(x_middle,a1,b1)
Plugging this into the chisquare function of Scipy, I get p-values of approximately e-5 to e-15 order of magnitude, which tells me the fitting function does not describe the histogram:
print(sp.stats.chisquare(observed_values,expected_values,ddof=2))
But this is not true, the function fits the histogram very well!
Does anybody know where I made a mistake?
Thanks a lot!!
Charles
p.s.: I set the number of delta degrees of freedom to 2, because the 2 parameters a1 and b1 are estimated from the sample. I tried using other ddof, but the results were still as poor!
Your calculation of the end-point of the array x_middle is off by one; it should be:
x_middle = np.linspace(bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins - 1)*bin_width,
num=number_of_bins)
Note the extra - 1 in the second argument of linspace().
A more concise version is
x_middle = 0.5*(bin_edges[1:] + bin_edges[:-1])
A different (and possibly more accurate) approach to computing expected_values is to use the differences of the CDF, instead of approximating those differences using the PDF in the middle of each interval:
In [75]: from scipy import stats
In [76]: cdf = stats.norm.cdf(bin_edges, a1, b1)
In [77]: expected_values = n * np.diff(cdf)
With that calculation, I get the following result from the chi-squared test:
In [85]: stats.chisquare(observed_values, expected_values, ddof=2)
Out[85]: Power_divergenceResult(statistic=61.168393496775181, pvalue=0.36292223875686402)

How to smooth a curve in the right way?

Lets assume we have a dataset which might be given approximately by
import numpy as np
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
Therefore we have a variation of 20% of the dataset. My first idea was to use the UnivariateSpline function of scipy, but the problem is that this does not consider the small noise in a good way. If you consider the frequencies, the background is much smaller than the signal, so a spline only of the cutoff might be an idea, but that would involve a back and forth fourier transformation, which might result in bad behaviour.
Another way would be a moving average, but this would also need the right choice of the delay.
Any hints/ books or links how to tackle this problem?
I prefer a Savitzky-Golay filter. It uses least squares to regress a small window of your data onto a polynomial, then uses the polynomial to estimate the point in the center of the window. Finally the window is shifted forward by one data point and the process repeats. This continues until every point has been optimally adjusted relative to its neighbors. It works great even with noisy samples from non-periodic and non-linear sources.
Here is a thorough cookbook example. See my code below to get an idea of how easy it is to use. Note: I left out the code for defining the savitzky_golay() function because you can literally copy/paste it from the cookbook example I linked above.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
yhat = savitzky_golay(y, 51, 3) # window size 51, polynomial order 3
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
UPDATE: It has come to my attention that the cookbook example I linked to has been taken down. Fortunately, the Savitzky-Golay filter has been incorporated into the SciPy library, as pointed out by #dodohjk (thanks #bicarlsen for the updated link).
To adapt the above code by using SciPy source, type:
from scipy.signal import savgol_filter
yhat = savgol_filter(y, 51, 3) # window size 51, polynomial order 3
EDIT: look at this answer. Using np.cumsum is much faster than np.convolve
A quick and dirty way to smooth data I use, based on a moving average box (by convolution):
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.8
def smooth(y, box_pts):
box = np.ones(box_pts)/box_pts
y_smooth = np.convolve(y, box, mode='same')
return y_smooth
plot(x, y,'o')
plot(x, smooth(y,3), 'r-', lw=2)
plot(x, smooth(y,19), 'g-', lw=2)
If you are interested in a "smooth" version of a signal that is periodic (like your example), then a FFT is the right way to go. Take the fourier transform and subtract out the low-contributing frequencies:
import numpy as np
import scipy.fftpack
N = 100
x = np.linspace(0,2*np.pi,N)
y = np.sin(x) + np.random.random(N) * 0.2
w = scipy.fftpack.rfft(y)
f = scipy.fftpack.rfftfreq(N, x[1]-x[0])
spectrum = w**2
cutoff_idx = spectrum < (spectrum.max()/5)
w2 = w.copy()
w2[cutoff_idx] = 0
y2 = scipy.fftpack.irfft(w2)
Even if your signal is not completely periodic, this will do a great job of subtracting out white noise. There a many types of filters to use (high-pass, low-pass, etc...), the appropriate one is dependent on what you are looking for.
Fitting a moving average to your data would smooth out the noise, see this this answer for how to do that.
If you'd like to use LOWESS to fit your data (it's similar to a moving average but more sophisticated), you can do that using the statsmodels library:
import numpy as np
import pylab as plt
import statsmodels.api as sm
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
lowess = sm.nonparametric.lowess(y, x, frac=0.1)
plt.plot(x, y, '+')
plt.plot(lowess[:, 0], lowess[:, 1])
plt.show()
Finally, if you know the functional form of your signal, you could fit a curve to your data, which would probably be the best thing to do.
This Question is already thoroughly answered, so I think a runtime analysis of the proposed methods would be of interest (It was for me, anyway). I will also look at the behavior of the methods at the center and the edges of the noisy dataset.
TL;DR
| runtime in s | runtime in s
method | python list | numpy array
--------------------|--------------|------------
kernel regression | 23.93405 | 22.75967
lowess | 0.61351 | 0.61524
naive average | 0.02485 | 0.02326
others* | 0.00150 | 0.00150
fft | 0.00021 | 0.00021
numpy convolve | 0.00017 | 0.00015
*savgol with different fit functions and some numpy methods
Kernel regression scales badly, Lowess is a bit faster, but both produce smooth curves. Savgol is a middle ground on speed and can produce both jumpy and smooth outputs, depending on the grade of the polynomial. FFT is extremely fast, but only works on periodic data.
Moving average methods with numpy are faster but obviously produce a graph with steps in it.
Setup
I generated 1000 data points in the shape of a sin curve:
size = 1000
x = np.linspace(0, 4 * np.pi, size)
y = np.sin(x) + np.random.random(size) * 0.2
data = {"x": x, "y": y}
I pass these into a function to measure the runtime and plot the resulting fit:
def test_func(f, label): # f: function handle to one of the smoothing methods
start = time()
for i in range(5):
arr = f(data["y"], 20)
print(f"{label:26s} - time: {time() - start:8.5f} ")
plt.plot(data["x"], arr, "-", label=label)
I tested many different smoothing fuctions. arr is the array of y values to be smoothed and span the smoothing parameter. The lower, the better the fit will approach the original data, the higher, the smoother the resulting curve will be.
def smooth_data_convolve_my_average(arr, span):
re = np.convolve(arr, np.ones(span * 2 + 1) / (span * 2 + 1), mode="same")
# The "my_average" part: shrinks the averaging window on the side that
# reaches beyond the data, keeps the other side the same size as given
# by "span"
re[0] = np.average(arr[:span])
for i in range(1, span + 1):
re[i] = np.average(arr[:i + span])
re[-i] = np.average(arr[-i - span:])
return re
def smooth_data_np_average(arr, span): # my original, naive approach
return [np.average(arr[val - span:val + span + 1]) for val in range(len(arr))]
def smooth_data_np_convolve(arr, span):
return np.convolve(arr, np.ones(span * 2 + 1) / (span * 2 + 1), mode="same")
def smooth_data_np_cumsum_my_average(arr, span):
cumsum_vec = np.cumsum(arr)
moving_average = (cumsum_vec[2 * span:] - cumsum_vec[:-2 * span]) / (2 * span)
# The "my_average" part again. Slightly different to before, because the
# moving average from cumsum is shorter than the input and needs to be padded
front, back = [np.average(arr[:span])], []
for i in range(1, span):
front.append(np.average(arr[:i + span]))
back.insert(0, np.average(arr[-i - span:]))
back.insert(0, np.average(arr[-2 * span:]))
return np.concatenate((front, moving_average, back))
def smooth_data_lowess(arr, span):
x = np.linspace(0, 1, len(arr))
return sm.nonparametric.lowess(arr, x, frac=(5*span / len(arr)), return_sorted=False)
def smooth_data_kernel_regression(arr, span):
# "span" smoothing parameter is ignored. If you know how to
# incorporate that with kernel regression, please comment below.
kr = KernelReg(arr, np.linspace(0, 1, len(arr)), 'c')
return kr.fit()[0]
def smooth_data_savgol_0(arr, span):
return savgol_filter(arr, span * 2 + 1, 0)
def smooth_data_savgol_1(arr, span):
return savgol_filter(arr, span * 2 + 1, 1)
def smooth_data_savgol_2(arr, span):
return savgol_filter(arr, span * 2 + 1, 2)
def smooth_data_fft(arr, span): # the scaling of "span" is open to suggestions
w = fftpack.rfft(arr)
spectrum = w ** 2
cutoff_idx = spectrum < (spectrum.max() * (1 - np.exp(-span / 2000)))
w[cutoff_idx] = 0
return fftpack.irfft(w)
Results
Speed
Runtime over 1000 elements, tested on a python list as well as a numpy array to hold the values.
method | python list | numpy array
--------------------|-------------|------------
kernel regression | 23.93405 s | 22.75967 s
lowess | 0.61351 s | 0.61524 s
numpy average | 0.02485 s | 0.02326 s
savgol 2 | 0.00186 s | 0.00196 s
savgol 1 | 0.00157 s | 0.00161 s
savgol 0 | 0.00155 s | 0.00151 s
numpy convolve + me | 0.00121 s | 0.00115 s
numpy cumsum + me | 0.00114 s | 0.00105 s
fft | 0.00021 s | 0.00021 s
numpy convolve | 0.00017 s | 0.00015 s
Especially kernel regression is very slow to compute over 1k elements, lowess also fails when the dataset becomes much larger. numpy convolve and fft are especially fast. I did not investigate the runtime behavior (O(n)) with increasing or decreasing sample size.
Edge behavior
I'll separate this part into two, to keep image understandable.
Numpy based methods + savgol 0:
These methods calculate an average of the data, the graph is not smoothed. They all (with the exception of numpy.cumsum) result in the same graph when the window that is used to calculate the average does not touch the edge of the data. The discrepancy to numpy.cumsum is most likely due to a 'off by one' error in the window size.
There are different edge behaviours when the method has to work with less data:
savgol 0: continues with a constant to the edge of the data (savgol 1 and savgol 2 end with a line and parabola respectively)
numpy average: stops when the window reaches the left side of the data and fills those places in the array with Nan, same behaviour as my_average method on the right side
numpy convolve: follows the data pretty accurately. I suspect the window size is reduced symmetrically when one side of the window reaches the edge of the data
my_average/me: my own method that I implemented, because I was not satisfied with the other ones. Simply shrinks the part of the window that is reaching beyond the data to the edge of the data, but keeps the window to the other side the original size given with span
Complicated methods:
These methods all end with a nice fit to the data. savgol 1 ends with a line, savgol 2 with a parabola.
Curve behaviour
To showcase the behaviour of the different methods in the middle of the data.
The different savgol and average filters produce a rough line, lowess, fft and kernel regression produce a smooth fit. lowess appears to cut corners when the data changes.
Motivation
I have a Raspberry Pi logging data for fun and the visualization proved to be a small challenge. All data points, except RAM usage and ethernet traffic are only recorded in discrete steps and/or inherently noisy. For example the temperature sensor only outputs whole degrees, but differs by up to two degrees between consecutive measurements. No useful information can be gained from such a scatter plot. To visualize the data I therefore needed some method that is not too computationally expensive and produced a moving average. I also wanted nice behavior at the edges of the data, as this especially impacts the latest info when looking at live data. I settled on the numpy convolve method with my_average to improve the edge behavior.
Another option is to use KernelReg in statsmodels:
from statsmodels.nonparametric.kernel_regression import KernelReg
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
# The third parameter specifies the type of the variable x;
# 'c' stands for continuous
kr = KernelReg(y,x,'c')
plt.plot(x, y, '+')
y_pred, y_std = kr.fit(x)
plt.plot(x, y_pred)
plt.show()
A clear definition of smoothing of a 1D signal from SciPy Cookbook shows you how it works.
Shortcut:
import numpy
def smooth(x,window_len=11,window='hanning'):
"""smooth the data using a window with requested size.
This method is based on the convolution of a scaled window with the signal.
The signal is prepared by introducing reflected copies of the signal
(with the window size) in both ends so that transient parts are minimized
in the begining and end part of the output signal.
input:
x: the input signal
window_len: the dimension of the smoothing window; should be an odd integer
window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'
flat window will produce a moving average smoothing.
output:
the smoothed signal
example:
t=linspace(-2,2,0.1)
x=sin(t)+randn(len(t))*0.1
y=smooth(x)
see also:
numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve
scipy.signal.lfilter
TODO: the window parameter could be the window itself if an array instead of a string
NOTE: length(output) != length(input), to correct this: return y[(window_len/2-1):-(window_len/2)] instead of just y.
"""
if x.ndim != 1:
raise ValueError, "smooth only accepts 1 dimension arrays."
if x.size < window_len:
raise ValueError, "Input vector needs to be bigger than window size."
if window_len<3:
return x
if not window in ['flat', 'hanning', 'hamming', 'bartlett', 'blackman']:
raise ValueError, "Window is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'"
s=numpy.r_[x[window_len-1:0:-1],x,x[-2:-window_len-1:-1]]
#print(len(s))
if window == 'flat': #moving average
w=numpy.ones(window_len,'d')
else:
w=eval('numpy.'+window+'(window_len)')
y=numpy.convolve(w/w.sum(),s,mode='valid')
return y
from numpy import *
from pylab import *
def smooth_demo():
t=linspace(-4,4,100)
x=sin(t)
xn=x+randn(len(t))*0.1
y=smooth(x)
ws=31
subplot(211)
plot(ones(ws))
windows=['flat', 'hanning', 'hamming', 'bartlett', 'blackman']
hold(True)
for w in windows[1:]:
eval('plot('+w+'(ws) )')
axis([0,30,0,1.1])
legend(windows)
title("The smoothing windows")
subplot(212)
plot(x)
plot(xn)
for w in windows:
plot(smooth(xn,10,w))
l=['original signal', 'signal with noise']
l.extend(windows)
legend(l)
title("Smoothing a noisy signal")
show()
if __name__=='__main__':
smooth_demo()
For a project of mine, I needed to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
Here I use a ConvolutionSmoother but you can also test it others.
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
# operate smoothing
smoother = ConvolutionSmoother(window_len=5, window_type='ones')
smoother.smooth(y)
# generate intervals
low, up = smoother.get_intervals('sigma_interval', n_sigma=2)
# plot the smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)
I point out also that tsmoothie can carry out the smoothing of multiple timeseries in a vectorized way
There is a simple function in scipy.ndimage that also works well for me:
from scipy.ndimage import uniform_filter1d
y_smooth = uniform_filter1d(y,size=15)
Using a moving average, a quick way (that also works for non-bijective functions) is
def smoothen(x, winsize=5):
return np.array(pd.Series(x).rolling(winsize).mean())[winsize-1:]
This code is based on https://towardsdatascience.com/data-smoothing-for-data-science-visualization-the-goldilocks-trio-part-1-867765050615. There, also more advanced solutions are discussed.
You could also use this:
def smooth(scalars, weight = 0.8): # Weight between 0 and 1
return [scalars[i] * weight + (1 - weight) * scalars[i+1] for i in range(len(scalars)) if i < len(scalars)-1]
If you are plotting time series graph and if you have used mtplotlib for drawing graphs then use
median method to smooth-en the graph
smotDeriv = timeseries.rolling(window=20, min_periods=5, center=True).median()
where timeseries is your set of data passed you can alter windowsize for more smoothining.

Generating numbers with Gaussian function in a range using python

I want to use the gaussian function in python to generate some numbers between a specific range giving the mean and variance
so lets say I have a range between 0 and 10
and I want my mean to be 3 and variance to be 4
mean = 3, variance = 4
how can I do that ?
Use random.gauss. From the docs:
random.gauss(mu, sigma)
Gaussian distribution. mu is the mean, and sigma is the standard deviation. This is slightly
faster than the normalvariate() function defined below.
It seems to me that you can clamp the results of this, but that wouldn't make it a Gaussian distribution. I don't think you can satisfy all the constraints simultaneously. If you want to clamp it to the range [0, 10], you could get your numbers:
num = min(10, max(0, random.gauss(3, 4)))
But then the resulting distribution of numbers won't be truly Gaussian. In this case, it seems you can't have your cake and eat it, too.
There's probably a better way to do this, but this is the function I ended up creating to solve this problem:
import random
def trunc_gauss(mu, sigma, bottom, top):
a = random.gauss(mu,sigma))
while (bottom <= a <= top) == False:
a = random.gauss(mu,sigma))
return a
If we break it down line by line:
import random
This allows us to use functions from the random library, which includes a gaussian random number generator (random.gauss).
def trunc_gauss(mu, sigma, bottom, top):
The function arguments allow us to specify the mean (mu) and variance (sigma), as well as the top and bottom of our desired range.
a = random.gauss(mu,sigma))
Inside the function, we generate an initial random number according to a gaussian distribution.
while (bottom <= a <= top) == False:
a = random.gauss(mu,sigma))
Next, the while loop checks if the number is within our specified range, and generates a new random number as long as the current number is outside our range.
return a
As soon as the number is inside our range, the while loop stops running and the function returns the number.
This should give a better approximation of a gaussian distribution, since we don't artificially inflate the top and bottom boundaries of our range by rounding up or down the outliers.
I'm quite new to Python, so there are most probably simpler ways, but this worked for me.
I was working on some numerical analytical computation and I ran into this python tutorial site - http://www.python-course.eu/weighted_choice_and_sample.php
Now, this is what I proffer as a solution should anyone be too busy as to not hit the site.
I don't know how many gaussian values you need so I'll go with 100 as n, mu you gave as 3 and variance as 4 which makes sigma = 2. Here's the code:
from random import gauss
n = 100
values = []
frequencies = {}
while len(values) < n:
value = gauss(3, 2)
if 0 < value < 10:
frequencies[int(value)] = frequencies.get(int(value), 0) + 1
values.append(value)
print(values)
I hope this helps. You can get the plot as well. It's all in the tutorials.
If you have a small range of integers, you can create a list with a gaussian distribution of the numbers within that range and then make a random choice from it.
import numpy as np
from random import uniform
from scipy.special import erf,erfinv
import math
def trunc_gauss(mu, sigma,xmin=np.nan,xmax=np.nan):
"""Truncated Gaussian distribution.
mu is the mean, and sigma is the standard deviation.
"""
if np.isnan(xmin):
zmin=0
else:
zmin = erf((xmin-mu)/sigma)
if np.isnan(xmax):
zmax=1
else:
zmax = erf((xmax-mu)/sigma)
y = uniform(zmin,zmax)
z = erfinv(y)
# This will not come up often but if y >= 0.9999999999999999
# due to the truncation of the ervinv function max z = 5.805018683193454
while math.isinf(z):
z = erfinv(uniform(zmin,zmax))
return mu + z*sigma
You can use minimalistic code for 150 variables:
import numpy as np
s = np.random.normal(3,4,150) #<= mean = 3, variance = 4
print(s)
Normal distribution is another like random, stochastic distribution.
So, we can check it by:
import seaborn as sns
import matplotlib.pyplot as plt
AA1_plot = sns.distplot(s, kde=True, rug=False)
plt.show()

Categories