How to make this matplotlib plot less noisy? - python

How can I plot the following noisy data with a smooth, continuous line without considering each individual value? I would like to only show the behavior in a nicer way, without caring about noisy and extreme values. This is the code I am using:
import numpy
import sys
import matplotlib.pyplot as plt
from scipy.interpolate import spline
dataset = numpy.genfromtxt(fname='data', delimiter=",")
dic = {}
for d in dataset:
dic[d[0]] = d[1]
plt.plot(range(len(dic)), dic.values(),linestyle='-', linewidth=2)
plt.savefig('plot.png')
plt.show()

In a previous answer, I was introduced to the Savitzky Golay filter, a particular type of low-pass filter, well adapted for data smoothing. How "smooth" you want your resulting curve to be is a matter of preference, and this can be adjusted by both the window-size and the order of the interpolating polynomial. Using the cookbook example for sg_filter:
import numpy as np
import sg_filter
import matplotlib.pyplot as plt
# Generate some sample data similar to your post
X = np.arange(1,1000,1)
Y = np.log(X**3) + 10*np.random.random(X.shape)
Y2 = sg_filter.savitzky_golay(Y, 101, 3)
plt.plot(X,Y,linestyle='-', linewidth=2,alpha=.5)
plt.plot(X,Y2,color='r')
plt.show()

There is more than one way to do it!
Here I show how to reduce noise using a variety of techniques:
Moving average
LOWESS regression
Low pass filter
Interpolation
Sticking with #Hooked example data for consistency:
import numpy as np
import matplotlib.pyplot as plt
X = np.arange(1, 1000, 1)
Y = np.log(X ** 3) + 10 * np.random.random(X.shape)
plt.plot(X, Y, alpha = .5)
plt.show()
Moving average
Sometimes all you need is a moving average.
For example, using pandas with a window size of 100:
import pandas as pd
df = pd.DataFrame(Y, X)
df_mva = df.rolling(100).mean() # moving average with a window size of 100
df_mva.plot(legend = False);
You will probably have to try several window sizes with your data. Note that the first 100 values of df_mva will be NaN but these can be removed with the dropna method.
Usage details for the pandas rolling function.
LOWESS regression
I've used LOWESS (Locally Weighted Scatterplot Smoothing) successfully to remove noise from repeated measures datasets. More information on local regression methods, including LOWESS and LOESS, here. It's a simple method with only one parameter to tune which in my experience gives good results.
Here is how to apply the LOWESS technique using the statsmodels implementation:
import statsmodels.api as sm
y_lowess = sm.nonparametric.lowess(Y, X, frac = 0.3) # 30 % lowess smoothing
plt.plot(y_lowess[:, 0], y_lowess[:, 1]) # some noise removed
plt.show()
It may be necessary to vary the frac parameter, which is the fraction of the data used when estimating each y value. Increase the frac value to increase the amount of smoothing. The frac value must be between 0 and 1.
Further details on statsmodels lowess usage.
Low pass filter
Scipy provides a set of low pass filters which may be appropriate.
After application of the lfiter:
from scipy.signal import lfilter
n = 50 # larger n gives smoother curves
b = [1.0 / n] * n # numerator coefficients
a = 1 # denominator coefficient
y_lf = lfilter(b, a, Y)
plt.plot(X, y_lf)
plt.show()
Check scipy lfilter documentation for implementation details regarding how numerator and denominator coefficients are used in the difference equations.
There are other filters in the scipy.signal package.
Interpolation
Finally, here is an example of radial basis function interpolation:
from scipy.interpolate import Rbf
rbf = Rbf(X, Y, function = 'multiquadric', smooth = 500)
y_rbf = rbf(X)
plt.plot(X, y_rbf)
plt.show()
Smoother approximation can be achieved by increasing the smooth parameter. Alternative function parameters to consider include 'cubic' and 'thin_plate'. When considering the function value, I usually try 'thin_plate' first followed by 'cubic'; however both 'thin_plate' and 'cubic' seemed to struggle with the noise in this dataset.
Check other Rbf options in the scipy docs. Scipy provides other univariate and multivariate interpolation techniques (see this tutorial).

Related

Integration of KDE with strange behavior of from scipy.integrate.quad and the setted bandwith

I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde. I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:
I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).
Problem
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity
np.random.seed(42)
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
kde = gaussian_kde(test_array,bw_method=0.5)
X_range = np.arange(-16,20,0.1)
y_list = []
for X in X_range:
pdf = lambda x : kde.evaluate([[x]])
y_list.append(pdf(X))
y = np.array(y_list)
_ = plt.plot(X_range,y)
# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]
print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))
print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))
plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()
If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:
First Question:
In my opinion it should always show: F(Mean) = 0.5 or am I wrong here? (Does this only apply to symetric distributions?)
Second Question:
The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter. In my opinion the mean should change too if the shape of the underlying distribution differs. If i set the bandwith to 5 I got the following graph:
Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?
I hope those question not only arise due to my flawed understanding of statistics ;)
Your initial data is generate here
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10, you can predict the expected average (500*4-10*100)/(500+100) = 1.66666.... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.

Curve fitting with cubic spline

I am trying to interpolate a cumulated distribution of e.g. i) number of people to ii) number of owned cars, showing that e.g. the top 20% of people own much more than 20% of all cars - off course 100% of people own 100% of cars. Also I know that there are e.g. 100mn people and 200mn cars.
Now coming to my code:
#import libraries (more than required here)
import pandas as pd
from scipy import interpolate
from scipy.interpolate import interp1d
from sympy import symbols, solve, Eq
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px
from scipy import interpolate
curve=pd.read_excel('inputs.xlsx',sheet_name='inputdata')
Input data: Curveplot (cumulated people (x) on the left // cumulated cars (y) on the right)
#Input data in list form (I am not sure how to interpolate from a list for the moment)
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
x, y = points[:,0], points[:,1]
interpolation = interp1d(x, y, kind = 'cubic')
number_of_people_mn= 100000000
oneperson = 1 / number_of_people_mn
dataset = pd.DataFrame(range(number_of_people_mn + 1))
dataset.columns = ["nr_of_one_person"]
dataset.drop(dataset.index[:1], inplace=True)
#calculating the position of every single person on the cumulated x-axis (between 0 and 1)
dataset["cumulatedpeople"] = dataset["nr_of_one_person"] / number_of_people_mn
#finding the "cumulatedcars" to the "cumulatedpeople" via interpolation (between 0 and 1)
dataset["cumulatedcars"] = interpolation(dataset["cumulatedpeople"])
plt.plot(dataset["cumulatedpeople"], dataset["cumulatedcars"])
plt.legend(['Cubic interpolation'], loc = 'best')
plt.xlabel('Cumulated people')
plt.ylabel('Cumulated cars')
plt.title("People-to-car cumulated curve")
plt.show()
However when looking at the actual plot, I get the following result which is false: Cubic interpolation
In fact, the curve should look almost like the one from a linear interpolation with the exact same input data - however this is not accurate enough for my purpose: Linear interpolation
Is there any relevant step I am missing out or what would be the best way to get an accurate interpolation from the inputs that almost looks like the one from a linear interpolation?
Short answer: your code is doing the right thing, but the data is unsuitable for cubic interpolation.
Let me explain. Here is your code that I simplified for clarity
from scipy.interpolate import interp1d
from matplotlib import pyplot as plt
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
interpolation = interp1d(cumulatedpeople, cumulatedcars, kind = 'cubic')
number_of_people_mn= 100#000000
cumppl = np.arange(number_of_people_mn + 1)/number_of_people_mn
cumcars = interpolation(cumppl)
plt.plot(cumppl, cumcars)
plt.plot(cumulatedpeople, cumulatedcars,'o')
plt.show()
note the last couple of lines -- I am plotting, on the same graph, both the interpolated results and the input date. Here is the result
orange dots are the original data, blue line is cubic interpolation. The interpolator passes through all the points so technically is doing the right thing
Clearly it is not doing what you would want
The reason for such strange behavior is mostly at the right end where you have a few x-points that are very close together -- the interpolator produces massive wiggles trying to fit very closely spaced points.
If I remove two right-most points from the interpolator:
interpolation = interp1d(cumulatedpeople[:-2], cumulatedcars[:-2], kind = 'cubic')
it looks a bit more reasonable:
But still one would argue linear interpolation is better. The wiggles on the left end now because the gaps between initial x-poonts are too large
The moral here is that cubic interpolation should really be used only if gaps between x points are roughly the same
Your best bet here, I think, is to use something like curve_fit
a related discussion can be found here
specifically monotone interpolation as explained here yields good results on your data. Copying the relevant bits here, you would replace the interpolator with
from scipy.interpolate import pchip
interpolation = pchip(cumulatedpeople, cumulatedcars)
and get a decent-looking fit:

Fast way reduce noise of autocorrelation function in python?

I can compute the autocorrelation using numpy's built in functionality:
numpy.correlate(x,x,mode='same')
However the resulting correlation is naturally noisy. I can partition my data, and compute the correlation on each resulting window, then average them all together to compute cleaner autocorrelation, similar to what signal.welch does. Is there a handy function in either numpy or scipy that does this, possibly faster than I would get if I were to compute partition and loop through the data myself?
UPDATE
This is motivated by #kazemakase answer. I have tried to show what I mean with some code used to generate the figure below.
One can see that #kazemakase is correct with the fact that the AC function naturally averages out the noise. However the averaging of the AC has the advantage that it is much faster! np.correlate seems to scale as the slow O(n^2) rather than O(nlogn) that I would expect if the correlation was calculated using circular convolution via the FFT...
from statsmodels.tsa.arima_model import ARIMA
import statsmodels as sm
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(12345)
arparams = np.array([.75, -.25, 0.2, -0.15])
maparams = np.array([.65, .35])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
x = sm.tsa.arima_process.arma_generate_sample(ar, ma, 10000)
def calc_rxx(x):
x = x-x.mean()
N = len(x)
Rxx = np.correlate(x,x,mode="same")[N/2::]/N
#Rxx = np.correlate(x,x,mode="same")[N/2::]/np.arange(N,N/2,-1)
return Rxx/x.var()
def avg_rxx(x,nperseg=1024):
rxx_windows = []
Nw = int(np.floor(len(x)/nperseg))
print Nw
first = True
for i in range(Nw-1):
xw = x[i*nperseg:nperseg*(i+1)]
y = calc_rxx(xw)
if i%1 == 0:
if first:
plt.semilogx(y,"k",alpha=0.2,label="Short AC")
first = False
else:
plt.semilogx(y,"k",alpha=0.2)
rxx_windows.append(y)
print np.shape(rxx_windows)
return np.mean(rxx_windows,axis=0)
plt.figure()
r_avg = avg_rxx(x,nperseg=300)
r = calc_rxx(x)
plt.semilogx(r_avg,label="Average AC")
plt.semilogx(r,label="Long AC")
plt.xlabel("Lag")
plt.ylabel("Auto-correlation")
plt.legend()
plt.xlim([0,150])
plt.show()
TL-DR: To decrease noise in the autocorrelation function increase the length of your signal x.
Partitioning the data and averaging like in spectral estimation is an interesting idea. I wish it would work...
The autocorrelation is defined as
Let's say we partition the data into two windows. Their autocorrelations become
Note how they are only different in the limits of the sumations. Basically, we split the summation of the autocorrelation into two parts. When we add these back together we are back to the original autocorrelation! So we did not gain anything.
The conclusion is, there is no such thing implemented in numpy/scipy because there is no point in doing so.
Remarks:
I hope it's easy to see that this extends to any number of partitions.
to keep it simple I left the normalization out. If you divide Rxx by n and the partial Rxx by n/2 you get Rxx / n == (Rxx1 * 2/n + Rxx2 * 2/n) / 2. I.e. The mean of the normalized partial autocorrelation is equal to the complete normalized autocorrelation.
to keep it even simpler I assumed the signal x could be indexed beyond the limits of 0 and n-1. In practice, if the signal is stored in an array this is often not possible. In this case there is a small difference between the full and the partialized autocorrelations that increases with the lag l. Unfortunately, this is merely a loss of precision and does not reduce noise.
Code heretic! I don't belive your evil math!
Of course we can try things out and see:
import matplotlib.pyplot as plt
import numpy as np
n = 2**16
n_segments = 8
x = np.random.randn(n) # data
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
segments = x.reshape(n_segments, -1)
m = segments.shape[1]
rs = []
for y in segments:
ry = np.correlate(y, y, mode='same') / m # partial ACF
rs.append(ry)
l2 = np.arange(-m//2, m//2) # lags of partial ACFs
plt.plot(l1, rx, label='full ACF')
plt.plot(l2, np.mean(rs, axis=0), label='partial ACF')
plt.xlim(-m, m)
plt.legend()
plt.show()
Although we used 8 segments to average the ACF, the noise level visually stays the same.
Okay, so that's why it does not work but what is the solution?
Here are the good news: Autocorrelation is already a noise reduction technique! Well, in some way at least: An application of the ACF is to find periodic signals hidden by noise.
Since noise (ideally) has zero mean, its influence diminishes the more elements we sum up. In other words, you can reduce noise in the autocorrelation by using longer signals. (I guess this is probably not true for every type of noise, but should hold for the usual Gaussian white noise and its relatives.)
Behold the noise getting lower with more data samples:
import matplotlib.pyplot as plt
import numpy as np
for n in [2**6, 2**8, 2**12]:
x = np.random.randn(n)
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
plt.plot(l1, rx, label='n={}'.format(n))
plt.legend()
plt.xlim(-20, 20)
plt.show()

Fitting a Gaussian to a set of x,y data

Firstly this is an assignment I've been set so I'm only after pointers, and I am restricted to using the following libraries, NumPy, SciPy and MatPlotLib.
We have been given a txt file which includes x and y data for a resonance experiment and have to fit both a gaussian and lorentzian fit. I'm working on the gaussian fit at the minute and have tried following the code laid out in a previous question as a basis for my own code. (Gaussian fit for Python)
from numpy import *
from matplotlib import *
import matplotlib.pyplot as plt
import pylab
from scipy.optimize import curve_fit
energy, intensity = numpy.loadtxt('resonance_data.txt', unpack=True)
n = size(energy)
mean = 30.7
sigma = 10
intensity0 = 45
def gaus(energy, intensity0, energy0, sigma):
return intensity0 * exp(-(energy - energy0)**2 / (sigma**2))
popt, pcov = curve_fit(gaus, energy, intensity, p0=[45, mean, sigma])
plt.plot(energy, intensity, 'o')
plt.xlabel('Energy/eV')
plt.ylabel('Intensity')
plt.title('Plot of Intensity against Energy')
plt.plot(energy, gaus(energy, *popt))
plt.show()
Which returns the following graph
If I keep the expressions for mean and sigma, as in the url posted the curve fit is a horizontal line, so I'm guessing the problem lies in the curve fit not converging or something.
Looks like your data skews heavily to the left, why Gaussian? Not Boltzmann, Log-Normal, or anything else?
Much of these are already implemented in scipy.stats. See scipy.stats.cauchy for lorentzian and scipy.stats.normal gaussian. An example:
import scipy.stats as ss
A=ss.norm.rvs(0, 5, size=(100)) #Generate a random variable of 100 elements, with expected mean=0, std=5
ss.norm.fit_loc_scale(A) #fit both the mean and std
(-0.13053732553697531, 5.163322485150271) #your number will vary.
And I think you don't need the intensity0 parameter, it is just going to be 1/sigma/srqt(2*pi), because the density function has to sum up to 1.

Spline representation with scipy.interpolate: Poor interpolation for low-amplitude, rapidly oscillating functions

I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
y=[]
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
y.append(A*sin(k[i]))
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
M=tck(k)
Below are the results for M for A = 1.e0 and A = 1.e-2
http://i.imgur.com/uEIxq.png Amplitude = 1
http://i.imgur.com/zFfK0.png Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
Cheers,
Rory
I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.legend()
ax.set_title(title + ' Smoothing')
plt.show()
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...
The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.

Categories