I am using the SciPy implementation of the kernel density estimate (KDE) (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html), which is working fine so far. However, I would now like to obtain the gradient of the KDE at a particular point.
I have looked at the Python source for the library, but haven't been able to figure out how to easily implement this functionality. Is anybody aware of a method to do this?
If you look at the source you referenced, you'll see that the density estimation is constructed from contributions from all points in the dataset Assuming there's only one point points[:,i] you want to evaluate for the moment (lines 219–222):
diff = self.dataset - points[:, i, newaxis]
tdiff = dot(self.inv_cov, diff)
energy = sum(diff * tdiff, axis=0) / 2.0
result[i] = sum(exp(-energy), axis=0)
In matrix notation (no LaTeX available?), this would be written, for a single point D from the dataset and point p to be evaluated as
d = D - p
t = Cov^-1 d
e = 1/2 d^T t
r = exp(-e)
The gradient you're looking for is grad(r) = (dr/dx, dr/dy):
dr/dx = d(exp(-e))/dx
= -de/dx exp(-e)
= -d(1/2 d^T Cov^-1 d)/dx exp(-e)
= -(Cov^-1 d) exp(-e)
Likewise for dr/dy. Hence all you need to do is calculate the term Cov^-1 d and multiply it with the result you already obtained.
result = zeros((self.d,m), dtype=float)
[...]
diff = self.dataset - points[:, i, newaxis]
tdiff = dot(self.inv_cov, diff)
energy = sum(diff * tdiff, axis=0) / 2.0
grad = dot(self.inv_cov, diff)
result[:,i] = sum(grad * exp(-energy), axis=1)
For some reason I needed to drop the -1 when calculating grad to obtain results coherent with a evaluating the density estimation at p and p+delta in all four directions, which is a sign I might of course be way off here.
Related
I have tried to no avail for a week while trying to solve a system of coupled differential equations and reproduce the results shown in the attached image. I seem to be getting weird results as shown also. I don't seem to know what I might be doing wrong.The set of coupled differential equations were solved using Newman's BAND. Here's a link to the python implementation: python solution using BAND . And another link to the original image of the problem in case the attached is not clear enough: here you find a clearer image of the problem. Now what I am trying to do is to solve the same problem by creating a sparse array directly from the discretized equations using a combination of sympy and numpy and then solving using scipy's spsolve. Here is my code below. I need some help to figure out what I am doing wrong.
I have represented the variables as c1 = cA, c2 = cB, c3 = cC, c4 = cD in my code. Equation 2 has been linearized and phi10 and phi20 are the trial values of the variables cC and cD.
# import modules
import numpy as np
import sympy
from sympy.core.function import _mexpand
import scipy as sp
import scipy.sparse as ss
import scipy.sparse.linalg as ssl
import matplotlib.pyplot as plt
# define functions
def flatten(t):
"""
function to flatten lists
"""
return [item for sublist in t for item in sublist]
def get_coeffs(coeff_dict, func_vars):
"""
function to extract coefficients from variables
and form the sparse symbolic array
"""
c = coeff_dict
for i in list(c.keys()):
b, _ = i.as_base_exp()
if b == i:
continue
if b in c:
c[i] = 0
if any(k.has(b) for k in c):
c[i] = 0
return [coeff_dict[val] for val in func_vars]
# Constants for the problem
I = 0.1 # A/cm2
L = 1.0 # distance (x) in cm
m = 100 # grid spacing
h = L / (m-1)
a = 23300 # 1/cm
io = 2e-7 # A/cm2
n = 1
F = 96500 # C/mol
R = 8.314 # J/mol-K
T = 298 # K
sigma = 20 # S/cm
kappa = 0.06 # S/cm
alpha = 0.5
beta = -(1-alpha)*n*F/R/T
phi10 , phi20 = 5, 0.5 # these are just guesses
P = a*io*np.exp(beta*(phi10-phi20))
j = sympy.symbols('j',integer = True)
cA = sympy.IndexedBase('cA')
cB = sympy.IndexedBase('cB')
cC = sympy.IndexedBase('cC')
cD = sympy.IndexedBase('cD')
# write the boundary conditions at x = 0
bc=[cA[1], cB[1],
(4/3) * cC[2] - (1/3)*cC[3], # use a three point approximation for cC_prime
cD[1]
]
# form a list of expressions from the boundary conditions and equations
expr=flatten([bc,flatten([[
-cA[j-1] - cB[j-1] + cA[j+1] + cB[j+1],
cB[j-1] - 2*h*P*beta*cC[j] + 2*h*P*beta*cD[j] - cB[j+1],
-sigma*cC[j-1] + 2*h*cA[j] + sigma * cC[j+1],
-kappa * cD[j-1] + 2*h * cB[j] + kappa * cD[j+1]] for j in range(2, m)])])
vars = [cA[j], cB[j], cC[j], cD[j]]
# flatten the list of variables
unknowns = flatten([[cA[j], cB[j], cC[j], cD[j]] for j in range(1,m)])
var_len = len(unknowns)
# # # substitute in the boundary conditions at x = L while getting the coefficients
A = sympy.SparseMatrix([get_coeffs(_mexpand(i.subs({cA[m]:I}))\
.as_coefficients_dict(), unknowns) for i in expr])
# convert to a numpy array
mat_temp = np.array(A).astype(np.float64)
# you can view the sparse array with this
fig = plt.figure(figsize=(6,6))
ax = fig.add_axes([0,0, 1,1])
cmap = plt.cm.binary
plt.spy(mat_temp, cmap = cmap, alpha = 0.8)
def solve_sparse(b0, error):
# create the b column vector
b = np.copy(b0)
b[0:4] = np.array([0.0, I, 0.0, 0.0])
b[var_len-4] = I
b[var_len-3] = 0
b[var_len-2] = 0
b[var_len-1] = 0
print(b.shape)
old = np.copy(b0)
mat = np.copy(mat_temp)
b_2 = np.copy(b)
resid = 10
lss = 0
while lss < 100:
mat_2 = np.copy(mat)
for j in range(3, var_len - 3, 4):
# update the forcing term of equation 2
b_2[j+2] = 2*h*(1-beta*old[j+3]+beta*old[j+4])*a*io*np.exp(beta*(old[j+3]-old[j+4]))
# update the sparse array at every iteration for variables cC and cD in equation2
mat_2[j+2, j+3] += 2*h*beta*a*io*np.exp(beta*(old[j+3]-old[j+4]))
mat_2[j+2, j+4] += 2*h*beta*a*io*np.exp(beta*(old[j+3]-old[j+4]))
# form the column sparse matrix
A_s = ss.csc_matrix(mat_2)
new = ssl.spsolve(A_s, b_2).flatten()
resid = np.sum((new - old)**2)/var_len
lss += 1
old = np.copy(new)
return new
val0 = np.array([[0.0, 0.0, 0.0, 0.0] for _ in range(m-1)]).flatten() # form an array of initial values
error = 1e-7
## Run the code
conc = solve_sparse(val0, error).reshape(m-1, len(vars))
conc.shape # gives (99, 4)
# Plot result for cA:
plt.plot(conc[:,0], marker = 'o', linestyle = '')
What happens seems pretty clear now, after having seen that the plotted solution indeed oscillates between the upper and lower values. You are using the central Euler method as discretization, for u'=F(u) this reads as
u[j+1]-u[j-1] = 2*h*F(u[j])
This method is only weakly stable and allows the sub-sequences of odd and even indices to evolve rather independently. As equation this would mean that the solution might approximate the system ue'=F(uo), uo'=F(ue) with independent functions ue, uo that follow the path of the even or odd sub-sequence.
These even and odd parts are only tied together by the treatment of the boundary points, two or three points deep. So to avoid or reduce the oscillation requires a very careful handling of boundary conditions and also the differential equations for the boundary points.
But one can avoid all this unpleasantness by using the trapezoidal method
u[j+1]-u[j] = 0.5*h*(F(u[j+1])+F(u[j]))
This also reduces the band-width of the system matrix.
To properly implement the implied Newton method correctly (linearizing via Taylor and solving the linearized equation is what the Newton-Kantorovich method does) you need to replace F(u[j]) with F(u_old[j])+F'(u_old[j])*(u[j]-u_old[j]). This then gives a linear system of equations in u for the iteration step.
For the trapezoidal method this gives
(I-0.5*h*F'(u_old[j+1]))*u[j+1] - (I+0.5*h*F'(u_old[j]))*u[j]
= 0.5*h*(F(u_old[j+1])-F'(u_old[j+1])*u_old[j+1] + F(u_old[j])-F'(u_old[j])*u_old[j])
In general, the derivatives values and thus the system matrix need not be updated every step, only the function value (else the iteration does not move forward).
I have a spectrum that I want to subtract a baseline from. The spectrum data are:
1.484043000000000001e+00 1.121043091000000004e+03
1.472555999999999976e+00 1.140899658000000045e+03
1.461239999999999872e+00 1.135047851999999921e+03
1.450093000000000076e+00 1.153286499000000049e+03
1.439112000000000169e+00 1.158624877999999853e+03
1.428292000000000117e+00 1.249718872000000147e+03
1.417629999999999946e+00 1.491854857999999922e+03
1.407121999999999984e+00 2.524922362999999677e+03
1.396767000000000092e+00 4.102439940999999635e+03
1.386559000000000097e+00 4.013319579999999860e+03
1.376497999999999999e+00 3.128252441000000090e+03
1.366578000000000070e+00 2.633181152000000111e+03
1.356797999999999949e+00 2.340077147999999852e+03
1.347154999999999880e+00 2.099404540999999881e+03
1.337645999999999891e+00 2.012083983999999873e+03
1.328268000000000004e+00 2.052154540999999881e+03
1.319018999999999942e+00 2.061067871000000196e+03
1.309895999999999949e+00 2.205770507999999609e+03
1.300896999999999970e+00 2.199266602000000148e+03
1.292019000000000029e+00 2.317792235999999775e+03
1.283260000000000067e+00 2.357031494000000293e+03
1.274618000000000029e+00 2.434981689000000188e+03
1.266089999999999938e+00 2.540746337999999923e+03
1.257675000000000098e+00 2.605709472999999889e+03
1.249370000000000092e+00 2.667244141000000127e+03
1.241172999999999860e+00 2.800522704999999860e+03
I've taken only every 20th data point from the actual data file, but the general shape is preserved.
import matplotlib.pyplot as plt
share = the_above_array
plt.plot(share)
Original_spectrum
There is a clear tail in around the high x values. Assume the tail is an artifact and needs to be removed. I've tried solutions using the ALS algorithm by P. Eilers, a rubberband approach, and the peakutils package, but these end up subtracting the tail and creating a rise around the low x values or not creating a suitable baseline.
ALS algorithim, in this example I am using lam=1E6 and p=0.001; these were the best parameters I was able to manually find:
# ALS approach
from scipy import sparse
from scipy.sparse.linalg import spsolve
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
baseline = baseline_als(share[:,1], 1E6, 0.001)
baseline_subtracted = share[:,1] - baseline
plt.plot(baseline_subtracted)
ALS_plot
Rubberband approach:
# rubberband approach
from scipy.spatial import ConvexHull
def rubberband(x, y):
# Find the convex hull
v = ConvexHull(share).vertices
# Rotate convex hull vertices until they start from the lowest one
v = np.roll(v, v.argmax())
# Leave only the ascending part
v = v[:v.argmax()]
# Create baseline using linear interpolation between vertices
return np.interp(x, x[v], y[v])
baseline_rubber = rubberband(share[:,0], share[:,1])
intensity_rubber = share[:,1] - baseline_rubber
plt.plot(intensity_rubber)
Rubber_plot
peakutils package:
# peakutils approach
import peakutils
baseline_peakutils = peakutils.baseline(share[:,1])
intensity_peakutils = share[:,1] - baseline_peakutils
plt.plot(intensity_peakutils)
Peakutils_plot
Are there any suggestions, aside from masking the low x value data, for constructing a baseline and subtracting the tail without creating a rise in the low x values?
I found a set of similar ALS algorithms here. One of these algorithms, asymmetrically reweighted penalized least squares smoothing (arpls), gives a slightly better fit than als.
# arpls approach
from scipy.linalg import cholesky
def arpls(y, lam=1e4, ratio=0.05, itermax=100):
r"""
Baseline correction using asymmetrically
reweighted penalized least squares smoothing
Sung-June Baek, Aaron Park, Young-Jin Ahna and Jaebum Choo,
Analyst, 2015, 140, 250 (2015)
"""
N = len(y)
D = sparse.eye(N, format='csc')
D = D[1:] - D[:-1] # numpy.diff( ,2) does not work with sparse matrix. This is a workaround.
D = D[1:] - D[:-1]
H = lam * D.T * D
w = np.ones(N)
for i in range(itermax):
W = sparse.diags(w, 0, shape=(N, N))
WH = sparse.csc_matrix(W + H)
C = sparse.csc_matrix(cholesky(WH.todense()))
z = spsolve(C, spsolve(C.T, w * y))
d = y - z
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
wt = 1. / (1 + np.exp(2 * (d - (2 * s - m)) / s))
if np.linalg.norm(w - wt) / np.linalg.norm(w) < ratio:
break
w = wt
return z
baseline = baseline_als(share[:,1], 1E6, 0.001)
baseline_subtracted = share[:,1] - baseline
plt.plot(baseline_subtracted, 'r', label='als')
baseline_arpls = arpls(share[:,1], 1e5, 0.1)
intensity_arpls = share[:,1] - baseline_arpls
plt.plot(intensity_arpls, label='arpls')
plt.legend()
ARPLS plot
Fortunately, this improvement becomes better when using the data from the entire spectrum:
Note the parameters for either algorithm were different. For now, I think the arpls algorithm is as close as I can get, at least for spectra that look like this. We'll see how robust the algorithm can fit spectra with different shapes. Of course, I am always open to suggestions or improvements!
Have a look at the RamPy library in python, which proposes various baseline subtraction algorithms. This includes splines, ARPLS, ALS, polynomial functions, and many more. It also offers various other features, such as resampling, normalisation, and peak fitting examples.
In your case, a simple spline function fitted before and after the peak should easily do the job. Have a look at this example Jupyter notebook.
I want to use Hawkes process to model some data. I could not find whether PyMC supports Hawkes process. More specifically I want an observed variable with Hawkes Process and learn a posterior on its params.
If it is not there, then could I define it in PyMC in some way e.g. #deterministic etc.??
It's been quite a long time since your question, but I've worked it out on PyMC today so I'd thought I'd share the gist of my implementation for the other people who might get across the same problem. We're going to infer the parameters λ and α of a Hawkes process. I'm not going to cover the temporal scale parameter β, I'll leave that as an exercise for the readers.
First let's generate some data :
def hawkes_intensity(mu, alpha, points, t):
p = np.array(points)
p = p[p <= t]
p = np.exp(p - t)
return mu + alpha * np.sum(p)
def simulate_hawkes(mu, alpha, window):
t = 0
points = []
lambdas = []
while t < window:
m = hawkes_intensity(mu, alpha, points, t)
s = np.random.exponential(scale=1/m)
ratio = hawkes_intensity(mu, alpha, points, t + s)
t = t + s
if t < window:
points.append(t)
lambdas.append(ratio)
else:
break
points = np.sort(np.array(points, dtype=np.float32))
lambdas = np.array(lambdas, dtype=np.float32)
return points, lambdas
# parameters
window = 1000
mu = 8
alpha = 0.25
points, lambdas = simulate_hawkes(mu, alpha, window)
num_points = len(points)
We just generated some temporal points using some functions that I adapted from there : https://nbviewer.jupyter.org/github/MatthewDaws/PointProcesses/blob/master/Temporal%20points%20processes.ipynb
Now, the trick is to create a matrix of size (num_points, num_points) that contains the temporal distance of the ith point from all the other points. So the (i, j) point of the matrix is the temporal interval separating the ith point to the jth. This matrix will be used to compute the sum of the exponentials of the Hawkes process, ie. the self-exciting part. The way to create this matrix as well as the sum of the exponentials is a bit tricky. I'd recommend to check every line yourself so you can see what they do.
tile = np.tile(points, num_points).reshape(num_points, num_points)
tile = np.clip(points[:, None] - tile, 0, np.inf)
tile = np.tril(np.exp(-tile), k=-1)
Σ = np.sum(tile, axis=1)[:-1] # this is our self-exciting sum term
We have points and we have a matrix containg the sum of the excitations term.
The duration between two consecutive events of a Hawkes process follow an exponential distribution of parameter λ = λ0 + ∑ excitation. This is what we are going to model, but first we have to compute the duration between two consecutive points of our generated data.
interval = points[1:] - points[:-1]
We're now ready for inference:
with pm.Model() as model:
λ = pm.Exponential("λ", 1)
α = pm.Uniform("α", 0, 1)
lam = pm.Deterministic("lam", λ + α * Σ)
interarrival = pm.Exponential(
"interarrival", lam, observed=interval)
trace = pm.sample(2000, tune=4000)
pm.plot_posterior(trace, var_names=["λ", "α"])
plt.show()
print(np.mean(trace["λ"]))
print(np.mean(trace["α"]))
7.829
0.284
Note: the tile matrix can become quite large if you have many data points.
I have used the finite element method to approximate the laplace equation and thus have turned it into a matrix system AU = F where A is the stiffness vector and solved for U (not massively important for my question).
I have now got my approximation U, which when i find AU i should get the vector F (or at least similar) where F is:
AU gives the following plot for x = 0 to x = 1 (say, for 20 nodes):
I then need to interpolate U to a longer vector and find AU (for a bigger A too, but not interpolating that). I interpolate U by the following:
U_inter = interp1d(x,U)
U_rich = U_inter(longer_x)
which seems to work okay until i multiply it with the longer A matrix:
It seems each spike is at a node of x (i.e. the nodes of the original U). Does anybody know what could be causing this? The following is my code to find A, U and F.
import numpy as np
import math
import scipy
from scipy.sparse import diags
import scipy.sparse.linalg
from scipy.interpolate import interp1d
import matplotlib
import matplotlib.pyplot as plt
def Poisson_Stiffness(x0):
"""Finds the Poisson equation stiffness matrix with any non uniform mesh x0"""
x0 = np.array(x0)
N = len(x0) - 1 # The amount of elements; x0, x1, ..., xN
h = x0[1:] - x0[:-1]
a = np.zeros(N+1)
a[0] = 1 #BOUNDARY CONDITIONS
a[1:-1] = 1/h[1:] + 1/h[:-1]
a[-1] = 1/h[-1]
a[N] = 1 #BOUNDARY CONDITIONS
b = -1/h
b[0] = 0 #BOUNDARY CONDITIONS
c = -1/h
c[N-1] = 0 #BOUNDARY CONDITIONS: DIRICHLET
data = [a.tolist(), b.tolist(), c.tolist()]
Positions = [0, 1, -1]
Stiffness_Matrix = diags(data, Positions, (N+1,N+1))
return Stiffness_Matrix
def NodalQuadrature(x0):
"""Finds the Nodal Quadrature Approximation of sin(pi x)"""
x0 = np.array(x0)
h = x0[1:] - x0[:-1]
N = len(x0) - 1
approx = np.zeros(len(x0))
approx[0] = 0 #BOUNDARY CONDITIONS
for i in range(1,N):
approx[i] = math.sin(math.pi*x0[i])
approx[i] = (approx[i]*h[i-1] + approx[i]*h[i])/2
approx[N] = 0 #BOUNDARY CONDITIONS
return approx
def Solver(x0):
Stiff_Matrix = Poisson_Stiffness(x0)
NodalApproximation = NodalQuadrature(x0)
NodalApproximation[0] = 0
U = scipy.sparse.linalg.spsolve(Stiff_Matrix, NodalApproximation)
return U
x = np.linspace(0,1,10)
rich_x = np.linspace(0,1,50)
U = Solver(x)
A_rich = Poisson_Stiffness(rich_x)
U_inter = interp1d(x,U)
U_rich = U_inter(rich_x)
AUrich = A_rich.dot(U_rich)
plt.plot(rich_x,AUrich)
plt.show()
comment 1:
I added a Stiffness_Matrix = Stiffness_Matrix.tocsr() statement to avoid an efficiency warning. FE calculations are complex enough that I'll have to print out some intermediate values before I can identify what is going on.
comment 2:
plt.plot(rich_x,A_rich.dot(Solver(rich_x))) plots nice. The noise you get is the result of the difference between the inperpolated U_rich and the true solution: U_rich-Solver(rich_x).
comment 3:
I don't think there's a problem with your code. The problem is with idea that you can test an interpolation this way. I'm rusty on FE theory, but I think you need to use the shape functions to interpolate, not a simple linear one.
comment 4:
Intuitively, with A_rich.dot(U_rich) you are asking, what kind of forcing F would produce U_rich. Compared to Solver(rich_x), U_rich has flat spots, regions where it's value is less than the true solution. What F would produce that? One that is spiky, with NodalQuadrature(x) at the x points, but near zero values in between. That's what your plot is showing.
A higher order interpolation will eliminate the flat spots, and produce a smoother back calculated F. But you really need to revisit the FE theory.
You might find it instructive to look at
plt.plot(x,NodalQuadrature(x))
plt.plot(rich_x, NodalQuadrature(rich_x))
The second plot is much smoother, but only about 1/5 as high.
Better yet look at:
plt.plot(rich_x,AUrich,'-*') # the spikes
plt.plot(x,NodalQuadrature(x),'o') # original forcing
plt.plot(rich_x, NodalQuadrature(rich_x),'+') # new forcing
In the model the forcing isn't continuous, it is a value at each node. With more nodes (rich_x) the magnitude at each node is less.
For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.
import numpy as np, matplotlib.pyplot as plt
data = np.random.rand(1000)+5.0
plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()
The problem with using percentile is that the points identified as outliers is a function of your sample size.
There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")
However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".
Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.
Let's compare a percentile-based outlier test (similar to #CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def main():
for num in [10, 50, 100, 1000]:
# Generate some data
x = np.random.normal(0, 0.5, num-3)
# Add three outliers...
x = np.r_[x, -3, -10, 12]
plot(x)
plt.show()
def mad_based_outlier(points, thresh=3.5):
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
def percentile_based_outlier(data, threshold=95):
diff = (100 - threshold) / 2.0
minval, maxval = np.percentile(data, [diff, 100 - diff])
return (data < minval) | (data > maxval)
def plot(x):
fig, axes = plt.subplots(nrows=2)
for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
sns.distplot(x, ax=ax, rug=True, hist=False)
outliers = x[func(x)]
ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)
kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
axes[0].set_title('Percentile-based Outliers', **kwargs)
axes[1].set_title('MAD-based Outliers', **kwargs)
fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)
main()
Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.
Detection of outliers in one dimensional data depends on its distribution
1- Normal Distribution :
Data values are almost equally distributed over the expected range :
In this case you easily use all the methods that include mean ,like the confidence interval of 3 or 2 standard deviations(95% or 99.7%) accordingly for a normally distributed data (central limit theorem and sampling distribution of sample mean).I is a highly effective method.
Explained in Khan Academy statistics and Probability - sampling distribution library.
One other way is prediction interval if you want confidence interval of data points rather than mean.
Data values are are randomly distributed over a range:
mean may not be a fair representation of the data, because the average is easily influenced by outliers (very small or large values in the data set that are not typical)
The median is another way to measure the center of a numerical data set.
Median Absolute deviation - a method which measures the distance of all points from the median in terms of median distance
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - has a good explanation as explained in Joe Kington's answer above
2 - Symmetric Distribution : Again Median Absolute Deviation is a good method if the z-score calculation and threshold is changed accordingly
Explanation :
http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
3 - Asymmetric Distribution : Double MAD - Double Median Absolute Deviation
Explanation in the above attached link
Attaching my python code for reference :
def is_outlier_doubleMAD(self,points):
"""
FOR ASSYMMETRIC DISTRIBUTION
Returns : filtered array excluding the outliers
Parameters : the actual data Points array
Calculates median to divide data into 2 halves.(skew conditions handled)
Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer)
Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
medianIndex = (points.size/2)
leftData = np.copy(points[0:medianIndex])
rightData = np.copy(points[medianIndex:points.size])
median1 = np.median(leftData, axis=0)
diff1 = np.sum((leftData - median1)**2, axis=-1)
diff1 = np.sqrt(diff1)
median2 = np.median(rightData, axis=0)
diff2 = np.sum((rightData - median2)**2, axis=-1)
diff2 = np.sqrt(diff2)
med_abs_deviation1 = max(np.median(diff1),0.000001)
med_abs_deviation2 = max(np.median(diff2),0.000001)
threshold1 = ((median-median1)/med_abs_deviation1)*3
threshold2 = ((median2-median)/med_abs_deviation2)*3
#if any threshold is 0 -> no outliers
if threshold1==0:
threshold1 = sys.maxint
if threshold2==0:
threshold2 = sys.maxint
#multiplied by a factor so that only the outermost points are removed
modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2
filtered1 = []
i = 0
for data in modified_z_score1:
if data < threshold1:
filtered1.append(leftData[i])
i += 1
i = 0
filtered2 = []
for data in modified_z_score2:
if data < threshold2:
filtered2.append(rightData[i])
i += 1
filtered = filtered1 + filtered2
return filtered
I've adapted the code from http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers and it gives the same results as Joe Kington's, but uses L1 distance instead of L2 distance, and has support for asymmetric distributions. The original R code did not have Joe's 0.6745 multiplier, so I also added that in for consistency within this thread. Not 100% sure if it's necessary, but makes the comparison apples-to-apples.
def doubleMADsfromMedian(y,thresh=3.5):
# warning: this function does not check for NAs
# nor does it address issues when
# more than 50% of your data have identical values
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y <= m])
right_mad = np.median(abs_dev[y >= m])
y_mad = left_mad * np.ones(len(y))
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > thresh
Well a simple solution can also be, removing something which outside 2 standard deviations(or 1.96):
import random
def outliers(tmp):
"""tmp is a list of numbers"""
outs = []
mean = sum(tmp)/(1.0*len(tmp))
var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp))
std = var**0.5
outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std]
return outs
lst = [random.randrange(-10, 55) for _ in range(40)]
print lst
print outliers(lst)
Use np.percentile as #Martin suggested:
percentiles = np.percentile(data, [2.5, 97.5])
# or =>, <= for within 95%
data[(percentiles[0]<data) & (percentiles[1]>data)]
# set the outliners to np.nan
data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan