Pythonic way of detecting outliers in one dimensional observation data - python

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could explain me further.
import numpy as np, matplotlib.pyplot as plt
data = np.random.rand(1000)+5.0
plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()

The problem with using percentile is that the points identified as outliers is a function of your sample size.
There are a huge number of ways to test for outliers, and you should give some thought to how you classify them. Ideally, you should use a-priori information (e.g. "anything above/below this value is unrealistic because...")
However, a common, not-too-unreasonable outlier test is to remove points based on their "median absolute deviation".
Here's an implementation for the N-dimensional case (from some code for a paper here: https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
This is very similar to one of my previous answers, but I wanted to illustrate the sample size effect in detail.
Let's compare a percentile-based outlier test (similar to #CTZhu's answer) with a median-absolute-deviation (MAD) test for a variety of different sample sizes:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def main():
for num in [10, 50, 100, 1000]:
# Generate some data
x = np.random.normal(0, 0.5, num-3)
# Add three outliers...
x = np.r_[x, -3, -10, 12]
plot(x)
plt.show()
def mad_based_outlier(points, thresh=3.5):
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
def percentile_based_outlier(data, threshold=95):
diff = (100 - threshold) / 2.0
minval, maxval = np.percentile(data, [diff, 100 - diff])
return (data < minval) | (data > maxval)
def plot(x):
fig, axes = plt.subplots(nrows=2)
for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
sns.distplot(x, ax=ax, rug=True, hist=False)
outliers = x[func(x)]
ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)
kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
axes[0].set_title('Percentile-based Outliers', **kwargs)
axes[1].set_title('MAD-based Outliers', **kwargs)
fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)
main()
Notice that the MAD-based classifier works correctly regardless of sample-size, while the percentile based classifier classifies more points the larger the sample size is, regardless of whether or not they are actually outliers.

Detection of outliers in one dimensional data depends on its distribution
1- Normal Distribution :
Data values are almost equally distributed over the expected range :
In this case you easily use all the methods that include mean ,like the confidence interval of 3 or 2 standard deviations(95% or 99.7%) accordingly for a normally distributed data (central limit theorem and sampling distribution of sample mean).I is a highly effective method.
Explained in Khan Academy statistics and Probability - sampling distribution library.
One other way is prediction interval if you want confidence interval of data points rather than mean.
Data values are are randomly distributed over a range:
mean may not be a fair representation of the data, because the average is easily influenced by outliers (very small or large values in the data set that are not typical)
The median is another way to measure the center of a numerical data set.
Median Absolute deviation - a method which measures the distance of all points from the median in terms of median distance
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm - has a good explanation as explained in Joe Kington's answer above
2 - Symmetric Distribution : Again Median Absolute Deviation is a good method if the z-score calculation and threshold is changed accordingly
Explanation :
http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
3 - Asymmetric Distribution : Double MAD - Double Median Absolute Deviation
Explanation in the above attached link
Attaching my python code for reference :
def is_outlier_doubleMAD(self,points):
"""
FOR ASSYMMETRIC DISTRIBUTION
Returns : filtered array excluding the outliers
Parameters : the actual data Points array
Calculates median to divide data into 2 halves.(skew conditions handled)
Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer)
Only difference being , the thresholds are now the median distance of the right and left median with the actual data median
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
medianIndex = (points.size/2)
leftData = np.copy(points[0:medianIndex])
rightData = np.copy(points[medianIndex:points.size])
median1 = np.median(leftData, axis=0)
diff1 = np.sum((leftData - median1)**2, axis=-1)
diff1 = np.sqrt(diff1)
median2 = np.median(rightData, axis=0)
diff2 = np.sum((rightData - median2)**2, axis=-1)
diff2 = np.sqrt(diff2)
med_abs_deviation1 = max(np.median(diff1),0.000001)
med_abs_deviation2 = max(np.median(diff2),0.000001)
threshold1 = ((median-median1)/med_abs_deviation1)*3
threshold2 = ((median2-median)/med_abs_deviation2)*3
#if any threshold is 0 -> no outliers
if threshold1==0:
threshold1 = sys.maxint
if threshold2==0:
threshold2 = sys.maxint
#multiplied by a factor so that only the outermost points are removed
modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1
modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2
filtered1 = []
i = 0
for data in modified_z_score1:
if data < threshold1:
filtered1.append(leftData[i])
i += 1
i = 0
filtered2 = []
for data in modified_z_score2:
if data < threshold2:
filtered2.append(rightData[i])
i += 1
filtered = filtered1 + filtered2
return filtered

I've adapted the code from http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers and it gives the same results as Joe Kington's, but uses L1 distance instead of L2 distance, and has support for asymmetric distributions. The original R code did not have Joe's 0.6745 multiplier, so I also added that in for consistency within this thread. Not 100% sure if it's necessary, but makes the comparison apples-to-apples.
def doubleMADsfromMedian(y,thresh=3.5):
# warning: this function does not check for NAs
# nor does it address issues when
# more than 50% of your data have identical values
m = np.median(y)
abs_dev = np.abs(y - m)
left_mad = np.median(abs_dev[y <= m])
right_mad = np.median(abs_dev[y >= m])
y_mad = left_mad * np.ones(len(y))
y_mad[y > m] = right_mad
modified_z_score = 0.6745 * abs_dev / y_mad
modified_z_score[y == m] = 0
return modified_z_score > thresh

Well a simple solution can also be, removing something which outside 2 standard deviations(or 1.96):
import random
def outliers(tmp):
"""tmp is a list of numbers"""
outs = []
mean = sum(tmp)/(1.0*len(tmp))
var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp))
std = var**0.5
outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std]
return outs
lst = [random.randrange(-10, 55) for _ in range(40)]
print lst
print outliers(lst)

Use np.percentile as #Martin suggested:
percentiles = np.percentile(data, [2.5, 97.5])
# or =>, <= for within 95%
data[(percentiles[0]<data) & (percentiles[1]>data)]
# set the outliners to np.nan
data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan

Related

Scale Sections of Data to between -1 and 1

Working with a 2D signal/time series dataset, after finding peaks and troughs, I would like to scale each section of the dataset appropriately.
For example, if I have the following visual dataset, with peaks and troughs labeled as such:
...what's a good "pythonic" way to label every other datapoint between each peak and trough to be a number > -1 and < 1, sort of like so:
I have provided a reproducible code below to experiment with.
NOTE: I'm running Windows 10, Python 3.10.5.
pip install findpeaks
from numpy import array, inf, nan, where
# pip install findpeaks
from findpeaks import findpeaks
from random import gauss, seed
from math import sqrt, exp
# ------------------------------------------------------------------------------------------------ #
# GENERATE RANDOM SIGNAL DATA #
# ------------------------------------------------------------------------------------------------ #
# https://towardsdatascience.com/create-a-stock-price-simulator-with-python-b08a184f197d
def create_GBM(s0, mu, sigma):
"""
Generates a price following a geometric brownian motion process based on the input of the arguments:
- s0: Asset inital price.
- mu: Interest rate expressed annual terms.
- sigma: Volatility expressed annual terms.
"""
st = s0
def generate_value():
nonlocal st
st *= exp((mu - 0.5 * sigma ** 2) * (1. / 365.) + sigma * sqrt(1./365.) * gauss(mu=0, sigma=1))
return st
return generate_value
gbm = create_GBM(100, 0.001, 1.0)
signal = [round(gbm(), 2) for _ in range(10000)]
print(signal)
# ------------------------------------------------------------------------------------------------ #
# FIND PEAKS AND TROUGHS DATAFRAME #
# ------------------------------------------------------------------------------------------------ #
print("Finding peaks/troughs....")
fp = findpeaks(method='peakdetect')
results = fp.fit(array(signal).flatten())
results_df = results['df']
results_df['label'] = where(results_df['valley'], -1,
where(results_df['peak'], 1, nan))
print(results_df)
# ------------------------------------------------------------------------------------------------ #
# FILL NAN's WITH THEIR APPROPRIATE VALUES, SCALED BETWEEN -1 and 1 #
# ------------------------------------------------------------------------------------------------ #
# ????????????????????????????
Given that the results_df gives the y values, along with some x indexes on where they are, I was hoping there'd be a one-liner for this.
Another thought I had would be to iterate through the results df, peak to trough, then trough to peak (repeat) and MinMaxScale everything between the start and end of each section, as we know what those values are. Something like:
UPDATE
I have a hacky solution here, HOWEVER IT'S NOT WORKING! So treat it as pseudo-code for now, but it looks like this so far. I feel there's an easier way...
# ------------------------------------------------------------------------------------------------ #
# FILL NAN's WITH THEIR APPROPRIATE VALUES, SCALED BETWEEN -1 and 1 #
# ------------------------------------------------------------------------------------------------ #
# Drop nan's from label column to make things easier for iteration
results_df = results_df.dropna()
print(results_df)
# Iterate through the results_df, starting at 1, not 0
for i in range(1, len(results_df)):
# Find the current values for this "section" of the signal dataset
if results_df['label'].iloc[i] > 0:
peak_value = results_df['y'].iloc[i]
peak_value_index = results_df['x'].iloc[i]
trough_value = results_df['y'].iloc[i-1]
trough_value_index = results_df['x'].iloc[i-1]
else:
peak_value = results_df['y'].iloc[i-1]
peak_value_index = results_df['x'].iloc[i-1]
trough_value = results_df['y'].iloc[i]
trough_value_index = results_df['x'].iloc[i]
# Find the current min value
current_min_value = min(peak_value, trough_value)
# Find the difference between the max and min values
current_difference = max(peak_value, trough_value) - min(peak_value, trough_value)
# Now iterate through that "section" of the signal list, and scale accordingly
for j in range(min(peak_value_index, trough_value_index), max(peak_value_index, trough_value_index)+1): # +1 to ensure last datapoint isn't missed
signal[j] = (signal[j] - current_min_value) / current_difference - 1
# Inspect the newly scaled signals at the peak/trough points to ensure they're correct
for i in range(0, len(results_df)):
print(signal[results_df['x'].iloc[i]])
My code can be found below. There are two remarks:
My implementation is a variation on your approach with two notable differences. First, I directly iterate through the segments and find these indices outside of the for-loop. Second, your transformation seems to be missing a factor 2. That is, I take transformation = -1 + 2* (value-min)/(max-min) to ensure that transformed value takes the value +1 whenever value=max.
I also added some code to plot the original series and its transformation together. This allows us to visually check whether the transformation was successful. In general, the transformation seems to be working but it does happen occasionally that the peak detection algorithm misses a peak/trough. The transformation will now receive the wrong input and the result of the transformation is no longer guaranteed to be in the [-1,1] interval.
#!/usr/bin/env python3
from numpy import argwhere, array, inf, isnan, nan, transpose, where, zeros
# pip install findpeaks
from findpeaks import findpeaks
from random import gauss, seed
from math import sqrt, exp
import matplotlib.pyplot as plt
# ------------------------------------------------------------------------------------------------ #
# GENERATE RANDOM SIGNAL DATA #
# ------------------------------------------------------------------------------------------------ #
# https://towardsdatascience.com/create-a-stock-price-simulator-with-python-b08a184f197d
def create_GBM(s0, mu, sigma):
"""
Generates a price following a geometric brownian motion process based on the input of the arguments:
- s0: Asset inital price.
- mu: Interest rate expressed annual terms.
- sigma: Volatility expressed annual terms.
"""
st = s0
def generate_value():
nonlocal st
st *= exp((mu - 0.5 * sigma ** 2) * (1. / 365.) + sigma * sqrt(1./365.) * gauss(mu=0, sigma=1))
return st
return generate_value
gbm = create_GBM(100, 0.001, 1.0)
signal = [round(gbm(), 2) for _ in range(10000)]
print(signal)
# ------------------------------------------------------------------------------------------------ #
# FIND PEAKS AND TROUGHS DATAFRAME #
# ------------------------------------------------------------------------------------------------ #
print("Finding peaks/troughs....")
fp = findpeaks(method='peakdetect')
results = fp.fit(array(signal).flatten())
results_df = results['df']
results_df['label'] = where(results_df['valley'], -1,
where(results_df['peak'], 1, nan))
print(results_df)
# ------------------------------------------------------------------------------------------------ #
# FILL NAN's WITH THEIR APPROPRIATE VALUES, SCALED BETWEEN -1 and 1 #
# ------------------------------------------------------------------------------------------------ #
# Convert some results to numpy arrays
label = results_df["label"].to_numpy()
y = transpose(results_df["y"].to_numpy())
# Indices to beginning and ends of segments
indices = argwhere(~isnan(label))
# Initialize output
signal = zeros( (len(results_df),1) )
# Compute signal for all segments
for segment in range(1,len(indices)):
# Indices of current segments
start_index = indices[segment-1][0]
end_index = indices[segment][0]
# Determine through and peak value
yvalue_start = y[start_index]
yvalue_end = y[end_index]
# Determine through and peak values
if yvalue_start<yvalue_end:
trough_value = yvalue_start
peak_value = yvalue_end
else:
trough_value = yvalue_end
peak_value = yvalue_start
current_difference = peak_value-trough_value
# Inform user
print("Segment {} from index {} to {} with trough={} and peak={}".format(segment, start_index, end_index, trough_value, peak_value))
signal[start_index:(end_index+1), 0] = -1.0 + (2/current_difference) * (y[start_index:(end_index+1)]-trough_value)
fig, axs = plt.subplots(2, 1)
axs[0].plot(y)
axs[0].set_title('Original series')
axs[1].plot(signal)
axs[1].set_title('Converted signal')
plt.show()

Speeding up normal distribution probability mass allocation

We have N users with P avg. points per user, where each point is a single value between 0 and 1. We need to distribute the mass of each point using a normal distribution with a known density of 0.05 as the points have some uncertainty. Additionally, we need to wrap the mass around 0 and 1 such that e.g. a point at 0.95 will also allocate mass around 0. I've provided a working example below, which bins the normal distribution into D=50 bins. The example uses the Python typing module, but you can ignore that if you'd like.
from typing import List, Any
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
D = 50
BINS: List[float] = np.linspace(0, 1, D + 1).tolist()
def probability_mass(distribution: Any, x0: float, x1: float) -> float:
"""
Computes the area under the distribution, wrapping at 1.
The wrapping is done by adding the PDF at +- 1.
"""
assert x1 > x0
return (
(distribution.cdf(x1) - distribution.cdf(x0))
+ (distribution.cdf(x1 + 1) - distribution.cdf(x0 + 1))
+ (distribution.cdf(x1 - 1) - distribution.cdf(x0 - 1))
)
def point_density(x: float) -> List[float]:
distribution: Any = scipy.stats.norm(loc=x, scale=0.05)
density: List[float] = []
for i in range(D):
density.append(probability_mass(distribution, BINS[i], BINS[i + 1]))
return density
def user_density(points: List[float]) -> Any:
# Find the density of each point
density: Any = np.array([point_density(p) for p in points])
# Combine points and normalize
combined = density.sum(axis=0)
return combined / combined.sum()
if __name__ == "__main__":
# Example for one user
data: List[float] = [.05, .3, .5, .5]
density = user_density(data)
# Example for multiple users (N = 2)
print([user_density(x) for x in [[.3, .5], [.7, .7, .7, .9]]])
### NB: THE REMAINING CODE IS FOR ILLUSTRATION ONLY!
### NB: THE IMPORTANT THING IS TO COMPUTE THE DENSITY FAST!
middle: List[float] = []
for i in range(D):
middle.append((BINS[i] + BINS[i + 1]) / 2)
plt.bar(x=middle, height=density, width=1.0 / D + 0.001)
plt.xlim(0, 1)
plt.xlabel("x")
plt.ylabel("Density")
plt.show()
In this example N=1, D=50, P=4. However, we want to scale this approach to N=10000 and P=100 while being as fast as possible. It's unclear to me how we'd vectorize this approach. How do we best speed up this?
EDIT
The faster solution can have slightly different results. For instance, it could approximate the normal distribution instead of using the precise normal distribution.
EDIT2
We only care about computing density using the user_density() function. The plot is only to help explain the approach. We do not care about the plot itself :)
EDIT3
Note that P is the avg. points per user. Some users may have more and some may have less. If it helps, you can assume that we can throw away points such that all users have a max of 2 * P points. It's fine to ignore this part while benchmarking as long as the solution can handle a flexible # of points per user.
You could get below 50ms for largest case (N=10000, AVG[P]=100, D=50) by using using FFT and creating data in numpy friendly format. Otherwise it will be closer to 300 msec.
The idea is to convolve a single normal distribution centered at 0 with a series Dirac deltas.
See image below:
Using circular convolution solves two issues.
naturally deals with wrapping at the edges
can be efficiently computed with FFT and Convolution Theorem
First one must create a distribution to be copied. Function mk_bell() created a histogram of a normal distribution of stddev 0.05 centered at 0.
The distribution wraps around 1. One could use arbitrary distribution here. The spectrum of the distribution is computed are used for fast convolution.
Next a comb-like function is created. The peaks are placed at indices corresponding to peaks in user density. E.g.
peaks_location = [0.1, 0.3, 0.7]
D = 10
maps to
peak_index = (D * peak_location).astype(int) = [1, 3, 7]
dist = [0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0] # ones at [1, 3, 7]
You can quickly create a composition of Diract Deltas by computing indices of the bins for each peak location with help of np.bincount() function.
In order to speed things even more one can compute comb-functions for user-peaks in parallel.
Array dist is 2D-array of shape NxD. It can be linearized to 1D array of shape (N*D). After this change element on position [user_id, peak_index] will be accessible from index user_id*D + peak_index.
With numpy-friendly input format (described below) this operation is easily vectorized.
The convolution theorem says that spectrum of convolution of two signals is equal to product of spectrums of each signal.
The spectrum is compute with numpy.fft.rfft which is a variant of Fast Fourier Transfrom dedicated to real-only signals (no imaginary part).
Numpy allows to compute FFT of each row of the larger matrix with one command.
Next, the spectrum of convolution is computed by simple multiplication and use of broadcasting.
Next, the spectrum is computed back to "time" domain by Inverse Fourier Transform implemented in numpy.fft.irfft.
To use the full speed of numpy one should avoid variable size data structure and keep to fixed size arrays. I propose to represent input data as three arrays.
uids the identifier for user, integer 0..N-1
peaks, the location of the peak
mass, the mass of the peek, currently it is 1/numer-of-peaks-for-user
This representation of data allows quick vectorized processing.
Eg:
user_data = [[0.1, 0.3], [0.5]]
maps to:
uids = [0, 0, 1] # 2 points for user_data[0], one from user_data[1]
peaks = [0.1, 0.3, 0.5] # serialized user_data
mass = [0.5, 0.5, 1] # scaling factors for each peak, 0.5 means 2 peaks for user 0
The code:
import numpy as np
import matplotlib.pyplot as plt
import time
def mk_bell(D, SIGMA):
# computes normal distribution wrapped and centered at zero
x = np.linspace(0, 1, D, endpoint=False);
x = (x + 0.5) % 1 - 0.5
bell = np.exp(-0.5*np.square(x / SIGMA))
return bell / bell.sum()
def user_densities_by_fft(uids, peaks, mass, D, N=None):
bell = mk_bell(D, 0.05).astype('f4')
sbell = np.fft.rfft(bell)
if N is None:
N = uids.max() + 1
# ensure that peaks are in [0..1) internal
peaks = peaks - np.floor(peaks)
# convert peak location from 0-1 to the indices
pidx = (D * (peaks + uids)).astype('i4')
dist = np.bincount(pidx, mass, N * D).reshape(N, D)
# process all users at once with Convolution Theorem
sdist = np.fft.rfft(dist)
sdist *= sbell
res = np.fft.irfft(sdist)
return res
def generate_data(N, Pmean):
# generateor for large data
data = []
for n in range(N):
# select P uniformly from 1..2*Pmean
P = np.random.randint(2 * Pmean) + 1
# select peak locations
chunk = np.random.uniform(size=P)
data.append(chunk.tolist())
return data
def make_data_numpy_friendly(data):
uids = []
chunks = []
mass = []
for uid, peaks in enumerate(data):
uids.append(np.full(len(peaks), uid))
mass.append(np.full(len(peaks), 1 / len(peaks)))
chunks.append(peaks)
return np.hstack(uids), np.hstack(chunks), np.hstack(mass)
D = 50
# demo for simple multi-distribution
data, N = [[0, .5], [.7, .7, .7, .9], [0.05, 0.3, 0.5, 0.5]], None
uids, peaks, mass = make_data_numpy_friendly(data)
dist = user_densities_by_fft(uids, peaks, mass, D, N)
plt.plot(dist.T)
plt.show()
# the actual measurement
N = 10000
P = 100
data = generate_data(N, P)
tic = time.time()
uids, peaks, mass = make_data_numpy_friendly(data)
toc = time.time()
print(f"make_data_numpy_friendly: {toc - tic}")
tic = time.time()
dist = user_densities_by_fft(uids, peaks, mass, D, N)
toc = time.time()
print(f"user_densities_by_fft: {toc - tic}")
The results on my 4-core Haswell machine are:
make_data_numpy_friendly: 0.2733159065246582
user_densities_by_fft: 0.04064297676086426
It took 40ms to process the data. Notice that processing data to numpy friendly format takes 6 times more time than the actual computation of distributions.
Python is really slow when it comes to looping.
Therefore I strongly recommend to generate input data directly in numpy-friendly way in the first place.
There are some issues to be fixed:
precision, can be improved by using larger D and downsampling
accuracy of peak location could be further improved by widening the spikes.
performance, scipy.fft offers move variants of FFT implementation that may be faster
This would be my vectorized approach:
data = np.array([0.05, 0.3, 0.5, 0.5])
np.random.seed(31415)
# random noise
randoms = np.random.normal(0,1,(len(data), int(1e5))) * 0.05
# samples with noise
samples = data[:,None] + randoms
# wrap [0,1]
samples = (samples % 1).ravel()
# histogram
hist, bins, patches = plt.hist(samples, bins=BINS, density=True)
Output:
I was able to reduce the time from about 4 seconds per sample of 100 datapoints to about 1 ms per sample.
It looks to me like you're spending quite a lot of time simulating a very large number of normal distributions. Since you're dealing with a very large sample size anyway, you may as well just use standard normal distribution values, because it'll all just average out anyway.
I recreated your approach (BaseMethod class), then created an optimized class (OptimizedMethod class), and evaluated them using a timeit decorator. The primary difference in my approach is the following line:
# Generate a standardized set of values to add to each sample to simulate normal distribution
self.norm_vals = np.array([norm.ppf(x / norm_val_n) * 0.05 for x in range(1, norm_val_n, 1)])
This creates a generic set of datapoints based on an inverse normal cumulative distribution function that we can add to each datapoint to simulate a normal distribution around that point. Then we just reshape the data into user samples and run np.histogram on the samples.
import numpy as np
import scipy.stats
from scipy.stats import norm
import time
# timeit decorator for evaluating performance
def timeit(method):
def timed(*args, **kw):
ts = time.time()
result = method(*args, **kw)
te = time.time()
print('%r %2.2f ms' % (method.__name__, (te - ts) * 1000 ))
return result
return timed
# Define Variables
N = 10000
D = 50
P = 100
# Generate sample data
np.random.seed(0)
data = np.random.rand(N, P)
# Run OP's method for comparison
class BaseMethod:
def __init__(self, d=50):
self.d = d
self.bins = np.linspace(0, 1, d + 1).tolist()
def probability_mass(self, distribution, x0, x1):
"""
Computes the area under the distribution, wrapping at 1.
The wrapping is done by adding the PDF at +- 1.
"""
assert x1 > x0
return (
(distribution.cdf(x1) - distribution.cdf(x0))
+ (distribution.cdf(x1 + 1) - distribution.cdf(x0 + 1))
+ (distribution.cdf(x1 - 1) - distribution.cdf(x0 - 1))
)
def point_density(self, x):
distribution = scipy.stats.norm(loc=x, scale=0.05)
density = []
for i in range(self.d):
density.append(self.probability_mass(distribution, self.bins[i], self.bins[i + 1]))
return density
#timeit
def base_user_density(self, data):
n = data.shape[0]
density = np.empty((n, self.d))
for i in range(data.shape[0]):
# Find the density of each point
row_density = np.array([self.point_density(p) for p in data[i]])
# Combine points and normalize
combined = row_density.sum(axis=0)
density[i, :] = combined / combined.sum()
return density
base = BaseMethod(d=D)
# Only running base method on first 2 rows of data because it's slow
density = base.base_user_density(data[:2])
print(density[:2, :5])
class OptimizedMethod:
def __init__(self, d=50, norm_val_n=50):
self.d = d
self.norm_val_n = norm_val_n
self.bins = np.linspace(0, 1, d + 1).tolist()
# Generate a standardized set of values to add to each sample to simulate normal distribution
self.norm_vals = np.array([norm.ppf(x / norm_val_n) * 0.05 for x in range(1, norm_val_n, 1)])
#timeit
def optimized_user_density(self, data):
samples = np.empty((data.shape[0], data.shape[1], self.norm_val_n - 1))
# transform datapoints to normal distributions around datapoint
for i in range(self.norm_vals.shape[0]):
samples[:, :, i] = data + self.norm_vals[i]
samples = samples.reshape(samples.shape[0], -1)
#wrap around [0, 1]
samples = samples % 1
#loop over samples for density
density = np.empty((data.shape[0], self.d))
for i in range(samples.shape[0]):
hist, bins = np.histogram(samples[i], bins=self.bins)
density[i, :] = hist / hist.sum()
return density
om = OptimizedMethod()
#Run optimized method on first 2 rows for apples to apples comparison
density = om.optimized_user_density(data[:2])
#Run optimized method on full data
density = om.optimized_user_density(data)
print(density[:2, :5])
Running on my system, the original method took about 8.4 seconds to run on 2 rows of data, while the optimized method took 1 millisecond to run on 2 rows of data and completed 10,000 rows in 4.7 seconds. I printed the first five values of the first 2 samples for each method.
'base_user_density' 8415.03 ms
[[0.02176227 0.02278653 0.02422535 0.02597123 0.02745976]
[0.0175103 0.01638513 0.01524853 0.01432158 0.01391156]]
'optimized_user_density' 1.09 ms
'optimized_user_density' 4755.49 ms
[[0.02142857 0.02244898 0.02530612 0.02612245 0.0277551 ]
[0.01673469 0.01653061 0.01510204 0.01428571 0.01326531]]

Autocorrelation to estimate periodicity with numpy

I have a large set of time series (> 500), I'd like to select only the ones that are periodic. I did a bit of literature research and I found out that I should look for autocorrelation. Using numpy I calculate the autocorrelation as:
def autocorr(x):
norm = x - np.mean(x)
result = np.correlate(norm, norm, mode='full')
acorr = result[result.size/2:]
acorr /= ( x.var() * np.arange(x.size, 0, -1) )
return acorr
This returns a set of coefficients (r?) that when plot should tell me if the time series is periodic or not.
I generated two toy examples:
#random signal
s1 = np.random.randint(5, size=80)
#periodic signal
s2 = np.array([5,2,3,1] * 20)
When I generate the autocorrelation plots I obtain:
The second autocorrelation vector clearly indicates some periodicity:
Autocorr1 = [1, 0.28, -0.06, 0.19, -0.22, -0.13, 0.07 ..]
Autocorr2 = [1, -0.50, -0.49, 1, -0.50, -0.49, 1 ..]
My question is, how can I automatically determine, from the autocorrelation vector, if a time series is periodic? Is there a way to summarise the values into a single coefficient, e.g. if = 1 perfect periodicity, if = 0 no periodicity at all. I tried to calculate the mean but it is not meaningful. Should I look at the number of 1?
I would use mode='same' instead of mode='full' because with mode='full' we get covariances for extreme shifts, where just 1 array element overlaps self, the rest being zeros. Those are not going to be interesting. With mode='same' at least half of the shifted array overlaps the original one.
Also, to have the true correlation coefficient (r) you need to divide by the size of the overlap, not by the size of the original x. (in my code these are np.arange(n-1, n//2, -1)). Then each of the outputs will be between -1 and 1.
A glance at Durbin–Watson statistic, which is similar to 2(1-r), suggests that people consider its values below 1 to be a significant indication of autocorrelation, which corresponds to r > 0.5. So this is what I use below. For a statistically sound treatment of the significance of autocorrelation refer to statistics literature; a starting point would be to have a model for your time series.
def autocorr(x):
n = x.size
norm = (x - np.mean(x))
result = np.correlate(norm, norm, mode='same')
acorr = result[n//2 + 1:] / (x.var() * np.arange(n-1, n//2, -1))
lag = np.abs(acorr).argmax() + 1
r = acorr[lag-1]
if np.abs(r) > 0.5:
print('Appears to be autocorrelated with r = {}, lag = {}'. format(r, lag))
else:
print('Appears to be not autocorrelated')
return r, lag
Output for your two toy examples:
Appears to be not autocorrelated
Appears to be autocorrelated with r = 1.0, lag = 4

How do you compute the confidence interval for Pearson's r in Python?

In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)
According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:
Convert r to z',
Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
Convert the confidence interval back to r.
Here are some sample codes:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Reference:
http://onlinestatbook.com/2/estimation/correlation_ci.html
Using rpy2 and the psychometric library (you will need R installed and to run install.packages("psychometric") within R first)
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
Where 0.9 is your correlation, n the sample size and 0.95 the confidence level
Here's a solution that uses bootstrapping to compute the confidence interval, rather than the Fisher transformation (which assumes bivariate normality, etc.), borrowing from this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
Answer given by bennylp is mostly correct, however, there is a small error in calculating the critical value in the 3rd function.
It should instead be:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Here's another post for reference: Scipy - two tail ppf function for a z value?
I know bootstrapping has been suggested above, proposing another variation of it below, which may suit some other set ups better.
#1
Sample your data (paired X & Ys and can also add other say weight) , fit original model on it, record r2, append it. Then extract your confidence intervals from your distribution of all R2s recorded.
#2 Additionally can fit on sampled data and using sampled data model predict on non sampled X (could also supply a continuous range to extend your predictions instead of using original X)
to get confidence intervals on your Y hats.
So in sample code:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"

Digitizing an analog signal

I have a array of CSV values representing a digital output. It has been gathered using an analog oscilloscope so it is not a perfect digital signal. I'm trying to filter out the data to have a perfect digital signal for calculating the periods (which may vary).
I would also like to define the maximum error i get from this filtration.
Something like this:
Idea
Apply a treshold od the data. Here is a pseudocode:
for data_point_raw in data_array:
if data_point_raw < 0.8: data_point_perfect = LOW
if data_point_raw > 2 : data_point_perfect = HIGH
else:
#area between thresholds
if previous_data_point_perfect == Low : data_point_perfect = LOW
if previous_data_point_perfect == HIGH: data_point_perfect = HIGH
There are two problems bothering me.
This seems like a common problem in digital signal processing, however i haven't found a predefined standard function for it. Is this an ok way to perform the filtering?
How would I get the maximum error?
Here's a bit of code that might help.
from __future__ import division
import numpy as np
def find_transition_times(t, y, threshold):
"""
Given the input signal `y` with samples at times `t`,
find the times where `y` increases through the value `threshold`.
`t` and `y` must be 1-D numpy arrays.
Linear interpolation is used to estimate the time `t` between
samples at which the transitions occur.
"""
# Find where y crosses the threshold (increasing).
lower = y < threshold
higher = y >= threshold
transition_indices = np.where(lower[:-1] & higher[1:])[0]
# Linearly interpolate the time values where the transition occurs.
t0 = t[transition_indices]
t1 = t[transition_indices + 1]
y0 = y[transition_indices]
y1 = y[transition_indices + 1]
slope = (y1 - y0) / (t1 - t0)
transition_times = t0 + (threshold - y0) / slope
return transition_times
def periods(t, y, threshold):
"""
Given the input signal `y` with samples at times `t`,
find the time periods between the times at which the
signal `y` increases through the value `threshold`.
`t` and `y` must be 1-D numpy arrays.
"""
transition_times = find_transition_times(t, y, threshold)
deltas = np.diff(transition_times)
return deltas
if __name__ == "__main__":
import matplotlib.pyplot as plt
# Time samples
t = np.linspace(0, 50, 501)
# Use a noisy time to generate a noisy y.
tn = t + 0.05 * np.random.rand(t.size)
y = 0.6 * ( 1 + np.sin(tn) + (1./3) * np.sin(3*tn) + (1./5) * np.sin(5*tn) +
(1./7) * np.sin(7*tn) + (1./9) * np.sin(9*tn))
threshold = 0.5
deltas = periods(t, y, threshold)
print("Measured periods at threshold %g:" % threshold)
print(deltas)
print("Min: %.5g" % deltas.min())
print("Max: %.5g" % deltas.max())
print("Mean: %.5g" % deltas.mean())
print("Std dev: %.5g" % deltas.std())
trans_times = find_transition_times(t, y, threshold)
plt.plot(t, y)
plt.plot(trans_times, threshold * np.ones_like(trans_times), 'ro-')
plt.show()
The output:
Measured periods at threshold 0.5:
[ 6.29283207 6.29118893 6.27425846 6.29580066 6.28310224 6.30335003]
Min: 6.2743
Max: 6.3034
Mean: 6.2901
Std dev: 0.0092793
You could use numpy.histogram and/or matplotlib.pyplot.hist to further analyze the array returned by periods(t, y, threshold).
This is not an answer for your question, just and suggestion that may help. Im writing it here because i cant put image in comment.
I think you should normalize data somehow, before any processing.
After normalization to range of 0...1 you should apply your filter.
If you're really only interested in the period, you could plot the Fourier Transform, you'll have a peak where the frequency of the signals occurs (and so you have the period). The wider the peak in the Fourier domain, the larger the error in your period measurement
import numpy as np
data = np.asarray(my_data)
np.fft.fft(data)
Your filtering is fine, it's basically the same as a schmitt trigger, but the main problem you might have with it is speed. The benefit of using Numpy is that it can be as fast as C, whereas you have to iterate once over each element.
You can achieve something similar using the median filter from SciPy. The following should achieve a similar result (and not be dependent on any magnitudes):
filtered = scipy.signal.medfilt(raw)
filtered = numpy.where(filtered > numpy.mean(filtered), 1, 0)
You can tune the strength of the median filtering with medfilt(raw, n_samples), n_samples defaults to 3.
As for the error, that's going to be very subjective. One way would be to discretise the signal without filtering and then compare for differences. For example:
discrete = numpy.where(raw > numpy.mean(raw), 1, 0)
errors = np.count_nonzero(filtered != discrete)
error_rate = errors / len(discrete)

Categories