Weighted mean in numpy/python

Weighted mean in numpy/python - python

I have a big continuous array of values that ranges from (-100, 100)
Now for this array I want to calculate the weighted average described here
since it's continuous I want also to set breaks for the values every 20
i.e the values should be discrete as
-100
-80
-60
....
60
80
100
How can I do this in NumPy or python in general?
EDIT: the difference here from the normal mean, that the mean is calculated according to the frequency of values

You actually have 2 different questions.
How to make data discrete, and
How to make a weighted average.
It's usually better to ask 1 question at a time, but anyway.
Given your specification:
xmin = -100
xmax = 100
binsize = 20
First, let's import numpy and make some data:
import numpy as np
data = numpy.array(range(xmin, xmax))
Then let's make the binnings you are looking for:
bins_arange = numpy.arange(xmin, xmax + 1, binsize)
From this we can convert the data to the discrete form:
counts, edges = numpy.histogram(data, bins=bins_arange)
Now to calculate the weighted average, we can use the binning middle (e.g. numbers between -100 and -80 will be on average -90):
bin_middles = (edges[:-1] + edges[1:]) / 2
Note that this method does not require the binnings to be evenly "spaced", contrary to the integer division method.
Then let's make some weights:
weights = numpy.array(range(len(counts)) / sum(range(len(counts))
Then to bring it all together:
average = np.sum(bin_middles * counts * 1) / sum(counts)
weighted_average = np.sum(bin_middles * counts * weights) / sum(counts)

For the discretization (breaks), here is a method using the python integer division :
import numpy as np
values = np.array([0, 5, 10, 11, 21, 24, 48, 60])
(values/20) *20
# or (a/10).astype(int)*10 to force rounding
that will print :
aarray([ 0, 0, 0, 0, 20, 20, 40, 60])
For the weighted mean, if you have another array with the weights for each point, you can use :
weighted_means = sum([ w*v for w,v in zip(weights, values)]) / sum( w*w )

Related

Efficiently masking and reducing a large multidimensional array using values from another array

I have a xarray.DataArray with two 3-dimensional (time, y, x) variables a and b:
import numpy as np
import xarray as xr
# Random data
a = np.random.rand(100, 3000, 3000).astype(np.float32)
b = np.random.rand(100, 3000, 3000).astype(np.float32)
# Create xarray.Dataset with two vars
ds = xr.Dataset(
data_vars={
"a": xr.DataArray(a, dims=("time", "y", "x")),
"b": xr.DataArray(b, dims=("time", "y", "x")),
}
)
I need to calculate the median value of my variable a across the time dimension when my variable b is between a min and a max threshold. These thresholds vary for each x, y pixel (i.e. they can be expressed as 2-dimensional (x, y) arrays):
random_vals = np.random.rand(1, 3000, 3000) / 10.0
min_threshold = 0.5 - random_vals
max_threshold = 0.5 + random_vals
Currently, I'm doing this by identifying pixels in b that are between my thresholds, using this boolean array to mask a using xarray's .where, then finally calculating the median value of a along the time dimension:
b_within_threshold = (ds.b > min_threshold) & (ds.b < max_threshold)
ds.a.where(b_within_threshold).median(dim='time')
This works, but the challenge is that is is extremely slow: 7.97 s ± 0 ns per loop for this example (my actual arrays can be far larger: e.g. shape=(500, 5000, 5000)). In my analysis, I need to do this calculation hundreds of times for different sets of min/max thresholds, for example:
for i in np.linspace(0, 1, 100):
# Create thresholds
random_vals = np.random.rand(1, 3000, 3000) / 10.0
min_threshold = i - random_vals
max_threshold = i + random_vals
# Apply mask and compute median
b_within_threshold = (ds.b > min_threshold) & (ds.b < max_threshold)
ds.a.where(b_within_threshold).median(dim='time')
Is there a more efficient/faster way I could apply this kind of calculation to my data? I'm happy with either an xarray, numpy or pandas solution - the speed of my current approach is just impractical given the amount of data I need to process, even when attempting to parallelise my code using multiprocessing or Dask.

The median is a rather expensive operation because it involves (partially) sorting a list and picking the middle value from it. Applying this operation along the time dimension, as you do, thus translates to sorting millions of (short-ish) lists hundreds of times. This simply takes time.
Your solution is already near-optimal on a python level, so your only two options are to either change your requirements or to optimize the constant overhead and parallelize. There are four things you can improve:
If you can, choose mean over median because it is less expensive to compute.
Use dask to parallelize the computation. You mentioned in a comment that you are already familiar with it, so I will not do this here.
Compile your own kernel using numba or cython to avoid expensive intermediate copies.
Ensure that your data is aligned/contiguous along the dimension you are computing. In this case time, so either switch to using fortran-ordered arrays or make time the last dimension of the array.
To make this concrete, here are the timings of your solution vs using your own (fortran-aligned) numba kernel:
Your Approach: 79.0010 s
Numba JIT: 7.7854 s
So about 10x faster. Keep in mind that these are single-core timings and that you will get further speedup from parallel processing. Here is the code for above timings:
import numpy as np
import xarray as xr
from timeit import timeit
import numba as nb
def time_solution(solution, number=1):
return timeit(
f"{solution.__name__}(data, threshold_value, low, high)",
setup=f"from __main__ import data, threshold_value, low, high, {solution.__name__}",
number=number,
)
def your_solution(data, threshold_value, low, high):
ds = xr.Dataset(
data_vars={
"a": xr.DataArray(data, dims=("time", "y", "x")),
"b": xr.DataArray(threshold_value, dims=("time", "y", "x")),
}
)
result = ds.a.where((ds.b > low) & (ds.b < high)).median(dim="time")
return result
#nb.jit(
"float32[:, :](float32[::1, :, :], float32[::1, :, :], float32[::1, :, :], float32[::1, :, :])",
nopython=True,
nogil=True,
)
def numba_magic(data, threshold_value, low, high):
output = np.empty(data.shape[1:], dtype=np.float32)
for height in range(data.shape[1]):
for width in range(data.shape[2]):
threshold = threshold_value[:, height, width]
mask = (low[:, height, width] < threshold) & (
threshold < high[:, height, width]
)
buffer = np.where(mask, data[:, height, width], np.nan)
output[height, width] = np.nanmedian(buffer)
return output
# Time solutions
# ==============
shape = (100, 3000, 3000)
rng = np.random.default_rng()
data = rng.random(shape).astype(np.float32, order="F")
threshold_value = rng.random(shape).astype(np.float32, order="F")
random_vals = (rng.random(shape) / 10).astype(np.float32, order="F")
low = 0.5 - random_vals
high = 0.5 + random_vals
# assert equality of solutions
expected = np.asarray(your_solution(data, threshold_value, low, high))
actual_numba = numba_magic(data, threshold_value, low, high)
assert np.allclose(expected, actual_numba, equal_nan=True)
# compare timings of solutions
repeats = 10
print("""
Timings
-------""")
print(f"Your Approach: {time_solution(your_solution, repeats)/repeats:.4f} s")
print(f"Numba JIT: {time_solution(numba_magic, repeats)/repeats:.4f} s")

One improvement could be to sort the arrays on the time-dimension. That has a lot of upfront computational cost but only once at the beginning.
After that you can continue the same way calculating the thresholds and masking the a-array.
Then calculate the median not by calling the median-function but by directly accessing the middle element from the a-array (respectively the average of the two middle elements if the array-length is even).
for i in np.linspace(0, 1, 100):
# Create thresholds
random_vals = np.random.rand(1, 3000, 3000) / 10.0
min_threshold = i - random_vals
max_threshold = i + random_vals
# Apply mask and compute median
b_within_threshold = (ds.b > min_threshold) & (ds.b < max_threshold)
a_masked = ds.a.where(b_within_threshold)
# Faster way to calculate median on a sorted array
len_a_masked = len(a_masked)
if len_a_masked == 0:
median = None
elif len_a_masked % 2 == 0:
median = 0.5 * (a_masked[(len_a_masked - 1) // 2] + a_masked[len_a_masked // 2])
else:
median = a_masked[(len_a_masked - 1) // 2]
Depending on how many medians you are calculating this should be a significant improvement, as you have the additional cost of sorting the array only once, but the improvement from the faster calculation of the median for every threshold-iteration.

Frequencies from a FFT shift based on size of data set?

I am working on finding the frequencies from a given dataset and I am struggling to understand how np.fft.fft() works. I thought I had a working script but ran into a weird issue that I cannot understand.
I have a dataset that is roughly sinusoidal and I wanted to understand what frequencies the signal is composed of. Once I took the FFT, I got this plot:
However, when I take the same dataset, slice it in half, and plot the same thing, I get this:
I do not understand why the frequency drops from 144kHz to 128kHz which technically should be the same dataset but with a smaller length.
I can confirm a few things:
Step size between data points 0.001
I have tried interpolation with little luck.
If I slice the second half of the dataset I get a different frequency as well.
If my dataset is indeed composed of both 128 and 144kHz, then why doesn't the 128 peak show up in the first plot?
What is even more confusing is that I am running a script with pure sine waves without issues:
T = 0.001
fs = 1 / T
def find_nearest_ind(data, value):
return (np.abs(data - value)).argmin()
x = np.arange(0, 30, T)
ff = 0.2
y = np.sin(2 * ff * np.pi * x)
x = x[:len(x) // 2]
y = y[:len(y) // 2]
n = len(y) # length of the signal
k = np.arange(n)
T = n / fs
frq = k / T * 1e6 / 1000 # two sides frequency range
frq = frq[:len(frq) // 2] # one side frequency range
Y = np.fft.fft(y) / n # dft and normalization
Y = Y[:n // 2]
frq = frq[:50]
Y = Y[:50]
fig, (ax1, ax2) = plt.subplots(2)
ax1.plot(x, y)
ax1.set_xlabel("Time (us)")
ax1.set_ylabel("Electric Field (V / mm)")
peak_ind = find_nearest_ind(abs(Y), np.max(abs(Y)))
ax2.plot(frq, abs(Y))
ax2.axvline(frq[peak_ind], color = 'black', linestyle = '--', label = F"Frequency = {round(frq[peak_ind], 3)}kHz")
plt.legend()
plt.xlabel('Freq(kHz)')
ax1.title.set_text('dV/dX vs. Time')
ax2.title.set_text('Frequencies')
fig.tight_layout()
plt.show()

Here is a breakdown of your code, with some suggestions for improvement, and extra explanations. Working through it carefully will show you what is going on. The results you are getting are completely expected. I will propose a common solution at the end.
First set up your units correctly. I assume that you are dealing with seconds, not microseconds. You can adjust later as long as you stay consistent.
Establish the period and frequency of the sampling. This means that the Nyquist frequency for the FFT will be 500Hz:
T = 0.001 # 1ms sampling period
fs = 1 / T # 1kHz sampling frequency
Make a time domain of 30e3 points. The half domain will contain 15000 points. That implies a frequency resolution of 500Hz / 15k = 0.03333Hz.
x = np.arange(0, 30, T) # time domain
n = x.size # number of points: 30000
Before doing anything else, we can define our time domain right here. I prefer a more intuitive approach than what you are using. That way you don't have to redefine T or introduce the auxiliary variable k. But as long as the results are the same, it does not really matter:
F = np.linspace(0, 1 - 1/n, n) / T # Notice F[1] = 0.03333, as predicted
Now define the signal. You picked ff = 0.2. Notice that 0.2Hz. 0.2 / 0.03333 = 6, so you would expect to see your peak in exactly bin index 6 (F[6] == 0.2). To better illustrate what is going on, let's take ff = 0.22. This will bleed the spectrum into neighboring bins.
ff = 0.22
y = np.sin(2 * np.pi * ff * x)
Now take the FFT:
Y = np.fft.fft(y) / n
maxbin = np.abs(Y).argmax() # 7
maxF = F[maxbin] # 0.23333333: This is the nearest bin
Since your frequency bins are 0.03Hz wide, the best resolution you can expect 0.015Hz. For your real data, which has much lower resolution, the error is much larger.
Now let's take a look at what happens when you halve the data size. Among other things, the frequency resolution becomes smaller. Now you have a maximum frequency of 500Hz spread over 7.5k samples, not 15k: the resolution drops to 0.066666Hz per bin:
n2 = n // 2 # 15000
F2 = np.linspace(0, 1 - 1 / n2, n2) / T # F[1] = 0.06666
Y2 = np.fft.fft(y[:n2]) / n2
Take a look what happens to the frequency estimate:
maxbin2 = np.abs(Y2).argmax() # 3
maxF2 = F2[maxbin2] # 0.2: This is the nearest bin
Hopefully, you can see how this applies to your original data. In the full FFT, you have a resolution of ~16.1 per bin with the full data, and ~32.2kHz with the half data. So your original result is within ~±8kHz of the right peak, while the second one is within ~±16kHz. The true frequency is therefore between 136kHz and 144kHz. Another way to look at it is to compare the bins that you showed me:
full: 128.7 144.8 160.9
half: 96.6 128.7 160.9
When you take out exactly half of the data, you drop every other frequency bin. If your peak was originally closest to 144.8kHz, and you drop that bin, it will end up in either 128.7 or 160.9.
Note: Based on the bin numbers you show, I suspect that your computation of frq is a little off. Notice the 1 - 1/n in my linspace expression. You need that to get the right frequency axis: the last bin is (1 - 1/n) / T, not 1 / T, no matter how you compute it.
So how to get around this problem? The simplest solution is to do a parabolic fit on the three points around your peak. That is usually a sufficiently good estimator of the true frequency in the data when you are looking for essentially perfect sinusoids.
def peakF(F, Y):
index = np.abs(Y).argmax()
# Compute offset on normalized domain [-1, 0, 1], not F[index-1:index+2]
y = np.abs(Y[index - 1:index + 2])
# This is the offset from zero, which is the scaled offset from F[index]
vertex = (y[0] - y[2]) / (0.5 * (y[0] + y[2]) - y[1])
# F[1] is the bin resolution
return F[index] + vertex * F[1]
In case you are wondering how I got the formula for the parabola: I solved the system with x = [-1, 0, 1] and y = Y[index - 1:index + 2]. The matrix equation is
[(-1)^2 -1 1] [a] Y[index - 1]
[ 0^2 0 1] # [b] = Y[index]
[ 1^2 1 1] [c] Y[index + 1]
Computing the offset using a normalized domain and scaling afterwards is almost always more numerically stable than using whatever huge numbers you have in F[index - 1:index + 2].
You can plug the results in the example into this function to see if it works:
>>> peakF(F, Y)
0.2261613409657391
>>> peakF(F2, Y2)
0.20401580936430794
As you can see, the parabolic fit gives an improvement, however slight. There is no replacement for just increasing frequency resolution through more samples though!

Speeding up normal distribution probability mass allocation

We have N users with P avg. points per user, where each point is a single value between 0 and 1. We need to distribute the mass of each point using a normal distribution with a known density of 0.05 as the points have some uncertainty. Additionally, we need to wrap the mass around 0 and 1 such that e.g. a point at 0.95 will also allocate mass around 0. I've provided a working example below, which bins the normal distribution into D=50 bins. The example uses the Python typing module, but you can ignore that if you'd like.
from typing import List, Any
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
D = 50
BINS: List[float] = np.linspace(0, 1, D + 1).tolist()
def probability_mass(distribution: Any, x0: float, x1: float) -> float:
"""
Computes the area under the distribution, wrapping at 1.
The wrapping is done by adding the PDF at +- 1.
"""
assert x1 > x0
return (
(distribution.cdf(x1) - distribution.cdf(x0))
+ (distribution.cdf(x1 + 1) - distribution.cdf(x0 + 1))
+ (distribution.cdf(x1 - 1) - distribution.cdf(x0 - 1))
)
def point_density(x: float) -> List[float]:
distribution: Any = scipy.stats.norm(loc=x, scale=0.05)
density: List[float] = []
for i in range(D):
density.append(probability_mass(distribution, BINS[i], BINS[i + 1]))
return density
def user_density(points: List[float]) -> Any:
# Find the density of each point
density: Any = np.array([point_density(p) for p in points])
# Combine points and normalize
combined = density.sum(axis=0)
return combined / combined.sum()
if __name__ == "__main__":
# Example for one user
data: List[float] = [.05, .3, .5, .5]
density = user_density(data)
# Example for multiple users (N = 2)
print([user_density(x) for x in [[.3, .5], [.7, .7, .7, .9]]])
### NB: THE REMAINING CODE IS FOR ILLUSTRATION ONLY!
### NB: THE IMPORTANT THING IS TO COMPUTE THE DENSITY FAST!
middle: List[float] = []
for i in range(D):
middle.append((BINS[i] + BINS[i + 1]) / 2)
plt.bar(x=middle, height=density, width=1.0 / D + 0.001)
plt.xlim(0, 1)
plt.xlabel("x")
plt.ylabel("Density")
plt.show()
In this example N=1, D=50, P=4. However, we want to scale this approach to N=10000 and P=100 while being as fast as possible. It's unclear to me how we'd vectorize this approach. How do we best speed up this?
EDIT
The faster solution can have slightly different results. For instance, it could approximate the normal distribution instead of using the precise normal distribution.
EDIT2
We only care about computing density using the user_density() function. The plot is only to help explain the approach. We do not care about the plot itself :)
EDIT3
Note that P is the avg. points per user. Some users may have more and some may have less. If it helps, you can assume that we can throw away points such that all users have a max of 2 * P points. It's fine to ignore this part while benchmarking as long as the solution can handle a flexible # of points per user.

You could get below 50ms for largest case (N=10000, AVG[P]=100, D=50) by using using FFT and creating data in numpy friendly format. Otherwise it will be closer to 300 msec.
The idea is to convolve a single normal distribution centered at 0 with a series Dirac deltas.
See image below:
Using circular convolution solves two issues.
naturally deals with wrapping at the edges
can be efficiently computed with FFT and Convolution Theorem
First one must create a distribution to be copied. Function mk_bell() created a histogram of a normal distribution of stddev 0.05 centered at 0.
The distribution wraps around 1. One could use arbitrary distribution here. The spectrum of the distribution is computed are used for fast convolution.
Next a comb-like function is created. The peaks are placed at indices corresponding to peaks in user density. E.g.
peaks_location = [0.1, 0.3, 0.7]
D = 10
maps to
peak_index = (D * peak_location).astype(int) = [1, 3, 7]
dist = [0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0] # ones at [1, 3, 7]
You can quickly create a composition of Diract Deltas by computing indices of the bins for each peak location with help of np.bincount() function.
In order to speed things even more one can compute comb-functions for user-peaks in parallel.
Array dist is 2D-array of shape NxD. It can be linearized to 1D array of shape (N*D). After this change element on position [user_id, peak_index] will be accessible from index user_id*D + peak_index.
With numpy-friendly input format (described below) this operation is easily vectorized.
The convolution theorem says that spectrum of convolution of two signals is equal to product of spectrums of each signal.
The spectrum is compute with numpy.fft.rfft which is a variant of Fast Fourier Transfrom dedicated to real-only signals (no imaginary part).
Numpy allows to compute FFT of each row of the larger matrix with one command.
Next, the spectrum of convolution is computed by simple multiplication and use of broadcasting.
Next, the spectrum is computed back to "time" domain by Inverse Fourier Transform implemented in numpy.fft.irfft.
To use the full speed of numpy one should avoid variable size data structure and keep to fixed size arrays. I propose to represent input data as three arrays.
uids the identifier for user, integer 0..N-1
peaks, the location of the peak
mass, the mass of the peek, currently it is 1/numer-of-peaks-for-user
This representation of data allows quick vectorized processing.
Eg:
user_data = [[0.1, 0.3], [0.5]]
maps to:
uids = [0, 0, 1] # 2 points for user_data[0], one from user_data[1]
peaks = [0.1, 0.3, 0.5] # serialized user_data
mass = [0.5, 0.5, 1] # scaling factors for each peak, 0.5 means 2 peaks for user 0
The code:
import numpy as np
import matplotlib.pyplot as plt
import time
def mk_bell(D, SIGMA):
# computes normal distribution wrapped and centered at zero
x = np.linspace(0, 1, D, endpoint=False);
x = (x + 0.5) % 1 - 0.5
bell = np.exp(-0.5*np.square(x / SIGMA))
return bell / bell.sum()
def user_densities_by_fft(uids, peaks, mass, D, N=None):
bell = mk_bell(D, 0.05).astype('f4')
sbell = np.fft.rfft(bell)
if N is None:
N = uids.max() + 1
# ensure that peaks are in [0..1) internal
peaks = peaks - np.floor(peaks)
# convert peak location from 0-1 to the indices
pidx = (D * (peaks + uids)).astype('i4')
dist = np.bincount(pidx, mass, N * D).reshape(N, D)
# process all users at once with Convolution Theorem
sdist = np.fft.rfft(dist)
sdist *= sbell
res = np.fft.irfft(sdist)
return res
def generate_data(N, Pmean):
# generateor for large data
data = []
for n in range(N):
# select P uniformly from 1..2*Pmean
P = np.random.randint(2 * Pmean) + 1
# select peak locations
chunk = np.random.uniform(size=P)
data.append(chunk.tolist())
return data
def make_data_numpy_friendly(data):
uids = []
chunks = []
mass = []
for uid, peaks in enumerate(data):
uids.append(np.full(len(peaks), uid))
mass.append(np.full(len(peaks), 1 / len(peaks)))
chunks.append(peaks)
return np.hstack(uids), np.hstack(chunks), np.hstack(mass)
D = 50
# demo for simple multi-distribution
data, N = [[0, .5], [.7, .7, .7, .9], [0.05, 0.3, 0.5, 0.5]], None
uids, peaks, mass = make_data_numpy_friendly(data)
dist = user_densities_by_fft(uids, peaks, mass, D, N)
plt.plot(dist.T)
plt.show()
# the actual measurement
N = 10000
P = 100
data = generate_data(N, P)
tic = time.time()
uids, peaks, mass = make_data_numpy_friendly(data)
toc = time.time()
print(f"make_data_numpy_friendly: {toc - tic}")
tic = time.time()
dist = user_densities_by_fft(uids, peaks, mass, D, N)
toc = time.time()
print(f"user_densities_by_fft: {toc - tic}")
The results on my 4-core Haswell machine are:
make_data_numpy_friendly: 0.2733159065246582
user_densities_by_fft: 0.04064297676086426
It took 40ms to process the data. Notice that processing data to numpy friendly format takes 6 times more time than the actual computation of distributions.
Python is really slow when it comes to looping.
Therefore I strongly recommend to generate input data directly in numpy-friendly way in the first place.
There are some issues to be fixed:
precision, can be improved by using larger D and downsampling
accuracy of peak location could be further improved by widening the spikes.
performance, scipy.fft offers move variants of FFT implementation that may be faster

This would be my vectorized approach:
data = np.array([0.05, 0.3, 0.5, 0.5])
np.random.seed(31415)
# random noise
randoms = np.random.normal(0,1,(len(data), int(1e5))) * 0.05
# samples with noise
samples = data[:,None] + randoms
# wrap [0,1]
samples = (samples % 1).ravel()
# histogram
hist, bins, patches = plt.hist(samples, bins=BINS, density=True)
Output:

I was able to reduce the time from about 4 seconds per sample of 100 datapoints to about 1 ms per sample.
It looks to me like you're spending quite a lot of time simulating a very large number of normal distributions. Since you're dealing with a very large sample size anyway, you may as well just use standard normal distribution values, because it'll all just average out anyway.
I recreated your approach (BaseMethod class), then created an optimized class (OptimizedMethod class), and evaluated them using a timeit decorator. The primary difference in my approach is the following line:
# Generate a standardized set of values to add to each sample to simulate normal distribution
self.norm_vals = np.array([norm.ppf(x / norm_val_n) * 0.05 for x in range(1, norm_val_n, 1)])
This creates a generic set of datapoints based on an inverse normal cumulative distribution function that we can add to each datapoint to simulate a normal distribution around that point. Then we just reshape the data into user samples and run np.histogram on the samples.
import numpy as np
import scipy.stats
from scipy.stats import norm
import time
# timeit decorator for evaluating performance
def timeit(method):
def timed(*args, **kw):
ts = time.time()
result = method(*args, **kw)
te = time.time()
print('%r %2.2f ms' % (method.__name__, (te - ts) * 1000 ))
return result
return timed
# Define Variables
N = 10000
D = 50
P = 100
# Generate sample data
np.random.seed(0)
data = np.random.rand(N, P)
# Run OP's method for comparison
class BaseMethod:
def __init__(self, d=50):
self.d = d
self.bins = np.linspace(0, 1, d + 1).tolist()
def probability_mass(self, distribution, x0, x1):
"""
Computes the area under the distribution, wrapping at 1.
The wrapping is done by adding the PDF at +- 1.
"""
assert x1 > x0
return (
(distribution.cdf(x1) - distribution.cdf(x0))
+ (distribution.cdf(x1 + 1) - distribution.cdf(x0 + 1))
+ (distribution.cdf(x1 - 1) - distribution.cdf(x0 - 1))
)
def point_density(self, x):
distribution = scipy.stats.norm(loc=x, scale=0.05)
density = []
for i in range(self.d):
density.append(self.probability_mass(distribution, self.bins[i], self.bins[i + 1]))
return density
#timeit
def base_user_density(self, data):
n = data.shape[0]
density = np.empty((n, self.d))
for i in range(data.shape[0]):
# Find the density of each point
row_density = np.array([self.point_density(p) for p in data[i]])
# Combine points and normalize
combined = row_density.sum(axis=0)
density[i, :] = combined / combined.sum()
return density
base = BaseMethod(d=D)
# Only running base method on first 2 rows of data because it's slow
density = base.base_user_density(data[:2])
print(density[:2, :5])
class OptimizedMethod:
def __init__(self, d=50, norm_val_n=50):
self.d = d
self.norm_val_n = norm_val_n
self.bins = np.linspace(0, 1, d + 1).tolist()
# Generate a standardized set of values to add to each sample to simulate normal distribution
self.norm_vals = np.array([norm.ppf(x / norm_val_n) * 0.05 for x in range(1, norm_val_n, 1)])
#timeit
def optimized_user_density(self, data):
samples = np.empty((data.shape[0], data.shape[1], self.norm_val_n - 1))
# transform datapoints to normal distributions around datapoint
for i in range(self.norm_vals.shape[0]):
samples[:, :, i] = data + self.norm_vals[i]
samples = samples.reshape(samples.shape[0], -1)
#wrap around [0, 1]
samples = samples % 1
#loop over samples for density
density = np.empty((data.shape[0], self.d))
for i in range(samples.shape[0]):
hist, bins = np.histogram(samples[i], bins=self.bins)
density[i, :] = hist / hist.sum()
return density
om = OptimizedMethod()
#Run optimized method on first 2 rows for apples to apples comparison
density = om.optimized_user_density(data[:2])
#Run optimized method on full data
density = om.optimized_user_density(data)
print(density[:2, :5])
Running on my system, the original method took about 8.4 seconds to run on 2 rows of data, while the optimized method took 1 millisecond to run on 2 rows of data and completed 10,000 rows in 4.7 seconds. I printed the first five values of the first 2 samples for each method.
'base_user_density' 8415.03 ms
[[0.02176227 0.02278653 0.02422535 0.02597123 0.02745976]
[0.0175103 0.01638513 0.01524853 0.01432158 0.01391156]]
'optimized_user_density' 1.09 ms
'optimized_user_density' 4755.49 ms
[[0.02142857 0.02244898 0.02530612 0.02612245 0.0277551 ]
[0.01673469 0.01653061 0.01510204 0.01428571 0.01326531]]

Generate data based on exponential distribution

I want to generate a dataset of 30 entries such between the range of (50-5000) such that it follows an increasing curve(log curve) i.e. increasing in the start and then stagnant in the end.
I came across the from scipy.stats import expon but I am not sure how to use the package in my scenario.
Can anyone help.
A possible output would look like [300, 1000, 1500, 1800, 1900, ...].

First you need to generate 30 random x values (uniformly). Then you get log(x). Ideally, log(x) should be in range [50, 5000). However, in such case you would need e^50 <= x <= e^5000 (overflow!!). A possible solution is to generate random x values in [min_x, max_x), get the logarithmic values and then scale them to the desired range [50, 5000).
import numpy as np
min_y = 50
max_y = 5000
min_x = 1
# any number max_x can be chosen
# this number controls the shape of the logarithm, therefore the final distribution
max_x = 10
# generate (uniformly) and sort 30 random float x in [min_x, max_x)
x = np.sort(np.random.uniform(min_x, max_x, 30))
# get log(x), i.e. values in [log(min_x), log(max_x))
log_x = np.log(x)
# scale log(x) to the new range [min_y, max_y)
y = (max_y - min_y) * ((log_x - np.log(min_x)) / (np.log(max_x) - np.log(min_x))) + min_y

Autocorrelation to estimate periodicity with numpy

I have a large set of time series (> 500), I'd like to select only the ones that are periodic. I did a bit of literature research and I found out that I should look for autocorrelation. Using numpy I calculate the autocorrelation as:
def autocorr(x):
norm = x - np.mean(x)
result = np.correlate(norm, norm, mode='full')
acorr = result[result.size/2:]
acorr /= ( x.var() * np.arange(x.size, 0, -1) )
return acorr
This returns a set of coefficients (r?) that when plot should tell me if the time series is periodic or not.
I generated two toy examples:
#random signal
s1 = np.random.randint(5, size=80)
#periodic signal
s2 = np.array([5,2,3,1] * 20)
When I generate the autocorrelation plots I obtain:
The second autocorrelation vector clearly indicates some periodicity:
Autocorr1 = [1, 0.28, -0.06, 0.19, -0.22, -0.13, 0.07 ..]
Autocorr2 = [1, -0.50, -0.49, 1, -0.50, -0.49, 1 ..]
My question is, how can I automatically determine, from the autocorrelation vector, if a time series is periodic? Is there a way to summarise the values into a single coefficient, e.g. if = 1 perfect periodicity, if = 0 no periodicity at all. I tried to calculate the mean but it is not meaningful. Should I look at the number of 1?

I would use mode='same' instead of mode='full' because with mode='full' we get covariances for extreme shifts, where just 1 array element overlaps self, the rest being zeros. Those are not going to be interesting. With mode='same' at least half of the shifted array overlaps the original one.
Also, to have the true correlation coefficient (r) you need to divide by the size of the overlap, not by the size of the original x. (in my code these are np.arange(n-1, n//2, -1)). Then each of the outputs will be between -1 and 1.
A glance at Durbin–Watson statistic, which is similar to 2(1-r), suggests that people consider its values below 1 to be a significant indication of autocorrelation, which corresponds to r > 0.5. So this is what I use below. For a statistically sound treatment of the significance of autocorrelation refer to statistics literature; a starting point would be to have a model for your time series.
def autocorr(x):
n = x.size
norm = (x - np.mean(x))
result = np.correlate(norm, norm, mode='same')
acorr = result[n//2 + 1:] / (x.var() * np.arange(n-1, n//2, -1))
lag = np.abs(acorr).argmax() + 1
r = acorr[lag-1]
if np.abs(r) > 0.5:
print('Appears to be autocorrelated with r = {}, lag = {}'. format(r, lag))
else:
print('Appears to be not autocorrelated')
return r, lag
Output for your two toy examples:
Appears to be not autocorrelated
Appears to be autocorrelated with r = 1.0, lag = 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Weighted mean in numpy/python - python

Related

Efficiently masking and reducing a large multidimensional array using values from another array

Frequencies from a FFT shift based on size of data set?

Speeding up normal distribution probability mass allocation

Generate data based on exponential distribution

Autocorrelation to estimate periodicity with numpy

Categories

Resources