Implementing a weighted average on the fly in Python

Implementing a weighted average on the fly in Python - python

I have a stream of coming data and I want to implement the moving average on the fly. If all the elements in the moving average have the same weight it is fairly easy to implement using a 'Queue' but I want the most recent elements to have higher weights and the distribution of this weights are linear (not exponential).
For example, if the moving average is of length 5, the current value should have weight '1', the previous one should have weight '0.8' and so on until the fifth element in the queue which should have weight '0.2'; so the weight vector is: [0.2, 0.4, 0.6, 0.8, 1.0].
I was wondering if anybody knows how to implement it is Python. If there is any faster way to do this please recommend that to me; efficiency is important for my specific job.

If you want to keep a weighting vector such as described (linearly decreasing weights), you will need to keep all the information about your past stream. I quickly tried to sketch a mathematical function that would avoid keeping your past scalars in your memory without success. This is where the exponential weighting has a powerful advantage:
average_(t) = x_(t) + aa*average_(t-1)
You only need to keep two variables in your memory.
Anyway, if the memory is not an efficiency parameter, your problem goes down as a vector multiplication. Therefore, I would suggest using the numpy library. [1] [2]. See a solution example below (perhaps you may found a more efficient one):
import numpy as np
stream = np.array((20, 40))
n = len(stream)
latest_scalar = 60
stream = np.append(stream, latest_scalar)
n += 1
# n represent the length of the stream
# I assumed that is more efficient to handle n without calling len() function
# may raise safety issue
weights = np.arange(1, n+1)
# [1, 2, 3]
average = np.dot(stream, weights).sum() / (n*(n+1)/2)
# (n*(n+1)/2): total of the weights
# output: 46.666... ok!

Related

Custom vectorized non-linear filter in Numpy

In digital image processing, many filters are non-linear, such as Harmonic Mean Filter.
I know in Numpy, they provided many vectorized functions which could speed up the computing time tremendously, but currently I have not known any that could work well with non-linear masks.
In specific, I want to speed up the calculation of my implementation of the above filter, which removes two ugly, snail-paced Python for loops:
import math as m
def harmonic(im, ksize):
# Make a copy of the original image
result = im.copy().astype(np.float32)
# Calculate padding size, and pad the original image
psize = m.floor(ksize/2) # paddding size
im = cv.copyMakeBorder(im, psize, psize, psize, psize, cv.BORDER_REFLECT)
# Perform non-linear operations
for i in range(0, result.shape[0]):
for j in range(0, result.shape[1]):
# Get the neighborhood same size as kernel
neighbor = im[(i):(i+2*psize+1),(j):(j+2*psize+1)].astype(np.float32)
# ----------------------------------------
# Calculate the reciprocal sum
recp_sum = np.sum(np.reciprocal(neighbor,where= neighbor != 0).astype(np.float32))
# Harmonic mean for that neighborhood
if (recp_sum != 0):
result[i][j] = (float((ksize*ksize)/(recp_sum)))
# ----------------------------------------
return result.astype(np.uint8)
In general, could we utilize Numpy to create any custom vectorized operations on a array? Or only a limited number operations and what types are they? If yes, what could I do specifically to optimize the above code?
I have tried to explore Numpy vectorization recently, and np.vectorize really caught my attention. However, the examples provided on the documentation was a bit (as far as I feel) irrelevant to the problem I am trying to solve. (English was not my native language so I may miss something, I'd be happy to be elaborated!)
Related to np.vectorize, I do not really understand pyfunc param. Does it really eliminate the traditional Python loops wrapped in that pyfunc? Or it's there just to define a specific mapping at a specific pixel in the array?

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. That is,
tmp = 1 / im.astype(np.float32)
tmp = cv2.blur(tmp, (ksize, ksize))
out = 1 / tmp
You might want to add a bit of code there to avoid division by zero. The simplest way is to replace zeros with very small values.

Exclude/Ignore data region in polynomial fit (zfit)

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:
import numpy as np
import pandas as pd
import zfit
# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])
# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)
# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))
# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)
TestSample.png
Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.
I also tried using MultiSpaces to ignore that region via
limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2
But this leads to a
ValueError: obs need to be a Space with exactly one limit if rescaling is requested.
Anyone has an idea how to solve this?
Thanks in advance ^^

Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.
What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.
This can normally be done in two ways:
Using multispace
using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).
Simultaneous fit
As mentioned in the comments by #jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).
So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.
Solution 1: different space and norm
Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.
Solution 2: manual re-scaling
The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

How to use sklearn's IncrementalPCA partial_fit

I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.
I can load in the data just fine:
f = h5py.File('my_big_data.h5')
features = f['data']
And from this example, it seems I need to decide what size chunks I want to read from it:
num_rows = data.shape[0] # total number of rows in data
chunk_size = 10 # how many rows at a time to feed ipca
Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):
ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])
This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?
EDIT
The code above was for testing on a smaller subset of my data – as #ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.

As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size to 100 or 1000 (or even higher, depending on the shape of your data).
What you have to do now to transform it, is actually transforming it:
out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])
And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out as another hdf5 dataset.
Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features. I would suggest reducing them to sqrt(features.shape[1]), as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_ to determine the best amount of features for your affordable information loss threshold.
Edit: as for the explained_variance_ratio_, it returns a vector of dimension n_components (the n_components that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.
You can follow the procedure in this answer to extract how much information is preserved by the first n components:
>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.
Thus, what is usually done, is to fit your PCA to the same number of components than your original data:
ipca = IncrementalPCA(n_components=features.shape[1])
Then, after training on your whole data (with iteration + partial_fit) you can plot explaine_variance_ratio_.cumsum() and choose how much data you want to lose. Or do it automatically:
k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)
The above will return the first index on the cumcum array where the value is > 0.9, this is, indicating the number of PCA components that preserve at least 90% of the original data.
Then you can tweek the transformation to reflect it:
cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]
NOTE the slicing to :k to just select only the first k components while ignoring the rest.

How can I speed up nearest neighbor search with python?

I have a code, which calculates the nearest voxel (which is unassigned) to a voxel ( which is assigned). That is i have an array of voxels, few voxels already have a scalar (1,2,3,4....etc) values assigned, and few voxels are empty (lets say a value of '0'). This code below finds the nearest assigned voxel to an unassigned voxel and assigns that voxel the same scalar. So, a voxel with a scalar '0' will be assigned a value (1 or 2 or 3,...) based on the nearest voxel. This code below works, but it takes too much time.
Is there an alternative to this ? or if you have any feedback on how to improve it further?
""" #self.voxels is a 3D numpy array"""
def fill_empty_voxel1(self,argx, argy, argz):
""" where # argx, argy, argz are the voxel location where the voxel is zero"""
argx1, argy1, argz1 = np.where(self.voxels!=0) # find the non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= self.mean) # self.mean is a mean radius search value
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
self.voxels[argx,argy,argz] = self.voxels[argx2,argy2,argz2] # update the voxel array
Example
""" Here is a small example with small dataset:"""
import numpy as np
from scipy.spatial import cKDTree
import timeit
voxels = np.zeros((10,10,5), dtype=np.uint8)
voxels[1:2,:,:] = 5.
voxels[5:6,:,:] = 2.
voxels[:,3:4,:] = 1.
voxels[:,8:9,:] = 4.
argx, argy, argz = np.where(voxels==0)
tic=timeit.default_timer()
argx1, argy1, argz1 = np.where(voxels!=0) # non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5.)
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
voxels[argx,argy,argz] = voxels[argx2,argy2,argz2]
toc=timeit.default_timer()
timetaken = toc - tic #elapsed time in seconds
print '\nTime to fill empty voxels', timetaken
for visualization:
from mayavi import mlab
data = voxels.astype('float')
scalar_field = mlab.pipeline.scalar_field(data)
iso_surf = mlab.pipeline.iso_surface(scalar_field)
surf = mlab.pipeline.surface(scalar_field)
vol = mlab.pipeline.volume(scalar_field,vmin=0,vmax=data.max())
mlab.outline()
mlab.show()
Now, if I have the dimension of the voxels array as something like (500,500,500), then the time it takes to compute the nearest search is no longer efficient. How can I overcome this? Could parallel computation reduce the time (I have no idea whether I can parallelize the code, if you do, please let me know)?
A potential fix:
I could substantially improve the computation time by adding the n_jobs = -1 parameter in the cKDTree query.
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5., n_jobs=-1)
I was able to compute the distances in less than a hour for an array of (400,100,100) on a 13 core CPU. I tried with 1 processor and it takes around 18 hours to complete the same array.
Thanks to #gsamaras for the answer!

You can switch to approximate nearest neighbors (ANN) algorithms which usually take advantage of sophisticated hashing or proximity graph techniques to index your data quickly and perform faster queries. One example is Spotify's Annoy. Annoy's README includes a plot which shows precision-performance tradeoff comparison of various ANN algorithms published in recent years. The top-performing algorithm (at the time this comment was posted), hnsw, has a Python implementation under Non-Metric Space Library (NMSLIB).

It would be interesting to try sklearn.neighbors.NearestNeighbors, which offers n_jobs parameter:
The number of parallel jobs to run for neighbors search.
This package also provides the Ball Tree algorithm, which you can test versus the kd-tree one, however my hunch is that the kd-tree will be better (but that again does depend on your data, so research that!).
You might also want to use dimensionality reduction, which is easy. The idea is that you reduce your dimensions, thus your data contain less info, so that tackling the Nearest Neighbour Problem can be done much faster. Of course, there is a trade off here, accuracy!
You might/will get less accuracy with dimensionality reduction, but it might worth the try. However, this usually applies in a high dimensional space, and you are just in 3D. So I don't know if for your specific case it would make sense to use sklearn.decomposition.PCA.
A remark:
If you really want high performance though, you won't get it with python, you could switch to c++, and use CGAL for example.

Using Numpy to find the average distance in a set of points

I have an array of points in unknown dimensional space, such as:
data=numpy.array(
[[ 115, 241, 314],
[ 153, 413, 144],
[ 535, 2986, 41445]])
and I would like to find the average euclidean distance between all points.
Please note that I have over 20,000 points, so I would like to do this as efficiently as possible.
Thanks.

If you have access to scipy, you could try the following:
scipy.spatial.distance.cdist(data,data)

Well, I don't think that there is a super fast way to do this, but this should do it:
tot = 0.
for i in xrange(data.shape[0]-1):
tot += ((((data[i+1:]-data[i])**2).sum(1))**.5).sum()
avg = tot/((data.shape[0]-1)*(data.shape[0])/2.)

Now that you've stated your goal of finding the outliers, you are probably better off computing the sample mean and, with that, the sample variance, since both those operations will give you an O(nd) operation. With that, you should be able to find outliers (e.g. excluding points further from the mean than some fraction of the std. dev.), and that filtering process should be possible to perform in O(nd) time for a total of O(nd).
You might be interested in a refresher on Chebyshev's inequality.

Is it ever worthwhile to optimize without a working solution? Also, computation of a distance matrix over the entire data set rarely needs to be fast because you only do it once--when you need to know a distance between two points, you just look it up, it's already calculated.
So if you don't have a place to start, here's one. If you want to do this in Numpy without the need to write any inline fortran or C, that should be no problem, though perhaps you want to include this small vector-based virtual machine called "numexpr" (available on PyPI, trivial to intall) which in this case gave a 5x performance boost versus Numpy alone.
Below i've calculated a distance matrix for 10,000 points in 2D space (a 10K x 10k matrix giving the distance between all 10k points). This took 59 seconds on my MBP.
import numpy as NP
import numexpr as NE
# data are points in 2D space (x, y)--obviously, this code can accept data of any dimension
x = NP.random.randint(0, 10, 10000)
y = NP.random.randint(0, 10, 10000)
fnx = lambda q : q - NP.reshape(q, (len(q), 1))
delX = fnx(x)
delY = fnx(y)
dist_mat = NE.evaluate("(delX**2 + delY**2)**0.5")

There's no getting around the number of evaluations:
Sum[n-i, {i, 0, n}] = http://www.equationsheet.com/latexrender/pictures/27744c0bd81116aa31c138ab38a2aa87.gif
But you can save yourself the expense of all those square roots if you can get by with an approximate result. It depends on your needs.
If you're going to calculate an average, I would advise you to not try putting all the values into an array before calculating. Just calculate the sum (and sum of squares if you need standard deviation as well) and throw away each value as you calculate it.
Since
and
, I don't know if this means you have to multiply by two somewhere.

If you want a fast and inexact solution, you could probably adapt the Fast Multipole Method algorithm.
Points that are separated by a small distance have a smaller contribution to the final average distance, so it would make sense to group points into clusters and compare the clusters distances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.