Best path in Dynamic Time Warping - python

I’ve been using different python DTW to find the shift between two images. Suppose we have image1 and image2 with the same shape. I apply dtw to each row of the two, i.e.
for i in range(image1.shape[0]):
alignment = dtw(image2[i,:], image1[I,:])
shift = alignment.index1 – alignment.index2
I need all_shifts to have the same shape as both images. However, in all the python dtw packages, the length of the alignment.index1 and index2 are larger than the length of s1 and s2, since there are duplicates. I have tried to select the unique indices by simple criteria. For example, in a synthetic test, I’ve selected those indices that have the maximum or minimum shifts, but the results are noisy. But in the real case, I don’t know the shifts between the two images in advance. So finding unique indices is not straightforward.
Could anyone advise me how I can find the best unique indices of the alignment, so that I could have a shift with the same length as s1 and s2?
Thank you,
I have tried the following code:
import numpy as np
import pandas as pd
from dtw import *
def warp_function(image2, image1):
warp_arr = np.zeros(image1.shape)
warped_img = np.zeros(image2.shape)
for i in range(image1.shape[0]):
print(f'[Warping function]: trace {i} out of {image1.shape[0]}')
reference = np.zeros(image1.shape[1])
query = np.zeros(image2.shape[1])
reference[:] = image1[i, :]
query[:] = image2[i, :]
alignment = dtw(query,reference)
shift = alignment.index2 - alignment.index1
warp_dict = {'shift': shift, 'index1': alignment.index1, 'index2': alignment.index2}
df = pd.DataFrame(warp_dict)
#df2 = df.drop_duplicates(subset=['index2'], keep='last')
df2 = find_uniques(df)
warp_arr[i,:] = df2['shift'].to_numpy()
warp_index = df2['index1'].to_numpy()
warped_img[i,:] = query[warp_index]
return warp_arr, warped_img
Find_unique is a function that finds minimum or maximum shifts for the duplicated indices.
I've been trying to find the shift between the two images using dtw. When I apply the shift, image2 and warped_img are expected to be the same. However, they are not. It shows that find_unique doesn't find the best unique indices.


Measuring the Feret diameter of multiple particles per TIFF image

I am looking to measure the minimum and maximum diameter of multiple particles (groups) in a TIFF image.
This is my current code:
from PIL import Image
import numpy as np
from skimage import measure
import os
import pandas as pd
import warnings; warnings.filterwarnings(action='once')
dirname1 = path
final1 = []
for fname in os.listdir(dirname1):
im =, fname))
imarray = np.array(im)
final1 = np.asarray(final1)
groups, group_count = measure.label(final1 == 0, return_num = True, connectivity = 2)
print('Groups: \n', groups)
print(f'Number of particles: {group_count}')
df1 = (pd.DataFrame(dict(zip(['Particle #', 'Size [pixel #]'],
np.unique(groups, return_counts=True))))
.loc[lambda d: d['Particle #'].ne(0)]
df1.index -= 1
props = measure.regionprops_table(groups, properties = ['label', 'equivalent_diameter'])
df1_new = pd.DataFrame(props)
My TIFF image looks like this: Image example (I normally work with multiple TIFF images)
In my code, I have used skimage to calculate the equivalent diameter. However, I need the min/max Feret diameter in the df1 DataFrame as well.
Thank you.
If you change the last two lines in your code to:
props = measure.regionprops_table(groups[0], properties = ['label', 'equivalent_diameter', 'axis_major_length', 'axis_minor_length' ])
props['equiv_dia'] = props['equivalent_diameter'] ; props.pop('equivalent_diameter')
props['min_feret'] = props['axis_minor_length' ] ; props.pop('axis_minor_length' )
props['max_feret'] = props['axis_major_length' ] ; props.pop('axis_major_length' )
df1_new = pd.DataFrame(props)
You will get for the following image:
as output:
label equiv_dia min_feret max_feret
0 1 2.580762 0.000000 3.651484
1 2 10.802272 3.651484 26.226566
2 3 3.059832 2.000000 3.651484
3 4 3.578801 3.096570 4.195087
4 5 5.497605 3.651484 8.032033
The changes to your code are solving following issues:
Error message because of bad array shapes
Lack of columns for feret values in the dataframe
The measure.regionprops_table() method comes with 'axis_major_length', 'axis_minor_length' options for the calculated properties which seem to be be equivalent to the min/max Feret values.
If you are not satisfied with the values this properties provide (I'm not) I suggest to decide which other tool you want to use for this calculation to obtain the values satisfying your needs what is maybe worth to ask another question about calculating Feret values.
As I created the tag feret-values you can use it along with python in a new question about differences in the values of min/max Feret calculated by different Python scripts/modules.
I have myself checked out 'RegionProperties' object has no attribute 'feret_diameter_max' , but the Feret property in scikit throws an Error.

How to efficiently index a numpy array based on varying start and stop indexes per row

I have a 2D numpy array with rows being time series of a feature, based on which I'm training a neural network. For generalisation purposes, I would like to subset these time series at random points. I'd like them to have a minimum subset length as well. However, the network requires fixed length time series, so I need to pre-pad the resulting subsets with zeroes.
Currently, I'm doing it using the code below, which includes a nasty for-loop, because I don't know how I can use fancy indexing for this particular problem. As this piece of code is part of the network data generator, it needs to be fast to keep up to pace with the data-hungry GPU. Does anyone know a numpy-way of doing this without the for-loop?
import numpy as np
import matplotlib.pyplot as plt
# Amount of time series to consider
batchsize = 25
# Original length of the time series
timesteps = 150
# As an example, fill the 2D array with sine function time series
sinefunction = np.expand_dims(np.sin(np.arange(timesteps)), axis=0)
originalarray = np.repeat(sinefunction, batchsize, axis=0)
# Now the real thing, we want:
# - to start the time series at a random moment (between 0 and maxstart)
# - to end the time series at a random moment
# - however with a minimum length of the resulting subset time series (minlength)
maxstart = 50
minlength = 75
# get random starts
randomstarts = np.random.choice(np.arange(0, maxstart), size=batchsize)
# get random stops
randomstops = np.random.choice(np.arange(maxstart + minlength, timesteps), size=batchsize)
# determine the resulting random sizes of the subset time series
randomsizes = randomstops - randomstarts
# finally create a new 2D array with all the randomly subset time series, however pre-padded with zeros
cutarray = np.zeros_like(originalarray)
for i in range(batchsize):
cutarray[i, -randomsizes[i]:] = originalarray[i, randomstarts[i]:randomstops[i]]
To show what goes in and out of the function:
# Show that it worked
f, ax = plt.subplots(2, 1)
ax[0].set_title('original array')
ax[1].set_title('zero-padded subset array')
Approach #1 : Views-based
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windowed views into a zeros padded version of the input and assign into a zeros padded version of the output. All of that padding is needed for a vectorized solution on account of the ragged nature. Upside is that working on views would be efficient on memory and performance.
The implementation would look something like this -
from skimage.util.shape import view_as_windows
n = randomsizes.max()
max_extent = randomstarts.max()+n
padlen = max_extent - origalarray.shape[1]
p = np.zeros((origalarray.shape[0],padlen),dtype=origalarray.dtype)
a = np.hstack((origalarray,p))
w = view_as_windows(a,(1,n))[...,0,:]
out_vals = w[np.arange(len(randomstarts)),randomstarts]
out_starts = origalarray.shape[1]-randomsizes
out_extensions_max = out_starts.max()+n
out = np.zeros((origalarray.shape[0],out_extensions_max),dtype=origalarray.dtype)
w2 = view_as_windows(out,(1,n))[...,0,:]
w2[np.arange(len(out_starts)),out_starts] = out_vals
cutarray_out = out[:,:origalarray.shape[1]]
Approach #2 : With masking
cutarray_out = np.zeros_like(origalarray)
r = np.arange(origalarray.shape[1])
m = (randomstarts[:,None]<=r) & (randomstops[:,None]>r)
s = origalarray.shape[1]-randomsizes
m2 = s[:,None]<=r
cutarray_out[m2] = origalarray[m]

How can I remove sharp jumps in data?

I have some skin temperature data (collected at 1Hz) which I intend to analyse.
However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.
I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.
My data roughly looks like this:
df =
timeStamp Temp
2018-05-04 10:08:00 28.63
. .
. .
2018-05-04 21:00:00 31.63
The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:
To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
df_diff = df.diff(60) # period of about 60 makes jumps stick out
filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.
However, I find myself stuck here. The main problem is that:
1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
The more minor problem is that
2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?
*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):
*Edit 2: Link to sample data
Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])
#the following line calculates the absolute value of a second order finite
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()
df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change
df[1].plot() #plot original data
Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.
Your first question I believe is answered with the .loc selection above.
You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.
Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.
A fourth-order finite difference can be applied using this:
df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
df[2] = df[2].abs()
It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders.
Finite Difference Coefficients Calculator
Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.
Here's a suggestion that targets your issues regarding
[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
using stats.zscore() and pandas.merge()
As it is, it will still have a minor issue with your concerns regarding
[...]left with some residual artefacts from the data jumps near the edges[...]
But we'll get to that later.
First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
# Dataframe with random data
df_raw = sample()
As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:
But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:
This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.
The steps:
# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns
# 2. Take the first difference and
df_diff = df_raw.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]
# 10. Reset column names
df_complete.columns = colName
# Result
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colName = df.columns
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colName
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
# Dataframe with random data
df_raw = sample()
# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)

python - combining argsort with masking to get nearest values in moving window

I have some code for calculating missing values in an image, based on neighbouring values in a 2D circular window. It also uses the values from one or more temporally-adjacent images at the same locations (i.e. the same 2D window shifted in the 3rd dimension).
For each position that is missing, I need to calculate the value based not necessarily on all the values available in the whole window, but only on the spatially-nearest n cells that do have values (in both images / Z-axis positions), where n is some value less than the total number of cells in the 2D window.
At the minute, it's much quicker to calculate for everything in the window, because my means of sorting to get the nearest n cells with data is the slowest part of the function as it has to be repeated each time even though the distances in terms of window coordinates do not change. I'm not sure this is necessary and feel I must be able to get the sorted distances once, and then mask those in the process of only selecting available cells.
Here's my code for selecting the data to use within a window of the gap cell location:
# radius will in reality be ~100
radius = 2
y,x = np.ogrid[-radius:radius+1, -radius:radius+1]
dist = np.sqrt(x**2 + y**2)
circle_template = dist > radius
# this will in reality be a very large 3 dimensional array
# representing daily images with some gaps, indicated by 0s
dataStack = np.zeros((2,5,5))
dataStack[1] = (np.random.random(25) * 100).reshape(dist.shape)
dataStack[0] = (np.random.random(25) * 100).reshape(dist.shape)
testdata = dataStack[1]
alternatedata = dataStack[0]
random_gap_locations = (np.random.random(25) * 30).reshape(dist.shape) > testdata
testdata[random_gap_locations] = 0
testdata[radius, radius] = 0
# in reality we will go through every gap (zero) location in the data
# for each image and for each gap use slicing to get a window of
# size (radius*2+1, radius*2+1) around it from each image, with the
# gap being at the centre i.e.
# testgaplocation = [radius, radius]
# and the variables testdata, alternatedata below will refer to these
# slices
locations_to_exclude = np.logical_or(circle_template, np.logical_or
(testdata==0, alternatedata==0))
# the places that are inside the circular mask and where both images
# have data
locations_to_include = ~locations_to_exclude
number_available = np.count_nonzero(locations_to_include)
# we only want to do the interpolation calculations from the nearest n
# locations that have data available, n will be ~100 in reality
number_required = 3
available_distances = dist[locations_to_include]
available_data = testdata[locations_to_include]
available_alternates = alternatedata[locations_to_include]
if number_available > number_required:
# In this case we need to find the closest number_required of elements, based
# on distances recorded in dist, from available_data and available_alternates
# Having to repeat this argsort for each gap cell calculation is slow and feels
# like it should be avoidable
sortedDistanceIndices = available_distances.argsort(kind = 'mergesort',axis=None)
requiredIndices = sortedDistanceIndices[0:number_required]
selected_data = np.take(available_data, requiredIndices)
selected_alternates = np.take(available_alternates , requiredIndices)
# we just use available_data and available_alternates as they are...
# now do stuff with the selected data to calculate a value for the gap cell
This works, but over half of the total time of the function is taken in the argsort of the masked spatial distance data. (~900uS of a total 1.4mS - and this function will be running tens of billions of times, so this is an important difference!)
I am sure that I must be able to just do this argsort once outside of the function, when the spatial distance window is originally set up, and then include those sort indices in the masking, to get the first howManyToCalculate indices without having to re-do the sort. The answer might involve putting the various bits that we are extracting from, into a record array - but I can't figure out how, if so. Can anyone see how I can make this part of the process more efficient?
So you want to do the sorting outside of the loop:
sorted_dist_idcs = dist.argsort(kind='mergesort', axis=None)
Then using some variables from the original code, this is what I could come up with, though it still feels like a major round-trip..
loc_to_incl_sorted = locations_to_include.take(sorted_dist_idcs)
sorted_dist_idcs_to_incl = sorted_dist_idcs[loc_to_incl_sorted]
required_idcs = sorted_dist_idcs_to_incl[:number_required]
selected_data = testdata.take(required_idcs)
selected_alternates = alternatedata.take(required_idcs)
Note the required_idcs refer to locations in the testdata and not available_data as in the original code. And this snippet I used take for the purpose of conveniently indexing the flattened array.
#moarningsun - thanks for the comment and answer. These got me on the right track, but don't quite work for me when the gap is < radius from the edge of the data: in this case I use a window around the gap cell which is "trimmed" to the data bounds. In this situation the indices reflect the "full" window and thus can't be used to select cells from the bounded window.
Unfortunately I edited that part of my code out when I clarified the original question but it's turned out to be relevant.
I've realised now that if you use argsort again on the output of argsort then you get ranks; i.e. the position that each item would have when the overall array was sorted. We can safely mask these and then take the smallest number_required of them (and do this on a structured array to get the corresponding data at the same time).
This implies another sort within the loop, but in fact we can use partitioning rather than a full sort, because all we need is the smallest num_required items. If num_required is substantially less than the number of data items then this is much faster than doing the argsort.
For example with num_required = 80 and num_available = 15000 the full argsort takes ~900µs whereas argpartition followed by index and slice to get the first 80 takes ~110µs. We still need to do the argsort to get the ranks at the outset (rather than just partitioning based on distance) in order to get the stability of the mergesort, and thus get the "right one" when distance is not unique.
My code as shown below now runs in ~610uS on real data, including the actual calculations that aren't shown here. I'm happy with that now, but there seem to be several other apparently minor factors that can have an influence on the runtime that's hard to understand.
For example putting the circle_template in the structured array along with dist, ranks, and another field not shown here, doubles the runtime of the overall function (even if we don't access circle_template in the loop!). Even worse, using np.partition on the structured array with order=['ranks'] increases the overall function runtime by almost two orders of magnitude vs using np.argpartition as shown below!
# radius will in reality be ~100
radius = 2
y,x = np.ogrid[-radius:radius+1, -radius:radius+1]
dist = np.sqrt(x**2 + y**2)
circle_template = dist > radius
ranks = dist.argsort(axis=None,kind='mergesort').argsort().reshape(dist.shape)
diam = radius * 2 + 1
# putting circle_template in this array too doubles overall function runtime!
fullWindowArray = np.zeros((diam,diam),dtype=[('ranks',ranks.dtype.str),
fullWindowArray['ranks'] = ranks
fullWindowArray['dist'] = dist
# this will in reality be a very large 3 dimensional array
# representing daily images with some gaps, indicated by 0s
dataStack = np.zeros((2,5,5))
dataStack[1] = (np.random.random(25) * 100).reshape(dist.shape)
dataStack[0] = (np.random.random(25) * 100).reshape(dist.shape)
testdata = dataStack[1]
alternatedata = dataStack[0]
random_gap_locations = (np.random.random(25) * 30).reshape(dist.shape) > testdata
testdata[random_gap_locations] = 0
testdata[radius, radius] = 0
# in reality we will loop here to go through every gap (zero) location in the data
# for each image
gapz, gapy, gapx = 1, radius, radius
desLeft, desRight = gapx - radius, gapx + radius+1
desTop, desBottom = gapy - radius, gapy + radius+1
extentB, extentR = dataStack.shape[1:]
# handle the case where the gap is < search radius from the edge of
# the data. If this is the case, we can't use the full
# diam * diam window
dataL = max(0, desLeft)
maskL = 0 if desLeft >= 0 else abs(dataL - desLeft)
dataT = max(0, desTop)
maskT = 0 if desTop >= 0 else abs(dataT - desTop)
dataR = min(desRight, extentR)
maskR = diam if desRight <= extentR else diam - (desRight - extentR)
dataB = min(desBottom,extentB)
maskB = diam if desBottom <= extentB else diam - (desBottom - extentB)
# get the slice that we will be working within
# ranks, dist and circle are already populated
boundedWindowArray = fullWindowArray[maskT:maskB,maskL:maskR]
boundedWindowArray['alternatedata'] = alternatedata[dataT:dataB, dataL:dataR]
boundedWindowArray['thisdata'] = testdata[dataT:dataB, dataL:dataR]
locations_to_exclude = np.logical_or(boundedWindowArray['circle_template'],
# the places that are inside the circular mask and where both images
# have data
locations_to_include = ~locations_to_exclude
number_available = np.count_nonzero(locations_to_include)
# we only want to do the interpolation calculations from the nearest n
# locations that have data available, n will be ~100 in reality
number_required = 3
data_to_use = boundedWindowArray[locations_to_include]
if number_available > number_required:
# argpartition seems to be v fast when number_required is
# substantially < data_to_use.size
# But partition on the structured array itself with order=['ranks']
# is almost 2 orders of magnitude slower!
reqIndices = np.argpartition(data_to_use['ranks'],number_required)[:number_required]
data_to_use = np.take(data_to_use,reqIndices)
# we just use available_data and available_alternates as they are...
# now do stuff with the selected data to calculate a value for the gap cell

Moving average of an array in Python

I have an array where discreet sinewave values are recorded and stored. I want to find the max and min of the waveform. Since the sinewave data is recorded voltages using a DAQ, there will be some noise, so I want to do a weighted average. Assuming self.yArray contains my sinewave values, here is my code so far:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
for y in range (0,filtersize):
summation = sum(self.yArray[x+y])
ave = summation/filtersize
My issue seems to be in the second for loop, where depending on my averaging window size (filtersize), I want to sum up the values in the window to take the average of them. I receive an error saying:
summation = sum(self.yArray[x+y])
TypeError: 'float' object is not iterable
I am an EE with very little experience in programming, so any help would be greatly appreciated!
The other answers correctly describe your error, but this type of problem really calls out for using numpy. Numpy will run faster, be more memory efficient, and is more expressive and convenient for this type of problem. Here's an example:
import numpy as np
import matplotlib.pyplot as plt
# make a sine wave with noise
times = np.arange(0, 10*np.pi, .01)
noise = .1*np.random.ranf(len(times))
wfm = np.sin(times) + noise
# smoothing it with a running average in one line using a convolution
# using a convolution, you could also easily smooth with other filters
# like a Gaussian, etc.
n_ave = 20
smoothed = np.convolve(wfm, np.ones(n_ave)/n_ave, mode='same')
plt.plot(times, wfm, times, -.5+smoothed)
If you don't want to use numpy, it should also be noted that there's a logical error in your program that results in the TypeError. The problem is that in the line
summation = sum(self.yArray[x+y])
you're using sum within the loop where your also calculating the sum. So either you need to use sum without the loop, or loop through the array and add up all the elements, but not both (and it's doing both, ie, applying sum to the indexed array element, that leads to the error in the first place). That is, here are two solutions:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = sum(self.yArray[x:x+filtersize]) # sum over section of array
ave = summation/filtersize
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = 0.
for y in range (0,filtersize):
summation = self.yArray[x+y]
ave = summation/filtersize
self.yArray[x+y] is returning a single item out of the self.yArray list. If you are trying to get a subset of the yArray, you can use the slice operator instead:
summation = sum(self.yArray[x:y])
to return an iterable that the sum builtin can use.
A bit more information about python slices can be found here (scroll down to the "Sequences" section):
You could use numpy, like:
import numpy
filtersize = 2
ysums = numpy.cumsum(numpy.array(self.yArray, dtype=float))
ylags = numpy.roll(ysums, filtersize)
ylags[0:filtersize] = 0.0
moving_avg = (ysums - ylags) / filtersize
Your original code attempts to call sum on the float value stored at yArray[x+y], where x+y is evaluating to some integer representing the index of that float value.
summation = sum(self.yArray[x:y])
Indeed numpy is the way to go. One of the nice features of python is list comprehensions, allowing you to do away with the typical nested for loop constructs. Here goes an example, for your particular problem...
import numpy as np
res=[np.sum(myarr[i:i+step],dtype=np.float)/step for i in range(len(myarr)-step+1)]
