Simple moving average for time data - Python

Simple moving average for time data - Python - python

I need to calculate the SMA for a time serie data.
In particular, I want to plot in x-axis that mean function and in y-axis another array.
At the beginning, when I didn't do the SMA, these 2 arrays have the same lenght.
Then I found this example online to calculate the SMA:
# Python program to calculate
# simple moving averages using pandas
import pandas as pd
arr = [1, 2, 3, 7, 9]
window_size = 3
# Convert array of integers to pandas series
numbers_series = pd.Series(arr)
# Get the window of series
# of observations of specified window size
windows = numbers_series.rolling(window_size)
# Create a series of moving
# averages of each window
moving_averages = windows.mean()
# Convert pandas series back to list
moving_averages_list = moving_averages.tolist()
# Remove null entries from the list
final_list = moving_averages_list[window_size - 1:]
print(final_list)
I tried to replay that in my case, but I obtain this error:
"ValueError: x and y must have same first dimension, but have shapes
(261228,) and (261237,)"
I paste a bit of my code, maybe it can be useful to understand better:
y_Int_40_dmRe=pd.Series(y_Int_40_dmRe)
windows_40_dmRe = y_Int_40_dmRe.rolling(window_size)
moving_averages_40_dmRe = windows_40_dmRe.mean()
moving_averages_40_dmRe_list = moving_averages_40_dmRe.tolist()
final_list_40_dmRe = moving_averages_40_dmRe_list[window_size - 1:]
plt.plot(final_list_40_dmRe,y_TotEn_40_dmRe, linewidth=2, label="40° - dmRe")
I'm here if you need more informations, thank you in advance for your help
Chiara

Related

Re-distributing 2d data with max in middle

Hey all I have a set up seemingly random 2D data that I want to reorder. This is more for an image with specific values at each pixel but the concept will be the same.
I have large 2d array that looks very random, say:
x = 100
y = 120
np.random.random((x,y))
and I want to re-distribute the 2d matrix so that the maximum value is in the center and the values from the maximum surround it giving it sort of a gaussian fall off from the center.
small example:
output = [[0.0,0.5,1.0,1.0,1.0,0.5,0.0]
[0.0,1.0,1.0,1.5,1.0,0.5,0.0]
[0.5,1.0,1.5,2.0,1.5,1.0,0.5]
[0.0,1.0,1.0,1.5,1.0,0.5,0.0]
[0.0,0.5,1.0,1.0,1.0,0.5,0.0]]
I know it wont really be a gaussian but just trying to give a visualization of what I would like. I was thinking of sorting the 2d array into a list from max to min and then using that to create a new 2d array but Im not sure how to distribute the values down to fill the matrix how I want.
Thank you very much!

If anyone looks at this in the future and needs help, Here is some advice on how to do this effectively for a lot of data. Posted below is the code.
def datasort(inputarray,spot_in_x,spot_in_y):
#get the data read
center_of_y = spot_in_y
center_of_x = spot_in_x
M = len(inputarray[0])
N = len(inputarray)
l_list = list(itertools.chain(*inputarray)) #listed data
l_sorted = sorted(l_list,reverse=True) #sorted listed data
#Reorder
to_reorder = list(np.arange(0,len(l_sorted),1))
x = np.linspace(-1,1,M)
y = np.linspace(-1,1,N)
centerx = int(M/2 - center_of_x)*0.01
centery = int(N/2 - center_of_y)*0.01
[X,Y] = np.meshgrid(x,y)
R = np.sqrt((X+centerx)**2 + (Y+centery)**2)
R_list = list(itertools.chain(*R))
values = zip(R_list,to_reorder)
sortedvalues = sorted(values)
unzip = list(zip(*sortedvalues))
unzip2 = unzip[1]
l_reorder = zip(unzip2,l_sorted)
l_reorder = sorted(l_reorder)
l_unzip = list(zip(*l_reorder))
l_unzip2 = l_unzip[1]
sorted_list = np.reshape(l_unzip2,(N,M))
return(sorted_list)
This code basically takes your data and reorders it in a sorted list. Then zips it together with a list based on a circular distribution. Then using the zip and sort commands you can create the distribution of data you wish to have based on your distribution function, in my case its a circle that can be offset.

Peak detection in unevenly spaced timeseries

I'm working with a dataset containing measures combined with a datetime like:
datetime value
2017-01-01 00:01:00,32.7
2017-01-01 00:03:00,37.8
2017-01-01 00:04:05,35.0
2017-01-01 00:05:37,101.1
2017-01-01 00:07:00,39.1
2017-01-01 00:09:00,38.9
I'm trying to detect and remove potential peaks that might appear, like 2017-01-01 00:05:37,101.1 measure.
Some things that I found so far:
This dataset has a time spacing that goes from 15 seconds all the way to 25 minutes, making it super uneven;
The width of the peaks cannot be determined beforehand
The height of the peaks clearly and significantly deviates from the other values
Normalization of the time step should only occur after the removal of the outliers since they would interfere with the results
It's "impossible" to making it even due to other anomalies (e.g, negative values, flat lines), even without them it would create wrong values due to the peaks;
find_peaks is expecting an evenly spaced timeseries therefore the previous solution didn't work for the irregular timeseries we have;
On that issue I forgot to mention the critical point that is unevenly spaced timeseries.
I've searched everywhere and I couldn't find anything. The implementation is going to be in Python but I'm willing to dig around other languages to get the logic.

I've posted this code on github to anyone that in the future have this problem, or similar.
After a lot of trial and error I think I created something that works. Using what #user58697 told me I managed to create a code that detects every peak between a threshold.
By using the logic that he/she explained if ((flow[i+1] - flow[i]) / (time[i+1] - time[i]) > threshold I've coded the following code:
Started by reading the .csv and parse the dates, followed by splitting into two numpy arrays:
dataset = pd.read_csv('https://raw.githubusercontent.com/MigasTigas/peak_removal/master/dataset_simple_example.csv', parse_dates=['date'])
dataset = dataset.sort_values(by=['date']).reset_index(drop=True).to_numpy() # Sort and convert to numpy array
# Split into 2 arrays
values = [float(i[1]) for i in dataset] # Flow values, in float
values = np.array(values)
dates = [i[0].to_pydatetime() for i in dataset]
dates = np.array(dates)
Then applied the (flow[i+1] - flow[i]) / (time[i+1] - time[i]) to the whole dataset:
flow = np.diff(values)
time = np.diff(dates).tolist()
time = np.divide(time, np.power(10, 9))
slopes = np.divide(flow, time) # (flow[i+1] - flow[i]) / (time[i+1] - time[i])
slopes = np.insert(slopes, 0, 0, axis=0) # Since we "lose" the first index, this one is 0, just for alignments
And finally to detect the peaks we reduced the data to rolling windows of x seconds each. That way we can detect them easily:
# ROLLING WINDOW
size = len(dataset)
rolling_window = []
rolling_window_indexes = []
RW = []
RWi = []
window_size = 240 # Seconds
dates = [i.to_pydatetime() for i in dataset['date']]
dates = np.array(dates)
# create the rollings windows
for line in range(size):
limit_stamp = dates[line] + datetime.timedelta(seconds=window_size)
for subline in range(line, size, 1):
if dates[subline] <= limit_stamp:
rolling_window.append(slopes[subline]) # Values of the slopes
rolling_window_indexes.append(subline) # Indexes of the respective values
else:
RW.append(rolling_window)
if line != size: # To prevent clearing the last rolling window
rolling_window = []
RWi.append(rolling_window_indexes)
if line != size:
rolling_window_indexes = []
break
else:
# To get the last rolling window since it breaks before append
RW.append(rolling_window)
RWi.append(rolling_window_indexes)
After getting all rolling windows we start the fun:
t = 0.3 # Threshold
peaks = []
for index, rollWin in enumerate(RW):
if rollWin[0] > t: # If the first value is greater of threshold
top = rollWin[0] # Sets as a possible peak
bottom = np.min(rollWin) # Finds the minimum of the peak
if bottom < -t: # If less than the negative threshold
bottomIndex = int(np.argmin(rollWin)) # Find it's index
for peak in range(0, bottomIndex, 1): # Appends all points between the first index of the rolling window until the bottomIndex
peaks.append(RWi[index][peak])
The idea behind this code is every peak has a rising and a falling, and if both are greater than the stated threshold then it's an outlier peak along with all peaks between them:
Where translated to the real dataset used, posted on github:

How to efficiently index a numpy array based on varying start and stop indexes per row

I have a 2D numpy array with rows being time series of a feature, based on which I'm training a neural network. For generalisation purposes, I would like to subset these time series at random points. I'd like them to have a minimum subset length as well. However, the network requires fixed length time series, so I need to pre-pad the resulting subsets with zeroes.
Currently, I'm doing it using the code below, which includes a nasty for-loop, because I don't know how I can use fancy indexing for this particular problem. As this piece of code is part of the network data generator, it needs to be fast to keep up to pace with the data-hungry GPU. Does anyone know a numpy-way of doing this without the for-loop?
import numpy as np
import matplotlib.pyplot as plt
# Amount of time series to consider
batchsize = 25
# Original length of the time series
timesteps = 150
# As an example, fill the 2D array with sine function time series
sinefunction = np.expand_dims(np.sin(np.arange(timesteps)), axis=0)
originalarray = np.repeat(sinefunction, batchsize, axis=0)
# Now the real thing, we want:
# - to start the time series at a random moment (between 0 and maxstart)
# - to end the time series at a random moment
# - however with a minimum length of the resulting subset time series (minlength)
maxstart = 50
minlength = 75
# get random starts
randomstarts = np.random.choice(np.arange(0, maxstart), size=batchsize)
# get random stops
randomstops = np.random.choice(np.arange(maxstart + minlength, timesteps), size=batchsize)
# determine the resulting random sizes of the subset time series
randomsizes = randomstops - randomstarts
# finally create a new 2D array with all the randomly subset time series, however pre-padded with zeros
# THIS IS THE FOR LOOP WE SHOULD TRY TO AVOID
cutarray = np.zeros_like(originalarray)
for i in range(batchsize):
cutarray[i, -randomsizes[i]:] = originalarray[i, randomstarts[i]:randomstops[i]]
To show what goes in and out of the function:
# Show that it worked
f, ax = plt.subplots(2, 1)
ax[0].imshow(originalarray)
ax[0].set_title('original array')
ax[1].imshow(cutarray)
ax[1].set_title('zero-padded subset array')

Approach #1 : Views-based
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windowed views into a zeros padded version of the input and assign into a zeros padded version of the output. All of that padding is needed for a vectorized solution on account of the ragged nature. Upside is that working on views would be efficient on memory and performance.
The implementation would look something like this -
from skimage.util.shape import view_as_windows
n = randomsizes.max()
max_extent = randomstarts.max()+n
padlen = max_extent - origalarray.shape[1]
p = np.zeros((origalarray.shape[0],padlen),dtype=origalarray.dtype)
a = np.hstack((origalarray,p))
w = view_as_windows(a,(1,n))[...,0,:]
out_vals = w[np.arange(len(randomstarts)),randomstarts]
out_starts = origalarray.shape[1]-randomsizes
out_extensions_max = out_starts.max()+n
out = np.zeros((origalarray.shape[0],out_extensions_max),dtype=origalarray.dtype)
w2 = view_as_windows(out,(1,n))[...,0,:]
w2[np.arange(len(out_starts)),out_starts] = out_vals
cutarray_out = out[:,:origalarray.shape[1]]
Approach #2 : With masking
cutarray_out = np.zeros_like(origalarray)
r = np.arange(origalarray.shape[1])
m = (randomstarts[:,None]<=r) & (randomstops[:,None]>r)
s = origalarray.shape[1]-randomsizes
m2 = s[:,None]<=r
cutarray_out[m2] = origalarray[m]

How can I remove sharp jumps in data?

I have some skin temperature data (collected at 1Hz) which I intend to analyse.
However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.
I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.
My data roughly looks like this:
df =
timeStamp Temp
2018-05-04 10:08:00 28.63
. .
. .
2018-05-04 21:00:00 31.63
The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:
To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
e.g.
df_diff = df.diff(60) # period of about 60 makes jumps stick out
filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.
However, I find myself stuck here. The main problem is that:
1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
The more minor problem is that
2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?
*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):
*Edit 2: Link to sample data

Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])
#the following line calculates the absolute value of a second order finite
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()
df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change
df[1].plot() #plot original data
plt.show()
Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.
Your first question I believe is answered with the .loc selection above.
You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.
Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.
Edit:
A fourth-order finite difference can be applied using this:
df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
(df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()
It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders.
Finite Difference Coefficients Calculator
Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.

Here's a suggestion that targets your issues regarding
[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
using stats.zscore() and pandas.merge()
As it is, it will still have a minor issue with your concerns regarding
[...]left with some residual artefacts from the data jumps near the edges[...]
But we'll get to that later.
First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:
But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:
This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.
The steps:
# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns
# 2. Take the first difference and
df_diff = df_raw.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]
# 10. Reset column names
df_complete.columns = colName
# Result
df_complete.plot()
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colName = df.columns
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colName
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
df_cleaned.plot()

Rolling PCA on pandas dataframe

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!
This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference

Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.
The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:
# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
# Set the window size
window = 100
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
# The results are now contained here:
print df_pca
A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.