Identifying extremes in pandas timeseries

Identifying extremes in pandas timeseries - python

I'm looking for a way to identify the local extremes in a pandas timeseries.
A MWE would be
import math
import matplotlib.pyplot as plt
import pandas as pd
sin_list = []
for i in range(200):
sin_list.append(math.sin(i / 10) + i / 100)
idx = pd.date_range('2018-01-01', periods=200, freq='H')
ts = pd.Series(sin_list, index=idx)
ts.plot(style='.')
plt.show()
and the red lines would mark the timestamps which I want to identify. Note that there are, of course, finite steps in this series.
A possible solution could be to fit a curve to it, derive it and then identify the exact place where the gradient is 0. This does seem like a big effort to program myself, and I assume such an implementation exists somewhere.

I developed a solution for the problem based on the .diff() module. The crucial attribute here is the p factor of the get_percentile function. Since the finite number of values means that the gradient will not attain value 0, the solution space must be blurred a bit. This means, the less values there are, the higher the p factor must be. In my solution 0.05 proved to be sufficient to identify the extremes, but small enough to locate the extremes with reasonable accuracy.
This is the code:
import copy
import math
import matplotlib.pyplot as plt
import pandas as pd
def get_percentile(data: list, p: float):
_data = copy.copy(data)
_data.sort()
result = _data[math.floor(len(_data) * p) - 1]
return result
sin_list = []
for i in range(200):
sin_list.append(math.sin(i / 10) + i / 100)
idx = pd.date_range('2018-01-01', periods=200, freq='H')
ts = pd.Series(sin_list, index=idx)
gradient_ts = abs(ts.diff())
percentile = get_percentile(gradient_ts.values, p=0.05)
binary_ts = gradient_ts.where(gradient_ts > percentile, 1).where(gradient_ts < percentile, 0)
fig, ax = plt.subplots()
binary_ts.plot(drawstyle="steps", ax=ax)
ax.fill_between(binary_ts.index, binary_ts, facecolor='green', alpha=0.5, step='pre')
ts.plot(secondary_y=True, style='.')
plt.show()

Related

Fast Fourier Transform in Python

I am new to the fourier theory and I've seen very good tutorials on how to apply fft to a signal and plot it in order to see the frequencies it contains. Somehow, all of them create a mix of sines as their data and i am having trouble adapting it to my real problem.
I have 242 hourly observations with a daily periodicity, meaning that my period is 24. So I expect to have a peak around 24 on my fft plot.
A sample of my data.csv is here:
https://pastebin.com/1srKFpJQ
Data plotted:
My code:
data = pd.read_csv('data.csv',index_col=0)
data.index = pd.to_datetime(data.index)
data = data['max_open_files'].astype(float).values
N = data.shape[0] #number of elements
t = np.linspace(0, N * 3600, N) #converting hours to seconds
s = data
fft = np.fft.fft(s)
T = t[1] - t[0]
f = np.linspace(0, 1 / T, N)
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
plt.bar(f[:N // 2], np.abs(fft)[:N // 2] * 1 / N, width=1.5) # 1 / N is a normalization factor
plt.show()
This outputs a very weird result where it seems I am getting the same value for every frequency.
I suppose that the problems comes with the definition of N, t and T but I cannot find anything online that has helped me understand this clearly. Please help :)
EDIT1:
With the code provided by charles answer I have a spike around 0 that seems very weird. I have used rfft and rfftfreq instead to avoid having too much frequencies.
I have read that this might be because of the DC component of the series, so after substracting the mean i get:
I am having trouble interpreting this, the spikes seem to happen periodically but the values in Hz don't let me obtain my 24 value (the overall frequency). Anybody knows how to interpret this ? What am I missing ?

The problem you're seeing is because the bars are too wide, and you're only seeing one bar. You will have to change the width of the bars to 0.00001 or smaller to see them show up.
Instead of using a bar chart, make your x axis using fftfreq = np.fft.fftfreq(len(s)) and then use the plot function, plt.plot(fftfreq, fft):
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv',index_col=0)
data.index = pd.to_datetime(data.index)
data = data['max_open_files'].astype(float).values
N = data.shape[0] #number of elements
t = np.linspace(0, N * 3600, N) #converting hours to seconds
s = data
fft = np.fft.fft(s)
fftfreq = np.fft.fftfreq(len(s))
T = t[1] - t[0]
f = np.linspace(0, 1 / T, N)
plt.ylabel("Amplitude")
plt.xlabel("Frequency [Hz]")
plt.plot(fftfreq,fft)
plt.show()

Plancks Law, Frequency figures

I want to plot the frequency version of planck's law. I first tried to do this independently:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
# Planck's Law
# Constants
h = 6.62607015*(10**-34) # J*s
c = 299792458 # m * s
k = 1.38064852*(10**-23) # J/K
T = 20 # K
frequency_range = np.linspace(10**-19,10**19,1000000)
def plancks_law(nu):
a = (2*h*nu**3) / (c**2)
e_term = np.exp(h*nu/(k*T))
brightness = a /(e_term - 1)
return brightness
plt.plot(frequency_range,plancks_law(frequency_range))
plt.gca().set_xlim([1*10**-16 ,1*10**16 ])
plt.gca().invert_xaxis()
This did not work, I have an issue with scaling somehow. My next idea was to attempt to use this person's code from this question: Plancks Formula for Blackbody spectrum
import matplotlib.pyplot as plt
import numpy as np
h = 6.626e-34
c = 3.0e+8
k = 1.38e-23
def planck_f(freq, T):
a = 2.0*h*(freq**3)
b = h*freq/(k*T)
intensity = a/( (c**2 * (np.exp(b) - 1.0) ))
return intensity
# generate x-axis in increments from 1nm to 3 micrometer in 1 nm increments
# starting at 1 nm to avoid wav = 0, which would result in division by zero.
wavelengths = np.arange(1e-9, 3e-6, 1e-9)
frequencies = np.arange(3e14, 3e17, 1e14, dtype=np.float64)
intensity4000 = planck_f(frequencies, 4000.)
plt.gca().invert_xaxis()
This didn't work, because I got a divide by zero error. Except that I don't see where there is a division by zero, the denominator shouldn't ever be zero since the exponential term shouldn't ever be equal to one. I chose the frequencies to be the conversions of the wavelength values from the example code.
Can anyone help fix the problem or explain how I can get planck's law for frequency instead of wavelength?

You can not safely handle such large numbers; even for comparably "small" values of b = h*freq/(k*T) your float64 will overflow, e.g np.exp(709.)=8.218407461554972e+307 is ok, but np.exp(710.)=inf. You'll have to adjust your units (exponents) accordingly to avoid this!
Note that this is also the case in the other question you linked to, if you insert print( np.exp(b)[:10] ) within the definition of planck(), you can examine the first ten evaluated b's and you'll see the overflow in the first few occurrences. In any case, simply use the answer posted within the other question, but convert the x-axis in plt.plot(wavelengths, intensity) to frequency (i hope you know how to get from one to the other) :-)

Is there any solution for better fit beta prime distribution to data than using Scipy?

I was trying to fit beta prime distribution to my data using python. As there's scipy.stats.betaprime.fit, I tried this:
import numpy as np
import math
import scipy.stats as sts
import matplotlib.pyplot as plt
N = 5000
nb_bin = 100
a = 12; b = 106; scale = 36; loc = -a/(b-1)*scale
y = sts.betaprime.rvs(a,b,loc,scale,N)
a_hat,b_hat,loc_hat,scale_hat = sts.betaprime.fit(y)
print('Estimated parameters: \n a=%.2f, b=%.2f, loc=%.2f, scale=%.2f'%(a_hat,b_hat,loc_hat,scale_hat))
plt.figure()
count, bins, ignored = plt.hist(y, nb_bin, normed=True)
pdf_ini = sts.betaprime.pdf(bins,a,b,loc,scale)
pdf_est = sts.betaprime.pdf(bins,a_hat,b_hat,loc_hat,scale_hat)
plt.plot(bins,pdf_ini,'g',linewidth=2.0,label='ini');plt.grid()
plt.plot(bins,pdf_est,'y',linewidth=2.0,label='est');plt.legend();plt.show()
It shows me the result that:
Estimated parameters:
a=9935.34, b=10846.64, loc=-90.63, scale=98.93
which is quite different from the original one and the figure from the PDF:
If I give the real value of loc and scale as the input of fit function, the estimation result would be better. Has anyone worked on this part already or got a better solution?

What's the difference between pandas ACF and statsmodel ACF?

I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorr function built into Pandas, and the acf function supplied by statsmodels.tsa. This is done in the following MWE:
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf
ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)
ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)
ticker_data_acf_1 = acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]
test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')
What I noticed was the values they predicted weren't identical:
What accounts for this difference, and which values should be used?

The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:
autocorr does nothing more than passing subseries of the original series to np.corrcoef. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficient
acf, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.
The differences may get smaller for longer time series but are quite big for short ones.
Compared to Matlab, the Pandas autocorr function probably corresponds to doing Matlabs xcorr (cross-corr) with the (lagged) series itself, instead of Matlab's autocorr, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).
See this MWE for clarification:
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
import matplotlib.pyplot as plt
plt.style.use("seaborn-colorblind")
def autocorr_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the subseries means
sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
# Normalize with the subseries stds
return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))
def acf_by_hand(x, lag):
# Slice the relevant subseries based on the lag
y1 = x[:(len(x)-lag)]
y2 = x[lag:]
# Subtract the mean of the whole series x to calculate Cov
sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
# Normalize with var of whole series
return sum_product / ((len(x) - lag) * np.var(x))
x = np.linspace(0,100,101)
results = {}
nlags=10
results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
results["acf"] = acf(x, unbiased=True, nlags=nlags-1)
pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
plt.xlabel("lag")
plt.ylim([-1.2, 1.2])
plt.ylabel("value")
plt.show()
Statsmodels uses np.correlate to optimize this, but this is basically how it works.

As suggested in comments, the problem can be decreased, but not completely resolved, by supplying unbiased=True to the statsmodels function. Using a random input:
import statistics
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import acf
DATA_LEN = 100
N_TESTS = 100
N_LAGS = 32
def test(unbiased):
data = pd.Series(np.random.random(DATA_LEN))
data_acf_1 = acf(data, unbiased=unbiased, nlags=N_LAGS)
data_acf_2 = [data.autocorr(i) for i in range(N_LAGS+1)]
# return difference between results
return sum(abs(data_acf_1 - data_acf_2))
for value in (False, True):
diffs = [test(value) for _ in range(N_TESTS)]
print(value, statistics.mean(diffs))
Output:
False 0.464562410987
True 0.0820847168593

In the following example, Pandas autocorr() function gives the expected results but statmodels acf() function does not.
Consider the following series:
import pandas as pd
s = pd.Series(range(10))
We expect that there is perfect correlation between this series and any of its lagged series, and this is actually what we get with autocorr() function
[ s.autocorr(lag=i) for i in range(10) ]
# [0.9999999999999999, 1.0, 1.0, 1.0, 1.0, 0.9999999999999999, 1.0, 1.0, 0.9999999999999999, nan]
But using acf() we get a different result:
from statsmodels.tsa.stattools import acf
acf(s)
# [ 1. 0.7 0.41212121 0.14848485 -0.07878788
# -0.25757576 -0.37575758 -0.42121212 -0.38181818 -0.24545455]
If we try acf with adjusted=True the result is even more unexpected because for some lags the result is less than -1 (note that correlation has to be in [-1, 1])
acf(s, adjusted=True) # 'unbiased' is deprecated and 'adjusted' should be used instead
# [ 1. 0.77777778 0.51515152 0.21212121 -0.13131313
# -0.51515152 -0.93939394 -1.4040404 -1.90909091 -2.45454545]

binning data in python with scipy/numpy

is there a more efficient way to take an average of an array in prespecified bins? for example, i have an array of numbers and an array corresponding to bin start and end positions in that array, and I want to just take the mean in those bins? I have code that does it below but i am wondering how it can be cut down and improved. thanks.
from scipy import *
from numpy import *
def get_bin_mean(a, b_start, b_end):
ind_upper = nonzero(a >= b_start)[0]
a_upper = a[ind_upper]
a_range = a_upper[nonzero(a_upper < b_end)[0]]
mean_val = mean(a_range)
return mean_val
data = rand(100)
bins = linspace(0, 1, 10)
binned_data = []
n = 0
for n in range(0, len(bins)-1):
b_start = bins[n]
b_end = bins[n+1]
binned_data.append(get_bin_mean(data, b_start, b_end))
print binned_data

It's probably faster and easier to use numpy.digitize():
import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
An alternative to this is to use numpy.histogram():
bin_means = (numpy.histogram(data, bins, weights=data)[0] /
numpy.histogram(data, bins)[0])
Try for yourself which one is faster... :)

The Scipy (>=0.11) function scipy.stats.binned_statistic specifically addresses the above question.
For the same example as in the previous answers, the Scipy solution would be
import numpy as np
from scipy.stats import binned_statistic
data = np.random.rand(100)
bin_means = binned_statistic(data, data, bins=10, range=(0, 1))[0]

Not sure why this thread got necroed; but here is a 2014 approved answer, which should be far faster:
import numpy as np
data = np.random.rand(100)
bins = 10
slices = np.linspace(0, 100, bins+1, True).astype(np.int)
counts = np.diff(slices)
mean = np.add.reduceat(data, slices[:-1]) / counts
print mean

The numpy_indexed package (disclaimer: I am its author) contains functionality to efficiently perform operations of this type:
import numpy_indexed as npi
print(npi.group_by(np.digitize(data, bins)).mean(data))
This is essentially the same solution as the one I posted earlier; but now wrapped in a nice interface, with tests and all :)

I would add, and also to answer the question find mean bin values using histogram2d python that the scipy also have a function specially designed to compute a bidimensional binned statistic for one or more sets of data
import numpy as np
from scipy.stats import binned_statistic_2d
x = np.random.rand(100)
y = np.random.rand(100)
values = np.random.rand(100)
bin_means = binned_statistic_2d(x, y, values, bins=10).statistic
the function scipy.stats.binned_statistic_dd is a generalization of this funcion for higher dimensions datasets

Another alternative is to use the ufunc.at. This method applies in-place a desired operation at specified indices.
We can get the bin position for each datapoint using the searchsorted method.
Then we can use at to increment by 1 the position of histogram at the index given by bin_indexes, every time we encounter an index at bin_indexes.
np.random.seed(1)
data = np.random.random(100) * 100
bins = np.linspace(0, 100, 10)
histogram = np.zeros_like(bins)
bin_indexes = np.searchsorted(bins, data)
np.add.at(histogram, bin_indexes, 1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identifying extremes in pandas timeseries - python

Related

Fast Fourier Transform in Python

Plancks Law, Frequency figures

Is there any solution for better fit beta prime distribution to data than using Scipy?

What's the difference between pandas ACF and statsmodel ACF?

binning data in python with scipy/numpy

Categories

Resources