Hurst Exponent in python - python

from datetime import datetime
from pandas.io.data import DataReader
from numpy import cumsum, log, polyfit, sqrt, std, subtract
from numpy.random import randn
def hurst(ts):
"""Returns the Hurst Exponent of the time series vector ts"""
# Create the range of lag values
lags = range(2, 100)
# Calculate the array of the variances of the lagged differences
# Here it calculates the variances, but why it uses
# standard deviation and then make a root of it?
tau = [sqrt(std(subtract(ts[lag:], ts[:-lag]))) for lag in lags]
# Use a linear fit to estimate the Hurst Exponent
poly = polyfit(log(lags), log(tau), 1)
# Return the Hurst exponent from the polyfit output
return poly[0]*2.0
# Download the stock prices series from Yahoo
aapl = DataReader("AAPL", "yahoo", datetime(2012,1,1), datetime(2015,9,18))
# Call the function
hurst(aapl['Adj Close'])
From this code for estimating Hurst Exponent, when we want to calculate the variance of the lagged difference, why we still use a standard deviation and take a square root? I am confused for a long time, and I don't know why others don't have the same confuse. Do I misunderstand the math behind it? Thanks!

I'm just as confused. I don't understand where the sqrt of std comes from either, and have spent 3 days trying to figure it out. In the end I noticed QuantStart credits Dr Tom Starke who uses a slightly different code. Dr Tom Starke credits Dr Ernie Chan, and going to his blog. I was able to find enough information to put together my own code from his principles. This doesn't use sqrt, uses variance instead of std and uses a 2.0 divisor at the end instead of a 2.0 multiplier. In the end, it seems to give the same results as the quantstart code you post, but I am able to understand it from first principles, which I guess is important. I put together a Jupyter Notebook which makes it clearer, but I'm not sure if I can post that here, so I will try to explain as best I can here. Code is pasted first, then an explanation.
lags = range(2,100)
def hurst_ernie_chan(p):
variancetau = []; tau = []
for lag in lags:
# Write the different lags into a vector to compute a set of tau or lags
tau.append(lag)
# Compute the log returns on all days, then compute the variance on the difference in log returns
# call this pp or the price difference
pp = subtract(p[lag:], p[:-lag])
variancetau.append(var(pp))
# we now have a set of tau or lags and a corresponding set of variances.
#print tau
#print variancetau
# plot the log of those variance against the log of tau and get the slope
m = polyfit(log10(tau),log10(variancetau),1)
hurst = m[0] / 2
return hurst
Dr Chan doesn't give any code on this page (I believe he works in MATLAB not Python anyway). Hence I needed to put together my own code from the notes he gives in his blog and answers he gives to questions posed on his blog.
Dr Chan states that if z is the log price, then volatility, sampled at intervals of τ, is volatility(τ)=√(Var(z(t)-z(t-τ))). To me another way of describing volatility is standard deviation, so std(τ)=√(Var(z(t)-z(t-τ)))
std is just the root of variance so var(τ)=(Var(z(t)-z(t-τ)))
Dr Chan then states: In general, we can write Var(τ) ∝ τ^(2H) where H is the Hurst exponent
Hence (Var(z(t)-z(t-τ))) ∝ τ^(2H)
Taking the log of each side we get log (Var(z(t)-z(t-τ))) ∝ 2H log τ
[ log (Var(z(t)-z(t-τ))) / log τ ] / 2 ∝ H (gives the Hurst exponent) where we know the term in square brackets on far left is the slope of a log-log plot of tau and a corresponding set of variances.
If you run that function and compare the answers to the Quantstart function, they should be the same. Not sure if that helped.

All that is going on here is a variation on math notation
I'll define
d = subtract(ts[lag:], ts[:-lag])
Then it is clear that
np.log(np.std(d)**2) == np.log(np.var(d))
np.log(np.std(d)) == .5*np.log(np.var(d))
Then you have the equivalence
2*np.log(np.sqrt(np.std(d))) == .5*np.log(np.sqrt(np.var(d)))
The functional output of polyfit scales proportionally to its input

As per intuitive definition taken from Ernest Chan's "Algorithmic trading" (p.44):
Intuitively speaking, a “stationary” price series means that the prices diffuse
from its initial value more slowly than a geometric random walk would.
one would want to check variance of time series with increasing lags against lag(s). This is because for normal distribution -- and log prices are believed to be normal (to certain extent) -- variance of sum of normal distributions is a sum of constituents' variances.
As per Ernest Chan's citation, for mean reverting processes the realized variance will be less than theoretically projected.
Putting this in code:
def hurst(p, l):
"""
Arguments:
p: ndarray -- the price series to be tested
l: list of integers or an integer -- lag(s) to test for mean reversion
Returns:
Hurst exponent
"""
if isinstance(l, int):
lags = [1, l]
else:
lags = l
assert lags[-1] >=2, "Lag in prices must be greater or equal 2"
print(f"Price lags of {lags[1:]} are included")
lp = np.log(p)
var = [np.var(lp[l:] - lp[:-l]) for l in lags]
hr = linregress(np.log(lags), np.log(var))[0] / 2
return hr

The code posted by OP is correct.
The reason for the confusion is that it does a square-root first, and then counters it by multiplying the slope (returned by polyfit) with 2.
For a more detailed explanation, continue reading.
tau is calculated with an "extra" square-root. Then, its log is calculated. log(sqrt(x)) = log(x^0.5) = 0.5*log(x) (this is the key).
polyfit now conducts the fitting with y multiplied by an "extra 0.5". So, the result obtained is also multiplied by nearly 0.5. Returning twice of that (return poly[0]*2.0) counters the initial (seemingly) extra 0.5.
Hope this makes it clearer.

Related

Average of an uarray in python uncertainties

My problem:
I have an array of ufloats (e.g. an unarray) in pythons uncertainties package.
All values of the array got their own errors, and I need a funktion, that gives me the average of the array in respect to both, the error
I get when calculating the mean of the nominal values and the influence the values errors have.
I have an uarray:
2 +/- 1
3 +/- 2
4 +/- 3
and need a funktion, that gives me an average value of the array.
Thanks
Assuming Gaussian statistics, the uncertainties stem from Gaussian parent distributions. In such a case, it is standard to weight the measurements (nominal values) by the inverse variance. This application to the general weighted average gives,
$$ \frac{\sum_i w_i x_i}{\sum_i w_i} = \frac{\sum_i x_i/\sigma_i^2}{\sum_i 1/\sigma_i^2} $$.
One need only perform good 'ol error propagation on this to get an uncertainty of the weighted average as,
$$ \sqrt{\sum_i \frac{1}{1/\sum_i \sigma_i^2}} $$
I don't have an n-length formula to do this syntactically speaking on hand, but here's how one could get the weighted average and its uncertainty in a simple case:
a = un.ufloat(5, 2)
b = un.ufloat(8, 4)
wavg = un.ufloat((a.n/a.s**2 + b.n/b.s**2)/(1/a.s**2 + 1/b.s**2),
np.sqrt(2/(1/a.s**2 + 1/b.s**2)))
print(wavg)
>>> 5.6+/-2.5298221281347035
As one would expect, the result tends more-so towards the value with the smaller uncertainty. This is good since a smaller uncertainty in a measurement implies that its associated nominal value is closer to the true value in the parent distribution than those with larger uncertainties.
Unless I'm missing something, you could calculate the sum divided by the length of the array:
from uncertainties import unumpy, ufloat
import numpy as np
arr = np.array([ufloat(2, 1), ufloat(3, 2), ufloat(4,3)])
print(sum(arr)/len(arr))
# 3.0+/-1.2
You can also define it like this:
arr1 = unumpy.uarray([2, 3, 4], [1, 2, 3])
print(sum(arr1)/len(arr1))
# 3.0+/-1.2
uncertainties takes care of the rest.
I used Captain Morgan's answer to serve up some sweet Python code for a project and discovered that it needed a little extra ingredient:
import uncertainties as un
from un.unumpy import unp
epsilon = unp.nominal_values(values).mean()/(1e12)
wavg = ufloat(sum([v.n/(v.s**2+epsilon) for v in values])/sum([1/(v.s**2+epsilon) for v in values]),
np.sqrt(len(values)/sum([1/(v.s**2+epsilon) for v in values])))
if wavg.s <= np.sqrt(epsilon):
wavg = ufloat(wavg.n, 0.0)
Without that little something (epsilon) we'd get div/0 errors from observations recorded with zero uncertainty.
If you already have a .csv file which stores variables in 'mean+/-sted' format, you could try the code below; it works for me.
from uncertainties import ufloat_fromstr
df=pd.read_csv('Z:\compare\SL2P_PAR.csv')
for i in range(len(df.uncertainty)):
df['mean'] = ufloat_fromstr(df['uncertainty'][I]).n
df['sted'] = ufloat_fromstr(df['uncertainty'][I]).s

Calculating thd in python

I'm trying to calculate the total harmonic distortion values of ac voltage supplied. I am sampling voltage data using Arduino at over 8 KHz rate and storing those data into a text file. Then I'm trying to calculate thd using the following code snippet written in python:
import numpy as np
import scipy.fftpack
from scipy.fftpack import fft
from numpy import genfromtxt
sampled_data = genfromtxt('/../file.txt',delimiter=',')
abs_yf=np.abs(fft(sampled_data))
#As far as I know, THD=sqrt(sum of square magnitude of
#harmonics+noise)/Fundamental value (Is it correct?)So I'm
#just summing up square of all frequency data obtained from FFT,
#sqrt() them and dividing them with fundamental frequecy value.
def thd(abs_data):
sq_sum=0.0
for r in range(len(abs_data)):
sq_sum=sq_sum+(abs_data[r])**2
sq_harmonics=sq_sum-(max(abs_data))**2.0
thd=100*sq_harmonics**0.5/max(abs_data)
return thd
print "Total Harmonic Distortion(in percent):"
print thd(abs_yf)
Problem is, The obtained Thd values vary within 5% to 25% in my case. (In reality it's not more than 5% actually). What am I doing wrong? Is there any other way to find out thd?
Though this is long quiet, for anyone encountering this post like me: There are a couple of problems with the OP method.
1) The magnitudes returned by FFT include a magnitude of the 0 frequency bin, so the assumption that max(abs_data) is the magnitude corresponding to the fundamental frequency is not correct if there is any DC bias in the signal. This is a problem in the line
thd = 100*sq_harmonics**0.5 / max(abs_data)
The amplitude associated with the 0 frequency can just be ignored as a quick solution.
2) The second half of the abs_data should be thrown out, it is a "mirrored" reflection of the first. This is due to the nature of the Fourier transform.
Both these issues can be addressed by changing the input to the function, i.e by replacing
print thd(abs_yf)
with
print( thd(abs_yf[1:int(len(abs_yf)/2) ]) )
where we have changed the input to not include the first or the last N/2 elements.
The result is still not ideal because the window needs to be exactly an integer number of cycles as the previous answers noted above. Testing with a pure sine with offset and adjusting the window demonstrates that the method works fairly well but fails terribly if significant window errors.
t0=0
tf = 0.02 # integer number of cycles
dt = 1e-4
offset = 0.5
N = int((tf-t0)/dt)
time = np.linspace(0.0,tf,N ) #;
commandSigFreq = 100
Amplitude = 2
waveOfSin = Amplitude*np.sin(2.0*pi*commandSigFreq*time) + offset
abs_yf = np.abs(fft(waveOfSin))
#print("freq is" + str(scipy.fftpack.fftfreq(sampled_data, dt ) ))
#As far as I know, THD=sqrt(sum of square magnitude of
#harmonics+noise)/Fundamental value (Is it correct?)So I'm
#just summing up square of all frequency data obtained from FFT,
#sqrt() them and dividing them with fundamental frequency value.
def thd(abs_data):
sq_sum=0.0
for r in range( len(abs_data)):
sq_sum = sq_sum + (abs_data[r])**2
sq_harmonics = sq_sum -(max(abs_data))**2.0
thd = 100*sq_harmonics**0.5 / max(abs_data)
return thd
print("Total Harmonic Distortion(in percent):")
print(thd(abs_yf[1:int(len(abs_yf)/2) ]))
It is quite likely that you add additional distortion by the measurement process itself.
If you compare an Arduino ADC with a high class measurement device, the values of the Arduino will very likely much worse. At least you need a very stable and jitter-free clock.
Furthermore, the output of the data (I guess via UART) might interfere with the timing of the ADC measurement.

Can normal distribution prob density be greater than 1?... based on python code checkup

I have a question:
Given mean and variance I want to calculate the probability of a sample using a normal distribution as probability basis.
The numbers are:
mean = -0.546369
var = 0.006443
curr_sample = -0.466102
prob = 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) )
I get a probability which is larger than 1! I get prob = 3.014558...
What is causing this? The fact that the variance is too small messes something up? It's a totally legal input to the formula and should give something small not greater than 1! Any suggestions?
Ok, what you compute is not a probability, but a probability density (which may be larger than one). In order to get 1 you have to integrate over the normal distribution like so:
import numpy as np
mean = -0.546369
var = 0.006443
curr_sample = np.linspace(-10,10,10000)
prob = np.sum( 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) ) * (curr_sample[1]-curr_sample[0]) )
print prob
witch results in
0.99999999999961509
The formula you give is a probability density, not a probability. The density formula is such that when you integrate it between two values of x, you get the probability of being in that interval. However, this means that the probability of getting any particular sample is, in fact, 0 (it's the density times the infinitesimally small dx).
So what are you actually trying to calculate? You probably want something like the probability of getting your value or larger, the so-called tail probability, which is often used in statistics (it so happens that this is given by the error function when you're talking about a normal distribution, although you need to be careful of exactly how it's defined).
When considering the bell-shaped probability distribution function (PDF) of given mean and variance, the peak value of the curve (height of mode) is 1/sqrt(2*pi*var). It is 1 for standard normal distribution (mean 0 and var 1). Hence when trying to calculate a specific value of a general normal distribution pdf, values larger than 1 are possible.

Numpy 1 Degree of Freedom

In the following code I'm confused as to what the third line means. What does the ddof = 1 do. I tried looking it up, but I still don't quite understand the concept or the purpose. I would really appreciate it if somebody could point me in the right direction.
Thanks
data = stats.binom.rvs(n = 10, p = 0.3, size = 10000)
print "Mean: %g" % np.mean(data)
print "SD: %g" % np.std(data, **ddof=1**)
First read the documentation:
Means Delta Degrees of Freedom. The divisor used in
calculations is N - ddof, where N represents the number of elements.
By default ddof is zero.
Searching for Degrees of Freedom then explains the statistical concept (emphasis mine):
Estimates of statistical parameters can be based upon different
amounts of information or data. The number of independent pieces of
information that go into the estimate of a parameter are called the
degrees of freedom. In general, the degrees of freedom of an estimate
of a parameter are equal to the number of independent scores that go
into the estimate minus the number of parameters used as intermediate
steps in the estimation of the parameter itself (most of the time the
sample variance has N − 1 degrees of freedom, since it is computed
from N random scores minus the only 1 parameter estimated as
intermediate step, which is the sample mean).
Degrees of freedom is an important concept which you may want to look it up, but the computational difference is actually straight forward, consider these:
In [20]:
x = np.array([6,5,4,6,6,7,2])
In [21]:
np.std(x)
Out[21]:
1.5518257844571737
#default is ddof=0, what this actually does:
In [22]:
np.sqrt((((x-x.mean())**2)/len(x)).sum())
Out[22]:
1.5518257844571737
In [23]:
np.std(x, ddof=1)
Out[23]:
1.6761634196950517
#what ddof=1 does:
In [24]:
np.sqrt((((x-x.mean())**2)/(len(x)-1)).sum())
Out[24]:
1.6761634196950517
In most languages (R, SAS etc), the default is to return std of ddof=1. numpy's default is ddof=0, which something worth noting.
It refers to denominator degrees of freedom
In some cases (e.g. working with population level data) your denominator is N. In other cases (e.g. sample level data) your denominator is N-1 (or whatever your ddof value is set to).
So the difference here is do you want to divide by N, or divide by N-ddof? Where and when you divide by each is a more domain/context specific question
In numpy, the default is ddof = 0 (divide by N), so if you want a different denominator value you have to manually specify it
ddof = 1 refers to Degrees of Freedom.
It's a statistical concept.
The formula for Standard Deviation of population:
But most of the time, we are trying to use sample Standard Deviation to estimate the true Standard Deviation of population. The formula above is a downward-biased estimation, using N-1 instead of N gives us a correction.
Check wiki: https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation for more information

Model I-V in Python

Model I-V.
Method:
Perform an integral, as a function of E, which outputs Current for each Voltage value used. This is repeated for an array of v_values. The equation can be found below.
Although the limits in this equation range from -inf to inf, the limits must be restricted so that (E+eV)^2-\Delta^2>0 and E^2-\Delta^2>0, to avoid poles. (\Delta_1 = \Delta_2). Therefore there are currently two integrals, with limits from -inf to -gap-e*v and gap to inf.
However, I keep returning a math range error although I believe I have excluded the troublesome E values by using the limits stated above. Pastie of errors: http://pastie.org/private/o3ugxtxai8zbktyxtxuvg
Apologies for the vagueness of this question. But, can anybody see obvious mistakes or code misuse?
My attempt:
from scipy import integrate
from numpy import *
import scipy as sp
import pylab as pl
import numpy as np
import math
e = 1.60217646*10**(-19)
r = 3000
gap = 400*10**(-6)*e
g = (gap)**2
t = 0.02
k = 1.3806503*10**(-23)
kt = k*t
v_values = np.arange(0,0.001,0.0001)
I=[]
for v in v_values:
val, err = integrate.quad(lambda E:(1/(e*r))*(abs(E)/np.sqrt(abs(E**2-g)))*(abs(E+e*v)/(np.sqrt(abs((E+e*v)**2-g))))*((1/(1+math.exp((E+e*v)/kt)))-(1/(1+math.exp(E/k*t)))),-inf,(-gap-e*v)*0.9)
I.append(val)
I = array(I)
I2=[]
for v in v_values:
val2, err = integrate.quad(lambda E:(1/(e*r))*(abs(E)/np.sqrt(abs(E**2-g)))*(abs(E+e*v)/(np.sqrt(abs((E+e*v)**2-g))))*((1/(1+math.exp((E+e*v)/kt)))-(1/(1+math.exp(E/k*t)))),gap*0.9,inf)
I2.append(val2)
I2 = array(I2)
I[np.isnan(I)] = 0
I[np.isnan(I2)] = 0
pl.plot(v_values,I,'-b',v_values,I2,'-b')
pl.show()
This question is better suited for the Computational Science site. Still here are some points for you to think about.
First, the range of integration is the intersection of (-oo, -eV-gap) U (-eV+gap, +oo) and (-oo, -gap) U (gap, +oo). There are two possible cases:
if eV < 2*gap then the allowed energy values are in (-oo, -eV-gap) U (gap, +oo);
if eV > 2*gap then the allowed energy values are in (-oo, -eV-gap) U (-eV+gap, -gap) U (gap, +oo).
Second, you are working in a very low temperature region. With t equal to 0.02 K, the denominator in the Boltzmann factor is 1.7 µeV, while the energy gap is 400 µeV. In this case the value of the exponent is huge for positive energies and it soon goes off the limits of the double precision floating point numbers, used by Python. As this is the minimum possible positive energy, things would not get any better at higher energies. With negative energies the value would always be very close to zero. Note that at this temperature, the Fermi-Dirac distribution has a very sharp edge and resembles a reflected theta function. At E = gap you would have exp(E/kT) of approximately 6.24E+100. You would run out of range when E/kT > 709.78 or E > 3.06*gap.
Yet it makes no sense to go to such energies since at that temperature the difference between the two Fermi functions very quickly becomes zero outside the [-eV, 0] interval which falls entirely inside the gap for the given temperature when V < (2*gap)/e (0.8 mV). That's why one would expect that the current would be very close to zero when the bias voltage is less than 0.8 mV. When it is more than 0.8 mV, then the main value of the integral would come from the integrand in (-eV+gap, -gap), although some non-zero value would come from the region near the singularity at E = gap and some from the region near the singularity at E = -eV-gap. You should not avoid the singularities in the DoS, otherwise you would not get the expected discontinuities (vertical lines) in the I(V) curve (image taken from Wikipedia):
Rather, you have to derive equivalent approximate expressions in the vicinity of each singularity and integrate them instead.
As you can see, there are many special cases for the value of the integrand and you have to take them all into account when computing numerically. If you don't want to do that, you should probably turn to some other mathematical package like Maple or Mathematica. These have much more sophisticated numerical integration routines and might be able to directly handle your formula.
Note that this is not an attempt to answer your question but rather a very long comment that would not fit in any comment field.
The reason for the math range error is that your exponential goes to infinity. Taking v = 0.0009 and E = 5.18e-23, the expression exp((E + e*v) / kt) (I corrected the typo pointed out by Hristo Liev in your Python expression) is exp(709.984..) which is beyond the range you can represent with double precision numbers (up to ca. 1E308).
Two additional notes:
As noted by others, you should probably rescale your equation by using a unit system which delivers numbers in a smaller range. Maybe, atomic units are a possible choice as it would set e = 1, but I did not try to convert your equation into it. (Probably, your timestep would then become quite large, as in atomic units the time unit is about is 1/40 fs).
Usually, one uses the exponential notation for float point numbers: e = 1.60217E-19 instead of e = 1.60217*10**(-19).
The best way to approach this problem in the end was to use a heaviside function to preventE variable from exceeding \Delta variable.

Categories