Precision Matlab and Python (numpy) - python

I'm converting a Matlab script to Python and I am getting different results in the 10**-4 order.
In matlab:
f_mean=f_mean+nanmean(f);
f = f - nanmean(f);
f_t = gradient(f);
f_tt = gradient(f_t);
if n_loop==1
theta = atan2( sum(f.*f_tt), sum(f.^2) );
end
theta = -2.2011167e+03
In Python:
f_mean = f_mean + np.nanmean(vel)
vel = vel - np.nanmean(vel)
firstDerivative = np.gradient(vel)
secondDerivative = np.gradient(firstDerivative)
if numberLoop == 1:
theta = np.arctan2(np.sum(vel * secondDerivative),
np.sum([vel**2]))
Although first and secondDerivative give the same results in Python and Matlab, f_mean is slightly different: -0.0066412 (Matlab) and -0.0066414 (Python); and so theta: -0.4126186 (M) and -0.4124718 (P). It is a small difference, but in the end leads to different results in my scripts.
I know some people asked about this difference, but always regarding std, which I get, but not regarding mean values. I wonder why it is.

One possible source of the initial difference you describe (between means) could be numpy's use of pairwise summation which on large arrays will typically be appreciably more accurate than the naive method:
a = np.random.uniform(-1, 1, (10**6,))
a = np.r_[-a, a]
# so the sum should be zero
a.sum()
# 7.815970093361102e-14
# use cumsum to get naive summation:
a.cumsum()[-1]
# -1.3716805469243809e-11
Edit (thanks #sascha): for the last word and as a "provably exact" reference you could use math.fsum:
import math
math.fsum(a)
# 0.0
Don't have matlab, so can't check what they are doing.

Related

Fourier transform integration in Python

I have to calculate the Fourier transform of an acceleration data that I've already coded. I have to do it the old fashion way (I mean, without the numpy np.fft.fft command, even though I don't master that neither) So, this is what I have for the integration:
ri = 1j # first time defining a complex number in python
Fmax = 50 # Hz, the maximum frequency to consider
df = 0.01 # frequency diferential
nf = int(Fmax / df) # number of sample points for frequency
# and I already have UD_Acc defined as a 1D numpy array, then the "for loop":
Int_UD = []
for i in range(UD_Acc.size):
w = []
for j in range(nf):
w.append(2 * np.pi * df * (j - 1))
Int_UD.append(Int_UD[i - 1] + UD_Acc[i] * np.exp(ri * w * (i - 1) * dt1))
First of all, in the for loop the w variable has a warning as:
Expected type 'complex', got 'List[Union[Union[float, int], Any]]' instead
And then, even if I run it, it says that the list index is out of range.
I know it may seem a little rudiment to integrate like this, or to find a Fourier transform without using scipy or np.fft, but is for class, and I'm trying to understand the basics, so thanks in advance.

What is a robust method for calculating the cosine of the angle of two vectors in python?

I want to calculate the cosine of the angle of two 2D-vectors that lie on the same plane in Python. I use the classic formula cosθ = v1∙v2/(|v1||v2|). However, depending on how I implement it, the rounding errors can give values greater than 1 or less than -1 for parallel vectors. Three examples of the implementation follow:
import numpy as np
def cos2Dvec_1(v1,v2):
v1norm2 = v1[0]**2+v1[1]**2
v2norm2 = v2[0]**2+v2[1]**2
v1v2 = v1[0]*v2[0]+v1[1]*v2[1]
if v1v2 > 0:
return np.sqrt(v1v2**2/(v1norm2*v2norm2))
else:
return -np.sqrt(v1v2**2/(v1norm2*v2norm2))
def cos2Dvec_2(v1,v2):
r1 = np.linalg.norm(v1)
r2 = np.linalg.norm(v2)
return np.dot(v1,v2)/(r1*r2)
def cos2Dvec_3(v1,v2):
v1 = v1/np.linalg.norm(v1)
v2 = v2/np.linalg.norm(v2)
return np.dot(v1,v2)
v2 = np.array([2.0, 3.0])
v1 = -0.01 *v2
print(cos2Dvec_1(v2,v1), cos2Dvec_2(v1,v2), cos2Dvec_3(v1,v2))
When I execute this code on my machine (Python 3.7.6, Numpy 1.18.1), I get:
-1.0 -1.0000000000000002 -1.0
The difference between the 2nd and 3rd version of the function is especially weird to me...
If I use the second function to calculate the angle via the inverse cosine, I will get a nan, which I want to avoid (and I don't want to check every time, if the result lies in [-1,1]). I was wondering though whether versions 1 and 3 (which give the correct result) are numerically stable in all cases.
You could use the clip function to ensure values will be coerced (or "clamped" or "clipped") between -1 and 1.
return np.clip(YourOriginalResult, -1, 1)
Rounding errors are inevitable when manipulating floats. Unfortunately, I don't have enough expertise on your calculations to recommend any of your methods.
I'd suggest you to to check which ones have correct performance, after adding the clip part to mitigate those "out of range" issues.
After timing numpy.clip(x,-1,1), I saw that a much faster alternative is just plain old y = max(-1, min(x,1)). Timing results for the following code snippet:
t1 = time.time()
vals = [-1.001, -0.99, 0.99, 1.001]
for ii in range(1):
for x in vals:
y = np.clip(x, -1, 1)
# y = max(-1, min(x,1))
print(y)
t2 = time.time()
print(t2-t1)
Numpy clip was almost 23 times slower on my laptop.

Difference and subsequent error between using Pandas Series and Numpy Arrays

When doing some estimations, calculations, and other fun stuff in Python I came across something really weird and upsetting.
I have this thing where I estimate some parameters using ML-estimation, and have previously assumed that everything was peachy and fine. I read csv-data with pandas, and use the subsequent data for the estimation. Therefore, the data has originally been passed down to the ML-estimation function as Pandas Series. Today I wanted to try some matrix-operations on a thing in the calculation for kicks-and-giggles, and converted the input-data to numpy arrays. However, when I ran the code, the estimation results were different. After restoring some of the multiplications, it was still different. Then I changed back to using Pandas series, and it returned to the previously expected result.
This is where I got curious and now turn to you. Is it so that there is a rounding error between float64 Numpy arrays and float64 Pandas Series so different that when doing my calculations, they get so drastically different?
Consider the following code-example containing a sample from my ML-estimator
import pandas as pd
import numpy as np
import math
values = [3.41527085753, 3.606855606852, 3.5550625070226231, 3.680327020956565, \
3.30270511221, 3.704752803295, 3.6307205395804001, 3.200863997609199, \
2.90599272353, 3.555062501231, 2.8047528032711295, 3.415270760685753, \
3.50599277872, 3.445622506242, 3.3047528084632258, 3.219431344191376, \
3.68032756565, 3.451542245654, 3.2244456543387564, 2.999848273256456]
Ps = pd.Series(values, dtype=np.float64)
Narr = np.array(values, dtype=np.float64)
def getLambda(S, delta = 1/255):
n = len(S) - 1
Sx = sum( S[0:-1] )
Sy = sum( S[1:] )
Sxx = sum( S[0:-1]**2 )
Sxy = sum( S[0:-1]*S[1:] )
mu = (Sy*Sxx - Sx*Sxy) / ( n*(Sxx - Sxy) - (Sx**2 - Sx*Sy) )
lambd = np.log((Sxy - mu*Sx - mu*Sy + n*mu**2) / (Sxx -2*mu*Sx + n*mu**2) )/ delta
a = math.exp(-lambd*delta)
return mu, a, lambd
print("Numpy Array calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Narr)[0], getLambda(Narr)[1], getLambda(Narr)[2]))
print("Pandas Series calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Ps)[0], getLambda(Ps)[1], getLambda(Ps)[2]))
The values are just some random value picked from my original data in my larger project.
This will, atleast for me, print:
>> Numpy Array calculation gives me mu = 3.378432651661709, alpha = 102.09644146650535 and Lambda = -1179.6090571432392
>> Pandas Series calculation gives me mu = 3.3981019891871247, alpha = nan and Lambda = nan
The procedure, method, and original data is identical, and it still gets a difference of about 0.019669 already in the calculation of mu, which is for me really weird and upsetting.
If this is due to difference (keep in mind that I explicity stated that it should be float64 in both cases) in rounding between the two methods of handling data its weird as this just makes me question which and why I should use any of them. Otherwise, there has to be a bug in one of them? Or is there a third alternative which explains everything and was something that I did not know of to begin with?

Working out an equation

I'm trying to solve a differential equation numerically, and am writing an equation that will give me an array of the solution to each time point.
import numpy as np
import matplotlib.pylab as plt
pi=np.pi
sin=np.sin
cos=np.cos
sqrt=np.sqrt
alpha=pi/4
g=9.80665
y0=0.0
theta0=0.0
sina = sin(alpha)**2
second_term = g*sin(alpha)*cos(alpha)
x0 = float(raw_input('What is the initial x in meters?'))
x_vel0 = float(raw_input('What is the initial velocity in the x direction in m/s?'))
y_vel0 = float(raw_input('what is the initial velocity in the y direction in m/s?'))
t_f = int(raw_input('What is the maximum time in seconds?'))
r0 = x0
vtan = sqrt(x_vel0**2+y_vel0**2)
dt = 1000
n = range(0,t_f)
r_n = r0*(n*dt)
r_nm1 = r0((n-1)*dt)
F_r = ((vtan**2)/r_n)*sina-second_term
r_np1 = 2*r_n - r_nm1 + dt**2 * F_r
data = [r0]
for time in n:
data.append(float(r_np1))
print data
I'm not sure how to make the equation solve for r_np1 at each time in the range n. I'm still new to Python and would like some help understanding how to do something like this.
First issue is:
n = range(0,t_f)
r_n = r0*(n*dt)
Here you define n as a list and try to multiply the list n with the integer dt. This will not work. Pure Python is NOT a vectorized language like NumPy or Matlab where you can do vector multiplication like this. You could make this line work with
n = np.arange(0,t_f)
r_n = r0*(n*dt),
but you don't have to. Instead, you should move everything inside the for loop to do the calculation at each timestep. At the present point, you do the calculation once, then add the same only result t_f times to the data list.
Of course, you have to leave your initial conditions (which is a key part of ODE solving) OUTSIDE of the loop, because they only affect the first step of the solution, not all of them.
So:
# Initial conditions
r0 = x0
data = [r0]
# Loop along timesteps
for n in range(t_f):
# calculations performed at each timestep
vtan = sqrt(x_vel0**2+y_vel0**2)
dt = 1000
r_n = r0*(n*dt)
r_nm1 = r0*((n-1)*dt)
F_r = ((vtan**2)/r_n)*sina-second_term
r_np1 = 2*r_n - r_nm1 + dt**2 * F_r
# append result to output list
data.append(float(r_np1))
# do something with output list
print data
plt.plot(data)
plt.show()
I did not add any piece of code, only rearranged your lines. Notice that the part:
n = range(0,t_f)
for time in n:
Can be simplified to:
for time in range(0,t_f):
However, you use n as a time variable in the calculation (previously - and wrongly - defined as a list instead of a single number). Thus you can write:
for n in range(0,t_f):
Note 1: I do not know if this code is right mathematically, as I don't even know the equation you're solving. The code runs now and provides a result - you have to check if the result is good.
Note 2: Pure Python is not the best tool for this purpose. You should try some highly optimized built-ins of SciPy for ODE solving, as you have already got hints in the comments.

Python Numpy Random Numbers - inconsistent?

I am trying to generate log-normally distributed random numbers in python (for later MC simulation), and I find the results to be quite inconsistent when parameters are a bit larger.
Below I am generating a series of LogNormals from Normals (and then using Exp) and directly from LogNormals.
The resulting means are bearable, but the variances - quite imprecise.. this also holds for mu = 4,5,...
If you re-run the below code a couple of times - the results come back quite different.
Code:
import numpy as np
mu = 10;
tmp1 = np.random.normal(loc=-mu, scale=np.sqrt(mu*2),size=1e7)
tmp1 = np.exp(tmp1)
print tmp1.mean(), tmp1.var()
tmp2 = np.random.lognormal(mean=-mu, sigma=np.sqrt(mu*2), size=1e7)
print tmp2.mean(), tmp2.var()
print 'True Mean:', np.exp(0), 'True Var:',(np.exp(mu*2)-1)
Any advice how to fix this?
I've tried this also on Wakari.io - so the result is consistent there as well
Update:
I've taken the 'True' Mean and Variance formula from Wikipedia: https://en.wikipedia.org/wiki/Log-normal_distribution
Snapshots of results:
1)
0.798301881219 57161.0894726
1.32976988569 2651578.69947
True Mean: 1.0 True Var: 485165194.41
2)
1.20346203176 315782.004309
0.967106664211 408888.403175
True Mean: 1.0 True Var: 485165194.41
3) Last one with n=1e8 random numbers
1.17719369919 2821978.59163
0.913827160458 338931.343819
True Mean: 1.0 True Var: 485165194.41
Even with the large sample size that you have, with these parameters, the estimated variance is going to change wildly from run to run. That's just the nature of the fat-tailed lognormal distribution. Try running the np.exp(np.random.normal(...)).var() several times. You will see a similar swing of values as np.random.lognormal(...).var().
In any case, np.random.lognormal() is just implemented as np.exp(np.random.normal()) (well, the C equivalent).
Ok, as you have just built the sample, and using the notation in wikipedia (first section, mu and sigma) and the example given by you:
from numpy import log, exp, sqrt
import numpy as np
mu = -10
scale = sqrt(2*10) # scale is sigma, not variance
tmp1 = np.random.normal(loc=mu, scale=scale, size=1e8)
# Just checking
print tmp1.mean(), tmp1.std()
# 10.0011028634 4.47048010775, perfectly accurate
tmp1_exp = exp(tmp1) # Not sensible to use the same name for two samples
# WIKIPEDIA NOTATION!
m = tmp1_exp.mean() # until proven wrong, this is a meassure of the mean
v = tmp1_exp.var() # again, until proven wrong, this is sigma**2
#Now, according to wikipedia
print "This: ", log(m**2/sqrt(v+m**2)), "should be similar to", mu
# I get This: 13.9983309499 should be similar to 10
print "And this:", sqrt(log(1+v/m**2)), "should be similar to", scale
# I get And this: 3.39421327037 should be similar to 4.472135955
So, even if the values are not exactly perfect, I wouldn't claim that they are completely wrong.

Categories