When doing some estimations, calculations, and other fun stuff in Python I came across something really weird and upsetting.
I have this thing where I estimate some parameters using ML-estimation, and have previously assumed that everything was peachy and fine. I read csv-data with pandas, and use the subsequent data for the estimation. Therefore, the data has originally been passed down to the ML-estimation function as Pandas Series. Today I wanted to try some matrix-operations on a thing in the calculation for kicks-and-giggles, and converted the input-data to numpy arrays. However, when I ran the code, the estimation results were different. After restoring some of the multiplications, it was still different. Then I changed back to using Pandas series, and it returned to the previously expected result.
This is where I got curious and now turn to you. Is it so that there is a rounding error between float64 Numpy arrays and float64 Pandas Series so different that when doing my calculations, they get so drastically different?
Consider the following code-example containing a sample from my ML-estimator
import pandas as pd
import numpy as np
import math
values = [3.41527085753, 3.606855606852, 3.5550625070226231, 3.680327020956565, \
3.30270511221, 3.704752803295, 3.6307205395804001, 3.200863997609199, \
2.90599272353, 3.555062501231, 2.8047528032711295, 3.415270760685753, \
3.50599277872, 3.445622506242, 3.3047528084632258, 3.219431344191376, \
3.68032756565, 3.451542245654, 3.2244456543387564, 2.999848273256456]
Ps = pd.Series(values, dtype=np.float64)
Narr = np.array(values, dtype=np.float64)
def getLambda(S, delta = 1/255):
n = len(S) - 1
Sx = sum( S[0:-1] )
Sy = sum( S[1:] )
Sxx = sum( S[0:-1]**2 )
Sxy = sum( S[0:-1]*S[1:] )
mu = (Sy*Sxx - Sx*Sxy) / ( n*(Sxx - Sxy) - (Sx**2 - Sx*Sy) )
lambd = np.log((Sxy - mu*Sx - mu*Sy + n*mu**2) / (Sxx -2*mu*Sx + n*mu**2) )/ delta
a = math.exp(-lambd*delta)
return mu, a, lambd
print("Numpy Array calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Narr)[0], getLambda(Narr)[1], getLambda(Narr)[2]))
print("Pandas Series calculation gives me mu = {}, alpha = {} and Lambda = {}".format(getLambda(Ps)[0], getLambda(Ps)[1], getLambda(Ps)[2]))
The values are just some random value picked from my original data in my larger project.
This will, atleast for me, print:
>> Numpy Array calculation gives me mu = 3.378432651661709, alpha = 102.09644146650535 and Lambda = -1179.6090571432392
>> Pandas Series calculation gives me mu = 3.3981019891871247, alpha = nan and Lambda = nan
The procedure, method, and original data is identical, and it still gets a difference of about 0.019669 already in the calculation of mu, which is for me really weird and upsetting.
If this is due to difference (keep in mind that I explicity stated that it should be float64 in both cases) in rounding between the two methods of handling data its weird as this just makes me question which and why I should use any of them. Otherwise, there has to be a bug in one of them? Or is there a third alternative which explains everything and was something that I did not know of to begin with?
Related
I have to calculate the Fourier transform of an acceleration data that I've already coded. I have to do it the old fashion way (I mean, without the numpy np.fft.fft command, even though I don't master that neither) So, this is what I have for the integration:
ri = 1j # first time defining a complex number in python
Fmax = 50 # Hz, the maximum frequency to consider
df = 0.01 # frequency diferential
nf = int(Fmax / df) # number of sample points for frequency
# and I already have UD_Acc defined as a 1D numpy array, then the "for loop":
Int_UD = []
for i in range(UD_Acc.size):
w = []
for j in range(nf):
w.append(2 * np.pi * df * (j - 1))
Int_UD.append(Int_UD[i - 1] + UD_Acc[i] * np.exp(ri * w * (i - 1) * dt1))
First of all, in the for loop the w variable has a warning as:
Expected type 'complex', got 'List[Union[Union[float, int], Any]]' instead
And then, even if I run it, it says that the list index is out of range.
I know it may seem a little rudiment to integrate like this, or to find a Fourier transform without using scipy or np.fft, but is for class, and I'm trying to understand the basics, so thanks in advance.
I have two Numpy (complex) arrays A[t],B[t] defined over a grid of points "t". These two arrays are convolved in a way such that I want a third array C[y] = (A*B)(y), where "y" needs to be exactly the same points as the "t" grid. The point is that both A and B need to be integrated from -\infty to \infty according to the standard convolution operation.
Im using scipy.signal.convolve for this, and I would also like to use the fftconvolve since my arrays are supposed to be big enough. However, when I try the module on a minimal working code, I seem to be doing things very wrong. Here is a piece of the code, where I choose A(t) = exp( -t**2 ) and B(t) = exp(-t). The convolution of these two functions in Mathematica gives:
C[y] = \integral_{-\infty}^{\infty} dt A[t]B[ y- t ] = sqrt(pi)*exp( 0.25 - y )
But then I try this in Python and get very wrong results:
import scipy.signal as scp
import numpy as np
import matplotlib.pyplot as plt
delta = 0.001
t = np.arange(1000)*delta
a = np.exp( -t**2 )
b = np.exp( -t )
c = scp.convolve(a, b, mode='same')*delta
d = np.sqrt(np.pi)*np.exp( 0.25 - t )
plt.plot(np.arange(len(c)) * delta, c)
plt.plot(t[::50], d[::50], 'o')
As far as I understood, the "same" mode allows for evaluation over the same points of the original grids, but this doesn't seem to be the case... Any help is greatly appreciated!
I want to plot the frequency version of planck's law. I first tried to do this independently:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
# Planck's Law
# Constants
h = 6.62607015*(10**-34) # J*s
c = 299792458 # m * s
k = 1.38064852*(10**-23) # J/K
T = 20 # K
frequency_range = np.linspace(10**-19,10**19,1000000)
def plancks_law(nu):
a = (2*h*nu**3) / (c**2)
e_term = np.exp(h*nu/(k*T))
brightness = a /(e_term - 1)
return brightness
plt.plot(frequency_range,plancks_law(frequency_range))
plt.gca().set_xlim([1*10**-16 ,1*10**16 ])
plt.gca().invert_xaxis()
This did not work, I have an issue with scaling somehow. My next idea was to attempt to use this person's code from this question: Plancks Formula for Blackbody spectrum
import matplotlib.pyplot as plt
import numpy as np
h = 6.626e-34
c = 3.0e+8
k = 1.38e-23
def planck_f(freq, T):
a = 2.0*h*(freq**3)
b = h*freq/(k*T)
intensity = a/( (c**2 * (np.exp(b) - 1.0) ))
return intensity
# generate x-axis in increments from 1nm to 3 micrometer in 1 nm increments
# starting at 1 nm to avoid wav = 0, which would result in division by zero.
wavelengths = np.arange(1e-9, 3e-6, 1e-9)
frequencies = np.arange(3e14, 3e17, 1e14, dtype=np.float64)
intensity4000 = planck_f(frequencies, 4000.)
plt.gca().invert_xaxis()
This didn't work, because I got a divide by zero error. Except that I don't see where there is a division by zero, the denominator shouldn't ever be zero since the exponential term shouldn't ever be equal to one. I chose the frequencies to be the conversions of the wavelength values from the example code.
Can anyone help fix the problem or explain how I can get planck's law for frequency instead of wavelength?
You can not safely handle such large numbers; even for comparably "small" values of b = h*freq/(k*T) your float64 will overflow, e.g np.exp(709.)=8.218407461554972e+307 is ok, but np.exp(710.)=inf. You'll have to adjust your units (exponents) accordingly to avoid this!
Note that this is also the case in the other question you linked to, if you insert print( np.exp(b)[:10] ) within the definition of planck(), you can examine the first ten evaluated b's and you'll see the overflow in the first few occurrences. In any case, simply use the answer posted within the other question, but convert the x-axis in plt.plot(wavelengths, intensity) to frequency (i hope you know how to get from one to the other) :-)
I'm converting a Matlab script to Python and I am getting different results in the 10**-4 order.
In matlab:
f_mean=f_mean+nanmean(f);
f = f - nanmean(f);
f_t = gradient(f);
f_tt = gradient(f_t);
if n_loop==1
theta = atan2( sum(f.*f_tt), sum(f.^2) );
end
theta = -2.2011167e+03
In Python:
f_mean = f_mean + np.nanmean(vel)
vel = vel - np.nanmean(vel)
firstDerivative = np.gradient(vel)
secondDerivative = np.gradient(firstDerivative)
if numberLoop == 1:
theta = np.arctan2(np.sum(vel * secondDerivative),
np.sum([vel**2]))
Although first and secondDerivative give the same results in Python and Matlab, f_mean is slightly different: -0.0066412 (Matlab) and -0.0066414 (Python); and so theta: -0.4126186 (M) and -0.4124718 (P). It is a small difference, but in the end leads to different results in my scripts.
I know some people asked about this difference, but always regarding std, which I get, but not regarding mean values. I wonder why it is.
One possible source of the initial difference you describe (between means) could be numpy's use of pairwise summation which on large arrays will typically be appreciably more accurate than the naive method:
a = np.random.uniform(-1, 1, (10**6,))
a = np.r_[-a, a]
# so the sum should be zero
a.sum()
# 7.815970093361102e-14
# use cumsum to get naive summation:
a.cumsum()[-1]
# -1.3716805469243809e-11
Edit (thanks #sascha): for the last word and as a "provably exact" reference you could use math.fsum:
import math
math.fsum(a)
# 0.0
Don't have matlab, so can't check what they are doing.
I am playing with SciPy today and I wanted to test least square fitting. The function malo(time) works perfectly in returning me calculated concentrations if I put it in a loop which iterates over an array of timesteps (in the code "time").
Now I want to compare my calculated concentrations with my measured ones. I created a residuals function which calculates the difference between measured concentration (in the script an array called conc) and the modelled concentration with malo(time).
With optimize.leastsq I want to fit the parameter PD to fit both curves as good as possible. I don't see a mistake in my code, malo(time) performs well, but whenever I want to run the optimize.leastsq command Python says "only length-1 arrays can be converted to Python scalars". If I set the timedt array to a single value, the code runs without any error.
Do you see any chance to convince Python to use my array of timesteps in the loop?
import pylab as p
import math as m
import numpy as np
from scipy import optimize
Q = 0.02114
M = 7500.0
dt = 30.0
PD = 0.020242215
tom = 26.0 #Minuten
tos = tom * 60.0 #Sekunden
timedt = np.array([30.,60.,90])
conc= np.array([ 2.7096, 2.258 , 1.3548, 0.9032, 0.9032])
def malo(time):
M1 = M/Q
M2 = 1/(tos*m.sqrt(4*m.pi*PD*((time/tos)**3)))
M3a = (1 - time/tos)**2
M3b = 4*PD*(time/tos)
M3 = m.exp(-1*(M3a/M3b))
out = M1 * M2 * M3
return out
def residuals(p,y,time):
PD = p
err = y - malo(timedt)
return err
p0 = 0.05
p1 = optimize.leastsq(residuals,p0,args=(conc,timedt))
Notice that you're working here with arrays defined in NumPy module. Eg.
timedt = np.array([30.,60.,90])
conc= np.array([ 2.7096, 2.258 , 1.3548, 0.9032, 0.9032])
Now, those arrays are not part of standard Python (which is a general purpose language). The problem is that you're mixing arrays with regular operations from the math module, which is part of the standard Python and only meant to work on scalars.
So, for example:
M2 = 1/(tos*m.sqrt(4*m.pi*PD*((time/tos)**3)))
will work if you use np.sqrt instead, which is designed to work on arrays:
M2 = 1/(tos*np.sqrt(4*m.pi*PD*((time/tos)**3)))
And so on.
NB: SciPy and other modules meant for numeric/scientific programming know about NumPy and are built on top of it, so those functions should all work on arrays. Just don't use math when working with them. NumPy comes with replicas of all those functions (sqrt, cos, exp, ...) to work with your arrays.