I'm trying to get the 3D Fourier Transform of the gaussian function e^(-r^(2)/2) in python using the numpy.fft library.
I've attempted using different ffts from the library with different inputs, shifting the results with np.fft.fftshift, trying to find a multiplicative factor and many other things, the last thing I tried was using the 1D fft function, and then cubing the result, here's the corresponding source code:
import numpy as np
R = float(10)
N = float(100)
y= np.dtype(np.float64)
dr = R/N
def F(x):
return np.exp(-((x*dr)**2)/2)
Frange = np.arange(1,int(N)+1)
y = np.zeros((int(N)))
i = 0
while i<int(N):
y[i] = F(Frange[i])
i += 1
y = y/3
y_fft = np.fft.fftshift(np.abs(np.fft.fft(y)))**3
print (y_fft)
The first values I get:
4.62e-03, 4.63e-03, 4.65e-03, 4.69e-03, 4.74e-03
According to Lado, Fred. (1971) Numerical Fourier transforms in one, two, and three dimensions for liquid state calculations, the analytic solution to the problem is: (2pi )^(3/2)*e^(-k^(2)/2)
And the first values of the analytic solution with the same values of R and N are:
14.99, 12.92, 10.10, 7.15, 4.58
I also created a DFT program using a formula provided in the previous article which gives the expected results, but I haven't been able to replicate the analytic results in any of my attempts using the NumPy or SciPy fft libraries.
Here's my program for the analytic and DFT results:
import math
import numpy as np
def F(r):
x=math.exp((-1/2)*(r**2))
return x
def FT(r):
x=((2*math.pi)**(3/2))*(math.exp((-1/2)*(r**2)))
return x
R = float(10)
N = int(100)
ft = np.zeros(N)
fta = np.zeros(N)
dr = R/N
dk = math.pi/R
print ("\tk \t\t\t Discrete \t\t\t Analytic")
for j in range (1, N):
kj = j*dk
#Discrete Transform
sum = 0
for i in range(1, N):
ri = i*dr
sum = sum + (dr*ri*(F(ri))*(math.sin(kj*ri)))
ft[j] = ((4*math.pi)/kj)*sum
#Analytic Transform
fta[j] = FT(kj)
#Print results
print(kj, f" \t\t{ft[j]:.10E} \t\t{fta[j]:.10E}")
And these are the first few results:
k Discrete Analytic
0.3141592653589793 1.4991263193E+01 1.4991263193E+01
0.6283185307179586 1.2928362116E+01 1.2928362116E+01
0.9424777960769379 1.0101494686E+01 1.0101494686E+01
1.2566370614359172 7.1509645344E+00 7.1509645344E+00
1.5707963267948966 4.5864901093E+00 4.5864901093E+00
Related
So I am trying to solve the differential equation $\frac{d^2y}{dx^2} = -y(x)$ subject to boundary conditions y(0) = 0 and y(1) = 1 ,the analytic solution is y(x) = sin(x)/sin(1).
I am using three point stencil to approximate the double derivative.
The curves obtained through these ways should match at least at the boundaries ,but my solutions have small differences even at the boundaries.
I am attaching the code, Please tell me what is wrong.
import numpy as np
import scipy.linalg as lg
from scipy.sparse.linalg import eigs
from scipy.sparse.linalg import inv
from scipy import sparse
import matplotlib.pyplot as plt
a = 0
b = 1
N = 1000
h = (b-a)/N
r = np.arange(a,b+h,h)
y_a = 0
y_b = 1
def lap_three(r):
h = r[1]-r[0]
n = len(r)
M_d = -2*np.ones(n)
#M_d = M_d + B_d
O_d = np.ones(n-1)
mat = sparse.diags([M_d,O_d,O_d],offsets=(0,+1,-1))
#print(mat)
return mat
def f(r):
h = r[1]-r[0]
n = len(r)
return -1*np.ones(len(r))*(h**2)
def R_mat(f,r):
r_d = f(r)
R_mat = sparse.diags([r_d],offsets=[0])
#print(R_mat)
return R_mat
#def R_mat(r):
# M_d = -1*np.ones(len(r))
def make_mat(r):
main = lap_three(r) - R_mat(f,r)
return main
main = make_mat(r)
main_mat = main.toarray()
print(main_mat)
'''
eig_val , eig_vec = eigs(main, k = 20,which = 'SM')
#print(eig_val)
Val = eig_vec.T
plt.plot(r,Val[0])
'''
main_inv = inv(main)
inv_mat = main_inv.toarray()
#print(inv_mat)
#print(np.dot(main_mat,inv_mat))
n = len(r)
B_d = np.zeros(n)
B_d[0] = 0
B_d[-1] = 1
#print(B_d)
#from scipy.sparse.linalg import spsolve
A = np.abs(np.dot(inv_mat,B_d))
plt.plot(r[0:10],A[0:10],label='calculated solution')
real = np.sin(r)/np.sin(1)
plt.plot(r[0:10],real[0:10],label='analytic solution')
plt.legend()
#plt.plot(r,real)
#plt.plot(r,A)
'''diff = A-real
plt.plot(r,diff)'''
There is no guarantee of what the last point in arange(a,b+h,h) will be, it will mostly be b, but could in some cases also be b+h. Better use
r,h = np.linspace(a,b,N+1,retstep=True)
The linear system consists of the equations for the middle positions r[1],...,r[N-1]. These are N-1 equations, thus your matrix size is one too large.
You could keep the matrix construction shorter by including the h^2 term already in M_d.
If you use sparse matrices, you can also use the sparse solver A = spsolve(main, B_d).
The equations that make up the system are
A[k-1] + (-2+h^2)*A[k] + A[k+1] = 0
The vector on the right side thus needs to contain the values -A[0] and -A[N]. This should clear up the sign problem, no need to cheat with the absolute value.
The solution vector A corresponds, as constructed from the start, to r[1:-1]. As there are no values for postitions 0 and N inside, there can also be no difference.
PS: There is no relaxation involved here, foremost because this is no iterative method. Perhaps you meant a finite difference method.
I am trying to use SVD and an Eigendecomposition for some data analysis using Dynamic Mode Decomposition. I am running into a simple problem of getting different results from Matlab and Python. I'm confused and don't know why Python is giving me wrong results/matrix values but everything looks (I think IS) correct.
So instead of using real data this time and looking at large data sets, I generated data. I will try to look at an eigenvalue plot after the eigendecomposition. I also use a delay embedding for the data because I will work with a data vector which is only (2x100), so I will perform a type of Hankel matrix to enrich the data with 10 delays.
clear all; close all; clc;
data = linspace(1,100);
data2 = linspace(2,101);
data = [data;data2];
numDelays = 10;
relTol= 10^-6;
%% Create first and second snap shot matrices for DMD. Any columns with missing
% data are not used.
disp('Constructing Data Matricies:')
X = zeros((numDelays+1)*size(data,1),size(data,2)-(numDelays+1));
Y = zeros(size(X));
for i = 1:numDelays+1
X(1 + (i-1)*size(data,1):i*size(data,1),:) = ...
data(:,(i):size(data,2)-(numDelays+1) + (i-1));
Y(1 + (i-1)*size(data,1):i*size(data,1),:) = ...
data(:,(i+1):size(data,2)-(numDelays+1) + (i));
end
[U,S,V] = svd(X);
r = find(diag(S)>S(1,1)*relTol,1,'last');
disp(['DMD subspace dimension:',num2str(r)])
U = U(:,1:r);
S = S(1:r,1:r);
V = V(:,1:r);
Atil = (U'*Y)*V*(S^-1);
[what,lambda] = eig(Atil);
Phi = (Y*V)*(S^-1)*what;
Keigs = diag(lambda);
tt = linspace(0,2*pi,101);
figure;
plot(real(Keigs),imag(Keigs),'ro')
hold on
plot(cos(tt),sin(tt),'--')
import scipy.io as sc
import math as m
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys
from numpy import dot, multiply, diag, power, pi, exp, sin, cos, cosh, tanh, real, imag
from scipy.linalg import expm, sinm, cosm, fractional_matrix_power, svd, eig, inv
def dmd(X, Y, relTol):
U2,Sig2,Vh2 = svd(X, False) # SVD of input matrix
S = np.zeros((Sig2.shape[0], Sig2.shape[0])) # Create S matrix with zeros based on Diag of S
np.fill_diagonal(S, Sig2) # Fill diagonal of S matrix with the nonzero values
r = np.count_nonzero(np.diag(S) > S[0,0] * relTol) # rank truncation
U = U2[:,:r]
Sig = diag(Sig2)[:r,:r] #GOOD =)
V = Vh2.conj().T[:,:r]
Atil = dot(dot(dot(U.conj().T, Y), V), inv(Sig)) # build A tilde
print(Atil)
mu,W = eig(Atil)
Phi = dot(dot(dot(Y, V), inv(Sig)), W) # build DMD modes
return mu, Phi
data = np.array([(np.linspace(1,100,100)),(np.linspace(2,101,100))])
Data = np.array(data)
######### Choose number of Delays ###########
# observable (coordinates of feature points). Setting to zero means only
# experimental observables will be used.
numDelays = 10
relTol = 10**-6
########## Create Data Matrices for DMD ###############
# Create first and second snap shot matrices for DMD. Any columns with missing
# data are not used.
X = np.zeros(((numDelays + 1) * data.shape[0], data.shape[1] - (numDelays + 1)))
Y = np.zeros(X.shape)
for i in range(1, numDelays + 2):
X[0 + (i - 1) * Data.shape[0]:i * Data.shape[0], :] = Data[:, (i):Data.shape[1] - (numDelays + 1) + (i - 0)]
Y[0 + (i - 1) * Data.shape[0]:i * Data.shape[0], :] = Data[:, (i + 0):Data.shape[1] - (numDelays + 1) + (i)]
Keigs, Phi = dmd(X, Y, relTol)
tt = np.linspace(0,2*np.pi,101)
plt.figure()
plt.plot(np.cos(tt),np.sin(tt),'--')
plt.plot(Keigs.real,Keigs.imag,'ro')
plt.title('DMD Eigenvalues')
plt.xlabel(r'Real $\ lambda$')
plt.ylabel(r'Imaginary $\ lambda$')
# plt.axes().set_aspect('equal')
plt.show()
So in matlab and python, I get my eigenvalues to all sit on the unit circle (as expect) and I get precisely one, sitting at 1.
So the problem comes when I look at the matrices from SVD, they appear to have different values. The only matrix that is the same is the 'S or Sig' matrix. The rest will differ a number or +/- sign. The biggest thing that peaked my interest is the Atil matrix.
In matlab, it looks like,
[1.0157, -0.3116; 7.91229e-4, 0.9843]
And python it looks like,
[1.0, -4.508e-15; -4.439e-18, 1.0]
Now this may look slightly off due to numerical error possibly but when I look at real data and these differ, it messes up my analysis.
SVD of a non-square matrix is not unique in U and V. Even if you have a square matrix with non-zero, non-degenerate singular values, singular vectors in U and V are only unique up to a sign factor.
https://math.stackexchange.com/questions/644327/how-unique-on-non-unique-are-u-and-v-in-singular-value-decomposition-svd
Moreover, Matlab (LAPACK + BLAS) and scipy.linalg.svd may use different algorithms for SVD.
This can lead to the differences you have experienced.
lately i am been working fitting a fourier series function to a periodic signal for retrieve the amplitude and the phase of each component via least squares, so i modified the code of this file for it:
import math
import numpy as np
#period of the signal
per=1.0
w = 2.0*np.pi/per
#number of fourier components.
nf = 5
fp = open("file.cat","r")
# m1 is the number of unknown coefficients.
m1 = 2*nf + 1
# Create empty matrices.
x = np.zeros((m1,m1))
y = np.zeros((m1,1))
xi = [0.0]*m1
# Read (time, value) from each line of the file.
for line in fp:
t = float(line.split()[0])
yi = float(line.split()[1])
xi[0] = 1.0
for k in range(1,nf+1):
xi[2*k-1] = np.sin(k*w*t)
xi[2*k] = np.cos(k*w*t)
for j in range(m1):
for k in range(m1):
x[j,k] += xi[j]*xi[k]
y[j] += yi*xi[j]
fp.close()
# Copy to big matrices.
X = np.mat( x.copy() )
Y = np.mat( y.copy() )
# Invert X and multiply by Y to get coefficients.
A = X.I*Y
A0 = A[0]
# Solution is A0 + Sum[ Amp*sin(k*wt + phi) ]
print "a[0] = %f" % A[0]
for k in range(1,nf+1):
amp = math.sqrt(A[2*k-1]**2 + A[2*k]**2)
phs = math.atan2(A[2*k],A[2*k-1])
print "amp[%d] = %f phi = %f" % (k, amp, phs)
but the plot show this (without the points, of course):
and it should show something like this:
somebody can tell me how can i compute the phase and the amplitude in another simpler way? a guide maybe, i will be very grateful.
cheers!
PD. I will attach the FILE that i used, just because :)
EDITED
The error was with a index :(
First, I defined the vector with the values:
amp = np.array([np.sqrt((A[2*k-1])**2 + (A[2*k])**2) for k in range(1,nf+1)])
phs = np.array([math.atan2(A[2*k],A[2*k-1]) for k in range(1,nf+1)])
and then, to build the signal, I defined:
def term(t): return np.array([amp[k]*np.sin(k*w*t + phs[k]) for k in range(len(amp))])
Signal = np.array([A0+sum(term(phase[i])) for i in range(len(mag))])
but within the np.sin(), k should be k+1, because the index start in 0 ·__·
def term(t): return np.array([amp[k]*np.sin((k+1)*w*t + phs[k]) for k in range(len(amp))])
plt.plot(phase,Signal,'r-',lw=3)
and that is all.
Thanks Marco Tompitak for the help!!
You're specifying the wrong period for the signal:
#period of the signal
per=0.178556
This gives you the resulting Fourier fit, indeed with a maximum period of ~0.17. The problem is that this number specifies the longest period that is present in your Fourier series. The function only has components with perior 0.17 or shorter. Apparently you are expecting a fit with period ~1, so it can never approximate that properly. You should specify per=1.0. There's nothing wrong with the algorithm; a quick writeup of a similar algorithm in Mathematica gives the same output and plausible results:
I have used the finite element method to approximate the laplace equation and thus have turned it into a matrix system AU = F where A is the stiffness vector and solved for U (not massively important for my question).
I have now got my approximation U, which when i find AU i should get the vector F (or at least similar) where F is:
AU gives the following plot for x = 0 to x = 1 (say, for 20 nodes):
I then need to interpolate U to a longer vector and find AU (for a bigger A too, but not interpolating that). I interpolate U by the following:
U_inter = interp1d(x,U)
U_rich = U_inter(longer_x)
which seems to work okay until i multiply it with the longer A matrix:
It seems each spike is at a node of x (i.e. the nodes of the original U). Does anybody know what could be causing this? The following is my code to find A, U and F.
import numpy as np
import math
import scipy
from scipy.sparse import diags
import scipy.sparse.linalg
from scipy.interpolate import interp1d
import matplotlib
import matplotlib.pyplot as plt
def Poisson_Stiffness(x0):
"""Finds the Poisson equation stiffness matrix with any non uniform mesh x0"""
x0 = np.array(x0)
N = len(x0) - 1 # The amount of elements; x0, x1, ..., xN
h = x0[1:] - x0[:-1]
a = np.zeros(N+1)
a[0] = 1 #BOUNDARY CONDITIONS
a[1:-1] = 1/h[1:] + 1/h[:-1]
a[-1] = 1/h[-1]
a[N] = 1 #BOUNDARY CONDITIONS
b = -1/h
b[0] = 0 #BOUNDARY CONDITIONS
c = -1/h
c[N-1] = 0 #BOUNDARY CONDITIONS: DIRICHLET
data = [a.tolist(), b.tolist(), c.tolist()]
Positions = [0, 1, -1]
Stiffness_Matrix = diags(data, Positions, (N+1,N+1))
return Stiffness_Matrix
def NodalQuadrature(x0):
"""Finds the Nodal Quadrature Approximation of sin(pi x)"""
x0 = np.array(x0)
h = x0[1:] - x0[:-1]
N = len(x0) - 1
approx = np.zeros(len(x0))
approx[0] = 0 #BOUNDARY CONDITIONS
for i in range(1,N):
approx[i] = math.sin(math.pi*x0[i])
approx[i] = (approx[i]*h[i-1] + approx[i]*h[i])/2
approx[N] = 0 #BOUNDARY CONDITIONS
return approx
def Solver(x0):
Stiff_Matrix = Poisson_Stiffness(x0)
NodalApproximation = NodalQuadrature(x0)
NodalApproximation[0] = 0
U = scipy.sparse.linalg.spsolve(Stiff_Matrix, NodalApproximation)
return U
x = np.linspace(0,1,10)
rich_x = np.linspace(0,1,50)
U = Solver(x)
A_rich = Poisson_Stiffness(rich_x)
U_inter = interp1d(x,U)
U_rich = U_inter(rich_x)
AUrich = A_rich.dot(U_rich)
plt.plot(rich_x,AUrich)
plt.show()
comment 1:
I added a Stiffness_Matrix = Stiffness_Matrix.tocsr() statement to avoid an efficiency warning. FE calculations are complex enough that I'll have to print out some intermediate values before I can identify what is going on.
comment 2:
plt.plot(rich_x,A_rich.dot(Solver(rich_x))) plots nice. The noise you get is the result of the difference between the inperpolated U_rich and the true solution: U_rich-Solver(rich_x).
comment 3:
I don't think there's a problem with your code. The problem is with idea that you can test an interpolation this way. I'm rusty on FE theory, but I think you need to use the shape functions to interpolate, not a simple linear one.
comment 4:
Intuitively, with A_rich.dot(U_rich) you are asking, what kind of forcing F would produce U_rich. Compared to Solver(rich_x), U_rich has flat spots, regions where it's value is less than the true solution. What F would produce that? One that is spiky, with NodalQuadrature(x) at the x points, but near zero values in between. That's what your plot is showing.
A higher order interpolation will eliminate the flat spots, and produce a smoother back calculated F. But you really need to revisit the FE theory.
You might find it instructive to look at
plt.plot(x,NodalQuadrature(x))
plt.plot(rich_x, NodalQuadrature(rich_x))
The second plot is much smoother, but only about 1/5 as high.
Better yet look at:
plt.plot(rich_x,AUrich,'-*') # the spikes
plt.plot(x,NodalQuadrature(x),'o') # original forcing
plt.plot(rich_x, NodalQuadrature(rich_x),'+') # new forcing
In the model the forcing isn't continuous, it is a value at each node. With more nodes (rich_x) the magnitude at each node is less.
Short summary: How do I quickly calculate the finite convolution of two arrays?
Problem description
I am trying to obtain the finite convolution of two functions f(x), g(x) defined by
To achieve this, I have taken discrete samples of the functions and turned them into arrays of length steps:
xarray = [x * i / steps for i in range(steps)]
farray = [f(x) for x in xarray]
garray = [g(x) for x in xarray]
I then tried to calculate the convolution using the scipy.signal.convolve function. This function gives the same results as the algorithm conv suggested here. However, the results differ considerably from analytical solutions. Modifying the algorithm conv to use the trapezoidal rule gives the desired results.
To illustrate this, I let
f(x) = exp(-x)
g(x) = 2 * exp(-2 * x)
the results are:
Here Riemann represents a simple Riemann sum, trapezoidal is a modified version of the Riemann algorithm to use the trapezoidal rule, scipy.signal.convolve is the scipy function and analytical is the analytical convolution.
Now let g(x) = x^2 * exp(-x) and the results become:
Here 'ratio' is the ratio of the values obtained from scipy to the analytical values. The above demonstrates that the problem cannot be solved by renormalising the integral.
The question
Is it possible to use the speed of scipy but retain the better results of a trapezoidal rule or do I have to write a C extension to achieve the desired results?
An example
Just copy and paste the code below to see the problem I am encountering. The two results can be brought to closer agreement by increasing the steps variable. I believe that the problem is due to artefacts from right hand Riemann sums because the integral is overestimated when it is increasing and approaches the analytical solution again as it is decreasing.
EDIT: I have now included the original algorithm 2 as a comparison which gives the same results as the scipy.signal.convolve function.
import numpy as np
import scipy.signal as signal
import matplotlib.pyplot as plt
import math
def convolveoriginal(x, y):
'''
The original algorithm from http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P, Q, N = len(x), len(y), len(x) + len(y) - 1
z = []
for k in range(N):
t, lower, upper = 0, max(0, k - (Q - 1)), min(P - 1, k)
for i in range(lower, upper + 1):
t = t + x[i] * y[k - i]
z.append(t)
return np.array(z) #Modified to include conversion to numpy array
def convolve(y1, y2, dx = None):
'''
Compute the finite convolution of two signals of equal length.
#param y1: First signal.
#param y2: Second signal.
#param dx: [optional] Integration step width.
#note: Based on the algorithm at http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P = len(y1) #Determine the length of the signal
z = [] #Create a list of convolution values
for k in range(P):
t = 0
lower = max(0, k - (P - 1))
upper = min(P - 1, k)
for i in range(lower, upper):
t += (y1[i] * y2[k - i] + y1[i + 1] * y2[k - (i + 1)]) / 2
z.append(t)
z = np.array(z) #Convert to a numpy array
if dx != None: #Is a step width specified?
z *= dx
return z
steps = 50 #Number of integration steps
maxtime = 5 #Maximum time
dt = float(maxtime) / steps #Obtain the width of a time step
time = [dt * i for i in range (steps)] #Create an array of times
exp1 = [math.exp(-t) for t in time] #Create an array of function values
exp2 = [2 * math.exp(-2 * t) for t in time]
#Calculate the analytical expression
analytical = [2 * math.exp(-2 * t) * (-1 + math.exp(t)) for t in time]
#Calculate the trapezoidal convolution
trapezoidal = convolve(exp1, exp2, dt)
#Calculate the scipy convolution
sci = signal.convolve(exp1, exp2, mode = 'full')
#Slice the first half to obtain the causal convolution and multiply by dt
#to account for the step width
sci = sci[0:steps] * dt
#Calculate the convolution using the original Riemann sum algorithm
riemann = convolveoriginal(exp1, exp2)
riemann = riemann[0:steps] * dt
#Plot
plt.plot(time, analytical, label = 'analytical')
plt.plot(time, trapezoidal, 'o', label = 'trapezoidal')
plt.plot(time, riemann, 'o', label = 'Riemann')
plt.plot(time, sci, '.', label = 'scipy.signal.convolve')
plt.legend()
plt.show()
Thank you for your time!
or, for those who prefer numpy to C. It will be slower than the C implementation, but it's just a few lines.
>>> t = np.linspace(0, maxtime-dt, 50)
>>> fx = np.exp(-np.array(t))
>>> gx = 2*np.exp(-2*np.array(t))
>>> analytical = 2 * np.exp(-2 * t) * (-1 + np.exp(t))
this looks like trapezoidal in this case (but I didn't check the math)
>>> s2a = signal.convolve(fx[1:], gx, 'full')*dt
>>> s2b = signal.convolve(fx, gx[1:], 'full')*dt
>>> s = (s2a+s2b)/2
>>> s[:10]
array([ 0.17235682, 0.29706872, 0.38433313, 0.44235042, 0.47770012,
0.49564748, 0.50039326, 0.49527721, 0.48294359, 0.46547582])
>>> analytical[:10]
array([ 0. , 0.17221333, 0.29682141, 0.38401317, 0.44198216,
0.47730244, 0.49523485, 0.49997668, 0.49486489, 0.48254154])
largest absolute error:
>>> np.max(np.abs(s[:len(analytical)-1] - analytical[1:]))
0.00041657780840698155
>>> np.argmax(np.abs(s[:len(analytical)-1] - analytical[1:]))
6
Short answer: Write it in C!
Long answer
Using the cookbook about numpy arrays I rewrote the trapezoidal convolution method in C. In order to use the C code one requires three files (https://gist.github.com/1626919)
The C code (performancemodule.c).
The setup file to build the code and make it callable from python (performancemodulesetup.py).
The python file that makes use of the C extension (performancetest.py)
The code should run upon downloading by doing the following
Adjust the include path in performancemodule.c.
Run the following
python performancemodulesetup.py build
python performancetest.py
You may have to copy the library file performancemodule.so or performancemodule.dll into the same directory as performancetest.py.
Results and performance
The results agree neatly with one another as shown below:
The performance of the C method is even better than scipy's convolve method. Running 10k convolutions with array length 50 requires
convolve (seconds, microseconds) 81 349969
scipy.signal.convolve (seconds, microseconds) 1 962599
convolve in C (seconds, microseconds) 0 87024
Thus, the C implementation is about 1000 times faster than the python implementation and a bit more than 20 times as fast as the scipy implementation (admittedly, the scipy implementation is more versatile).
EDIT: This does not solve the original question exactly but is sufficient for my purposes.