Plotting a mixture distribution in sympy.stats - python

( gist of this Q here )
I'd like create a mixture of two Gamma distributions and plot the result, evaluated over a given range.
It would appear that sympy.stats is capable of this because it is able to compute the expectation of the mixture and sample from it. I'm quite new to sympy, so not sure if there is a preferred way for evaluating and plotting in this situation than the one I've been using.
%matplotlib inline
from matplotlib import pyplot as plt
from sympy.stats import Gamma, E, density
import numpy as np
G1 = Gamma("G1", 5, 2.5)
G2 = Gamma("G2", 4, 1.5)
f1 = 0.7; f2 = 1-f1
G3 = f1*G1 + f2*G2
Expectation gives me single sensible number for all 3
In [19]: E(G1)
Out[19]: 12.5000000000000
In [20]: E(G2)
Out[20]: 6.00000000000000
In [21]: E(G3)
Out[21]: 10.5500000000000
...but plotting fails on the mixture
u = np.linspace(0, 50)
D1 = density(G1); D2 = density(G2); D3 = density(G3)
v1 = [D1.args[1].subs(D1.args[0][0], i).evalf() for i in u]
v2 = [D2.args[1].subs(D2.args[0][0], i).evalf() for i in u]
v3 = [D3.args[1].subs(D3.args[0][0], i).evalf() for i in u]
plt.plot(u, v1)
plt.plot(u, v2)
plt.plot(u, v3) # this one fails with error 'can't convert expression to float'
The problem would appear to be that the mixture terms still contain free symbols
In [44]: v1[0].free_symbols
Out[44]: set()
In [45]: v3[0].free_symbols
Out[45]: {x}
...as I said, sympy.stats appears to be dealing with this ok somehow in computing the expectation, I assume. So I think I need to apply that machinery here in evaluating and plotting the mixture distribution (?)

It looks like this was fixed. I can reproduce your error in SymPy 0.7.3 but it works just fine in 0.7.4.1, the latest version.
First off, you don't need the fanagling with the .args. The expressions returned by density are callable. Just call D1(i).evalf() to get the numerical value of D1 at i, like
D1 = density(G1); D2 = density(G2); D3 = density(G3)
v1 = [D1(i).evalf() for i in u]
v2 = [D2(i).evalf() for i in u]
v3 = [D3(i).evalf() for i in u]
I've uploaded a working version to http://nbviewer.ipython.org/gist/asmeurer/8486176.

Related

SymPy lambdify gives wrong result, while *.subs gives the accruate one

Sorry for bothering you with this. I have a serious issue and now im on clock to solve it, so here is my question.
I have an issue where I lambdify a quantity, but the result of the quantity differs from the ".subs" result, and sometimes it's way off, or it's a NaN, where in reality there is a real number (found by subs)
Here, I have a small MWE where you can see the issue! Thanks in advance for ur time
import sympy as sy
import numpy as np
##STACK
#some quantities needed before u see the problem
r = sy.Symbol('r', real=True)
th = sy.Symbol('th', real=True)
e_c = 1e51
lf0 = 100
A = 1.6726e-24
#here are some quantities I define to go the problem
lfac = lf0+2
rd = 4*3.14/4/sy.pi/A/lfac**2
xi = r/rd #rescaled r
#now to the problem:
#QUANTITY
lfxi = xi**(-3)*(lfac+1)/2*(sy.sqrt( 1 + 4*lfac/(lfac+1)*xi**(3) + (2*xi**(3)/(lfac+1))**2) -1)
#RESULT WITH SUBS
print(lfxi.subs({th:1.00,r:1.00}).evalf())
#RESULT WITH LAMBDIFY
lfxi_l = sy.lambdify((r,th),lfxi)
lfxi_l(0.01,1.00)
##gives 0
The issue is that your mpmath precision needs to be set higher!
By default mpmath uses prec=53 and dps=15, but your expression requires a much higher resolution than this for it
# print(lfxi)
3.0256512324559e+62*(sqrt(1.09235114769539e-125*pi**6*r**6 + 6.74235013645028e-61*pi**3*r**3 + 1) - 1)/(pi**3*r**3)
...
from mpmath import mp
lfxi_l = sy.lambdify((r,th),lfxi, modules=["mpmath"])
mp.dps = 125
print(lfxi_l(1.00,1.00))
# 101.999... result
Changing a couple of the constants to "modest" values:
In [89]: e_c=1; A=1
The different methods produce essentially the same thing:
In [91]: lfxi.subs({th:1.00,r:1.00}).evalf()
Out[91]: 1.00000000461176
In [92]: lfxi_l = sy.lambdify((r,th),lfxi)
In [93]: lfxi_l(1.0,1.00)
Out[93]: 1.000000004611762
In [94]: lfxi_m = sy.lambdify((r,th),lfxi, modules=["mpmath"])
In [95]: lfxi_m(1.0,1.00)
Out[95]: mpf('1.0000000046117619')

Numpy.dot dot product function for statsmodels

I am learning statsmodels.api module to use python for regression analysis. So I started from the simple OLS model.
In econometrics, the function is like: y = Xb + e
where X is NxK dimension, b is Kx1, e is Nx1, so adding together y is Nx1. This is perfectly fine from linear algebra point of view.
But I followed the tutorial from Statsmodels as the following:
import numpy as np
nsample = 100 # total obs is 100
x = np.linspace(0, 10, 100) # using np.linspace(start, stop, number)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size = nsample) # draw numbers from normal distribution
default at mu = 0, and std.dev = 1, size = set by user
# e is n x 1
# Now, we add the constant/intercept term to X
X = sm.add_constant(X)
# Now, we compute the y
y = np.dot(X, beta) + e
So this generates the correct answer. But I have a question about the generation of beta = np.array([1,0.1,10]). This beta, if we use:
beta.shape
(3,)
It has a dimension of (3,), the same goes with y and e except X:
X.shape
(100,3)
e.shape
(100,)
y.shape
(100,)
So I guess initiating array using the following three ways
o = array([1,2,3])
o1 = array([[1],[2],[3]])
o2 = array([[1,2,3]])
print(o.shape)
print(o1.shape)
print(o2.shape)
----------------
(3,)
(3, 1)
(1, 3)
If I use beta = array([[1],[2],[3]]), which is a (3,1), and np.dot(X, beta) gets me a wrong answer, although the dimension seems to work.
If I use array([[1,2,3]]), which is a row vector, the dimension doesn't match for dot product in numpy, neither in linear algebra.
So, I am wondering why for a NxK dot Kx1 numpy dot product, we have to use a (N,K) dot (K,) instead of (N,K) dot (K,1) matrices. What operation makes only np.array([1, 0.1, 10]) works for numpy.dot() while np.array([[1], [0.1], [10]]) doesn't.
Thank you very much.
Some update
Sorry about the confusion, the codes in Statsmodels are randomly generated so I tried to fix the X and get the following input:
f = array([[1,2,3],[4,5,6],[7,8,9],[10,11,12],[13,14,15]])
o = array([1,2,3])
o1 = array([[1],[2],[3]])
o2 = array([[1,2,3]])
print(o.shape)
print(o1.shape)
print(o2.shape)
print("---------")
print(np.dot(f,o))
print(np.dot(f,o1))
r1 = np.dot(f,o)
r2 = np.dot(f,o1)
type1 = type(np.dot(f,o))
type2 = type(np.dot(f,o1))
tf = type1 is type2
tf2 = type1 == type2
print(type1)
print(type2)
print(tf)
print(tf2)
-------------------------
(3,)
(3, 1)
(1, 3)
---------
[14 32 50 68 86]
[[14]
[32]
[50]
[68]
[86]]
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
True
True
Sorry again for the confusion and inconvenience, they worked fine.
python/numpy is not a matrix-based language as it is Matlab or Octave or Scilab. These follow the rules of matrix multplication strictly. So
np.dot(f,o) ---------> f*o in Matlab/Octave/Scilab
np.dot(f,o1) ---------> f*o1 does not work in Matlab/Octave/Scilab
python/numpy has the 'broadcasting' which are the rules how the different data types and operations give together a result. It's not obvious why np.dot(f,o1) even should work, but the broadcasting defines some usefull results. You will have to consult the docs for that.
In python/numpy the * is not a matrix operator. You can find out what the broadcasting gives for
print(f*o)
print(f*o1)
print(f*o2)
Rather recently python/numpy has introduced the matrix operator #. You might find out what happens with
print(f#o)
print(f#o1)
print(f#o2)
Does this give some impressions ?

simple t-test in python with CIs of difference

What is the most straightforward way to perform a t-test in python and to include CIs of the difference? I've seen various posts but everything is different and when I tried to calculate the CIs myself it seemed slightly wrong... Here:
import numpy as np
from scipy import stats
g1 = np.array([48.7107107,
36.8587287,
67.7129929,
39.5538852,
35.8622661])
g2 = np.array([62.4993857,
49.7434833,
67.7516511,
54.3585559,
71.0933957])
m1, m2 = np.mean(g1), np.mean(g2)
dof = (len(g1)-1) + (len(g2)-1)
MSE = (np.var(g1) + np.var(g2)) / 2
stderr_diffs = np.sqrt((2 * MSE)/len(g1))
tcl = stats.t.ppf([.975], dof)
lower_limit = (m1-m2) - (tcl) * (stderr_diffs)
upper_limit = (m1-m2) + (tcl) * (stderr_diffs)
print(lower_limit, upper_limit)
returns:
[-30.12845447] [-0.57070077]
However, when I run the same test in SPSS, although I have the same t and p values, the CIs are -31.87286, 1.17371, and this is also the case in R. I can't seem to find the correct way to do this and would appreciate some help.
You're subtracting 1 when you compute dof, but when you compute the variance you're not using the sample variance:
MSE = (np.var(g1) + np.var(g2)) / 2
should be
MSE = (np.var(g1, ddof=1) + np.var(g2, ddof=1)) / 2
which gives me
[-31.87286426] [ 1.17370902]
That said, instead of doing the manual implementation, I'd probably use statsmodels' CompareMeans:
In [105]: import statsmodels.stats.api as sms
In [106]: r = sms.CompareMeans(sms.DescrStatsW(g1), sms.DescrStatsW(g2))
In [107]: r.tconfint_diff()
Out[107]: (-31.872864255548553, 1.1737090155485568)
(really we should be using a DataFrame here, not an ndarray, but I'm lazy).
Remember though that you're going to want to consider what assumption you want to make about the variance:
In [110]: r.tconfint_diff(usevar='pooled')
Out[110]: (-31.872864255548553, 1.1737090155485568)
In [111]: r.tconfint_diff(usevar='unequal')
Out[111]: (-32.28794665832114, 1.5887914183211436)
and if your g1 and g2 are representative, the assumption of equal variance might not be a good one.

Obtaining Legendre polynomial form once Legendre coefficients are determined

I have obtained the coefficients for the Legendre polynomial that best fits my data. Now I am needing to determine the value of that polynomial at each time-step of my data. I need to do this so that I can subtract the fit from my data. I have looked at the documentation for the Legendre module, and I'm not sure if I just don't understand my options or if there isn't a native tool in place for what I want. If my data-points were evenly spaced, linspace would be a good option, but that's not the case here. Does anyone have a suggestion for what to try?
For those who would like to demand a minimum working example of code, just use a random array, get the coefficients, and tell me from there how you would proceed. The values themselves don't matter. It's the technique that I'm asking about here. Thanks.
To simplify Ahmed's example
In [1]: from numpy.polynomial import Polynomial, Legendre
In [2]: p = Polynomial([0.5, 0.3, 0.1])
In [3]: x = np.random.rand(10) * 10
In [4]: y = p(x)
In [5]: pfit = Legendre.fit(x, y, 2)
In [6]: plot(*pfit.linspace())
Out[6]: [<matplotlib.lines.Line2D at 0x7f815364f310>]
In [7]: plot(x, y, 'o')
Out[7]: [<matplotlib.lines.Line2D at 0x7f81535d8bd0>]
The Legendre functions are scaled and offset, as the data should be confined to the interval [-1, 1] to get any advantage over the usual power basis. If you want the coefficients for plain old Legendre functions
In [8]: pfit.convert()
Out[8]: Legendre([ 0.53333333, 0.3 , 0.06666667], [-1., 1.], [-1., 1.])
But that isn't recommended.
Once you have a function, you can just generate a numpy array for the timepoints:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
... a,b,c = pfinal # obviously, for a*x^2 + b*x + c
... return (a*bins**2) + b*bins + c
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
It automatically evaluates it for each timepoint is in the numpy array.
Now all you have to do is rewrite mypolynomial to go from a simple quadratic example to a proper one for a Legendre polynomial. Treat the function as if it were evaluating a float to return the value, and when called on the numpy array it will automatically evaluate it for each value.
EDIT:
Let's say I wanted to generalize this to all standard polynomials:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
>>> hist = np.zeros((1, len(myarray))) # define blank return
... for i in range(len(pfinal)):
... # fixed a typo here, was pfinal[-i] which would give -0 rather than -1, since negative indexing starts at -1, not -0
... const = pfinal[-i-1] # negative index to go from 0 exponent to highest exponent
... hist += const*(bins**i)
... return hist
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
EDIT2: Typo fix
EDIT3:
#Ahmed is perfect right when he states Homer's rule is good for numerical stability. The implementation here would be as follows:
>>> def horner(coeffs, x):
... acc = 0
... for c in coeffs:
... acc = acc * x + c
... return acc
>>> horner((1,1,0), myarray)
array([ 2, 12, 56, 240, 272, 306, 380])
Slightly modified to keep the same argument order as before, from the code here:
http://rosettacode.org/wiki/Horner%27s_rule_for_polynomial_evaluation#Python
When you're using a nice library to fit polynomials, the library will in my experience usually have a function to evaluate them. So I think it is useful to know how you're generating these coefficients.
In the example below, I used two functions in numpy, legfit and legval which made it trivial to both fit and evaluate the Legendre polynomials without any need to invoke Horner's rule or do the bookkeeping yourself. (Though I do use Horner's rule to generate some example data.)
Here's a complete example where I generate some sparse data from a known polynomial, fit a Legendre polynomial to it, evaluate that polynomial on a dense grid, and plot. Note that the fitting and evaluating part takes three lines thanks to the numpy library doing all the heavy lifting.
It produces the following figure:
import numpy as np
### Setup code
def horner(coeffs, x):
"""Evaluate a polynomial at a point or array"""
acc = 0.0
for c in reversed(coeffs):
acc = acc * x + c
return acc
x = np.random.rand(10) * 10
true_coefs = [0.1, 0.3, 0.5]
y = horner(true_coefs, x)
### Fit and evaluate
legendre_coefs = np.polynomial.legendre.legfit(x, y, 2)
new_x = np.linspace(0, 10)
new_y = np.polynomial.legendre.legval(new_x, legendre_coefs)
### Plotting only
try:
import pylab
pylab.ion() # turn on interactive plotting
pylab.figure()
pylab.plot(x, y, 'o', new_x, new_y, '-')
pylab.xlabel('x')
pylab.ylabel('y')
pylab.title('Fitting Legendre polynomials and evaluating them')
pylab.legend(['original sparse data', 'fit'])
except:
print("Can't start plots.")

Estimate formants using LPC in Python

I'm new to signal processing (and numpy, scipy, and matlab for that matter). I'm trying to estimate vowel formants with LPC in Python by adapting this matlab code:
http://www.mathworks.com/help/signal/ug/formant-estimation-with-lpc-coefficients.html
Here is my code so far:
#!/usr/bin/env python
import sys
import numpy
import wave
import math
from scipy.signal import lfilter, hamming
from scikits.talkbox import lpc
"""
Estimate formants using LPC.
"""
def get_formants(file_path):
# Read from file.
spf = wave.open(file_path, 'r') # http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ae.wav
# Get file as numpy array.
x = spf.readframes(-1)
x = numpy.fromstring(x, 'Int16')
# Get Hamming window.
N = len(x)
w = numpy.hamming(N)
# Apply window and high pass filter.
x1 = x * w
x1 = lfilter([1., -0.63], 1, x1)
# Get LPC.
A, e, k = lpc(x1, 8)
# Get roots.
rts = numpy.roots(A)
rts = [r for r in rts if numpy.imag(r) >= 0]
# Get angles.
angz = numpy.arctan2(numpy.imag(rts), numpy.real(rts))
# Get frequencies.
Fs = spf.getframerate()
frqs = sorted(angz * (Fs / (2 * math.pi)))
return frqs
print get_formants(sys.argv[1])
Using this file as input, my script returns this list:
[682.18960189917243, 1886.3054773107765, 3518.8326108511073, 6524.8112723782951]
I didn't even get to the last steps where they filter the frequencies by bandwidth because the frequencies in the list aren't right. According to Praat, I should get something like this (this is the formant listing for the middle of the vowel):
Time_s F1_Hz F2_Hz F3_Hz F4_Hz
0.164969 731.914588 1737.980346 2115.510104 3191.775838
What am I doing wrong?
Thanks very much
UPDATE:
I changed this
x1 = lfilter([1., -0.63], 1, x1)
to
x1 = lfilter([1], [1., 0.63], x1)
as per Warren Weckesser's suggestion and am now getting
[631.44354635609318, 1815.8629524985781, 3421.8288991389031, 6667.5030877036006]
I feel like I'm missing something since F3 is very off.
UPDATE 2:
I realized that the order being passed to scikits.talkbox.lpc was off due to a difference in sampling frequency. Changed it to:
Fs = spf.getframerate()
ncoeff = 2 + Fs / 1000
A, e, k = lpc(x1, ncoeff)
Now I'm getting:
[257.86573127888488, 774.59006835496086, 1769.4624576002402, 2386.7093679399809, 3282.387975973973, 4413.0428174593926, 6060.8150432549655, 6503.3090645887842, 7266.5069407315023]
Much closer to Praat's estimation!
The problem had to do with the order being passed to the lpc function. 2 + fs / 1000 where fs is the sampling frequency is the rule of thumb according to:
http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html
I have not been able to get the results you expect, but I do notice two things which might cause some differences:
Your code uses [1, -0.63] where the MATLAB code from the link you provided has [1 0.63].
Your processing is being applied to the entire x vector at once instead of smaller segments of it (see where the MATLAB code does this: x = mtlb(I0:Iend); ).
Hope that helps.
There are at least two problems:
According to the link, the "pre-emphasis filter is a highpass all-pole (AR(1)) filter". The signs of the coefficients given there are correct: [1, 0.63]. If you use [1, -0.63], you get a lowpass filter.
You have the first two arguments to scipy.signal.lfilter reversed.
So, try changing this:
x1 = lfilter([1., -0.63], 1, x1)
to this:
x1 = lfilter([1.], [1., 0.63], x1)
I haven't tried running your code yet, so I don't know if those are the only problems.

Categories