Constraint on parameters in lmfit - python

I am trying to fit 3 peaks using lmfit with a skewed-Voigt profile (this is not that important for my question). I want to set a constraint on the peaks centers of the form:
peak1 = SkewedVoigtModel(prefix='sv1_')
pars.update(peak1.make_params())
pars['sv1_center'].set(x)
peak2 = SkewedVoigtModel(prefix='sv2_')
pars.update(peak2.make_params())
pars['sv2_center'].set(1000+x)
peak3 = SkewedVoigtModel(prefix='sv3_')
pars.update(peak3.make_params())
pars['sv3_center'].set(2000+x)
Basically I want them to be 1000 apart from each other, but I need to fit for the actual shift, x. I know that I can force some parameters to be equal using pars['sv2_center'].set(expr='sv1_center'), but what I would need is pars['sv2_center'].set(expr='sv1_center'+1000) (which doesn't work just like that). How can I achieve what I need? Thank you!

Just do:
pars['sv2_center'].set(expr='sv1_center+1000')
pars['sv3_center'].set(expr='sv1_center+2000')
The constraint expression is a Python expression that will be evaluated every time the constrained parameter needs to get its value.

Related

How to give an arbitrary initial condition on odient in python

I'm trying to solve a first order linear differential equation in one variable, and am currently using the odient module in scipy.integrate. However, the initial condition it takes in $y_0$ is evaluated at the initial boundary of the domain $x_0$, while what I have is the value of $y$ at some random point $x$.
Suggestions on similar questions were to use solve_bvp, which doesn't quite solve my problem either.
How do I go about this?
Numerical integrators always march in only one direction from the initial point. To get a two-sided solution one would have to call the numerical integrator twice, forward and backward, for instance as
ta = np.linspace(x0,a,Na+1)
ya = odeint(f,y0,ta)
tb = np.linspace(x0,b,Nb+1)
yb = odeint(f,y0,tb)
You can leave these two parts separate for further uses like plotting, or join them into one array each
t=np.concatenate([ta[::-1],tb[1:]])
y=np.concatenate([ya[::-1],yb[1:]])

Converting from R to Python, trying to understand a line

I have a fairly simple question. I have been converting some statistical analysis code from R to Python. Up until now, I have been doing just fine, but I have gotten stuck on this particular line:
nlsfit <- nls(N~pnorm(m, mean=mean, sd=sd),data=data4fit,start=list(mean=mu, sd=sig), control=list(maxiter=100,warnOnly = TRUE))
Essentially, the program is calculating the non-linear least-squares fit for a set of data, the "nls" command. In the original text, the "tilde" looks like an "enye", I'm not sure if that is significant.
As I understand the equivalent of pnorm in Python is norm.cdf from from scipy.stats. What I want to know is, what does the "tilde/enye" do before the pnorm function is invoked. "m" is a predefined variable, while "mean" and "sd" are not.
I also found some code, essentially reproducing nls in Python: nls Python code, however, because of the date of the post (2013), I was wondering if there are any more recent equivalents, preferably written in Pyton 3.
Any advice is appreiated, thanks!
As you can see from ?nls: the first argument in nsl is formula:
formula: a nonlinear model formula including variables and parameters.
Will be coerced to a formula if necessary
Now, if you do ?formula, we can read this:
The models fit by, e.g., the lm and glm functions are specified in a
compact symbolic form. The ~ operator is basic in the formation of
such models. An expression of the form y ~ model is interpreted as a
specification that the response y is modelled by a linear predictor
specified symbolically by model
Therefore, the ~ in your case nls join the response/dependent/regressand variable in the left with the regressors/explanatory variables in the right part of your nonlinear least squares.
Best!
This minimizes
sum((N - pnorm(m, mean=mean, sd=sd))^2)
using starting values for mean and sd specified in start. It will perform a maximum of 100 iterations and it will return instead of signalling an error in the case of termination before convergence.
The first argument to nls is an R formula which specifies the regression where the left hand side of the tilde (N) is the dependent variable and the right side is the function of the parameters (mean, sd) and data (m) used to predict it.
Note that formula objects do not have a fixed meaning in R but rather each function can interpret them in any way it likes. For example, formula objects used by nls are interpreted differently than formula objects used by lm. In nls the formula y ~ a + b * x would be used to specify a linear regression but in lm the same regression would be expressed as y ~ x .
See ?pnorm, ?nls, ?nls.control and ?formula .

Confused about X in GaussianHMM.fit([X])

With this code:
X = numpy.array(range(0,5))
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
I get
tuple index out of range
self.n_features = obs[0].shape[1]
So what are you supposed to pass .fit() exactly? The hidden states AND emissions in a tuple? If so in what order? The documentation is less than helpful.
I noticed it likes being passed tuples as this does not give an error:
X = numpy.column_stack([range(0,5),range(0,5)])
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
Edit:
Let me clarify a bit, the documentation indicates that the ordinality of the array must be:
List of array-like observation sequences (shape (n_i, n_features)).
This would almost indicate that you pass a tuple for each sample that indicates in a binary fashion which observations are present. However their example indicates otherwise:
# pack diff and volume for training
X = np.column_stack([diff, volume])
hence the confusion
It would appear the GaussianHMM function is for multivariate-emission-only HMM problems, hence the requirement to have >1 emission vectors. When the documentation refers to 'n_features' they are not referring to the number of ways emissions can express themselves but the number of orthogonal emission vectors.
Hence, "features" (the orthogonal emission vectors) are not to be confused with "symbols" which, in sklearn's parlance (which is likely shared with the greater hmm community for all I know), refer to what actual unique values the system is capable of emitting.
For univariate emission-vector problems, use MultinomialHMM.
Hope that clarifies for anyone else who want to use this stuff without becoming the world's foremost authority on HMMs :)
I realize this is an old thread but the problem in the example code is still there. I believe the example is now at this link and still giving the same error:
tuple index out of range
self.n_features = obs[0].shape[1]
The offending line of code is:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit(X)
Which should be:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit([X])

What exactly does the "returned value" in langid.py mean?

beside the correct language ID langid.py returns a certain value - "The value returned is a score for the language. It is not a probability esimate, as it is not normalized by the document probability since this is unnecessary for classification."
But what does the value mean??
I'm actually the author of langid.py. Unfortunately, I've only just spotted this question now, almost a year after it was asked. I've tidied up the handling of the normalization since this question was asked, so all the README examples have been updated to show actual probabilities.
The value that you see there (and that you can still get by turning normalization off) is the un-normalized log-probability of the document. Because log/exp are monotonic, we don't actually need to compute the probability to decide the most likely class. The actual value of this log-prob is not actually of any use to the user. I should probably have never included it, and I may remove its output in the future.
I think this is the important chunk of langid.py code:
def nb_classify(fv):
# compute the log-factorial of each element of the vector
logfv = logfac(fv).astype(float)
# compute the probability of the document given each class
pdc = np.dot(fv,nb_ptc) - logfv.sum()
# compute the probability of the document in each class
pd = pdc + nb_pc
# select the most likely class
cl = np.argmax(pd)
# turn the pd into a probability distribution
pd /= pd.sum()
return cl, pd[cl]
It looks to me that the author is calculating something like the multinomial log-posterior of the data for each of the possible languages. logfv calculates the logarithm of the denominator of the PMF (x_1!...x_k!). np.dot(fv,nb_ptc) calculates the
logarithm of the p_1^x_1...p_k^x_k term. So, pdc looks like the list of language conditional log-likelihoods (except that it's missing the n! term). nb_pc looks like the prior probabilities, so pd would be the log-posteriors. The normalization line, pd /= pd.sum() confuses me, since one usually normalizes probability-like values (not log-probability values); also, the examples in the documentation (('en', -55.106250761034801)) don't look like they've been normalized---maybe they were generated before the normalization line was added?
Anyway, the short answer is that this value, pd[cl] is a confidence score. My understanding based on the current code is that they should be values between 0 and 1/97 (since there are 97 languages), with a smaller value indicating higher confidence.
Looks like a value that tells you how certain the engine is that it guessed the correct language for the document. I think generally the closer to 0 the number, the more sure it is, but you should be able to test that by mixing languages together and passing them in to see what values you get out. It allows you to fine tune your program when using langid depending upon what you consider 'close enough' to count as a match.

Python: create a polynomial of degree n

I have a feature set
[x1,x2....xm]
Now I want to create polynomial feature set
What that means is that if degree is two, then I have the feature set
[x1.... xm,x1^2,x2^2...xm^2, x1x2, x1x3....x1,xm......xm-1x1....xm-1xm]
So it contains terms of only of order 2..
same is if order is three.. then you will have cubic terms as well..
How to do this?
Edit 1: I am working on a machine learning project where I have close to 7 features... and a non-linear regression on this linear features are giving ok result...Hence I thought that to get more number in features I can map these features to a higher dimension..
So one way is to consider polynomial order of the feature vector...
Also generating x1*x1 is easy.. :) but getting the rest of the combinations are a bit tricky..
Can combinations give me x1x2x3 result if the order is 3?
Use
itertools.combinations(list, r)
where list is the feature set, and r is the order of desired polynomial features. Then multiply elements of the sublists given by the above. That should give you {x1*x2, x1*x3, ...}. You'll need to construct other ones, then union all parts.
[Edit]
Better: itertools.combinations_with_replacement(list, r) will nicely give sorted length-r tuples with repeated elements allowed.
You could use itertools.product to create all the possible sets of n values that are chosen from the original set; but keep in mind that this will generate (x2, x1) as well as (x1, x2).
Similarly, itertools.combinations will produce sets without repetition or re-ordering, but that means you won't get (x1, x1) for example.
What exactly are you trying to do? What do you need these result values for? Are you sure you do want those x1^2 type terms (what does it mean to have the same feature more than once)? What exactly is a "feature" in this context anyway?
Using Karl's answer as inspiration, try using product and then taking advantage of the set object. Something like,
set([set(comb) for comb in itertools.product(range(5),range(5)])
This will get rid of recurring pairs. Then you can turn the set back into a list and sort it or iterate over it as you please.
EDIT:
this will actually kill the x_m^2 terms, so build sorted tuples instead of sets. this will allow the terms to be hashable and nonrepeating.
set([tuple(sorted(comb)) for comb in itertools.product(range(5),range(5))])

Categories