Dynamic Factor Model Estimation

Dynamic Factor Model Estimation - python

I'm looking for a python or matlab based package which can estimate parameters for the following model:
In the original paper they refer to this code by Koop. The problem I have is that this program as well the standard packages from Python's statsmodel estimate a DFM of the form:
The difference to the model in the paper is that if we have two factors, then A_1 is two-dimensional, but in the model I want to estimate, we only want to estimate a_11 and assume a_12 = 0. Is there a package which can estimate such models?

There are two ways to do this in Statsmodels, although there are trade-offs to each approach:
(1) If you are okay with 1 lag for the error terms (i.e. if it is okay to have e(i,t) = \phi(i,1) e(i,t-1) + u(i,t), from your linked "Model" equations), then you can use the DynamicFactorMQ class. For two factors that evolve independently, you can use the following:
mod = sm.tsa.DynamicFactorMQ(y, factors=['f1', 'f2'],
factor_orders={'f1':1, 'f2':1},
idiosyncratic_ar1=True)
res = mod.fit()
See here for more details on how the factors and factor_orders arguments work. Basically, by specifying factor_orders={'f1':1, 'f2':1} instead of factor_orders={('f1', 'f2'):1} (which is the default if you don't specify anything), the factors evolve separately (which is the same as having diagonal A matrices).
(2) Otherwise, if you do not have too many left-hand-side variables, you could use the DynamicFactor class with fixed parameters:
mod = sm.tsa.DynamicFactor(std, factor_order=1, k_factors=2,
error_order=1,
enforce_stationarity=False)
with mod.fix_params({'L1.f2.f1': 0, 'L1.f1.f2': 0}):
res = mod.fit()
In this case, when you do mod.fix_params({'L1.f2.f1': 0, 'L1.f1.f2': 0}), you are specifying that a_12 = a_21 = 0. See here for some more details about using fix_params.
But in general, the DynamicFactorMQ class from option (1) above is more robust, and is likely the better option.

Related

Exclude/Ignore data region in polynomial fit (zfit)

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:
import numpy as np
import pandas as pd
import zfit
# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])
# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)
# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))
# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)
TestSample.png
Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.
I also tried using MultiSpaces to ignore that region via
limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2
But this leads to a
ValueError: obs need to be a Space with exactly one limit if rescaling is requested.
Anyone has an idea how to solve this?
Thanks in advance ^^

Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.
What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.
This can normally be done in two ways:
Using multispace
using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).
Simultaneous fit
As mentioned in the comments by #jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).
So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.
Solution 1: different space and norm
Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.
Solution 2: manual re-scaling
The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

PyMC3 binomial switchpoint model highly dependent on testval

I've set up the following binomial switchpoint model in PyMC3:
with pm.Model() as switchpoint_model:
switchpoint = pm.DiscreteUniform('switchpoint', lower=df['covariate'].min(), upper=df['covariate'].max())
# Priors for pre- and post-switch parameters
early_rate = pm.Beta('early_rate', 1, 1)
late_rate = pm.Beta('late_rate', 1, 1)
# Allocate appropriate binomial probabilities to years before and after current
p = pm.math.switch(switchpoint >= df['covariate'].values, early_rate, late_rate)
p = pm.Deterministic('p', p)
y = pm.Binomial('y', p=p, n=df['trials'].values, observed=df['successes'].values)
It seems to run fine, except that it entirely centers in on one value for the switchpoint (999), as shown below.
Upon further investigation it seems that the results for this model are highly dependent on the starting value (in PyMC3, "testval"). The below shows what happens when I set the testval = 750.
switchpoint = pm.DiscreteUniform('switchpoint', lower=gp['covariate'].min(),
upper=gp['covariate'].max(), testval=750)
I get similarly different results with additional different starting values.
For context, this is what my dataset looks like:
My questions are:
Is my model somehow incorrectly specified?
If it's correctly specified, how should I interpret these results? In particular, how do I compare / select results generated by different testvals? The only idea I've had has been using WAIC to evaluate out of sample performance...

Models with discrete values can be problematic, all the nice sampling techniques using the derivatives don't work anymore, and they can behave a lot like multi modal distributions. I don't really see why this would be that problematic in this case, but you could try to use a continuous variable for the switchpoint instead (wouldn't that also make more sense conceptually?).

loss function as min of several points, custom loss function and gradient

I am trying to predict quality of metal coil. I have the metal coil with width 10 meters and length from 1 to 6 kilometers. As training data I have ~600 parameters measured each 10 meters, and final quality control mark - good/bad (for whole coil). Bad means there is at least 1 place there is coil is bad, there is no data where is exactly. I have data for approx 10000 coils.
Lets imagine we want to train logistic regression for this data(with 2 factors).
X = [[0, 0],
...
[0, 0],
[1, 1], # coil is actually broken here, but we don't know it yet.
[0, 0],
...
[0, 0]]
Y = ?????
I can't just put all "bad" in Y and run classifier, because I will be confusing for classifier. I can't put all "good" and one "bad" in Y becuase I don't know where is the bad position.
The solution I have in mind is the following, I could define loss function as sum( (Y-min(F(x1,x2)))^2 ) (min calculated by all F belonging to one coil) not sum( (Y-F(x1,x2))^2 ). In this case probably I get F trained correctly to point bad place. I need gradient for that, it there is impossible to calculate it in all points, the min is not differentiable in all places, but I could use weak gradient instead(using values of functions which is minimal in coil in every place).
I more or less know how to implement it myself, the question is what is the simplest way to do it in python with scikit-learn. Ideally it should be same (or easily adaptable) with several learning method(a lot of methods based on loss function and gradient), is where possible to make some wrapper for learning methods which works this way?
update: looking at gradient_boosting.py - there is internal abstract class LossFunction with ability to calculate loss and gradient, looks perspective. Looks like there is no common solution.

What you are considering here is known in machine learning community as superset learning, meaning, that instead of typical supervised setting where you have training set in the form of {(x_i, y_i)} you have {({x_1, ..., x_N}, y_1)} such that you know that at least one element from the set has property y_1. This is not a very common setting, but existing, with some research available, google for papers in the domain.
In terms of your own loss functions - scikit-learn is a no-go. Scikit-learn is about simplicity, it provides you with a small set of ready to use tools with very little flexibility. It is not a research tool, and your problem is researchy. What can you use instead? I suggest you go for any symbolic-differentiation solution, for example autograd which gives you ability to differentiate through python code, simply apply scipy.optimize.minimize on top of it and you are done! Any custom loss function will work just fine.
As a side note - minimum operator is not differentiable, thus the model might have hard time figuring out what is going on. You could instead try to do sum((Y - prod_x F(x_1, x_2) )^2) since multiplication is nicely differentiable, and you will still get the similar effect - if at least one element is predicted to be 0 it will remove any "1" answer from the remaining ones. You can even go one step further to make it more numerically stable and do:
if Y==0 then loss = sum_x log(F(x_1, x_2 ) )
if Y==1 then loss = sum_x log(1-F(x_1, x_2))
which translates to
Y * sum_x log(1-F(x_1, x_2)) + (1-Y) * sum_x log( F(x_1, x_2) )
you can notice similarity with cross entropy cost which makes perfect sense since your problem is indeed a classification. And now you have perfect probabilistic loss - you are attaching such probabilities of each segment to be "bad" or "good" so the probability of the whole object being bad is either high (if Y==0) or low (if Y==1).

Putting bounds on stochastic variables in PyMC

I have a variable A which is Bernoulli distributed, A = pymc.Bernoulli('A', p_A), but I don't have a hard value for p_A and want to sample for it. I do know that it should be small, so I want to use an exponential distribution p_A = pymc.Exponential('p_A', 10).
However, the exponential distribution can return values higher than 1, which would throw off A. Is there a way of bounding the output of p_A without having to re-implement either the Bernoulli or the Exponential distributions in my own #pymc.stochastic-decorated function?

You can use a deterministic function to truncate the Exponential distribution. Personally I believe it would be better if you use a distribution that is bound between 0 and 1, but to exactly solve your problem you can do as follows:
import pymc as pm
p_A = pm.Exponential('p_A',10)
#pm.deterministic
def p_B(p=p_A):
return min(1, p)
A = pm.Bernoulli('A', p_B)
model = dict(p_A=p_A, p_B=p_B, A=A)
S = pm.MCMC(model)
S.sample(1000)
p_B_trace = S.trace('p_B')[:]

PyMC provides bounds. The following should also work:
p_A = pymc.Bound(pymc.Exponential, upper=1)('p_A', lam=10)

For any other lost souls who come across this:
I think the best solution for my purposes (that is, I was only using the exponential distribution because the probabilities I was looking to generate were probably small, rather than out of mathematical convenience) was to use a Beta function instead.
For certain parameter values it approximates the shape of an exponential function (and can do the same for binomials and normals), but is bounded to [0 1]. Probably only useful for doing things numerically, though, as I imagine it's a pain to do any analysis with.

Debugging pymc probability calculations

I've tried to model a mixture of exponentials by copying the mixture-of-Gaussians example given here. The code is below. I know there are some funky aspects to the inference here, but my question is more about how to debug the calculations in models like this.
The idea is that it's a mixture of three exponentials, with scale parameters taken from the Gamma assigned to scales. However, all observations get assigned to the zeroth exponential during the ElemwiseCategoricalStep. You can see that the assignments of the observations to the exponential components are initially diverse by looking at initial_assignments, and you can see that all observations are assigned to the zeroth component on all interations from the fact that set(tr['exp'].flatten()) contains only 0.
I assume this is because all of the values assigned to p in the expression array([logp(v * self.sh) for v in self.values]) in ElemwiseCategoricalStep.astep are minus infinity. I would like to know why that is and how to correct it, but even more, I would like to know what tools are available to debug this kind of thing. Is there any way for me to step through the calculation of logp(v * self.sh) to see how the result is determined? If I try to do it using pdb, I think I get stymied at outputs = self.fn() in theano.compile.function_module.Function.__call__, which I guess I can't step into because it's a native function.
Even knowing how to compute the pdf for a given set of model parameters would be a useful start.
import numpy as np
import pymc as pm
from pymc import Model, Gamma, Normal, Dirichlet, Exponential
from pymc import Categorical
from pymc import sample, Metropolis, ElemwiseCategoricalStep
durations = np.concatenate(
[np.random.exponential(1/lam, 10)
for lam in [1e-3,7e-5,2e-6]])
initial_assignments = np.random.randint(0, 3, len(durations))
print 'initial_assignments', initial_assignments
with Model() as model:
scales = Gamma('hp', 1, 1, shape=3)
props = Dirichlet('props', a=np.array([1., 1., 1.]), shape=3)
category = Categorical('exp', p=props, shape=len(durations))
points = Exponential('obs', lam=scales[category], observed=durations)
step1 = pm.Metropolis(vars=[props,scales])
step2 = ElemwiseCategoricalStep(var=category, values=[0,1,2])
start = {'exp': initial_assignments,
'hp': np.ones(3),
'props': np.ones(3),}
tr = sample(3000, step=[step1, step2], start=start)
print set(tr['exp'].flatten())

Excellent question. One thing you can do is look at the pdf for each of the components.
The Model and each of the variables should have both a .logp and a .elemwise_logp property and them which returns a function that can take a point or parameter values.
Thus you can say something like print scales.logp(start) or print model.logp(start) or print scales.dlogp()(start).
For now, I think you unfortunately have to specify all the parameter values (even ones that don't affect the result for a particular variable).
Model, FreeRV and ObservedRV all inherit from Factor which provides this functionality and has a few other methods. You'll probably want the non fast versions since those are more forgiving in the kinds of arguments they accept.
Does that help? Please let me know if you have other ideas for things that might help you in debugging. This is one area where we know pymc3 and theano needs some work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.