Defining Input feature importance with Vector Auto Regression

Defining Input feature importance with Vector Auto Regression - python

I am using following package: https://www.statsmodels.org/dev/generated/statsmodels.tsa.vector_ar.var_model.VAR.html
I am trying to find out which of my input features has the biggest influence on the forecast that I am performing.
I am performing a multi-variate time-series forecast.
I differenced all variables until they became stationary.
I now want to find out which variables have the biggest influence in defining the output. (The idea similar to extracting the feature importance in a random forest)
As far as I understand, in VAR, the coefficients relate to the lags of each variable until the order is achieved.
You can write the formula as something like
Y = alpha + B1 x Y t-1 + ... + Bp x Yt-p
Y(t-1) = alpha + B(var1(t-1))* var1(t-1) + B(var2(t-1))* var2(t-1) + ... + B(varX(t-1))*varX(t-1)
With 'B' being the value of the coefficients and 'p' the current order (= lag)
Based on this formula, it would seem that the bigger the value of the coefficient, the bigger its effect on the output.
However, my data is not normalized / standardized (as is not necessary to do so in VAR), and the coeffients go from 1 - p (so until the order is achieved).
Based on above formula, there are p coefficients for each input variable.
And each input variable is scaled differently.
Is there an obvious / simple way in defining the most important features in a VAR?
Or a metric provided by the VAR algorithm that helps me define the most important features?
Right now, what I think of doing is do this:
Influence_var1 = B1 * B(var1(t-1))* var1(t-1) + B2 * B(var(t-2))* var1(t-2) + ... + Bp * B(var(t-p))* var1(t-p)
To determine the influence of var1, and do this for each input variable until varX.
But I am fairly certain I don't have to do all of that and that there are simpler solutions.
With the summary function from the package, the VAR system is estimated.

Related

Interpreting logistic regression coefficients of scaled features

I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).

Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.

How do I get raw coefficients from bspline when using Patsy in pystatsmodels

I'm running a GLM and have to hand over discrete values that come from the variable*coefficient to our IT department.
That said, I'm not sure how to calculate the slopes in a piecewise regression model using the bs() function from patsy.
Let's say I have the following model:
y ~ bs(length, degree = 1, knots = [32]
This gives me two rows of the standard pystatsmodel parameters (coefficeints, pvalues, standard error, etc).
Those values are,
variable coeff
y ~ bs(length, degree = 1, knots = [32][0] .3763
y ~ bs(length, degree = 1, knots = [32][1] .4335
I can also run it like this:
y ~ length + np.maximum(length-32,0)
Which yields
variable coeff
length .0118
length -.0074
What I don't understand is when I run a test set through both of these models, they yield the same prediction.
I'm not sure what patsy is doing in the background in either case and I'm wondering, to answer my question, should I
slope 1 for length should come right from the exponent of the coefficient and
slope 2 for length is the exponent(coefficient1 + ceoff2). If that's the case, does that rule apply to both types of syntax?

Fitting a bias in a weight decay regression using least-squares

I'm calcualting the weights for a linear regression with weight-decay, i.e. normally I am trying to find beta = (X'X + lambda I)^-1 X'Y where X has n rows of D features each and Y is a vector of outputs for each row of X.
I've been fitting without a bias term by using:
def wd_fit(A, y, lamb=0):
n_col = A.shape[1]
return np.linalg.lstsq(A.T.dot(A) + lamb * np.identity(n_col), A.T.dot(y))
I'd like to also calculate a bias or intercept term for the fit, instead of having it pass through the origin. I'd like to keep the same call to lstsq, so if there's some matrix transform I can carry out, that would be ideal. My inclination is to append column of 1s somewhere, so that X_mod say would then have D+1 features where the last relates to the intercept value, but I'm not quite sure where that should be or even if it's correct.

If you don't want to mean-center your variables, adding a column of ones will work and is a perfectly acceptable solution.
The bias term will just be the coefficient at the position of the added column.

Possible to use Rank Correlation as cost function in TensorFlow?

I'm working with extremely noisy data occasionally peppered with outliers, so I'm relying mostly on correlation as a measure of accuracy in my NN.
Is it possible to explictly use something like rank correlation (the Spearman correlation coefficient) as my cost function? Up to now, I've relied mostly on MSE as a proxy for correlation.
I have three major stumbling blocks right now:
1) The notion of ranking becomes much fuzzier with mini-batches.
2) How do you dynamically perform rankings? Will TensorFlow not have a gradient error/be unable to track how a change in a weight/bias affects the cost?
3) How do you determine the size of the tensors you're looking at during runtime?
For example, the code below is what I'd like to roughly do if I were to just use correlation. In practice, length needs to be passed in rather than determined at runtime.
length = tf.shape(x)[1] ## Example code. This line not meant to work.
original_loss = -1 * length * tf.reduce_sum(tf.mul(x, y)) - (tf.reduce_sum(x) * tf.reduce_sum(y))
divisor = tf.sqrt(
(length * tf.reduce_sum(tf.square(x)) - tf.square(tf.reduce_sum(x))) *
(length * tf.reduce_sum(tf.square(y)) - tf.square(tf.reduce_sum(y)))
)
original_loss = tf.truediv(original_loss, divisor)

Here is the code for the Spearman correlation:
predictions_rank = tf.nn.top_k(predictions_batch, k=samples, sorted=True, name='prediction_rank').indices
real_rank = tf.nn.top_k(real_outputs_batch, k=samples, sorted=True, name='real_rank').indices
rank_diffs = predictions_rank - real_rank
rank_diffs_squared_sum = tf.reduce_sum(rank_diffs * rank_diffs)
six = tf.constant(6)
one = tf.constant(1.0)
numerator = tf.cast(six * rank_diffs_squared_sum, dtype=tf.float32)
divider = tf.cast(samples * samples * samples - samples, dtype=tf.float32)
spearman_batch = one - numerator / divider
The problem with the Spearman correlation is that you need to use a sorting algorithm (top_k in my code). And there is no way to translate it to a loss value. There is no derivade of a sorting algorithm. You can use a normal correlation but I think there is no mathematically difference to use the mean squared error.
I am working on this right now for images. What I have read in papers that they use to add the ranking into the loss function is to compare 2 or 3 images (where I say images you can say anything you want to rank).
Comparing two elements:
Where N is the total number of elements and α a margin value. I got this equation from Photo Aesthetics Ranking Network with Attributes and Content Adaptation
You can also use losses with 3 elemens where you compare two of them with similar ranking with another one with a different one:
But in this equation you also need to add the direction of the ranking, more details in Will People Like Your Image?. In the case of this paper they use a vector encodig instead of a real value but you can do it for just a number too.
In the case of images, the comparison between images makes more sense when those images are related. So it is a good idea to run a clustering algorithm to create (maybe?) 10 clusters, so you can use elements of the same cluster to make comparisons instead of very different things. This will help the network as the inputs are related somehow and not completely different.
As a side note you should know what is more important for you, if it is the final rank order or the rank value. If it is the value you should go with mean square error, if it is the rank order you can use the losses I wrote before. Or you can even combine them.
How do you determine the size of the tensors you're looking at during runtime?
tf.shape(tensor) returns a tensor with the shape. Then you can use tf.gather(tensor,index) to get the value you want.

PyMC: Hierachical Hidden Markov Model

This is a follow up on PyMC: Parameter estimation in a Markov system
I have a system which is defined by its position and velocity at each timestep. The behavior of the system is defined as:
vel = vel + damping * dt
pos = pos + vel * dt
So, here is my PyMC model. To estimate vel, pos and most importantly damping.
# PRIORS
damping = pm.Normal("damping", mu=-4, tau=(1 / .5**2))
# we assume some system noise
tau_system_noise = (1 / 0.1**2)
# the state consist of (pos, vel); save in lists
# vel: we can't judge the initial velocity --> assume it's 0 with big std
vel_states = [pm.Normal("v0", mu=-4, tau=(1 / 2**2))]
# pos: the first pos is just the observation
pos_states = [pm.Normal("p0", mu=observations[0], tau=tau_system_noise)]
for i in range(1, len(observations)):
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=tau_system_noise)
vel_states.append(new_vel)
pos_states.append(
pm.Normal("s" + str(i),
mu=pos_states[-1] + new_vel * dt,
tau=tau_system_noise)
)
# we assume some observation noise
tau_observation_noise = (1 / 0.5**2)
obs = pm.Normal("obs", mu=pos_states, tau=tau_observation_noise, value=observations, observed=True)
This is how I run the sampling:
mcmc = pm.MCMC([damping, obs, vel_states, pos_states])
mcmc.sample(50000, 25000)
pm.Matplot.plot(mcmc.get_node("damping"))
damping_samples = mcmc.trace("damping")[:]
print "damping -- mean:%f; std:%f" % (mean(damping_samples), std(damping_samples))
print "real damping -- %f" % true_damping
The value for damping is dominated by the prior. Even if I change the prior to Uniform or whatever, it is still the case.
What am I doing wrong? It's pretty much like the previous example, just with another layer.
The full IPython notebook of this problem is available here: http://nbviewer.ipython.org/github/sotte/random_stuff/blob/master/PyMC%20-%20HMM%20Dynamic%20System.ipynb
[EDIT: Some clarifications & code for sampling.]
[EDIT2: #Chris answer didn't help. I could not use AdaptiveMetropolis since the *_states don't seem to be part of the model.]

There are a couple of issues with the model, looking at it again. First and foremost, you did not add all of your PyMC objects to the model. You have only added [damping, obs]. You should pass all of the PyMC nodes to the model.
Also, note that you don't need to call both Model and MCMC. This is fine:
model = pm.MCMC([damping, obs, vel_states, pos_states])
The best workflow for PyMC is to keep your model in a separate file from the running logic. That way, you can just import the model and pass it to MCMC:
import my_model
model = pm.MCMC(my_model)
Alternately, you can write your model as a function, returning locals (or vars), then calling the function as the argument for MCMC. For example:
def generate_model():
# put your model definition here
return locals()
model = pm.MCMC(generate_model())

Assuming you know the structure of your model -- you are doing parameter estimation, not system identification -- you can construct your PyMC model as a regression, with unknown damping, initial position and initial velocity as parameters and the array of positions, your observations.
That is, with class PM representing the point-mass system:
pm = PM(true_damping)
positions, velocities = pm.integrate(true_pos, true_vel, N, dt)
# Assume little system noise
std_system_noise = 0.05
tau_system_noise = 1.0/std_system_noise**2
# Treat the real positions as observations
observations = positions + np.random.randn(N,)*std_system_noise
# Damping is modelled with a Uniform prior
damping = mc.Uniform("damping", lower=-4.0, upper=4.0, value=-0.5)
# Initial position & velocity unknown -> assume Uniform priors
init_pos = mc.Uniform("init_pos", lower=-1.0, upper=1.0, value=0.5)
init_vel = mc.Uniform("init_vel", lower=0.0, upper=2.0, value=1.5)
#mc.deterministic
def det_pos(d=damping, pi=init_pos, vi=init_vel):
# Apply damping, init_pos and init_vel estimates and integrate
pm.damping = d.item()
pos, vel = pm.integrate(pi, vi, N, dt)
return pos
# Standard deviation is modelled with a Uniform prior
std_pos = mc.Uniform("std", lower=0.0, upper=1.0, value=0.5)
#mc.deterministic
def det_prec_pos(s=std_pos):
# Precision, based on standard deviation
return 1.0/s**2
# The observations are based on the estimated positions and precision
obs_pos = mc.Normal("obs", mu=det_pos, tau=det_prec_pos, value=observations, observed=True)
# Create the model and sample
model = mc.Model([damping, init_pos, init_vel, det_prec_pos, obs_pos])
mcmc = mc.MCMC(model)
mcmc.sample(50000, 25000)
The full listing is here:
https://gist.github.com/stuckeyr/7762371
Increasing N and decreasing dt will improve your estimates markedly.

What do you mean by unreasonable? Are they shrunken toward the prior? Damping seems to have a pretty tight variance -- what if you give it a more diffuse prior?
Also, you might try using the AdaptiveMetropolis sampler on the *_states arrays:
my_model.use_step_method(AdaptiveMetropolis, my_model.vel_states)
It sometimes mixes better for correlated variables, as these likely are.

I think that your initial approach is fine and should work, except that the "obs" variable has not been included in the list of nodes supplied to MCMC (see In[10] in your notebook). After including this variable, the MCMC solver runs fine and does enforce the conditional dependencies specified by your model. I'd like to repeat the point made by Chris that it is best to define the model in a different file or under a function to avoid such mistakes.
The reason why you don't get the right results, is that your priors have been chosen arbitrarily and in some cases, the values are such that it is very difficult for the model to mix properly in order to converge. Your toy problem tries to estimate a damping value such that the positions converge to vector of observed positions. For this, your model should have the flexibility to choose velocity and damping values in a wide range so that stochastic errors in the position/velocity can be corrected when going from one time step to the next. Otherwise, as a result of your Euler integration scheme, the errors just keep getting propagated. I think Chris referred to the same thing when he suggested choosing a more diffuse prior.
I suggest playing around with the tau values for each of the Normal variables. For instance, I changed the following values:
damping = pm.Normal("damping", mu=0, tau=1/20.**2) # was tau=1/2.**2
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=(1/2.**2)) # was tau=tau_system_noise=(1 / 0.5**2)
tau_observation_noise = (1 / 0.005**2) # was 1 / 0.5**2
You can see the modified file here.
The plots at the bottom show that the positions are indeed converging. The velocities are all over the place. The estimated mean value of damping is 6.9, which is very different from -1.5. Perhaps you can achieve better estimates by choosing appropriate values for the priors.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.