Passing Two List to a Python function - python

I am trying to run python package pyabc(Approximate Bayesian Computation) for model selection between two list of values i.e model_1=[2,3,4,5] and model_2=[3,4,2,5]. The main function of pyabc is ABCSMC which states that
Definition : ABCSMC(models: Union[List[Model], Model], parameter_priors:
Union[List[Distribution], Distribution, Callable], distance_function: Union[Distance,
Callable]=None, population_size: Union[PopulationStrategy, int]=100, summary_statistics:
Callable[[model_output], dict]=identity, model_prior: RV=None)
I don't know where to define and pass my two lists model_1 and model_2 in the below mentioned code. I tried it several times but not able to do it as I am new to Python. I am following an example and its code in mentioned below.
import os
import tempfile
import scipy.stats as st
import pyabc
# Define a gaussian model
sigma = .5
def model(parameters):
# sample from a gaussian
y = st.norm(parameters.x, sigma).rvs()
# return the sample as dictionary
return {"y": y}
# We define two models, but they are identical so far
models = [model, model]
# However, our models' priors are not the same.
# Their mean differs.
mu_x_1, mu_x_2 = 0, 1
parameter_priors = [
pyabc.Distribution(x=pyabc.RV("norm", mu_x_1, sigma)),
pyabc.Distribution(x=pyabc.RV("norm", mu_x_2, sigma))
]
abc = pyabc.ABCSMC(
models, parameter_priors,
pyabc.PercentileDistance(measures_to_use=["y"]))
db_path = ("sqlite:///" +
os.path.join(tempfile.gettempdir(), "test.db"))
history = abc.new(db_path, {"y": y_observed})
print("ABC-SMC run ID:", history.id)
# We run the ABC until either criterion is met
history = abc.run(minimum_epsilon=0.2, max_nr_populations=5)

Model selection in pyABC aims to decide among different model candidates which model describes a common set of observed data best. In your above code, you would thus typically use different models in the list of models models = [model1, model2]. I am not sure what you mean by model selection over lists?
For the underlying algorithm and problem that it tries to solve, see also the original paper https://doi.org/10.1093/bioinformatics/btp619.

Related

Summarise the posterior of a single parameter from an array with arviz

I am estimating a model using the pyMC3 library in python. In my "real" model, there are four parameter arrays, two of which have over 170,000 parameters in them. Summarising this array of parameters is too computationally intensive on my computer. I have been trying to figure out if the summary function in arviz will allow me to only summarise one (or a small number) of parameters in the array. Below is a reprex where the same problem is present, though the model is a lot simpler. In the linear regression model below, the parameter array b has three parameters in it b[0], b[1], b[2]. I would like to know how to get the summary for just b[0] and b[1] or alternatively for just a single parameter, e.g., b[0].
import pandas as pd
import pymc3 as pm
import arviz as az
d = pd.read_csv("https://quantoid.net/files/mtcars.csv")
mpg = d['mpg'].values
hp = d['hp'].values
weight = d['wt'].values
with pm.Model() as model:
b = pm.Normal("b", mu=0, sigma=10, shape=3)
sig = pm.HalfCauchy("sig", beta=2)
mu = pm.Deterministic('mu', b[0] + b[1]*hp + b[2]*weight)
like = pm.Normal('like', mu=mu, sigma=sig, observed=mpg)
fit = pm.fit(10000, method='advi')
samp = fit.sample(1500)
with model:
smry = az.summary(samp, var_names = ["b"])
It looked like the coords argument to the summary() function would do it, but after googling around and finding a few examples, like the one here with plot_posterior() instead of summary(), I was unable to get something to work. In particular, I tried the following in the hopes that it would return the summary for b[0] and b[1].
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": range(1)})
or this to return the summary of b[0]:
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": [0]})
I suspect I am missing something simple (I'm an R user who dabbles occasionally with Python). Any help is greatly appreciated.
(BTW, I am using Python 3.8.0, pyMC3 3.9.3, arviz 0.10.0)
To use coords for this, you need to update to the development (which will still show 0.11.2 but has the code from github or any >0.11.2 release) version of ArviZ. Until 0.11.2, the coords argument in summary was not used to subset the data (like it did in all plotting functions) but instead it was only taken into account if the input was not already InferenceData in which case it was passed to the converter.
With older versions, you need to use xarray to subset the data before passing it to summary. Therefore you need to explicitly convert the trace to inferencedata beforehand. In the example above it would look like:
with model:
...
samp = fit.sample(1500)
idata = az.from_pymc3(samp)
az.summary(idata.posterior[["b"]].sel({"b_dim_0": [0]}))
Moreover, you may also want to indicate summary to compute only a subset of the stats/diagnostics as shown in the docstring examples.

Partial fit or incremental learning for autoregressive model

I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.

Only save transformed parameters in PyMC3

I want to shift my variables in PyMC3. I'm currently just using deterministic transforms, but when I perform the inference, it saves both the original samples and the shifted samples (which is expected behavior).
Example code:
import pymc3 as pm
x_lower = -3
with pm.Model():
x = pm.Gamma('x', alpha=2., beta=1.5)
x_shift = pm.Deterministic("x_shift", x + x_lower)
trace = pm.sample(1000, tune=1000)
trace.remove_values("x") # my current solution
tp = pm.traceplot(trace)
# other analysis...
Now trace stores all x samples and all x_shift samples, which is clearly a waste when the numbers of variables and samples increase. I can do trace.remove_values("x") before continuing with analysis, but I would prefer to simply not savex at all.
Another option is to not save x_shift at all, but I can't find how to add x_lower on to the samples after inference. So this isn't really a solution if I want to use the in-built analysis tools.
Can I save only the x_shift samples, and not the x samples, when I sample?
You can specify exactly what you want to save by setting the trace argument in the pm.sample() function, e.g.,
trace = pm.sample(1000, tune=1000, trace=[x_shift])
In case anyone finds this...
I went for a still-unsatsifying solution: I'm not using Deterministic transformations any more. I still transform the variables, but don't save them. I just transform the saved (original) samples after sampling.
The above code now looks like this:
import pymc3 as pm
x_lower = -3
def f_x(x):
return x + x_lower
with pm.Model():
x = pm.Gamma('x', alpha=2., beta=1.5)
x_shift = f_x(x)
#keep = {"x": f_x} # something like this for more variables
trace = pm.sample(1000, tune=1000)
trace = {"x": f_x(trace["x"])} # note this merges all chains
#trace = {varname: f(trace[varname]) for varname, f in keep.items()}
tp = pm.traceplot(trace)
# other analysis...
I think the pm.traceplot(trace) still works with trace in this form, but otherwise just import arviz and use it directly --- it does work with dicts like this.
Note: take care with more complicated transformations. e.g. you'll have to use pm.math.exp in the model, but np.exp in the post-sampling transformation.

Investigating the features' importance and weights evolution in given DL model

I apologize for a longer than usual intro, but it is important for the question:
I've recently been assigned to work on an existing project, which uses Keras+Tensorflow to create a Fully Connected Net.
Overall the model has 3 fully connected layers with 500 neurons and has 2 output classes. The first layer has 500 neurons which are connected to 82 input features. The model is used in the production and is retrained weekly, using this week information generated by an outer source.
The engineer which designed the model is no longer working here and I'm trying to reverse engineer and understand the behavior of the model.
Couple of objectives I have defined for myself are:
Understand the feature selection process and feature importance.
Understand and control the weekly re-training process.
In order to try and answer both of them, I've implemented an experiment where I feed my code with two models: one from the previous week and the other from the current week:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from keras.models import model_from_json
path1 = 'C:/Model/20190114/'
path2 = 'C:/Model/20190107/'
model_name1 = '0_10.1'
model_name2 = '0_10.2'
models = [path1 + model_name1, path2 + model_name2]
features_cum_weight = {}
I then take each feature and try to sum all the weights (their absolute value) which connect it to the first hidden layer.
This way I create two vectors of 82 values:
for model_name in models:
structure_filename = model_name + "_structure.json"
weights_filename = model_name + "_weights.h5"
with open(structure_filename, 'r') as model_json:
model = model_from_json(model_json.read())
model.load_weights(weights_filename)
in_layer_weights = model.layers[0].get_weights()[0]
in_layer_weights = abs(in_layer_weights)
features_cum_weight[model_name] = in_layer_weights.sum(axis=1)
I then plot them, using MatplotLib:
# Plot the Evolvement of Input Neuron Weights:
keys = list(features_cum_weight.keys())
weights_1 = features_cum_weight[keys[0]]
weights_2 = features_cum_weight[keys[1]]
fig, ax = plt.subplots(nrows=2, ncols=2)
width = 0.35 # the width of the bars
n_plots = 4
batch = int(np.ceil(len(weights_1)/n_plots))
for i in range(n_plots):
start = i*(batch+1)
stop = min(len(weights_1), start + batch + 1)
cur_w1 = weights_1[start:stop]
cur_w2 = weights_2[start:stop]
ind = np.arange(len(cur_w1))
cur_ax = ax[i//2][i%2]
cur_ax.bar(ind - width/2, cur_w1, width, color='SkyBlue', label='Current Model')
cur_ax.bar(ind + width/2, cur_w2, width, color='IndianRed', label='Previous Model')
cur_ax.set_ylabel('Sum of Weights')
cur_ax.set_title('Sum of all weights connected by feature')
cur_ax.set_xticks(ind)
cur_ax.legend()
cur_ax.set_ylim(0, 30)
plt.show()
Resulting in the following plot:
MatPlotLib plot
I then try to compare the vectors to deduce:
If the vectors have been changed drastically - there might be some major change in the training data or some problem while retraining the model.
If some value is close to zero the model might have recognized this feature as not important.
I want your opinion and insights on the following:
The overall approach to this experiment.
Advice on other ideas on reverse engineering on a given model.
Insights on the output I provide here.
Thank you all, I am open to any suggestions and critic!
This type of deduction is not entirely true. The combination between the features is not linear. It is true that if is strictly 0 does not matter, but it may be that it is then recombined in another way and in another deep layer.
It would be true if your model is linear. In fact, this is how the PCA analysis works, where it searches for linear relationships through the covariance matrix. The eigenvalue would indicate the importance of each feature.
I think that there are several ways to confirm your suspicions:
Eliminate features that you think are not important to train again and see the result. If it is similar, your suspicions are correct.
Apply the current model, take an example (we will call it as pivot) to evaluate and significantly change the features that you consider irrelevant and create many examples. This applies for several pivots. If the result is similar, that field should not matter. Example (I consider the first feature to be irrelevant):
data = np.array([[0.5, 1, 0.5], [1, 2, 5]])
range_values = 50
new_data = []
for i in range(data.shape[0]):
sample = data[i]
# We create new samples
for i in range (1000):
    noise = np.random.rand () * range_values
    new_sample = sample.copy()
new_sample[0] += noise
new_data.append(new_sample)

statsmodels - printing summary of ARMA fit throws error

I want to fit an ARMA(p,q) model to simulated data, y, and check the effect of different estimation methods on the results. However, fitting a model to the same object like so
model = tsa.ARMA(y,(1,1))
results_mle = model.fit(trend='c', method='mle', disp=False)
results_css = model.fit(trend='c', method='css', disp=False)
and printing the results
print result_mle.summary()
print result_css.summary()
generates the following error
File "C:\Anaconda\lib\site-packages\statsmodels\tsa\arima_model.py", line 1572, in summary
smry.add_table_params(self, alpha=alpha, use_t=False)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 885, in add_table_params
use_t=use_t)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 475, in summary_params
exog_idx]
IndexError: index 3 is out of bounds for axis 0 with size 3
If, instead, I do this
model1 = tsa.ARMA(y,(1,1))
model2 = tsa.ARMA(y,(1,1))
result_mle = model1.fit(trend='c',method='css-mle',disp=False)
print result_mle.summary()
result_css = model2.fit(trend='c',method='css',disp=False)
print result_css.summary()
no error occurs. Is that supposed to be or a Bug that should be fixed?
BTW the ARMA process I generated as follows
from __future__ import division
import statsmodels.tsa.api as tsa
import numpy as np
# generate arma
a = -0.7
b = -0.7
c = 2
s = 10
y1 = np.random.normal(c/(1-a),s*(1+(a+b)**2/(1-a**2)))
e = np.random.normal(0,s,(100,))
y = [y1]
for t in xrange(e.size-1):
arma = c + a*y[-1] + e[t+1] + b*e[t]
y.append(arma)
y = np.array(y)
You could report this as a bug, even though it looks like a consequence of the current design.
Some attributes of the model change when the estimation method is changed, which should in general be avoided. Since both results instances access the same model, the older one is inconsistent with it in this case.
http://www.statsmodels.org/dev/pitfalls.html#repeated-calls-to-fit-with-different-parameters
In general, statsmodels tries to keep all parameters that need to change the model in the model.__init__ and not as arguments in fit, and attach the outcome of fit and results to the Results instance.
However, this is not followed everywhere, especially not in older models that gained new options along the way.
trend is an example that is supposed to go into the ARMA.__init__ because it is now handled together with the exog (which is an ARMAX model), but wasn't in pure ARMA. The estimation method belongs in fit and should not cause problems like these.
Aside: There is a helper function to simulate an ARMA process that uses scipy.signal.lfilter and should be much faster than an iteration loop in Python.

Categories