How to use external model data with Emukit python package

How to use external model data with Emukit python package - python

I am implementing this code (found here: https://emukit.readthedocs.io/en/latest/notebooks/Emukit-tutorial-custom-model.html)
import numpy as np
from emukit.experimental_design import ExperimentalDesignLoop
from emukit.core import ParameterSpace, ContinuousParameter
from emukit.core.loop import UserFunctionWrapper
from sklearn.gaussian_process import GaussianProcessRegressor
x_min = -30.0
x_max = 30.0
X = np.random.uniform(x_min, x_max, (10, 1))
Y = np.sin(X) + np.random.randn(10, 1) * 0.05
sklearn_gp = GaussianProcessRegressor();
sklearn_gp.fit(X, Y);
from emukit.core.interfaces import IModel
class SklearnGPModel(IModel):
def __init__(self, sklearn_model):
self.model = sklearn_model
def predict(self, X):
mean, std = self.model.predict(X, return_std=True)
return mean[:, None], np.square(std)[:, None]
def set_data(self, X: np.ndarray, Y: np.ndarray) -> None:
self.model.fit(X, Y)
def optimize(self, verbose: bool = False) -> None:
# There is no separate optimization routine for sklearn models
pass
#property
def X(self) -> np.ndarray:
return self.model.X_train_
#property
def Y(self) -> np.ndarray:
return self.model.y_train_
emukit_model = SklearnGPModel(sklearn_gp)
p = ContinuousParameter('c', x_min, x_max)
space = ParameterSpace([p])
loop = ExperimentalDesignLoop(space, emukit_model)
loop.run_loop(np.sin, 50)
I am trying to implement this code but with the exteral data set. To do this, I need to understand if I can extract the 50 x-values propagated through the np.sin function when the loop.run_loop(np.sin, 50) is executed. Then, having obtained these 50 inputs (x-values), I need to propagate them in an external software, which saves the result as .csv file.
The information that I would have, that needs to be "put through" the loop.run_loop() is as follows:
So, I need to make the loop.run_loop() code work by loading an external results data but do now know how to implement that.

If i understand your question correctly, passing data does not make sense in this context. The default acquisition function will select the next input (or experiment) based on the your model. Your model is updated at each iteration from the outcome of your experiment and the next experiment is dependent on previous observations - it's not random.
Passing your samples independently of this loop would be significantly less informative.
In short, you need to define a function similar to np.sin that can be queried.
Hope this makes sense!

Related

HAC standard errors with GenericLikelihoodModel

I am fitting a linear model using maximum likelihood estimation based on the GenericLikelihoodModel class. The errors exhibit heteroskedasticity and serial correlation so I want to estimate HAC standard errors and display this in the main output. Although this is straight forward to do for the OLS estimation, I am unable to implement this for the ML model.
I have spent a lot of time searching but no answer has come up. My background is in econometrics (Eviews, Stata, Matlab), so I am familiar with ML estimation and HAC standard errors, I am just struggling to implement it in Python. I understand that it could just be done manually after the estimation, but I would like to do it using the available statsmodels tools and presented in the main estimation results.
Is there a way to use the statsmodels cov_type paramater with the GenericLikelihoodModel class, or would we need to just code the errors from scratch afterwards?
The code is below.
# = = = = = = = = = = = = = = = = = = = #
# MLE with a linear model #
# = = = = = = = = = = = = = = = = = = = #
# Gives the (almost) exact same results as OLS when using normal errors
# https://www.statsmodels.org/dev/examples/notebooks/generated/generic_mle.html
# https://rlhick.people.wm.edu/posts/estimating-custom-mle.html
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.base.model import GenericLikelihoodModel
from scipy.optimize import minimize
# --- Set up --- #
def _ll_ols(y, X, beta, sigma): # This is a python function that calculates the log-likelihood value.
mu = X.dot(beta)
log_likelihood= norm(mu,sigma).logpdf(y).sum() # log_likelihood = np.sum(norm.logpdf(y,mu,sigma)) is another way
return log_likelihood
class linear_MLE(GenericLikelihoodModel): # We are creating a class called 'linear_MLE' using 'GenericLikelihoodModel' as a template
def __init__(self, endog, exog, **kwds): # **kwds just carries the
super(linear_MLE, self).__init__(endog, exog, **kwds) # This gives the 'linear_MLE' we are creating all of the same properties as the 'GenericLikelihoodModel' class
def nloglikeobs(self, params): # 'GenericLikelihoodModel' has a set of associated "methods" (basically functions), here we create a new one
sigma = params[-1] # Pull the sigma and the beta from the model, this is just anything that you need in the log likelihood function
beta = params[:-1]
ll = _ll_ols(self.endog, self.exog, beta, sigma) # Calculate the log likelihood based on the function that we created below.
return -ll # Basically 'GenericLikelihoodModel' gives us the freedom/ability to set 'nloglikeobs' with our own likelihood value.
def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds): # Update the fit for any other values we needed in the likelihood function and the starting values
self.exog_names.append('*** sigma ***') # we have one additional parameter and we need to add it for summary
if start_params == None:
start_params = np.append(sm.OLS(y,X).fit().params.values, sm.OLS(y,X).fit().scale**.5) # Set the starting values as the OLS estimates.
#start_params = np.append(np.ones(self.exog.shape[1]), .5) # Set some reasonable starting values. Play around with this if you have issues with the Hessian.
return super(linear_MLE, self).fit(start_params=start_params,maxiter=maxiter,maxfun=maxfun,**kwds)
# --- Data --- #
n = 100
k = 2
error = np.random.randn(n)
heteroskedastic_error = np.append(error[:np.int(n/2)],error[np.int(n/2):]*3)
HA_error = (np.append(heteroskedastic_error[-1:],heteroskedastic_error[:-1])+np.append(heteroskedastic_error[-2:],heteroskedastic_error[:-2]))/2
X = pd.DataFrame(np.append([[1]]*n,np.random.randn(k)*np.random.randn(n,k)+np.random.randn(k),axis=1))
y = pd.DataFrame(np.dot(X,np.random.randn(k+1))+HA_error)
# --- Models --- #
ols_results = sm.OLS(y,X).fit()
print(ols_results.summary())
ols_results = sm.OLS(y,X).fit(cov_type='HAC',cov_kwds={'maxlags':2})
print(ols_results.summary())
mle_results = linear_MLE(y,X).fit()
print(mle_results.summary())
mle_results = linear_MLE(y,X).fit(cov_type='HAC',cov_kwds={'maxlags':2})
print(mle_results.summary())

GPflow 2 custom kernel construction: fine upon construction, but kernel of size None in optimization

I'm creating some GPflow models in which I need the observations pre and post of a threshold x0 to be independent a priori. I could achieve this with just GP models, or with a ChangePoints kernel with infinite steepness, but both solutions don't work well with my future extensions in mind (MOGP in particular).
I figured I could easily construct what I want from scratch, so I made a new Combination kernel object, which uses the appropriate child kernel pre- or post x0. This works as intended when I evaluate the kernel on a set of input points; the expected correlations between points before and after threshold are zero, and the rest is determined by the children kernels:
import numpy as np
import gpflow
from gpflow.kernels import Matern32
import matplotlib.pyplot as plt
import tensorflow as tf
from gpflow.kernels import Combination
class IndependentKernel(Combination):
def __init__(self, kernels, x0, forcing_variable=0, name=None):
self.x0 = x0
self.forcing_variable = forcing_variable
super().__init__(kernels, name=name)
def K(self, X, X2=None):
# threshold X, X2 based on self.x0, and construct a joint tensor
if X2 is None:
X2 = X
fv = self.forcing_variable
mask = tf.dtypes.cast(X[:, fv] >= self.x0, tf.int32)
X_partitioned = tf.dynamic_partition(X, mask, 2)
X2_partitioned = tf.dynamic_partition(X2, mask, 2)
K_pre = self.kernels[0].K(X_partitioned[0], X2_partitioned[0])
K_post = self.kernels[1].K(X_partitioned[1], X2_partitioned[1])
zero_block_1 = tf.zeros([K_pre.shape[0], K_post.shape[1]], tf.float64)
zero_block_2 = tf.zeros([K_post.shape[0], K_pre.shape[1]], tf.float64)
upper_row = tf.concat([K_pre, zero_block_1], axis=1)
lower_row = tf.concat([zero_block_2, K_post], axis=1)
return tf.concat([upper_row, lower_row], axis=0)
#
def K_diag(self, X):
fv = self.forcing_variable
mask = tf.dtypes.cast(X[:, fv] >= self.x0, tf.int32)
X_partitioned = tf.dynamic_partition(X, mask, 2)
return tf.concat([self.kernels[0].K_diag(X_partitioned[0]),
self.kernels[1].K_diag(X_partitioned[1])],
axis=1)
#
#
def f(x):
return np.sin(6*(x-0.7))
x0 = 0.3
n = 100
x = np.linspace(0, 1, n)
sigma = 0.5
y = np.random.normal(loc=f(x), scale=sigma)
fv = 0
X = x[:, None]
kernel = IndependentKernel([Matern32(), Matern32()], x0=x0, name='indep')
x_pred = np.linspace(0, 1, 100)
K = kernel.K(x_pred[:, None]) # <- kernel is evaluated correctly here
However, when I want to train a GPflow model with this kernel, I receive the error message TypeError: Expected int32, got None of type 'NoneType' instead. This appears to result from the sub-kernel matrices K_pre and K_post to be of size (None, 1), instead of the expected squares (which they correctly are if I evaluate the kernel 'manually').
m = gpflow.models.GPR(data=(X, y[:, None]), kernel=kernel)
gpflow.optimizers.Scipy().minimize(m.training_loss,
m.trainable_variables,
options=dict(maxiter=10000),
method="L-BFGS-B") # <- K_pre & K_post are of size (None, 1) now?
What can I do to make the kernel properly trainable?
I am using GPflow 2.1.3 and TensorFlow 2.4.1.

this is not a GPflow issue but a subtlety of TensorFlow's eager vs graph mode: In eager mode (which is the default behaviour when you interact with tensors "manually" as in calling the kernel) K_pre.shape works just as expected. In graph mode (which is what happens when you wrap code in tf.function(), this generally does not always work (e.g. the shape might depend on tf.Variables with None shapes), and you have to use tf.shape(K_pre) instead to obtain the dynamic shape (that depends on the actual values inside the variables). GPflow's Scipy class by default wraps the loss&gradient computation inside tf.function() to speed up optimization. If you explicitly turn this off by passing compile=False to the minimize() call, your code example runs fine. If you replace the .shape attributes with tf.shape() calls to fix it properly, it likewise will run fine.

Use Python lmfit with a variable number of parameters in function

I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())

I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)

I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()

Python Information gain implementation

I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.
I came up with the following solution:
from scipy.stats import entropy
import numpy as np
def information_gain(X, y):
def _entropy(labels):
counts = np.bincount(labels)
return entropy(counts, base=None)
def _ig(x, y):
# indices where x is set/not set
x_set = np.nonzero(x)[1]
x_not_set = np.delete(np.arange(x.shape[1]), x_set)
h_x_set = _entropy(y[x_set])
h_x_not_set = _entropy(y[x_not_set])
return entropy_full - (((len(x_set) / f_size) * h_x_set)
+ ((len(x_not_set) / f_size) * h_x_not_set))
entropy_full = _entropy(y)
f_size = float(X.shape[0])
scores = np.array([_ig(x, y) for x in X.T])
return scores
Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example
categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
X, y = newsgroups_train.data, newsgroups_train.target
cv = CountVectorizer(max_df=0.95, min_df=2,
max_features=100,
stop_words='english')
X_vec = cv.fit_transform(X)
t0 = time()
res_sk = mutual_info_classif(X_vec, y, discrete_features=True)
print("Time passed for sklearn method: %3f" % (time()-t0))
t0 = time()
res_ig = information_gain(X_vec, y)
print("Time passed for ig: %3f" % (time()-t0))
for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):
print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))
sample output:
center: mi=0.011824, ig=0.003548
christian: mi=0.128629, ig=0.127122
color: mi=0.028413, ig=0.026397
com: mi=0.041184, ig=0.030458
computer: mi=0.020590, ig=0.012327
cs: mi=0.007291, ig=0.001574
data: mi=0.020734, ig=0.008986
did: mi=0.035613, ig=0.024604
different: mi=0.011432, ig=0.005492
distribution: mi=0.007175, ig=0.004675
does: mi=0.019564, ig=0.006162
don: mi=0.024000, ig=0.017605
earth: mi=0.039409, ig=0.032981
edu: mi=0.023659, ig=0.008442
file: mi=0.048056, ig=0.045746
files: mi=0.041367, ig=0.037860
ftp: mi=0.031302, ig=0.026949
gif: mi=0.028128, ig=0.023744
god: mi=0.122525, ig=0.113637
good: mi=0.016181, ig=0.008511
gov: mi=0.053547, ig=0.048207
So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.

A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.
The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first
The sections starting at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L233
def _entropy(dist):
"""Entropy of class-distribution matrix"""
p = dist / np.sum(dist, axis=0)
pc = np.clip(p, 1e-15, 1)
return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))
Then the second portion.
https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L305
class GainRatio(ClassificationScorer):
"""
Information gain ratio is the ratio between information gain and
the entropy of the feature's
value distribution. The score was introduced in [Quinlan1986]_
to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio
<http://en.wikipedia.org/wiki/Information_gain_ratio>`_.
.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.
"""
def from_contingency(self, cont, nan_adjustment):
h_class = _entropy(np.sum(cont, axis=1))
h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))
h_attribute = _entropy(np.sum(cont, axis=0))
if h_attribute == 0:
h_attribute = 1
return nan_adjustment * (h_class - h_residual) / h_attribute
The actual scoring process happens at https://github.com/biolab/orange3/blob/master/Orange/preprocess/score.py#L218

Using multiprocessing in emcee library inside a class

I have tried to use emcee library to implement Monte Carlo Markov Chain inside a class and also make multiprocessing module works but after running such a test code:
import numpy as np
import emcee
import scipy.optimize as op
# Choose the "true" parameters.
m_true = -0.9594
b_true = 4.294
f_true = 0.534
# Generate some synthetic data from the model.
N = 50
x = np.sort(10*np.random.rand(N))
yerr = 0.1+0.5*np.random.rand(N)
y = m_true*x+b_true
y += np.abs(f_true*y) * np.random.randn(N)
y += yerr * np.random.randn(N)
class modelfit():
def __init__(self):
self.x=x
self.y=y
self.yerr=yerr
self.m=-0.6
self.b=2.0
self.f=0.9
def get_results(self):
def func(a):
model=a[0]*self.x+a[1]
inv_sigma2 = 1.0/(self.yerr**2 + model**2*np.exp(2*a[2]))
return 0.5*(np.sum((self.y-model)**2*inv_sigma2 + np.log(inv_sigma2)))
result = op.minimize(func, [self.m, self.b, np.log(self.f)],options={'gtol': 1e-6, 'disp': True})
m_ml, b_ml, lnf_ml = result["x"]
return result["x"]
def lnprior(self,theta):
m, b, lnf = theta
if -5.0 < m < 0.5 and 0.0 < b < 10.0 and -10.0 < lnf < 1.0:
return 0.0
return -np.inf
def lnprob(self,theta):
lp = self.lnprior(theta)
likelihood=self.lnlike(theta)
if not np.isfinite(lp):
return -np.inf
return lp + likelihood
def lnlike(self,theta):
m, b, lnf = theta
model = m * self.x + b
inv_sigma2 = 1.0/(self.yerr**2 + model**2*np.exp(2*lnf))
return -0.5*(np.sum((self.y-model)**2*inv_sigma2 - np.log(inv_sigma2)))
def run_mcmc(self,nstep):
ndim, nwalkers = 3, 100
pos = [self.get_results() + 1e-4*np.random.randn(ndim) for i in range(nwalkers)]
self.sampler = emcee.EnsembleSampler(nwalkers, ndim, self.lnprob,threads=10)
self.sampler.run_mcmc(pos, nstep)
test=modelfit()
test.x=x
test.y=y
test.yerr=yerr
test.get_results()
test.run_mcmc(5000)
I got this error message :
File "MCMC_model.py", line 157, in run_mcmc
self.sampler.run_mcmc(theta0, nstep)
File "build/bdist.linux-x86_64/egg/emcee/sampler.py", line 157, in run_mcmc
File "build/bdist.linux-x86_64/egg/emcee/ensemble.py", line 198, in sample
File "build/bdist.linux-x86_64/egg/emcee/ensemble.py", line 382, in _get_lnprob
File "build/bdist.linux-x86_64/egg/emcee/interruptible_pool.py", line 94, in map
File "/vol/aibn84/data2/zahra/anaconda/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
cPickle.PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
I reckon it has something to do with how I have used multiprocessing in the class but I could not figure out how I could keep the structure of my class the way it is and meanwhile use multiprocessing as well??!!
I will appreciate for any tips.
P.S. I have to mention the code works perfectly if I remove threads=10 from the last function.

There are a number of SO questions that discuss what's going on:
https://stackoverflow.com/a/21345273/2379433
https://stackoverflow.com/a/28887474/2379433
https://stackoverflow.com/a/21345308/2379433
https://stackoverflow.com/a/29129084/2379433
…including this one, which seems to be your response… to nearly the same question:
https://stackoverflow.com/a/25388586/2379433
However, the difference here is that you are not using multiprocessing directly -- butemcee is. Therefore, the pathos.multiprocessing solution (from the links above) is not available for you. Since emcee uses cPickle, you'll have to stick to things that pickle knows how to serialize. You are out of luck for class instances. Typical workarounds are to either use copy_reg to register the type of object you want to serialize, or to add a __reduce__ method to tell python how to serialize it. You can see several of the answers from the above links suggest similar things… but none enable you to keep the class the way you have written it.

For the record, you can now create a pathos.multiprocessing pool, and pass it to emcee using the pool argument. However, be aware that the overhead of multiprocessing can actually slow things down, unless your likelihood is particularly time-consuming to compute.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use external model data with Emukit python package - python

Related

HAC standard errors with GenericLikelihoodModel

GPflow 2 custom kernel construction: fine upon construction, but kernel of size None in optimization

Use Python lmfit with a variable number of parameters in function

Python Information gain implementation

Using multiprocessing in emcee library inside a class

Categories

Resources