Optimizing conditional multiclass softmax objective function in XGBoost - python

I have successfully implemented a custom multiclass softmax function function in XGBoost based on this tutorial. The reason for customization is that the classes I want to predict are conditional on some data inputs - i.e. of the 24 possible classes being predicted, only a certain subset are valid. valid_transitions are lists of indices corresponding to classes we want to make predictions on and invalid_transitions are the inverse set of indices.
I have implemented .fit() and .predict_proba() such that they take valid_transitions and invalid_transitions as arguments which tells softprob_obj() and softmax()which classes to null out during training and prediction.
def softmax(x, valid_transitions, invalid_transitions):
for i in range(len(x)):
e = np.exp(x[i,valid_transitions[i]])
x[i, valid_transitions[i]] = e/np.sum(e)
x[i, invalid_transitions[i]] = 0
return x
def softprob_obj(labels, predt, data, valid_transitions, invalid_transitions):
'''Loss function. Computing the gradient and approximated hessian (diagonal).
Reimplements the `multi:softprob` inside XGBoost.
'''
kRows = len(data)
kClasses = len(np.unique(labels))
# The prediction is of shape (rows, classes), each element in a row
# represents a raw prediction (leaf weight, hasn't gone through softmax
# yet). In XGBoost 1.0.0, the prediction is transformed by a softmax
# function, fixed in later versions.
assert predt.shape == (kRows, kClasses)
eps = 1e-6
# compute the gradient and hessian, slow iterations in Python, only
# suitable for demo. Also the one in native XGBoost core is more robust to
# numeric overflow as we don't do anything to mitigate the `exp` in
# `softmax` here.
probs = softmax(predt, valid_transitions, invalid_transitions)
labels = labels.astype(int)
hess = np.maximum((2.0 * probs * (1.0 - probs)), eps)
probs[np.arange(len(probs)),labels] -= 1
# Right now (XGBoost 1.0.0), reshaping is necessary
grad = probs.reshape((kRows * kClasses, 1))
hess = hess.reshape((kRows * kClasses, 1))
return grad, hess
This works, but is pretty slow in training, presumably because the core xgboost functions I'm replacing are not written in python. I made some attempts to try to vectorize the calculation in numpy to avoid the for loop in softmax(), but ran into some issues with the ragged arrays that valid_transition and invalid_transition create. Was wondering if anyone had any ideas on how to optimize this within python. Thanks!

Related

Tensorflow model architecture for sparse dataset

I have a regression dataset where approximately 95% of the target variables are zeros (the other 5% are between 1 and 30) and I am trying to design a Tensorflow model to model that data. I am thinking of implementing a model that combines a classifier and a regressor (check the output of the classifier submodel, if it's less than a threshold then pass it to the regression submodel). I have the intuition that this should be built using the functional API But I couldn't find helpful resources on that. Any ideas?
Here is the code that generates the data that I am using to replicate the problem:
n = 10000
zero_percentage = 0.95
zeros = np.zeros(round(n * zero_percentage))
non_zeros = np.random.randint(1,30,size=round(n * (1- zero_percentage)))
y = np.concatenate((zeros,non_zeros))
np.random.shuffle(y)
a = 50
b = 10
x = np.array([np.random.randint(31,60) if element == 0 else (element - b) / a for element in y])
y_classification = np.array([0 if element == 0 else 1 for element in y])
Note: I experimented with probabilistic models (Poisson regression and regression with a discretized logistic mixture distribution), and they provided good results but the training was unstable (loss diverges very often).
Instead of trying to find some heuristic to balance the training between the zero values and the others, you might want to try some input preprocessing method that can handle imbalanced training sets better (usually by mapping to another space before running the model, then doing the inverse with the results); for example, an embedding layer. Alternatively, normalize the values to a small range (like [-1, 1]) and apply an activation function before evaluating the model on the data.

Use Scipy Optimizer with Tensorflow 2.0 for Neural Network training

After the introduction of Tensorflow 2.0 the scipy interface (tf.contrib.opt.ScipyOptimizerInterface) has been removed. However, I would still like to use the scipy optimizer scipy.optimize.minimize(method=’L-BFGS-B’) to train a neural network (keras model sequential). In order for the optimizer to work, it requires as input a function fun(x0) with x0 being an array of shape (n,). Therefore, the first step would be to "flatten" the weights matrices to obtain a vector with the required shape. To this end, I modified the code provided by https://pychao.com/2019/11/02/optimize-tensorflow-keras-models-with-l-bfgs-from-tensorflow-probability/. This provides a function factory meant to create such a function fun(x0). However, the code does not seem to work and the loss function does not decrease. I would be really grateful if someone could help me work this out.
Here the piece of code I am using:
func = function_factory(model, loss_function, x_u_train, u_train)
# convert initial model parameters to a 1D tf.Tensor
init_params = tf.dynamic_stitch(func.idx, model.trainable_variables)
init_params = tf.cast(init_params, dtype=tf.float32)
# train the model with L-BFGS solver
results = scipy.optimize.minimize(fun=func, x0=init_params, method='L-BFGS-B')
def loss_function(x_u_train, u_train, network):
u_pred = tf.cast(network(x_u_train), dtype=tf.float32)
loss_value = tf.reduce_mean(tf.square(u_train - u_pred))
return tf.cast(loss_value, dtype=tf.float32)
def function_factory(model, loss_f, x_u_train, u_train):
"""A factory to create a function required by tfp.optimizer.lbfgs_minimize.
Args:
model [in]: an instance of `tf.keras.Model` or its subclasses.
loss [in]: a function with signature loss_value = loss(pred_y, true_y).
train_x [in]: the input part of training data.
train_y [in]: the output part of training data.
Returns:
A function that has a signature of:
loss_value, gradients = f(model_parameters).
"""
# obtain the shapes of all trainable parameters in the model
shapes = tf.shape_n(model.trainable_variables)
n_tensors = len(shapes)
# we'll use tf.dynamic_stitch and tf.dynamic_partition later, so we need to
# prepare required information first
count = 0
idx = [] # stitch indices
part = [] # partition indices
for i, shape in enumerate(shapes):
n = np.product(shape)
idx.append(tf.reshape(tf.range(count, count+n, dtype=tf.int32), shape))
part.extend([i]*n)
count += n
part = tf.constant(part)
def assign_new_model_parameters(params_1d):
"""A function updating the model's parameters with a 1D tf.Tensor.
Args:
params_1d [in]: a 1D tf.Tensor representing the model's trainable parameters.
"""
params = tf.dynamic_partition(params_1d, part, n_tensors)
for i, (shape, param) in enumerate(zip(shapes, params)):
model.trainable_variables[i].assign(tf.cast(tf.reshape(param, shape), dtype=tf.float32))
# now create a function that will be returned by this factory
def f(params_1d):
"""
This function is created by function_factory.
Args:
params_1d [in]: a 1D tf.Tensor.
Returns:
A scalar loss.
"""
# update the parameters in the model
assign_new_model_parameters(params_1d)
# calculate the loss
loss_value = loss_f(x_u_train, u_train, model)
# print out iteration & loss
f.iter.assign_add(1)
tf.print("Iter:", f.iter, "loss:", loss_value)
return loss_value
# store these information as members so we can use them outside the scope
f.iter = tf.Variable(0)
f.idx = idx
f.part = part
f.shapes = shapes
f.assign_new_model_parameters = assign_new_model_parameters
return f
Here model is an object tf.keras.Sequential.
Thank you in advance for any help!
Changing from tf1 to tf2 I was exposed to the same question and after a little bit of experimenting I found the solution below that shows how to establish the interface between a function decorated with tf.function and a scipy optimizer. The important changes compared to the question are:
As mentioned by Ives scipy's lbfgs
needs to get function value and gradient, so you need to provide a function that delivers both and then set jac=True
scipy's lbfgs is a Fortran function that expects the interface to provide np.float64 arrays while tensorflow tf.function uses tf.float32.
So one has to cast input and output.
I provide an example of how this can be done for a toy problem here below.
import tensorflow as tf
import numpy as np
import scipy.optimize as sopt
def model(x):
return tf.reduce_sum(tf.square(x-tf.constant(2, dtype=tf.float32)))
#tf.function
def val_and_grad(x):
with tf.GradientTape() as tape:
tape.watch(x)
loss = model(x)
grad = tape.gradient(loss, x)
return loss, grad
def func(x):
return [vv.numpy().astype(np.float64) for vv in val_and_grad(tf.constant(x, dtype=tf.float32))]
resdd= sopt.minimize(fun=func, x0=np.ones(5),
jac=True, method='L-BFGS-B')
print("info:\n",resdd)
displays
info:
fun: 7.105427357601002e-14
hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
jac: array([-2.38418579e-07, -2.38418579e-07, -2.38418579e-07, -2.38418579e-07,
-2.38418579e-07])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 3
nit: 2
status: 0
success: True
x: array([1.99999988, 1.99999988, 1.99999988, 1.99999988, 1.99999988])
Benchmark
For comparing speed
I use
the lbfgs optimizer for a style transfer
problem (see here for the network). Note, that for this problem the network parameters are fixed and the input signal is adapted. As the optimized parameters (the input signal) are 1D the function factory is not needed.
I compare four implementations
TF1.12: TF1 with with ScipyOptimizerInterface
TF2.0 (E): the approach above without using tf.function decorators
TF2.0 (G): the approach above using tf.function decorators
TF2.0/TFP: using the lbfgs minimizer from
tensorflow_probability
For this comparison the optimization is stopped after 300 iterations (generally for convergence the problem requires 3000 iterations)
Results
Method runtime(300it) final loss
TF1.12 240s 0.045 (baseline)
TF2.0 (E) 299s 0.045
TF2.0 (G) 233s 0.045
TF2.0/TFP 226s 0.053
The TF2.0 eager mode (TF2.0(E)) works correctly but is about 20% slower than the TF1.12 baseline version. TF2.0(G) with tf.function works fine and is marginally faster than TF1.12, which is a good thing to know.
The optimizer from tensorflow_probability (TF2.0/TFP) is slightly faster than TF2.0(G) using scipy's lbfgs but does not achieve the same error reduction. In fact the decrease of the loss over time is not monotonous which seems a bad sign. Comparing the two implementations of lbfgs (scipy and tensorflow_probability=TFP) it is clear that the Fortran code in scipy is significantly more complex.
So either the simplification of the algorithm in TFP is harming here or even the fact that TFP is performing all calculations in float32 may also be a problem.
Here is a simple solution using a library (autograd_minimize) that I wrote building on the answer of Roebel:
import tensorflow as tf
from autograd_minimize import minimize
def rosen_tf(x):
return tf.reduce_sum(100.0*(x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)
res = minimize(rosen_tf, np.array([0.,0.]))
print(res.x)
>>> array([0.99999912, 0.99999824])
It also works with keras models as shown with this naive example of linear regression:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from autograd_minimize.tf_wrapper import tf_function_factory
from autograd_minimize import minimize
import tensorflow as tf
#### Prepares data
X = np.random.random((200, 2))
y = X[:,:1]*2+X[:,1:]*0.4-1
#### Creates model
model = keras.Sequential([keras.Input(shape=2),
layers.Dense(1)])
# Transforms model into a function of its parameter
func, params = tf_function_factory(model, tf.keras.losses.MSE, X, y)
# Minimization
res = minimize(func, params, method='L-BFGS-B')
print(res.x)
>>> [array([[2.0000016 ],
[0.40000062]]), array([-1.00000164])]
I guess SciPy does not know how to calculate gradients of TensorFlow objects. Try to use the original function factory (i.e., the one also returns the gradients together after loss), and set jac=True in scipy.optimize.minimize.
I tested the python code from the original Gist and replaced tfp.optimizer.lbfgs_minimize with SciPy optimizer. It worked with BFGS method:
results = scipy.optimize.minimize(fun=func, x0=init_params, jac=True, method='BFGS')
jac=True means SciPy knows that func also returns gradients.
For L-BFGS-B, however, it's tricky. After some effort, I finally made it work. I have to comment out the #tf.function lines and let func return grads.numpy() instead of the raw TF Tensor. I guess that's because the underlying implementation of L-BFGS-B is a Fortran function, so there might be some issue converting data from tf.Tensor -> numpy array -> Fortran array. And forcing the function func to return the ndarray version of the gradients resolves the problem. But then it's not possible to use #tf.function.
(Similar Question to: Is there a tf.keras.optimizers implementation for L-BFGS?)
While this is not from anywhere as legit as tf.contrib, it's an implementation L-BFGS (and any other scipy.optimize.minimize solver) for your consideration in case it fits your use case:
https://pypi.org/project/kormos/
https://github.com/mbhynes/kormos
The package has models that extend keras.Model and keras.Sequential models, and can be compiled with .compile(..., optimizer="L-BFGS-B") to use L-BFGS in TF2, or compiled with any of the other standard optimizers (because flipping between stochastic & deterministic should be easy!):
kormos.models.BatchOptimizedModel
kormos.models.BatchOptimizedSequentialModel

Implementing a Gaussian-based loss function in Keras

I'm trying to implement my custom loss function in Keras using TensorFlow backend. The idea is for the neural network to input coefficients for Gaussians and compare the sum of four Gaussians to the model output. So we're fitting Gaussians to the data. I'd like to have y_pred in the form of [a_0, b_0, c_0, a_1, ..., c_3] and calculate the sum of a_i*e^((x-b_i)^2/2c_i), i=0,1,2,3 and then work out for example mean absolute error comparing this function to y_true. What I tried was
def gauss_loss(y_true, y_pred):
# zs is the the size y_true
# the size of y_pred is 12
xs = np.linspace(0, 1, zs)
gauss_sum = 0
for i in range(0, 12, 3):
gauss_sum += y_pred[:,i]*K.exp(-(xs-y_pred[:,i+1])**2/(2*y_pred[:,i+2]))
return 1./zs*sum(K.abs(y_true-gauss_sum))
I get the error "TypeError: Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn".
However, I don't think I can use tf.map_fn either because it only accepts one argument so I can't use the first entry of y_pred as coefficient a and the next as b in the same formula.
All examples I find just use tensor operations for the entire matrix. It seems to me that this might not even be possible in Keras. Is this possible and if so, how is it done?

Keras: handling batch size dimension for custom pearson correlation metric

I want to create a custom metric for pearson correlation as defined here
I'm not sure how exactly to apply it to batches of y_pred and y_true
What I did:
def pearson_correlation_f(y_true, y_pred):
y_true,_ = tf.split(y_true[:,1:],2,axis=1)
y_pred, _ = tf.split(y_pred[:,1:], 2, axis=1)
fsp = y_pred - K.mean(y_pred,axis=-1,keepdims=True)
fst = y_true - K.mean(y_true,axis=-1, keepdims=True)
corr = K.mean((K.sum((fsp)*(fst),axis=-1))) / K.mean((
K.sqrt(K.sum(K.square(y_pred -
K.mean(y_pred,axis=-1,keepdims=True)),axis=-1) *
K.sum(K.square(y_true - K.mean(y_true,axis=-1,keepdims=True)),axis=-1))))
return corr
Is it necessary for me to use keepdims and handle the batch dimension manually and the take the mean over it? Or does Keras somehow do this automatically?
When you use K.mean without an axis, Keras automatically calculates the mean for the entire batch.
And the backend already has standard deviation functions, so it might be cleaner (and perhaps faster) to use them.
If your true data is shaped like (BatchSize,1), I'd say keep_dims is unnecessary. Otherwise I'm not sure and it would be good to test the results.
(I don't understand why you use split, but it seems also unnecessary).
So, I'd try something like this:
fsp = y_pred - K.mean(y_pred) #being K.mean a scalar here, it will be automatically subtracted from all elements in y_pred
fst = y_true - K.mean(y_true)
devP = K.std(y_pred)
devT = K.std(y_true)
return K.mean(fsp*fst)/(devP*devT)
If it's relevant to have the loss for each feature instead of putting them all in the same group:
#original shapes: (batch, 10)
fsp = y_pred - K.mean(y_pred,axis=0) #you take the mean over the batch, keeping the features separate.
fst = y_true - K.mean(y_true,axis=0)
#mean shape: (1,10)
#fst shape keeps (batch,10)
devP = K.std(y_pred,axis=0)
devt = K.std(y_true,axis=0)
#dev shape: (1,10)
return K.sum(K.mean(fsp*fst,axis=0)/(devP*devT))
#mean shape: (1,10), making all tensors in the expression be (1,10).
#sum is only necessary because we need a single loss value
Summing the result of the ten features or taking a mean of them is the same, being one 10 times the other (That is not very relevant to keras models, affecting only the learning rate, but many optimizers quickly find their way around this).

Python + Theano: Logistic regression weights do not update

I've compared extensively to existing tutorials but I can't figure out why my weights don't update. Here is the function that return the list of updates:
def get_updates(cost, params, learning_rate):
updates = []
for param in params:
updates.append((param, param - learning_rate * T.grad(cost, param)))
return updates
It is defined at the top level, outside of any classes. This is standard gradient descent for each param. The 'params' parameter here is fed in as mlp.params, which is simply the concatenated lists of the param lists for each layer. I removed every layer except for a logistic regression one to isolate the reason as to why my cost was not decreasing. The following is the definition of mlp.params in MLP's constructor. It follows the definition of each layer and their respective param lists.
self.params = []
for layer in self.layers:
self.params += layer.params
The following is the train function, which I call for each minibatch during each epoch:
train = theano.function([minibatch_index], cost,
updates=get_updates(cost, mlp.params, learning_rate),
givens= {
x: train_set_x[minibatch_index * batch_size : (minibatch_index + 1) * batch_size],
y: train_set_y[minibatch_index * batch_size : (minibatch_index + 1) * batch_size]
})
If you require further details, the entire file is available here: http://pastebin.com/EeNmXfGD
I don't know how many people use Theano (it doesn't seem like plenty); if you've read to this point, thank you.
Fixed: I've determined that I can't use average squared error as the cost function. It works as usual after replacing it with a negative log-likelihood.
This behavior it caused by a few things but it comes down to the cost not being properly computed. In your implementation , the output of the LogisticRegression layer is the predicted class for every input digit (obtained with the argmax operation) and you take the squared difference between it and the expected prediction.
This will give you gradients of 0s wrt to any parameter in your model because the gradient of the output of the argmax (predicted class) wrt the input of the argmax (class probabilities) will be 0.
Instead, the LogisticRegression should output the probabilities of the classes :
def output(self, input):
input = input.flatten(2)
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
return self.p_y_given_x
And then in the MLP class, you compute the cost. You can used mean squared error between the desired probabilities for each class and the probabilities computed by the model but people tend to use the Negative Log Likelihood of the expected classes and you can implement it as such in the MLP class :
def neg_log_likelihood(self, x, y):
p_y_given_x = self.output(x)
return -T.mean(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
Then you can use this function to compute your cost and the model trains :
cost = mlp.neg_log_likelihood(x_, y)
A few additional things:
At line 215, when you print your cost, you format it as an integer value but it is a floating point value; this will lose precision in the monitoring.
Initializing all the weights to 0s as you do in your LogisticRegression class is often not recommended. Weights should differ in their original values so as to help break symmetry

Categories