Question
In CS231 Computing the Analytic Gradient with Backpropagation which is first implementing a Softmax Classifier, the gradient from (softmax + log loss) is divided by the batch size (number of data being used in a cycle of forward cost calculation and backward propagation in the training).
Please help me understand why it needs to be divided by the batch size.
The chain rule to get the gradient should be below. Where should I incorporate the division?
Derivative of Softmax loss function
Code
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
#Train a Linear Classifier
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in range(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples # <---------------------- Why?
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
It's because you are averaging the gradients instead of taking directly the sum of all the gradients.
You could of course not divide for that size, but this division has a lot of advantages. The main reason is that it's a sort of regularization (to avoid overfitting). With smaller gradients the weights cannot grow out of proportions.
And this normalization allows comparison between different configuration of batch sizes in different experiments (How can I compare two batch performances if they are dependent to the batch size?)
If you divide for that size the gradients sum it could be useful to work with greater learning rates to make the training faster.
This answer in the crossvalidated community is quite useful.
Came to notice that the dot in dW = np.dot(X.T, dscores) for the gradient at W is Σ over the num_sample instances. Since the dscore, which is probability (softmax output), was divided by the num_samples, did not understand that it was normalization for dot and sum part later in the code. Now understood divide by num_sample is required (may still work without normalization if the learning rate is trained though).
I believe the code below explains better.
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores) / num_examples
db = np.sum(dscores, axis=0, keepdims=True) / num_examples
I am developing a custom model in Tensorflow. I am trying to implement a Virtual Adversarial Training (VAT) model from https://arxiv.org/abs/1704.03976. The model makes use of both labeled and unlabeled data in its classification task. Therefore, in the train_step of the model, I need to divide the data of the batch into labeled (0, or 1), or unlabeled (-1). It seems to work as expected when compiling the model using run_eagerly=True, but when I use run_eagerly=False, it gives me the following error:
ValueError: Number of mask dimensions must be specified, even if some dimensions are None. E.g. shape=[None] is ok, but shape=None is not.
which seems to be produced in:
X_l, y_l = tf.boolean_mask(X, tf.logical_not(missing)), tf.boolean_mask(y, tf.logical_not(missing))
I am not sure what is causing the error, but it seems to have something to do with a weird tensor shape issues that only occur during run_eagerly=False. I need the boolean_mask functionality in order to distinguish the labeled and unlabeled data. I hope someone can help me out. In order to reproduce the errors, I added the model, and a small simulation example. The simulation will produce the error I have, when run_eagerly=False is set.
Thanks in advance.
Model defintion:
from tensorflow import keras
import tensorflow as tf
metric_acc = keras.metrics.BinaryAccuracy()
metric_loss = keras.metrics.Mean('loss')
class VAT(keras.Model):
def __init__(self, units_1=16, units_2=16, dropout=0.3, xi=1e-6, epsilon=2.0, alpha=1.0):
super(VAT, self).__init__()
# Set model parameters
self.units_1 = units_1
self.units_2 = units_2
self.dropout = dropout
self.xi = xi
self.epsilon = epsilon
self.alpha = alpha
# First hidden
self.dense1 = keras.layers.Dense(self.units_1)
self.activation1 = keras.layers.Activation(tf.nn.leaky_relu)
self.dropout1 = keras.layers.Dropout(self.dropout)
# Second hidden
self.dense2 = keras.layers.Dense(self.units_2)
self.activation2 = keras.layers.Activation(tf.nn.leaky_relu)
self.dropout2 = keras.layers.Dropout(self.dropout)
# Output layer
self.dense3 = keras.layers.Dense(1)
self.activation3 = keras.layers.Activation("sigmoid")
def call(self, inputs, training=None, mask=None):
x1 = self.dense1(inputs)
x2 = self.activation1(x1)
x3 = self.dropout1(x2, training=True)
x4 = self.dense2(x3)
x5 = self.activation2(x4)
x6 = self.dropout2(x5, training=True)
x7 = self.dense3(x6)
x8 = self.activation3(x7)
return x8
def generate_perturbation(self, inputs):
# Generate normal vectors
d = tf.random.normal(shape=tf.shape(inputs))
# Normalize vectors
d = tf.math.l2_normalize(d, axis=1)
# Calculate r
r = self.xi * d
# Make predictions
p = self(inputs, training=True)
# Tape gradient
with tf.GradientTape() as tape:
tape.watch(r)
# Perturbed predictions
p_perturbed = self(inputs + r, training=True)
# Calculate divergence
D = keras.losses.KLD(p, p_perturbed) + keras.losses.KLD(1 - p, 1 - p_perturbed)
# Calculate gradient
gradient = tape.gradient(D, r)
# Calculate r_vadv
r_vadv = tf.math.l2_normalize(gradient, axis=1)
# Return virtual adversarial perturbation
return r_vadv
#tf.function
def train_step(self, data):
# Unpack data
X, y = data
# Missing label boolean indices
missing = tf.squeeze(tf.equal(y, -1))
# Split data into labeled and unlabeled data
X_l, y_l = tf.boolean_mask(X, tf.logical_not(missing)), tf.boolean_mask(y, tf.logical_not(missing))
X_u = tf.boolean_mask(X, missing)
# Calculate virtual perturbations for labeled and unlabeled
r_l = self.generate_perturbation(X_l)
r_u = self.generate_perturbation(X_u)
# Tape gradient
with tf.GradientTape() as model_tape:
model_tape.watch(self.trainable_variables)
# Calculate probabilities real data
prob_l, prob_u = self(X_l, training=True), self(X_u, training=True)
# Calculate probabilities perturbed data
prob_r_l, prob_r_u = self(X_l + self.epsilon * r_l, training=True), self(X_u + self.epsilon * r_u, training=True)
# Calculate loss
loss = vat_loss(y_l, prob_l, prob_u, prob_r_l, prob_r_u, self.alpha)
# Calculate gradient
model_gradient = model_tape.gradient(loss, self.trainable_variables)
# Update weights
self.optimizer.apply_gradients(zip(model_gradient, self.trainable_variables))
# Compute metrics
metric_acc.update_state(y_l, prob_l)
metric_loss.update_state(loss)
return {'loss': metric_loss.result(), 'accuracy': metric_acc.result()}
#property
def metrics(self):
return [metric_loss, metric_acc]
def vat_loss(y_l, prob_l, prob_u, prob_r_l, prob_r_u, alpha):
N_l = tf.cast(tf.size(prob_l), dtype=tf.dtypes.float32)
N_u = tf.cast(tf.size(prob_u), dtype=tf.dtypes.float32)
if tf.equal(N_l, 0):
# No labeled examples: get contribution from unlabeled data using perturbations
R_vadv = tf.reduce_sum(
keras.losses.KLD(prob_u, prob_r_u)
+ keras.losses.KLD(1 - prob_u, 1 - prob_r_u)
)
return alpha * R_vadv / N_u
elif tf.equal(N_u, 0):
# No unlabeled examples: get contribution from labeled data
R = tf.reduce_sum(keras.losses.binary_crossentropy(y_l, prob_l))
R_vadv = tf.reduce_sum(
keras.losses.KLD(prob_l, prob_r_l)
+ keras.losses.KLD(1 - prob_l, 1 - prob_r_l)
)
return R / N_l + alpha * R_vadv / N_l
else:
# Get contribution from labeled data
R = tf.reduce_sum(keras.losses.binary_crossentropy(y_l, prob_l))
# Get contribution from labeled and unlabeled data using perturbations
R_vadv = tf.reduce_sum(
keras.losses.KLD(prob_l, prob_r_l)
+ keras.losses.KLD(1 - prob_l, 1 - prob_r_l)
) + tf.reduce_sum(
keras.losses.KLD(prob_u, prob_r_u)
+ keras.losses.KLD(1 - prob_u, 1 - prob_r_u)
)
return R / N_l + alpha * R_vadv / (N_l + N_u)
Simulation example:
To show that the model/code works as desired (when using run_eagerly=True, I made a simulation example. In this example, I bias when observations are labeled/unlabeled. The figure below illustrates the labeled observations used by the model (yellow or purple), and the unlabeled observations (blue).
The VAT produces an accuracy of around ~0.75, whereas the reference model produces an accuracy of around ~0.58. These accuracies are produced without hyperparameter tuning.
from modules.vat import VAT
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
def create_biased_sample(x, proportion_labeled):
labeled = np.random.choice([True, False], p=[proportion_labeled, 1-proportion_labeled])
if x[0] < 0.0:
return False
elif x[0] > 1.0:
return False
else:
return labeled
# Simulation parameters
N = 2000
proportion_labeled = 0.15
# Model training parameters
BATCH_SIZE = 128
BUFFER_SIZE = 60000
EPOCHS = 100
# Generate a dataset
X, y = datasets.make_moons(n_samples=N, noise=.05, random_state=3)
X, y = X.astype('float32'), y.astype('float32')
y = y.reshape(-1, 1)
# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
# Simulate missing labels
sample_biased = lambda x: create_biased_sample(x, proportion_labeled)
labeled = np.array([sample_biased(k) for k in X_train])
y_train[~ labeled] = -1
# Estimate VAT model
vat = VAT(dropout=0.2, units_1=16, units_2=16, epsilon=0.5)
vat.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.01), run_eagerly=True)
vat.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)
# Estimate a reference model
reference = keras.models.Sequential([
keras.layers.Input(shape=(2,)),
keras.layers.Dense(16),
keras.layers.Activation(tf.nn.leaky_relu),
keras.layers.Dropout(0.2),
keras.layers.Dense(16),
keras.layers.Activation(tf.nn.leaky_relu),
keras.layers.Dropout(0.2),
keras.layers.Dense(1),
keras.layers.Activation("sigmoid")
])
reference.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01), loss=keras.losses.binary_crossentropy, run_eagerly=False)
reference.fit(X_train[y_train.flatten() != -1, :], y_train[y_train.flatten() != -1], batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)
# Calculate out-of-sample accuracies
test_acc_vat = tf.reduce_mean(keras.metrics.binary_accuracy(y_test, vat(X_test, training=False)))
test_acc_reference = tf.reduce_mean(keras.metrics.binary_accuracy(y_test, reference(X_test, training=False)))
# Print results
print('Test accuracy of VAT: {}'.format(test_acc_vat))
print('Test accuracy of reference model: {}'.format(test_acc_reference))
# Plot scatter
plt.scatter(X_test[:, 0], X_test[:, 1])
plt.scatter(X_train[y_train.flatten() != -1, 0], X_train[y_train.flatten() != -1, 1], c=y_train.flatten()[y_train.flatten() != -1])
For anyone who is interested, I solved the issue by adding the following in the train_step() method:
missing.set_shape([None])
It should be just after declaring the tensor missing. I solved this using this thread: Tensorflow boolean_mask with dynamic mask.
I'm trying to learn some PyTorch and am referencing this discussion here
The author provides a minimum working piece of code that illustrates how you can use PyTorch to solve for an unknown linear function that has been polluted with random noise.
This code runs fine for me.
However, when I change the function such that I want t = X^2, the parameter does not seem to converge.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
# Let's make some data for a linear regression.
A = 3.1415926
b = 2.7189351
error = 0.1
N = 100 # number of data points
# Data
X = Variable(torch.randn(N, 1))
# (noisy) Target values that we want to learn.
t = X * X + Variable(torch.randn(N, 1) * error)
# Creating a model, making the optimizer, defining loss
model = nn.Linear(1, 1)
optimizer = optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.MSELoss()
# Run training
niter = 50
for _ in range(0, niter):
optimizer.zero_grad()
predictions = model(X)
loss = loss_fn(predictions, t)
loss.backward()
optimizer.step()
print("-" * 50)
print("error = {}".format(loss.data[0]))
print("learned A = {}".format(list(model.parameters())[0].data[0, 0]))
print("learned b = {}".format(list(model.parameters())[1].data[0]))
When I execute this code, the new A and b parameters are seemingly random thus it does not converge. I think this should converge because you can approximate any function with a slope and offset function. My theory is that I'm using PyTorch incorrectly.
Can any identify a problem with my t = X * X + Variable(torch.randn(N, 1) * error) line of code?
You cannot fit a 2nd degree polynomial with a linear function. You cannot expect more than random (since you have random samples from the polynomial).
What you can do is try and have two inputs, x and x^2 and fit from them:
model = nn.Linear(2, 1) # you have 2 inputs now
X_input = torch.cat((X, X**2), dim=1) # have 2 inputs per entry
# ...
predictions = model(X_input) # 2 inputs -> 1 output
loss = loss_fn(predictions, t)
# ...
# learning t = c*x^2 + a*x + b
print("learned a = {}".format(list(model.parameters())[0].data[0, 0]))
print("learned c = {}".format(list(model.parameters())[0].data[0, 1]))
print("learned b = {}".format(list(model.parameters())[1].data[0]))
I am new to Spark, my current version is 1.3.1. And I want to implement logistic regression with PySpark, so, I found this example from Spark Python MLlib
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)
# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))
And I found the attributes of model are:
In [21]: model.<TAB>
model.clearThreshold model.predict model.weights
model.intercept model.setThreshold
How can I get the coefficients of logistic regression?
As you noticed the way to obtain the coefficients is by using LogisticRegressionModel's attributes.
Parameters:
weights – Weights computed for every feature.
intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the
intercepts will not be a single value, so the intercepts will be part
of the weights.)
numFeatures – the dimension of the features.
numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default,
it is binary logistic regression so numClasses will be set to 2.
Don't forget that hθ(x) = 1 / exp ^ -(θ0 + θ1 * x1 + ... + θn * xn) where θ0 represents the intercept, [θ1,...,θn] the weights, and the number of features is n.
Edit
As you can see this is the way how the prediction is done, you can check LogisticRegressionModel's source.
def predict(self, x):
"""
Predict values for a single data point or an RDD of points
using the model trained.
"""
if isinstance(x, RDD):
return x.map(lambda v: self.predict(v))
x = _convert_to_vector(x)
if self.numClasses == 2:
margin = self.weights.dot(x) + self._intercept
if margin > 0:
prob = 1 / (1 + exp(-margin))
else:
exp_margin = exp(margin)
prob = exp_margin / (1 + exp_margin)
if self._threshold is None:
return prob
else:
return 1 if prob > self._threshold else 0
else:
best_class = 0
max_margin = 0.0
if x.size + 1 == self._dataWithBiasSize:
for i in range(0, self._numClasses - 1):
margin = x.dot(self._weightsMatrix[i][0:x.size]) + \
self._weightsMatrix[i][x.size]
if margin > max_margin:
max_margin = margin
best_class = i + 1
else:
for i in range(0, self._numClasses - 1):
margin = x.dot(self._weightsMatrix[i])
if margin > max_margin:
max_margin = margin
best_class = i + 1
return best_class
I'm using
model.coefficients
and it works!
Documentation:
https://spark.apache.org/docs/2.4.5/api/python/pyspark.ml.html?highlight=coefficients#pyspark.ml.classification.LogisticRegressionModel.coefficients
I'm implementing logistic regression based on the Coursera documentation, both in python and Octave.
In Octave, I managed to do it and achieve the right training accuracy, but in python, since I don't have access to fminunc, I cannot figure out a work around.
Currently, this is my code:
df = pandas.DataFrame.from_csv('ex2data2.txt', header=None, index_col=None)
df.columns = ['x1', 'x2', 'y']
y = df[df.columns[-1]].as_matrix()
m = len(y)
y = y.reshape(m, 1)
X = df[df.columns[:-1]]
X = X.as_matrix()
from sklearn.preprocessing import PolynomialFeatures
feature_mapper = PolynomialFeatures(degree=6)
X = feature_mapper.fit_transform(X)
def sigmoid(z):
return 1/(1+np.power(np.e, z))
def cost_function_reg(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
reg = (_lambda / (2.0*m))* shifted_theta.T.dot(shifted_theta)
J = ((1.0/m)*(-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1-h)))) + reg
return J
def gradient(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
gradR = _lambda*shifted_theta
gradR.shape = (gradR.shape[0], 1)
grad = (1.0/m)*(X.T.dot(h-y)+gradR)
return grad.flatten()
from scipy.optimize import *
theta = fmin_ncg(cost_f, initial_theta, fprime=gradient)
predictions = predict(theta, X)
accuracy = np.mean(np.double(predictions == y)) * 100
print 'Train Accuracy: %.2f' % accuracy
The output is:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 0.693147
Iterations: 0
Function evaluations: 22
Gradient evaluations: 12
Hessian evaluations: 0
Train Accuracy: 50.85
In octave, the accuracy is: 83.05.
Any help is appreciated.
There were two problems on that implementation:
The first one, fmin_ncg is not ideal for that minimization. I have used it on the previous exercise, but it was failing to find the theta with that gradient function, which is ideal to the one in Octave.
Switching to
theta = fmin_bfgs(cost_function_reg, initial_theta)
Fixed that issue.
The second issue was that the accuracy was being miscalculated.
Once I optimized with fmin_bfgs, and achieved the cost that matched the Octave results (0.529), the (predictions == y) part had different shapes ((118, 118) and (118,1)) , yielding a matrix that was MxM instead of vector.