I am working on applying DL to a regression problem and some of the outputs need to be integers while others can be floats. So far I have built a NN which returns floats for all but I want to go to the next step and actually return ints vs floats for the different outputs.
Previously I asked a question where I provided a simple example of regression for y = m * x + b which I was able to solve on my own. In this example, how would the code be changed to ensure b is integer while m is float?
#!/usr/bin/env python3
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
ARANGE = (-5.0, 5.0) # Possible values for m in training data
BRANGE = (0.0, 10.0) # Possible values for b in training data
X_MIN = 1.0
X_MAX = 9.0
N = 10 # Number of grid points
M = 2 # Number of {(x,y)} sets to train on
def gen_ab(arange, brange):
""" mrange, brange are tuples of floats """
a = (arange[1] - arange[0])*np.random.rand() + arange[0]
b = (brange[1] - brange[0])*np.random.rand() + brange[0]
return (a, b)
def build_model(x_data, y_data):
""" Build the model using input / output training data
x_data (np array): Size (m, n*2) grid of input training data.
y_data (np array): Size (m, 2) grid of output training data.
model (Sequential model)
model = keras.Sequential()
model.add(layers.Dense(64, activation='relu', input_dim=len(x_data[0])))
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae', 'mse'])
return model
def gen_data(xs, arange, brange, m):
""" Generate training data for lines of y = m*x + b
xs (list): Grid points (size N1)
arange (tuple): Range to use for a (a_min, a_max)
brange (tuple): Range to use for b (b_min, b_max)
m (int): Number of y grids to generate
x_data (np array): Size (m, n*2) grid of input training data.
y_data (np array): Size (m, 2) grid of output training data.
n = len(xs)
x_data = np.zeros((m, 2*n))
y_data = np.zeros((m, 2))
for ix in range(m):
(a, b) = gen_ab(arange, brange)
ys = a*xs + b*np.ones(xs.size)
x_data[ix, :] = np.concatenate((xs, ys))
y_data[ix, :] = [a, b]
return (x_data, y_data)
def main():
""" Main routin """
# Generate the x axis grid to be used for all training sets
xs = np.linspace(X_MIN, X_MAX, N)
# Generate the training data
# x_train has M rows (M is the number of training samples)
# x_train has 2*N columns (first N columns are x, second N columns are y)
# y_train has M rows, each of which has two columns (a, b) for y = ax + b
(x_train, y_train) = gen_data(xs, ARANGE, BRANGE, M)
model = build_model(x_train, y_train)
model.fit(x_train, y_train, epochs=10, batch_size=32)
### Test example ###
(a, b) = gen_ab(ARANGE, BRANGE)
ys = a*xs + b*np.ones(xs.size)
rys = np.concatenate((xs, ys))
ab1 = model.predict(x_train)
ab2 = model.predict(np.array([rys]))
if __name__ == "__main__":
I think this would be possible but actually not as trivial as it sounds. You unfortunately can't simply get the NN to output an int and a float and use the normal MSE loss you are using as the discreet nature of the int values prevents the loss function being continuously differentiable like the optimisers need.
If one really wanted to do it they could treat the int output variable as if it were actually a multi class output (treating the float output the same). You would need to craft a loss function out of the combination of these two outputs (multi class + float). You could one-hot-encode and then softmax the multi class outputs. An interesting complication of this is that the neural network would not know that the multi class output is actually ordinal (ordered, since 1<2<3<4 etc.). There have been interesting attempts in the past to help NNs to realise this (see Neural Network Ordinal Classification for Age).
I could not understand well especially how gradients were computed with regards to matrix transposes. My question is for DW2 but if you want also to discuss about the computation of the other gradients and extend my question I am open to discussion. Mathematically things seem a little bit different but this code is reliable and on github so I trust this code.
from __future__ import print_function
from builtins import range
from builtins import object
import numpy as np
import matplotlib.pyplot as plt
from past.builtins import xrange
class TwoLayerNet(object):
A two-layer fully-connected neural network. The net has an input dimension of
D* (correction), a hidden layer dimension of H, and performs classification over C classes.
We train the network with a softmax loss function and L2 regularization on the
weight matrices. The network uses a ReLU nonlinearity after the first fully
connected layer.
In other words, the network has the following architecture:
input - fully connected layer - ReLU - fully connected layer - softmax
The outputs of the second fully-connected layer are the scores for each class.
def __init__(self, input_size, hidden_size, output_size, std=1e-4):
Initialize the model. Weights are initialized to small random values and
biases are initialized to zero. Weights and biases are stored in the
variable self.params, which is a dictionary with the following keys:
W1: First layer weights; has shape (D, H)
b1: First layer biases; has shape (H,)
W2: Second layer weights; has shape (H, C)
b2: Second layer biases; has shape (C,)
- input_size: The dimension D of the input data.
- hidden_size: The number of neurons H in the hidden layer.
- output_size: The number of classes C.
self.params = {}
self.params['W1'] = std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
def loss(self, X, y=None, reg=0.0):
Compute the loss and gradients for a two layer fully connected neural
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].
If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape
# Compute the forward pass
scores = None
# TODO: Perform the forward pass, computing the class scores for the input. #
# Store the result in the scores variable, which should be an array of #
# shape (N, C). #
# perform the forward pass and compute the class scores for the input
# input - fully connected layer - ReLU - fully connected layer - softmax
# define lamba function for relu
relu = lambda x: np.maximum(0, x)
# a1 = X x W1 = (N x D) x (D x H) = N x H
a1 = relu(X.dot(W1) + b1) # activations of fully connected layer #1
# store the result in the scores variable, which should be an array of
# shape (N, C).
# scores = a1 x W2 = (N x H) x (H x C) = N x C
scores = a1.dot(W2) + b2 # output of softmax
# If the targets are not given then jump out, we're done
if y is None:
return scores
# Compute the loss
loss = None
# TODO: Finish the forward pass, and compute the loss. This should include #
# both the data loss and L2 regularization for W1 and W2. Store the result #
# in the variable loss, which should be a scalar. Use the Softmax #
# classifier loss. #
# shift values for 'scores' for numeric reasons (over-flow cautious)
# figure out the max score across all classes
# scores.shape is N x C
scores -= scores.max(axis = 1, keepdims = True)
# probs.shape is N x C
probs = np.exp(scores)/np.sum(np.exp(scores), axis = 1, keepdims = True)
loss = -np.log(probs[np.arange(N), y])
# loss is a single number
loss = np.sum(loss)
# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by N.
loss /= N
# Add regularization to the loss.
loss += reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
# Backward pass: compute gradients
grads = {}
# TODO: Compute the backward pass, computing the derivatives of the weights #
# and biases. Store the results in the grads dictionary. For example, #
# grads['W1'] should store the gradient on W1, and be a matrix of same size #
# since dL(i)/df(k) = p(k) - 1 (if k = y[i]), where f is a vector of scores for the given example
# i is the training sample and k is the class
dscores = probs.reshape(N, -1) # dscores is (N x C)
dscores[np.arange(N), y] -= 1
# since scores = a1.dot(W2), we get dW2 by multiplying a1.T and dscores
# W2 is H x C so dW2 should also match those dimensions
# a1.T x dscores = (H x N) x (N x C) = H x C
dW2 = np.dot(a1.T, dscores)
# Right now the gradient is a sum over all training examples, but we want it
# to be an average instead so we divide by N.
dW2 /= N
# b2 gradient: sum dscores over all N and C
db2 = dscores.sum(axis = 0)/N
# since a1 = X.dot(W1), we get dW1 by multiplying X.T and da1
# W1 is D x H so dW1 should also match those dimensions
# X.T x da1 = (D x N) x (N x H) = D x H
# first get da1 using scores = a1.dot(W2)
# a1 is N x H so da1 should also match those dimensions
# dscores x W2.T = (N x C) x (C x H) = N x H
da1 = dscores.dot(W2.T)
da1[a1 == 0] = 0 # set gradient of units that did not activate to 0
dW1 = X.T.dot(da1)
# Right now the gradient is a sum over all training examples, but we want it
# to be an average instead so we divide by N.
dW1 /= N
# b1 gradient: sum da1 over all N and H
db1 = da1.sum(axis = 0)/N
# Add regularization loss to the gradient
dW1 += 2 * reg * W1
dW2 += 2 * reg * W2
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
return loss, grads
def train(self, X, y, X_val, y_val,
learning_rate=1e-3, learning_rate_decay=0.95,
reg=5e-6, num_iters=100,
batch_size=200, verbose=False):
Train this neural network using stochastic gradient descent.
- X: A numpy array of shape (N, D) giving training data.
- y: A numpy array f shape (N,) giving training labels; y[i] = c means that
X[i] has label c, where 0 <= c < C.
- X_val: A numpy array of shape (N_val, D) giving validation data.
- y_val: A numpy array of shape (N_val,) giving validation labels.
- learning_rate: Scalar giving learning rate for optimization.
- learning_rate_decay: Scalar giving factor used to decay the learning rate
after each epoch.
- reg: Scalar giving regularization strength.
- num_iters: Number of steps to take when optimizing.
- batch_size: Number of training examples to use per step.
- verbose: boolean; if true print progress during optimization.
num_train = X.shape[0]
iterations_per_epoch = max(num_train / batch_size, 1)
# Use SGD to optimize the parameters in self.model
loss_history = []
train_acc_history = []
val_acc_history = []
for it in range(num_iters):
X_batch = None
y_batch = None
# TODO: Create a random minibatch of training data and labels, storing #
# them in X_batch and y_batch respectively. #
# generate random indices
indices = np.random.choice(num_train, batch_size)
X_batch, y_batch = X[indices], y[indices]
# Compute loss and gradients using the current minibatch
loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
# TODO: Use the gradients in the grads dictionary to update the #
# parameters of the network (stored in the dictionary self.params) #
# using stochastic gradient descent. You'll need to use the gradients #
# stored in the grads dictionary defined above. #
self.params['W1'] -= learning_rate * grads['W1']
self.params['W2'] -= learning_rate * grads['W2']
self.params['b1'] -= learning_rate * grads['b1']
self.params['b2'] -= learning_rate * grads['b2']
if verbose and it % 100 == 0:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))
# Every epoch, check train and val accuracy and decay learning rate.
if it % iterations_per_epoch == 0:
# Check accuracy
train_acc = (self.predict(X_batch) == y_batch).mean()
val_acc = (self.predict(X_val) == y_val).mean()
# Decay learning rate
learning_rate *= learning_rate_decay
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history,
def predict(self, X):
Use the trained weights of this two-layer network to predict labels for
data points. For each data point we predict scores for each of the C
classes, and assign each data point to the class with the highest score.
- X: A numpy array of shape (N, D) giving N D-dimensional data points to
- y_pred: A numpy array of shape (N,) giving predicted labels for each of
the elements of X. For all i, y_pred[i] = c means that X[i] is predicted
to have class c, where 0 <= c < C.
y_pred = None
# TODO: Implement this function; it should be VERY simple! #
# define lamba function for relu
relu = lambda x: np.maximum(0, x)
# activations of fully connected layer #1
a1 = relu(X.dot(self.params['W1']) + self.params['b1'])
# output of softmax
# scores = a1 x W2 = (N x H) x (H x C) = N x C
scores = a1.dot(self.params['W2']) + self.params['b2']
y_pred = np.argmax(scores, axis = 1)
return y_pred
With regards to above code, I could not understand how DW2 was computed well. I took picture of the point I need to clarify and need an explanation for the difference.enter image description here
In TensorFlow, I intend to manipulate tensor with Taylor series of sin(x) with certain approximation terms. To do so, I have tried to manipulate the grayscale image (shape of (32,32)) with Taylor series of sin(x) and it works fine. Now I have trouble manipulating the same things that worked for a grayscale image with the shape of (32,32) to RGB image with the shape of (32,32,3), and it doesn't give me the correct array. Intuitively, I am trying to manipulate tensor with Taylor's expansion of sin(x). Can anyone show me the possible way of doing this in tensorflow? Any idea?
my attempt:
here is taylor expansion of sin(x) at x=0: 1- x + x**2/2 - x**3/6 with three expansion term.
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
x= X_train[1,:,:,1]
k= 3
func = 'sin(x)'
new_x = np.zeros((x.shape[0], x.shape[1]*k))
new_x = new_x.astype('float32')
nn = 0
for i in range(x.shape[1]):
col_d = x[:,i].ravel()
new_x[:,nn] = col_d
if n_terms > 0:
for j in range(1,k):
if func == 'cos(x)':
new_x[:,nn+j] = new_x[:,nn+j-1]
I think I could do this more efficiently with TensorFlow but that's not quite intuitive for me how to do it. Can anyone suggest a possible workaround to make this work? Any thought?
In 2dim array col_d = x[:,i].ravel() is pixel vector which flattened 2 dim array. Similarly, we could reshape 3dim array to 2 dim by this way: x.transpose(0,1,2).reshape(x.shape[1],-1) in for loop, so it could be x[:,i].transpose(0,1,2).reshape(x.shape[1],-1), but this is still not correct. I think tensorflow might have better way of doing this. How can we manipulate the tensor with taylor series of sin(x) more efficiently? Any thoughts?
Intuitively, in Taylor series of sin(x), x is tensor, and if we want only 2, 3 approximation terms of Taylor series of sin(x) for each tensor, I want to concatenate them in new tensor. How should we do it efficiently in TensorFlow? Any thoughts?
new_x = np.zeros((x.shape[0], x.shape[1]*n_terms))
This line has no meaning, why allocating space for 96 elements for 3 taylor expansion terms.
(new_x[:, 3:] == 0.0).all() = True # check
For pixelwise taylor expansion with n-terms
def sin_exp_step(x, i):
c1 = 2 * i + 1
c2 = (-1) ** i / np.math.factorial(c1)
t = c2 * (x ** c1)
return t
# validate
x = 45.0
x = (np.pi / 180.0) * x
y = np.sin(x)
approx_y = 0
for i in range(n_terms):
approx_y += sin_exp_step(x, i)
abs(approx_y - y) < 1e-8
x= X_train[1,:,:,:]
n_terms = 3
func = 'sin(x)'
new_x = np.zeros((*x.shape, n_terms))
for i in range(0, n_terms):
if func == 'sin(x)': # sin(x)
new_x[..., i] += sin_exp_step(x, i)
Commonly numerical approximation methods are being avoided, as they are computationally expensive (i.e. factorial) and less stable, so gradient based optimization usually is the best, for a higher order derivatives algorithms such BFGS and LBFGS used to approximate hessian matrix (2nd order derivative). Optimizers such Adam & SGD are sufficient and comes with much less computational consumption. Using neural network, we might be able to find a much better expansions.
Tensorflow solution for n-terms expansion
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.layers import Input, LocallyConnected2D
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = tf.constant(x_train, dtype=tf.float32)
x_test = tf.constant(x_test, dtype=tf.float32)
def expansion_approx_of(func):
def reconstruction_loss(y_true, y_pred):
loss = (y_pred - func(y_true)) ** 2
loss = 0.5 * K.mean(loss)
return loss
return reconstruction_loss
class Expansion2D(LocallyConnected2D): # n-terms expansion layer
def __init__(self, i_shape, n_terms, kernel_size=(1, 1), *args, **kwargs):
if len(i_shape) != 3:
raise ValueError('...')
self.i_shape = i_shape
self.n_terms = n_terms
filters = self.n_terms * self.i_shape[-1]
super(Expansion2D, self).__init__(filters=filters, kernel_size=kernel_size,
use_bias=False, *args, **kwargs)
def call(self, inputs):
shape = (-1, self.i_shape[0], self.i_shape[1], self.i_shape[-1], self.n_terms)
out = super().call(inputs)
expansion = tf.reshape(out, shape)
out = tf.math.reduce_sum(expansion, axis=-1)
return out, expansion
inputs = Input(shape=(32, 32, 3))
# expansion: might be a taylor expansion or something better.
out, expansion = Expansion2D(i_shape=(32, 32, 3), n_terms=3)(inputs)
model = Model(inputs, [out, expansion])
opt = tf.keras.optimizers.Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999)
loss = expansion_approx_of(K.sin)
model.compile(optimizer=opt, loss=[loss])
model.fit(x_train, x_train, batch_size=1563, epochs=100)
x_pred, x_exp = model.predict_on_batch(x_test[:32])
print((x_exp[0].sum(axis=-1) == x_pred[0]).all())
err = abs(x_pred - np.sin(x_test[0])).mean()
Put three expansion terms into a tensor at axis=1
x = tf.ones([8, 32, 32, 3], tf.float32) * 0.5 # example batchsize=8, imageshape=[32, 32, 3]
x = tf.stack([x, - (1/6) * tf.math.pow(x, 3), (1/120) * tf.math.pow(x, 5)], axis=1) # expansion of three terms of sin(x), [8, 3, 32, 32, 3]
If you would go with tf.keras Functional API or Sequential API, you might make a Keras custom layer
Edit: In the first answer, I recommended tf.keras.layers.Lambda, but it might not work with tf.math.pow or tf.stack (I haven't tried). You would go with Keras custom layer.
I think you can do this for 1D tensor as:
def expend_func(x):
p1 = x
p2 = x - ((x**2)/2)
new_x = K.concatenate([p1, p2], axis=1)
return new_x
note that x is your 1D tensor, new_x with two terms. If you need new_x with three terms, you might modify expend_funcs with three terms. for 2D tensor, you should use tf.stack() which is not the elegant way but that might help.
I saw this question: Implementing custom loss function in keras with condition And I need to do the same thing but with code that seems to need loops.
I have a custom numpy function which calculates the mean Euclid distance from the mean vector. I wrote this based on the paper https://arxiv.org/pdf/1801.05365.pdf:
import numpy as np
def mean_euclid_distance_from_mean_vector(n_vectors):
dists = []
for (i, v) in enumerate(n_vectors):
n_vectors_rest = n_vectors[np.arange(len(n_vectors)) != i]
print("rest of vectors: ")
# calculate mean vector
mean_rest = n_vectors_rest.mean(axis=0)
print("mean rest vector")
dist = v - mean_rest
print("dist vector")
# dists is now a matrix of distance vectors (distance from the mean vector)
dists = np.array(dists)
print("distance vector matrix")
# here we matmult each vector
# sum them up
# and divide by the total number of elements
result = np.sum([np.matmul(d, d) for d in dists]) / dists.size
return result
features = np.array([
c = mean_euclid_distance_from_mean_vector(features)
I need this function however to work inside tensorflow with Keras. So a custom lambda https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda
However, I'm not sure how to implement the above in Keras/Tensorflow since it has loops, and the way the paper talked about calculating the m_i seems to require loops like the way I implemented the above.
For reference, the PyTorch version of this code is here: https://github.com/PramuPerera/DeepOneClass
Given a feature map like:
features = np.array([
[1, 2, 3, 4],
[2, 4, 4, 3],
[3, 2, 1, 4],
], dtype=np.float64)
reflecting a batch_size of
batch_size = features.shape[0]
k = features.shape[1]
One has that implementing the above Formulas in Tensorflow could be expressed (prototyped) by:
dim = (batch_size, features.shape[1])
def zero(i):
arr = np.ones(dim)
arr[i] = 0
return arr
mapper = [zero(i) for i in range(batch_size)]
elems = (features, mapper)
m = (1 / (batch_size - 1)) * tf.map_fn(lambda x: tf.math.reduce_sum(x[0] * x[1], axis=0), elems, dtype=tf.float64)
pairs = tf.map_fn(lambda x: tf.concat(x, axis=0) , tf.stack([features, m], 1), dtype=tf.float64)
compactness_loss = (1 / (batch_size * k)) * tf.map_fn(lambda x: tf.math.reduce_euclidean_norm(x), pairs, dtype=tf.float64)
with tf.Session() as sess:
print("loss value output is: ", compactness_loss.eval())
Which yields:
loss value output is: [0.64549722 0.79056942 0.64549722]
However a single measure is required for the batch, therefore it is necessary to reduce it; by the summation of all values.
The wanted Compactness Loss function à la Tensorflow is:
def compactness_loss(actual, features):
features = Flatten()(features)
k = 7 * 7 * 512
dim = (batch_size, k)
def zero(i):
z = tf.zeros((1, dim[1]), dtype=tf.dtypes.float32)
o = tf.ones((1, dim[1]), dtype=tf.dtypes.float32)
arr = []
for k in range(dim[0]):
arr.append(o if k != i else z)
res = tf.concat(arr, axis=0)
return res
masks = [zero(i) for i in range(batch_size)]
m = (1 / (batch_size - 1)) * tf.map_fn(
# row-wise summation
lambda mask: tf.math.reduce_sum(features * mask, axis=0),
dists = features - m
sqrd_dists = tf.pow(dists, 2)
red_dists = tf.math.reduce_sum(sqrd_dists, axis=1)
compact_loss = (1 / (batch_size * k)) * tf.math.reduce_sum(red_dists)
return compact_loss
Of course the Flatten() could be moved back into the model for convenience and the k could be derived directly from the feature map; this answers your question. You may just have some trouble finding out the the expected values for the model are - feature maps from the VGG16 (or any other architechture) trained against the imagenet for instance?
The paper says:
In our formulation (shown in Figure 2 (e)), starting froma pre-trained deep model, we freeze initial features (gs) and learn (gl) and (hc). Based on the output of the classification sub-network (hc), two losses compactness loss and descriptiveness loss are evaluated. These two losses, introduced in the subsequent sections, are used to assess the quality of the learned deep feature. We use the provided one-class dataset to calculate the compactness loss. An external multi-class reference dataset is used to evaluate the descriptiveness loss.As shown in Figure 3, weights of gl and hc are learned in the proposed method through back-propagation from the composite loss. Once training is converged, system shown in setup in Figure 2(d) is used to perform classification where the resulting model is used as the pre-trained model.
then looking at the "Framework" backbone here plus:
AlexNet Binary and VGG16 Binary (Baseline). A binary CNN is trained by having ImageNet samples and one-class image samples as the two classes using AlexNet andVGG16 architectures, respectively. Testing is performed using k-nearest neighbor, One-class SVM [43], Isolation Forest [3]and Gaussian Mixture Model [3] classifiers.
Makes me wonder whether it would not be reasonable to add suggested the dense layers to both the Secondary and the Reference Networks to a single class output (Sigmoid) or even and binary class output (using Softmax) and using the mean_squared_error as the so called Compactness Loss and binary_cross_entropy as the Descriptveness Loss.
I am working through Nielsen's Neural Networks and Deep Learning. To develop my understanding Nielsen suggests rewriting his back-propagation algorithm to take a matrix based approach (supposedly much quicker due to optimizations in linear algebra libraries).
Currently I get a very low/fluctuating accuracy between 9-10% every single time. Normally, I'd continue working on my understanding, but I have worked this algorithm for the better part of 3 days and I feel like I have a pretty good handle on the math behind backprop. Regardless, I continue to generate mediocre results for accuracy, so any insight would be greatly appreciated!!!
I'm using the MNIST handwritten digits database.
the neural network functions (backprop in here)
neural_net.py modified to use matrix operations
# Libs
import random
import numpy as np
# Neural Network
class Network(object):
def __init__(self, sizes):
self.num_layers = len(sizes) # Number of layers in network
self.sizes = sizes # Number of neurons in each layer
self.biases = [np.random.randn(y, 1) for y in sizes[1:]] # Bias vector, 1 bias for each neuron in each layer, except input neurons
self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])] # Weight matrix
# Feed Forward Function
# Returns netowrk output for input a
def feedforward(self, a):
for b, w in zip(self.biases, self.weights): # a’ = σ(wa + b)
a = sigmoid(np.dot(w, a)+b)
return a
# Stochastic Gradient Descent
def SGD(self, training_set, epochs, m, eta, test_data):
if test_data: n_test = len(test_data)
n = len(training_set)
# Epoch loop
for j in range(epochs):
# Shuffle training data & parcel out mini batches
mini_batches = [training_set[k:k+m] for k in range(0, n, m)]
# Pass mini batches one by one to be updated
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
# End of Epoch (optional epoch testing)
if test_data:
evaluation = self.evaluate(test_data)
print("Epoch %6i: %5i / %5i" % (j, evaluation, n_test))
print("Epoch %5i complete" % (j))
# Update Mini Batch (Matrix approach)
def update_mini_batch(self, mini_batch, eta):
m = len(mini_batch)
nabla_b = []
nabla_w = []
# Build activation & answer matrices
x = np.asarray([_x.ravel() for _x,_y in mini_batch]) # 10x784 where each row is an input vector
y = np.asarray([_y.ravel() for _x,_y in mini_batch]) # 10x10 where each row is an desired output vector
nabla_b, nabla_w = self.backprop(x, y) # Feed matrices into backpropagation
# Train Biases & weights
self.biases = [b-(eta/m)*nb for b, nb in zip(self.biases, nabla_b)]
self.weights = [w-(eta/m)*nw for w, nw in zip(self.weights, nabla_w)]
def backprop(self, x, y):
# Gradient arrays
nabla_b = [0 for i in self.biases]
nabla_w = [0 for i in self.weights]
w = self.weights
# Vars
m = len(x) # Mini batch size
a = x # Activation matrix temp variable
a_s = [x] # Activation matrix record
z_s = [] # Weighted Activation matrix record
special_b = [] # Special bias matrix to facilitate matrix operations
# Build special bias matrix (repeating biases for each example)
for j in range(len(self.biases)):
for k in range(m):
special_b[j] = np.asarray(special_b[j])
# Forward pass
# Starting at the input layer move through each layer
for l in range(len(self.sizes)-1):
z = a # w[l].transpose() + special_b[l]
a = sigmoid(z)
# Backward pass
delta = cost_derivative(a_s[-1], y) * sigmoid_prime(z_s[-1])
nabla_b[-1] = delta
nabla_w[-1] = delta # a_s[-2]
for n in range(2, self.num_layers):
z = z_s[-n]
sp = sigmoid_prime(z)
delta = self.weights[-n+1].transpose() # delta * sp.transpose()
nabla_b[-n] = delta
nabla_w[-n] = delta # a_s[-n-1]
# Create bias vectors by summing bias columns elementwise
for i in range(len(nabla_b)):
temp = []
for j in nabla_b[i]:
nabla_b[i] = np.asarray(temp).reshape(-1,1)
return [nabla_b, nabla_w]
def evaluate(self, test_data):
test_results = [(np.argmax(self.feedforward(t[0])), t[1]) for t in test_data]
return sum(int(x==y) for (x, y) in test_results)
# Cost Derivative Function
# Returns the vector of partial derivatives C_x, a for the output activations y
def cost_derivative(output_activations, y):
# Sigmoid Function
def sigmoid(z):
return 1.0/(1.0+np.exp(-z))
# Sigmoid Prime (Derivative) Function
def sigmoid_prime(z):
return sigmoid(z)*(1-sigmoid(z))
test script
import mnist_data
import neural_net_batch as nn
# Data Sets
training_data, validation_data, test_data = mnist_data.load_data_wrapper()
training_data = list(training_data)
validation_data = list(validation_data)
test_data = list(test_data)
# Network
net = nn.Network([784, 30, 10])
# Perform Stochastic Gradient Descent using MNIST training & test data,
# 30 epochs, mini_batch size of 10, and learning rate of 3.0
net.SGD(list(training_data), 30, 10, 3.0, test_data=test_data)
A very helpful Reddit (u/xdaimon) helped me to get the following answer (on Reddit):
Your backward pass should be
# Backward pass
delta = cost_derivative(a_s[-1], y) * sigmoid_prime(z_s[-1])
nabla_b[-1] = delta.T
nabla_w[-1] = delta.T # a_s[-2]
for n in range(2, self.num_layers):
z = z_s[-n]
sp = sigmoid_prime(z)
delta = delta # self.weights[-n+1] * sp
nabla_b[-n] = delta.T
nabla_w[-n] = delta.T # a_s[-n-1]
One way to find this bug is to remember that there should be a
transpose somewhere in the product that computes nabla_w.
And if you're interested, the transpose shows up in the matrix
implementation of backprop because AB is the same as the sum of outer
products of the columns of A and the rows of B. In this case A=delta.T
and B=a_s[-n-1] and so the outer products are between the rows of
delta and the rows of a_s[-n-1]. Each term in the sum is nabla_w for a
single element in the batch which is exactly what we want. If the
minibatch size is 1 you can easily see that delta.T#a_s[-n-1] is just
the outer product of the delta vector and activation vector.
Testing shows not only is the network accurate again, the expected speedup is present.
I am currently trying to learn logistic regression, and am stuck on plotting a line from the weights after training. I am expecting an array of 3 values, but when I print the weights to check them, I get (with different values each time, but the same format):
[array([[ 0.42433906],
[-0.67847246]], dtype=float32)
array([-0.06681705], dtype=float32)]
My question, is why are the weights in this format of 2 arrays, rather than 1 array of length 3? And how do I interpret these weights so that I can plot the separating line?
Here is my code:
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.regularizers import L1L2
import random
import numpy as np
# return the array data of shape (m, 2) and the array labels of shape (m, 1)
def get_random_data(w, b, mu, sigma, m): # slope, y-intercept, mean of the data, standard deviation, size of arrays
data = np.empty((m, 2))
labels = np.empty((m, 1))
# fill the arrays with random data
for i in range(m):
c = (random.random() > 0.5) # 0 with probability 1/2 and 1 with probability 1/2
n = random.normalvariate(mu, sigma) # noise using normal distribution
x_1 = random.random() # uniform distribution on [0, 1)
x_2 = w * x_1 + b + (-1)**c * n
labels[i] = c
data[i][0] = x_1
data[i][1] = x_2
# the train set is the first 80% of our data, and the test set is the following 20%
train_length = int(round(m * 0.8, 1))
train_data = np.empty((train_length, 2))
train_labels = np.empty((train_length, 1))
test_data = np.empty((m - train_length, 2))
test_labels = np.empty((m - train_length, 1))
for i in range(train_length):
train_data[i] = data[i]
train_labels[i] = labels[i]
for i in range(train_length, m):
test_data[i - train_length] = data[i]
test_labels[i - train_length] = labels[i]
return (train_data, train_labels), (test_data, test_labels)
(train_data, train_labels), (test_data, test_labels) = get_random_data(2,3,100,100,200)
model = Sequential()
kernel_regularizer=L1L2(l1=0.0, l2=0.1),
model.fit(train_data, train_labels, epochs=100, validation_data=(test_data,test_labels))
weights = np.asarray(model.get_weights())
print("the weights are " , weights)
The first index of the array shows the weights of coefficients and the second array shows the bias.
So you have a equation like below.
h(x) = 0.42433906x1 + -0.67847246x2 + -0.06681705
Logistic regression takes this equation and applies sigmoid function to squeeze the results between 0-1.
So if you want to draw an equation of a line, you can use do it with the returned weights like I explained above.