Related
I'm trying to predict a direction given by two angles (theta and phi). I defined a loss function which is the angular distance between the predicted and the true direction but I continue to get nan value after a few epochs.
Given that the output activation is linear, my custom loss is:
import tensorflow as tf
import numpy as np
import tensorflow.keras.backend as K
def square_angular_distance(y_true, y_pred):
the_pred = K.abs(y_pred[:, 0])
phi_pred = (y_pred[:, 1])%(2*np.pi)
the_true = y_true[:, 0]
phi_true = y_true[:, 1]
cos_phi_pred = K.cos(phi_pred)
cos_phi_true = K.cos(phi_true)
sin_phi_pred = K.sin(phi_pred)
sin_phi_true = K.sin(phi_true)
cos_the_pred = K.cos(the_pred)
cos_the_true = K.cos(the_true)
sin_the_pred = K.sin(the_pred)
sin_the_true = K.sin(the_true)
v_true = K.stack((sin_the_true*cos_phi_true, sin_the_true*sin_phi_true, cos_the_true), axis=1)
v_pred = K.stack((sin_the_pred*cos_phi_pred, sin_the_pred*sin_phi_pred, cos_the_pred), axis=1)
v_dot = K.batch_dot(v_true, v_pred)
angle_dist = tf.math.acos(K.clip(v_dot, -1., 1.))*180./np.pi
return K.mean(K.square(angle_dist), axis=-1)
Where y_pred[:, 0] and y_pred[:, 1] are respectively the theta and phi angles of a unitary vector (same for y_true).
I tried to use regularizers, to clip the gradient, the learning rate and I also checked the data to have no Nan/Inf values.
I also tried to clip the output values using a custom activation function for the output layer but it didn't resolve the problem.
Any suggestions on what am I doing wrong?
The comment from #ATony resolved the problem.
Shortening the input domain of tf.math.acos prevented the loss to be Nan.
K.clip(v_dot, -.999, .999)
In TensorFlow, I intend to manipulate tensor with Taylor series of sin(x) with certain approximation terms. To do so, I have tried to manipulate the grayscale image (shape of (32,32)) with Taylor series of sin(x) and it works fine. Now I have trouble manipulating the same things that worked for a grayscale image with the shape of (32,32) to RGB image with the shape of (32,32,3), and it doesn't give me the correct array. Intuitively, I am trying to manipulate tensor with Taylor's expansion of sin(x). Can anyone show me the possible way of doing this in tensorflow? Any idea?
my attempt:
here is taylor expansion of sin(x) at x=0: 1- x + x**2/2 - x**3/6 with three expansion term.
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
x= X_train[1,:,:,1]
k= 3
func = 'sin(x)'
new_x = np.zeros((x.shape[0], x.shape[1]*k))
new_x = new_x.astype('float32')
nn = 0
for i in range(x.shape[1]):
col_d = x[:,i].ravel()
new_x[:,nn] = col_d
if n_terms > 0:
for j in range(1,k):
if func == 'cos(x)':
new_x[:,nn+j] = new_x[:,nn+j-1]
I think I could do this more efficiently with TensorFlow but that's not quite intuitive for me how to do it. Can anyone suggest a possible workaround to make this work? Any thought?
update:
In 2dim array col_d = x[:,i].ravel() is pixel vector which flattened 2 dim array. Similarly, we could reshape 3dim array to 2 dim by this way: x.transpose(0,1,2).reshape(x.shape[1],-1) in for loop, so it could be x[:,i].transpose(0,1,2).reshape(x.shape[1],-1), but this is still not correct. I think tensorflow might have better way of doing this. How can we manipulate the tensor with taylor series of sin(x) more efficiently? Any thoughts?
goal:
Intuitively, in Taylor series of sin(x), x is tensor, and if we want only 2, 3 approximation terms of Taylor series of sin(x) for each tensor, I want to concatenate them in new tensor. How should we do it efficiently in TensorFlow? Any thoughts?
new_x = np.zeros((x.shape[0], x.shape[1]*n_terms))
This line has no meaning, why allocating space for 96 elements for 3 taylor expansion terms.
(new_x[:, 3:] == 0.0).all() = True # check
For pixelwise taylor expansion with n-terms
def sin_exp_step(x, i):
c1 = 2 * i + 1
c2 = (-1) ** i / np.math.factorial(c1)
t = c2 * (x ** c1)
return t
# validate
x = 45.0
x = (np.pi / 180.0) * x
y = np.sin(x)
approx_y = 0
for i in range(n_terms):
approx_y += sin_exp_step(x, i)
abs(approx_y - y) < 1e-8
x= X_train[1,:,:,:]
n_terms = 3
func = 'sin(x)'
new_x = np.zeros((*x.shape, n_terms))
for i in range(0, n_terms):
if func == 'sin(x)': # sin(x)
new_x[..., i] += sin_exp_step(x, i)
Commonly numerical approximation methods are being avoided, as they are computationally expensive (i.e. factorial) and less stable, so gradient based optimization usually is the best, for a higher order derivatives algorithms such BFGS and LBFGS used to approximate hessian matrix (2nd order derivative). Optimizers such Adam & SGD are sufficient and comes with much less computational consumption. Using neural network, we might be able to find a much better expansions.
Tensorflow solution for n-terms expansion
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.layers import Input, LocallyConnected2D
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = tf.constant(x_train, dtype=tf.float32)
x_test = tf.constant(x_test, dtype=tf.float32)
def expansion_approx_of(func):
def reconstruction_loss(y_true, y_pred):
loss = (y_pred - func(y_true)) ** 2
loss = 0.5 * K.mean(loss)
return loss
return reconstruction_loss
class Expansion2D(LocallyConnected2D): # n-terms expansion layer
def __init__(self, i_shape, n_terms, kernel_size=(1, 1), *args, **kwargs):
if len(i_shape) != 3:
raise ValueError('...')
self.i_shape = i_shape
self.n_terms = n_terms
filters = self.n_terms * self.i_shape[-1]
super(Expansion2D, self).__init__(filters=filters, kernel_size=kernel_size,
use_bias=False, *args, **kwargs)
def call(self, inputs):
shape = (-1, self.i_shape[0], self.i_shape[1], self.i_shape[-1], self.n_terms)
out = super().call(inputs)
expansion = tf.reshape(out, shape)
out = tf.math.reduce_sum(expansion, axis=-1)
return out, expansion
inputs = Input(shape=(32, 32, 3))
# expansion: might be a taylor expansion or something better.
out, expansion = Expansion2D(i_shape=(32, 32, 3), n_terms=3)(inputs)
model = Model(inputs, [out, expansion])
opt = tf.keras.optimizers.Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999)
loss = expansion_approx_of(K.sin)
model.compile(optimizer=opt, loss=[loss])
model.summary()
model.fit(x_train, x_train, batch_size=1563, epochs=100)
x_pred, x_exp = model.predict_on_batch(x_test[:32])
print((x_exp[0].sum(axis=-1) == x_pred[0]).all())
err = abs(x_pred - np.sin(x_test[0])).mean()
print(err)
Put three expansion terms into a tensor at axis=1
x = tf.ones([8, 32, 32, 3], tf.float32) * 0.5 # example batchsize=8, imageshape=[32, 32, 3]
x = tf.stack([x, - (1/6) * tf.math.pow(x, 3), (1/120) * tf.math.pow(x, 5)], axis=1) # expansion of three terms of sin(x), [8, 3, 32, 32, 3]
If you would go with tf.keras Functional API or Sequential API, you might make a Keras custom layer
tf.math.pow
tf.stack
Edit: In the first answer, I recommended tf.keras.layers.Lambda, but it might not work with tf.math.pow or tf.stack (I haven't tried). You would go with Keras custom layer.
I think you can do this for 1D tensor as:
def expend_func(x):
p1 = x
p2 = x - ((x**2)/2)
new_x = K.concatenate([p1, p2], axis=1)
return new_x
note that x is your 1D tensor, new_x with two terms. If you need new_x with three terms, you might modify expend_funcs with three terms. for 2D tensor, you should use tf.stack() which is not the elegant way but that might help.
I've been searching on how to perform matrix factorization for this very simple and basic case that I will show, but didn't find anything. I only found complex and long solutions, so I will present what I want to solve:
U x V = A
I would just like to know how to solve this equation in Tensorflow 2, being A a known sparse matrix, and U and V two random initialized matrices. So I would like to find U and V, so that their multiplication is approximately equal to A.
For example, having these variables:
# I use this function to build a toy dataset for the sparse matrix
def build_rating_sparse_tensor(ratings):
indices = ratings[['U_num', 'V_num']].values
values = ratings['rating'].values
return tf.SparseTensor(
indices=indices,
values=values,
dense_shape=[ratings.U_num.max()+1, ratings.V_num.max()+1])
# here I create what will be the matrix A
ratings = (pd.DataFrame({'U_num': list(range(0,10_000))*30,
'V_num': list(range(0,60_000))*5,
'rating': np.random.randint(6, size=300_000)})
.sample(1000)
.drop_duplicates(subset=['U_num','V_num'])
.sort_values(['U_num','V_num'], ascending=[1,1]))
# Variables
A = build_rating_sparse_tensor(ratings)
U = tf.Variable(tf.random_normal(
[A_Sparse.shape[0], embeddings], stddev=init_stddev))
# this matrix would be transposed in the equation
V = tf.Variable(tf.random_normal(
[A_Sparse.shape[1], embeddings], stddev=init_stddev))
# loss function
def sparse_mean_square_error(sparse_ratings, user_embeddings, movie_embeddings):
predictions = tf.reduce_sum(
tf.gather(user_embeddings, sparse_ratings.indices[:, 0]) *
tf.gather(movie_embeddings, sparse_ratings.indices[:, 1]),
axis=1)
loss = tf.losses.mean_squared_error(sparse_ratings.values, predictions)
return loss
Is it possible to do this with a particular loss function, optimizer and learning schedule?
Thank you very much.
A naive and straightforward approach using TensorFlow 2:
Note that rating was converted to float32. TensorFlow cannot calculate gradients over integer, see https://github.com/tensorflow/tensorflow/issues/20524.
A = build_rating_sparse_tensor(ratings)
indices = ratings[["U_num", "V_num"]].values
embeddings = 3000
U = tf.Variable(tf.random.normal([A.shape[0], embeddings]), dtype=tf.float32)
V = tf.Variable(tf.random.normal([embeddings, A.shape[1]]), dtype=tf.float32)
optimizer = tf.optimizers.Adam()
trainable_weights = [U, V]
for step in range(100):
with tf.GradientTape() as tape:
A_prime = tf.matmul(U, V)
# indexing the result based on the indices of A that contain a value
A_prime_sparse = tf.gather(
tf.reshape(A_prime, [-1]),
indices[:, 0] * tf.shape(A_prime)[1] + indices[:, 1],
)
loss = tf.reduce_sum(tf.metrics.mean_squared_error(A_prime_sparse, A.values))
grads = tape.gradient(loss, trainable_weights)
optimizer.apply_gradients(zip(grads, trainable_weights))
if step % 20 == 0:
print(f"Training loss at step {step}: {loss:.4f}")
We take advantage of the sparsity of A by calculating the loss only over the actual values of A. However, we still have to allocate two really big dense tensor for the trainable weights U and V. For big numbers like in your example, you will probably encounter some OOM errors.
Maybe it could be worth exploring another representation for your data.
I saw this question: Implementing custom loss function in keras with condition And I need to do the same thing but with code that seems to need loops.
I have a custom numpy function which calculates the mean Euclid distance from the mean vector. I wrote this based on the paper https://arxiv.org/pdf/1801.05365.pdf:
import numpy as np
def mean_euclid_distance_from_mean_vector(n_vectors):
dists = []
for (i, v) in enumerate(n_vectors):
n_vectors_rest = n_vectors[np.arange(len(n_vectors)) != i]
print("rest of vectors: ")
print(n_vectors_rest)
# calculate mean vector
mean_rest = n_vectors_rest.mean(axis=0)
print("mean rest vector")
print(mean_rest)
dist = v - mean_rest
print("dist vector")
print(dist)
dists.append(dist)
# dists is now a matrix of distance vectors (distance from the mean vector)
dists = np.array(dists)
print("distance vector matrix")
print(dists)
# here we matmult each vector
# sum them up
# and divide by the total number of elements
result = np.sum([np.matmul(d, d) for d in dists]) / dists.size
return result
features = np.array([
[1,2,3,4],
[4,3,2,1]
])
c = mean_euclid_distance_from_mean_vector(features)
print(c)
I need this function however to work inside tensorflow with Keras. So a custom lambda https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda
However, I'm not sure how to implement the above in Keras/Tensorflow since it has loops, and the way the paper talked about calculating the m_i seems to require loops like the way I implemented the above.
For reference, the PyTorch version of this code is here: https://github.com/PramuPerera/DeepOneClass
Given a feature map like:
features = np.array([
[1, 2, 3, 4],
[2, 4, 4, 3],
[3, 2, 1, 4],
], dtype=np.float64)
reflecting a batch_size of
batch_size = features.shape[0]
and
k = features.shape[1]
One has that implementing the above Formulas in Tensorflow could be expressed (prototyped) by:
dim = (batch_size, features.shape[1])
def zero(i):
arr = np.ones(dim)
arr[i] = 0
return arr
mapper = [zero(i) for i in range(batch_size)]
elems = (features, mapper)
m = (1 / (batch_size - 1)) * tf.map_fn(lambda x: tf.math.reduce_sum(x[0] * x[1], axis=0), elems, dtype=tf.float64)
pairs = tf.map_fn(lambda x: tf.concat(x, axis=0) , tf.stack([features, m], 1), dtype=tf.float64)
compactness_loss = (1 / (batch_size * k)) * tf.map_fn(lambda x: tf.math.reduce_euclidean_norm(x), pairs, dtype=tf.float64)
with tf.Session() as sess:
print("loss value output is: ", compactness_loss.eval())
Which yields:
loss value output is: [0.64549722 0.79056942 0.64549722]
However a single measure is required for the batch, therefore it is necessary to reduce it; by the summation of all values.
The wanted Compactness Loss function à la Tensorflow is:
def compactness_loss(actual, features):
features = Flatten()(features)
k = 7 * 7 * 512
dim = (batch_size, k)
def zero(i):
z = tf.zeros((1, dim[1]), dtype=tf.dtypes.float32)
o = tf.ones((1, dim[1]), dtype=tf.dtypes.float32)
arr = []
for k in range(dim[0]):
arr.append(o if k != i else z)
res = tf.concat(arr, axis=0)
return res
masks = [zero(i) for i in range(batch_size)]
m = (1 / (batch_size - 1)) * tf.map_fn(
# row-wise summation
lambda mask: tf.math.reduce_sum(features * mask, axis=0),
masks,
dtype=tf.float32,
)
dists = features - m
sqrd_dists = tf.pow(dists, 2)
red_dists = tf.math.reduce_sum(sqrd_dists, axis=1)
compact_loss = (1 / (batch_size * k)) * tf.math.reduce_sum(red_dists)
return compact_loss
Of course the Flatten() could be moved back into the model for convenience and the k could be derived directly from the feature map; this answers your question. You may just have some trouble finding out the the expected values for the model are - feature maps from the VGG16 (or any other architechture) trained against the imagenet for instance?
The paper says:
In our formulation (shown in Figure 2 (e)), starting froma pre-trained deep model, we freeze initial features (gs) and learn (gl) and (hc). Based on the output of the classification sub-network (hc), two losses compactness loss and descriptiveness loss are evaluated. These two losses, introduced in the subsequent sections, are used to assess the quality of the learned deep feature. We use the provided one-class dataset to calculate the compactness loss. An external multi-class reference dataset is used to evaluate the descriptiveness loss.As shown in Figure 3, weights of gl and hc are learned in the proposed method through back-propagation from the composite loss. Once training is converged, system shown in setup in Figure 2(d) is used to perform classification where the resulting model is used as the pre-trained model.
then looking at the "Framework" backbone here plus:
AlexNet Binary and VGG16 Binary (Baseline). A binary CNN is trained by having ImageNet samples and one-class image samples as the two classes using AlexNet andVGG16 architectures, respectively. Testing is performed using k-nearest neighbor, One-class SVM [43], Isolation Forest [3]and Gaussian Mixture Model [3] classifiers.
Makes me wonder whether it would not be reasonable to add suggested the dense layers to both the Secondary and the Reference Networks to a single class output (Sigmoid) or even and binary class output (using Softmax) and using the mean_squared_error as the so called Compactness Loss and binary_cross_entropy as the Descriptveness Loss.
I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.
From this stackexchange answer, softmax gradient is calculated as:
Python implementation for above is:
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs:
- W: C x D array of weights
- X: D x N array of data. Data are D-dimensional columns
- y: 1-dimensional array of length N with labels 0...K-1, for K classes
- reg: (float) regularization strength
Returns:
a tuple of:
- loss as single float
- gradient with respect to weights W, an array of same size as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# Get shapes
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
# Compute vector of scores
f_i = W.dot(X[:, i]) # in R^{num_classes}
# Normalization trick to avoid numerical instability, per http://cs231n.github.io/linear-classify/#softmax
log_c = np.max(f_i)
f_i -= log_c
# Compute loss (and add to it, divided later)
# L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
sum_i = 0.0
for f_i_j in f_i:
sum_i += np.exp(f_i_j)
loss += -f_i[y[i]] + np.log(sum_i)
# Compute gradient
# dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
# Here we are computing the contribution to the inner sum for a given i.
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
# Compute average
loss /= num_train
dW /= num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg*W
return loss, dW
Not sure if this helps, but:
is really the indicator function , as described here. This forms the expression (j == y[i]) in the code.
Also, the gradient of the loss with respect to the weights is:
where
which is the origin of the X[:,i] in the code.
I know this is late but here's my answer:
I'm assuming you are familiar with the cs231n Softmax loss function.
We know that:
So just as we did with the SVM loss function the gradients are as follows:
Hope that helped.
A supplement to this answer with a small example.
I came across this post and still was not 100% clear how to arrive at the partial derivatives.
For that reason I took another approach to get to the same results - maybe it is helpful to others too.