from numpy import exp, array, random, dot, matrix, asarray
class NeuralNetwork():
def __init__(self):
random.seed(1)
self.synaptic_weights = 2 * random.random((3, 1)) - 1 # init weight from -1 to 1
def __sigmoid(self, x):
return 1 / (1 + exp(-x))
def __sigmoid_derivative(self, x):
return x * (1 - x)
def train(self, train_input, train_output, iter):
for i in range(iter):
output = self.think(train_input)
error = train_output - output
adjustment = dot(train_input.T, error * self.__sigmoid_derivative(output))
self.synaptic_weights += adjustment
def think(self, inputs):
return self.__sigmoid(dot(inputs, self.synaptic_weights))
neural_network = NeuralNetwork()
train = matrix([[0, 0, 1, 0],[1, 1, 1, 1],[1, 0, 1, 1],[0, 1, 1, 0]])
train_input = asarray(train[:, 0:3])
train_output = asarray(train[:,3])
neural_network.train(train_input, train_output, 10000)
This code is a basic neural network. It works well when I convert the training set using asarray, but It does not work matrix itself. It seems matrix cannot calculate the sigmoid_derivative, and terminal shows ValueError: shapes (4,1) and (4,1) not aligned: 1 (dim 1) != 4 (dim 0)
Why matrix does not work in the code?
The error is in the
x * (1 - x)
expression. x is (4,1) shape. With the array element by element multiplication, this x*(1-x) works fine, returning another (4,1) result.
But if x is a (4,1) matrix, then * is the matrix product, the same np.dot for arrays. That would require a (4,1) * (1,4) => (4,4), or a (1,4)*(4,1)=>(1,1). You are already using dot for matrix product, so this derivative is clear the element wise one.
If you see machine learning code that uses np.matrix it is probably based on older examples, and retains matrix for backward compatibility. It is better to use array, and use the dot product as needed.
Related
In my previous question I found how to use PyTorch's autograd to differentiate. And it worked:
#autograd
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
class net_x(nn.Module):
def __init__(self):
super(net_x, self).__init__()
self.fc1=nn.Linear(1, 20)
self.fc2=nn.Linear(20, 20)
self.out=nn.Linear(20, 4)
def forward(self, x):
x=torch.tanh(self.fc1(x))
x=torch.tanh(self.fc2(x))
x=self.out(x)
return x
nx = net_x()
r = torch.tensor([1.0], requires_grad=True)
print('r', r)
y = nx(r)
print('y', y)
print('')
for i in range(y.shape[0]):
# prints the vector (dy_i/dr_0, dy_i/dr_1, ... dy_i/dr_n)
print(grad(y[i], r, retain_graph=True))
>>>
r tensor([1.], requires_grad=True)
y tensor([ 0.1698, -0.1871, -0.1313, -0.2747], grad_fn=<AddBackward0>)
(tensor([-0.0124]),)
(tensor([-0.0952]),)
(tensor([-0.0433]),)
(tensor([-0.0099]),)
The problem that I currently have is that I have to differentiate a very large tensor and iterating through it like I'm currently doing (for i in range(y.shape[0])) is taking forever.
The reason I'm iterating is that from understanding, grad only knows how to propagate gradients from a scalar tensor, which y is not. So I need to compute the gradients with respect to each coordinate of y.
I know that TensorFlow is capable of differentiating tensors, from here:
tf.gradients(
ys, xs, grad_ys=None, name='gradients', gate_gradients=False,
aggregation_method=None, stop_gradients=None,
unconnected_gradients=tf.UnconnectedGradients.NONE
)
"ys and xs are each a Tensor or a list of tensors. grad_ys is a list of Tensor, holding the gradients received by the ys. The list must be the same length as ys.
gradients() adds ops to the graph to output the derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys and for x in xs."
And was hoping that there's a more efficient way to differentiate tensors in PyTorch.
For example:
a = range(100)
b = range(100)
c = range(100)
d = range(100)
my_tensor = torch.tensor([a,b,c,d])
t = range(100)
#derivative = grad(my_tensor, t) --> not working
#Instead what I'm currently doing:
for i in range(len(t)):
a_grad = grad(a[i],t[i], retain_graph=True)
b_grad = grad(b[i],t[i], retain_graph=True)
#etc.
I was told that it might work if I could run autograd on the forward pass rather than the backwards pass, but from here it seems like it's not currently a feature PyTorch has.
Update 1:
#jodag mentioned that what I'm looking for might be just the diagonal of the Jacobian. I'm following the link he attached and trying out the faster method. Though, this doesn't seem to work and gives me an error:
RuntimeError: grad can be implicitly created only for scalar outputs.
Code:
nx = net_x()
x = torch.rand(10, requires_grad=True)
x = torch.reshape(x, (10,1))
x = x.unsqueeze(1).repeat(1, 4, 1)
y = nx(x)
dx = torch.diagonal(torch.autograd.grad(torch.diagonal(y, 0, -2, -1), x), 0, -2, -1)
I believe I solved it using # jodag advice -- to simply calculate the Jacobian and take the diagonal.
Consider the following network:
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
class net_x(nn.Module):
def __init__(self):
super(net_x, self).__init__()
self.fc1=nn.Linear(1, 20)
self.fc2=nn.Linear(20, 20)
self.out=nn.Linear(20, 4) #a,b,c,d
def forward(self, x):
x=torch.tanh(self.fc1(x))
x=torch.tanh(self.fc2(x))
x=self.out(x)
return x
nx = net_x()
#input
t = torch.tensor([1.0, 2.0, 3.2], requires_grad = True) #input vector
t = torch.reshape(t, (3,1)) #reshape for batch
My approach so far was to iterate through the input since grad wants a scalar value as mentioned above:
#method 1
for timestep in t:
y = nx(timestep)
print(grad(y[0],timestep, retain_graph=True)) #0 for the first vector (i.e "a"), 1 for the 2nd vector (i.e "b")
>>>
(tensor([-0.0142]),)
(tensor([-0.0517]),)
(tensor([-0.0634]),)
Using the diagonal of the Jacobian seems more efficient and gives the same results:
#method 2
dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)
dx = torch.diagonal(torch.diagonal(dx, 0, -1), 0)[0] #first vector
#dx = torch.diagonal(torch.diagonal(dx, 1, -1), 0)[0] #2nd vector
#dx = torch.diagonal(torch.diagonal(dx, 2, -1), 0)[0] #3rd vector
#dx = torch.diagonal(torch.diagonal(dx, 3, -1), 0)[0] #4th vector
dx
>>>
tensor([-0.0142, -0.0517, -0.0634])
I am trying to implement a gradient descent function in Python from scratch which I have implemented and work in GNU Octave. Unfortunately I am stuck. I fiddled with it for a while and checked the NumPy documentation but so far no luck.
I am aware of libraries such as scikit-learn, however my purpose is to learn to code such a function from scratch. Perhaps I am going about it the wrong way.
Below you will find all the code necessary to reproduce the error.
Thanks in advance for your help.
Actual result: test fails with error -> "ValueError: matmul: Input operand 0 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)"
Expected result: and array with values [5.2148, -0.5733]
Function gradientDescent() in Octave:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta = theta - (alpha/m)*X'*(X*theta-y);
J_history(iter) = computeCost(X, y, theta);
end
Function gradient_descent() in python:
from numpy import zeros
def compute_cost(X, y, theta):
m = len(y)
ans = (X.T # theta).T - y
J = (ans # ans.T) / (2 * m)
return J[0, 0]
def gradient_descent(X, y, theta, alpha, num_iters):
m = len(y)
J_history = zeros((num_iters, 1), dtype=int)
for iter in range(num_iters):
theta = theta - (alpha / m) # X.T # (X # theta - y)
J_history[iter] = compute_cost(X, y, theta)
return theta
the test file: test_ml_utils.py
import unittest
import numpy as np
from ml.ml_utils import compute_cost, gradient_descent
class TestGradientDescent(unittest.TestCase):
# TODO: implement tests for Gradient Descent function
# [theta J_hist] = gradientDescent([1 5; 1 2; 1 4; 1 5],[1 6 4 2]',[0 0]',0.01,1000);
def test_gradient_descent_00(self):
X = np.array([[1, 5], [1, 2], [1, 4], [1, 5]])
y = np.array([1, 6, 4, 2])
theta = np.zeros(2)
alpha = 0.01
num_iter = 1000
r_theta = np.array([5.2148, -0.5733])
result = gradient_descent(X, y, theta, alpha, num_iter)
self.assertEqual((round(result, 4), r_theta), 'Result is wrong!')
if __name__ == '__main__':
unittest.main()
The __matmul__ operator # in Python binds more tightly than -. That means you're trying to do matrix multiplication with the operands (alpha / m), which is a scalar, and X.T, which is actually a matrix. See operator precedence.
In the Octave code, (alpha - m) * X' is doing scalar multiplication, not matrix, so if you want that same behavior in Python, use * rather than #. This seems to be because Octave overloads the * operator to perform scalar multiplication if one operand is a scalar, but matrix multiplication if both operands are matrices.
Adding to Adam's answer (which is correct in regards to the error you're getting).
However I wanted to add more generally, this code is meaningless to a reader without some sort of hint (whether programmatically or in the form of a comment) what dimensions the different variables take.
E.g., there is a hint in the code that y is likely to be 2-dimensional, and you're using len to get its size. Just as an example of how this might fail silently, consider this:
>>> y = numpy.array([[1,2,3,4,5]])
>>> len( y )
1
whereas presumably you want
>>> numpy.shape( y )
(1, 5)
or
>>> numpy.size( y )
5
I note in your unit test that you're passing a rank 1 vector instead of rank 2, so it turns out y is 1D instead of 2D, but operates with X which is 2D due to broadcasting. Your code therefore works despite the logic implied, but in the absence of making such things explicit, this is a runtime error waiting to happen.
This is the documentation for tf.nn.conv2d: Given an input tensor of shape [batch, in_height, in_width, in_channels] and a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels], this op performs the following
Flattens the filter to a 2-D matrix with shape [filter_height *
filter_width * in_channels,
Extracts image patches from the input tensor to form a virtual tensor of shape [batch, out_height, out_width, filter_height *
filter_width * in_channels].
For each patch, right-multiplies the filter matrix and the image patch vector.
In other words, it takes in a tensor of n images and does convolution with out_channel filters.
I am trying to translate to code that uses only numpy operations and the code is the following:
def my_conv2d(x, kernel):
nf = kernel.shape[-1] # number of filters
rf = kernel.shape[0] # filter size
w = kernel
s = 1 # stride
h_range = int((x.shape[2] - rf) / s) + 1 # (W - F + 2P) / S
w_range = int((x.shape[1] - rf) / s) + 1 # (W - F + 2P) / S
np_o = np.zeros((1, h_range, w_range, nf))
for i in range(x.shape[0]):
for z in range(nf):
for _h in range(h_range):
for _w in range(w_range):
np_o[0, _h, _w, z] = np.sum(x[i, _h * s:_h * s + rf, _w * s:_w * s
+ rf, * w[:, :, :, z])
return np_o
The problem is that code is extremely slow. Are there any numpy or scipy functions that can replicate what tensorflows' conv2d is doing that is of similar efficiency? I have looked at https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.convolve2d.html and it does convolution ONCE, meaning I have to pass a 2d tensor alongside a 2d kernel (it does not do multiple filters).
None of the previous stackoverflow questions helped much with this.
Thanks
Edit: did some testing and my code is about 44000% slower than doing tf.nn.conv2d!
Things are slow for you because you are using loops. Implementing with vector operations will be much faster but not as efficient as the high-level APIs such as tf.nn.conv2d or tf.nn.convolution. This post should be able to help you with the vectorized implementation of the same in numpy : https://wiseodd.github.io/techblog/2016/07/16/convnet-conv-layer/
I have a polynomial of the form:
p(y) = A + By + Cy^2 ... + Dy^n
Here, each of the coefficients A,B,..,D are matrices (and therefore p(y) is also a matrix). Say I interpolate the polynomial at n+1 points. I should now be able to solve this system. I'm trying to do this in Numpy. I have the following code right now:
a = np.vander([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2]) #polynomial degree is 12, a -> (12x12)
b = np.random.rand(12,60,60) #p(x) is a 60x60 matrix that I have evaluated at 12 points
x = np.linalg.solve(a,b)
I get the following error:
ValueError: solve: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (m,m),(m,n)->(m,n) (size 60 is different from 12)
How can I solve this system in Numpy to get x? Is there a general mathematical trick to this?
Essentially, you're just doing 3600 12d polynomial regressions and composing the coefficients into matrices. For instance, the component p(y)[0,0] is just:
p(y)[0, 0] = A[0, 0] + B[0, 0] * y + C[0, 0] * y**2 ... + D[0, 0] * y**n
The problem is that np.linalg.solve can only take one dimension of coefficients. But since your matrix elements are all independent (y is scalar), you can ravel them and you can do the calulation with the form (m,m),(m,n**2) -> (m,n**2) and reshape back to a matrix. So try:
a = np.vander([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2]) #polynomial degree is 12, a -> (12x12)
b = np.random.rand(12,60,60) #p(x) is a 60x60 matrix that I have evaluated at 12 points
s = b.shape
x = np.linalg.solve(a, b.reshape(s[0], -1))
x = x.reshape(s)
I'm trying to implement the softmax function for a neural network written in Numpy. Let h be the softmax value of a given signal i.
I've struggled to implement the softmax activation function's partial derivative.
I'm currently stuck at issue where all the partial derivatives approaches 0 as the training progresses. I've cross-referenced my math with this excellent answer, but my math does not seem to work out.
import numpy as np
def softmax_function( signal, derivative=False ):
# Calculate activation signal
e_x = np.exp( signal )
signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
if derivative:
# Return the partial derivation of the activation function
return np.multiply( signal, 1 - signal ) + sum(
# handle the off-diagonal values
- signal * np.roll( signal, i, axis = 1 )
for i in xrange(1, signal.shape[1] )
)
else:
# Return the activation signal
return signal
#end activation function
The signal parameter contains the input signal sent into the activation function and has the shape (n_samples, n_features).
# sample signal (3 samples, 3 features)
signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
The following code snipped is a fully working activation function and is only included as a reference and proof (mostly for myself) that the conceptual idea actually work.
from scipy.special import expit
import numpy as np
def sigmoid_function( signal, derivative=False ):
# Prevent overflow.
signal = np.clip( signal, -500, 500 )
# Calculate activation signal
signal = expit( signal )
if derivative:
# Return the partial derivation of the activation function
return np.multiply(signal, 1 - signal)
else:
# Return the activation signal
return signal
#end activation function
Edit
The problem intuitively persist with simple single layer networks. The softmax (and its derivative) is applied at the final layer.
This is an answer on how to calculate the derivative of the softmax function in a more vectorized numpy fashion. However, the fact that the partial derivatives approach to zero might not be a math issue, and just be a problem of the learning rate or the known dying weight issue with complex deep neural networks. Layers like ReLU help preventing the latter issue.
First, I've used the following signal (just duplicating your last entry) to make it 4 samples x 3 features so is easier to see what is going on with the dimensions.
>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> signal.shape
(4, 3)
Next, you want to compute the Jacobian matrix of your softmax function. According to the cited page it is defined as -hi * hj for the off-diagonal entries (majority of the matrix for n_features > 2), so lets start there. In numpy, you can efficiently calculate that Jacobian matrix using broadcasting:
>>> J = - signal[..., None] * signal[:, None, :]
>>> J.shape
(4, 3, 3)
The first signal[..., None] (equivalent to signal[:, :, None]) reshapes the signal to (4, 3, 1) while the second signal[:, None, :] reshapes the signal to (4, 1, 3). Then, the * just multiplies both matrices element-wise. Numpy's internal broadcasting repeats both matrices to form the n_features x n_features matrix for every sample.
Then, we need to fix the diagonal elements:
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
The above lines extract diagonal indices for n_features x n_features matrix. It is equivalent of doing iy = np.arange(n_features); ix = np.arange(n_features). Then, replaces the diagonal entries with your defitinion hi * (1 - hi).
Last, according to the linked source, you need to sum across rows for each of the samples. That can be done as:
>>> J = J.sum(axis=1)
>>> J.shape
(4, 3)
Find bellow a summarized version:
if derivative:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
return J.sum(axis=1) # sum across-rows for each sample
Comparison of the derivatives:
>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> e_x = np.exp( signal )
>>> signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
Yours:
>>> np.multiply( signal, 1 - signal ) + sum(
# handle the off-diagonal values
- signal * np.roll( signal, i, axis = 1 )
for i in xrange(1, signal.shape[1] )
)
array([[ 2.77555756e-17, -2.77555756e-17, 0.00000000e+00],
[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],
[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17],
[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17]])
Mine:
>>> J = signal[..., None] * signal[:, None, :]
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
>>> J.sum(axis=1)
array([[ 4.16333634e-17, -1.38777878e-17, 0.00000000e+00],
[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],
[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17],
[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17]])
#Imanol Luengo's solution is wrong the moment he takes sums across rows.
#Harveyslash also makes a good point since he noticed extremely low "gradients" in his solution, the NN won't learn or learn in the wrong direction.
We have 4 samples x 3 inputs
The thing is that softmax is not a scalar function that takes just 1 input, but 3 in this case. Remember that all inputs in one sample must add up to one, and therefore you can't compute the value of one without knowing the others? This implies that the gradient must be a square matrix, because we need to also take into account our partial derivatives.
TLDR: The output of the gradient of softmax(sample) in this case must be a 3x3 matrix.
This is correct:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
Up until this point Imanol uses fast vector operations to compute the Jacobian of the softmax function in 4 points, resulting in a 3x3 matrix stacked 4 times: 4 x 3 x 3.
Now I think what the OP really wants is dJdZ, the first step in ANN backprop:
dJdZ(4x3) = dJdy(4x3) * gradSoftmax[layer signal(4x3)](?,?)
The problem is that we usually (with sigmoid, ReLU,... any scalar activation function) can compute the gradient as a stacked vector and then multiply element-wise with dJdy, but here we got a stacked matrix. How can we marry the two concepts?
The vector can be seen as the non zero elements of a diagonal matrix -> All this time we have been getting away with easy element-wise multiplication just because our activation function was scalar! For our softmax it's not that simple, and therefore we have to use matrix multiplication
dJdZ(4x3) = dJdy(4-1x3) * anygradient[layer signal(4,3)](4-3x3)
Now we multiply each 1x3 dJdy vector times the 3x3 gradient, for each of the 4 samples, but usually common operations will fail. We need to specify along which dimensions we multiply too (use np.einsum). End result:
#For reference 'mnr,mrr->mr' = 4x1x3,4x3x3->4x3
dJdZ = np.einsum('mnr,mrr->mr', dJdy[:,None,:], gradient)