I am now learning the stanford cs231n course. When completing the softmax_loss function, I found it is not easy to write in a full-vectorized type, especially dealing with the dw term. Below is my code. Can somebody optimize the code. Would be appreciated.
def softmax_loss_vectorized(W, X, y, reg):
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
scores = X.dot(W)
scores -= np.max(scores, axis = 1)[:, np.newaxis]
exp_scores = np.exp(scores)
sum_exp_scores = np.sum(exp_scores, axis = 1)
correct_class_score = scores[range(num_train), y]
loss = np.sum(np.log(sum_exp_scores)) - np.sum(correct_class_score)
exp_scores = exp_scores / sum_exp_scores[:,np.newaxis]
# **maybe here can be rewroten into matrix operations**
for i in xrange(num_train):
dW += exp_scores[i] * X[i][:,np.newaxis]
dW[:, y[i]] -= X[i]
loss /= num_train
loss += 0.5 * reg * np.sum( W*W )
dW /= num_train
dW += reg * W
return loss, dW
Here's a vectorized implementation below. But I suggest you try to spend a little bit more time and get to the solution yourself. The idea is to construct a matrix with all softmax values and subtract -1 from the correct elements.
def softmax_loss_vectorized(W, X, y, reg):
num_train = X.shape[0]
scores = X.dot(W)
scores -= np.max(scores)
correct_scores = scores[np.arange(num_train), y]
# Compute the softmax per correct scores in bulk, and sum over its logs.
exponents = np.exp(scores)
sums_per_row = np.sum(exponents, axis=1)
softmax_array = np.exp(correct_scores) / sums_per_row
information_array = -np.log(softmax_array)
loss = np.mean(information_array)
# Compute the softmax per whole scores matrix, which gives the matrix for X rows coefficients.
# Their linear combination is algebraically dot product X transpose.
all_softmax_matrix = (exponents.T / sums_per_row).T
grad_coeff = np.zeros_like(scores)
grad_coeff[np.arange(num_train), y] = -1
grad_coeff += all_softmax_matrix
dW = np.dot(X.T, grad_coeff) / num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg * W
return loss, dW
Related
I am new to PyTorch and I would like to implement linear regression partly with PyTorch and partly on my own. I want to use squared features for my regression:
import torch
# init
x = torch.tensor([1,2,3,4,5])
y = torch.tensor([[1],[4],[9],[16],[25]])
w = torch.tensor([[0.5], [0.5], [0.5]], requires_grad=True)
iterations = 30
alpha = 0.01
def forward(X):
# feature transformation [1, x, x^2]
psi = torch.tensor([[1.0, x[0], x[0]**2]])
for i in range(1, len(X)):
psi = torch.cat((psi, torch.tensor([[1.0, x[i], x[i]**2]])), 0)
return torch.matmul(psi, w)
def loss(y, y_hat):
return ((y-y_hat)**2).mean()
for i in range(iterations):
y_hat = forward(x)
l = loss(y, y_hat)
l.backward()
with torch.no_grad():
w -= alpha * w.grad
w.grad.zero_()
if i%10 == 0:
print(f'Iteration {i}: The weight is:\n{w.detach().numpy()}\nThe loss is:{l}\n')
When I execute my code, the regression doesn't learn the correct features and the loss increases permanently. The output is the following:
Iteration 0: The weight is:
[[0.57 ]
[0.81 ]
[1.898]]
The loss is:25.450000762939453
Iteration 10: The weight is:
[[ 5529.5835]
[22452.398 ]
[97326.12 ]]
The loss is:210414632960.0
Iteration 20: The weight is:
[[5.0884394e+08]
[2.0662339e+09]
[8.9567642e+09]]
The loss is:1.7820802835250162e+21
Does somebody know, why my model is not learning?
UPDATE
Is there a reason why it performs so poorly? I thought it's because of the low number of training data. But also with 10 data points, it is not performing well :
You should normalize your data. Also, since you're trying to fit x -> axΒ² + bx + c, c is essentially the bias. It should be wiser to remove it from the training data (I'm referring to psi here) and use a separate parameter for the bias.
What could be done:
normalize your input data and targets with mean and standard deviation.
separate the parameters into w (a two-component weight tensor) and b (the bias).
you don't need to construct psi on every inference since x is identical.
you can build psi with torch.stack([torch.ones_like(x), x, x**2], 1), but here we won't need the ones, as we've essentially detached the bias from the weight tensor.
Here's how it would look like:
x = torch.tensor([1,2,3,4,5]).float()
psi = torch.stack([x, x**2], 1).float()
psi = (psi - psi.mean(0)) / psi.std(0)
y = torch.tensor([[1],[4],[9],[16],[25]]).float()
y = (y - y.mean(0)) / y.std(0)
w = torch.tensor([[0.5], [0.5]], requires_grad=True)
b = torch.tensor([0.5], requires_grad=True)
iterations = 30
alpha = 0.02
def loss(y, y_hat):
return ((y-y_hat)**2).mean()
for i in range(iterations):
y_hat = torch.matmul(psi, w) + b
l = loss(y, y_hat)
l.backward()
with torch.no_grad():
w -= alpha * w.grad
b -= alpha * b.grad
w.grad.zero_()
b.grad.zero_()
if i%10 == 0:
print(f'Iteration {i}: The weight is:\n{w.detach().numpy()}\nThe loss is:{l}\n')
And the results:
Iteration 0: The weight is:
[[0.49954653]
[0.5004535 ]]
The loss is:0.25755801796913147
Iteration 10: The weight is:
[[0.49503425]
[0.5049657 ]]
The loss is:0.07994867861270905
Iteration 20: The weight is:
[[0.49056274]
[0.50943726]]
The loss is:0.028329044580459595
Currently I'm learning from Andrew Ng course on Coursera called "Machine Learning". In exercise 5, we built a model that can predict digits, trained by the MNIST dataset. This task was completed successfully in Matlab by me, but I wanted to migrate that code to Python, just to see how different things are and maybe continue to play around with the model.
I managed to implement the cost function and the back propagation algorithm correctly. I know that because I compared the metrics with my working model in Matlab and it emits the same numbers.
Now, because in the course we train the model using fmincg, I tried to do the same using Scipy fmin_cg
function.
My problem is, the cost function takes extra small steps and fails to converge.
Here is my code for the network:
import numpy as np
import utils
import scipy.optimize as op
class Network:
def __init__(self, layers):
self.layers = layers
self.weights = self.generate_params()
# Function for generating theta multidimensional matrix
def generate_params(self):
theta = []
epsilon = 0.12
for i in range(len(self.layers) - 1):
current_layer_units = self.layers[i]
next_layer_units = self.layers[i + 1]
theta_i = np.multiply(
np.random.rand(next_layer_units, current_layer_units + 1),
2 * epsilon - epsilon
)
# Appending the params to the theta matrix
theta.append(theta_i)
return theta
# Function to append bias row/column to matrix X
def append_bias(self, X, d):
m = X.shape[0]
n = 1 if len(X.shape) == 1 else X.shape[1]
if (d == 'column'):
ones = np.ones((m, n + 1))
ones[:, 1:] = X.reshape((m, n))
elif (d == 'row'):
ones = np.ones((m + 1, n))
ones[1:, :] = X.reshape((m, n))
return ones
# Function for computing the gradient for 1 training example
def back_prop(self, y, feed, theta):
activations = feed["activations"]
weighted_layers = feed["weighted_layers"]
delta_output = activations[-1] - y.reshape(len(y), 1)
current_delta = delta_output
# Initializing gradients
gradients = []
for i, theta_i in enumerate(theta):
gradients.append(np.zeros(theta_i.shape))
# Peforming delta calculations.
# Here, we continue to propagate the delta values backwards
# until we arrive to the second layer.
for i in reversed(range(len(theta))):
theta_i = theta[i]
if (i > 0):
i_weighted_inputs = self.append_bias(weighted_layers[i - 1], 'row')
t_theta_i = np.transpose(theta_i)
delta_i = np.multiply(np.dot(t_theta_i, current_delta), utils.sigmoidGradient(i_weighted_inputs))
delta_i = delta_i[1:]
gradients[i] = current_delta * np.transpose(activations[i])
# Setting current delta for the next layer
current_delta = delta_i
else:
gradients[i] = current_delta * np.transpose(activations[i])
return gradients
# Function for computing the cost and the derivatives
def compute_cost(self, theta, X, y, r12n = 0):
m = len(X)
num_labels = self.layers[-1]
costs = np.zeros(m)
# Initializing gradients
gradients = []
for i, theta_i in enumerate(theta):
gradients.append(np.zeros(theta_i.shape))
# Iterating over the training set
for i in range(m):
inputs = X[i]
observed = utils.create_output_vector(y[i], num_labels)
feed = self.feed_forward(inputs)
predicted = feed["activations"][-1]
total_cost = 0
for k, o in enumerate(observed):
if (o == 1):
total_cost += np.log(predicted[k])
else:
total_cost += np.log(1 - predicted[k])
cost = -1 * total_cost
# Storing the cost for the i-th training example
costs[i] = cost
# Calculating the gradient for this training example
# using back propagation algorithm
gradients_i = self.back_prop(observed, feed, theta)
for i, gradient in enumerate(gradients_i):
gradients[i] += gradient
# Calculating the avg regularization term for the cost
sum_of_theta = 0
for i, theta_i in enumerate(theta):
squared_theta = np.power(theta_i[:, 1:], 2)
sum_of_theta += np.sum(squared_theta)
r12n_avg = r12n * sum_of_theta / (2 * m)
total_cost = np.sum(costs) / m + r12n_avg
# Applying regularization terms to the gradients
for i, theta_i in enumerate(theta):
lambda_i = np.copy(theta_i)
lambda_i[:, 0] = 0
lambda_i = np.multiply((r12n / m), lambda_i)
# Adding the r12n matrix to the gradient
gradients[i] = gradients[i] / m + lambda_i
return total_cost, gradients
# Function for training the neural network using conjugate gradient algorithm
def train_cg(self, X, y, r12n = 0, iterations = 50):
weights = self.weights
def Cost(theta, X, y):
theta = utils.roll_theta(theta, self.layers)
cost, _ = self.compute_cost(theta, X, y, r12n)
print(cost);
return cost
def Gradient(theta, X, y):
theta = utils.roll_theta(theta, self.layers)
_, gradient = self.compute_cost(theta, X, y, r12n)
return utils.unroll_theta(gradient)
unrolled_theta = utils.unroll_theta(weights)
result = op.fmin_cg(f = Cost,
x0 = unrolled_theta,
args=(X, y),
fprime=Gradient,
maxiter = iterations)
self.weights = utils.roll_theta(result, self.layers)
# Function for feeding forward the network
def feed_forward(self, X):
# Useful variables
activations = []
weighted_layers = []
weights = self.weights
currentActivations = self.append_bias(X, 'row')
activations.append(currentActivations)
for i in range(len(self.layers) - 1):
layer_weights = weights[i]
weighted_inputs = np.dot(layer_weights, currentActivations)
# Storing the weighted inputs
weighted_layers.append(weighted_inputs)
activation_nodes = []
# If the next layer is not the output layer, we'd like to add a bias unit to it
# (Excluding the input and the output layer)
if (i < len(self.layers) - 2):
activation_nodes = self.append_bias(utils.sigmoid(weighted_inputs), 'row')
else:
activation_nodes = utils.sigmoid(weighted_inputs)
# Appending the layer of nodes to the activations array
activations.append(activation_nodes)
currentActivations = activation_nodes
data = {
"activations": activations,
"weighted_layers": weighted_layers
}
return data
def predict(self, X):
data = self.feed_forward(X)
output = data["activations"][-1]
# Finding the max index in the output layer
return np.argmax(output, axis=0)
Here is the invocation of the code:
import numpy as np
from network import Network
# %% Load data
X = np.genfromtxt('data/mnist_data.csv', delimiter=',')
y = np.genfromtxt('data/mnist_outputs.csv', delimiter=',').astype(int)
# %% Create network
num_labels = 10
input_layer = 400
hidden_layer = 25
output_layer = num_labels
layers = [input_layer, hidden_layer, output_layer]
# Create a new neural network
network = Network(layers)
# %% Train the network and save the weights
network.train_cg(X, y, r12n = 1, iterations = 20)
This is what the code emits after each iteration:
15.441233231650283
15.441116436313076
15.441192262452514
15.44122384651483
15.441231216030646
15.441232804294314
15.441233141284435
15.44123321255294
15.441233227614855
As you can see, the changes to the cost are very small.
I checked for the shapes of the vectors and gradient and they both seem fine, just like in my Matlab implementation. I'm not sure what I do wrong here.
If you guys could help me, that'd be great :)
This is not a question for a specific problem I am trying to solve. I am just trying to understand why a gradient is calculated by multiplying the layers (matrices) in a mostly backward fashion. I also didn't know subtracting y from the prediction could also give you something called a gradient.
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
I don't know what I thought Pytorch was doing finding the gradients. I figured it was some kind of algorithm that did the power rule and followed other derivative rules somehow.
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
I am trying to create a basic Linear Regression Model implementing Coordinate Descent (I have made it inherit from OrdinaryLinearRegression, because it implements the same predict and score functions).
Using the loss function as the Residual Sum of squares:
πΏπ
ππ= 1βN βππ€βπ¦β2
Our gradient descent should be:
π€β²= π€ β π 2βN ππ(ππ€βπ¦)
Implementing the code:
def scalingfeatures(X):
scaler = StandardScaler()
scaler.fit(X)
return scaler.transform(X)
class OrdinaryLinearRegressionCoordinateDescent(OrdinaryLinearRegression):
def __init__(self,lr,num_iter):
self.lr = lr
self.num_iter = num_iter
def lossfunction(self,X,y,w):
m = np.size(y)
#Cost function in vectorized form
y_pred = X # w
# J = 1/N * Sum((αΊ - y)**2)
J = float((1./(2*m)) * (y_pred - y).T # (y_pred - y))
return J
def fit(self,X,y):
X = scalingfeatures(X)
X = np.concatenate((np.ones((X.shape[0],1)),X),axis=1)
m,n = X.shape
np.random.seed(42)
w = np.random.randn(n,1)
y = y.reshape(-1,1)
for iter in range(self.num_iter):
for j in range(n):
#Coordinate descent in vectorized form
X_j = X[:,j].reshape(-1,1)
y_pred = X # w
gradient = X_j.T # (y_pred-y)
w[j] = w[j] - self.lr * (2/n) * gradient
loss = self.lossfunction(X,y,w)
print(loss)
self.w = w
return self
OLRCD = OrdinaryLinearRegressionCoordinateDescent(lr=0.05,num_iter=500)
train = OLRCD.fit(X,y)
print("The training MSE for ORLGD is: ",train.score(X,y))
When I run the code, I get that with every iteration the loss only increases...
I am freshman & beginner.
I am studying machine learning with open tutorials.
I have a trouble with making gradient descent algorithm
I have to complete "for _ in range(max_iter):" but, I don't know about numpy... so I don't know what code should i add
Could you please help me fill the blank?
I know this type of question is so rude... sorry but I need your help :(
Thank you in advance.
from sklearn import datasets
import numpy as np
from sklearn.metrics import accuracy_score
X, y = datasets.make_classification(
n_samples = 200, n_features = 2, random_state = 333,
n_informative =2, n_redundant = 0 , n_clusters_per_class= 1)
def sigmoid(s):
return 1 / (1 + np.exp(-s))
def loss(y, h):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
def gradient(X, y, w):
return -(y * X) / (1 + np.exp(-y * np.dot(X, w)))
X_bias = np.append(np.ones((X.shape[0], 1)), X, axis=1)
y = np.array([[1] if label == 0 else [0] for label in y])
w = np.array([[random.uniform(-1, 1)] for _ in range(X.shape[1]+1)])
max_iter = 100
learning_rate = 0.1
threshold = 0.5
for _ in range(max_iter):
#fill in the blank
what code should i add ????
probabilities = sigmoid(np.dot(X_bias, w))
predictions = [[1] if p > threshold else [0] for p in probabilities]
print("loss: %.2f, accuracy: %.2f" %
(loss(y, probabilities), accuracy_score(y, predictions)))
Inside the for loop, we have to first compute the probabilities. Then find the gradients and then update the weights.
For computing probabilities, you can use the code below
probs=sigmoid(np.dot(X_bias,w))
np.dot is numpy command for matrix multiplication. Then we will calculate the loss and its gradients.
J=loss(y,probs)
dJ=gradient(X_bias,y,w)
Now we will update the weights.
w=w-learning_rate*dJ
So the final code will be
from sklearn import datasets
import numpy as np
from sklearn.metrics import accuracy_score
X, y = datasets.make_classification(
n_samples = 200, n_features = 2, random_state = 333,
n_informative =2, n_redundant = 0 , n_clusters_per_class= 1)
def sigmoid(s):
return 1 / (1 + np.exp(-s))
def loss(y, h):
return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
def gradient(X, y, w):
return -(y * X) / (1 + np.exp(-y * np.dot(X, w)))
X_bias = np.append(np.ones((X.shape[0], 1)), X, axis=1)
y = np.array([[1] if label == 0 else [0] for label in y])
w = np.array([[np.random.uniform(-1, 1)] for _ in range(X.shape[1]+1)])
max_iter = 100
learning_rate = 0.1
threshold = 0.5
for _ in range(max_iter):
probs=sigmoid(np.dot(X_bias,w))
J=loss(y,probs)
dJ=gradient(X_bias,y,w)
w=w-learning_rate*dJ
probabilities = sigmoid(np.dot(X_bias, w))
predictions = [[1] if p > threshold else [0] for p in probabilities]
print("loss: %.2f, accuracy: %.2f" %
(loss(y, probabilities), accuracy_score(y, predictions)))
Note: In the for loop, there is no need to compute probs and loss, As we only need gradients to update the weights. I did that because it will be easy to understand.