I am implementing logistic Regression algorithm with two feature x1 and x2. I am writing the code of cost function in logistic regression.
def computeCost(X,y,theta):
J =((np.sum(-y*np.log(sigmoid(np.dot(X,theta)))-(1-y)*(np.log(1-sigmoid(np.dot(X,theta))))))/m)
return J
Here My X is the training set matrix, y is the output.
the shape of X is (100,3) and shape of y is (100,) as determined by shape attribute of numpy library. my theta is initially contained all zero entry with shape (3,1). When I calculate cost with these parameters I got the cost 69.314. But it is incorrect. The correct Cost is 0.69314. Actually, I got this correct cost when I reshape my y vector as y = numpy.reshape(y,(-1,1)) .
But I actually didn't get how this reshaping corrects my cost.
Here m(numbers of the training set) is 100.
First of all never in future simply dump your code! You post(code + explanation) should be as descriptive as it could be! (not verbose, nobody will read it). Here is what your code is doing! Please post readable code in future! Else it's hard to read & answer!
def computeCost(X,y,theta):
'''
Using Mean Absolute Error
X:(100,3)
y: (100,1)
theta:(3,1)
Returns 1D matrix of predictions
Cost = ( log(predictions) + (1-labels)*log(1-predictions) ) / len(labels)
'''
m = len(y)
# calculate the prediction
predictions = sigmoid(np.dot(X,theta))
# error for when label is of class1
class1_cost= -y * np.log(predictions)
# error for when label is of class1
class2_cost= (1-y)*np.log(1-predictions)
# total cost
cost = class1_cost-class2_cost
# averaging cost
cost =cost.sum() / m
return cost
You should first understand How dot product works in math and what shape of input you algorithm would take to give you the correct answer! Don't throw random shapes! Your feature_vector is of shape(100,3) which when multiplied by your theta which of shape(3,1) will output a prediction vector of shape (100,1).
Matrix multiplication: The product of an M x N matrix and an N x K matrix is an M x K matrix. The new matrix takes the rows of the 1st and columns of the 2nd
So, your y dimension should be in (100,1) shape and not (100,). Huge difference! One is [[3],[4],[6],[7],[9],...] and another [3,4,6,7,9,.....].
Your dimension should match for correct output!
A better way of asking the question would be, how to calculate error/cost in logistic regression using the correct dimensions of my labels.!
For additional understanding!
import numpy as np
label_type1= np.random.rand(100,1)
label_type2= np.random.rand(100,)
predictions= np.random.rand(100,1)
print(label_type1.shape, label_type2.shape, predictions.shape)
# When you mutiply (100,1) with (100,1) --> (100,1)
print((label_type1 * predictions).shape)
# When you do a dot product (100,1) with (100,1) --> Error, for which you have to take a transpose which isn't relavant to the context!
# print( np.dot(label_type1,predictions).shape) # error: shapes (100,1) and (100,1) not aligned: 1 (dim 1) != 100 (dim 0)
print( np.dot(label_type1.T,predictions).shape) #
print('*'*5)
# When you mutiply (100,) with (100,1) --> (100,100) !
print((label_type2 * predictions).shape) #
# When you do a dot product (100,) with (100,1) --> (1,) !
print(np.dot(label_type2, predictions).shape)
print('*'*5)
# what you are doin
label_type1_addDim = np.reshape(label_type2,(-1,1))
print(label_type1_transpose.shape)
So, coming straight to the point, What you wanna achieve is a cost with dim (100,1)! so either you do 1st which you aren't doing! or you do the fifth, where you unknowingly adding a dimension to your y
making it from (100,) to (100,1) and doing the same * operation as of 1st case! to get dim (100,1).
Related
I'm trying to use the dot product in Numpy between two matrices with different dimensions.
w is (1, 5) and X is (3, 5)
I'm not sure which command I can use to change the dimensions as I am new to python.
Thank you.
When I try running my function, it gives me an error saying:
ValueError: shapes (1,5) and (3,5) not aligned: 5 (dim 1) != 3 (dim 0)
from numpy.core.memmap import ndarray
def L(w, X, y):
"""
Arguments:
w -- vector of size n representing weights of input features n
X -- matrix of size m x n represnting input data, m data sample with n features each
y -- vector of size m (true labels)
Returns:
loss -- the value of the loss function defined above
"""
### START CODE HERE ### (2-4 lines of code)
#w needs to match X matrix
# w = (1, 5)
# x = (3, 5)
yhat = np.dot(w, X)
L1 = y - yhat
loss = np.dot(L1, L1)
### END CODE HERE ###
return loss
Here is the picture of directions:
image of directions
The dot product of two vectors is the sum of the products of elements with regards to position. The first element of the first vector is multiplied by the first element of the second vector and so on. The sum of these products is the dot product which can be done with np.dot() function.
Since we multiply elements at the same positions, the two vectors must have same length in order to have a dot product.
import numpy as np
a = np.array([[1,2],[3,4]])
b = np.array([[11,12],[13,14]])
np.dot(a,b)
It will produce the following output −
[[37 40]
[85 92]]
Note that the dot product is calculated as −
[[1*11+2*13, 1*12+2*14],[3*11+4*13, 3*12+4*14]]
You get the full flexibility with tensordot which implements tensor products with arbitrary choice of axes.
A nice application is estimating the covariance matrix, without messing with transpositions:
import numpy as np
from scipy.stats import multivariate_normal
dist = multivariate_normal(mean=[0,0],cov=[[1,1],[1,2]])
samples = dist.rvs(1000,2)
np.tensordot(samples,samples,axes=[0,0])/len(samples) # close to [[1,1],[1,2]
I was going through this amazing playlist for SVD by Steve Brunton in youtube. I think I got majority of the concepts but there are some gaps. Let me add a couple of screenshots so that it's easier for me to explain.
He is considering the input matrix X to be a collection of images. So, considering an image is 28x28 pixels, we flatten it to create a 784x1 column vector. So, each column denotes an image, and the rows denote pixel indices. Let's take the dimension of X to be n x m. Now, after computing the economy SVD, if we keep only the first r (<< m) singular values, then the approximation of X is given by
X' = σ1.u1.v1(T) + σ2.u2.v2(T) + ... + σr.ur.vr(T)
I understand that here, we're throwing away information, so the reconstructed images would be pixelated but they would still be of the same dimension (28x28). So, how are we achieving compression here? Is it because instead of storing 784m pixel values, we'll have to store r x (28 (length of each u) + 28 (length of each v)) pixels? Or is there something more to it?
My second question is, if I try to draw an analogy to numerical features, e.g. let's say a housing price dataset, that has 50 features, and 1000 data points. So, our X matrix has dimension 50 x 1000 (each column being a feature vector). In that case, if there are useless features, we'll get << 50 features (maybe 20, or 10... whatever) after applying PCA, right? I'm not able to grasp how that smaller feature vector is derived when we select only the biggest r singular values. Because X and X' have the same dimensions.
Let's have a sample code. The dimensions are reversed because of how sklearn expects it.
pca = PCA(n_components=10)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape) # original shape: (1000, 50)
print("transformed shape:", X_pca.shape) # transformed shape: (1000, 10)
So, how are we going from 50 to 10 here? I get that that in this case there would be 50 U basis vectors. So, even if we choose top r from these 50, the dimensions will still be the same, right? Any help is appreciated.
I've been searching for the answer all over the web, and finally it clicked when I saw this video tutorial. We know X = U x ∑ x V.T. Here, columns of U give us the principal components for the colspace of X. Similarly rows of V.T give us the principal components for the rowspace of X. Since, in pca we tend to represent a feature vector by a row (unlike svd), so we'd select the first r principal components from the matrix V.T.
Let's assume the dimensions of X to be mxn. So, we have m samples each having n features. That gives us the following dimensions for the SVD:
U: mxm
∑: mxn
V: nxn
Now, if we select only r (<< n) principal components then the projection of X to the r-dimensional space would be given by X.[v1 v2 ... vr]. Here each of v1, v2, ... vr is a column vector. So, the dimension of [v1 v2 ... vr] is nxr. If we now multiply X with this vector we get an nxr matrix, which is nothing but the projection of all the data points to r dimensions.
a_test = (Phi_train_M.T.dot(Phi_train_M) +lambda_reg*IdendityMatrix)
b_test = Phi_train_M.T.dot(data_train[:,1])
error_reg_test = np.linalg.lstsq (a_test,b_test,rcond=None)[1]
I want to check for the regularized sum of squared errors/residues. a_test dimensions are (16,16) and b_test dimensions are (16,) the doc says
Blockquote
residuals{(1,), (K,), (0,)} ndarray
Sums of squared residuals: Squared Euclidean 2-norm for each column in b - a # x. If the rank of a is < N or M <= N, this is an empty array. If b is 1-dimensional, this is a (1,) shape array. Otherwise the shape is (K,).
and as i typed it i noticed this doesnt work if 'a' is an N x N matrix. Does someone know a workaround? or how to solve this?
I have two tensors named x_t, x_k with follwing shapes NxHxW and KxNxHxW respectively, where K, is the number of autoencoders used to reconstruct x_t (if you have no idea what is this, assume they're K different nets aiming to predict x_t, this probably has nothing to do with the question anyways) N is batch size, H matrix height, W matrix width.
I'm trying to apply Kullback-Leibler divergence algorithm to both tensors (after broadcasting x_t as x_k along the Kth dimension) using Pytorch's nn.functional.kl_div method.
However, it does not seem to be working as I expected. I'm looking to calcualte the kl_div between each observation in x_t and x_k resulting in a tensor of size KxN (i.e., kl_div of each observation for each K autoencoder).
The actual output is a single value if I use the reduction argument, and the same tensor size (i.e., KxNxHxW) if I do not use it.
Has anyone tried something similar?
Reproducible example:
import torch
import torch.nn.functional as F
# K N H W
x_t = torch.randn( 10, 5, 5)
x_k = torch.randn( 3, 10, 5, 5)
x_broadcasted = x_t.expand_as(x_k)
loss = F.kl_div(x_t, x_k, reduction="none") # or "batchmean", or there are many options
It's unclear to me what exactly constitutes a probability distribution in your model. With reduction='none', kl_div, given log(x_n) and y_n, computes kl_div = y_n * (log(y_n) - log(x_n)), which is the "summed" part of the actual Kullback-Leibler divergence. Summation (or, in other words, taking the expectation) is up to you. If your point is that H, W are the two dimensions over which you want to take expectation, it's as simple as
loss = F.kl_div(x_t, x_k, reduction="none").sum(dim=(-1, -2))
Which is of shape [K, N]. If your network output is to be interpreted differently, you need to better specify which are the event dimensions and which are sample dimensions of your distribution.
I want to whiten the CIFAR10 dataset using ZCA. The input X_train is of shape (40000, 32, 32, 3) where 40000 is the number of images, and 32x32x3 is the size of each image. I'm using the code from this answer for this purpose:
X_flat = np.reshape(X_train, (-1, 32*32*3))
# compute the covariance of the image data
cov = np.cov(X_flat, rowvar=True) # cov is (N, N)
# singular value decomposition
U,S,V = np.linalg.svd(cov) # U is (N, N), S is (N,)
# build the ZCA matrix
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
# transform the image data zca_matrix is (N,N)
zca = np.dot(zca_matrix, X_flat) # zca is (N, 3072)
However, at run time I encountered the following warning:
D:\toolkits.win\anaconda3-5.2.0\envs\dlwin36\lib\site- packages\ipykernel_launcher.py:8: RuntimeWarning: invalid value encountered in sqrt
So after I got the SVD output, I tried:
print(np.min(S)) # prints -1.7798217
Which is unexpected because S can only have positive values. Also, the ZCA whitening result was not correct and it contained nan values.
I tried reproducing this by re-running this same code a second time and this time I did not encounter any warnings or any negative S values, but instead I got:
print(np.min(S)) # prints nan
Any idea for why this might have happened?
Update: Restarted the kernel to free up cpu and RAM resources, and tried running this code again. Again got the same warning for feeding in negative values to np.sqrt(). Not sure if it helps but I've also attached the cpu and ram utilization figures:
activity monitor figures
Here are a couple of ideas. I don't have your dataset so I can't be totally sure that these will fix your problem, but I'm confident enough to post this as an answer instead of a comment.
First. Your X_train is 40'000 by 3072, where each row is a data vector, and each column is a variable or feature. You want the covariance matrix that is 3072 by 3072: pass in rowvar=False to np.cov.
I'm not really sure why the 40'000 by 40'000 covariance matrix's SVD is diverging. Assuming you have enough RAM to store the 12 GB covariance matrix, the one thing I can think of is numerical overflow, because you're perhaps not removing the mean of the data, as is expected by ZCA (and any other whitening technique)?
So second. Remove the mean: X_zeromean = X_flat - np.mean(X_flat, 0).
If you do these, then the final step has to be modified a tiny bit (to make dimensions line up). Here's a quick check using uniform random data:
import numpy as np
X_flat = np.random.rand(40000, 32*32*3)
X_zeromean = X_flat - np.mean(X_flat, 0)
cov = np.cov(X_zeromean, rowvar=False)
U,S,V = np.linalg.svd(cov)
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
zca = np.dot(zca_matrix, X_zeromean.T) # <-- transpose needed here
As a sanity check np.cov(zca) now is very close to the identity matrix, as desired (zca will have flipped dimensions as the input).
(As a sidenote, this is a really expensive and numerically unstable way to whiten the data array: you don't need to compute the covariance and then take the SVD—you're doing twice the work. You can take the skinny SVD of the data matrix itself (np.linalg.svd with the full_matrices=False flag) and compute the whitening matrix directly from there, without ever evaluating the expensive outer product for the covariance matrix.)