I am trying to develop a code to compute a covariance matrix of a dataset using For Loops instead of Numpy. The code I have so far generates an error:
def cov_naive(X):
"""Compute the covariance for a dataset of size (D,N)
where D is the dimension and N is the number of data points"""
D, N = X.shape
### Edit the code below to compute the covariance matrix by iterating over the dataset.
covariance = np.zeros((D, D))
mean = np.mean(X, axis=1)
for i in range(D):
for j in range(D):
covariance[i,j] += (X[:,i] - mean[i]) # (X[:,j] - mean[j])
return covariance/N
I am trying to perform the below test to validate that it works:
# Let's first test the functions on some hand-crafted dataset.
X_test = np.arange(6).reshape(2,3)
expected_test_mean = np.array([1., 4.]).reshape(-1, 1)
expected_test_cov = np.array([[2/3., 2/3.], [2/3.,2/3.]])
print('X:\n', X_test)
print('Expected mean:\n', expected_test_mean)
print('Expected covariance:\n', expected_test_cov)
np.testing.assert_almost_equal(mean(X_test), expected_test_mean)
np.testing.assert_almost_equal(mean_naive(X_test), expected_test_mean)
np.testing.assert_almost_equal(cov(X_test), expected_test_cov)
np.testing.assert_almost_equal(cov_naive(X_test), expected_test_cov)
and get the following error:
AssertionError:
Arrays are not almost equal to 7 decimals
AssertionError Traceback (most recent call last)
<ipython-input-21-6a6498089109> in <module>()
12
13 np.testing.assert_almost_equal(cov(X_test), expected_test_cov)
---> 14 np.testing.assert_almost_equal(cov_naive(X_test), expected_test_cov)
Any help would be greatly appreciated!
The mistake lies in that line
mean = np.mean(X, axis=1)
it should be:
mean = np.mean(X, axis=0)
as you are computing the mean over the columns (ie dataset Dimensionality)
Related
Why am I getting a singular error when I run it as a function, but when it was done piece wise before it worked correctly? Using the same matrices for both. Following along from a video and I am not able to see what I did different when looking though what he did.
#Artificial data for learning purpose provided by Alex
X= [
[148, 24,1385],
[132,25,2031],
[453,11,86],
[158,24,185],
[172,25,201],
[413,11,86],
[38,54,185],
[142,25,431],
[453,31,86]
]
X = np.array(X)
# Add in the bias (default) value of car before calculating features.
ones = np.ones(X.shape[0])
X = np.column_stack([ones, X])
# y values provided by Alex
y = [10000,20000,15000,20050,10000,20000,15000,25000,12000]
XTX = X.T.dot(X)
XTX_inv = np.linalg.inv(XTX)
w_full = XTX_inv.dot(X.T).dot(y)
# bias value
w0 = w_full[0]
# features
w = w_full[1:]
print(w0, w)
#Output: 25844.754055766753, array([ -16.08906468, -199.47254894, -1.22802883])
At this point the code runs as expected. However, this function gives an error:
# Error in function
def train_linear_regression(X, y):
ones = np.ones(X.shape[0])
X = np.column_stack([ones, X])
XTX = X.T.dot(X)
XTX_inv = np.linalg.inv(XTX)
w = XTX_inv.dot(X.T).dot(y)
return w[0], w[1:]
w0, w = train_linear_regression(X, y)
print(w0, w)
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-48-ed5d7ddc40b1> in <module>
----> 1 train_linear_regression(X,y)
2 frames
<__array_function__ internals> in inv(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
86
87 def _raise_linalgerror_singular(err, flag):
---> 88 raise LinAlgError("Singular matrix")
89
90 def _raise_linalgerror_nonposdef(err, flag):
LinAlgError: Singular matrix
It looks like you are adding/stacking bias (ones) to X twice. First with in the normal flow and second time within the function, that is leading to determinant of 0 for the matrix XTX in the function.
So you need to remove the addition of ones from one of them.
def train_linear_regression(X, y):
ones = np.ones(X.shape[0])
X = np.column_stack([ones, X]) # stacking ones here again - REMOVE
....
I keep getting the error IndexError: tuple index out of range and i am not sure what is happening. My code was working just fine however when i restarted the jupyter notebook i started receiving this error.
this is my code:
X = df.Tweet
y = df.target
from sklearn import linear_model
import pyswarms as ps
# Create an instance of the classifier
classifier = linear_model.LogisticRegression()
# Define objective function
def f_per_particle(m, alpha):
total_features = X.shape[1]
# Get the subset of the features from the binary mask
if np.count_nonzero(m) == 0:
X_subset = X
else:
X_subset = X[:,m==1]
# Perform classification and store performance in P
classifier.fit(X_subset, y)
P = (classifier.predict(X_subset) == y).mean()
# Compute for the objective function
j = (alpha * (1.0 - P)
+ (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features)))
return j
[some more code]
options = {'c1': 0.5, 'c2': 0.5, 'w':0.9, 'k': 30, 'p':2}
# Call instance of PSO
dimensions = X.shape[1] # dimensions should be the number of features
optimizer = ps.discrete.BinaryPSO(n_particles=30, dimensions=dimensions, options=options)
# Perform optimization
cost, pos = optimizer.optimize(f, iters=1000)
i received the following traceback:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-76-bea8cf064cd2> in <module>
2
3 # Call instance of PSO
----> 4 dimensions = X.shape[1] # dimensions should be the number of features
5 optimizer = ps.discrete.BinaryPSO(n_particles=30, dimensions=dimensions, options=options)
6
IndexError: tuple index out of range
It is not absolutely clear, but it seems to me that your df variable might be a Pandas dataframe, and your df.Tweet may be a Pandas series.
In that case, being a series, your X will have only one dimension (so, only the first element of the tuple X.shape, X.shape[0]), instead of two dimensions - reason for the index out of range exception in your code. The two dimensions case occurs only when the variable is a dataframe.
More information: https://www.google.com/amp/s/www.geeksforgeeks.org/python-pandas-series-shape/amp/
it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.
In sklearn I receive this values:
In [30]: lm = LinearRegression().fit(xx,y_train)
In [31]: lm.coef_
Out[31]:
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
0.08619906, -0.08108713]])
In [32]: xx.shape
Out[32]: (1097, 3419)
Call [30] should return an error. How does sklearn work when p>n like in this case?
EDIT:
It seems that the matrix is filled with some values
if n > m:
# need to extend b matrix as it will be filled with
# a larger solution matrix
if len(b1.shape) == 2:
b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
b2[:m,:] = b1
else:
b2 = np.zeros(n, dtype=gelss.dtype)
b2[:m] = b1
b1 = b2
When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.
argmin_w l2_norm(w) subject to Xw = y
This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.
w = np.linalg.pinv(X).dot(y)
The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).
Check out this example
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)
print(coef1)
print(coef2)
And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)
I would like to fit my surface equation to some data. I already tried scipy.optimize.leastsq but as I cannot specify the bounds it gives me an unusable results. I also tried scipy.optimize.least_squares but it gives me an error:
ValueError: too many values to unpack
My equation is:
f(x,y,z)=(x-A+y-B)/2+sqrt(((x-A-y+B)/2)^2+C*z^2)
parameters A, B, C should be found so that the equation above would be as close as possible to zero when the following points are used for x,y,z:
[
[-0.071, -0.85, 0.401],
[-0.138, -1.111, 0.494],
[-0.317, -0.317, -0.317],
[-0.351, -2.048, 0.848]
]
The bounds would be A > 0, B > 0, C > 1
How I should obtain such a fit? What is the best tool in python to do that. I searched for examples on how to fit 3d surfaces but most of examples involving function fitting is about line or flat surface fits.
I've edited this answer to provide a more general example of how this problem can be solved with scipy's general optimize.minimize method as well as scipy's optimize.least_squares method.
First lets set up the problem:
import numpy as np
import scipy.optimize
# ===============================================
# SETUP: define common compoments of the problem
def our_function(coeff, data):
"""
The function we care to optimize.
Args:
coeff (np.ndarray): are the parameters that we care to optimize.
data (np.ndarray): the input data
"""
A, B, C = coeff
x, y, z = data.T
return (x - A + y - B) / 2 + np.sqrt(((x - A - y + B) / 2) ** 2 + C * z ** 2)
# Define some training data
data = np.array([
[-0.071, -0.85, 0.401],
[-0.138, -1.111, 0.494],
[-0.317, -0.317, -0.317],
[-0.351, -2.048, 0.848]
])
# Define training target
# This is what we want the target function to be equal to
target = 0
# Make an initial guess as to the parameters
# either a constant or random guess is typically fine
num_coeff = 3
coeff_0 = np.ones(num_coeff)
# coeff_0 = np.random.rand(num_coeff)
This isn't strictly least squares, but how about something like this?
This solution is like throwing a sledge hammer at the problem. There probably is a way to use least squares to get a solution more efficiently using an SVD solver, but if you're just looking for an answer scipy.optimize.minimize will find you one.
# ===============================================
# FORMULATION #1: a general minimization problem
# Here the bounds and error are all specified within the general objective function
def general_objective(coeff, data, target):
"""
General function that simply returns a value to be minimized.
The coeff will be modified to minimize whatever the output of this function
may be.
"""
# Constraints to keep coeff above 0
if np.any(coeff < 0):
# If any constraint is violated return infinity
return np.inf
# The function we care about
prediction = our_function(coeff, data)
# (optional) L2 regularization to keep coeff small
# (optional) reg_amount = 0.0
# (optional) reg = reg_amount * np.sqrt((coeff ** 2).sum())
losses = (prediction - target) ** 2
# (optional) losses += reg
# Return the average squared error
loss = losses.sum()
return loss
general_result = scipy.optimize.minimize(general_objective, coeff_0,
method='Nelder-Mead',
args=(data, target))
# Test what the squared error of the returned result is
coeff = general_result.x
general_output = our_function(coeff, data)
print('====================')
print('general_result =\n%s' % (general_result,))
print('---------------------')
print('general_output = %r' % (general_output,))
print('====================')
The output looks like this:
====================
general_result =
final_simplex: (array([[ 2.45700466e-01, 7.93719271e-09, 1.71257109e+00],
[ 2.45692680e-01, 3.31991619e-08, 1.71255150e+00],
[ 2.45726858e-01, 6.52636219e-08, 1.71263360e+00],
[ 2.45713989e-01, 8.06971686e-08, 1.71260234e+00]]), array([ 0.00012404, 0.00012404, 0.00012404, 0.00012404]))
fun: 0.00012404137498459109
message: 'Optimization terminated successfully.'
nfev: 431
nit: 240
status: 0
success: True
x: array([ 2.45700466e-01, 7.93719271e-09, 1.71257109e+00])
---------------------
general_output = array([ 0.00527974, -0.00561568, -0.00719941, 0.00357748])
====================
I found in the documentation that all you need to do to adapt this to actual least squares is to specify the function that computes the residuals.
# ===============================================
# FORMULATION #2: a special least squares problem
# Here all that is needeed is a function that computes the vector of residuals
# the optimization function takes care of the rest
def least_squares_residuals(coeff, data, target):
"""
Function that returns the vector of residuals between the predicted values
and the target value. Here we want each predicted value to be close to zero
"""
A, B, C = coeff
x, y, z = data.T
prediction = our_function(coeff, data)
vector_of_residuals = (prediction - target)
return vector_of_residuals
# Here the bounds are specified in the optimization call
bound_gt = np.full(shape=num_coeff, fill_value=0, dtype=np.float)
bound_lt = np.full(shape=num_coeff, fill_value=np.inf, dtype=np.float)
bounds = (bound_gt, bound_lt)
lst_sqrs_result = scipy.optimize.least_squares(least_squares_residuals, coeff_0,
args=(data, target), bounds=bounds)
# Test what the squared error of the returned result is
coeff = lst_sqrs_result.x
lst_sqrs_output = our_function(coeff, data)
print('====================')
print('lst_sqrs_result =\n%s' % (lst_sqrs_result,))
print('---------------------')
print('lst_sqrs_output = %r' % (lst_sqrs_output,))
print('====================')
The output here is:
====================
lst_sqrs_result =
active_mask: array([ 0, -1, 0])
cost: 6.197329866927735e-05
fun: array([ 0.00518416, -0.00564099, -0.00710112, 0.00385024])
grad: array([ -4.61826888e-09, 3.70771396e-03, 1.26659198e-09])
jac: array([[-0.72611025, -0.27388975, 0.13653112],
[-0.74479565, -0.25520435, 0.1644325 ],
[-0.35777232, -0.64222767, 0.11601263],
[-0.77338046, -0.22661953, 0.27104366]])
message: '`gtol` termination condition is satisfied.'
nfev: 13
njev: 13
optimality: 4.6182688779976278e-09
status: 1
success: True
x: array([ 2.46392438e-01, 5.39025298e-17, 1.71555150e+00])
---------------------
lst_sqrs_output = array([ 0.00518416, -0.00564099, -0.00710112, 0.00385024])
====================
I'm currently trying to dot product a row from a diagonal matrix formed by scipy.sparse.diags function in "Poisson Stiffness" below, but i am having trouble selecting the row and am getting the following error:
TypeError: 'dia_matrix' object has no attribute '__getitem__'
The following is my code
def Poisson_Stiffness(x0):
"""Finds the Poisson equation stiffness matrix with any non uniform mesh x0"""
x0 = np.array(x0)
N = len(x0) - 1 # The amount of elements; x0, x1, ..., xN
h = x0[1:] - x0[:-1]
a = np.zeros(N+1)
a[0] = 1 #BOUNDARY CONDITIONS
a[1:-1] = 1/h[1:] + 1/h[:-1]
a[-1] = 1/h[-1]
a[N] = 1 #BOUNDARY CONDITIONS
b = -1/h
b[0] = 0 #BOUNDARY CONDITIONS
c = -1/h
c[N-1] = 0 #BOUNDARY CONDITIONS: DIRICHLET
data = [a.tolist(), b.tolist(), c.tolist()]
Positions = [0, 1, -1]
Stiffness_Matrix = diags(data, Positions, (N+1,N+1))
return Stiffness_Matrix
def Error_Indicators(Uh,U_mesh,Z,Z_mesh,f):
"""Take in U, Interpolate to same mesh as Z then solve for eta"""
u_inter = interp1d(U_mesh,Uh) #Interpolation of old mesh
U2 = u_inter(Z_mesh) #New function u for the new mesh to use in
Bz = Poisson_Stiffness(Z_mesh)
eta = np.empty(len(Z_mesh))
for i in range(len(Z_mesh)):
eta[i] = (f[i] - np.dot(Bz[i,:],U2[:]))*z[i]
return eta
The error is specifically coming from the following line of code:
eta[i] = (f[i] - np.dot(Bz[i,:],U2[:]))*z[i]
It is the Bz matrix that is causing the error, being created using scipy.sparse.diags and will not allow me to call the row.
Bz = Poisson_Stiffness(np.linspace(0,1,40))
print Bz[0,0]
This code will also produce the same error.
Any help is very much appreciated
Thanks
sparse.dia_matrix apparently does not support indexing. The way around that would be to convert it to another format. For calculation purposes tocsr() would be appropriate.
The documentation for dia_matrix is limited, though I think the code is visible Python. I'd have to double check, but I think it is a matrix construction tool, not a fully developed working format.
In [97]: M=sparse.dia_matrix(np.ones((3,4)),[0,-1,2])
In [98]: M[1,:]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-98-a74bcb661a8d> in <module>()
----> 1 M[1,:]
TypeError: 'dia_matrix' object is not subscriptable
In [99]: M.tocsr()[1,:]
Out[99]:
<1x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
A dia_matrix stores its information in a .data and .offsets attributes, which are simple modifications of the input parameters. They aren't ammenable for 2d indexing.