Gradient descent from scratch in python not working - python

I am trying to implement a gradient descent algorithm from scratch in python, which should be fairly easy. however, I have been scratching my head for quite while with my code now, unable to make it work.
I generate data as follow:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
#Defining the x array.
x=np.array(range(1,100))
#Defining the y array.
y=10+2*x.ravel()
y=y+np.random.normal(loc=0, scale=70, size=99)
Then define the parameters:
alpha = 0.01 # Which will be the learning rate
NbrIter = 100 # Representing the number of iteration
m = len(y)
theta = np.random.randn(2,1)
and my GD is as follow:
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * ( X.T # ((X # theta) - y) )
What I get is a huge matrix, meaning that I have some problem with the linear algebra. However, I really fail to see where the issue is.
(Playing around with the matrices to try to get them to match I reached a theta having the correct form (2x1) with:
theta = theta - (1/m) * alpha * ( X.T # ((X # theta).T - y).T )
But it does look wrong and the actual value are way off (array([[-8.92647663e+148],
[-5.92079000e+150]]))
)

I guess you were hit by broadcasting. Variable y's shape is (100,). When y is subtracted from result of X.T#X#theta. Theta is column vector so I guess the result is a column vector. Variable y is broadcasted to a row vector of shape (1,100). The result of subtraction is (100,100). To fix this reshape y as column vector with y.reshape(-1,1)
Now, a few optimizations:
X.T # ((X # theta) - y[:,None])
can be rewritten as:
(X.T#X) # theta - (X.T*y[:,None])
The most costly computation can be taken out of the loop:
XtX = X.T#X
Xty = X.T*y[:,None]
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * (XtX # theta - Xty)
Now you operate on 2x2 matrix rather that 100x2.
Let's take a look on convergence.
Assuming that X is constructed like: X=np.column_stack((x, np.ones_like(x)) it is possible to check matrix condition:
np.linalg.cond(XtX)
Which produced:
13475.851490419038
It means that the ratio between minimal and maximal eigenvector is about 13k. Therefore using alpha larger then 1/13k will likely result in bad convergence.
If you use alpha=1e-5 the algorithm will converge.
Good luck!

Related

Gaussian RBF Visualization in Python

It is some days that I am trying to visualize the so-called kernel trick resulting from a RBF kernel transformation in a SVC model. Basically, I am trying to map a 2D space to a 3D space in order to let the viewer see how the kernel trick adds a dimension in order to linearly separate the space between two classes.
Following sklearn examples, I managed to plot a 2D example of the trick. However, I feel it is not enough to really grasp what it is happening behind the scences.
Below is what I managed to plot:
I would like to plot the same data on a three dimensional space, representing also the plane that splits the space between the two classes.
I am not asking for the actual code here. Rather, I would like to understand what goes on the axis for the third dimension. Indeed, I think that such axis should be equal to exp(-gamma||x-y||^2). However, due to my poor vector algebra skills, I do not know how to compute it.
Any help would be much appreciated.
Cheers!
UPDATE
The following allowed me to build a new matrix for a 3D plot:
def feature_map_2(X):
return np.asarray((X[:,0], X[:,1], np.exp( -gam*(( X[:,0]**2 + X[:,1]**2 -2*X[:,0]*X[:,1]))))).T
Z = feature_map_2(X)
Where gam = 1/n_features
Then, I computed the boundary as follows:
#SVM
clf = svm.NuSVC(kernel = 'linear', nu=0.5)
clf.fit(Z, y)
w = clf.coef_.flatten()
b = clf.intercept_.flatten()
# create x,y
xx, yy = np.meshgrid(np.linspace(-6,6), np.linspace(-2,2))
# calculate corresponding z
boundary = lambda xx, yy: (-w[0] * xx - w[1] * yy - b) * 1. /w[2]
However, results differ from what one might have expected looking at the 2D plot.
Do you mean SVM model?
https://jgreitemann.github.io/svm-demo
Visualizing 3D may be difficult because you will need to project to the screen bringing again to a 2D image.
To find the plane in the 3D space you simply apply your kernel to make your classes linearly separable and then apply a
Linear SVM
The equation w' x - b = 0 expressed in terms of scalars as w[0] * x[0] + w[1] * x[1] + w[2] * x[2] - b = 0, can be made parametric by choosing element of x (with non-zero coefficient). For instance if w[2] != 0 you can write the plane as.
(U, V, (b - w[0] * U - w[1] * V) / w[2])
And this may be used in common surface plot functions, for instance in python it would be like this
U, V = meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
plt.pcolormesh(U, V, (b - w[0] * U - w[1] * V) / w[2]);

Numpy polyfit: possible error in the scaling of the covariance matrix?

I am having a hard time figuring out the scaling for the covariance matrix in numpy polyfit.
In the documentation I read that the scaling factor to go from an unscaled to a scaled covariance matrix is
chi2 / sqrt(N - DOF).
In the code attached below, it seems that the scaling factor actually is
chi2 / DOF
Here is my code
# Generate synthetically the data
# True parameters
import numpy as np
true_slope = 3
true_intercept = 7
x_data = np.linspace(-5, 5, 30)
# The y-data will have a noise term, to simulate imperfect observations
sigma = 1
y_data = true_slope * np.linspace(-5, 5, 30) + true_intercept
y_obs = y_data + np.random.normal(loc=0.0, scale=sigma, size=x_data.size)
# Here I generate artificially some unequal uncertainties
# (even if there is no reason for them to be so)
y_uncertainties = sigma * np.random.normal(loc=1.0, scale=0.5*sigma, size=x_data.size)
# Make the fit
popt, pcov = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov='unscaled')
popt, pcov_scaled = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov=True)
my_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 / y_uncertainties**2)\
/ (len(y_obs)-2)
scale_factor = pcov_scaled[0,0] / pcov[0,0]
If I run the code, I see that the actual scale factor is chi2 / DOF and not the value reported in the documentation. Is this true or am I missing something?
I have a further question. Why is it suggested to use just the inverse of the y-data error instead of the square of the inverse of the y-data errors for the weights in the case that the uncertainties are normally-distributed?
Edit to add the data generated by a run of the code
x_data = array([-5. , -4.65517241, -4.31034483, -3.96551724, -3.62068966,
-3.27586207, -2.93103448, -2.5862069 , -2.24137931, -1.89655172,
-1.55172414, -1.20689655, -0.86206897, -0.51724138, -0.17241379,
0.17241379, 0.51724138, 0.86206897, 1.20689655, 1.55172414,
1.89655172, 2.24137931, 2.5862069 , 2.93103448, 3.27586207,
3.62068966, 3.96551724, 4.31034483, 4.65517241, 5. ])
y_obs = array([-7.27819725, -8.41939411, -3.9089926 , -5.24622589, -3.78747379,
-1.92898727, -1.375255 , -1.84388812, -0.37092441, 0.27572306,
2.57470918, 3.860485 , 4.62580789, 5.34147103, 6.68231985,
7.38242258, 8.28346559, 9.46008873, 10.69300274, 12.46051285,
13.35049975, 13.28279961, 14.31604781, 16.8226239 , 16.81708308,
18.64342284, 19.37375515, 19.6714002 , 20.13700708, 22.72327533])
y_uncertainties = array([ 0.63543112, 1.07608924, 0.83603265, -0.03442888, -0.07049299,
1.30864191, 1.36015322, 1.42125414, 1.04099854, 1.20556608,
0.43749964, 1.635056 , 1.00627014, 0.40512511, 1.19638787,
1.26230966, 0.68253139, 0.98055035, 1.01512232, 1.83910276,
0.96763007, 0.57373151, 1.69358475, 0.62068133, 0.70030971,
0.34648312, 1.85234844, 1.18687269, 1.23841579, 1.19741206])
With this data I obtain that scale_factor = 1.6534129347542432, my_scale_factor = 1.653412934754234 and that the "nominal" scale factor reported in the documentation, i.e.
nominal_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 /\
y_uncertainties**2) / np.sqrt(len(y_obs) - len(y_obs) + 2)
has value nominal_scale_factor = 32.73590595145554
PS. my numpy version is
1.18.5 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
Regarding the numpy.polyfit documentation:
By default, the covariance are scaled by chi2/sqrt(N-dof), i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.
This looks like a documentation bug. The correct scaling factor for the covariance is chi_square/(N-M) where M is the number of fit parameters and N-M is the number of degrees of freedom. It looks like np.polyfit is implemented correctly, because my_scale_factor and scale_factor are consistent.
Regarding the question on why not "the square of the inverse of the y-data errors": a polynomial fit or more generally, a least-squares fit involves solving the p vector in
A # p = y
where A is an (N, M) matrix for N data points in y and M elements in p and each column in A is the polynomial term evaluated at the corresponding x values.
The solution minimizes
(SUM_j A[i, j] p[j] - y[i])^2
SUM -----------------------------
i sigma_y[i]^2
Computationally, the cheapest way to calculate this is by multiplying each row in A and each y value by the corresponding 1/sigma_y and then taking a standard least-square solution of the A#p=y equation. By having the user supply the inverse errors, it saves the fit routine from handling division by zero issues and slow square-root operations.
Regarding the first part, I opened a Github issue
https://github.com/numpy/numpy/issues/16842
The conclusion on that thread is that the documentation is wrong, but the function behaves correctly.
The documentation should be updated to
By default, the covariance is scaled by chi2/dof, i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.

Calculate correlation in xarray with missing data

I am trying to calculate a correlation between two datasets in xarray along the time dimension. My dataset are both lat x lon x time. One of my datasets has enough data missing that is isn't reasonable to interpolate and eliminate gaps, instead I would like to just ignore missing values. I have some simple bits of code that are working somewhat, but none that fits my exact use case. For example:
def covariance(x,y,dims=None):
return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)
def correlation(x,y,dims=None):
return covariance(x,y,dims) / (x.std(dims) * y.std(dims))
works well if no data is missing but of course can't work with nans. While there is a good example written for xarray here, even with this code I am struggling to calcuate the pearson's correlation not the spearman's.
import numpy as np
import xarray as xr
import bottleneck
def covariance_gufunc(x, y):
return ((x - x.mean(axis=-1, keepdims=True))
* (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)
def pearson_correlation_gufunc(x, y):
return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))
def spearman_correlation_gufunc(x, y):
x_ranks = bottleneck.rankdata(x, axis=-1)
y_ranks = bottleneck.rankdata(y, axis=-1)
return pearson_correlation_gufunc(x_ranks, y_ranks)
def spearman_correlation(x, y, dim):
return xr.apply_ufunc(
spearman_correlation_gufunc, x, y,
input_core_dims=[[dim], [dim]],
dask='parallelized',
output_dtypes=[float])
Finally there was a useful discussion on github of adding this as a feature to xarray but it has yet to be implemented. Is there an efficient way to do this on datasets with data gaps?
I've been following this Github discussion and the subsequent attempts to implement a .corr() method, seems like we're pretty close but it's still not there yet.
In the meantime, the basic code which most are attempting to merge is outlined pretty well in this other answer (How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?). It's a good solution which leverages vectorized operations in NumPy and with some small tweaking (see accepted answer in the link) can be made to account for NaNs along the time axis.
def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time.
Thus the input data could be a 1D time series, or for example, have three
dimensions (time,lat,lon).
Datasets can be provided in any order, but note that the regression slope
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value,
and standard error on regression between the two datasets along their
aligned time dimension.
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount.
"""
#1. Ensure that the data are properly alinged to each other.
x,y = xr.align(x,y)
#2. Add lag information if any, and shift the data accordingly
if lagx!=0:
# If x lags y by 1, x must be shifted 1 step backwards.
# But as the 'zero-th' value is nonexistant, xr assigns it as invalid
# (nan). Hence it needs to be dropped
x = x.shift(time = -lagx).dropna(dim='time')
# Next important step is to re-align the two datasets so that y adjusts
# to the changed coordinates of x
x,y = xr.align(x,y)
if lagy!=0:
y = y.shift(time = -lagy).dropna(dim='time')
x,y = xr.align(x,y)
#3. Compute data length, mean and standard deviation along time axis:
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd = x.std(axis=0)
ystd = y.std(axis=0)
#4. Compute covariance along time axis
cov = np.sum((x - xmean)*(y - ymean), axis=0)/(n)
#5. Compute correlation along time axis
cor = cov/(xstd*ystd)
#6. Compute regression slope and intercept:
slope = cov/(xstd**2)
intercept = ymean - xmean*slope
#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats
from scipy.stats import t
pval = t.sf(tstats, n-2)*2
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)
return cov,cor,slope,intercept,pval,stderr
Hope this helps! Fingers crossed the merge comes soon for this.
The solution is in the github thread https://github.com/pydata/xarray/issues/1115
def covariance(x, y, dim=None):
valid_values = x.notnull() & y.notnull()
valid_count = valid_values.sum(dim)
demeaned_x = (x - x.mean(dim)).fillna(0)
demeaned_y = (y - y.mean(dim)).fillna(0)
return xr.dot(demeaned_x, demeaned_y, dims=dim) / valid_count
def correlation(x, y, dim=None):
# dim should default to the intersection of x.dims and y.dims
return covariance(x, y, dim) / (x.std(dim) * y.std(dim))

Python Earth Mover Distance of 2D arrays

I would like to compute the Earth Mover Distance between two 2D arrays (these are not images).
Right now I go through two libraries: scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html) and pyemd (https://pypi.org/project/pyemd/).
#define a sampeling method
def sampeling2D(n, mu1, std1, mu2, std2):
#sample from N(0, 1) in the 2D hyperspace
x = np.random.randn(n, 2)
#scale N(0, 1) -> N(mu, std)
x[:,0] = (x[:,0]*std1) + mu1
x[:,1] = (x[:,1]*std2) + mu2
return x
#generate two sets
Y1 = sampeling2D(1000, 0, 1, 0, 1)
Y2 = sampeling2D(1000, -1, 1, -1, 1)
#compute the distance
distance = pyemd.emd_samples(Y1, Y2)
While the scipy version doesn't accept 2D arrays and it returns an error, the pyemd method returns a value. If you see from the documentation, it says that it accept only 1D arrays, so I think that the output is wrong. How can I calculate this distance in this case?
So if I understand you correctly, you're trying to transport the sampling distribution, i.e. calculate the distance for a setup where all clusters have weight 1. In general, you can treat the calculation of the EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem: Your two arrays are the partitions in a bipartite graph, and the weights between two vertices are your distance of choice. Assuming that you want to use the Euclidean norm as your metric, the weights of the edges, i.e. the ground distances, may be obtained using scipy.spatial.distance.cdist, and in fact SciPy provides a solver for the linear sum assignment problem as well in scipy.optimize.linear_sum_assignment (which recently saw huge performance improvements which are available in SciPy 1.4. This could be of interest to you, should you run into performance problems; the 1.3 implementation is a bit slow for 1000x1000 inputs).
In other words, what you want to do boils down to
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
d = cdist(Y1, Y2)
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)
It is also possible to use scipy.sparse.csgraph.min_weight_bipartite_full_matching as a drop-in replacement for linear_sum_assignment; while made for sparse inputs (which yours certainly isn't), it might provide performance improvements in some situations.
It might be instructive to verify that the result of this calculation matches what you would get from a minimum cost flow solver; one such solver is available in NetworkX, where we can construct the graph by hand:
import networkx as nx
G = nx.DiGraph()
# Represent elements in Y1 by 0, ..., 999, and elements in
# Y2 by 1000, ..., 1999.
for i in range(n):
G.add_node(i, demand=-1)
G.add_node(n + i, demand=1)
for i in range(n):
for j in range(n):
G.add_edge(i, n + j, capacity=1, weight=d[i, j])
At this point, we can verify that the approach above agrees with the minimum cost flow:
In [16]: d[assignment].sum() == nx.algorithms.min_cost_flow_cost(G)
Out[16]: True
Similarly, it's instructive to see that the result agrees with scipy.stats.wasserstein_distance for 1-dimensional inputs:
from scipy.stats import wasserstein_distance
np.random.seed(0)
n = 100
Y1 = np.random.randn(n)
Y2 = np.random.randn(n) - 2
d = np.abs(Y1 - Y2.reshape((n, 1)))
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n) # 1.9777950447866477
print(wasserstein_distance(Y1, Y2)) # 1.977795044786648

KDE in python with different mu, sigma / mapping a function to an array

I have a 2-dimensional array of values that I would like to perform a Gaussian KDE on, with a catch: the points are assumed to have different variances. For that, I have a second 2-dimensional array (with the same shape) that is the variance of the Gaussian to be used for each point. In the simple example,
import numpy as np
data = np.array([[0.4,0.2],[0.1,0.5]])
sigma = np.array([[0.05,0.1],[0.02,0.3]])
there would be four gaussians, the first of which is centered at x=0.4 with σ=0.05. Note: Actual data is much larger than 2x2
I am looking for one of two things:
A Gaussian KDE solver that will allow for bandwidth to change for each point
or
A way to map the results of each Gaussian into a 3-dimensional array, with each Gaussian evaluated across a range of points (say, evaluate each center/σ pair along np.linspace(0,1,101)). In this case, I could e.g. have the KDE value at x=0.5 by taking outarray[:,:,51].
The best way I found to handle this is through array multiplication of a sigma array and a data array. Then, I stack the arrays for each value I want to solve the KDE for.
import numpy as np
def solve_gaussian(val,data_array,sigma_array):
return (1. / sigma_array) * np.exp(- (val - data_array) * (val - data_array) / (2 * sigma_array * sigma_array))
def solve_kde(xlist,data_array,sigma_array):
kde_array = np.array([])
for xx in xlist:
single_kde = solve_gaussian(xx,data_array,sigma_array)
if np.ndim(kde_array) == 3:
kde_array = np.concatenate((kde_array,single_kde[np.newaxis,:,:]),axis=0)
else:
kde_array = np.dstack(single_kde)
return kde_array
xlist = np.linspace(0,1,101) #Adjust as needed
kde_array = solve_kde(xlist,data_array,sigma_array)
kde_vector = np.sum(np.sum(kde_array,axis=2),axis=1)
mode_guess = xlist[np.argmax(kde_vector)]
Caveat, for anyone attempting to use this code: the value of the Gaussian is along axis 0, not axis 2 as specified in the original question.

Categories