how to find a transformation matrix with sgd - python

This seems like it would be simple, but I can't get things to work. I 100 dimension vector spaces and I have several vectors in each space that are matched. I want to find the transformation matrix (W) such that:
a_vector[0] in vector space A x W = b_vector[0] in vector space B (or approximation).
So a paper mentions the formula for this.
So no bias term, no activation that we typically see.
I've tried using sklearns Linear Regression without much success.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
regression_model = LinearRegression(fit_intercept=True)
regression_model.fit(X_train, y_train)
regression_model.score(X_test, y_test)
> -1451478.4589335269 (!!???)
y_predict = regression_model.predict(X_test)
regression_model_mse = mean_squared_error(y_predict, y_test)
regression_model_mse = 524580.06
Tried tensorflow without much success. Don't care about the tool - tensorflow, sklearn - just looking for help with the solutions.
Thanks.
EDIT
so I hand rolled the code below - maxing for cosine sim (representing how close the predicted points are to the real points - 1.00 = perfect match) - but it is VERY SLOW.
shape = (100,100)
W1 = np.random.randn(*shape).astype(np.float64) / np.sqrt(sum(shape))
avgs = []
for x in range(1000):
shuffle(endevec)
distance = [0]
for i,x in enumerate(endevec):
pred1 = x[0].dot(W1)
cosine = 1 - scipy.spatial.distance.cosine(pred1, x[1])
distance.append(cosine)
diff = pred1 - x[0]
gradient = W1.T.dot(diff) / W1.shape[0]
W1 += -gradient * .0001
avgs.append(np.mean(distance))
sys.stdout.write('\r')
# the exact output you're looking for:
sys.stdout.write(str(avgs[-1]))
sys.stdout.flush()
EDIT 2
Jeanne Dark below had a great answer for finding the transformation matrix using:
M=np.linalg.lstsq(source_mtrx[:n],target_mtrx[:n])[0]
On my dataset of matched vecs, the predicted vecs using the TM found with this method was:
minmax=(-0.09405095875263214, 0.9940633773803711)
mean=0.972490919224675 (1.0 being a perfect match)
variance=0.0011325349465895844
skewness=-18.317443753033665
kurtosis=516.5701661370497
Had tiny amount of really big outliers.
The plot of cosine sim was:

I was having exactly the same problem yesterday. I ended up using numpy.linalg.lstsq and I think it works.
# find tranformation matrix M so that: source_matrix∙M = target_matrix based
#on top n most frequent terms in the target corpus
n=500 # the choice of n depends on the size of your vocabulary
M=np.linalg.lstsq(source_mtrx[:n],target_mtrx[:n])[0]
print M.shape # returns (100,100)
# apply this tranformation to source matrix:
new_mtrx= np.array([np.dot(i, M) for i in source_mtrx])
Also check out this paper Lexical Comparison Between Wikipedia and Twitter Corpora by Using
Word Embeddings. They are based on the paper that you mentioned and they follow the same method but they explain the implementation with more details. For example, they suggest that in order to find the transformation matrix M we only use the vectors of the top n most frequent terms, and then, after we apply the transformation to the source matrix, we calculate the similarity for the remaining terms.
Please let me know if u find another solution for calculating M based on SGD.

Related

Scaling wide range datas in scikit learn

I'm trying to use a MLPregressor from scikit learn in order to do a non linear regression on a set of 260 examples (X,Y). One example is composed of 200 features for X and 1 feature for Y.
File containing X
File containing Y
The link between X and Y is not obvious if directly plotted together but if we plot x=log10(sum(X)) and y=log10(Y), the link between both is almost linear.
As a first approach, I tried to apply my neural network directly on X and Y without success.
I have read that scaling would improve regression. In my case, Y is containing datas in a very wide range of values (from 10e-12 to 10e-5). When computing the error, of course 10e-5 as much more weight than 10e-12. But I would like my neural network to correctly approximate both. When using a linear scaling, let's say preprocessing.MinMaxScaler from scikit learn, 10e-8 ~ -0.99 and 10e-12 ~ -1. So I'm loosing all the information of my target.
My question here is: what kind of scaling could I use to get consistent results?
The only solution I have found is to apply log10(Y) but of course, error is increased exponentially.
The best I could get is with the code below:
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"]=(20,10)
freqIter=[]
for i in np.arange(0,0.2,0.001):
freqIter.append([i,i+0.001])
#############################################################################
X = np.zeros((len(learningFiles),len(freqIter)))
Y = np.zeros(len(learningFiles))
# Import X: loadtxt()
# Import Y: loadtxt
maxy = np.amax(Y)
Y *= 1/maxy
Y = Y.reshape(-1, 1)
maxx = np.amax(X)
X *= 1/maxx
#############################################################################
reg = MLPRegressor(hidden_layer_sizes=(8,2), activation='tanh', solver='adam', alpha=0.0001, learning_rate='adaptive', max_iter=10000, verbose=False, tol = 1e-7)
reg.fit(X, Y)
#############################################################################
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],Y*maxy,label = 'INPUTS',color='blue')
plt.scatter([np.log10(np.sum(kou*maxx)) for kou in X],reg.predict(X)*maxy,label='Predicted',color='red')
plt.grid()
plt.legend()
plt.show()
Result:
Thanks for your help.
You may want to look at a FunctionTransformer. The example given applies a logarithmic transformation as part of pre-processing. You can also do it for an arbitrary mathematical function.
I would also suggest trying a ReLU activation function if you scale logarithmically. After the transformation your data looks fairly linear, so it may be converge a little faster -- but that's just a hunch.
I've finally found something interesting that is working well on my case.
First, I've used a log scaling for Y. I think it is the most adapted scaling when the range of values is very wide such as mine (from 10e-12 to 10e-5). Target is then between -5 and -12.
Secondly, my error about scaling X was to apply the same scaling to all features. Let's say my X contains 200 features, then I was dividing by the max of all features of all examples. My solution here is to scale feature1 by the max of all feature1 through all examples and then to reapeat it for all features. This gives me feature1 between 0 and 1 for all examples instead of far less previously (feature1 could be betwwen 0 and 0.0001 with my previous scaling).
I get better results, my main issue now is to select the correct parameters (number of layers, tolerance,...) but this is another problem.

Bad quality of Viterbi Algorithm (HMM)

I've been trying to get into hidden Markov models and the Viterbi algorithm recently. I found a library called hmmlearn (http://hmmlearn.readthedocs.io/en/latest/tutorial.html) to help me generate a state sequence for two states (with Gaussian emissions). Then I wanted to re-determine the state sequence using Viterbi. My code works, but predicts approximately 5% of the states wrong (depending on the means and variances of the Gaussian emissions). The hmmlearn library has a .predict method which also uses Viterbi to determine the state sequence.
My problem now is that the Viterbi algorithm by hmmlearn is much better than my hand-written one (error rate is lower than 0.5% compared to my 5%). I couldn't find any major problem in my code, so I'm not sure why this is the case. Below is my code where I first generate the state and observation sequence Z and X, predict Z with hmmlearn and finally predict it with my own code:
# Import libraries
import numpy as np
import scipy.stats as st
from hmmlearn import hmm
# Generate a sequence
model = hmm.GaussianHMM(n_components = 2, covariance_type = "spherical")
model.startprob_ = pi
model.transmat_ = A
model.means_ = obs_means
model.covars_ = obs_covars
X, Z = model.sample(T)
## Predict the states from generated observations with the hmmlearn library
Z_pred = model.predict(X)
# Predict the state sequence with Viterbi by hand
B = np.concatenate((st.norm(mean_1,var_1).pdf(X), st.norm(mean_2,var_2).pdf(X)), axis = 1)
delta = np.zeros(shape = (T, 2))
psi = np.zeros(shape= (T, 2))
### Calculate starting values
for s in np.arange(2):
delta[0, s] = np.log(pi[s]) + np.log(B[0, s])
psi = np.zeros((T, 2))
### Take everything in log space since values get very low as t -> T
for t in range(1,T):
for s_post in range(0, 2):
delta[t, s_post] = np.max([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1) + np.log(B[t, s_post])
psi[t, s_post] = np.argmax([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1)
### Backtrack
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
I'm not sure if I have a big error in my code or if hmmlearn just uses a more refined Viterbi algorithm. I have noticed by looking into the falsely predicted states that the impact of the emission probability B seems to be too big as it causes the states to change too frequently even if the transition probability to go to the other state is really low.
I'm rather new to python so please excuse my ugly coding. Thanks in advance for any tips you might have!
Edit: As you can see in the code, I'm stupid and used variances instead of the standard deviation to determine the emission probabilities. After fixing this, I get the same result as the implemented Viterbi algorithm.

Unsupervised learning clustering 1D array

I am faced with the following array:
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
What I would like to do is extract the cluster with the highest scores. That would be
best_cluster = [200,297,275,243]
I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?
Try MeanShift. From the sklean user guide of MeanShift:
The algorithm automatically sets the number of clusters, ...
Modified demo code:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
print(labels)
Output:
number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.
BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.
See also:
Overview of clustering methods
Choosing the right estimator
Clustering is overkill here
Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].
Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.
This is O(n), much faster than all the others mentioned. Also very understandable and predictable.
On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.
HDBSCAN is the best clustering algorithm and you should always use it.
Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.
For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.
Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.
To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:
from hdbscan import HDBSCAN
import numpy as np
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))
clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)
best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)
The output is [297 200 275 243]. Original order is not preserved. C'est la vie.

Get parameters of fitted model of KernelRidge class scikit learn library

I want to use KernelRidge class of scikit_learn library to fit nonlinear regression model on my data. But I am getting confused how I can do that.
from sklearn.kernel_ridge import KernelRidge
import numpy as np
n_samples, n_features = 20,1
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
Krr = KernelRidge(alpha=1.0, kernel='linear',degree = 4)
Krr.fit(X, y)
I am expecting 5 coefficients to be set for this model, how can I get them?
The above code will transform 1-D data to 4-D space and fit the model to the data. I think it should find best c0,c1,c2,c3,c4 according to the training data. My question is how can I access c0,c1,c2,c3,c4?
EDIT:
I made a mistake in above my code here, kernel parameter should be "polynomial" instead of "linear" in line 7.
Krr = KernelRidge(alpha=1.0, kernel='polynomial',degree = 4)
But my question is same as before.
http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge
dual_coef_ : array, shape = [n_features] or [n_targets, n_features]
so
Krr.dual_coef_
should do it.
EDIT:
Ok, so dual_coef_ is the coefficient in the Kernel space. For a linear kernel, the Kernel, K(X,X') is X.T *X . So this is an NxN matrix, hence the number of coefficients equal to the the dimension of y.
there are 3 equations we need to understand,
The first is the standard ridge regression weight estimation.
The second is the partially kernalised version, with the relation linking the two being the third equation.
dual_coef_ returns the alpha of equation 2. Therefore to have the weight vector in the 'normal' space, rather than the kernel space as it is returned, you need to do X.T * Krr.dual_coef_
We can check this is correct because KRR and Ridge Regression are the same if the kernel is linear.
import numpy as np
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge
rng = np.random.RandomState(0)
X = 5 * rng.rand(100, 1)
y = np.sin(X).ravel()
Krr = KernelRidge(alpha=1.0, kernel='linear', coef0=0)
R = Ridge(alpha=1.0,fit_intercept=False)
Krr.fit(X, y)
R.fit(X, y)
print np.dot(X.transpose(),Krr.dual_coef_)
print R.coef_
I see this to output:
[-0.03997686]
[-0.03997686]
Will show they are equivalent (you have to change the intercept options as the defaults differ between the models).
As the degree parameter is ignored, as I mentioned in the comments, the coefficient should be 1x1 in this case (as it is).
If you want to know exactly what a particular model returns, I recommend looking at the source code on github, which I think is the only way to gain a deeper understanding of how this stuff works. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_ridge.py
Additionally, for a non-linear kernel, the intuition of the weights can easily be lost, so always start from first principles if you do this.
Illustration of how KernelRidge prediction works. Hope it will help someone to understand the model.

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.
Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE
These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_
I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))
The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

Categories