SKLearn NMF Vs Custom NMF - python

I am trying to build a recommendation system using Non-negative matrix factorization. Using scikit-learn NMF as the model, I fit my data, resulting in a certain loss(i.e., reconstruction error). Then I generate recommendation for new data using the inverse_transform method.
Now I do the same using another model I built in TensorFlow. The reconstruction error after training is close to that obtained using sklearn's approach earlier.
However, neither are the latent factors similar to one another nor the final recommendations.
One difference between the 2 approaches that I am aware of is:
In sklearn, I am using the Coordinate Descent solver whereas in TensorFlow, I am using the AdamOptimizer which is based on Gradient Descent.
Everything else seems to be the same:
Loss function used is the Frobenius Norm
No regularization in both cases
Tested on the same data using same number of latent dimensions
Relevant code that I am using:
1. scikit-learn approach:
model = NMF(alpha=0.0, init='random', l1_ratio=0.0, max_iter=200,
n_components=2, random_state=0, shuffle=False, solver='cd', tol=0.0001,
verbose=0)
model.fit(data)
result = model.inverse_transform(model.transform(data))
2. TensorFlow approach:
w = tf.get_variable(initializer=tf.abs(tf.random_normal((data.shape[0],
2))), constraint=lambda p: tf.maximum(0., p))
h = tf.get_variable(initializer=tf.abs(tf.random_normal((2,
data.shape[1]))), constraint=lambda p: tf.maximum(0., p))
loss = tf.sqrt(tf.reduce_sum(tf.squared_difference(x, tf.matmul(w, h))))
My question is that if the recommendations generated by these 2 approaches do not match, then how can I determine which are the right ones?
Based on my use case, sklearn's NMF is giving me good results, but not the TensorFlow implementation. How can I achieve the same using my custom implementation?

The choice of the optimizer has a big impact on the quality of the training. Some very simple models (I'm thinking of GloVe for example) do work with some optimizer and not at all with some others. Then, to answer your questions:
how can I determine which are the right ones ?
The evaluation is as important as the design of your model, and it is as hard i.e. you can try these 2 models and several available datasets and use some metrics to score them. You could also use A/B testing on a real case application to estimate the relevance of your recommendations.
How can I achieve the same using my custom implementation ?
First, try to find a coordinate descent optimizer for Tensorflow and make sure all step you implemented are exactly the same as the one in scikit-learn. Then, if you can't reproduce the same, try different solutions (why don't you try a simple gradient descent optimizer first ?) and take profit of the great modularity that Tensorflow offers !
Finally, if the recommendations provided by your implementation are that bad, I suggest you have an error in it. Try to compare with some existing codes.

Related

Best Practice for Transforming y_pred in Tensorflow's Metric

In my previous project, I need to frame an image classification task as a regression problem. I implement the regression model using Tensorflow, with standard Sequential model with a 1 node Dense layer with no activation function as the last layer. In order to measure the performance, I need to use standard classification metrics, such as accuracy and cohen kappa.
However, I can't directly use those metrics because my model is a regression model, so I need to clip and round the output before feeding them to the metrics. I use a workaround by defining my own metric, however that workaround is not practical. Therefore, I'm thinking about contributing to Tensorflow by implementing a custom transformation_function to transform y_pred by a Tensor lambda function before storing them in the __update_state method. After reading the source code, I get doubts regarding this idea. So, I'm asking out to you, fellow Tensorflow user/contributors, what is the best practice of transforming y_pred before feeding it to a metric? Is this functionality already implemented in the newest version?
Thank you!

optimizers other than GradientDescentOptimizer tensorflow having zero for gradient value

I am really at wit's end and don't know where else I can ask so I am asking here. I know that my question may not be of the best quality but I am hoping for at least some guidance on the direction I should look to figure out my problem.
I am replicating sci-kit learn's implementation of Elastic Net Multiple Linear Regression in tensorflow and tensorboard as a learning exercise so I can eventually move on to implement and visualize more difficult machine learning algorithms.
I have some code that does a Multiple Linear Regression using the Elastic Net Regularization as the loss function. With gradient descent, it converges to a suboptimal solution compared to sci-kit learn's algorithm. Through some searching, I learned that sci-kit learn initializes weights using the Xavier method, so I did that in tensorflow as well. Performance improved slightly but still was no where close to sklearn. My next improvement was to change the optimizer to attempt to match performance although my research told me that scikit learn uses coordinate descent which is a method that isn't implemented in tensorflow.
However, this is where I am stuck. It seems that simply switching out the optimizer for another optimizer does not seem to work (not that I expected it to, but I'm also having trouble finding material that will tell me how to set up properly). Currently I've simply performed the switch the following way, can anyone give me a hint why my gradients are 0?
Thanks!
# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.001)
my_opt = tf.train.AdamOptimizer(epsilon = 0.1)
Histogram of gradients:
Loss function showing that Adam optimizer isn't doing anything:
EDIT:
I have updated my learning rate to be higher, but convergence still doesn't seem that great. I think I will proceed to try to implement Coordinate Descent in tensorflow to match sci-kit learn's method as close as possible. I've attached an image of the difference for those curious:
In comparison to SGD:

What is the best way to minimize the RMSE?

I am using LinearRegression() from sklearn to predict. I have created different features for X and trying to understand how can i select the best features automatically? Let's say i have defined 50 different features for X and only one output for y. Is there a way to select the best performing features automatically instead of doing it manually?
Also I can get rmse using following command:
scores = np.sqrt(-cross_val_score(lm, X, y, cv=20, scoring='neg_mean_squared_error')).mean()
From now on, how can i use this RMSE scores? I mean do i have to make multiple predictions? How am i going to use this rmse? There must be a way to predict() using some optimisations but couldn't findout.
Actually sklearn doesn't seem to have a stepwise algorithm, which helps in understanding the importance of features. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection.
See the documentation here:
Recursive Feature Elimination
Note that it is not necessary that it will reduce your RMSE. You might try different techniques like Ridge and Lasso Regression as well.
RMSE measures the average magnitude of the prediction error.
RMSE gives high weight to high errors, lower the values it's always better. RMSE can be improved only if you have a decent model. For feature selection, you can use PCA or stepwise regression or basic correlation technique. If you see a lot of multi-collinearity then go for Lasso or Ridge regression. Also, make sure you have a decent split of test and train data. If you have bad testing data you will get poor results. Also, check training data R-sq and testing data R-sq to make sure the model doesn't over-fit.
It would be helpful if you add information on no. of observations in your test and train data and r-sq value. Hope this helps

Fitting Keras L1 models

I have a simple keras model (normal Lasso linear model) where the inputs are moved to a single 'neuron' Dense(1, kernel_regularizer=l1(fdr))(input_layer) but the weights from this model are never set exactly to zero. I find this interesting since scikit-learn's Lasso can set coefficients exactly to zero.
I have used Adam and tensorflow's FtrlOptimizer for optimisation and they have the same problem.
I've already checked this question already but this does not explain why sklearn can set values exactly to zero, no to mention how their models converge in ~500ms on my server when the same model in Keras takes 2.4secs with early terminations.
Is this all because of the optimizer being used or am I missing something?
Is this all because of the optimizer being used or am I missing
something?
Indeed. If you look into the actual function that gets called when you fit Lasso from scikit-learn (it's called from ElasticNet class) you see that it uses different optimization algorithm.
Coordinate Descent in scikit-learn's ElasticNet starts with coefficient vector equal to zero, and then considers adding nonzero entries one at a time (this is related to stepwise feature selection for linear regression).
Other methods that are used to optimize L1 regularized regression also are work in that way: for example LARS (Least-angle regression) can be also used from scikit-learn.
In contrast to that, a paper on FTRL algorithm says
Unfortunately, OGD is not particularly effective at producing
sparse models. In fact, simply adding a subgradient
of the L1 penalty to the gradient of the loss (Ow`t(w))
will essentially never produce coefficients that are exactly
zero.

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

Categories