Mixed linear model with Probabilistic Layers Regression

Mixed linear model with Probabilistic Layers Regression - python

I'm interested in fitting a linear mixed model using the variational inference capabilities of tensorflow probabilities and keras. However, I cannot find a straight-forward answer on how to implement such analysis. Using the regression example in TF probabilities (see Case 3 here), I am able to grasp how to fit these models if we have only random variables in the model (the example is regression using a single feature). Following the radon example here, we have two features: floor (fixed) and county (random). My understanding is the latter should only be passed to the denseVariational layers, while the former can be passed to a regular dense layer. So I guess I would have to jointly train two networks, one for fixed and one for the random features and some how merge their outputs.
So my questions are:
(1) If these are fit jointly can the same loss function be applied to both? I often see mean square error used, while in VI negative log likelihood is used (I think this equivalent to maximizing evidence of lower bound).
(2) Does the input need to be split before-hand and fed as input to two networks?

Related

Why ReLU function after every layer in CNN?

I am taking intro to ML on Coursera offered by Duke, which I recommend if you are interested in ML. The instructors of this course explained that "We typically include nonlinearities between layers of a neural network.There's a number of reasons to do so.For one, without anything nonlinear between them, successive linear transforms (fully connected layers) collapse into a single linear transform, which means the model isn't any more expressive than a single layer. On the other hand, intermediate nonlinearities prevent this collapse, allowing neural networks to approximate more complex functions." I am curious that, if I apply ReLU, aren't we losing information since ReLU is transforming every negative value to 0? Then how is this transformation more expressive than that without ReLU?
In Multilayer Perceptron, I tried to run MLP on MNIST dataset without a ReLU transformation, and it seems that the performance didn't change much (92% with ReLU and 90% without ReLU). But still, I am curious why this tranformation gives us more information rather than lose information.

the first point is that without nonlinearities, such as the ReLU function, in a neural network, the network is limited to performing linear combinations of the input. In other words, the network can only learn linear relationships between the input and output. This means that the network can't approximate complex functions that are not linear, such as polynomials or non-linear equations.
Consider a simple example where the task is to classify a 2D data point as belonging to one of two classes based on its coordinates (x, y). A linear classifier, such as a single-layer perceptron, can only draw a straight line to separate the two classes. However, if the data points are not linearly separable, a linear classifier will not be able to classify them accurately. A nonlinear classifier, such as a multi-layer perceptron with a nonlinear activation function, can draw a curved decision boundary and separate the two classes more accurately.
ReLU function increases the complexity of the neural network by introducing non-linearity, which allows the network to learn more complex representations of the data. The ReLU function is defined as f(x) = max(0, x), which sets all negative values to zero. By setting all negative values to zero, the ReLU function creates multiple linear regions in the network, which allows the network to represent more complex functions.
For example, suppose you have a neural network with two layers, where the first layer has a linear activation function and the second layer has a ReLU activation function. The first layer can only perform a linear transformation on the input, while the second layer can perform a non-linear transformation. By having a non-linear function in the second layer, the network can learn more complex representations of the data.
In the case of your experiment, it's normal that the performance did not change much when you removed the ReLU function, because the dataset and the problem you were trying to solve might not be complex enough to require a ReLU function. In other words, a linear model might be sufficient for that problem, but for more complex problems, ReLU can be a critical component to achieve good performance.
It's also important to note that ReLU is not the only function to introduce non-linearity and other non-linear activation functions such as sigmoid and tanh could be used as well. The choice of activation function depends on the problem and dataset you are working with.

Neural networks are inspired by the structure of brain. Neurons in the brain transmit information between different areas of the brain by using electrical impulses and chemical signals. Some signals are strong and some are not. Neurons with weak signals are not activated.
Neural networks work in the same fashion. Some input features have weak and some have strong signals. These depend on the features. If they are weak, the related neurons aren't activated and don't transmit the information forward. We know that some features or inputs aren't crucial players in contributing to the label. For the same reason, we don't bother with feature engineering in neural networks. The model takes care of it. Thus, activation functions help here and tell the model which neurons and how much information they should transmit.

Machine Learning Regression To Support Multi Variable Regression

I have a data set of ~150 samples where for each sample I have 11 inputs and 3 outputs. I tried to build a full regression model to take in 11 inputs to be trained to predict the 3 outputs The issue is with so few samples training a full model is almost impossible. For this reason I am experimenting with regression such as linear regression in pythons sklearn. From what I can find most regression models either support one input to predict one output (after regression is complete) or many inputs to predict one output.
Question: Are there any types of regression that support many inputs to predict many outputs. Or perhaps any regression types at all that may be better suited for my needs.
Thank you for any help!

Have you considered simply performing separate linear regressions for each dependent variable?
Also, you need to decide which inputs are theoretically significant (in terms of explaining the variation in the output) and then test for statistical significance to determine which ones should be retained in the model.
Additionally, test for multicollinearity to determine if you have variables which are statistically related and inadvertently influencing the output. Then, test in turn for serial correlation (if the data is a time series) and heteroscedasticity.
The approach you are describing of "garbage in, garbage out" risks overfitting - since you don't seem to be screening the inputs themselves for their relevance in predicting the output.

Pre Processing spiral dataset to use for Logistic Regression

So I need to classify a spiral dataset. I have been experimenting with a bunch of algorithms like KNN, Kernel SVM, etc. I would like to try to improve the performance of Logistic Regression using feature engineering, preprocessing, etc.
I am also using scikit learn to do all of the classifications.
I fully understand Logistic Regression is not the proper algorithm to do this sort of problem. This is more of a learning excerise for Pre processing and other feature engineering/extraction methods to see how much I can improve this specific model.
Here is an example dataset I would use for the classification. Any suggestions of how I can manipulate the dataset to use in the Logistic Regression algorithm would be helpful.
I also have datasets with multiple spirals as well. some datasets have 2 classes or sometimes up to 5. This means up to 5 spirals.

Logistic Regression is generally used as a linear classifier i.e the decision boundary separating one class samples from the other is a linear(straight-line) but it can be used for non-linear decision boundaries as well.
Using the kernel trick in SVC is also good option as it maps the data in the lower dimension to higher dimension making it linearly separable.
example:
In the above example, the data is not linearly separable in lower dimension, but after applying the transformation ϕ(x) = x² and adding the second dimension to the features we have the right side graph that becomes linearly separable.
You can start transforming the data by creating new features for applying logistic regression.
Also try SVC(Support Vector Classifier) that uses kernel trick. For SVC you don't have to transform the data into higher dimensions explicitly.
There are few resources which are great for learning are one and two

Since the data doesn't seem to be linearly separable, you can try using the Kernel Trick method commonly used in Support Vector Classification. The kernel function accepts inputs in the original lower-dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. That means transformed vector ϕ(x) is just some function of the coordinates in the corresponding lower-dimensional vector x.

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.

I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.

This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here

Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

Supervised Dimensionality Reduction for Text Data in scikit-learn

I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.
The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.
At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?
Thanks!

I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.
However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.
The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().
An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.
As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.
If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).
If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).
Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.

You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.
PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).

Several existing scikit modules do something similar to what you asked for.
Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.
Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids.
These are linear regression methods, but you could apply a 0-1 encoding to you target signal
and use these models anyway.
You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.

Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.

Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.
You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.