I have a machine learning problem that I'm trying to solve. I'm using a Gaussian HMM (from hmmlearn) with 5 states, modelling extreme negative, negative, neutral, positive and extreme positive in the sequence. I have set up the model in the gist below
https://gist.github.com/stevenwong/cb539efb3f5a84c8d721378940fa6c4c
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
x = pd.read_csv('data.csv')
x = np.atleast_2d(x.values)
h = GaussianHMM(n_components=5, n_iter=10, verbose=True, covariance_type="full")
h = h.fit(x)
y = h.predict(x)
The problem is that most of the estimated states converges to the middle, even when I can visibly see that there are spades of positive values and spades of negative values but they are all lumped together. Any idea how I can get it to better fit the data?
EDIT 1:
Here is the transition matrix. I believe the way it's read in hmmlearn is across the row (i.e., row[0] means prob of transiting to itself, state 1, 2, 3...)
In [3]: h.transmat_
Out[3]:
array([[ 0.19077231, 0.11117929, 0.24660208, 0.20051377, 0.25093255],
[ 0.12289066, 0.17658589, 0.24874935, 0.24655888, 0.20521522],
[ 0.15713787, 0.13912972, 0.25004413, 0.22287976, 0.23080852],
[ 0.14199694, 0.15423031, 0.25024992, 0.2332739 , 0.22024893],
[ 0.17321093, 0.12500688, 0.24880728, 0.21205912, 0.2409158 ]])
If I set all the transition probs to 0.2, it looks like this (if I do average by state the separation is worse).
Apparently, your model learned large variance for state 2. GMM is a generative model trained with max likelihood criteria, so in some sense, you got the optimal fit to the data. I can see it provides meaningful prediction in extreme cases, so if you want it to attribute more observations to classes other than 2, I would try the following:
Data preprocessing. Try to use log values for your input to make the difference between them sharper.
Look at your transition matrix, maybe transition probs from state 2 are too low. Try to set all probabilities to equal and see what happens.
Related
I'm trying to use GaussianProcessRegressor in sklearn to predict values of unknown.
The target values are typically between 1000-10000.
Since they are not 0-mean prior, I set the model with normalize_y = False, which is a default setup.
from sklearn.gaussian_process import GaussianProcessRegressor
gpr = GaussianProcessRegressor(kernel = RBF, random_state=0, alpha=1e-10, normalize_y = False)
when I predicted unknown with the gpr model, the returned std values are unrealistically too small, like in the scale of 0.1, which is 0.001% of the predicted values.
When I changed the setting to normalize_y = True, the returned std values are more realistic, about 500ish.
Can someone explain exactly what normalize_y does here, and if I set it to True or False in this case?
I found the closest answer HERE: https://github.com/scikit-learn/scikit-learn/issues/15612
"OK I think I know what might be going on here. It's a bit tricky to see but I think that none of the kernels have a vertical length scale parameter, so kernel(x,x) is always equal to 1. All the diagonal elements of K are equal to 1 (before we add the ridge to it), for example.
We can then see that the variance of the predictions can only be between 0 and 1. For example, if we're predicting at a point far from the training data (so kernel(X, x_new) is a vector of zeros) then the variance is just
sigma^2 = kernel(x_new, x_new) = 1
I think the real problem here is that the prior is for data with unit variance, but the data doesn't have unit variance. The solution would be to normalise the data so that it has unit variance after it 'enters' the GP, conduct the GP analysis, and then 'unnormalise' it back again at the end. The code already removes the mean automatically, so I think we just need to divide by the standard deviation at the same point and it would work OK.
So could just need a few extra lines!"
For this reason, changing the length_scale_bounds parameter of your kernel should fix this issue!
I hope this helps those who land here as I faced the same issue!
Running Python 3.7.3
I have made a simple GMM and fit it to some data. Using the predict_proba method, the returns are 1's and 0's, instead of probabilities for the input belonging to each gaussian.
I initially tried this on a bigger data set and then tried to get a minimum example.
from sklearn.mixture import GaussianMixture
import pandas as pd
feat_1 = [1,1.8,4,4.1, 2.2]
feat_2 = [1.4,.9,4,3.9, 2.3]
test_df = pd.DataFrame({'feat_1': feat_1, 'feat_2': feat_2})
gmm_test = GaussianMixture(n_components =2 ).fit(test_df)
gmm_test.predict_proba(test_df)
gmm_test.predict_proba(np.array([[8,-1]]))
I'm getting arrays that are just 1's and 0's, or almost (10^-30 or whatever).
Unless I'm interpreting something incorrectly, the return should be a probability of each, so for example,
gmm_test.predict_proba(np.array([[8,-1]]))
should certainly not be [1,0] or [0,1].
The example you gave is giving you weird results because you have only 5 data points and still you are using 2 mixture components, which is basically causing overfitting.
If you do check the means and covariances of your components:
print(gmm_test.means_)
>>> [[4.05 3.95 ]
[1.66666667 1.53333333]]
print(gmm_test.covariances_)
>>> [[[ 0.002501 -0.0025 ]
[-0.0025 0.002501 ]]
[[ 0.24888989 0.13777778]
[ 0.13777778 0.33555656]]]
From this you can see that the first Gaussian is basically fitted with a very small covariance matrix, meaning that unless a point is very close to its center (4.05,3.95), the probability to belong to this Gaussian will always be negligible.
To convince you that despite this, your model is working as expected, try this:
epsilon = 0.005
print(gmm_test.predict_proba([gmm_test.means_[0]+epsilon]))
>>> array([[0.03142181, 0.96857819]])
As soon as you will increase epsilon, it will only return you array([[0., 1.]]), like you observed.
It might be useful to know that increasing cova_reg will decrease the confidence:
gmm_test = GaussianMixture(n_components =2,reg_covar=1).fit(test_df)
# output [[0.56079116 0.43920884]]
good day everyone. I have got the following:
I am using a GaussianProcessRegressor object from the Sklearn library.
After fitting the model, I want to sample points using predict, to get a better idea of what the model looks like so far. But now I do get the issue that it just assumed the points zero anywhere except for the training points.
I reset the alpha value of the Regressor from my initial 1e-5 back to default 1e-10 and the n_restarts_optimizer from 9 back to default zero, my kernel is a Matern kernel with nearly standard settings. Now I do get non-zero values, however I am not sure how to proceed:
I have the following:
a = df_reduced.values[0:4, :]
print("a[0,0]: ", a[0,0])
gp.predict(a)
Of course this gives me a nice result (since it's the fitting data):
a[0,0]: 150.0
Out[47]:
array([[10.4 ],
[ 9.3 ],
[78.39990234],
[78.39990234]])
Now I slightly alter the first feature of the first sample in it's initial vicinity:
a = df_reduced.values[1:4, :]
a[0, 0] = 151
gp.predict(a)
array([[4.85703698e-254],
[7.83999023e+001],
[7.83999023e+001]])
, and for a[0, 0] = 152
array([[ 0. ],
[78.39990234],
[78.39990234]])
. So it seems that in most of the area the function is simply zero, which is kind of a problem, because I want to use this for a Gaussian Hyperparameter Optimization minimising globally. Would somebody have a lead how to optimise better?
Btw I am using 16 features, and fitting on 30 samples so far and the output function takes values between 0 and 100.
Parameters are as follows (copy-paste):
matern = C(1.0)*Matern(length_scale=1.0, nu=2.5)
gp = GaussianProcessRegressor(kernel=matern)
gp.fit(df_reduced.values, Y) # df_reduced.values, because meanwhile we have overwritten X_reduced
Thanks already for any lead,
Best regards,
robTheBob86
Can scikit-learn be used for removal of the features which are highly-correlated while using multiple linear regression?
With regard to the answer posted by #behzad.nouri to Capturing high multi-collinearity in statsmodels, I have some questions for avoiding my confusion.
So, he tested the high-multi-collinearity among 5 columns or features of independent variables; each column has 100 rows or data. He got that w[0] is near to zero. So can I say that first column or first independent variable should be removed for avoiding very high-multi-collinearity?
For detecting the cause of multicollinearity, you can simply check the correlation matrix (the first two lines in behzad.nouri's answer) to see which variables are highly correlated with each other (look for values close to 1).
Another alternative is to look at variance inflation factors (VIFs). statsmodels package reports VIF values as well. There is no standard threshold but VIF values greater than 4 are considered problematic.
import numpy as np
import statsmodels.stats.outliers_influence as oi
mean = [0, 0, 0]
cov = [[100, 90, 5], [90, 95, 10], [5, 10, 30]]
x, y, z = np.random.multivariate_normal(mean, cov, 1000).T
print np.corrcoef([x,y,z])
In the above code I've created three random variables x, y, and z. The covariance between x and y is high, so if you print out the correlation matrix you will see that the correlation between these two variables is very high as well (0.931).
array([[ 1. , 0.93109838, 0.1051695 ],
[ 0.93109838, 1. , 0.18838079],
[ 0.1051695 , 0.18838079, 1. ]])
At this phase you can discard either x or y as the correlation between them is very high and using only one of them would be enough to explain most of the variation.
You can check the VIF values as well:
exog = np.array([x,y,z]).transpose()
vif0 = oi.variance_inflation_factor(exog, 0)
If you print out vif0 it will give you 7.21 for the first variable, which is a high number and indicative of high multicollinearity of the first variable with other variables.
Which one to exclude from the analysis (x or y) is up to you. You can check their standardized regression coefficients to see which one has a higher impact. You can also use techniques like ridge regression or lasso if you have multicollinearity problem. If you want to go deeper, I would suggest asking CrossValidated instead.
I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!
You have at least two options:
Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.
Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.
https://github.com/remykarem/mixed-naive-bayes
The library is written such that the APIs are similar to scikit-learn's.
In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.
from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
[1, 1, 165.2, 61.5],
[2, 1, 166.3, 60.3],
[1, 1, 173.0, 68.2],
[0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)
Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)
The simple answer: multiply result!! it's the same.
Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).
so the right answer is:
calculate the probability from the categorical variables.
calculate the probability from the continuous variables.
multiply 1. and 2.
#Yaron's approach needs an extra step (4. below):
Calculate the probability from the categorical variables.
Calculate the probability from the continuous variables.
Multiply 1. and 2.
AND
Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.
Step 4. is the normalization step. Take a look at #remykarem's mixed-naive-bayes as an example (lines 268-278):
if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
finals = t * p * self.priors
elif self.gaussian_features.size != 0:
finals = t * self.priors
elif self.categorical_features.size != 0:
finals = p * self.priors
normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
normalised = np.moveaxis(normalised, [0, 1], [1, 0])
return normalised
The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).
For hybrid features, you can check this implementation.
The author has presented mathematical justification in his Quora answer, you might want to check.
You will need the following steps:
Calculate the probability from the categorical variables (using predict_proba method from BernoulliNB)
Calculate the probability from the continuous variables (using predict_proba method from GaussianNB)
Multiply 1. and 2. AND
Divide by the prior (either from BernoulliNB or from GaussianNB since they are the same) AND THEN
Divide 4. by the sum (over the classes) of 4. This is the normalisation step.
It should be easy enough to see how you can add your own prior instead of using those learned from the data.