Can scikit-learn be used for removal of the features which are highly-correlated while using multiple linear regression?
With regard to the answer posted by #behzad.nouri to Capturing high multi-collinearity in statsmodels, I have some questions for avoiding my confusion.
So, he tested the high-multi-collinearity among 5 columns or features of independent variables; each column has 100 rows or data. He got that w[0] is near to zero. So can I say that first column or first independent variable should be removed for avoiding very high-multi-collinearity?
For detecting the cause of multicollinearity, you can simply check the correlation matrix (the first two lines in behzad.nouri's answer) to see which variables are highly correlated with each other (look for values close to 1).
Another alternative is to look at variance inflation factors (VIFs). statsmodels package reports VIF values as well. There is no standard threshold but VIF values greater than 4 are considered problematic.
import numpy as np
import statsmodels.stats.outliers_influence as oi
mean = [0, 0, 0]
cov = [[100, 90, 5], [90, 95, 10], [5, 10, 30]]
x, y, z = np.random.multivariate_normal(mean, cov, 1000).T
print np.corrcoef([x,y,z])
In the above code I've created three random variables x, y, and z. The covariance between x and y is high, so if you print out the correlation matrix you will see that the correlation between these two variables is very high as well (0.931).
array([[ 1. , 0.93109838, 0.1051695 ],
[ 0.93109838, 1. , 0.18838079],
[ 0.1051695 , 0.18838079, 1. ]])
At this phase you can discard either x or y as the correlation between them is very high and using only one of them would be enough to explain most of the variation.
You can check the VIF values as well:
exog = np.array([x,y,z]).transpose()
vif0 = oi.variance_inflation_factor(exog, 0)
If you print out vif0 it will give you 7.21 for the first variable, which is a high number and indicative of high multicollinearity of the first variable with other variables.
Which one to exclude from the analysis (x or y) is up to you. You can check their standardized regression coefficients to see which one has a higher impact. You can also use techniques like ridge regression or lasso if you have multicollinearity problem. If you want to go deeper, I would suggest asking CrossValidated instead.
Related
I'm running a chi square test on some categorical values pertaining to race, and whether different racial groups participated in a clinic. As there's about a dozen different races in this data, I bucketed them down to 'White', 'Black' and 'Other', just for the purposes of testing (as the correlations indicated most of the activity occurring between 'White and 'Black'. However, using Python's .chi2_contingency() method, I'm getting results back that seem unusual. The table is below:
Appointment Status No Yes
Black 9170 33372
White 15137 152307
Other 8864 56165
The Python method returns the following:
X^2: 5207.16
p-value: 0.0
df: 2
expected values array: array([[ 5131.21350472, 37410.78649528],
[ 7843.48838791, 57185.51161209],
[ 20196.29810738, 147247.70189262]]))
The df is good, but the chi square value and p-value both don't seem right. Is there something anyone can see that I might be doing methodologically that might be producing these values, or might there be something going on behind the scenes in Python that's doing this? Thanks!
The test statistic and p-value are correct (and perhaps also understandable). Let me stepwise explain the outcome. The section entitled ``Example chi-squared test for categorical data'' on wikipedia (https://en.wikipedia.org/wiki/Chi-squared_test#Example_chi-squared_test_for_categorical_data) might help as well.
The expected count is the number of observations that would end up in a given cell of the table if we would assume independence. The fractions of Black and No are 0.15468974 and 0.12061524, respectively. Under independence, we expect 0.15468974x0.12061524x275015=5131.21350472 observations in the sample to be labeled as Black and No (Note: 275015 is the total number of observations).
All other expected counts are calculated similar. Note that the differences between the expected and observed counts (i.e. the numbers in your table) are rather large. This should be a first indication that the null hypothesis of independence might be false.
The test statistic is calculated be computing (Obs-Exp)^2/Exp for each element in the cell and summing all the elements in the table. The result is indeed 5207.162302393083 (see code below). Under the null hypothesis, this test statistic is chi2 distributed with 2 df (as you already mentioned). Compared to this distribution, the value 5207.162302393083 is truly far in the tail of the distribution making it very very unlikely to observe this outcome under the null of independence. The p-value is therefore equal to zero.
The code posted below replicates all the numbers and plots the PDF of the chi2 distribution with 2 degrees of freedom. I hope this helps.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
# Data and properties
TrueCounts = np.array( [ [9170,33372],[15137,152307],[8864,56165] ])
Datadimension = TrueCounts.shape
TotalCounts = np.sum(TrueCounts)
print(TotalCounts)
# Fractions
fracAnswer = np.sum(TrueCounts, axis=0)/TotalCounts
fracRace = np.sum(TrueCounts, axis=1)/TotalCounts
# Caculate expected counts
ExpCounts = np.zeros(np.shape(TrueCounts))
for iter1 in range(Datadimension[0]):
for iter2 in range(Datadimension[1]):
ExpCounts[iter1, iter2] = fracRace[iter1]*fracAnswer[iter2]*TotalCounts
print('=== True and expected counts ===')
print(fracAnswer)
print(fracRace)
print('=== True and expected counts ===')
print(TrueCounts)
print(ExpCounts)
print('=== Test summary ===')
TestStat = np.sum( (TrueCounts-ExpCounts)**2/ExpCounts )
print(TestStat)
# Make ch2 plot for comparison
x = np.arange(0, 20, 0.05)
plt.plot(x, chi2.pdf(x, df=2))
plt.show()
Suppose I have the following data:
array([[0.88574245, 0.3749999 , 0.39727183, 0.50534724],
[0.22034441, 0.81442653, 0.19313024, 0.47479565],
[0.46585887, 0.68170517, 0.85030437, 0.34167736],
[0.18960739, 0.25711086, 0.71884116, 0.38754042]])
and knowing that this data follows normal distribution. How do I calculate the AIC number ?
The formula is
2K - 2log(L)
K is the total parameters, for normal distribution the parameter is 3(mean,variance and residual). i'm stuck on L, L is suppose to be the maximum likelihood function, I'm not sure what to pass in there for data that follows normal distribution, how about for Cauchy or exponential. Thank you.
Update: this question appeared in one of my coding interview.
For a given normal distribution, the probability of y given
import scipy.stats
def prob( y = 0, mean = 0, sd = 1 ):
return scipy.stats.norm( mean, sd ).pdf( y )
For example, given mean = 0 and sd = 1, the probability of value 0, is prob( 0, 0, 1 )
If we have a set of values 0 - 9, the log likelihood is the sum of the log of these probabilities, in this case the best parameters are the mean of x and StDev of x, as in :
import numpy as np
x = range( 9 )
logLik = sum( np.log( prob( x, np.mean( x ), np.std( x ) ) ) )
Then AIC is simply:
K = 2
2*K - 2*( logLik )
For the data you provide, I am not so sure what the three columns and row reflect. So do you have to calculate three means and three StDev-s? It's not very clear.
Hopefully this above can get you started
I think the interview question leaves out some stuff, but maybe part of the point is to see how you handle that.
Anyway, AIC is essentially a penalized log likelihood calculation. Log likelihood is great -- the greater the log likelihood, the better the model fits the data. However, if you have enough free parameters, you can always make the log likelihood greater. Hmm. So various penalty terms, which counter the effect of more free parameters, have been proposed. AIC (Akaike Information Criterion) is one of them.
So the problem, as it is stated, is (1) find the log likelihood for each of the three models given (normal, exponential, and Cauchy), (2) count up the free parameters for each, and (3) calculate AIC from (1) and (2).
Now for (1) you need (1a) to look up or derive the maximum likelihood estimator for each model. For normal, it's just the sample mean and sample variance. I don't remember the others, but you can look them up, or work them out. Then (1b) you need to apply the estimators to the given data, and then (1c) calculate the likelihood, or equivalently, the log likelihood of the estimated parameters for the given data. The log likelihood of any parameter value is just sum(log(p(x|params))) where params = parameters as estimated by maximum likelihood.
As for (2), there are 2 parameters for a normal distribution, mu and sigma^2. For an exponential, there's 1 (it might be called lambda or theta or something). For a Cauchy, there might be a scale parameter and a location parameter. Or, maybe there are no free parameters (centered at zero and scale = 1). So in each case, K = 1 or 2 or maybe K = 0, 1, or 2.
Going back to (1b), the data look a little funny to me. I would expect a one dimensional list, but it seems like the array is two dimensional (with 4 rows and 4 columns if I counted right). One might need to go back and ask about that. If they really mean to have 4 dimensional data, then the conceptual basis remains the same, but the calculations are going to be a little more complex than in the 1-d case.
Good luck and have fun, it's a good problem.
good day everyone. I have got the following:
I am using a GaussianProcessRegressor object from the Sklearn library.
After fitting the model, I want to sample points using predict, to get a better idea of what the model looks like so far. But now I do get the issue that it just assumed the points zero anywhere except for the training points.
I reset the alpha value of the Regressor from my initial 1e-5 back to default 1e-10 and the n_restarts_optimizer from 9 back to default zero, my kernel is a Matern kernel with nearly standard settings. Now I do get non-zero values, however I am not sure how to proceed:
I have the following:
a = df_reduced.values[0:4, :]
print("a[0,0]: ", a[0,0])
gp.predict(a)
Of course this gives me a nice result (since it's the fitting data):
a[0,0]: 150.0
Out[47]:
array([[10.4 ],
[ 9.3 ],
[78.39990234],
[78.39990234]])
Now I slightly alter the first feature of the first sample in it's initial vicinity:
a = df_reduced.values[1:4, :]
a[0, 0] = 151
gp.predict(a)
array([[4.85703698e-254],
[7.83999023e+001],
[7.83999023e+001]])
, and for a[0, 0] = 152
array([[ 0. ],
[78.39990234],
[78.39990234]])
. So it seems that in most of the area the function is simply zero, which is kind of a problem, because I want to use this for a Gaussian Hyperparameter Optimization minimising globally. Would somebody have a lead how to optimise better?
Btw I am using 16 features, and fitting on 30 samples so far and the output function takes values between 0 and 100.
Parameters are as follows (copy-paste):
matern = C(1.0)*Matern(length_scale=1.0, nu=2.5)
gp = GaussianProcessRegressor(kernel=matern)
gp.fit(df_reduced.values, Y) # df_reduced.values, because meanwhile we have overwritten X_reduced
Thanks already for any lead,
Best regards,
robTheBob86
I have a machine learning problem that I'm trying to solve. I'm using a Gaussian HMM (from hmmlearn) with 5 states, modelling extreme negative, negative, neutral, positive and extreme positive in the sequence. I have set up the model in the gist below
https://gist.github.com/stevenwong/cb539efb3f5a84c8d721378940fa6c4c
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
x = pd.read_csv('data.csv')
x = np.atleast_2d(x.values)
h = GaussianHMM(n_components=5, n_iter=10, verbose=True, covariance_type="full")
h = h.fit(x)
y = h.predict(x)
The problem is that most of the estimated states converges to the middle, even when I can visibly see that there are spades of positive values and spades of negative values but they are all lumped together. Any idea how I can get it to better fit the data?
EDIT 1:
Here is the transition matrix. I believe the way it's read in hmmlearn is across the row (i.e., row[0] means prob of transiting to itself, state 1, 2, 3...)
In [3]: h.transmat_
Out[3]:
array([[ 0.19077231, 0.11117929, 0.24660208, 0.20051377, 0.25093255],
[ 0.12289066, 0.17658589, 0.24874935, 0.24655888, 0.20521522],
[ 0.15713787, 0.13912972, 0.25004413, 0.22287976, 0.23080852],
[ 0.14199694, 0.15423031, 0.25024992, 0.2332739 , 0.22024893],
[ 0.17321093, 0.12500688, 0.24880728, 0.21205912, 0.2409158 ]])
If I set all the transition probs to 0.2, it looks like this (if I do average by state the separation is worse).
Apparently, your model learned large variance for state 2. GMM is a generative model trained with max likelihood criteria, so in some sense, you got the optimal fit to the data. I can see it provides meaningful prediction in extreme cases, so if you want it to attribute more observations to classes other than 2, I would try the following:
Data preprocessing. Try to use log values for your input to make the difference between them sharper.
Look at your transition matrix, maybe transition probs from state 2 are too low. Try to set all probabilities to equal and see what happens.
I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!
You have at least two options:
Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.
Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.
Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.
https://github.com/remykarem/mixed-naive-bayes
The library is written such that the APIs are similar to scikit-learn's.
In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.
from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
[1, 1, 165.2, 61.5],
[2, 1, 166.3, 60.3],
[1, 1, 173.0, 68.2],
[0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)
Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)
The simple answer: multiply result!! it's the same.
Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).
so the right answer is:
calculate the probability from the categorical variables.
calculate the probability from the continuous variables.
multiply 1. and 2.
#Yaron's approach needs an extra step (4. below):
Calculate the probability from the categorical variables.
Calculate the probability from the continuous variables.
Multiply 1. and 2.
AND
Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.
Step 4. is the normalization step. Take a look at #remykarem's mixed-naive-bayes as an example (lines 268-278):
if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
finals = t * p * self.priors
elif self.gaussian_features.size != 0:
finals = t * self.priors
elif self.categorical_features.size != 0:
finals = p * self.priors
normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
normalised = np.moveaxis(normalised, [0, 1], [1, 0])
return normalised
The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).
For hybrid features, you can check this implementation.
The author has presented mathematical justification in his Quora answer, you might want to check.
You will need the following steps:
Calculate the probability from the categorical variables (using predict_proba method from BernoulliNB)
Calculate the probability from the continuous variables (using predict_proba method from GaussianNB)
Multiply 1. and 2. AND
Divide by the prior (either from BernoulliNB or from GaussianNB since they are the same) AND THEN
Divide 4. by the sum (over the classes) of 4. This is the normalisation step.
It should be easy enough to see how you can add your own prior instead of using those learned from the data.