I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).
Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.
Related
I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have:
from sklearn.linear_model import SGDRegressor
out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)]
alpha=[0.0001, 1, 100]
N= len(out)
plt.figure(figsize=(20,15))
j=1
for i in alpha:
X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included X and Y in this section
Y= a * np.cos(phi)
for num in range(N):
plt.subplot(3, N, j)
X=np.append(X,out[num][0]) # Appending outlier to main X
Y=np.append(Y,out[num][1]) # Appending outlier to main Y
j=j+1 # Increasing J so we move on to next plot
model=SGDRegressor(alpha=i, eta0=0.001, learning_rate='constant',random_state=0)
model.fit(X.reshape(-1, 1), Y) # Fitting the model
plt.scatter(X,Y)
plt.title("alpha = "+ str(i) + " | " + "Slope :" + str(round(model.coef_[0], 4))) #Adding title to each plot
abline(model.coef_[0],model.intercept_) # Plotting the line using abline function
plt.show()
As shown above I had the main datset of X and Y and in each iteration, I am adding a point as an outlier to the main dataset and train the model and plot regression line (hyperplane). Below you can see the result for different values of alpha:
I am looking at results and am still confused and can't make solid conclusion as how alhpa parameter changes the model? what's the effect of alpha? is it causing overfitting? underfitting?
From scikit-learn:
alpha : float, default=0.0001
Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when set to learning_rate is set to 'optimal'.
As for regularization, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. If there is noise (not "true" data) in the training data, then the model's estimated coefficients won’t generalize well to the future (test) data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.
From Towards Data Science (paraphrased):
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for a data set different than its training data. Regularization significantly reduces the variance of the model, without substantial increase in its bias. The tuning parameter alpha controls the impact on bias and variance. As the value of alpha rises, it reduces the value of coefficients, thus reducing the variance.Till a point, this increase in alpha is beneficial as it is only reducing the variance (hence avoiding overfitting), without losing any important properties in the data. But after certain value, the model starts losing important properties, giving rise to bias in the model and thus underfitting.
In your example, comparing the rows of the third column highlights this effect (slope).
My dataset has feature columns and a target label of 0 and 1.
When I use SVM classifier for binary classification it predicts well.
But my question is how is it mathematically predicted.?
The marginal hyperplanes H1 and H2 have the equations:
W^T X +b >= 1
meaning if greater than +1 it falls in one class. And if less than -1, it falls in another class.
But we have given the target label 0 and 1.
How actually is it done mathematically?
Anyone expert please.....
Basically, SVM wants to find the optimal hyperplane that splits the datapoints in such way that the margin between the closest datapoints of each class (the so-called support vectors) is maximized. This all breaks down to the following Lagrangian optimization problem:
w:vector that determines the optimum hyperplane ( for intuition, make yourself familiar with the geometrical meaning of a dot product)
(w^T∙x_i+b) is a scalar and displays the geometrical distance
between single datapoint x_i and the maximum margin hyperplane
b is a bias vector ( I think it comes from indetermined integral in
derivation of SVM) more on that you can find here: University Stanford -Computer Science Lecture 3-SVM
λ_i the Lagrangian multiplier
y_i the normalized classification boundary
Solving the optimization problem leads to all necessary parameters of w, b, and lambda.
To answer you quesiton in one sentence: The class boundaries [-1,1] are set arbitrarily. It is really just definition.
The labels of your binary data [0;1] (so-called dummy varaibles) have nothing to with the boundaries. It is just a convenient way to label binary data. The labels are only needed to link the features to its corresponding class or category.
The only non paramter in Formula (8) is x_i , your datapoint in feature space.
At least thats how I understand SVM. Feel free to correct me if I am wrong or unprecise.
So I'm having a hard time conceptualizing how to make mathematical representation of my solution for a simple logistic regression problem. I understand what is happening conceptually and have implemented it, but I am answering a question which asks for a final solution.
Say I have a simple two column dataset denoting something like likelihood of getting a promotion per year worked, so the likelihood would increase the person accumulates experience. Where X denotes the year and Y is a binary indicator for receiving a promotion:
X | Y
1 0
2 1
3 0
4 1
5 1
6 1
I implement logistic regression to find the probability per year worked of receiving a promotion, and get an output set of probabilities that seem correct.
I get an output weight vector that that is two items, which makes sense as there are only two inputs. The number of years X, and when I fix the intercept to handle bias, it adds a column of 1s. So one weight for years, one for bias.
So I have two few questions about this.
Since it is easy to get an equation of the form y = mx + b as a decision boundary for something like linear regression or a PLA, how can similarly I denote a mathematical solution with the weights of the logistic regression model? Say I have a weights vector [0.9, -0.34], how can I convert this into an equation?
Secondly, I am performing gradient descent which returns a gradient, and I multiply that by my learning rate. Am I supposed to update the weights at every epoch? As my gradient never returns zeros in this case so I am always updating.
Thank you for your time.
The logistic regression is trying to map the input value (x = years) to the output value (y=likelihood) through this relationship:
where theta and b are the weights you are trying to find.
The decision boundary will then be defined as L(x)>p or <p. where L(x) is the right term of the equation above. That is the relationship you want.
You can eventually transform it to a more linear form like the one of linear regression by passing the exponential in numerator and taking the log on both sides.
I have a Logistic Regression model. There are around 10 features, 3 of which are basically highly correlated (Lets call them x_5, x_6, x_7). In fact x_5 + x_6 = x_7. But they are all kind of important in the business sense.
I did a log transformation on the data, and since there are quite a number of zeros, I also added 1 to all data. That means:
1) x_5 + x_6 = x_7
2) I did log(1 + x_5), log(1 + x_6) and log(1 + x_7) (and also other features)
And then I fit a Logistic Regression in different cases, and checked the coefficients.(Lets call them beta_5, beta_6, beta_7 for x_5, x_6, x_7 respectively). The cases are summarized below. (zero means I omit the variable, i.e. in case 2 I omitted x_7)
There are something that I find confused.
1) The signs of beta_5 and beta_6 change from case 1 to case 2. I understand this is becoz of the multicollinearity issue. But does it affect the predictability of my Logistic Model?
2) The value of beta_7 drops quite significantly from case 1 to case 3. Does case 3 explain better the importance of x_7?
3) Based on this findings, which case should I use? Or how should I make the decision?
Thanks for your help!
as you have the governing equation x5+x6 = x7, then you may drop one of them from the beginning.
To be confident about final solution, you could apply regularization using Lasso to know which feature(s) could be removed.
I am working on a classification problem. I have around 1000 features and target variable has 2 classes. All the 1000 features have values 1 or 0. I am trying to find feature importance but my feature importance values varies from 0.0 - 0.003. I am not sure if such low value is meaningful.
Is there a way I can increase feature importance.
# Variable importance
rf = RandomForestClassifier(min_samples_split=10, random_state =1)
rf.fit(X, Y)
print ("Features sorted by their score:")
a = (list(zip(map(lambda x: round(x, 3), rf.feature_importances_), X)))
I would really appreciate any help! Thanks
Since you only have two target classes you can perform an unequal variance t-test which has been useful to find important features in a binary classification task when all other feature ranking methods have failed me. You can implement this using scipy.stats.ttest_ind function. It basically is a statistical test that checks whether the two distributions are different. if the returned p-value is less than 0.05, they can be assumed to be different distributions. To implement for each feature, follow these steps:
Extract all predictor values for class 1 and 2 respectively.
Run test_ind on these two distributions, specifying that they're variance is unknown, and make sure it's a two tailed t-test
If the p-value is less than 0.05, this feature is important.
Alternatively, you can do this for all your features and use the p-value as the measure of feature importance. The lower, the p-value, the higher the importance of a feature.
Cheers!