During implementing a linear regression model on a bag of words, python returned very large/low values. train_data_features contains all words, which are in the training data. The training data contains about 400 comments of each less than 500 characters with a ranking between 0 and 5. Afterwards, I created a bag of words for each document. While trying to perform a linear regression on the matrix of all bag of words,
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(train_data_features, train['dim_hate'])
coef = clf.coef_
words = vectorizer.get_feature_names()
for i in range(len(words)):
print(str(words[i]) + " " + str(coef[i]))
the result seems to be very strange (just an example of 3 from 4000). It shows the factors of the created regression function for the words.
btw -0.297473967075
land 54662731702.0
landesrekord -483965045.253
I'm very confused because the target variable is between 0 and 5, but the factors are so different. Most of them have very high/low numbers and I was expecting only values like the one of btw.
Do you have an idea, why the results are like they are?
It might be that your model is overfitting to the data, since it's trying to exactly match the outputs. You're right to be worried and suspicious, because it means that your model is probably overfitting to your data and will not generalize well to new data. You can try one of two things:
Run LinearRegression(normalize=True) and see if it helps with the coefficients. But it will only be a temporary solution.
Use Ridge regression instead. It is basically doing Linear Regression, except adding a penalty for having coefficients that are too large.
Check for correlated features in your data-set.
You may run into the problem if your features are highly correlated. For example expenses per customer -
jan_expenses, feb_expenses, mar_expenses, Q1_expenses
the Q1 feature is the sum of the jan-mar and therefore your coefficients, when trying to fit, will go 'crazy' as it will struggle to find a line that best describes the monthly features and the Q feature. Try and remove the highly correlated features and re-run.
(btw Ridge regression also solved the problem for me but I was curious as to why this happens so i dug in a bit)
Related
I have a movie review dataset and I want to perform sentiment analysis on it.
I have implemented this using logistic regression. Following are the steps that I took in the process:
Removed stop words and punctuation from each row in the dataset.
Split the data into train, validation and test set.
Created a vocabulary of words from the training set.
Added every word in the vocabulary as a feature. If this word is in the current row, its TF-IDF value is set as the value of the feature, else 0 is set as the value.
Train the model. During training, sigmoid function is used for calculating the hypothesis and cross entropy loss is used for cost function. Then using gradient descent, the weights of the model were updated.
Tune hyperparameters using validation set
Evaluate model using test set
Now, I need to implement the same thing using Naive Bayes and I'm confused as to how to approach this problem. I assume the first 4 steps are going to be the same. But what is the training step when using Naive Bayes? What is the loss function and cost function in this case? And where do I use the Bayes' theorem to calculate the conditional probability? And how do I update the weights and biases?
I've searched a lot of resources on the web and I've mostly only found implementations using sklearn with model.fit and model.predict and I'm having a hard time figuring out the math behind this and how it could be implemented using vanilla python.
In the case of Logistic regression or SVM, the model is trying to predict the hyperplane which best fits the data. And so these models will determine the weights and biases.
Naive Bayes is moreover a probabilistic approach. It completely depends on Bayes' theorem.
There will be NO weights and biases in NB, there will only be CLASS WISE probability values for each of the features (i.e, words in case of text).
To avoid zero probabilities or to handle the case of unseen data (words in case of text), use Laplace Smoothing.
α is called the smoothening factor. And this will be hyperparameter in NB
Use log for numeric stability.
Test example: This movie is great
After removing the stopwords: movie great
From the training data, we already know prob value for the words movie and great both for +ve & -ve class. Refer STEP 2.
Prob of great for +ve class would be greater than the prob of great for -ve class. And for the word movie, prob values could be almost the same. (This highly depends on your training data. Here I am just making an assumption)
positive class prob = P(movie/+ve) * P(great/+ve)
negative class prob = P(movie/-ve) * P(great/-ve)
Compare the class prob values & return the one having high prob value.
P.S
If the number of words in the sentence is large in numbers, then the class value would become very very small. Using log would solve this problem.
If the word great wasn't there in the training set, the class prob value would be 0. So use smoothening factor-α (Laplace smoothing)
Refer sk-learn naive bayes for more detailed info
I have about 8000 features measuring a two level response variable i.e. output can belong to class 1 or 0.
The 8000 features consist of about 3000 features with 0-1 values and about 5000 features (which are basically words from text data and their tfidf scores.
I am building a linear svm model on this to predict my output variable and am getting decent results/ accuracy, recall and precision around 60-70%
I am looking for help with the following:
Standardization: do the 0-1 values need to be standardized? Do tfidf scores need to be standardized even if I use sublinear tdf=true ?
Dimension reduction: I have tried f_classif using SelectPercentile function of sklearn so far. Any other dimension reduction techniques that can be suggested? I have gone through the sklearn dimension reduction url which also talks about chi2 dim reduction but that isn't giving me good results. Can pca be applied if the data is a mix of 0-1 columns and tfidf score columns?
Remove collinearity: How can I remove highly correlated independent variables.
I am fairly new to python and machine learning, so any help would be appreciated.
(edited to include additional questions)
1 - I would centre and scale your variables for a linear model. I don't know if it's strictly necessary for SVMs, but if I recall correctly, spatial based models are better if the variables are in the same ranges. I don't think there's any harm in doing this anyway (vs. unscaled/uncentred). Someone may correct me - I don't do much by way of text analysis.
2 - (original answer) = Could you try applying a randomForest model, then inspecting the importance scores (discarding those with low importance). With so many features I'd worry about memory issues but if your machine can handle it...?
Another good approach here would be to use ridge/lasso logistic regression. This by its very nature is good at identifying (and discarding) redundant variables, and can help with your question 3 (correlated variables).
Appreciate you're new to this, but both these models above are good at getting around correlation / non-significant variables, so you may want to use these on the way to finalising an SVM.
3 - There's no magic bullet that I know of. The above may help. I predominantly use R, and within that there's a package called Boruta which is good for this step. There may be a Python equivalent?
I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)
While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!
I have trained my model on a data set and i used decision trees to train my model and it has 3 output classes - Yes,Done and No , and I got to know the feature that are most decisive in making a decision by checking feature importance of the classifier. I am using python and sklearn as my ML library. Now that I have found the feature that is most decisive I would like to know how that feature contributes, in the sense that if the relation is positive such that if the feature value increases the it leads to Yes and if it is negative It leads to No and so on and I would also want to know the magnitude for the same.
I would like to know if there a solution to this and also would to know a solution that is independent of the algorithm of choice, Please try to provide solutions that are not specific to decision tree but rather general solution for all the algorithms.
If there is some way that would tell me like:
for feature x1 the relation is 0.8*x1^2
for feature x2 the relation is -0.4*x2
just so that I would be able to analyse the output depends based on input feature x1 ,x2 and so on
Is it possible to find out the whether a high value for particular feature to a certain class, or a low value for the feature.
You can use Partial Dependency Plots (PDPs). scikit has a built-in PDP for their GBM - http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence which was created in Friedman's Greedy Function Approximation Paper http://statweb.stanford.edu/~jhf/ftp/trebst.pdf pp26-28.
If you used scikit-learn GBM, use their PDP function. If you used another estimator, you can create your own PDP which is a few lines of code. PDPs and this method is algorithm agnostic as you asked. It just will not scale.
Logic
Take your training data
For the feature you are examining, get all unique values or some quantiles to reduce the time
Take a unique value
For the feature you are examining, in all observations, replace with the value from (3)
Predict all training observations
Get the mean of all predictions
Plot the point (unique value, mean)
Repeat 3-7 taking the next unique value until no more values
You now have a 1-way PDP. When the feature increases (X-axis), what on average happens to the prediction (y-axis). What is the magnitude of the change.
Taking the analysis further, you can fit a smooth curve or splines to the PDP which may help understand the relationship. As #Maxim said, there is not a perfect rule so you are looking for the trend here, trying to understand a relationship. We tend to run this for the most important features and/or features you are curious about.
The above scikit-learn reference has more examples.
For a Decision Tree, you can use the algorithmic short-cut as described by Friedman and implemented by scikit-learn. You need to walk the tree so the code is tied to the package and algorithm, hence it does not answer your question and I will not describe it. But it is on that scikit-learn page I referenced and in the paper.
def pdp_data(clf, X, col_index):
X_copy = np.copy(X)
results = {}
results['x_values'] = np.sort(np.unique(X_copy[:, col_index]))
results['y_values'] = []
for value in results['x_values']:
X_copy[:, col_index] = value
y_predict = clf.predict_log_proba(X_copy)[:, 1]
results['y_values'].append(np.mean(y_predict))
return results
Edited to answer new part of question:
For the addition to your question, you are looking for a linear model with coefficients. If you must interpret the model with linear coefficients, build a linear model.
Sometimes how you need to interpret the model guides what type of model you build.
In general - no. Decision trees work differently that that. For example it could have a rule under the hood that if feature X > 100 OR X < 10 and Y = 'some value' than answer is Yes, if 50 < X < 70 - answer is No etc. In the instance of decision tree you may want to visualize its results and analyse the rules. With RF model it is not possible, as far as I know, since you have a lot of trees working under the hood, each has independent decision rules.
I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.