I have a data set of tweets, where each tweet has an average confidence score.
For example
Tweet Average Confidence Standard Deviation
too much thoughts inside his headdd we can t even imagine 0.3 0.163951
His ass need to stay up 0.8 0.161962
First time I heard his name in camp, he seems amazing 0.19 0.181962
Average Confidence is the average of the confidences predicted by several supervised models for a specific instance to belong to the positive class.
Standard Deviation is the confidences' standard deviation from Average Confidence for a particular instance.
If i consider it as a regression task, how to handle multi label data
EDIT
Didn't understood your question quite yet, so answered as best I thought :)
Basically the score (in your case average) is used alone in sentiment analysis to classify good-bad sentences (data), pick a threshold which yield the best classification results, lets say 0.6 so
if score >= 0.6
classify as GOOD
else
classify as BAD
I suggest to see if this simple approach is good enough for your requirements
in case you want to classify using more variables (information) e.g 'averageandstd` you can use another classification model (like logistic-regression, decision trees, svm and more...)
in case you would like to use some regression approach, I suggest a logistic regression (it's pretty strait forward)
because your current model consists of just 2 variables average and std a svm, might give better results (basically it project the data on higher dimension and do the classification there)
keep in mind all methods (maybe except decision-trees and such) will output another score, like prob of classification between 0 to 1 so a thresholds will always have to be applied at the end
Related
I'm working on predicting if any task breaches a given deadline or not(Binary Classification Problem)
I've used Logistic Regression, Random Forest and XGBoost. All of them give an F1 score of around 56% for the class label 1(i.e the F1 score of the positive class only).
I've used:
StandardScaler()
GridSearchCV for Hyperparameter Tuning
Recursive Feature Elimination(for feature selection)
SMOTE(the dataset is imbalanced so I used SMOTE to create new examples from existing examples)
to try and improve the F score of this model.
I've also created an ensemble model using EnsembleVoteClassifier.As you can see from the picture, the weighted F score is 94% however the F score for class 1 (i.e positive class which says that the task will cross the deadline) is just 57%.
After applying all those methods mentioned above, I have been able to improve the f1 score of label 1 from 6% to 57%. However, I'm not sure what else to do to further improve the F score of the label 1.
You should also experiment with Under-Sampling. In general, you won't get much improvement by simply changing the algorithm. You should look into more advanced ensemble based techniques specifically designed for dealing with class imbalance.
You can also try out the approach used in this paper: https://www.sciencedirect.com/science/article/abs/pii/S0031320312001471
Alternatively, you could look into more advanced data synthesis methods.
Clearly, the fact that you have a relatively small number of True 1s samples in you datasets affects the performance of your classifier.
You have an "imbalanced data", you have much more of the 0s samples than of 1s.
There are multiple way to deal with imbalanced data. Each learner you have applied have its own "trick" for it. However, a general thing you can try is to resample the 1s samples. That is, artificially increase the proportion of the 1s in your dataset.
You can read more about different options here:
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
I have a data of dimension (13961,48 ) initially, and after one hot encoding and also basic massaging of data the dimension observed around (13961,862). the data is imbalance with two categories of 'Retained' around 6% and 'not Retained' around 94%.
While running any algorithms such as logistic,knn,decision tree,random forest, the data results in very high accuracy even without any feature selection process carried out and the accuracy crosses more than 94% mostly except 'Naive bias classifier'.
This seems like odd and even by having any two features randomly also--> that gives accuracy more than 94% , which seems non reality in general.
Applying SMOTE also, provide result of more than 94% of accuracy even for baseline model of any algorithms said above such as logistic,knn,decision tree,random forest,
After removing the top 20 features also , this gives accuracy of good result more than 94% ( checked for understanding the genuineness )
g = data[Target_col_Y_name]
df = pd.concat([g.value_counts(),
g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print('The % distribution between the retention and non-retention flag\n')
print (df)
# The code o/p to show the imbalance is
The % distribution between the retention and non-retention flag
counts percentage
Non Retained 13105 93.868634
Retained 856 6.131366
My data have 7 numerical variables such as month, amount, interest rate and all others ( around 855) as one-hot-encoding transformed categorical variables.
Any methodology , to handle this kind of data on baseline,feature selection or imbalance optimization techniques ? please guide by looking at the dimensionality and the imbalance count for each levels.
I would like to add something in addition to Elias answer.
Firstly, you have to understand that even if you's create "dumb classifier", which always predicts "not retained", you'd still be correct 94% of times. So accuracy is clearly weak metric in this case.
You should definitely learn about confusion matrix and metrics that come along with it (like AUC).
One of these metrics is F1 score, which is harmonic average of precision and recall. It is better that accuracy in imbalanced class setting, but... it doesn't have to be the best. F1 will favor these classifiers that have similar precision and recall. But this is not necessary something that is important for you.
For instance, if you'd build sfw content filter, you would be ok with labeling some SFW content as nsfw (negative class), which would increase false negative rate (and decrease recall), but you would like to be sure that you kept only safe ones (high precision).
In your case you can reason what is worse: retaining something bad or abandoning something good, and pick the metric in that way.
As far as strategy is concerned: there are plenty of ways to handle class imbalance: sampling techniques (try down-sampling, up-sampling besides SMOTE or ROSE) and check out whether your validation score (training metrics alone are almost useless) improved. Just remember to apply sampling/augmentations techniques after the train-validation split.
Moreover, some models have special hyperparametrs to focus more on rare class (for instance in xgboost there is scale_pos_weight parameter). From my experience, tunning this hyperparam is way more effective than SMOTE.
Good luck
Accuracy is not a very good measure in general, particularly for imbalanced classes. I would recommend this other stackoverflow answer, that explains when to use F1 score and when to use AUROC, which are both far better measures than accuracy; in this case F1 is better.
Few points just to clear up:
For models such as random forest, you should not have to remove features to improve the accuracy, as it will just regard them as insignificant features. I recommend random forests as it tends to be very accurate (except in some cases) and can show significant features just by using clf.feature_significances_ (if using the scipy random forest).
Decision trees will almost always perform worse than random forests, as random forests are many aggregated decision trees.
I am using sklearn's random forests module to predict a binary target variable based on 166 features.
When I increase the number of dimensions to 175 the accuracy of the model decreases (from accuracy = 0.86 to 0.81 and from recall = 0.37 to 0.32) .
I would expect more data to only make the model more accurate, especially when the added features were with business value.
I built the model using sklearn in python.
Why the new features did not get weight 0 and left the accuracy as it was ?
Basically, you may be "confusing" your model with useless features. MORE FEATURES or MORE DATA WILL NOT ALWAYS MAKE YOUR MODEL BETTER. The new features will also not get weight zero because the model will try hard to use them! Because there are so many (175!), RF is just not able to come back to the previous "pristine" model with better accuracy and recall (maybe these 9 features are really not adding anything useful).
Think about how a decision tree essentially works. These new features will cause some new splits that can worsen the results. Try to work it out from the basics and slowly adding new information always checking the performance. In addition, pay attention to for example the number of features used per split (mtry). For so many features, you would need to have a very high mtry (to allow for a big sample to be considered for every split). Have you considered adding 1 or 2 more and checking how the accuracy responds? Also, don't forget mtry!
More data does not always make the model more accurate. Random forest is a traditional machine learning method where the programmer has to do the feature selection. If the model is given a lot of data but it is bad, then the model will try to make sense out of that bad data too and will end up messing things up. More data is better for neural networks as those networks select the best possible features out of the data on their own.
Also, 175 features is too much and you should definitely look into dimensionality reduction techniques and select the features which are highly correlated with the target. there are several methods in sklearn to do that. You can try PCA if your data is numerical or RFE to remove bad features, etc.
I have trained my model on a data set and i used decision trees to train my model and it has 3 output classes - Yes,Done and No , and I got to know the feature that are most decisive in making a decision by checking feature importance of the classifier. I am using python and sklearn as my ML library. Now that I have found the feature that is most decisive I would like to know how that feature contributes, in the sense that if the relation is positive such that if the feature value increases the it leads to Yes and if it is negative It leads to No and so on and I would also want to know the magnitude for the same.
I would like to know if there a solution to this and also would to know a solution that is independent of the algorithm of choice, Please try to provide solutions that are not specific to decision tree but rather general solution for all the algorithms.
If there is some way that would tell me like:
for feature x1 the relation is 0.8*x1^2
for feature x2 the relation is -0.4*x2
just so that I would be able to analyse the output depends based on input feature x1 ,x2 and so on
Is it possible to find out the whether a high value for particular feature to a certain class, or a low value for the feature.
You can use Partial Dependency Plots (PDPs). scikit has a built-in PDP for their GBM - http://scikit-learn.org/stable/modules/ensemble.html#partial-dependence which was created in Friedman's Greedy Function Approximation Paper http://statweb.stanford.edu/~jhf/ftp/trebst.pdf pp26-28.
If you used scikit-learn GBM, use their PDP function. If you used another estimator, you can create your own PDP which is a few lines of code. PDPs and this method is algorithm agnostic as you asked. It just will not scale.
Logic
Take your training data
For the feature you are examining, get all unique values or some quantiles to reduce the time
Take a unique value
For the feature you are examining, in all observations, replace with the value from (3)
Predict all training observations
Get the mean of all predictions
Plot the point (unique value, mean)
Repeat 3-7 taking the next unique value until no more values
You now have a 1-way PDP. When the feature increases (X-axis), what on average happens to the prediction (y-axis). What is the magnitude of the change.
Taking the analysis further, you can fit a smooth curve or splines to the PDP which may help understand the relationship. As #Maxim said, there is not a perfect rule so you are looking for the trend here, trying to understand a relationship. We tend to run this for the most important features and/or features you are curious about.
The above scikit-learn reference has more examples.
For a Decision Tree, you can use the algorithmic short-cut as described by Friedman and implemented by scikit-learn. You need to walk the tree so the code is tied to the package and algorithm, hence it does not answer your question and I will not describe it. But it is on that scikit-learn page I referenced and in the paper.
def pdp_data(clf, X, col_index):
X_copy = np.copy(X)
results = {}
results['x_values'] = np.sort(np.unique(X_copy[:, col_index]))
results['y_values'] = []
for value in results['x_values']:
X_copy[:, col_index] = value
y_predict = clf.predict_log_proba(X_copy)[:, 1]
results['y_values'].append(np.mean(y_predict))
return results
Edited to answer new part of question:
For the addition to your question, you are looking for a linear model with coefficients. If you must interpret the model with linear coefficients, build a linear model.
Sometimes how you need to interpret the model guides what type of model you build.
In general - no. Decision trees work differently that that. For example it could have a rule under the hood that if feature X > 100 OR X < 10 and Y = 'some value' than answer is Yes, if 50 < X < 70 - answer is No etc. In the instance of decision tree you may want to visualize its results and analyse the rules. With RF model it is not possible, as far as I know, since you have a lot of trees working under the hood, each has independent decision rules.
I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.
I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).
My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).
Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?
Here is my test class http://gist.github.com/451880
You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.
Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.
Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.
If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by #ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.
For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).
How big (number of words) are your documents? Memory consumption at 150K trainingdocs should not be an issue.
Naive Bayes is a good choice especially when you have many categories with only a few training examples or very noisy trainingdata. But in general, linear Support Vector Machines do perform much better.
Is your problem multiclass (a document belongs only to one category exclusivly) or multilabel (a document belongs to one or more categories)?
Accuracy is a poor choice to judge classifier performance. You should rather use precision vs recall, precision recall breakeven point (prbp), f1, auc and have to look at the precision vs recall curve where recall (x) is plotted against precision (y) based on the value of your confidence-threshold (wether a document belongs to a category or not). Usually you would build one binary classifier per category (positive training examples of one category vs all other trainingexamples which don't belong to your current category). You'll have to choose an optimal confidence threshold per category. If you want to combine those single measures per category into a global performance measure, you'll have to micro (sum up all true positives, false positives, false negatives and true negatives and calc combined scores) or macro (calc score per category and then average those scores over all categories) average.
We have a corpus of tens of million documents, millions of training examples and thousands of categories (multilabel). Since we face serious training time problems (the number of documents are new, updated or deleted per day is quite high), we use a modified version of liblinear. But for smaller problems using one of the python wrappers around liblinear (liblinear2scipy or scikit-learn) should work fine.
Is there a way to have a "none of the
above" option for the classifier just
in case the document doesn't fit into
any of the categories?
You might get this effect simply by having a "none of the above" pseudo-category trained each time. If the max you can train is 5 categories (though I'm not sure why it's eating up quite so much RAM), train 4 actual categories from their actual 2K docs each, and a "none of the above" one with its 2K documents taken randomly from all the other 146 categories (about 13-14 from each if you want the "stratified sampling" approach, which may be sounder).
Still feels like a bit of a kludge and you might be better off with a completely different approach -- find a multi-dimensional doc measure that defines your 300K pre-tagged docs into 150 reasonably separable clusters, then just assign each of the other yet-untagged docs to the appropriate cluster as thus determined. I don't think NLTK has anything directly available to support this kind of thing, but, hey, NLTK's been growing so fast that I may well have missed something...;-)