For research purposes, I find myself needing to traing SVM via SGD on a large DS (that is, a large number of examples). This makes using scikit-learn's implementation (SGDClassifier) problematic, as it requires loading the entire DS at once.
The algorithm I am familiar with uses n step of SGD to obtain n different separators w_i, and then averages them (specifics can be seen in slide 12 of https://www.cse.huji.ac.il/~shais/Lectures2014/lecture8.pdf).
This made me think that maybe I can use scikit-learn to train multiple such classifiers and then take the average of the resulting linear separators (assume no bias).
Is this a reasonable line of thinking, or does scikit-learn's implementation not fall under my logic?
Edit: I am well aware of the alternatives for training SVM in different ways, but this is for a specific research purpose. I would just like to know if this line of thinking is possible with scikit-learn's implementation, or if you are aware of an alternative that will allow me to train SVM using SGD without loading an entire DS to memory.
SGDClassifier have a partial_fit method, and one of the primary objectives of partial_fit method is to scale sklearn models to large-scale datasets. Using this, you can load a part of the dataset into RAM, feed it to SGD, and keep repeating this unless full dataset is used.
In code below, I use KFold mainly to imitate loading chunk of dataset.
class GD_SVM(BaseEstimator, ClassifierMixin):
def __init__(self):
self.sgd = SGDClassifier(loss='hinge',random_state=42,fit_intercept=True,l1_ratio=0,tol=.001)
def fit(self,X,y):
cv = KFold(n_splits=10,random_state=42,shuffle=True)
for _,test_id in cv.split(X,y):
xt,yt = X[test_id],y[test_id]
self.sgd = self.sgd.partial_fit(xt,yt,classes=np.unique(y))
def predict(self,X):
return self.sgd.predict(X)
To test this against regular (linear) SVM:
X,y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X) #For simplicity, Pipeline is better choice
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=43)
sgd = GD_SVM()
svm = LinearSVC(loss='hinge',max_iter=1,random_state=42,
C=1.0,fit_intercept=True,tol=.001)
r = cross_val_score(sgd,X,y,cv=cv) #cross_val_score(svm,X,y,cv=cv)
print(r.mean())
This returned 95% accuracy for above GD_SVM, and 96% for SVM. In Digits dataset SVM had 93% accuracy, while GD_SVM had 91%. While these performances are broadly similar, as these measurements show, please note that they are not identical. This is expected, since these algorithms use pretty different optimization algorithms, but I think careful tuning of hyper-parameter would reduce the gap.
Based on the concern of loading all of the data in memory, if you have access to more compute resources, you may want to use PySpark's SVM implementation: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-support-vector-machine, as that Spark is built for large scale data processing. I don't know if averaging the separators from multiple Scikit-Learn models would work as expected; there isn't a clean way to instantiate a new model with new separators, based on the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), so it would probably have to be implemented as an ensemble approach.
If you insist on using the whole DS for training instead of sampling (btw that is what the slides describe) and you do not care about performance, I would train n classifiers, and then select only their support vectors and retrain final version on those support vectors only. This way you effectively dismiss most of the data and concentrate only on the points that are important for the classification.
Related
I have a project that asks to do binary classification for whether an employee will leave the company or not, based on about 52 features and 2000 rows of data. The data is somewhat balanced with 1200 neg to 800 pos. I have done extensive EDA and data cleansing. I chose to try several different models from sklearn, Logarithmic Regression, SVM, and Random Forests. I am getting very poor and similar results from all of them. I only used 15 of the 52 features for this run, but the results are almost identical to when I used all 52 features. Of the 52 features, 6 were categorical that I converted to dummies (between 3-6 categories per feature) and 3 were datetime that I converted to days-since-epoch. There were no null values to fill.
This is the code and confusion matrix my most recent run with a random forest.
x_train, x_test, y_train, y_test = train_test_split(small_features, endreason, test_size=0.2, random_state=0)
RF = RandomForestClassifier(bootstrap = True,
max_features = 'sqrt',
random_state=0)
RF.fit(x_train, y_train)
RF.predict(x_test)
cm = confusion_matrix(y_test, rf_predictions)
plot_confusion_matrix(cm, classes = ['Negative', 'Positive'],
title = 'Confusion Matrix')
What are steps I can do to help better fit this model?
The results you are showing definitely seem a bit dis-encouraging for the methods your propose and balance of the data you describe. However, from the description of the problem there definitely seems to be a lot of room for improvement.
When you are using train_test_split make sure you pass stratify=endreason to make sure there's no issues regarding the labels when splitting the dataset. Moving on to helpful points to improve your model:
First of all, dimensionality reduction: Since you are dealing with many features, some of them might be useless or even contaminate the classification problem you are trying to solve. It is very important to considering fitting different dimension reduction techniques to your data and using this fitted data to feed your model. Some common approaches that might be worth trying:
PCA (Principal component analysis)
Low Variance & Correlation filter
Random Forests feature importance
Secondly understanding the model: While Logistic Regression might prove to be an excellent baseline for a linear classifier, it might not necessarily be what you need for this task. Random Forests seem to be much better when capturing non-linear relationships but needs to be controlled and pruned to avoid overfitting and might require a lot of data. On the other hand, SVM is a very powerful method with non-linear kernels but might prove inefficient when working with huge amounts of data. XGBoost and LightGBM are very powerful gradient boosting algorithms that have won multiple kaggle competitions and work very well in almost every case, of course there needs to be some preprocessing as XGBoost is not preparred to work with categorical features (LightGBM is). My suggestion is to try these last 2 methods. From worse to last (in general case scenarios) I would list:
LightGBM / XGBoost
RandomForest / SVM / Logistic Regression
Last but not least hyperparameter tunning: Regardless of the method you choose, there will always be some fine-tuning that needs to be done. Sklearn offers gridsearch which comes in really handy. However you would need to understand how your classifiers are behaving in order to know what should you be looking for. I will not go in-depth in this as it will be off-topic and not suited for SO but you can definitely have a read here
I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report
First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%.
You can manually set per-class weights with xgb.DMatrix..., weights)
Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)
I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)
While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!
I have two data set with different size.
1) Data set 1 is with high dimensions 4500 samples (sketches).
2) Data set 2 is with low dimension 1000 samples (real data).
I suppose that "both data set have the same distribution"
I want to train an non linear SVM model using sklearn on the first data set (as a pre-training ), and after that I want to update the model on a part of the second data set (to fit the model).
How can I develop a kind of update on sklearn. How can I update a SVM model?
In sklearn you can do this only for linear kernel and using SGDClassifier (with appropiate selection of loss/penalty terms, loss should be hinge, and penalty L2). Incremental learning is supported through partial_fit methods, and this is not implemented for neither SVC nor LinearSVC.
Unfortunately, in practise fitting SVM in incremental fashion for such small datasets is rather useless. SVM has easy obtainable global solution, thus you do not need pretraining of any form, in fact it should not matter at all, if you are thinking about pretraining in the neural network sense. If correctly implemented, SVM should completely forget previous dataset. Why not learn on the whole data in one pass? This is what SVM is supposed to do. Unless you are working with some non-convex modification of SVM (then pretraining makes sense).
To sum up:
From theoretical and practical point of view there is no point in pretraining SVM. You can either learn only on the second dataset, or on both in the same time. Pretraining is only reasonable for methods which suffer from local minima (or hard convergence of any kind) thus need to start near actual solution to be able to find reasonable model (like neural networks). SVM is not one of them.
You can use incremental fitting (although in sklearn it is very limited) for efficiency reasons, but for such small dataset you will be just fine fitting whole dataset at once.
I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.
How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.
I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.
Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?
EDIT:
Here is my scenario:
lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
Thanks.
PS: Please do not close this .
I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.
For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.
How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.
For linear models (LinearSVC, SGDClassifier, Perceptron...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier) by sticking in it the average values of coef_ and intercept_ as attributes. The predict method of LinearSVC, SGDClassifier, Perceptron compute the same function (linear prediction using a dot product with an intercept_ threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.
However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.
Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.