I am trying to balance my dataset, But I am struggling in finding the right way to do it. Let me set the problem. I have a multiclass dataset with the following class weights:
class weight
2.0 0.700578
4.0 0.163401
3.0 0.126727
1.0 0.009294
As you can see the dataset is pretty unbalanced. What I would like to do is to obtain a balanced dataset in which each class is represented with the same weight.
There are a lot of questions regarding but:
Scikit-learn balanced subsampling: this subsamples can be overlapping, which for my approach is wrong. Moreover, I would like to get that using sklearn or packages that are well tested.
How to perform undersampling (the right way) with python scikit-learn?: here they suggest to use an unbalanced dataset with a balance class weight vector, however, I need to have this balance dataset, is not a matter of which model and which weights.
https://github.com/scikit-learn-contrib/imbalanced-learn: a lot of question refers to this package. Below an example on how I am trying to use it.
Here the example:
from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(random_state=42, sampling_strategy='not minority', n_estimators=2)
eec.fit(data_for, label_all.loc[data_for.index,'LABEL_O_majority'])
new_data = eec.estimators_samples_
However, the returned indexes are all the indexes of the initial data and they are repeated n_estimators times.
Here the result:
[array([ 0, 1, 2, ..., 1196, 1197, 1198]),
array([ 0, 1, 2, ..., 1196, 1197, 1198])]
Finally, a lot of techniques use oversampling but would like to not use them. Only for class 1 I can tolerate oversampling, as it is very predictable.
I am wondering if really sklearn, or this contrib package do not have a function that does this.
Based on my experience, under-sampling really doesn't help every time as we are not utilizing the total available data and this approach might lead to a lot of overfitting. Synthetic Minority Over-sampling Technique (SMOTE) has worked well with most type of data (both structured and unstructured data like images), although the performance can be slow sometimes. But it is easy to use and available through [imblearn][1] In case if you want to try oversampling techniques this particular article might help: https://medium.com/#adib0073/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad
But for undersampling as mentioned in the above comments, you would have to slice your dataframe or array of the majority class to match the size of the minority class
try to apply iterative-stratification
On the Stratification of Multi-label Data :
Stratified sampling is a sampling method that takes into account the
existence of disjoint groups within a population and pro- duces
samples where the proportion of these groups is maintained. In
single-label classification tasks, groups are differentiated based on
the value of the target variable. In multi-label learning tasks,
however, where there are multiple target variables, it is not clear
how stratified sam- pling could/should be performed. This paper
investigates stratification in the multi-label data context. It
considers two stratification methods for multi-label data and
empirically compares them along with random sampling on a number of
datasets and based on a number of evaluation criteria. The results
reveal some interesting conclusions with respect to the utility of
each method for particular types of multi-label datasets.
Related
I have a dataset that was generated from IOT device and I'm trying to predict '1' that a machine will break down (Rare Event) and '0' that it will not. The dataset is highly imbalanced and I'm considering using LSTM for prediction. I'm not sure how to prepare my data for this task. Do I remove all zero values per rows since most columns contain this. Only few of those columns do not contain outliers. Below is an example of what the distribution of my data looks like but not entirely. FYI, I have more columns not included in the snapshot and about 75% of the columns in the data are like this.
The common approach when dealing with imbalanced datasets is to use resampling techniques such as undersampling and oversampling. In Python, imbalanced-learn is a popular library used for both of these methods.
Undersampling remove samples from the majority class where oversampling duplicates samples from the minority class. Oversampling is generally preferred as you are not removing data. Lastly, you can use an advanced oversampling technique called SMOTE to create new synthetic minority class data. This is generally most performant, see here for additional info.
I have about 8000 features measuring a two level response variable i.e. output can belong to class 1 or 0.
The 8000 features consist of about 3000 features with 0-1 values and about 5000 features (which are basically words from text data and their tfidf scores.
I am building a linear svm model on this to predict my output variable and am getting decent results/ accuracy, recall and precision around 60-70%
I am looking for help with the following:
Standardization: do the 0-1 values need to be standardized? Do tfidf scores need to be standardized even if I use sublinear tdf=true ?
Dimension reduction: I have tried f_classif using SelectPercentile function of sklearn so far. Any other dimension reduction techniques that can be suggested? I have gone through the sklearn dimension reduction url which also talks about chi2 dim reduction but that isn't giving me good results. Can pca be applied if the data is a mix of 0-1 columns and tfidf score columns?
Remove collinearity: How can I remove highly correlated independent variables.
I am fairly new to python and machine learning, so any help would be appreciated.
(edited to include additional questions)
1 - I would centre and scale your variables for a linear model. I don't know if it's strictly necessary for SVMs, but if I recall correctly, spatial based models are better if the variables are in the same ranges. I don't think there's any harm in doing this anyway (vs. unscaled/uncentred). Someone may correct me - I don't do much by way of text analysis.
2 - (original answer) = Could you try applying a randomForest model, then inspecting the importance scores (discarding those with low importance). With so many features I'd worry about memory issues but if your machine can handle it...?
Another good approach here would be to use ridge/lasso logistic regression. This by its very nature is good at identifying (and discarding) redundant variables, and can help with your question 3 (correlated variables).
Appreciate you're new to this, but both these models above are good at getting around correlation / non-significant variables, so you may want to use these on the way to finalising an SVM.
3 - There's no magic bullet that I know of. The above may help. I predominantly use R, and within that there's a package called Boruta which is good for this step. There may be a Python equivalent?
I have come across a peculiar situation when preprocessing data.
Let's say I have a dataset A. I split the dataset into A_train and A_test. I fit the A_train using any of the given scalers (sci-kit learn) and transform A_test with that scaler. Now training the neural network with A_train and validating on A_test works well. No overfitting and performance is good.
Let's say I have dataset B with the same features as in A, but with different ranges of values for the features. A simple example of A and B could be Boston and Paris housing datasets respectively (This is just an analogy to say that features ranges like the cost, crime rate, etc vary significantly ). To test the performance of the above trained model on B, we transform B according to scaling attributes of A_train and then validate. This usually degrades performance, as this model is never shown the data from B.
The peculiar thing is if I fit and transform on B directly instead of using scaling attributes of A_train, the performance is a lot better. Usually, this reduces performance if I test this on A_test. In this scenario, it seems to work, although it's not right.
Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.
Any ideas, please.
PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied QuantileTransformer, it improved performance but could be better.
One possible solution could be like this.
Normalize (pre-process) the dataset A such that the range of each features is within a fixed interval, e.g., between [-1, 1].
Train your model on the normalized set A.
Whenever you are given a new dataset like B:
(3.1.) Normalize the new dataset such that the feature have the same range as they have in A ([-1, 1]).
(3.2) Apply your trained model (step 2) on the normalized new set (3.1).
As you have a one-to-one mapping between set B and its normalized version, then you can see what is the prediction on set B, based on predictions on normalized set B.
Note you do not need to have access to set B in advance (or such sets if they are hundreds of them). You normalize them, as soon as you are given one and you want to test your trained model on it.
I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)
While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!
Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?
First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline