I have multi-class classification problem and data is heavily skewed. My target variable (y) has 3 classes and their % in data is as follows:
- 0=3%
- 1=90%
- 2=7%
I am looking for Packages in R which can do multi-class oversampling, Undersampling or both the techniques.
If it is not doable in R then where I can handle this problem.?
PS:
I tried using ROSE package in R but it works only for binary class problems.
Well there is the caret-package which offers a wide range of ML-algorithms including for multi-class problems.
It also can apply down- and upsampling methods via: downSample(), upSample()
trainclass <- data.frame("label" = c(rep("class1", 100), rep("class2", 20), rep("class3", 180)),
"predictor1" = rnorm(300, 0 ,1),
"predictor2" = sample(c("this", "that"), 300, replace = TRUE))
> table(trainclass$label)
class1 class2 class3
100 20 180
#then use
set.seed(234)
dtrain <- downSample(x = trainclass[, -1],
y = trainclass$label)
> table(dtrain$Class)
class1 class2 class3
20 20 20
Nice feat: It can also do downsampling, upsampling as well as SMOTE and ROSE while applying resampling procedures (such as crossvalidation)
This performs 10-fold cross-validation using downsampling.
ctrl <- caret::trainControl(method = "cv",
number = 10,
verboseIter = FALSE,
summaryFunction = multiClassSummary
sampling = "down")
set.seed(42)
model_rf_under <- caret::train(Class ~ .,
data = data,
method = "rf",
trControl = ctrl)
See further information here:
https://topepo.github.io/caret/subsampling-for-class-imbalances.html
Also Check out the mlr-package:
https://mlr.mlr-org.com/articles/tutorial/over_and_undersampling.html#sampling-based-approaches
You can use SMOTE function under DMwR packages. I have created a sample dataset and make three Imbalance class..
install.packages("DMwR")
library(DMwR)
## A small example with a data set created artificially from the IRIS
## data
data(iris)
#setosa 90%, versicolor 3% and virginica 7%
Species<-c(rep("setosa",135),rep("versicolor",5),rep("virginica",10))
data<-cbind(iris[,1:4],Species)
table(data$Species)
Imbalance class:
setosa versicolor virginica
135 5 10
Now, for recovering 2 imbalance class, apply SMOTE functions 2 times on data...
First_Imbalence_recover <- DMwR::SMOTE(Species ~ ., data, perc.over = 2000,perc.under=100)
Final_Imbalence_recover <- DMwR::SMOTE(Species ~ ., First_Imbalence_recover, perc.over = 2000,perc.under=200)
table(Final_Imbalence_recover$Species)
Final balance class:
setosa versicolor virginica
79 81 84
NOTE: These examples will be generated by using the information from
the k nearest neighbors of each example of the minority class. The
parameter k controls how many of these neighbors are used. So, the
class may vary every run, which shouldn't affect overall balancing.
Related
Running an LGBM Classifier model and I'm able to use lgbm.plot_importance to plot the most important features but I would prefer having a list of these features instead, does anybody know how to go about doing this?
The lightgbm.Booster object has a method .feature_importance() which can be used to access feature importances.
That method returns an array with one importance value per feature, and supports two types of importance, based on the value of importance_type:
"gain" = "cumulative gain of all splits using this feature"
"split" = "number of splits this feature was used in"
You can explore this using the following code. I ran this with lightgbm==3.3.0, numpy==1.21.0, pandas==1.2.3, and scikit-learn==0.24.1, using Python 3.8.
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={"objective": "binary"},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_gain': bst.feature_importance(importance_type='gain'),
'importance_split': bst.feature_importance(importance_type='split'),
})
.sort_values('importance_gain', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
Here's an example of the output.
feature_name importance_gain importance_split
0 Column_22 1051.204456 8
1 Column_23 862.363854 10
2 Column_27 262.272097 19
3 Column_7 161.842017 13
4 Column_21 66.431762 24
This is saying that, for example, feature Column_21 was used in more splits than other top features, but the improvement those splits provided were much less impactful than the 8 splits using Column_22.
Seems like you are using Sklearn API for Lightgbm. This should help.
General idea:
LGBMClassifier.feature_importances_
Particular case:
model_name.feature_importances_
Full code snippet (assuming pandas dataframe was used for training):
features = train_x.columns
importances = model.feature_importances_
feature_importance = pd.DataFrame({'importance':importances,'features':features}).sort_values('importance', ascending=False).reset_index(drop=True)
feature_importance
Also you can plot importances:
lgb.plot_importance(model_name)
I am trying to use a wrapper method in Python to implement a simple forward selection using KNN from the data I have.
My data:
ID S_LENGTH S_WIDTH P_LENGTH P_WIDTH SPECIES
------------------------------------------------------------------
1 3.5 2.5 5.6 1.7 VIRGINICA
2 4.5 5.6 3.4 8.7 SETOSA
This is where I have defined X and y:
X = df[['S_LENGTH', 'S_WIDTH', 'P_LENGTH', 'P_WIDTH']].values
y = df['SPECIES'].values
This is a simple KNN model:
clf = neighbors.KNeighborsClassifier()
clf.fit(X_fs,y)
predictions = clf.predict(X_fs)
metrics.accuracy_score(y, predictions)
Therefore, how would I implement a KNN model using forward selection?
Thanks!
I do not believe that KNN has a features importance built-in, so you have basically three options. First, you can use a model agnostic version of feature importance like permutation importance.
Second, you can try adding one feature at a time at each step, and pick the model that most increases performance.
Third (closely related to second), just try every permutation! Since you only have 4 features, assuming you don't have too much data, you could just try all combinations of features. There are 4 models of one feature, 6 (4 choose 2) models with two features, 4 with three, and 1 with all four. That's probably less computation than the above two ideas.
So something like this:
feat_lists = [
['S_LENGTH'],
['S_WIDTH'],
...
['S_LENGTH', 'S_WIDTH', 'P_LENGTH'],
['S_LENGTH', 'S_WIDTH', 'P_WIDTH'],
...
['S_LENGTH', 'S_WIDTH', 'P_LENGTH', 'P_WIDTH']
]
for feats in feat_lists:
X = df[feats].values
y = df['SPECIES'].values
...all you other code...
print(feats)
print(metrics.accuracy_score(y, predictions))
To clarify, I'm assuming that's not actually your data, but only the first two rows, correct? If you only have two rows, you have bigger problems :)
Let's say there are some 20 categorical columns in the data, each having a different set of unique categorical values. Now a train test split has to done, and one needs to ensure that all unique categories are included in the train set. How can it be done? I have not tried yet, but should all these columns be included in the stratify argument?
Yes. That's correct.
For demonstration, I'm using Melbourne Housing Dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
Meta = pd.read_csv('melb_data.csv')
Meta = Meta[["Rooms", "Type", "Method", "Bathroom"]]
print(Meta.head())
print("\nBefore split -- Method feature distribution\n")
print(Meta.Method.value_counts(normalize=True))
print("\nBefore split -- Type feature distribution\n")
print(Meta.Type.value_counts(normalize=True))
train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])
print("\nAfter split -- Method feature distribution\n")
print(train.Method.value_counts(normalize=True))
print("\nAfter split -- Type feature distribution\n")
print(train.Type.value_counts(normalize=True))
Output
Rooms Type Method Bathroom
0 2 h S 1.0
1 2 h S 1.0
2 3 h SP 2.0
3 3 h PI 2.0
4 4 h VB 1.0
Before split -- Method feature distribution
S 0.664359
SP 0.125405
PI 0.115169
VB 0.088292
SA 0.006775
Name: Method, dtype: float64
Before split -- Type feature distribution
h 0.695803
u 0.222165
t 0.082032
Name: Type, dtype: float64
After split -- Method feature distribution
S 0.664396
SP 0.125368
PI 0.115151
VB 0.088273
SA 0.006811
Name: Method, dtype: float64
After split -- Type feature distribution
h 0.695784
u 0.222202
t 0.082014
Name: Type, dtype: float64
you want all categories from all categorical variables to be in your train split.
Using :
train, test = train_test_split(Meta, test_size = 0.2, stratify=Meta[["Method", "Type"]])
ensure that all categories are in the train split and test split. This is more than what you want.
It has to be noticed that the more categorical variables you stratify on, the more probable it is that a combination of categories has only one record associated. If that case occurs, the split won't be done.
Error message :
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.
You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
Problem
In an on-line process consisting of different steps I have data of people that complete the process and the people that drop out. The each user, the data consists of a sequence of process steps per time interval, let's say a second.
An example of such a sequence of a completed user would be [1,1,1,1,2,2,2,3,3,3,3....-1] where the user is in step 1 for four seconds,
followed by step 2 for three seconds and step 3 for four seconds etc before reaching the end of the process (denoted by -1).
An example of a drop out would be [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2] where the user would be spending an excessive timespan in step 1, then 5 seconds in step 2 and then closing the webpage (so not reaching the end (-1))
Based on a model I would like to be able to predict/classify online (as in in 'real-time') the probability of the user completing the process or dropping out.
Approach
I have read about HMMs and I would to apply the following principle:
train one model using the sequences of people of that completed the process
train another model using the sequences of people that did not complete the process
collect the stream of incoming data of an unseen user and at each timestep use the forward algorithm on each of the models to see which of the two models is most likely to output this stream. The corresponding model represents then the label associated to this stream.
What is your opinion? Is this doable? I have been looking at the Python libraries hmmlearn and pomegranate, but I can not seem to create a small working example to test with. Some test code of mine can be found below with some artificial data:
from pomegranate import *
import numpy as np
# generate data of some sample sequences of length 4
# mean and std of each step in sequence
means = [1,2,3,4]
stds = [0.1, 0.1, 0.1, 0.1]
num_data = 100
data = []
for mean, std in zip(means, stds):
d = np.random.normal(mean, std, num_data)
data.append(d)
data = np.array(data).T
# create model (based on sample code of pomegranate https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_3_Hidden_Markov_Models.ipynb)
s1 = State( NormalDistribution( 1, 1 ), name="s1" )
s2 = State( NormalDistribution( 2, 1 ), name="s2" )
model = HiddenMarkovModel()
model.add_states( [s1, s2] )
model.add_transition( model.start, s1, 0.5, pseudocount=4.2 )
model.add_transition( model.start, s2, 0.5, pseudocount=1.3 )
model.add_transition( s1, s2, 0.5, pseudocount=5.2 )
model.add_transition( s2, s1, 0.5, pseudocount=0.9 )
model.bake()
#model.plot()
# fit model
model.fit( data, use_pseudocount=False, algorithm = 'baum-welch', verbose=False )
# get probability of very clean sequence (mean of each step)
p = model.probability([1,2,3,4])
print p # 3.51e^-112
I would expect here that the probability of this very clean sequence would be close to 1, since the values are the means of each of the distributions of the steps. How can I make this example better and eventually apply it for my application?
Concerns
I am not sure what states and transitions my model should comprise of. What is a 'good' model? How can you know that you need to add more states to the model to add more expressive data given the data. The tutorials of pomegranate are nice but insufficient for me to apply HMM's in this context.
Yes, the HMM is a viable way to do this, although it's a bit of overkill, since the FSM is a simple linear chain. The "model" can also be built from mean and variation for each string length, and you can simply compare the distance of the partial string to each set of parameters, rechecking at each desired time point.
The states are simple enough:
1 ==> 2 ==> 3 ==> ... ==> done
Each state has a loop back to itself; this is the most frequent choice.
There is also a transition to "failed" from any state.
Thus, the Markov Matrices will be sparse, something like
1 2 3 4 done failed
0.8 0.1 0 0 0 0.1
0 0.8 0.1 0 0 0.1
0 0 0.8 0.1 0 0.1
0 0 0 0.8 0.1 0.1
0 0 0 0 1.0 0
0 0 0 0 0 1.0