Issues with imbalanced dataset in case of binary classification

Issues with imbalanced dataset in case of binary classification - python

I have a binary classification problem where the data division is like :{0:85%,1:15%}. I have tried re-weighting class_weights and other sampling approches. But all the approaches that I have used is giving me unsatisfactory results.
My dataset is (91125,57).
Accuracy:1
F1-Score:1
F2-Score:1
Precision:1
Recall:1
AUCROC:1
Kappa:1
Is there any other method I can use to handle such a situation?

Make sure you're dropping the target variable from your features before feeding the data to the classifier:
X = df.drop('target',axis=1)
y = df['target']
I'd also check if some independent variables are highly correlated with the target. It may give your an idea what causes an unrealistically perfect classiification:
import seaborn as sns
sns.heatmap(X_train.corr())

Related

ValueError: Unknown label type: 'continuous' in DecisionTreeClassifier()

I am trying to create a model which predicts results column below:
Date Open High Close Result
1/22/2010 25.95 31.29 30.89 0.176104
2/19/2010 23.98 24.22 23.60 -0.343760
3/19/2010 21.46 23.16 22.50 0.124994
4/23/2010 21.32 21.77 21.06 -0.765601
5/21/2010 55.41 55.85 49.06 0.302556
The code I am using is:
import pandas
from sklearn.tree import DecisionTreeClassifier
dataset = pandas.read_csv('data.csv')
X = dataset.drop(columns=['Date','Result'])
y = dataset.drop(columns=['Date', 'Open', 'High', 'Close'])
model = DecisionTreeClassifier()
model.fit(X, y)
But I am getting an error:
ValueError: Unknown label type: 'continuous'
Suggestion for using other algorithms are also welcome.

In ML, it's important as a first step to consider the nature of your problem. Is it a regression or classification problem? Do you have target data (supervised learning) or is this a problem where you don't have a target and want to learn more about your data's inherent structure (such as unsupervised learning). Then, consider what steps you need to take in your pipeline to prepare your data (preprocessing).
In this case, you are passing floats (floating point numbers) to a Classifier (DecisionTreeClassifier). The problem with this is that a classifier generally separates distinct classes, and so this classifier expects a string or an integer type to distinguish different classes from each other (this is known as the "target"). You can read more about this in an introduction to classifiers.
The problem you seek to solve is to determine a continuous numerical output, Result. This is known as a regression problem, and so you need to use a Regression algorithm (such as the DecisionTreeRegressor). You can try other regression algorithms out once you have this simple one working, and this is a good place to start as it is a fairly straight forward one to understand, it is fairly transparent, it is fast, and easily implemented - so decision trees were a great choice of starting point!
As a further note, it is important to consider preprocessing your data. You have done some of this simply by separating your target from your input data:
X = dataset.drop(columns=['Date','Result'])
y = dataset.drop(columns=['Date', 'Open', 'High', 'Close'])
However, you may wish to look into preprocessing further, particularly standardisation of your data. This is often a required step for whichever ML algorithm you implement to be able to interpret your data. There's a saying that goes: "Garbage in, garbage out".
Part of preprocessing sometimes requires you to change the data type of a given column. The error posted in your question, at face value, leads one to think that the issue on hand is that you need to change data types. But, as explained, in the case of your problem, it wouldn't help to do that, given that you seek to use regression to determine a continuous output.

You are using DecisionTreeClassifier which is a classifier and will only predict categorical values such as 0 or 1 but your Result column is continuous so you should use DecisionTreeRegressor

Few suggestions
You approach is a good try but I think it's not right approach.
In ML modelling, there 3 main categories of models
Regression: Have you head of Newton's laws? These are kind of ML Models that help identify the hidden rules & logics in data.
Classification: These are type of ML models that are used to separate data into different categories.
Time Series ML Models: This is like stock market data analytics. Unlike above, here on a date X the value depends on X-1, X-2, X-3 and so..on. This is some what closer to Regression but these requires model like ARIMA.
As for the error DecisionTreeClassifier is supposed to be used for identifying categories like 1, 2, 3, 4, .. so on but only for a limit set of classes.
For a series like your Results which is continuous and fractional series, you should a regression like models or ARIMA like time series ML Models.

MinMaxScaler + DecisionTree classifier with numerical and categorical data

I would like to know how should I managed the following situation:
I have a dataset which I need to analyze. It is labeled data and I need to perform over it a classification task. Some features are numerical and others are categorical (non-ordinal), and my problem is I don't know how can I handle the categorical ones.
Before to classify, I usually apply a MinMaxScaler. But I can't do this in this particular dataset because of the categorical features.
I've read about the one-hot encoding, but I don't understand how can apply it to my case because my dataset have some numerical features and 10 categorical features and the one-hot encoding generates more columns in the dataframe, and I don't know how do I need to prepare the resultant dataframe to sent it to the decision tree classifier.
In order to clarify the situation the code I'm using so far is the following:
y = df.class
X = df.drop(['class'] , axis=1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# call DecisionTree classifier
When the df has categorical features I get the following error: TypeError: data type not understood. So, if I apply the one-hot encoding I get a dataframe with many columns and I don't know if the decisionTree classifier is going to understand the real situation of my data. I mean how can I express to the classifier that a group of columns belongs to a specific feature? Am I understanding the whole situation wrong? Sorry if this a confused question but I am newbie and I fell pretty confused about how to handle this.

I don't have enough reputation to comment, but note that decision tree classifiers don't require their input to be scaled. So if you're using a decision tree classifier, just use the features as they appear.
If you're using a method that requires feature scaling, then you should probably do one-hot-encoding and feature scaling separately - see this answer: https://stackoverflow.com/a/43798994/9988333
Alternatively, you could use a method that handles categorical variables 'out of the box', such as LGBM.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)

While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

How to do linear regression using Python and Scikit learn using one hot encoding?

I am trying to use linear regression in combination with python and scikitlearn to answer the question "can user session lengths be predicted given user demographic information?"
I am using linear regression because the user session lengths are in milliseconds, which is continuous. I one hot encoded all of my categorical variables including gender, country, and age range.
I am not sure how to take into account my one hot encoding, or if I even need to.
Input Data:
I tried reading here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
I understand the inputs is my main are whether to calculate a fit intercept, normalize, copy x (all boolean), and then n jobs.
I'm not sure what factors to take into account when deciding on these inputs. I'm also concerned whether my one hot encoding of the variables makes an impact.

You can do like:
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
# X is a numpy array with your features
# y is the label array
enc = OneHotEncoder(sparse=False)
X_transform = enc.fit_transform(X)
# apply your linear regression as you want
model = LinearRegression()
model.fit(X_transform, y)
print("Mean squared error: %.2f" % np.mean((model.predict(X_transform) - y) ** 2))
Please note that this example I am training and testing with the same dataset! This may cause an overfit in your model. You should avoid that splitting the data or doing cross-validation.

I just wanted to fit a linear regression with sklearn which I use as benchmark for other non-linear approaches, such as MLPRegressor, but also variations of linear regression, such as Ridge, Lasso and ElasticNet (see here for an introduction to this group: http://scikit-learn.org/stable/modules/linear_model.html).
Doing it the same ways as described by #silviomoreto (which worked for all other models) actually for me resulted in an errogenous model (very high errors). This is most likely due to the so called dummy variable trap, which occurs due to multicollinearity in the variables when you include one dummy variable per category for categoric variables -- which is exactly what OneHotEncoder does! See also the following discussion on statsexchange: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn.
To avoid this, I wrote a simple wrapper that excludes one variable, which then acts as the default.
class DummyEncoder(BaseEstimator, TransformerMixin):
def __init__(self, n_values='auto'):
self.n_values = n_values
def transform(self, X):
ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
return ohe.fit_transform(X)[:,:-1]
def fit(self, X, y=None, **fit_params):
return self
So building on the code of #silviomoreto, you would change line 6:
enc = DummyEncoder()
This solved the problem for me. Note that OneHotEncoder worked fine (and better) for all other models, such as Ridge, Lasso and ANN.
I chose this way, because I wanted to include it in my feature pipeline. But you seem to have the data already encoded. Here, you would have to drop one column per category (e.g. for male/female only include one). So if you for example used pandas.get_dummies(...), this can be done with the parameter drop_first=True.
Last but not least, if you really need to go deeper into linear regression in Python, and not use it just as a benchmark, I would recommend statsmodels over scikit-learn (https://pypi.python.org/pypi/statsmodels), as it provides better model statistics, e.g. p-values per variable, etc.

how to prepare data for sklearn LinearRegression
OneHotEncode should only be used on the intended columns: those with categorical variables or strings, or integers that are essentially levels rather than numeric.
DO NOT apply OneHotEncode to your entire dataset including numerical variable or Booleans.
To prepare the data for sklearn LinearRegression, the numerical and categorical should be separately handled.
numerical columns: standardize if your model contains interactions or polynomial terms
categorical columns: apply OneHot either through sklearn or pd.get_dummies. pd.get_dummies is more flexible while OneHotEncode is more consistent in working with sklearn API.
drop='first'
As of version 0.22, OneHotEncoder in sklearn has drop option. For example OneHotEncoder(drop='first').fit(X), which is similar to
pd.get_dummies(drop_first=True).
use regularized linear regression
If you use regularized linear regression such as Lasso, multicollinear variables will be penalized and shrunk.
limitation of p-value statistics
The p-value in OLS is only valid when the OLS assumptions are more or less true. While there are methods to deal with situations when p-values cannot be trusted, one potential solution is to use cross validation or leave-one-out for gaining confidence on the model.

python classification without having to impute missing values

I have a dataset that is working nicely in weka. It has a lot of missing values represented by '?'. Using a decision tree, I am able to deal with the missing values.
However, on sci-kit learn, I see that the estimators can't used with data with missing values. Is there an alternative library I can use instead that would support this?
Otherwise, is there a way to get around this in sci-kit learn?

The py-earth package supports missing data. It's still in development and not yet on pypi, but it's pretty usable and well tested at this point and interacts well with scikit-learn. Missingness is handled as described in this paper. It does not assume missingness-at-random, and in fact missingness is treated as potentially predictive. The important assumption is that the distribution of missingness in your training data must be the same as in whatever data you use the model with in operation.
The Earth class provided by py-earth is a regressor. To create a classifier, you need to put it in a pipeline with some other scikit-learn classifier (I usually use LogisticRegression for this). Here's an example:
from pyearth import Earth
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.pipeline import Pipeline
# X and y are some training data (numpy arrays, pandas DataFrames, or
# similar) and X may have some values that are missing (nan, None, or
# some other standard signifier of missingness)
from your_data import X, y
# Create an Earth based classifer that accepts missing data
earth_classifier = Pipeline([('earth', Earth(allow_missing=True)),
('logistic', LogisticRegression())])
# Fit on the training data
earth_classifier.fit(X, y)
The Earth model handles missingness in a nice way, and the LogisticRegression only sees the transformed data coming out of Earth.transform.
Disclaimer: I am an author of py-earth.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.