SciKit LogisticRegression failing to predict accurately - python

I have a set of generated data describing web connections in CSV that looks like this:
conn_duration,conn_destination,response_size,response_code,is_malicious
1564,130,279,532,302,0
1024,200,627,1032,307,0
2940,130,456,3101,201,1
Full CSV here
The class indicates which ones are of interest based on duration, destination_id and response code.
I think LogisticRegression would be a good fit here but the results I'm getting aren't great. On the generated dataset I've got 750 rows with a 0 class and 150 with a 1.
This is how I'm manipulating and providing the data:
names = ['conn_duration', 'conn_destination', 'response_size', 'response_code', 'is_malicious']
dataframe = pandas.read_csv(path, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:4]
y = array[:,4]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])
model = LogisticRegression()
model.fit(X, y)
# Two test bits of data, expect the first to be predicted 1 and the second to be 0
Xnew = [[[3492, 150, 750, 200]], [[3492, 120, 901, 200]]]
for conn in Xnew:
# make a prediction
ynew = model.predict(conn)
print("X=%s, Predicted=%s" % (conn[0], ynew[0]))
The criteria for a malicious bit of traffic is that the response code is 200, conn_destination is 150, and the response size is greater than 500.
I'm getting reasonable prediction but wonder if LogisticRegression is the right algorithm to be using?
TIA!

If the code is working, but you aren't sure what algorithm to use, I would recommend trying an SVM, random forest, etc. Use the GridSearchCV module to determine which algorithm gives the best performance.

Since there's a simple rule to classify the traffic, as "response code is 200, conn_destination is 150, and the response size is greater than 500", you don't actually need a model to solve it. Don't overkill a simple problem.
For studying purposes it's ok, but the model should get very close to 100% because it should learn this rule.
Anyway, conn_destination and response_code are categorical data, if you directly normalize it the algorith will understand 200 closer to 201 then to 300, but they categories not numbers.
Here's a reference of some ways to threat categorical data: Linear regression analysis with string/categorical features (variables)?

I would try XGBoost (Extreme Gradient Boosted Trees). In large datasets SVM is computationally costly and I specially like Random Forests when you have a highly imbalanced dataset.
Logistic regression can be part of a Neural Network, if you want to develop something more accurate and sophisticated, like tuning hyperparameters, avoiding overfitting and increasing generalization properties. You can also do that in XGBoost, by pruning trees.
XGBoost and Neural Networks would be my choices for a classification problem. but the whole thing is bigger than that. It's not about choosing an algorithm, but understand how it works, what is going on under the hood and HOW you can adjust it in a way you can accurately predict classes.
Also, data preparation, variable selection, outlier detection, descriptive statistics are very important for the quality and accuracy of your model.

Related

Correct use of LinearSVC

I am trying to implement a machine learning algorithm which detects irregular ecg signals. I extracted some features, but I am not sure how to manage a correct input for the classifier.
I have 20k different ecg signals, each signal has 1000 values. They are all labeld as correct or incorrect.
I choose e.g. the two features heart_rate and xposition_of_3_highest_peaks, but how to feed them into the classifier?
Following you can see my attempt, but everytime I add a second feature the score decreases. Why?
clf = svm.SVC()
#[64,70,48,89...74,58]
X_train_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
X_train_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
I am not sure if the StandardScaler().fit_transform is necessary or if the np.concatenate is correct? Maybe there is even a better classifier for this use case?
Sorry I am a complete beginner, please be kind :)
When you are doing any transformations for pre-processing, you must use the same process from the training data and apply it to the validation / test data. However, this process must use the same statistics from the training data, because you are assuming that the validation / test data also come from this same distribution. Therefore, you need to create an object to store the transformations of the training data, then apply it to the training and test data equally. Your decreased performance is because you are not applying the right statistics to both training and validation / test correctly. You are scaling both datasets using separate means and standard deviations, which can cause out-of-distribution predictions if your sample size isn't large enough.
Therefore, call fit_transform on the training data, then just transform on the validation / test data. fit_transform will simultaneously find the parameters of the scaling for each column, then apply it to the input data and return the transformed data to you. transform assumes an already fit scaler, such as what was done in fit_transform and applies the scaling accordingly. I sometimes like to separate the operations and do a separate fit on the training data, then transform on the training and validation/test data after. This is a common source of confusion for new practitioners. You also need to save the scaler object so you can apply this to your validation / test data later.
clf = svm.SVC()
#[64,70,48,89...74,58]
heartRate_scaler = StandardScaler()
X_train_heartRate = heartRate_scaler.fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = heartRate_scaler.transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
three_peaks_scalar = StandardScaler()
X_train_3_peaks = three_peaks_scalar.fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = three_peaks_scalar.transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
Take note that you can concatenate the features you want first, then apply the StandardScaler after the fact because the method applies the standardization to each feature/column independently. The above method of scaling the different sets of features and concatenating them after is no different than concatenating the features first, then scaling after.
Minor Note
I forgot to ask about the fe object. What is that doing under the hood? Does it use the training data in any way to get you features? You must make sure that this object operates on the statistics of your training data and test data, not separately. What I mentioned about ensuring that the pre-processing must match between training and validation/test, the statistics must also match in this fe object as well. I assume this either uses the training data's statistics to both sets of data, or it is an independent transformation that is agnostic. Either way, you haven't specified what this is doing under the hood, but I will assume the happy path.
Possible Improvement
Consider using a decision tree-based algorithm like a Random Forest Classifier that does not require scaling of the input features, as the job is to partition the feature space of your data into N-dimensional hypercubes, with N being the number of features in your dataset (if N=2, this would be a 2D rectangle, N=3 a 3D rectangle, etc). Depending on how your data is distributed, tree-based algorithms can do better and are the first things to try in Kaggle competitions.

How to fix a very high false negative rate in fairly balanced binary classification?

I have a project that asks to do binary classification for whether an employee will leave the company or not, based on about 52 features and 2000 rows of data. The data is somewhat balanced with 1200 neg to 800 pos. I have done extensive EDA and data cleansing. I chose to try several different models from sklearn, Logarithmic Regression, SVM, and Random Forests. I am getting very poor and similar results from all of them. I only used 15 of the 52 features for this run, but the results are almost identical to when I used all 52 features. Of the 52 features, 6 were categorical that I converted to dummies (between 3-6 categories per feature) and 3 were datetime that I converted to days-since-epoch. There were no null values to fill.
This is the code and confusion matrix my most recent run with a random forest.
x_train, x_test, y_train, y_test = train_test_split(small_features, endreason, test_size=0.2, random_state=0)
RF = RandomForestClassifier(bootstrap = True,
max_features = 'sqrt',
random_state=0)
RF.fit(x_train, y_train)
RF.predict(x_test)
cm = confusion_matrix(y_test, rf_predictions)
plot_confusion_matrix(cm, classes = ['Negative', 'Positive'],
title = 'Confusion Matrix')
What are steps I can do to help better fit this model?
The results you are showing definitely seem a bit dis-encouraging for the methods your propose and balance of the data you describe. However, from the description of the problem there definitely seems to be a lot of room for improvement.
When you are using train_test_split make sure you pass stratify=endreason to make sure there's no issues regarding the labels when splitting the dataset. Moving on to helpful points to improve your model:
First of all, dimensionality reduction: Since you are dealing with many features, some of them might be useless or even contaminate the classification problem you are trying to solve. It is very important to considering fitting different dimension reduction techniques to your data and using this fitted data to feed your model. Some common approaches that might be worth trying:
PCA (Principal component analysis)
Low Variance & Correlation filter
Random Forests feature importance
Secondly understanding the model: While Logistic Regression might prove to be an excellent baseline for a linear classifier, it might not necessarily be what you need for this task. Random Forests seem to be much better when capturing non-linear relationships but needs to be controlled and pruned to avoid overfitting and might require a lot of data. On the other hand, SVM is a very powerful method with non-linear kernels but might prove inefficient when working with huge amounts of data. XGBoost and LightGBM are very powerful gradient boosting algorithms that have won multiple kaggle competitions and work very well in almost every case, of course there needs to be some preprocessing as XGBoost is not preparred to work with categorical features (LightGBM is). My suggestion is to try these last 2 methods. From worse to last (in general case scenarios) I would list:
LightGBM / XGBoost
RandomForest / SVM / Logistic Regression
Last but not least hyperparameter tunning: Regardless of the method you choose, there will always be some fine-tuning that needs to be done. Sklearn offers gridsearch which comes in really handy. However you would need to understand how your classifiers are behaving in order to know what should you be looking for. I will not go in-depth in this as it will be off-topic and not suited for SO but you can definitely have a read here

low training (~64%) and test accuracy (~14%) with 5 different models

Im struggling to find a learning algorithm that works for my dataset.
I am working with a typical regressor problem. There are 6 features in the dataset that I am concerned with. There are about 800 data points in my dataset. The features and the predicted values have high non-linear correlation so the features are not useless (as far as I understand). The predicted values have a bimodal distribution so I disregard linear model pretty quickly.
So I have tried 5 different models: random forest, extra trees, AdaBoost, gradient boosting and xgb regressor. The training dataset returns accuracy and the test data returns 11%-14%. Both numbers scare me haha. I try tuning the parameters for the random forest but seems like nothing particularly make a drastic difference.
Function to tune the parameters
def hyperparatuning(model, train_features, train_labels, param_grid = {}):
grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 3, n_jobs = -1, verbose =2)
grid_search.fit(train_features, train_labels)
print(grid_search.best_params_)
return grid_search.best_estimator_`
Function to evaluate the model
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100*np.mean(errors/test_labels)
accuracy = 100 - mape
print('Model Perfomance')
print('Average Error: {:0.4f} degress. '.format(np.mean(errors)))
print('Accuracy = {:0.2f}%. '.format(accuracy))
I expect the output to be at least ya know acceptable but instead i got training data to be 64% and testing data to be 12-14%. It is a real horror to look at this numbers!
There are several issues with your question.
For starters, you are trying to use accuracy in what it seems to be a regression problem, which is meaningless.
Although you don't provide the exact models (it would arguably be a good idea), this line in your evaluation function
errors = abs(predictions - test_labels)
is actually the basis of the mean absolute error (MAE - although you should actually take its mean, as the name implies). MAE, like MAPE, is indeed a performance metric for regression problems; but the formula you use next
accuracy = 100 - mape
does not actually hold, neither it is used in practice.
It is true that, intuitively, one might want to get the 1-MAPE quantity; but this is not a good idea, as MAPE itself has a lot of drawbacks which seriously limit its use; here is a partial list from Wikipedia:
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero.
For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.
It is an overfitting problem. You are fitting the hypothesis very well on your training data.
Possible solutions to your problem:
You can try getting more training data(not features).
Try less complex model like decision trees since highly complex
models(like random forest,neural networks etc.) fit the hypothesis
well on the training data.
Cross-validation:It allows you to tune hyperparameters with only
your original training set. This allows you to keep your test set as
a truly unseen dataset for selecting your final model.
Regularization:The method will depend on the type of learner you’re
using. For example, you could prune a decision tree, use dropout on
a neural network, or add a penalty parameter to the cost function in
regression.
I would suggest you use pipeline function since it'll allow you to perform multiple models simultaneously.
An example of that:
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 20, 30, 40, 50, 64],
'logistic__alpha': np.logspace(-4, 4, 5),
}
search = GridSearchCV(pipe, param_grid, iid=False, cv=5)
search.fit(X_train, X_test)
I would suggest improving by preprocessing the data in better forms. Try to manually remove the outliers, check the concept of cook's distance to see elements which have high influence in your model negatively. Also, you could scale the data in a different form than Standard scaling, use log scaling if elements in your data are too big, or too small. Or use feature transformations like DCT transform/ SVD transform etc.
Or to be simplest, you could create your own features with the existing data, for example, if you have yest closing price and todays opening price as 2 features in stock price prediction, you can create a new feature saying the difference in cost%, which could help a lot on your accuracy.
Do some linear regression analysis to know the Beta values, to have a better understanding which feature is contributing more to the target value. U can use feature_importances_ in random forests too for the same purpose and try to improve that feature as well as possible such that the model would understand better.
This is just a tip of ice-berg of what could be done. I hope this helps.
Currently, you are overfitting so what you are looking for is regularization. For example, to reduce the capacity of models that are ensembles of trees, you can limit the maximum depth of the trees (max_depth), increase the minimum required samples at a node to split (min_samples_split), reduce the number of learners (n_estimators), etc.
When performing cross-validation, you should fit on the training set and evaluate on your validation set and the best configuration should be the one that performs the best on the validation set. You should also keep a test set in order to evaluate your model on completely new observations.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)
While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

Scikit learn: measure of goodness of fit, better splitting the dataset or use all of it?

Sort of taking inspiration from here.
My problem
So I have a dataset with 3 features and n observations. I also have n responses. Basically I want to see if this model is a good fit or not.
From the question above people use R^2 for this purpose. But I am not sure I understand..
Can I just fit the model and then calculate the Mean Squared Error?
Should I use train/test split?
All of these seem to have in common prediction, but here I just want to see how good it is at fitting it.
For instance this is my idea
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
diabetes = datasets.load_diabetes()
#my idea
regr = linear_model.LinearRegression()
regr.fit(diabetes_X, diabetes.target)
print(np.mean((regr.predict(diabetes_X)-diabetes.target)**2))
However I often see people doing things like
diabetes_X = diabetes.data[:, np.newaxis, 2]
# split X
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# split y
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# instantiate and fit
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
# MSE but based on the prediction on test
print('Mean squared error: %.2f' % np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2))
In the first instance we get: 3890.4565854612724 while in the second case we get 2548.07. Which is the most correct one?
IMPORTANT: I WANT THIS TO WORK IN MULTIPLE REGRESSION, THIS IS JUST A MWE!
Can I just fit the model and then calculate the Mean Squared Error? Should I use train/test split?
No, you will run the risk of overfitting the model. That's the reason for the data to be split into train and test (or, even validation datasets). So, that the model doesn't just 'memorize' what it sees but learns to perform even on newer, unseen samples.
It's always preferred to evaluate the performance of the model on a new set of data that wasn't observed during training. If you're going to optimize hyper-parameters or choosing among several models, an additional validation data is a right choice.
However, sometimes the data is scarce and entirely removing data from the training process is prohibitive. In these cases, I strongly recommend you to use more efficient ways of validating your models such as k-fold cross-validation (see KFold and StratifiedKFold in scikit-learn).
Finally, it is a good idea to ensure that your partitions behave in a similar way in the training and test sets. I recommend you to sample the data uniformly on the target space so you can ensure that you train/validate your model with the same distribution of target values.

Categories