I'm currently trying to use Scikit Learn to create a simple anomaly detection snippet.
The program receives a .csv file which then dissects into a Panda's DataFrame.
The Dataframe has 8 columns: 'Src IP'; 'Dst IP'; 'sPort'; 'dPort'; 'Protocol'; 'Load'; 'Packets'; 'TCP Flag'.
I fit the data into an IsolationForest like so:
iForest = IsolationForest(n_estimators=128, max_samples='auto', max_features=1, behaviour='new', contamination='auto', random_state=None, n_jobs=-1, verbose=0, bootstrap=True)
usecols=["Src IP","Dst IP","sPort","dPort","Protocol","Load","Packets","TCP Flags"]
iForest.fit(data[usecols])
And then get the Outliers/anomalies from the IForest:
pred = iForest.predict(data[usecols])
data['anomaly']=pred
outliers=data.loc[data['anomaly']==-1]
It all works well, however, my question here is:
How can I use Isolation Forest to detect anomalies on the network, while being independent on the 'contamination' property?
In an IDS, having a low False Positive rate is crucial. In my case i'm somewhat deciding what entries are 'contaminated' by choosing a percentage.
My goal is to make Isolation Forest set the contamination factor automatically, knowing that if x.csv is 100% clear of contamination, then find the % of contamination on y.csv.
This should be a part of an Hybrid IDS that uses both signature analysis and behavior to detect intrusions based on Flow Data (NetFlow).
TLDR: IsolationForest needs to receive a clean .csv (no contamination) to then detect anomalies on a new set of data (another .csv or pipe data). How is that possible using ScikitLearn?
If you have a training set which contains only normal data, then set contamination=0. To chose an appropriate threshold for anomalies, use a validation set and plot the histogram of anomaly scores. Without labeled data this can only be done heuristically:
For maximum True Positive (but sacrificing False Positives), set the threshold based on a budget of what resources you have for looking into Positives. You can compute this based on the inbound data rate, the expected Positives from the histogram statistics, and a cost (time/money) per evaluation.
For minimizing False Positives, set the threshold to be a bit outside the existing scores. The assumption is then that the training/validation contains practically no anomalies, and anything new and different is anomalous. Sometimes this is called novelty detection.
If it is possible to determine by looking at the data whether something was a true or false anomaly, I would recommend doing that for some 10-100 of the items with the highest anomaly score. This will usually be very fast compared to labeling all the data, and can help estimating the False Positive Rate.
When you put this model into production, your protocol for acting on anomalies should ensure that cases are evaluated and scored as anomaly/not. Then this is your future labeled validation/test data, which you can use to adjust the thresholds.
If you do have labeled anomaly/not anomaly in the validation/testset (but not in training), you can use this to optimize the threshold to maximize/minimize the desired metric using a hyper-parameter search.
Related
I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!
Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.
I've been using the scikit learn sklearn.ensemble.IsolationForest implementation of the isolation forest to detect anomalies in my datasets that range from 100s of rows to millions of rows worth of data. It seems to be working well and I've overridden the max_samples to a very large integer to handle some of my larger datasets (essentially not using sub-sampling). I noticed that the original paper states that larger sample sizes create risk of swamping and masking.
Is it okay to use the isolation forest on large sample sizes if it seems to be working okay? I tried training with a smaller max_samples and the testing produced too many anomalies. My data has really started to grow and I'm wondering if a different anomaly detection algorithm would be better for such a large sample size.
Citing the original paper:
The isolation characteristic of iTrees enables them to build partial
models and exploit sub-sampling to an extent that is not feasible in
existing methods. Since a large part of an iTree that isolates normal
points is not needed for anomaly detection; it does not need to be
constructed. A small sample size produces better iTrees because the
swamping and masking effects are reduced.
From you question, I have a feeling that you confuse between the size of the dataset, and the size of the sample you take from it to construct iTree. The Isolation forest can handle very large datasets. It works better when it samples them.
The original paper discusses it in chapter 3:
The data set has two anomaly clusters located close to one large
cluster of normal points at the centre. There are interfering normal
points surrounding the anomaly clusters, and the anomaly clusters are
denser than normal points in this sample of 4096 instances. Figure
4(b) shows a sub-sample of 128 instances of the original data. The
anomalies clusters are clearly identifiable in the sub-sample.
Those normal instances surrounding the two anomaly clusters have been
cleared out, and the size of anomaly clusters becomes smaller which
makes them easier to identify. When using the entire sample, iForest
reports an AUC of 0.67. When using a sub-sampling size of 128, iForest
achieves an AUC of 0.91.
Isolation forest is not a perfect algorithm and needs parameter tuning for your specific data. It might even perform poorly on some datasets. If you wish to consider other methods, Local Outlier Factor is also included in sklearn. You may also combine several methods (ensemble).
Here you can find a nice comparison of different methods.
I have a labeled database with a separate class "-1" in which all the outliers are.
I am currently using sklearn's LocalOutlierFactor and OneClassSVM for fitting them on outlier-free training data and after that, test them on test data containing outliers. My objective is to check if new unseen examples are outliers before classifying them using a classification model.
It appears that since the training data which I use to fit the models is free of outliers and since the examples in each class are very similar, I get the best results (precision and recall) on the test data if I set the hyperparameter contamination of the LocalOutlierFactor as low as possible, something like 10**-100. The higher I set this value the more false outliers my model detects on new data.
I observe a similar behaviour using OneClassSVM. The hyperparameters gamma and nu have to be extremely low that the model serves me the best results.
According to Scikit-learn the training data shall not be polluted by outliers to perform novelty detection. This is explained at the top here.
Given this, my question is if I am missing something or if my approach is legit. I don't get why I even have to set the contamination hyperparameter if the training data shall not be polluted. It is perfectly clean in my case and contamination can't be set to 0 and it seems weird to set it manually to such a low value.
I am trying to solve the outlier detection problem with several algorithms. When I use Local Outlier Factor API of Scikit-learn, I have to input a very important parameter--n_neighbors. However, with different n_neighbors, I receive different ROC_AUC scores. For example, with n_neighbors=5 then ROC_AUC=56. However, with n_neighbors=6 then ROC_AUC=85; with n_neighbors=7 then ROC_AUC=94, etc. Formally, ROC_AUC is very high if n_neighbors>=6
I want to ask three questions:
(1) Why the n_neighbors parameter of Local Outlier Factor affects to ROC-AUC?
(2) How to choose an appropriate n_neighbors in an unsupervised learning setting?
(3) Should I choose high n_neighbors to get a high ROC_AUC?
If the results would not be affected, the parameter would not be needed, right?
Considering more neighbors is more costly. But it also means more data is used, so I'm not surprised that results improve. Did you read the paper that explains what the parameter does?
When you are choosing the parameter based on the evaluation, then you are cheating. It is an unsupervised method - you are not supposed to have such labels in a real use case.
I'm trying to build a multilabel-classifier to predict the probabilities of some input data being either 0 or 1. I'm using a neural network and Tensorflow + Keras (maybe a CNN later).
The problem is the following:
The data is highly skewed. There are a lot more negative examples than positive maybe 90:10. So my neural network nearly always outputs very low probabilities for positive examples. Using binary numbers it would predict 0 in most of the cases.
The performance is > 95% for nearly all classes, but this is due to the fact that it nearly always predicts zero...
Therefore the number of false negatives is very high.
Some suggestions how to fix this?
Here are the ideas I considered so far:
Punishing false negatives more with a customized loss function (my first attempt failed). Similar to class weighting positive examples inside a class more than negative ones. This is similar to class weights but within a class.
How would you implement this in Keras?
Oversampling positive examples by cloning them and then overfitting the neural network such that positive and negative examples are balanced.
Thanks in advance!
You're on the right track.
Usually, you would either balance your data set before training, i.e. reducing the over-represented class or generate artificial (augmented) data for the under-represented class to boost its occurrence.
Reduce over-represented class
This one is simpler, you would just randomly pick as many samples as there are in the under-represented class, discard the rest and train with the new subset. The disadvantage of course is that you're losing some learning potential, depending on how complex (how many features) your task has.
Augment data
Depending on the kind of data you're working with, you can "augment" data. That just means that you take existing samples from your data and slightly modify them and use them as additional samples. This works very well with image data, sound data. You could flip/rotate, scale, add-noise, in-/decrease brightness, scale, crop etc.
The important thing here is that you stay within bounds of what could happen in the real world. If for example you want to recognize a "70mph speed limit" sign, well, flipping it doesn't make sense, you will never encounter an actual flipped 70mph sign. If you want to recognize a flower, flipping or rotating it is permissible. Same for sound, changing volume / frequency slighty won't matter much. But reversing the audio track changes its "meaning" and you won't have to recognize backwards spoken words in the real world.
Now if you have to augment tabular data like sales data, metadata, etc... that's much trickier as you have to be careful not to implicitly feed your own assumptions into the model.
I think your two suggestions are already quite good.
You can also simply undersample the negativ class, of course.
def balance_occurences(dataframe, zielspalte=target_name, faktor=1):
least_frequent_observation=dataframe[zielspalte].value_counts().idxmin()
bottleneck=len(dataframe[dataframe[zielspalte]==least_frequent_observation])
balanced_indices=dataframe.index[dataframe[zielspalte]==least_frequent_observation].tolist()
for value in (set(dataframe[zielspalte])-{least_frequent_observation}):
full_list=dataframe.index[dataframe[zielspalte]==value].tolist()
selection=np.random.choice(a=full_list,size=bottleneck*faktor, replace=False)
balanced_indices=np.append(balanced_indices,selection)
df_balanced=dataframe[dataframe.index.isin(balanced_indices)]
return df_balanced
Your loss function could look into the recall of the positive class combined with some other measurement.