Python SKlearn contamination must be in (0, 0.5] error - python

I'm new to Machine Learning and working on a project using Python(3.6), Pandas, Numpy and SKlearn. I have done classifications and reshaping but while in prediction it throws an error as contamination must be in (0, 0.5].
Here's what i have tried:
# Determine no of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]
# calculate percentages for Fraud & Valid
outlier_fraction = len(Fraud) / float(len(Valid))
print(outlier_fraction)
print('Fraud Cases : {}'.format(len(Fraud)))
print('Valid Cases : {}'.format(len(Valid)))
# Get all the columns from dataframe
columns = data.columns.tolist()
# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]
# store the variables we want to predicting on
target = "Class"
X = data.drop(target, 1)
Y = data[target]
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)
# define a random state
state = 1
# define the outlier detection method
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
contamination = outlier_fraction)
}
# fit the model
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit te data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid and 1 for fraudulent
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# run classification metrics
print('{}:{}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred ))
print(classification_report(Y, y_pred ))
Here's what it returns :
ValueError: contamination must be in (0, 0.5]
and it throws this error for y_pred = clf.predict(X) line, as pointed in Traceback.
I'm new to machine learning, don't have much idea about ** contamination**, so where i did something wrong?
Help me, please!
Thanks in advance!

ValueError: contamination must be in (0, 0.5]
This means that contamination must be strictly larger than 0.0 and less than or equal to 0.5. (What does this square bracket and parenthesis bracket notation mean [first1,last1)? is a good question on the brackets notation) As you have commented, print(outlier_fraction) outputs 0.0, the problem lies in the first 6 lines of the code you posted.

LocalOutlierFactor is an unsupervised outlier detection algorithm, introduced in this paper. Each algorithm, has its own parameters which really change the behavior of the algorithm. You should always study those parameters and their effect on the algorithm before applying the method, or else you may be lost in the land of massive parameter options.
In the case of LocalOutlierFactor, it assumes your outliers are not more than half of your dataset. In practice, I'd say, even if the outliers take up to 30% of your dataset, they're not outliers anymore. They're simply a different type, or class of data.
On the other hand, you cannot expect the outlier detection algorithm to work if you tell it that you have 0 outliers, which may be the case for you if the outlier_fraction is actually 0.

Related

Squared Error Relevance Area (SERA) implementation in Python as custom evaluation metric

I'm facing an imbalanced regression problem and I've already tried several ways to solve this problem. Eventually I came a cross this new metric called SERA (Squared Error Relevance Area) as a custom scoring function for imbalanced regression as mentioned in this paper. https://link.springer.com/article/10.1007/s10994-020-05900-9
In order to calculate SERA you have to compute the relevance function phi, which is varied from 0 to 1 in small steps. For each value of relevance (phi) (e.g. 0.45) a subset of the training dataset is selected where the relevance is greater or equal to that value (e.g. 0.45). And for that selected training subset sum of squared errors is calculated i.e. sum(y_true - y_pred)**2 which is known as squared error relevance (SER). Then a plot us created for SER vs phi and area under the curve is calculated i.e. SERA.
Here is my implementation, inspired by this other question here in StackOverflow:
import pandas as pd
from scipy.integrate import simps
from sklearn.metrics import make_scorer
def calc_sera(y_true, y_pred,x_relevance=None):
# creating a list from 0 to 1 with 0.001 interval
start_range = 0
end_range = 1
interval_size = 0.001
list_1 = [round(val * interval_size, 3) for val in range(1, 1000)]
list_1.append(start_range)
list_1.append(end_range)
epsilon = sorted(list_1, key=lambda x: float(x))
df = pd.concat([y_true,y_pred,x_relevance],axis=1,keys= ['true', 'pred', 'phi'])
# Initiating lists to store relevance(phi) and squared-error relevance (ser)
relevance = []
ser = []
# Converting the dataframe to a numpy array
rel_arr = x_relevance
# selecting a phi value
for phi in epsilon:
relevance.append(phi)
error_squared_sum = 0
error_squared_sum = sum((df[df.phi>=phi]['true'] - df[df.phi>=phi]['pred'])**2)
ser.append(error_squared_sum)
# squared-error relevance area (sera)
# numerical integration using simps(y, x)
sera = simps(ser, relevance)
return sera
sera = make_scorer(calc_sera, x_relevance=X['relevance'], greater_is_better=False)
I implemented a simple GridSearch using this score as an evaluation metric to select the best model:
model = CatBoostRegressor(random_state=0)
cv = KFold(n_splits = 5, shuffle = True, random_state = 42)
parameters = {'depth': [6,8,10],'learning_rate' : [0.01, 0.05, 0.1],'iterations': [100, 200, 500,1000]}
clf = GridSearchCV(estimator=model,
param_grid=parameters,
scoring=sera,
verbose=0,cv=cv)
clf.fit(X=X.drop(columns=['relevance']),
y=y,
sample_weight=X['relevance'])
print("Best parameters:", clf.best_params_)
print("Lowest SERA: ", clf.best_score_)
I also added the relevance function as weights to the model so it could apply this weights in the learning task. However, what I am getting as output is this:
Best parameters: {'depth': 6, 'iterations': 100, 'learning_rate': 0.01}
Lowest SERA: nan
Any clue on why SERA value is returning nan? Should I implement this another way?
Whenever you get unexpected NaN scores in a grid search, you should set the parameter error_score="raise" to get an error traceback, and debug from there.
In this case I think I see the problem though: sera is defined with x_relevance=X['relevance'], which includes all the rows of X. But in the search, you're cross-validating: each testing set has fewer rows, and those are what sera will be called on. I can think of a couple of options; I haven't tested either, so let me know if something doesn't work.
Use pandas index
In your pd.concat, just set join="inner". If y_true is a slice of the original pandas series (I think this is how GridSearchCV will pass it...), then the concatenation will join on those row indices, so keeping the whole of X['relevance'] is fine: it will just drop the irrelevant rows. y_pred may well be a numpy array, so you might need to set its index appropriately first?
Keep relevance in X
Then your scorer can reference the relevance column directly from the sliced X. For this, you will want to drop that column from the fitting data, which could be difficult to do for the training but not the testing set; however, CatBoost has an ignored_features parameter that I think ought to work.

LSTM with more features / classes

How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.
You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Apparently contradictory results from topk/sort and pick

I'm predicting roughly one of 100K possible outputs with a MXNet model, using a fairly standard softmax output. I want to compare the probability assigned to the true label versus the top predictions under the model. To get the former I'm using the pick operator; the later I've tried the cheap version (topk operator) and the expensive version (sort/argsort + slice).
In both cases I'm getting contradictory results. Specifically, there are numerous cases where the probability of the true label (retrieved with pick) is significantly higher than the highest probability output (retrieved with topk/sort). I think this means I'm doing something wrong but don't understand what. It does not happen for all predictions, but it does for a significant fraction.
Can anybody give me a hint as to what is going on?
Code follows:
for batch in data_iter:
model.forward(batch, is_train=False)
predictions = model.get_outputs()[0]
labels = batch.label[0].as_in_context(predictions.context)
# scores = mx.nd.topk(predictions, axis=1, k=6, ret_typ='value')
scores = mx.nd.sort(predictions, axis=1, is_ascend=0)
scores = mx.nd.slice_axis(scores, axis=1, begin=0, end=6)
label_score = mx.nd.pick(predictions, labels, axis=1)
equal = label_score.asnumpy() <= scores.asnumpy()[:, 0]
if not np.all(equal):
#I think this should never happen but it does frequently
Testing with MXNet 1.1.0, the following code shows that the problem doesn't happen:
for _ in range(10):
predictions = nd.random.uniform(shape=(100, 100000))
labels = nd.array(np.random.randint(0, 99999, size=(100, 1)))
scores = mx.nd.sort(predictions, axis=1, is_ascend=0)
scores = mx.nd.slice_axis(scores, axis=1, begin=0, end=6)
label_score = mx.nd.pick(predictions, labels, axis=1)
equal = label_score.asnumpy() <= scores.asnumpy()[:, 0]
if not np.all(equal):
print("ERROR")

How to get predictive attributes of each target in `Random Forest`?

I've been messing around with Random Forest models lately and they are really useful w/ the feature_importance_ attribute!
It would be useful to know which variables are more predictive of particular targets.
For example, what if the 1st and 2nd attributes were more predictive of distringuishing target 0 but the 3rd and 4th attributes were more predictive of target 1?
Is there a way to get the feature_importance_ array for each target separately? With sklearn, scipy, pandas, or numpy preferably.
# Iris dataset
DF_iris = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_iris = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Import modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
# Split Data
X_tr, X_te, y_tr, y_te = train_test_split(DF_iris, Se_iris, test_size=0.3, random_state=0)
# Create model
Mod_rf = RandomForestClassifier(random_state=0)
Mod_rf.fit(X_tr,y_tr)
# Variable Importance
Mod_rf.feature_importances_
# array([ 0.14334485, 0.0264803 , 0.40058315, 0.42959169])
# Target groups
Se_iris.unique()
# array([0, 1, 2])
This is not really how RF works. Since there is no simple "feature voting" (which takes place in linear models) it is really hard to answer the question what "feature X is more predictive for target Y" even means. What feature_importance of RF captures is "how probable is, in general, to use this feature in the decision process". The problem with addressing your question is that if you ask "how probable is, in general, to use this feature in decision process leading to label Y" you would have to pretty much run the same procedure but remove all subtrees which do not contain label Y in a leaf - this way you remove parts of the decision process which do not address the problem "is it Y or not Y" but rather try to answer which "not Y" it is. However, in practice, due to very stochastic nature of RF, cutting its depth etc. this might barely reduce anything. The bad news is also, that I never seen it implemented in any standard RF library, you could do this on your own, just the way I said:
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
for x in X (X is your dataset)
features_importance[i] += how_many_times_each_feature_is_used(tree, x) / |X|
features_importance[i] /= |tmp_RF|
return features_importance
in particular you could use existing feature_importance codes, simply by doing
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
features_importance[i] = run_regular_feature_importance(tmp_RF)
return features_importance

Stepwise Regression in Python

How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk
Thanks again.
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
The essential part of my code is as follows:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your
variables written as 'P>|t|'. Then check for the variable with the highest p
value. Suppose x3 has the highest value e.g 0.956. Then remove this column
from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the
number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors.
There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)

Categories