Getting error while running in jupyter notebook - python

ERROR
Invalid parameter C for estimator
DecisionTreeClassifier(class_weight=None, criterion='gini',
max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'). Check the list of available parameters with estimator.get_params().keys().
CODE
def train(X_train,y_train,X_test):
# Scaling features
X_train=preprocessing.scale(X_train)
X_test=preprocessing.scale(X_test)
Cs = 10.0 ** np.arange(-2,3,.5)
gammas = 10.0 ** np.arange(-2,3,.5)
param = [{'gamma': gammas, 'C': Cs}]
skf = StratifiedKFold(n_splits=5)
skf.get_n_splits(X_train, y_train)
cvk = skf
classifier = DecisionTreeClassifier()
clf = GridSearchCV(classifier,param_grid=param,cv=cvk)
clf.fit(X_train,y_train)
print("The best classifier is: ",clf.best_estimator_)
clf.best_estimator_.fit(X_train,y_train)
# Estimate score
scores = model_selection.cross_val_score(clf.best_estimator_, X_train,y_train, cv=5)
print (scores)
print('Estimated score: %0.5f (+/- %0.5f)' % (scores.mean(), scores.std() / 2))
title = 'Learning Curves (SVM, rbf kernel, $\gamma=%.6f$)' %clf.best_estimator_.gamma
plot_learning_curve(clf.best_estimator_, title, X_train, y_train, cv=5)
plt.show()
# Predict class
y_pred = clf.best_estimator_.predict(X_test)
return y_test,y_pred

It looks like you are making the param an array with a single dictionary inside. param needs to be just a dictionary:
EDIT:
Looking into this further, as mentioned by #DzDev, passing an array containing a single dictionary is also a valid way to pass in parameters.
It appears that your issue is that you are mixing the concepts of two different types of estimators. You are passing in the parameters for svm.SVC but are sending in a DecisionTreeClassifier estimator. So it turns out the error is exactly as it says, 'C' is not a valid parameter. You should update to either using a svm.SVC estimator or updates your parameters to be correct for the DecisionTreeClassifier.

Related

show overfitting with sklearn & random forest

I followed this tutorial to create a simple image classification script:
https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/
train_data = scipy.io.loadmat('extra_32x32.mat')
# extract the images and labels from the dictionary object
X = train_data['X']
y = train_data['y']
X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)
....
clf = RandomForestClassifier()
print(clf)
start_time = time.time()
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))
It gave me an accuracy of approximately 0.7.
Is there someway to visualize or show where/when/if the model is overfitting? I believe this can be shown by training the model until we see that the accuracy of training is increasing and the validation data is decreasing. But how can I do so in the code?
There are multiple ways you can test overfitting and underfitting. If you want to look specifically at train and test scores and compare them you can do this with sklearns cross_validate[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate]. If you read the documentation it will return you a dictionary with train scores (if supplied as train_score=True) and test scores in metrics that you supply.
sample code
model = RandomForestClassifier(n_estimators=1000, random_state=1, criterion='entropy', bootstrap=True, oob_score=True, verbose=1)
cv_dict = cross_validate(model, X, y, return_train_score=True)
You can also simply create a hold out test set with train test split and compare your training and test scores using the test data set.
Another option is to use a library like Optuna, which will test various hyperparameters for you and you could use the methods mentioned above.

Decision tree with a probability target

I'm currently working on a model to predict a probability of fatality once a person is infected with the Corona virus.
I'm using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc.
It was suggested to use a decision tree, which I've already built.
Since I'm new to decision trees I would like some assistance.
I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output.
How can I achieve this?
Also I want to play around with samples by inputting the data myself and see what the outcome is.
For instance: let's take someone who is 40, male etc. and calculate what its survival chance is.
How can I achieve this?
I've attached the code below:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import random as rnd
filename = '/Users/sef/Downloads/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)
df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = YHat
print(df)
you can use the method "predict_proba" of the DecisionTreeClassifier to compute the probabilities instead of the binary classification values.
In order to test individual data, that you can create by hand, you have to create an array of the shape of your X_test data (just that it only has one entry). Then you can use that with model.predict(array) or model.predict_proba(array).
By the way, your tree is currently not useful for retrieving probabilities. There is an article that explains the problem very well: https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html
So you can fix your code by defining the max_depths of your tree:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import random as rnd
filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)
model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
model.fit(X_train, Y_train)
rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)
df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = list(YHat)
print(df)
Decision Tree can also estimate the probability than an instance belongs to a particular class. Use predict_proba() as below with your train feature data to return the probability of various class you want to predict. model.predict() returns the class which has the highest probability
model.predict_proba()
Use the function called predict_proba
model.predict_proba(X_test)
To the second part of your question, here is what you will have to do.
Create your own custom dataset with the exact same column names as you had trained.
Read your data from a csv and apply the same encoder values if any.
You can also save your label encoder object in a much more efficient way.
label = preprocessing.LabelEncoder()
label_encoded_columns=['Date_statistics_type', 'Agegroup', 'Sex', 'Province', 'Hospital_admission', 'Municipal_health_service', 'Deceased']
for col in label_encoded_columns:
dataframe[col] = dataframe[col].astype(str)
Label_Encoder = labelencoder.fit(dataframe[label_encoded_columns].values.flatten())
Encoded_Array = (Label_Encoder.transform(dataframe[label_encoded_columns].values.flatten())).reshape(dataframe[label_encoded_columns].shape)
LE_Dataframe=pd.DataFrame(Encoded_DataFrame,columns=label_encoded_columns,index=dataframe.index)
LE_mapping = dict(zip(Label_Encoder.classes_,Label_Encoder.transform(Label_Encoder.classes_).tolist()))
#####This should give you dictionary in the form for all your list of values.
##### for eg: {'Apple':0,'Banana':1}
For your second part of the question, there can be two ways.
The first one is pretty straightforward, where in you can use values of X_test to give you a resulting prediction.
model.predict(X_test.iloc[0:30]) ###First 30 rows
model.predict_proba(X_test.iloc[0:30])
In the second one, if you are talking about introducing new data, then in that case, you will have to label encode the raw data once again.
If that data is not present, it may give you never seen before values error.
Refer to this link in that case

Failing to tune decision tree classifier parameters using gridsearch

I am trying to tune parameters using GridSearchCV but keep encountering this error message
ValueError: Invalid parameter decisiontreeclassifier for estimator DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'). Check the list of available parameters with `estimator.get_params().keys()`.
This is the code I have written
accuracy_score = make_scorer(accuracy_score,greater_is_better = True)
dtc = DecisionTreeClassifier()
depth = np.arange(1,30)
leaves = [1,2,4,5,10,20,30,40,80,100]
param_grid =[{'decisiontreeclassifier__max_depth':depth,
'decisiontreeclassifier__min_samples_leaf':leaves}]
grid_search = GridSearchCV(estimator = dtc,param_grid = param_grid,
scoring=accuracy_score,cv=10)
grid_search = grid_search.fit(X_train,y_train)
Use max_depth instead of decisiontreeclassifier__max_depth in your param_grid. (The same thing applies to the other parameter.) The notation that you're using is for pipelines with multiple estimators chained together.
accuracy_score = make_scorer(accuracy_score,greater_is_better = True)
dtc = DecisionTreeClassifier()
depth = np.arange(1,30)
leaves = [1,2,4,5,10,20,30,40,80,100]
param_grid =[{'max_depth':depth,
'min_samples_leaf':leaves}]
grid_search = GridSearchCV(estimator = dtc,param_grid = param_grid,
scoring=accuracy_score,cv=10)
grid_search = grid_search.fit(X_train,y_train)

Print decision tree and feature_importance when using BaggingClassifier

Obtaining the decision tree and the important features can be easy when using DecisionTreeClassifier in scikit learn. However I am not able to obtain none of them if I and bagging function, e.g., BaggingClassifier.
Since we need to fit the model using the BaggingClassifier, I can not return the results (print the trees (graphs), feature_importances_, ...) related to the DecisionTreeClassifier.
Hier is my script:
seed = 7
n_iterations = 199
DTC = DecisionTreeClassifier(random_state=seed,
max_depth=None,
min_impurity_split= 0.2,
min_samples_leaf=6,
max_features=None, #If None, then max_features=n_features.
max_leaf_nodes=20,
criterion='gini',
splitter='best',
)
#parametersDTC = {'max_depth':range(3,10), 'max_leaf_nodes':range(10, 30)}
parameters = {'max_features':range(1,200)}
dt = RandomizedSearchCV(BaggingClassifier(base_estimator=DTC,
#max_samples=1,
n_estimators=100,
#max_features=1,
bootstrap = False,
bootstrap_features = True, random_state=seed),
parameters, n_iter=n_iterations, n_jobs=14, cv=kfold,
error_score='raise', random_state=seed, refit=True) #min_samples_leaf=10
# Fit the model
fit_dt= dt.fit(X_train, Y_train)
print(dir(fit_dt))
tree_model = dt.best_estimator_
# Print the important features (NOT WORKING)
features = tree_model.feature_importances_
print(features)
rank = np.argsort(features)[::-1]
print(rank[:12])
print(sorted(list(zip(features))))
# Importing the image (NOT WORKING)
from sklearn.externals.six import StringIO
tree.export_graphviz(dt.best_estimator_, out_file='tree.dot') # necessary to plot the graph
dot_data = StringIO() # need to understand but it probably relates to read of strings
tree.export_graphviz(dt.best_estimator_, out_file=dot_data, filled=True, class_names= target_names, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
img = Image(graph.create_png())
print(dir(img)) # with dir we can check what are the possibilities in graph.create_png
with open("my_tree.png", "wb") as png:
png.write(img.data)
I obtain erros like: 'BaggingClassifier' object has no attribute 'tree_' and 'BaggingClassifier' object has no attribute 'feature_importances'. Does anyone know how can I obtain them? thanks.
Based on the documentation, BaggingClassifier object indeed doesn't have the attribute 'feature_importances'. You could still compute it yourself as described in the answer to this question: Feature importances - Bagging, scikit-learn
You can access the trees that were produced during the fitting of BaggingClassifier using the attribute estimators_, as in the following example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
iris = datasets.load_iris()
clf = BaggingClassifier(n_estimators=3)
clf.fit(iris.data, iris.target)
clf.estimators_
clf.estimators_ is a list of the 3 fitted decision trees:
[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1422640898, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1968165419, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=2103976874, splitter='best')]
So you can iterate over the list and access each one of the trees.

sklearn SVM performing awfully poor

I have 9164 points, where 4303 are labeled as the class I want to predict and 4861 are labeled as not that class. They are no duplicate points.
Following How to split into train, test and evaluation sets in sklearn?, and since my dataset is a tuple of 3 items (id, vector, label), I do:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
def predict_labels(test_data, classifier):
labels = []
for test_d in test_data:
labels.append(classifier.predict([test_d]))
return np.array(labels)
def construct_labels(df):
labels = []
for index, row in df.iterrows():
if row[2] == 'Trump':
labels.append('Atomium')
else:
labels.append('Not Trump')
return np.array(labels)
def construct_data(df):
first_row = df.iloc[0]
data = np.array([first_row[1]])
for index, row in df.iterrows():
if first_row[0] != row[0]:
data = np.concatenate((data, np.array([row[1]])), axis=0)
return data
and then:
>>> classifier = SVC(verbose=True)
>>> classifier.fit(train_data, train_labels)
[LibSVM].......*..*
optimization finished, #iter = 9565
obj = -2718.376533, rho = 0.132062
nSV = 5497, nBSV = 2550
Total nSV = 5497
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=True)
>>> predicted_labels = predict_labels(test_data, classifier)
>>> for p, t in zip(predicted_labels, test_labels):
... if p == t:
... correct = correct + 1
and I get correct only 943 labels out of 1833 (=len(test_labels)) -> (943*100/1843 = 51.4%)
I am suspecting I am missing something big time here, maybe I should set a parameter to the classifier to do more refined work or something?
Note: First time using SVMs here, so anything you might get for granted, I might have not even imagine...
Attempt:
I went ahed and decreased the number of negative examples to 4303 (same number as positive examples). This slightly improved accuracy.
Edit after the answer:
>>> print(clf.best_estimator_)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> classifier = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
... decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
... max_iter=-1, probability=False, random_state=None, shrinking=True,
... tol=0.001, verbose=False)
>>> classifier.fit(train_data, train_labels)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Also I tried clf.fit(train_data, train_labels), which performed the same.
Edit with data (the data are not random):
>>> train_data[0]
array([ 20.21062112, 27.924016 , 137.13815308, 130.97432804,
... # there are 256 coordinates in total
67.76352596, 56.67798138, 104.89566517, 10.02616417])
>>> train_labels[0]
'Not Trump'
>>> train_labels[1]
'Trump'
Most estimators in scikit-learn such as SVC are initiated with a number of input parameters, also known as hyper parameters. Depending on your data, you will have to figure out what to pass as inputs to the estimator during initialization. If you look at the SVC documentation in scikit-learn, you see that it can be initialized using several different input parameters.
For simplicity, let's consider kernel which can be 'rbf' or ‘linear’ (among a few other choices); and C which is a penalty parameter, and you want to try values 0.01, 0.1, 1, 10, 100 for C. That will lead to 10 different possible models to create and evaluate.
One simple solution is to write two nested for-loops one for kernel and the other for C and create the 10 possible models and see which one is the best model amongst others. However, if you have several hyper parameters to tune, then you have to write several nested for loops which can be tedious.
Luckily, scikit learn has a better way to create different models based on different combinations of values for your hyper model and choose the best one. For that, you use GridSearchCV. GridSearchCV is initialized using two things: an instance of an estimator, and a dictionary of hyper parameters and the desired values to examine. It will then run and create all possible models given the choices of hyperparameters and finds the best one, hence you need not to write any nested for-loops. Here is an example:
from sklearn.grid_search import GridSearchCV
print("Fitting the classifier to the training set")
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf', 'linear']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)
clf = clf.fit(train_data, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)
You will need to use something similar to this example, and play with different hyperparameters. If you have a good variety of values for your hyperparameters, there is a very good chance you will find a much better model this way.
It is however possible for GridSearchCV to take a very long time to create all these models to find the best one. A more practical approach is to use RandomizedSearchCV instead, which creates a subset of all possible models (using the hyperparameters) at random. It should run much faster if you have a lot of hyperparameters, and its best model is usually pretty good.
After the comments of sascha and the answer of shahins, I did this eventually:
df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_labels = construct_labels(train)
train_data = construct_data(train)
test_labels = construct_labels(test)
test_data = construct_data(test)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
from sklearn.svm import SVC
# Classifier found with shahins' answer
classifier = SVC(C=10, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
classifier = classifier.fit(train_data, train_labels)
test_data = scaler.fit_transform(test_data)
predicted_labels = predict_labels(test_data, classifier)
and got:
>>> correct_labels = count_correct_labels(predicted_labels, test_labels)
>>> print_stats(correct_labels, len(test_labels))
Correct labels = 1624
Accuracy = 88.5979268958
with these methods:
def count_correct_labels(predicted_labels, test_labels):
correct = 0
for p, t in zip(predicted_labels, test_labels):
if p[0] == t:
correct = correct + 1
return correct
def print_stats(correct_labels, len_test_labels):
print "Correct labels = " + str(correct_labels)
print "Accuracy = " + str((correct_labels * 100 / float(len_test_labels)))
I was able to optimize more with more hyper parameter tuning!
Helpful link: RBF SVM parameters
Note: If I don't transform the test_data, accuracy is 52.7%.

Categories