Premise: I have been working on this ML dataset and I found my ADA boost and SVM to be extremely good when it comes to detecting TP. The confusion matrix for both models is identical shown below.
Here's the image:
Out of the 10 models I have trained, 2 of them are ADA and SVM. The other 8, some have lower accuracy and others higher by ~+-2%
MAIN QUESTION:
How do I chain/pipeline so that all my test cases are handled in the following manner?
Pass all the cases through SVM and ADA. If the either SVM or ADA has 80%+ confidence return the result
Else, if SVM or ADA don't have a high confidence, have only those test cases evaluated by the other 8 models for a final decision
Potential Solution:
My potential attempt involved the use of 2 voting classifiers. One classifier with just ADA and SVM, the second classifier with the other 8 models. But I don't know hot to make this work
Here's the code for my approach:
from sklearn.ensemble import VotingClassifier
ensemble1=VotingClassifier(estimators=[
('SVM',model[5]),
('ADA',model[7]),
], voting='hard').fit(X_train,Y_train)
print('The accuracy for ensembled model is:',ensemble1.score(X_test, Y_test))
#I was trying to make ensemble 1 the "first pass" if it was more than 80% confident in it's decision, return the result
#ELSE, ensemble 2 jumps in to make a decision
ensemble2=VotingClassifier(estimators=[
('LR',model[0]),
('DT',model[1]),
('RFC',model[2]),
('KNN',model[3]),
('GBB',model[4]),
('MLP',model[6]),
('EXT',model[8]),
('XG',model[9])
], voting='hard').fit(X_train,Y_train)
#I don't know how to make these two models work together though.
Extra Questions:
These questions are to facilitate some extra concerns I had and are NOT the main question:
Is what I am trying to do worth it?
Is it normal to have a Confusion matrix with just True Positives and False Positives? Or is this indicative of incorrect training? As seen above in the picture for Model 5.
Are the accuracies of my models on an individual level considered to be good? The models are predicting likelihood of developing heart disease. Accuracies below:
Sorry for the long post and thanks for all your input and suggestions. I'm new to ML so I'd appreciate any pointers.
This is a simple implementation, that hopefully solves your main problem of chaining multiple estimators:
class ChainEstimator(BaseEstimator,ClassifierMixin):
def __init__(self,est1,est2):
self.est1 = est1
self.est2 = est2
def fit(self,X,y):
self.est1.fit(X,y)
self.est2.fit(X,y)
return self
def predict(self,X):
ans = np.zeros((len(X),)) - 1
probs = self.est1.predict_proba(X) #averaging confidence of Ada & SVC
conf_samples = np.any(probs>=.8,axis=1) #samples with >80% confidence
ans[conf_samples] = np.argmax(probs[conf_samples,:],axis=1) #Predicted Classes of confident samples
if conf_samples.sum()<len(X): #Use est2 for non-confident samples
ans[~conf_samples] = self.est2.predict(X[~conf_samples])
return ans
Which you can call like this:
est1 = VotingClassifier(estimators=[('ada',AdaBoostClassifier()),('svm',SVC(probability=True))],voting='soft')
est2 = VotingClassifier(estimators=[('dt',DecisionTreeClassifier()),('knn',KNeighborsClassifier())])
clf = ChainEstimator(est1,est2).fit(X_train,Y_train)
ans = clf.predict(X_test)
Now if you want to base your chaining on the performance of est1, you can do something like this to record its performance during training, and add a few more ifs on the predict function:
def fit(self,X,y):
self.est1.fit(X,y)
self.est1_perf = cross_val_score(self.est1,X,y,cv=4,scoring='f1_macro')
self.est2.fit(X,y)
self.est2_perf = cross_val_score(self.est2,X,y,cv=4,scoring='f1_macro')
return self
Note that you shouldn't be using simple accuracy for problem like this.
Related
I made a short Jupyter notebook to go with my question regarding the TransformedTargetRegressor.
I wanted to put a transformer inside a pipeline to play with a parameter grid but the scores didn't match.
...
model = linear_model.LinearRegression()
lg_tr = preprocessing.FunctionTransformer(func=np.log, inverse_func=np.exp, check_inverse=True)
y_log = lg_tr.transform(y)
score_0 = model.fit(X, y_log).score(X, y_log)
...
model = compose.TransformedTargetRegressor(func=np.log, inverse_func=np.exp, check_inverse=True,
regressor=linear_model.LinearRegression())
score_1 = model.fit(X, y).score(X, y)
The score_0 value is correct. Why is the one from score_1 not?
I don't have problem with the prediction that works fine, only the score.
Did I miss something?
Thank you =)
Typically, you should be interested on how good your model performs (or scores) when predicting the actual values in their original range or scale. This is however what you are measuring with score_1 and not with score_0.
score_0 represents the model's performance when the target variable is in log scale which is in most cases not very useful.
score_1 however uses the score method of TransformedTargetRegressor which makes sure the target variable is in its original scale before computing any performance metric. Thus, judgements should be made based on score_1.
I have been constructing my own Extra Trees (XT) classifier in Rust for binary classification. To verify correctness of my classifier, I have been comparing it against Sklearns implementation of XT, but I constantly get different results. I thought that there must be a bug in my code at first, but now I realize it's not a bug, but instead a different method of calculating votes amongst the different trees in the ensemble. In my code, each tree votes based on the most frequent classification in a leafs' subset of data. So for example, if we are traversing a tree, and find ourselves at a leaf node that has 40 classifications of 0, and 60 classifications of 1, the tree classifies the data as 1.
Looking at Sklearn's documentation for XT (As seen here), I read the following line in regards to the predict method
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
While this gives me some idea about how individual trees vote, I still have more questions. Perhaps an exact mathematical expression of how these weights are calculated would help, but I have yet to find one in the documentation.
I will provide more details in the upcoming paragraphs, but I wish to ask my question concisely here. How are these weights calculated at a high level, what are the mathematics behind it? Is there a way to change how individual XT trees calculate their votes?
---------------------------------------- Additional Details -----------------------------------------------
For my current tests, this is how I build my classifier
classifier = ExtraTreesClassifier(n_estimators=5, criterion='gini',
max_depth=1, max_features=5,random_state=0)
To predict unseen transactions X, I use classifier.predict(X). Digging through the source code of predict (seen here, line 630-ish), I see that this is all the code that executes for binary classification
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
What this code is doing is relatively obvious to me. It merely determines the most likely classification of transactions by taking the argmax of proba. What I fail to understand is how this proba value is made in the first place. I beleive that the predict_proba method that predict uses is defined here at Line 650-ish. Here is what I believe the relevant source code to be
check_is_fitted(self)
# Check data
X = self._validate_X_predict(X)
# Assign chunk of trees to jobs
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
# avoid storing the output of every estimator by summing them here
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in np.atleast_1d(self.n_classes_)]
lock = threading.Lock()
Parallel(n_jobs=n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
lock)
for e in self.estimators_)
for proba in all_proba:
proba /= len(self.estimators_)
if len(all_proba) == 1:
return all_proba[0]
else:
return all_proba
I fail to understand what exactly is being calculated here. This is where my trail goes a bit cold and I get confused, and find myself in need of help.
Trees can predict probability estimates, according to the training sample proportions in each leaf. In your example, the probability of class 0 is 0.4, and 0.6 for class 1.
Random forests and extremely random trees in sklearn perform soft voting: each tree predicts the class probabilities as above, and then the ensemble just averages those across trees. That produces a probability for each class, and then the predicted class is the one with the largest probability.
In the code, the relevant bit is _accumulate_predictions, which just sums the probability estimates, followed by the division by the number of estimators.
Context to what I'm trying to achieve:
I have a problem regarding image classification using scikit. I have Cifar 10 data, training and testing images. There are 10000 training images and 1000 testing images. Each test/train image is stored in a test/train npy file, as a 4-d matrix (height,width,rgb,sample). I also have test/train labels. I have a ‘computeFeature’ method that utilizes Histogram of Orientated Gradients method to represent image domain features as a vector. I am trying to iterate this method over both the training and testing data so that I can create an array of features that can be used later so that the images can be classified. I have tried creating a for loop using I and storing the results in a numpy array. I must then continue to apply PCA/LDA and do image classification with SVC and CNN etc (any method of image classification).
import numpy as np
import skimage.feature
from sklearn.decomposition import PCA
trnImages = np.load('trnImage.npy')
tstImages = np.load('tstImage.npy')
trnLabels = np.load('trnLabel.npy')
tstLabels = np.load('tstLabel.npy')
from sklearn.svm import SVC
def computeFeatures(image):
hog_feature, hog_as_image = skimage.feature.hog(image, visualize=True, block_norm='L2-Hys')
return hog_feature
trnArray = np.zeros([10000,324])
tstArray = np.zeros([1000,324])
for i in range (0, 10000 ):
trnFeatures = computeFeatures(trnImages[:,:,:,i])
trnArray[i,:] = trnFeatures
for i in range (0, 1000):
tstFeatures = computeFeatures(tstImages[:,:,:,i])
tstArray[i,:] = tstFeatures
pca = PCA(n_components = 2)
trnModel = pca.fit_transform(trnArray)
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
# Divide the dataset into the two sets.
test_data = tstModel
test_labels = tstLabels
train_data = trnModel
train_labels = trnLabels
C = 1
model = SVC(kernel='linear', C=C)
model.fit(train_data, train_labels.ravel())
y_pred = model.predict(test_data)
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
print('Percentage accuracy on testing set is: {0:.2f}%'.format(accuracy))
Accuracy prints out as 100%, I'm pretty sure this is wrong but I'm not sure why?
First of all,
pca = PCA(n_components = 2)
tstModel = pca.fit_transform(tstArray)
this is wrong. You have to use:
tstModel = pca.transform(tstArray)
Secondly, how did you select the dimension of PCA? Why 2? Why not 25 or 100? 2 PC may be few for the images. Also, as I understand, datasets are not scaled prior to PCA.
Just for interest, check the balance of classes.
Regarding to 'shall we use PCA before SVM or not': highly depends on the data. Try to check both cases and then decide. SVC maybe pretty slow in computation so PCA (or other dimensionality reduction technique) may speed it up a little. But you need to check both cases.
The immediate concern in this sort of situation is that the model is over-fitted. Any professional reviewer would immediately return this to the investigator. In this case, I suspect it is a result of the statistical approach used.
I don't work with images, but I would question why PCA was being stacked onto SVM. In common speak, you are using two successive methods that reduce/collapse hyper-dimensional space. This would very likely lead to a definite outcome. If you collapse high-level dimensionality once, why repeat it?
The PCA is standard for images, but should be followed by something very simple such as K-means.
The other approach instead of PCA is, of course, NMF and I would recommend it if you feel PCA is not providing the resolution sought.
Otherwise the calculation looks fine.
accuracy = np.sum(np.equal(test_labels, y_pred)) / test_labels.shape[0]
On second thoughts, the accuracy index might not be concerned with over-fitting, IF (that's a grammatical emphasis type 'IF'), test_labels contained a prediction of the image (of which ~50% are incorrect).
I'm just guessing this is what "test_labels" data is however and we have no idea how that prediction was derived. So I'm not sure there's enough information to answer the question.
BTW could some explain, "shape[0]"please? Is it needed?
One obvious problem with your approach is that you apply PCA in a rather peculiar way. You should typically only estimate one transform -- on the training data -- and then use it to transform any evaluation set as well.
This way, you kind of... implement SVM with whitening batch-norm, which sounds cool, but is at least rather unusual. So it would need much care. E.g. this way, you cannot classify a single sample. Still, it may work as an unsupervised adaptation technique.
Apart from that, it's hard to tell without access to your data. Are you sure that the test and train sets are disjoint?
I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally.
Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model.
rf = RandomForestRegressor(n_estimators=100)
print ("Trying to fit the Random Forest model --> ")
if os.path.exists('rf.pkl'):
print ("Trained model already pickled -- >")
with open('rf.pkl', 'rb') as f:
rf = cPickle.load(f)
else:
df_x_train = x_train[col_feature]
rf.fit(df_x_train,y_train)
print ("Training for the model done ")
with open('rf.pkl', 'wb') as f:
cPickle.dump(rf, f)
df_x_test = x_test[col_feature]
pred = rf.predict(df_x_test)
EDIT 1: I don't have the compute capacity to train the model on 4 years of data all at once.
What you're talking about, updating a model with additional data incrementally, is discussed in the sklearn User Guide:
Although not all algorithms can learn incrementally (i.e. without
seeing all the instances at once), all estimators implementing the
partial_fit API are candidates. Actually, the ability to learn
incrementally from a mini-batch of instances (sometimes called “online
learning”) is key to out-of-core learning as it guarantees that at any
given time there will be only a small amount of instances in the main
memory.
They include a list of classifiers and regressors implementing partial_fit(), but RandomForest is not among them. You can also confirm RFRegressor does not implement partial fit on the documentation page for RandomForestRegressor.
Some possible ways forward:
Use a regressor which does implement partial_fit(), such as SGDRegressor
Check your RandomForest model's feature_importances_ attribute, then retrain your model on 3 or 4 years of data after dropping unimportant features
Train your model on only the most recent two years of data, if you can only use two years
Train your model on a random subset drawn from all four years of data.
Change the tree_depth parameter to constrain how complicated your model can get. This saves computation time and so may allow you to use all your data. It can also prevent overfitting. Use Cross-Validation to select the best tree-depth hyperparameter for your problem
Set your RF model's param n_jobs=-1 if you haven't already,to use multiple cores/processors on your machine.
Use a faster ensemble-tree-based algorithm, such as xgboost
Run your model-fitting code on a large machine in the cloud, such as AWS or dominodatalab
You can set the 'warm_start' parameter to True in the model. This will ensure the retention of learning with previous learn using fit call.
Same model learning incrementally two times (train_X[:1], train_X[1:2]) after setting ' warm_start '
forest_model = RandomForestRegressor(warm_start=True)
forest_model.fit(train_X[:1],train_y[:1])
pred_y = forest_model.predict(val_X[:1])
mae = mean_absolute_error(pred_y,val_y[:1])
print("mae :",mae)
print('pred_y :',pred_y)
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 1290000.0
pred_y : [ 1630000.]
mae : 925000.0
pred_y : [ 1630000.]
Model only with the last learnt values ( train_X[1:2] )
forest_model = RandomForestRegressor()
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 515000.0
pred_y : [ 1220000.]
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
I have done some simple Bayesian classification
X = [[1,0,0], [1,1,0]] ### there are more data of course
Y = [1,0]
classifier = BernoulliNB()
classifier.fit(X, Y)
Now I have got some "insider tips" that the first element in every X is more important than the others.
Can I incorporate this knowledge before I train the model please?
If sklearn doesn't allow it, is there any other classifier or other library that allows us to incorporate our prior before model training please?
I do not know the answer of the question 2 but I can answer question 1.
In the comment "multiply the first element for each observation by different values" is a wrong approach.
When you are using BernoulliNB or Binomial, the way you incorporate prior knowledge is by adding your knowledge into the sample (data).
Let's say you are flipping the coin and you know that the coin is rigged towards more head. Then you are adding more samples that show more heads. If your prior knowledge says 70% heads and 30% tails: You can add total 100 samples, 70 heads and 30 tails, to your data X.
Think about what the algorithm is actually doing. Naive Bayes performs the following classification:
p(class = k | data) ~ p(class = k) * p(data | class = k)
In words: The (posterior) probability of an observation being in class k is proportional to the probability of any observation being in class k (that's the prior) times the probability of seeing the observation, given it came from class k (the likelihood).
Usually when we don't know anything, we assume that p(class = k) just reflects the distribution of the observed data.
In your case, you're saying that you have some information, in addition to the observed data, that leads you to believe that the prior, p(class = k) should be amended. This is perfectly legitimate. In fact, that's the beauty of Bayesian inference. Whatever your prior knowledge is, you should incorporate that into this term. So in your case, perhaps that's increasing the probability of being in a particular class (i.e. increasing its weight as suggested in the comments), if you know that it's more likely to occur than the data suggests.