Scikit-learn - How to use Cross Validation correctly - python

I am working on a program where I have some data (labeled and unlabeled) and 2 different groups ("artritis" and "fibro"). I would like to obtain the classifier's accuracy and then classify the unlabeled data. My problem is that I am testing it with 2 classifiers (LDA and QDA). With the first one I obtain an accuracy of 81% and when I classify the unlabeled data (39 objects) it classifies everything correctly. However, when I use QDA I obtain an accuracy of 93,74% and when it classifies the unlabeled data (the same 39 objects) it labels 3 of them with the wrong group. Can someone help me to find my errors?
My code:
#"listaTrain" has a list of dictionaries which are the labeled data and will be used for
# training and Cross-Validation
#"listaLabels" has a list of the train labels
#"listaClasificar" has a list of dictionaries which are the unlabeled data
# which I want to label
#"clasificador" is my classifier
X=vec.fit_transform(listaTrain) #I transform the dictionaries to
#a format that sklearn can use
X=preprocessing.scale(X.toarray()) #I scale the values
clasificador.fit(X, listaLabels) #I train the classifier with the train data and
# the train labels
n_samples = X.shape[0]
cv = cross_validation.ShuffleSplit(n_samples, n_iter=300, test_size=0.6, random_state=4)
#I make Cross-Validation dividing the X's data (40% for training and 60% for testing)
scores = cross_validation.cross_val_score(clasificador, X, listaLabels,v=cv)
#I obtain the Cross-validation accuracy
scores.mean() #I obtain the accuracy mean (here is where i obtain 81% and 93%)
testX=vec.transform(listaClasificar) #I transform the dictionaries to a
#format that sklearn can use
testX=preprocessing.scale(testX.toarray()) #I scale the values
predicted=clasificador.predict(testX) #I predict the labels of the unlabeled data

Related

Getting straight line while creating ARIMA model

I have a Fan Speed (RPM) dataset of 192.405 Values (train+test values). I am training the ARIMA model and trying to predict the rest of the future values of our dataset and comparing the results.
While fitting the model in test data I am getting straight line for predictions
from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima_model import ARIMA
dfx = df[(df['Tarih']>'2020-07-23') & (df['Tarih']<'2020-10-23')]
X_train = dfx[:int(dfx.shape[0]*0.8)] #2 months
X_test = dfx[int(dfx.shape[0]*0.8):] # rest, 1 months
model = ARIMA(X_train.Value, order=(4,1,4))
model_fit = model.fit(disp=0)
print(model_fit.summary())
test = X_test
train = X_train
What could i do now ?
Your ARIMA model uses the last 4 observations to make a prediction. The first prediction will be based on the four last known data points. The second prediction will be based on the first prediction and the last three known data points. The third prediction will be based on the first and second prediction and the last two known data points and so on. Your fifth prediction will be based entirely on predicted values. The hundredth prediction will be based on predicted values based on predicted values based on predicted values … Each prediction will have a slight deviation from the actual values. These prediction errors accumulate over time. This often leads to ARIMA simply prediction a straight line when you try to predict such large horizons.
If your model uses the MA component, represented by the q parameter, then you can only predict q steps into the future. That means your model is only able to predict the next four data points, after that the prediction will converge into a straight line.

is it overfitting or data leakage problem?

I have applied Sklearn DecisionTreeClassifier() on a personalized dataset to perform binary classification (class 0 and class 1).
Initially classes were not balanced I tried to balance them using :
rus = RandomUnderSampler(random_state=42, replacement=True)
data_rus, target_rus = rus.fit_resample(X, y)
So my dataset was balanced with 186404 samples for class 1 and 186404 samples for class 2. The training samples were : 260965 and the testing samples were : 111843
I calculated the accuracy using sklearn.metrics and I got the next result:
clf=tree.DecisionTreeClassifier("entropy",random_state = 0)
clf.fit(data_rus, target_rus)
accuracy_score(y_test,clf.predict(X_test)) # I got 100% for both training and testing
clf.score(X_test, y_test) # I got 100% for both training and testing
So, I got 100% as accuracy for both testing and training phases I am sure that the result is abnormal and I could not understand if it is an overfitting or data leakage despite I had shuffled my data before splitting it. Then I have decided to plot both training and testing accuracy using
sklearn.model_selection.validation_curve
I got the next figure and I could not interpret it :
I tried two other classification algorithms : Logistic Regression and SVM, I have got the next testing accuracy: 99,84 and 99,94%, respectively.
Update
In my original dataset I have 4 categorical columns I mapped then using the next code:
DataFrame['Color'] = pd.Categorical(DataFrame['Color'])
DataFrame['code_Color'] = DataFrame.Color.cat.codes
After using the RandomUnderSampler to under sample my original data so to get a class balance I splitted the data into train and test datasets train_test_split of sklearn
Any idea could be helpful for me please!

Is train_test_split necessary for binary classification? And why are there 4 outcomes?

Why are there 4 outcomes to train_test_split in sklearn? Why is there y_test, if the testing data has no y_data?
The reason you get 4 outcomes is because you get: train_features, test_features, train_labels and test_labels (X_train, X_test, y_train, y_test). So it not just splits the dataset into train and test set, but also the labels. (so 2 + 2 = 4 outcomes).
Looking into the documentation, you can see that the first parameter is
*arrays, which means you can put as many arrays as you want there. Now, what does it returns?
Returns: splitting : list, length=2 * len(arrays)
Which means it returns twice the amount of arrays passed in the train_test_split function.
So, if you already have a training and a testing set, it only makes sense to split the training set, so you can have a validation set to check the model performance.
Eg.:
train_data, validation_data, train_label, validation_label= train_test_split(original_train_data, original_train_label)
Note that you also must split the labels in the case you have the data and the label in separated vectors.
because you have split your original data into train and test parts. so there would be four outcomes.
1 (X_train, Y_train) where X_train are the training points while Y_train are their respective class labels. Now this is your training data which will be used to train your model with any classical models like K-NN, logistic regression , Decision Tress.
2 (X_test,Y_test) where X_test represents your test data point and y_train are your respective class labels for these test points.Now once you have trained your model and calculated your training error/accuracy, then you can use these points to see whether the trained model predicts the data correctly or not.The lower the difference between your training and test error the better it is.
That is why you get 4 outcomes with pairs of 2 each.
Hope this helps.

What is meta-classifier in StackingClassifier function in mlxtend?

In mlxtend library, there is An ensemble-learning meta-classifier for stacking called "StackingClassifier".
Here is an example of a StackingClassifier function call:
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
What is meta_classifier here? What is it used for?
What is stacking ?
Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta-classifier is fitted based on the outputs -- meta-features -- of the individual classification models in the ensemble.
Source : StackingClassifier-mlxtend
So meta_classifier parameter helps us to choose the classifier to fit the output of the individual models.
Example:
Assume that you have used 3 binary classification models say LogisticRegression, DT & KNN for stacking. Lets say 0, 0, 1 be the classes predicted by the models. Now we need a classifier which will do majority voting on the predicted values. And that classifier is the meta_classifier. And in this example it would would pick 0 as the predicted class.
You can extend this for prob values also.
Refer mlxtend-API for more info
meta-classifier is the one that takes in all the predicted values of your models. As in your example you have three classifiers clf1, clf2, clf3 let's say clf1 is naive bayes, clf2 is random-forest, clf3 is svm. Now for every data point x_i in your dataset your all three models will run h_1(x_i), h_2(x_i), h_3(x_i) where h_1,h_2,h_3 corresponds to the function of clf1, clf2, clf3. Now these three models will give three predicted y_i values and all these will run in parallel. Now with these predicted values a model is trained which is known as meta- classifier and that is logistic regression in your case.
So for a new query point (x_q) it will calculated as h^'(h_1(x_q),h_2(x_q),h_3(x_q)) where h^'(h dash) is function that computes y_q.
The advantage of meta-classifier or ensemble models is that suppose your clf1 gives an accuracy of 90%, clf2 gives an accuracy of 92%, clf3 gives an accuracy of 93%. So the end model will give an accuracy that will be greater than 93% which is trained using meta classifier. These stacking classifer are used extensively in kaggle completions.
meta_classifier is simply the classifier that makes a final prediction among all the predictions by using those predictions as features. So, it takes classes predicted by various classifiers and pick the final one as the result that you need.
Here is a nice and simple presentation of StackingClassifier:

Python vectorization for classification [duplicate]

This question already has an answer here:
Scikit learn - fit_transform on the test set
(1 answer)
Closed 8 years ago.
I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a problem because I have a different number of features in my training set than I do in my test set (so the dimensions for the training set are less than the test).
####### Convert bag of words to TFIDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data)
print tfidf_matrix.shape
## number of features = 421
####### Train Random Forest Model
clf = RandomForestClassifier(max_depth=None,min_samples_split=1, random_state=1,n_jobs=-1)
####### k-fold cross validation
scores = cross_val_score(clf, tfidf_matrix.toarray(),labels,cv=7,n_jobs=-1)
print scores.mean()
### this is the new data matrix for unseen data
new_tfidf = tfidf_vectorizer.fit_transform(new_X)
### number of features = 619
clf.fit(tfidf_matrix.toarray(),labels)
clf.predict(new_tfidf.toarray())
How can I go about creating a working RF model for classification that will incorporate new features (words) that weren't seen in the training?
Do not call fit_transform on the unseen data, only transform! That will keep the dictionary from the training set.
You cannot introduce new features into the test set that were not part of your training set. The model is trained on a specific dictionary of terms and that same dictionary of terms must be used across training, validating, testing, and production. Further more, the indices of the words in your feature vector cannot change either.
You should be creating one large matrix using all of your data and then split the rows into your train and test sets. This will guarantee that you will have the same feature set for train and test.

Categories