Python RandomForest classifier (how to test it) - python

I have been able to create a RandomForestClassifier on a dataset.
clf = RandomForestClassifier(n_estimators=100, random_state = 101)
I can then use it on the test data like this:
prediction = pd.DataFrame(clf.predict(x)) # x = Matrix of predictor values
So my question is, how can I test clf.predict outside of Python, how can I see the values that is using and how can I test it "manually" for example if you get the betas in a Regression you can then use those values in Excel and replicate the model. How to do this with RandomForests in Python?
Also is there a similar metric to Rsquared to test the model's explication power?
Thanks!

The RandomForestClassifier is an ensemble of trees which means it is composed by multiple trees.
To be able to test the trees I would suggest to do it in Python itself, you can access all the trees in the estimators_ attribute of the classifier and subsequently export them as graphs with export_graphviz from sklearn.tree module.
If you insist on exporting the trees you will need to export all the rules that each tree is composed by. For that, you can follow this instructions from the sklearn docs.
Regarding the metrics, for a classification problem you could use accuracy_score from sklearn.metrics module.

Related

Using sklearn's cross_val_score with different training and testing datasets

I have a quick question about the following short snippet of code (my version of sklearn, from which cross_val_score and LinearDiscriminantAnalysis are imported from, is 1.1.1):
cv_results = cross_val_score(LinearDiscriminantAnalysis(),data,isTarget,cv=kfold,scoring='accuracy')
I am trying to train a LinearDiscriminantAnalysis ML algorithm on the 'data' variable and the 'isTarget' variable, which are numpy arrays of the features of the samples in my ML dataset and a list of which samples are targets (1) or non-targets (0), respectfully. kfold is just a method for scoring the algorithm, it isn't important here.
My question is this: I am trying to score this algorithm by training it on 'data' and 'isTarget', but I would like to test it on a different dataset, 'data_val' and 'isTarget_val,' but cross_val_score does not have parameters for training an algoirithm on one dataset and testing it on another. I've been searching for other functions that will do this, and I feel that it is a really simple answer and I just can't find it.
Can someone help me out? Thanks :)
This is how cross-validation is designed to work. The cv argument you are supplying specifies that you want to do K-Fold cross-validation, which means that the entirety of your dataset will be used for both training and testing in K different folds.
You can read up more on cross-validation here.
You can accomplish this using a PredefinedSplit (docs) as the cv argument.

I am new to ML and I dont know how to solve this problem,Can some one help me?

Download the dataset, where the first four columns are features, and the last column corresponds to categories (3 labels). Perform the following tasks.
Split the dataset into train and test sets (80:20)
Construct the Naive Bayes classifier from scratch and train it on the train set. Assume Gaussian distribution to compute probabilities.
Evaluate the performance using the following metric on the test set
a. Confusion matrix
b. Overall and class-wise accuracy
c. ROC curve, AUC
Use any library (e.g. scikit-learn) and repeat 1 to 3
Compare and comment on the performance of the results of the classifier in 2 and 4 6. Calculate the Bayes risk.
Consider,
λ =
2 1 6
4 2 4
6 3 1
Where λ is a loss function and rows and columns corresponds to classes (ci) and actions (aj) respectively, e.g. λ(a3 / c2) = 4
It's not clear what specific part of the problem you're having trouble with, which makes it hard to give specific advice.
With that in mind, here is some reading that might help get you started:
If the dataset is in CSV format, you can read it into a dataframe using pd.read_csv() as discussed here: https://www.geeksforgeeks.org/python-read-csv-using-pandas-read_csv/
To split the df into a train set and test set, you can import scikit-learn (sklearn) and then use train_test_split() as discussed here: https://www.stackvidhya.com/train-test-split-using-sklearn-in-python/
It sounds like your professor (or whoever is the source of this question) wants you to write a function that duplicates a Naive Bayes classifier, so I'll leave you to figure that out. Sklearn does provide a Naive Bayes classifier you can read about here and use to verify your results: https://scikit-learn.org/stable/modules/naive_bayes.html
For confusion matrices, sklearn (again) provides some functionality that will let you plot a confusion matrix: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_predictions
For the ROC curve, you can see here: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Hope this is enough to get you started.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(label,targets,
test_size=0.20,random_state=42)
example of gaussian naive bayes
from sklearn.naive_bayes import GaussianNB
# define the model
model = GaussianNB()
# fit the model
model.fit(X_train,y_train)
predict=model.predict(x_test)
matrix = classification_report(y_test,predict)
print('Classification report :\n',matrix)
https://scikit-learn.org/stable/modules/cross_validation.html

Machine learning procedure splitting the data into 3 sets

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

Predicting Multiple output based on multiple input like Month and Fixed values column

I have a data like shown in image. it is about 25,000 rows. The data containes details about 12 months for past 4 years. I want to predict Client and Position Opened for particular month and particular jobtitle.
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df_final['Clientname_numeric'] = le.fit_transform(df_final['ClientName'])
X = df_final[['MONTH','JobTitleID']]
y = df_final[['PositionsOpened','Clientname_numeric']]
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.05 )
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
clf = RandomForestClassifier()
clf.fit(x_train, y_train)
predictions = clf.predict(x_test)
predictions = predictions.astype(int)
accuracy = accuracy_score(y_test,predictions)
I am using above code and getting error
ValueError: multiclass-multioutput is not supported
You could use the package scikit learn and the random forest classifier. I should point out that I only have very superficial knowledge of machine learning, so this might just be the wrong one for your specific case. The RandomForestClassifier however allows to predict multiple outputs at once.
In general, given your data, you would approach it like this (using Scikit Learn):
Split the tables into input columns and output columns. This could propably be done most easily using the pandas package. Then split those into training and test subsets. Scikit offers an off-the-shelf solution for this.
Create an instance of a classifier like RandomForestClassifier and train it using the input- and output-data from your training set (classifier.train(inputs_train, outputs_train))
Given the inputs of your test data, predict the outputs (classifier.predict(inputs_predict)). Decide whether you are satisfied with the predictive quality of your classifier.
For classifying multiple outputs, sklearn has this library, it expects a base estimator like random forests, gradient boosting etc.
The library allows multiple output regression and classification.
Hope this helps!

How to use scikit's preprocessing/normalization along with cross validation?

As an example of cross-validation without any preprocessing, I can do something like this:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
I would like to preprocess my data using something like
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?
Per the documentation, if you employ Pipeline, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
More relevant information on pipelines available here.

Categories