Can I set feature priors in sklearn Bayesian classifier? - python

I have done some simple Bayesian classification
X = [[1,0,0], [1,1,0]] ### there are more data of course
Y = [1,0]
classifier = BernoulliNB()
classifier.fit(X, Y)
Now I have got some "insider tips" that the first element in every X is more important than the others.
Can I incorporate this knowledge before I train the model please?
If sklearn doesn't allow it, is there any other classifier or other library that allows us to incorporate our prior before model training please?

I do not know the answer of the question 2 but I can answer question 1.
In the comment "multiply the first element for each observation by different values" is a wrong approach.
When you are using BernoulliNB or Binomial, the way you incorporate prior knowledge is by adding your knowledge into the sample (data).
Let's say you are flipping the coin and you know that the coin is rigged towards more head. Then you are adding more samples that show more heads. If your prior knowledge says 70% heads and 30% tails: You can add total 100 samples, 70 heads and 30 tails, to your data X.

Think about what the algorithm is actually doing. Naive Bayes performs the following classification:
p(class = k | data) ~ p(class = k) * p(data | class = k)
In words: The (posterior) probability of an observation being in class k is proportional to the probability of any observation being in class k (that's the prior) times the probability of seeing the observation, given it came from class k (the likelihood).
Usually when we don't know anything, we assume that p(class = k) just reflects the distribution of the observed data.
In your case, you're saying that you have some information, in addition to the observed data, that leads you to believe that the prior, p(class = k) should be amended. This is perfectly legitimate. In fact, that's the beauty of Bayesian inference. Whatever your prior knowledge is, you should incorporate that into this term. So in your case, perhaps that's increasing the probability of being in a particular class (i.e. increasing its weight as suggested in the comments), if you know that it's more likely to occur than the data suggests.

Related

How are the votes of individual trees calculated for Random Forest and Extra Trees in Sklearn?

I have been constructing my own Extra Trees (XT) classifier in Rust for binary classification. To verify correctness of my classifier, I have been comparing it against Sklearns implementation of XT, but I constantly get different results. I thought that there must be a bug in my code at first, but now I realize it's not a bug, but instead a different method of calculating votes amongst the different trees in the ensemble. In my code, each tree votes based on the most frequent classification in a leafs' subset of data. So for example, if we are traversing a tree, and find ourselves at a leaf node that has 40 classifications of 0, and 60 classifications of 1, the tree classifies the data as 1.
Looking at Sklearn's documentation for XT (As seen here), I read the following line in regards to the predict method
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
While this gives me some idea about how individual trees vote, I still have more questions. Perhaps an exact mathematical expression of how these weights are calculated would help, but I have yet to find one in the documentation.
I will provide more details in the upcoming paragraphs, but I wish to ask my question concisely here. How are these weights calculated at a high level, what are the mathematics behind it? Is there a way to change how individual XT trees calculate their votes?
---------------------------------------- Additional Details -----------------------------------------------
For my current tests, this is how I build my classifier
classifier = ExtraTreesClassifier(n_estimators=5, criterion='gini',
max_depth=1, max_features=5,random_state=0)
To predict unseen transactions X, I use classifier.predict(X). Digging through the source code of predict (seen here, line 630-ish), I see that this is all the code that executes for binary classification
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
What this code is doing is relatively obvious to me. It merely determines the most likely classification of transactions by taking the argmax of proba. What I fail to understand is how this proba value is made in the first place. I beleive that the predict_proba method that predict uses is defined here at Line 650-ish. Here is what I believe the relevant source code to be
check_is_fitted(self)
# Check data
X = self._validate_X_predict(X)
# Assign chunk of trees to jobs
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
# avoid storing the output of every estimator by summing them here
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in np.atleast_1d(self.n_classes_)]
lock = threading.Lock()
Parallel(n_jobs=n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
lock)
for e in self.estimators_)
for proba in all_proba:
proba /= len(self.estimators_)
if len(all_proba) == 1:
return all_proba[0]
else:
return all_proba
I fail to understand what exactly is being calculated here. This is where my trail goes a bit cold and I get confused, and find myself in need of help.
Trees can predict probability estimates, according to the training sample proportions in each leaf. In your example, the probability of class 0 is 0.4, and 0.6 for class 1.
Random forests and extremely random trees in sklearn perform soft voting: each tree predicts the class probabilities as above, and then the ensemble just averages those across trees. That produces a probability for each class, and then the predicted class is the one with the largest probability.
In the code, the relevant bit is _accumulate_predictions, which just sums the probability estimates, followed by the division by the number of estimators.

Statsmodels: Implementing a direct and recursive multi-step forecasting strategy with ARIMA

I am currently trying to implement both direct and recursive multi-step forecasting strategies using the statsmodels ARIMA library and it has raised a few questions.
A recursive multi-step forecasting strategy would be training a one-step model, predicting the next value, appending the predicted value onto the end of my exogenous values fed into the forecast method and repeating. This is my recursive implementation:
def arima_forecast_recursive(history, horizon=1, config=None):
# make list so can add / remove elements
history = history.tolist()
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
for i, x in enumerate(history):
yhat = model_fit.forecast(steps=1, exog=history[i:])
yhat.append(history)
return np.array(yhat)
def walk_forward_validation(dataframe, config=None):
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_recursive(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
Similarly to perform a direct strategy I would just fit my model on the available training data and use this to predict the total multi-step forecast at once. I am not sure how to achieve this using the statsmodels library.
My attempt (which produces results) is below:
def walk_forward_validation(dataframe, config=None):
# This currently implements a direct forecasting strategy
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_direct(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
def arima_forecast_direct(history, horizon=1, config=None):
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
return model_fit.forecast(steps=horizon)[0]
What confuses me specifically is if the model should just be fit once for all predictions or multiple times to make a single prediction in the multi-step forecast? Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one. As shown above my current implementation only fits one model.
What I do not understand is how, if I call ARIMA.fit() method multiple times on the same training data I will get a model that I will get aa fit that is any different outside of the expected normal stochastic variation?
My final question is with regard to optimisation. Using a method such as walk forward validation gives me statistically very significant results, but for many time series it is very computationally expensive. Both of the above implementations are already called using the joblib parallel loop execution functionality which significantly reduced the runtime on my laptop. However I would like to know if there is anything that can be done with regard to the above implementations to make them even more efficient. When running these methods for ~2000 separate time series (~ 500,000 data points total across all series) there is a runtime of 10 hours. I have profiled the code and most of the execution time is spent in the statsmodels library, which is fine but there a discrepancy between the runtime of the walk_forward_validation() method and ARIMA.fit(). This is expected as obviously the walk_forward_validation() method does stuff other than just call the fit method, but if anything in it can be changed to speed up execution time then please let me know.
The idea of this code is to find an optimal arima order per time series as it isn't feasible to investigate 2000 time series individually and as such the walk_forward_validation() method is called 27 times per time series. So roughly 27,000 times overall. Therefore any performance saving that can be found within this method will have an impact no matter how small it is.
Normally, ARIMA can only perform recursive forecasting, not direct forecasting. There might some research done on variations of ARIMA for direct forecasting, but they wouldn't be implemented in Statsmodels. In statsmodels, (or in R auto.arima()), when you set a value for h > 1, it simply performs a recursive forecast to get there.
As far as I know, none of the standard forecasting libraries have direct forecasting implemented yet, you're going to have to code it yourself.
Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one.
I haven't read Ben Taieb's thesis, but from his paper "Machine Learning Strategies for Time Series Forecasting", for direct forecasting, there is only one model for one value of H. So for H=26, there will be only one model. There will be H models if you need to forecast for every value between 1 and H, but for one H, there is only one model.

SKlearn prediction on test dataset with different shape from training dataset shape

I'm new to ML and would be grateful for any assistance provided. I've run a linear regression prediction using test set A and training set A. I saved the linear regression model and would now like to use the same model to predict a test set A target using features from test set B. Each time I run the model it throws up the error below
How can I successfully predict a test data set from features and a target with different shapes?
Input
print(testB.shape)
print(testA.shape)
Output
(2480, 5)
(1315, 6)
Input
saved_model = joblib.load(filename)
testB_result = saved_model.score(testB_features, testA_target)
print(testB_result)
Output
ValueError: Found input variables with inconsistent numbers of samples: [1315, 2480]
Thanks again
They are inconsistent shapes which is why the error is being thrown. Have you tried to reshape the data so one of them are same shape? From a quick look, it seems that you have more samples and one less feature in testA.
Think about it, if you have trained your model with 5 features you cannot then ask the same model to make a prediction given 6 features. You speak of using a Linear Regressor, the equation is roughly:
y = b + w0*x0 + w1*x1 + w2*x2 + .. + wN-1*xN-1
Where {
y is your output/label
N is the number of features
b is the bias term
w(i) is the ith weight
x(i) is the ith feature value
}
You have trained a linear regressor with 5 features, effectively producing the following
y (your output/label) = b + w0*x0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
You then ask it to make a prediction given 6 features but it only knows how to deal with 5.
Aside from that issue, you also have too many samples, testB has 2480 and testA has 1315. These need to match, as the model wants to make 2480 predictions, but you only give it 1315 outputs to compare it to. How can you get a score for 1165 missing samples? Do you now see why the data has to be reshaped?
EDIT
Assuming you have datasets with an equal amount of features as discussed above, you may now look at reshaping (removing data) testB like so:
testB = testB[0:1314, :]
testB.shape
(1315, 5)
Or, if you would prefer a solution using the numpy API:
testB = np.delete(testB, np.s_[0:(len(testB)-len(testA))], axis=0)
testB.shape
(1315, 5)
Keep in mind, when doing this you slice out a number of samples. If this is important to you (which it can be) then it may be better to introduce a pre-processing step to help out with the missing values, namely imputing them like this. It is worth noting that the data you are reshaping should be shuffled (unless it is already), as you may be removing parts of the data the model should be learning about. Neglecting to do this could result in a model that may not generalise as well as you hoped.

How can I improve the cosine similarity of two documents(sentences) in doc2vec model?

I am building a NLP chat application in Python using gensim library through doc2vec model. I have hard coded documents and given a set of training examples, I am testing the model by throwing a user question and then finding most similar documents as a first step. In this case my test question is an exact copy of a document from training example.
import gensim
from gensim import models
sentence = models.doc2vec.LabeledSentence(words=[u'sampling',u'what',u'is',u'tell',u'me',u'about'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'what',u'is',u'my',u'limit',u'how',u'much',u'can',u'I',u'claim'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'I',u'am',u'retiring',u'how',u'much',u'can',u'claim',u'have', u'resigned'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'what',u'is',u'my',u'eligibility',u'post',u'my',u'promotion'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'what',u'is', u'my',u'eligibility' u'post',u'my',u'promotion'], tags=["SENT_4"])
sentences = [sentence, sentence1, sentence2, sentence3, sentence4]
class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
model = models.Doc2Vec(alpha=0.03, min_alpha=.025, min_count=2)
model.build_vocab(sentences)
for epoch in range(30):
model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)
model.alpha -= 0.002 # decrease the learning rate`
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("my_model.doc2vec")
model_loaded = models.Doc2Vec.load('my_model.doc2vec')
print (model_loaded.docvecs.most_similar(["SENT_4"]))
Result:
[('SENT_1', 0.043695494532585144), ('SENT_2', 0.0017897281795740128), ('SENT_0', -0.018954679369926453), ('SENT_3', -0.08253869414329529)]
Similarity of SENT_4 and SENT_3 is only -0.08253869414329529 when it should be 1 since they are exactly same. How should I improve this accuracy? Is there a specific way of training documents and I am missing something out?
Word2Vec/Doc2Vec don't work well on toy-sized examples (such as few texts, short texts, and few total words). Many of the desirable properties are only reliably achieved with training sets of millions of words, or tens-of-thousands of documents.
In particular, with only 5 examples, and only a dozen or two words, but 100-dimensions of modeling vectors, the training isn't forced to do the main thing which makes word-vectors/doc-vectors useful: compress representations into dense embeddings, where similar items need to be incrementally nudged near each other in vector space, because there's no way to retain all the original variation in a sort-of-giant-lookup-table. With more dimensions than corpus variation, your identical-tokens SENT_3 and SENT_4 can adopt wildly different doc-vectors, and the model is still large enough to do great on its training task (essentially, 'overfit'), without the desired end-state of similar-texts having similar-vectors being forced.
You can sometimes squeeze a little more meaning out of small datasets with more training iterations, and a much-smaller model (in terms of vector size), but really: these vectors need large, varied datasets to become meaningful.
That's the main issue. Some other inefficiencies or errors in your example code:
Your code doesn't use the class LabeledLineSentence, so there's no need to include it here – it's irrelevant boilerplate. (Also, TaggedDocument is the preferred name for the words+tags document class in recent gensim versions, rather than LabeledSentence.)
Your custom-management of alpha and min_alpha is unlikely to do anything useful. These are best left at their defaults unless you already have something working, understand the algorithm well, and then want to try subtle optimizations.
train() will do its own iterations, so you don't need to call it many times in an outer loop. (This code as written does in its first loop 5 model.iter iterations at alpha values gradually descending from 0.03 to 0.025, then 5 iterations at a fixed alpha of 0.028, then 5 more at 0.026, then 27 more sets of 5 iterations at decreasing alpha, ending on the 30th loop at a fixed alpha of -0.028. That's a nonsense ending value – the learning-rate should never be negative – at the end of a nonsense progression. Even with a big dataset, these 150 iterations, about half happening at negative alpha values, would likely yield weird results.)

how to set the number of features to use in random selection sklearn

I am using sklearn RandomForest Classifier/Bag classifier for learning and I am not getting the expected results when compared to Java/Weka Machine Learning library.
In Weka, I am learning the model with - Random forest of 10 trees, each constructed while considering 6 random features. (setNumFeatures need to be set and default is 10 trees)
In sklearn - I am not sure how to specify the number of features to randomly consider while constructing a random forest of 10 trees. This what I am doing:
rf_classifier = RandomForestClassifier(n_estimators=num_trees, max_features=6)
rf_classifier = rf_classifier.fit(train_file, train_file_label)
for items in rf_classifier.estimators_:
classifier_list.append(items)
I saw the docs and there is a parameter - max_features but I am not sure if that serves the purpose. I get this error when I am trying to calculate entropy:
# code to calculate voting entropy for all features (unlabeled data)
vote_count_for_features = list(classifier_list[0].predict(feature_data_arr))
for i in range(1, len(classifier_list)):
res_temp = []
res_temp = list(classifier_list[i].predict(feature_data_arr))
vote_count_for_features = [x + y for x, y in zip(vote_count_for_features, res_temp)]
If I set that parameter to 6, than my code fails with the error message:
Number of features of the model must match the input. Model n_features
is 6 and input n_features is 31
Inputs: Sample set of 1 million records with 31 features. When I run weka, the number of rules extracted are around 1000 whereas when I run the same thing through sklearn - I get hardly 70 rules.
I am new to python and sklearn and I am trying to figure out where am I doing wrong. (Weka code has been tested well and gives 95% precision, 80% recall - so I am assuming that's good)
Note: I have used sklearn imputer to impute missing values using 'mean' strategy whereas Weka has ways to handle NaN.
This is what I am trying to achieve: Learn Random Forest on a sample set, extract rules, evaluate rules and then apply on the bigger set
Any suggestions or input will really help me debug through the issue and solve it quickly.
I think the issue is that the individual trees get confused since they only use 6 features, but you give them 31. You can try to get the prediction to work by setting check_input = False:
list(classifier_list[i].predict(feature_data_arr, check_input = False))

Categories