Learning and using augmented Bayes classifiers in python

Learning and using augmented Bayes classifiers in python - python

I'm trying to use a forest (or tree) augmented Bayes classifier (Original introduction, Learning) in python (preferably python 3, but python 2 would also be acceptable), first learning it (both structure and parameter learning) and then using it for discrete classification and obtaining probabilities for those features with missing data. (This is why just discrete classification and even good naive classifiers are not very useful for me.)
The way my data comes in, I'd love to use incremental learning from incomplete data, but I haven't even found anything doing both of these in the literature, so anything that does structure and parameter learning and inference at all is a good answer.
There seem to be a few very separate and unmaintained python packages that go roughly in this direction, but I haven't seen anything that is moderately recent (for example, I would expect that using pandas for these calculations would be reasonable, but OpenBayes barely uses numpy), and augmented classifiers seem completely absent from anything I have seen.
So, where should I look to save me some work implementing a forest augmented Bayes classifier? Is there a good implementation of Pearl's message passing algorithm in a python class, or would that be inappropriate for an augmented Bayes classifier anyway?
Is there a readable object-oriented implementation for learning and inference of TAN Bayes classifiers in some other language, which could be translated to python?
Existing packages I know of, but found inappropriate are
milk, which does support classification, but not with Bayesian classifiers (and I defitinetly need probabilities for the classification and unspecified features)
pebl, which only does structure learning
scikit-learn, which only learns naive Bayes classifiers
OpenBayes, which has only barely changed since somebody ported it from numarray to numpy and documentation is negligible.
libpgm, which claims to support an even different set of things. According to the main documentation, it does inference, structure and parameter learning. Except there do not seem to be any methods for exact inference.
Reverend claims to be a “Bayesian Classifier”, has negligible documentation, and from looking at the source code I am lead to the conclusion that it is mostly a Spam classifier, according to Robinson's and similar methods, and not a Bayesian classifier.
eBay's bayesian Belief Networks allows to build generic Bayesian networks and implements inference on them (both exact and approximate), which means that it can be used to build a TAN, but there is no learning algorithm in there, and the way BNs are built from functions means implementing parameter learning is more difficult than it might be for a hypothetical different implementation.

I'm afraid there is not an out-of-the-box implementation of Random Naive Bayes classifier (not that I am aware of) because it is still academic matters. The following paper present the method to combine RF and NB classifiers (behind a paywall) : http://link.springer.com/chapter/10.1007%2F978-3-540-74469-6_35
I think you should stick with scikit-learn, which is one of the most popular statistical module for Python (along with NLTK) and which is really well documented.
scikit-learn has a Random Forest module : http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees . There is a submodule which may (I insist of the uncertainty) be used to pipeline towards NB classifier :
RandomTreesEmbedding implements an unsupervised transformation of the
data. Using a forest of completely random trees, RandomTreesEmbedding
encodes the data by the indices of the leaves a data point ends up in.
This index is then encoded in a one-of-K manner, leading to a high
dimensional, sparse binary coding. This coding can be computed very
efficiently and can then be used as a basis for other learning tasks.
The size and sparsity of the code can be influenced by choosing the
number of trees and the maximum depth per tree. For each tree in the
ensemble, the coding contains one entry of one. The size of the coding
is at most n_estimators * 2 ** max_depth, the maximum number of leaves
in the forest.
As neighboring data points are more likely to lie within the same leaf
of a tree, the transformation performs an implicit, non-parametric
density estimation.
And of course there is a out-of-core implementation of Naive Bayes classifier, which can be used incrementally : http://scikit-learn.org/stable/modules/naive_bayes.html
Discrete naive Bayes models can be used to tackle large scale text
classification problems for which the full training set might not fit
in memory. To handle this case both MultinomialNB and BernoulliNB
expose a partial_fit method that can be used incrementally as done
with other classifiers as demonstrated in Out-of-core classification
of text documents.

I was similarly confused as to how to do exact inference with libpgm. However, turns out it is possible. For example (from libpgm docs),
import json
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
# load nodedata and graphskeleton
nd = NodeData()
skel = GraphSkeleton()
nd.load("../tests/unittestdict.txt")
skel.load("../tests/unittestdict.txt")
# toporder graph skeleton
skel.toporder()
# load evidence
evidence = dict(Letter='weak')
query = dict(Grade='A')
# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)
# load factorization
fn = TableCPDFactorization(bn)
# calculate probability distribution
result = fn.condprobve(query, evidence)
# output
print json.dumps(result.vals, indent=2)
print json.dumps(result.scope, indent=2)
print json.dumps(result.card, indent=2)
print json.dumps(result.stride, indent=2)
To get the example to work, here is the datafile (I replaced None with null and saved as a .json).
I know this is quite late to the game, but this was the best post I found when searching for a resource to do Bayesian networks with Python. I thought I'd answer in case anyone else is looking for this. (Sorry, would have commented, but just signed up for SO to answer this and rep isn't high enough.)

R's bnlearn has implementations for both Naive Bayes and Tree-augmented Naive Bayes classifiers. You can use rpy2 to port these to Python.
http://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf

There seems to be no such thing yet.
The closest thing currently seems to be eBay's open source implementation bayesian of Belief Networks. It implements inference (two exact ways, and approximate), which means that it can be used to build a TAN. An example (at the moment still an ugly piece of spaghetti code) for that can be found in my open20q repository.
Advantages:
It works.
That is, I now have an implementation of TAN inference, based on bayesian belief network inference.
With Apache 2.0 and 3-clause BSD style licenses respectively, it is legally possible to combine bayesian code and libpgm code to try to get inference and learning to work.
Disadvantages:
There is no learning whatsoever in bayesian. Trying to combine something like libpgm learning with bayesian classes and inference will be a challenge.
Even more so as bayesian assumes that nodes are given by factors which are fixed python functions. Parameter learning requires some wrapping code to enable tweaking the probabilities.
bayesian is written in pure python, using dicts etc. as basic structures, not making use of any speedup numpy, pandas or similar packages might bring, and is therefore quite slow even for the tiny example I build.

I know it's a bit late in the day, but the Octave forge NaN package might be of interest to you. One of the classifiers in this package is an Augmented Naive Bayesian Classifier. The code is GPL'ed so you could easily port it to Python.

Related

What's the difference between scikit-learn and tensorflow? Is it possible to use them together?

I cannot get a satisfying answer to this question. As I understand it, TensorFlow is a library for numerical computations, often used in deep learning applications, and Scikit-learn is a framework for general machine learning.
But what is the exact difference between them, what is the purpose and function of TensorFlow? Can I use them together, and does it make any sense?

Your understanding is pretty much spot on, albeit very, very basic. TensorFlow is more of a low-level library. Basically, we can think of TensorFlow as the Lego bricks (similar to NumPy and SciPy) that we can use to implement machine learning algorithms whereas Scikit-Learn comes with off-the-shelf algorithms, e.g., algorithms for classification such as SVMs, Random Forests, Logistic Regression, and many, many more. TensorFlow really shines if we want to implement deep learning algorithms, since it allows us to take advantage of GPUs for more efficient training. TensorFlow is a low-level library that allows you to build machine learning models (and other computations) using a set of simple operators, like “add”, “matmul”, “concat”, etc.
Makes sense so far?
Scikit-Learn is a higher-level library that includes implementations of several machine learning algorithms, so you can define a model object in a single line or a few lines of code, then use it to fit a set of points or predict a value.
Tensorflow is mainly used for deep learning while Scikit-Learn is used for machine learning.
Here is a link that shows you how to do Regression and Classification using TensorFlow. I would highly suggest downloading the data sets and running the code yourself.
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Of course, you can do many different kinds of Regression and Classification using Scikit-Learn, without TensorFlow. I would suggesting reading through the Scikit-Learn documentation when you have a chance.
https://scikit-learn.org/stable/user_guide.html
It's going to take a while to get through everything, but if yo make it to the end, you will have learned a ton!!! Finally, you can get the 2,600+ page user guide for Scikit-Learn from the link below.
https://scikit-learn.org/stable/_downloads/scikit-learn-docs.pdf

The Tensorflow is a library for constructing Neural Networks. The scikit-learn contains ready to use algorithms. The TF can work with a variety of data types: tabular, text, images, audio. The scikit-learn is intended to work with tabular data.
Yes, you can use both packages. But if you need only classic Multi-Layer implementation then the MLPClassifier and MLPRegressor available in scikit-learn is a very good choice. I have run a comparison of MLP implemented in TF vs Scikit-learn and there weren't significant differences and scikit-learn MLP works about 2 times faster than TF on CPU. You can read the details of the comparison in my blog post.
Below the scatter plots of performance comparison:

Both are 3rd party machine learning modules, and both are good at it.
Tensorflow is the more popular of the two.
Tensorflow is typically used more in Deep Learning and Neural Networks.
SciKit learn is more general Machine Learning.
And although I don't think I've come across anyone using both simultaneously, no one is saying you can't.

Machine Learning detecting random string

I do apologies in advance if something similar has been posted but from the research I've done I can't find anything specific.
I'm currently looking at http://scikit-learn.org and the content here looks great but I'm confused what type I should be using for my problem.
I want to able to have 2 labels.
**Suspicious**
1hbn34uqrup7a13t
qmr30zoyswr21cdxolg
1qmqnbetqx
**Not-Suspicious**
cheesemix
reg526
animato12
What type of machine learning algorithm could I feed the data in above as to teach it what I'd class as suspicious through supervised learning?
I'm leaning towards classification but there are so many models to choose from my slightly lost.

The first step in such machine learning problems is to think about the "features". You can't use e.g. a linear classifier directly on these strings. Thus, you have to extract some meaningful features that describe the string. In computer vision, these features are often edges, corner points, SIFT features. You basically have to options:
Design features yourself.
Learn the features.
1) This is the "classical" machine learning approach: you manually design a list of representative features, which you can extract from your input data. In your case, you could start with e.g.
length of the string
number of different characters
number of special characters
something about the sorting?
...
That will give you a vector of numbers for each string. Now, you can use any of the classifiers from scikit-learn to classify the data. You can start choosing your algorithm with the help of this flowchart. You should start with a simple model, e.g. a linear model (e.g. linear SVM). If performance is not sufficient, use a more complex model (e.g. SVM with kernels), or rethink your choice of features.
2) This is the "modern" approach, which is gaining more and more popularity. Designing the features is a crucial step in 1) and it requires good knowledge of your data. Now, by using a deep neural network, you can feed your raw data (the string) into the network, and let the network learn such "features" itself. This, however, requires a large amount of labeled training data, and a lot of processing power (GPUs).
LSTM networks are todays state-of-the-art in natural language processing and similar tasks. LSTMs would be well suited to your tasks, as the input can be of variable length.
tl;dr: Either design features yourself and use a classifier of your choice, or dive into deep neural networks and let a network learn both the features and the classification.

Signal feature identification

I'm am trying to identify phonemes in voices using a training database of known ones.
I'm wondering if there is a way of identifying common features within my training sample and using that to classify a new one.
It seems like there are two paths:
Give the process raw/normalised data and it will return similar ones
Extract certain metrics such as pitch, formants etc and compare to training set
My interest is the first!
Any recommendations on machine learning or regression methods/algorithms?

Since you tagged Python, I highly recommend looking into scikit-learn, an excellent Python library for Machine Learning. Their docs are very thorough, and should give you a good crash course in Machine Learning algorithms and implementation (including classification, regression, clustering, etc)

Your points 1 and 2 are not very different: 1) is the end results of a classification problem 2) is the feature that you give for classification. What you need is a good classifier (SVM, decision trees, hierarchical classifiers etc.) and a good set of features (pitch, formants etc. that you mentioned).

Random Forest interpretation in scikit-learn

I am using scikit-learn's Random Forest Regressor to fit a random forest regressor on a dataset. Is it possible to interpret the output in a format where I can then implement the model fit without using scikit-learn or even Python?
The solution would need to be implemented in a microcontroller or maybe even an FPGA. I am doing analysis and learning in Python but want to implement on a uC or FPGA.

You can check out graphviz, which uses 'dot language' for storing models (which is quite human-readable if you'd want to build some custom interpreter, shouldn't be hard). There is an export_graphviz function in scikit-learn. You can load and process the model in C++ through boost library read_graphviz method or some of other custom interpreters available.

You could trying extracting rules from the tree ensemble model and implement the rules in hardware.
You can use TE2Rules (Tree Ensembles to Rules) to extract human understandable rules to explain a scikit tree ensemble (like GradientBoostingClassifier). It provides levers to control interpretability, fidelity and run time budget to extract useful explanations. Rules extracted by TE2Rules are guaranteed to closely approximate the tree ensemble, by considering the joint interactions of multiple trees in the ensemble.
References:
TE2Rules: You can find the code: https://github.com/linkedin/TE2Rules and documentation: https://te2rules.readthedocs.io/en/latest/ here.
Disclosure: I'm one of the core developers of TE2Rules.

It's unclear what you mean by this part:
Now, that I have the results, is it possible to interpret this in some format where I can then implement the fit without using sklearn or even python?
Implement the fitting process for a given dataset? tree topology? choice of parameters?
As to 'implement... without using sklearn or python', did you mean 'port the bytecode or binary' or 'clean-code a totally new implementation'?
Assuming you meant the latter, I'd suggest GPU rather than FPGA or uC.

Support Vector Regression with High Dimensional Output using python's libsvm

I would like to ask if anyone has an idea or example of how to do support vector regression in python with high dimensional output( more than one) using a python binding of libsvm? I checked the examples and they are all assuming the output to be one dimensional.

libsvm might not be the best tool for this task.
The problem you describe is called multivariate regression, and usually for regression problems, SVM's are not necessarily the best choice.
You could try something like group lasso (http://www.di.ens.fr/~fbach/grouplasso/index.htm - matlab) or sparse group lasso (http://spams-devel.gforge.inria.fr/ - seems to have a python interface), which solve the multivariate regression problem with different types of regularization.

Support Vector Machines as a mathematical framework is formulated in terms of a single prediction variable. Hence most libraries implementing them will reflect this as using one single target variable in their API.
What you could do is train a single SVM model for each target dimension in your data.
on the plus side, you can train them in // on a cluster as each model is independent of one another
on the minus side, sub-models will share nothing and won't benefit from what they individually discovered in the structure of the input data and potentially need a lot of memory to store the model as they will have no shared intermediate representation
Variant of SVMs can probably be devised in a multi-task learning setting to learn some common kernel-based intermediate representation suitable for reuse to predict multi-dimensional targets however this is not implemented in libsvm AFAIK. Google for multi task learning SVM if you want to learn more.
Alternatively, multi-layer perceptrons (a kind of feed forward neural networks) can naturally deal with multi-dimensional outcomes and hence should be better at sharing intermediate representations of the data reused across targets, especially if they are deep enough with the first layers pre-trained in an unsupervised manner using an autoencoder objective function.
You might want to have a look at http://deeplearning.net/tutorial/ for a nice introduction to various neural network architectures and practical tools and examples to implement them efficiently.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.