unbalanced dataset, anomalies have same distribution as normal data

unbalanced dataset, anomalies have same distribution as normal data - python

I worked with a dataset which contains 2 classes (95%, 5%).
And the features of these 2 classes have almost the same distribution.
Question is: How can I classify these 2 classes and explain which principal the model uses to classify the test set?

Actually the distribution of features makes sense, but you have to make more detailed exploratory analysis than simple distribution of features. I suggest to have a look some 3D plots. You can have a look at some links about EDA:
https://www.kaggle.com/dejavu23/titanic-eda-to-ml-beginner
https://www.kaggle.com/dejavu23/house-prices-eda-to-ml-beginner
Regarding to classification models, I would suggest to have use Decision Tree based models, such as Random Forest or Gradient Tree Boosting.
The idea behind Decision Tree is partition of feature space and making the same prediction for each part of it. You can plot Decision Trees, using some packages and it will help to understand principles behind the model. You can read more about all these models in the nice book:
http://www-bcf.usc.edu/~gareth/ISL/
Links to packages:
https://lightgbm.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/tree.html
https://scikit-learn.org/stable/modules/ensemble.html
You can read about decision tree visualization:
https://medium.com/#rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn

Related

What's a base model and why/when do we go for other ML algorithms

Let's assume we're dealing with continuous features and responses. We fit a linear regression model (let's say first order) and after CV we get a somehow good r^2 (let's say r^2=0.8).
Why do we go for other ML algorithms? I've read some research papers and they were trying different ML algorithms and taking the simple linear model as a base model for comparison. In these papers, the linear model outperformed other algorithms, what I have difficulty understanding is why do we go for other ML algorithms then? Why can't we just be satisfied with the linear model especially in the specific case where other algorithms perform poorly?
The other question is what do they gain from presenting the other algorithms in their research papers if these algorithms performed poorly?

The Best model for solving predictive problems such as continuous output is the regression model especially if you do it using a Neural network (polynomial or linear) with hyperparameter tuning based on the problem.
Using other ML algorithms such as Decision Tree or SVM or any other model where their main goal is classification but on the paper, they say it can do regression also in fact, they can't predict any new values.
but in the field of research people always try to find a better way to predict values other than regression, like in the classification world we start with Logistic regression -> decision tree and now we have SVM and ensembling models and DeepLearning.

I think the answer is because you never know.
especially in the specific case where other algorithms perform poorly?
You know they performed poorly because someone tried dose models. It's always worthy trying various models.

How to classify unlabelled data?

I am new to Machine Learning. I am trying to build a classifier that classifies the text as having a url or not having a url. The data is not labelled. I just have textual data. I don't know how to proceed with it. Any help or examples is appreciated.

Since it's text, you can use bag of words technique to create vectors.
You can use cosine similarity to cluster the common type text.
Then use classifier, which would depend on number of clusters.
This way you have a labeled training set.
If you have two cluster, binary classifier like logistic regression would work.
If you have multiple classes, you need to train model based on multinomial logistic regression
or train multiple logistic models using One vs Rest technique.
Lastly, you can test your model using k-fold cross validation.

You cannot train a classifier with unlabeled data. You need labeled examples. There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute).
Stack Overflow is for programming; this question would be better suited in, say, Cross-Validated. Maybe they'll have better suggestions than me.
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue.
Good luck!

What is the difference between xgboost, extratreeclassifier, and randomforrestclasiffier?

I am new to all these methods and am trying to get a simple answer to that or perhaps if someone could direct me to a high level explanation somewhere on the web. My googling only returned kaggle sample codes.
Are the extratree and randomforrest essentially the same? And xgboost uses boosting when it chooses the features for any particular tree i.e. sampling the features. But then how do the other two algorithms select the features?
Thanks!

Extra-trees(ET) aka. extremely randomized trees is quite similar to random forest (RF). Both methods are bagging methods aggregating some fully grow decision trees. RF will only try to split by e.g. a third of features, but evaluate any possible break point within these features and pick the best. However, ET will only evaluate a random few break points and pick the best of these. ET can bootstrap samples to each tree or use all samples. RF must use bootstrap to work well.
xgboost is an implementation of gradient boosting and can work with decision trees, typical smaller trees. Each tree is trained to correct the residuals of previous trained trees. Gradient boosting can be more difficult to train, but can achieve a lower model bias than RF. For noisy data bagging is likely to be most promising. For low noise and complex data structures boosting is likely to be most promising.

Signal feature identification

I'm am trying to identify phonemes in voices using a training database of known ones.
I'm wondering if there is a way of identifying common features within my training sample and using that to classify a new one.
It seems like there are two paths:
Give the process raw/normalised data and it will return similar ones
Extract certain metrics such as pitch, formants etc and compare to training set
My interest is the first!
Any recommendations on machine learning or regression methods/algorithms?

Since you tagged Python, I highly recommend looking into scikit-learn, an excellent Python library for Machine Learning. Their docs are very thorough, and should give you a good crash course in Machine Learning algorithms and implementation (including classification, regression, clustering, etc)

Your points 1 and 2 are not very different: 1) is the end results of a classification problem 2) is the feature that you give for classification. What you need is a good classifier (SVM, decision trees, hierarchical classifiers etc.) and a good set of features (pitch, formants etc. that you mentioned).

Learning and using augmented Bayes classifiers in python

I'm trying to use a forest (or tree) augmented Bayes classifier (Original introduction, Learning) in python (preferably python 3, but python 2 would also be acceptable), first learning it (both structure and parameter learning) and then using it for discrete classification and obtaining probabilities for those features with missing data. (This is why just discrete classification and even good naive classifiers are not very useful for me.)
The way my data comes in, I'd love to use incremental learning from incomplete data, but I haven't even found anything doing both of these in the literature, so anything that does structure and parameter learning and inference at all is a good answer.
There seem to be a few very separate and unmaintained python packages that go roughly in this direction, but I haven't seen anything that is moderately recent (for example, I would expect that using pandas for these calculations would be reasonable, but OpenBayes barely uses numpy), and augmented classifiers seem completely absent from anything I have seen.
So, where should I look to save me some work implementing a forest augmented Bayes classifier? Is there a good implementation of Pearl's message passing algorithm in a python class, or would that be inappropriate for an augmented Bayes classifier anyway?
Is there a readable object-oriented implementation for learning and inference of TAN Bayes classifiers in some other language, which could be translated to python?
Existing packages I know of, but found inappropriate are
milk, which does support classification, but not with Bayesian classifiers (and I defitinetly need probabilities for the classification and unspecified features)
pebl, which only does structure learning
scikit-learn, which only learns naive Bayes classifiers
OpenBayes, which has only barely changed since somebody ported it from numarray to numpy and documentation is negligible.
libpgm, which claims to support an even different set of things. According to the main documentation, it does inference, structure and parameter learning. Except there do not seem to be any methods for exact inference.
Reverend claims to be a “Bayesian Classifier”, has negligible documentation, and from looking at the source code I am lead to the conclusion that it is mostly a Spam classifier, according to Robinson's and similar methods, and not a Bayesian classifier.
eBay's bayesian Belief Networks allows to build generic Bayesian networks and implements inference on them (both exact and approximate), which means that it can be used to build a TAN, but there is no learning algorithm in there, and the way BNs are built from functions means implementing parameter learning is more difficult than it might be for a hypothetical different implementation.

I'm afraid there is not an out-of-the-box implementation of Random Naive Bayes classifier (not that I am aware of) because it is still academic matters. The following paper present the method to combine RF and NB classifiers (behind a paywall) : http://link.springer.com/chapter/10.1007%2F978-3-540-74469-6_35
I think you should stick with scikit-learn, which is one of the most popular statistical module for Python (along with NLTK) and which is really well documented.
scikit-learn has a Random Forest module : http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees . There is a submodule which may (I insist of the uncertainty) be used to pipeline towards NB classifier :
RandomTreesEmbedding implements an unsupervised transformation of the
data. Using a forest of completely random trees, RandomTreesEmbedding
encodes the data by the indices of the leaves a data point ends up in.
This index is then encoded in a one-of-K manner, leading to a high
dimensional, sparse binary coding. This coding can be computed very
efficiently and can then be used as a basis for other learning tasks.
The size and sparsity of the code can be influenced by choosing the
number of trees and the maximum depth per tree. For each tree in the
ensemble, the coding contains one entry of one. The size of the coding
is at most n_estimators * 2 ** max_depth, the maximum number of leaves
in the forest.
As neighboring data points are more likely to lie within the same leaf
of a tree, the transformation performs an implicit, non-parametric
density estimation.
And of course there is a out-of-core implementation of Naive Bayes classifier, which can be used incrementally : http://scikit-learn.org/stable/modules/naive_bayes.html
Discrete naive Bayes models can be used to tackle large scale text
classification problems for which the full training set might not fit
in memory. To handle this case both MultinomialNB and BernoulliNB
expose a partial_fit method that can be used incrementally as done
with other classifiers as demonstrated in Out-of-core classification
of text documents.

I was similarly confused as to how to do exact inference with libpgm. However, turns out it is possible. For example (from libpgm docs),
import json
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
# load nodedata and graphskeleton
nd = NodeData()
skel = GraphSkeleton()
nd.load("../tests/unittestdict.txt")
skel.load("../tests/unittestdict.txt")
# toporder graph skeleton
skel.toporder()
# load evidence
evidence = dict(Letter='weak')
query = dict(Grade='A')
# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)
# load factorization
fn = TableCPDFactorization(bn)
# calculate probability distribution
result = fn.condprobve(query, evidence)
# output
print json.dumps(result.vals, indent=2)
print json.dumps(result.scope, indent=2)
print json.dumps(result.card, indent=2)
print json.dumps(result.stride, indent=2)
To get the example to work, here is the datafile (I replaced None with null and saved as a .json).
I know this is quite late to the game, but this was the best post I found when searching for a resource to do Bayesian networks with Python. I thought I'd answer in case anyone else is looking for this. (Sorry, would have commented, but just signed up for SO to answer this and rep isn't high enough.)

R's bnlearn has implementations for both Naive Bayes and Tree-augmented Naive Bayes classifiers. You can use rpy2 to port these to Python.
http://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf

There seems to be no such thing yet.
The closest thing currently seems to be eBay's open source implementation bayesian of Belief Networks. It implements inference (two exact ways, and approximate), which means that it can be used to build a TAN. An example (at the moment still an ugly piece of spaghetti code) for that can be found in my open20q repository.
Advantages:
It works.
That is, I now have an implementation of TAN inference, based on bayesian belief network inference.
With Apache 2.0 and 3-clause BSD style licenses respectively, it is legally possible to combine bayesian code and libpgm code to try to get inference and learning to work.
Disadvantages:
There is no learning whatsoever in bayesian. Trying to combine something like libpgm learning with bayesian classes and inference will be a challenge.
Even more so as bayesian assumes that nodes are given by factors which are fixed python functions. Parameter learning requires some wrapping code to enable tweaking the probabilities.
bayesian is written in pure python, using dicts etc. as basic structures, not making use of any speedup numpy, pandas or similar packages might bring, and is therefore quite slow even for the tiny example I build.

I know it's a bit late in the day, but the Octave forge NaN package might be of interest to you. One of the classifiers in this package is an Augmented Naive Bayesian Classifier. The code is GPL'ed so you could easily port it to Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.