Using sklearn and Python for a large application classification/scraping exercise

Using sklearn and Python for a large application classification/scraping exercise - python

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ.
The research framework is as follows:
1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler)
2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result
My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be?
Or perhaps Python is enough if accompanied with some parallelization of the code?
Thanks

Use the HashingVectorizer and one of the linear classification modules that supports the partial_fit API for instance SGDClassifier, Perceptron or PassiveAggresiveClassifier to incrementally learn the model without having to vectorize and load all the data in memory upfront and you should not have any issue in learning a classifier on hundreds of millions of documents with hundreds of thousands (hashed) features.
You should however load a small subsample that fits in memory (e.g. 100k documents) and grid search good parameters for the vectorizer using a Pipeline object and the RandomizedSearchCV class of the master branch. You can also fine tune the value of the regularization parameter (e.g. C for PassiveAggressiveClassifier or alpha for SGDClassifier) using the same RandomizedSearchCVor a larger, pre-vectorized dataset that fits in memory (e.g. a couple of millions of documents).
Also linear models can be averaged (average the coef_ and intercept_ of 2 linear models) so that you can partition the dataset, learn linear models independently and then average the models to get the final model.

Fundamentally, if you rely on numpy, scipy, and sklearn, Python will not be a bottleneck as most critical portions of those libraries are implemented as C-extensions.
But, since you're scraping millions of sites, you're going to be bounded by your single machine's capabilities. I would consider using a service like PiCloud [1] or Amazon Web Services (EC2) to distribute your workload across many servers.
An example would be to funnel your scraping through Cloud Queues [2].
[1] http://www.picloud.com
[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

Related

What method should be used to tag specific texts, when the dataset is too small for training a model?

My problem is the following:
The task is to use a Machine Learning Algorithm to perform tagging on texts. This would be simple enough if those texts were about common topics where large datasets for training are readily available.
But these texts describe company KPI's and Products. The tagging should improve search inside the upcoming Data Catalog.
What complicates the matter: the amount of already human-tagged texts is very small (around 50)
So I was searching for a method to train a model on a dataset that has nothing to do with our dataset since it is highly specialized.
Is there a way to adapt pre-trained text classifiers to our data?
I'm fairly new to NLP/Machine Learning since I recently started studying this field. I mainly program in Python. If a solution in Python is available, it would be great!
Any help would be much appreciated!

Spacy train ner using multiprocessing

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations.
It is taking more than 2 hours to train completely.
Is there any way to train using multiprocessing? Will it improve the training time?

Short answer... probably not
It's very unlikely that you will be able to get this to work for a few reasons:
The network being trained is performing iterative optimization
Without knowing the results from the batch before, the next batch cannot be optimized
There is only a single network
Any parallel training would be creating divergent networks...
...which you would then somehow have to merge
Long answer... there's plenty you can do!
There are a few different things you can try however:
Get GPU training working if you haven't
It's a pain, but can speed up training time a bit
It will dramatically lower CPU usage however
Try to use spaCy command line tools
The JSON format is a pain to produce but...
The benefit is you get a well optimised algorithm written by the experts
It can have dramatically faster / better results than hand crafted methods
If you have different entities, you can train multiple specialised networks
Each of these may train faster
These networks could be done in parallel to each other (CPU permitting)
Optimise your python and experiment with parameters
Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
Your python implementation providing the batches (make sure this is top notch)
Pre-process your examples
spaCy NER extraction requires a surprisingly small amount of context to work
You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs
Final thoughts... when is your network "done"?
I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.
However 90% of the increase in performance is captured in the first 10% of training.
Do you need to wait for 50 batches?
... or are you looking for a specific level of performance?
If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.
You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.
Good luck!

Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data. See it really depends like how are you passing your data. Follow the steps I am mentioning below might it could work on CPU:
Annotate your text files and save into JSON
Convert your JSON files into .spacy format because this is the format spacy accepts.
Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.
Passing all your JSON text will take more time in training. So you can split your data and pass iterating it. Don't pass consolidated data. Split it.

Save Naive Bayes Classifier in memory

I am new in NLTK and machine learning. I'm using Python with NLTK Naive Bayes Classifier . I have create a Naive Bayes Classifier for text classification using NLTK and save it on disk. I am also able to load it when needed to classify some test data by using this python code:
import pickle
f = open('classifier.pickle')
classifier = pickle.load(f)
f.close()
But my problem is that whenever an new test data come , I have to load this classifier again and again in memory that takes lots of time (2-3 min) to load as it have large size. Also if I have to run two instances of the same sentimental analysis program, that will take double RAM as both program will load this classifier separately. My questions is: Is there any technique to store this classifier in memory so that whenever needed the sentimental anylysis programs can read this directly from memory or is there any other method through which the load time of the classifier can be minimize.
Thanks in advance for your help.

You can't have it both ways. You can either keep pickling/unpickling one at a time to use less RAM, or you can store both in memory, using twice as much ram, but reducing load times and disk i/o wait times.
Are the two classifiers trained using different training data, or are you using the same classifier in parallel? It sounds like the latter from your usage of "two instances", and in that case you may want to look into threading to allow the same classifier to work with two sets of data (some parallelism may be achieved by classify some of the data, then doing some other stuff like results processing to allow the other thread to classify, repeat).
My expertise in this comes from having started an open source NLTK based sentiment analysis system: https://bitbucket.org/tommyjcarpenter/evopminer.

Learning and using augmented Bayes classifiers in python

I'm trying to use a forest (or tree) augmented Bayes classifier (Original introduction, Learning) in python (preferably python 3, but python 2 would also be acceptable), first learning it (both structure and parameter learning) and then using it for discrete classification and obtaining probabilities for those features with missing data. (This is why just discrete classification and even good naive classifiers are not very useful for me.)
The way my data comes in, I'd love to use incremental learning from incomplete data, but I haven't even found anything doing both of these in the literature, so anything that does structure and parameter learning and inference at all is a good answer.
There seem to be a few very separate and unmaintained python packages that go roughly in this direction, but I haven't seen anything that is moderately recent (for example, I would expect that using pandas for these calculations would be reasonable, but OpenBayes barely uses numpy), and augmented classifiers seem completely absent from anything I have seen.
So, where should I look to save me some work implementing a forest augmented Bayes classifier? Is there a good implementation of Pearl's message passing algorithm in a python class, or would that be inappropriate for an augmented Bayes classifier anyway?
Is there a readable object-oriented implementation for learning and inference of TAN Bayes classifiers in some other language, which could be translated to python?
Existing packages I know of, but found inappropriate are
milk, which does support classification, but not with Bayesian classifiers (and I defitinetly need probabilities for the classification and unspecified features)
pebl, which only does structure learning
scikit-learn, which only learns naive Bayes classifiers
OpenBayes, which has only barely changed since somebody ported it from numarray to numpy and documentation is negligible.
libpgm, which claims to support an even different set of things. According to the main documentation, it does inference, structure and parameter learning. Except there do not seem to be any methods for exact inference.
Reverend claims to be a “Bayesian Classifier”, has negligible documentation, and from looking at the source code I am lead to the conclusion that it is mostly a Spam classifier, according to Robinson's and similar methods, and not a Bayesian classifier.
eBay's bayesian Belief Networks allows to build generic Bayesian networks and implements inference on them (both exact and approximate), which means that it can be used to build a TAN, but there is no learning algorithm in there, and the way BNs are built from functions means implementing parameter learning is more difficult than it might be for a hypothetical different implementation.

I'm afraid there is not an out-of-the-box implementation of Random Naive Bayes classifier (not that I am aware of) because it is still academic matters. The following paper present the method to combine RF and NB classifiers (behind a paywall) : http://link.springer.com/chapter/10.1007%2F978-3-540-74469-6_35
I think you should stick with scikit-learn, which is one of the most popular statistical module for Python (along with NLTK) and which is really well documented.
scikit-learn has a Random Forest module : http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees . There is a submodule which may (I insist of the uncertainty) be used to pipeline towards NB classifier :
RandomTreesEmbedding implements an unsupervised transformation of the
data. Using a forest of completely random trees, RandomTreesEmbedding
encodes the data by the indices of the leaves a data point ends up in.
This index is then encoded in a one-of-K manner, leading to a high
dimensional, sparse binary coding. This coding can be computed very
efficiently and can then be used as a basis for other learning tasks.
The size and sparsity of the code can be influenced by choosing the
number of trees and the maximum depth per tree. For each tree in the
ensemble, the coding contains one entry of one. The size of the coding
is at most n_estimators * 2 ** max_depth, the maximum number of leaves
in the forest.
As neighboring data points are more likely to lie within the same leaf
of a tree, the transformation performs an implicit, non-parametric
density estimation.
And of course there is a out-of-core implementation of Naive Bayes classifier, which can be used incrementally : http://scikit-learn.org/stable/modules/naive_bayes.html
Discrete naive Bayes models can be used to tackle large scale text
classification problems for which the full training set might not fit
in memory. To handle this case both MultinomialNB and BernoulliNB
expose a partial_fit method that can be used incrementally as done
with other classifiers as demonstrated in Out-of-core classification
of text documents.

I was similarly confused as to how to do exact inference with libpgm. However, turns out it is possible. For example (from libpgm docs),
import json
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
# load nodedata and graphskeleton
nd = NodeData()
skel = GraphSkeleton()
nd.load("../tests/unittestdict.txt")
skel.load("../tests/unittestdict.txt")
# toporder graph skeleton
skel.toporder()
# load evidence
evidence = dict(Letter='weak')
query = dict(Grade='A')
# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)
# load factorization
fn = TableCPDFactorization(bn)
# calculate probability distribution
result = fn.condprobve(query, evidence)
# output
print json.dumps(result.vals, indent=2)
print json.dumps(result.scope, indent=2)
print json.dumps(result.card, indent=2)
print json.dumps(result.stride, indent=2)
To get the example to work, here is the datafile (I replaced None with null and saved as a .json).
I know this is quite late to the game, but this was the best post I found when searching for a resource to do Bayesian networks with Python. I thought I'd answer in case anyone else is looking for this. (Sorry, would have commented, but just signed up for SO to answer this and rep isn't high enough.)

R's bnlearn has implementations for both Naive Bayes and Tree-augmented Naive Bayes classifiers. You can use rpy2 to port these to Python.
http://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf

There seems to be no such thing yet.
The closest thing currently seems to be eBay's open source implementation bayesian of Belief Networks. It implements inference (two exact ways, and approximate), which means that it can be used to build a TAN. An example (at the moment still an ugly piece of spaghetti code) for that can be found in my open20q repository.
Advantages:
It works.
That is, I now have an implementation of TAN inference, based on bayesian belief network inference.
With Apache 2.0 and 3-clause BSD style licenses respectively, it is legally possible to combine bayesian code and libpgm code to try to get inference and learning to work.
Disadvantages:
There is no learning whatsoever in bayesian. Trying to combine something like libpgm learning with bayesian classes and inference will be a challenge.
Even more so as bayesian assumes that nodes are given by factors which are fixed python functions. Parameter learning requires some wrapping code to enable tweaking the probabilities.
bayesian is written in pure python, using dicts etc. as basic structures, not making use of any speedup numpy, pandas or similar packages might bring, and is therefore quite slow even for the tiny example I build.

I know it's a bit late in the day, but the Octave forge NaN package might be of interest to you. One of the classifiers in this package is an Augmented Naive Bayesian Classifier. The code is GPL'ed so you could easily port it to Python.

Multiprocessing scikit-learn

I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.
How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.
I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.
Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?
EDIT:
Here is my scenario:
lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
Thanks.
PS: Please do not close this .

I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.
For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.
How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.

For linear models (LinearSVC, SGDClassifier, Perceptron...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier) by sticking in it the average values of coef_ and intercept_ as attributes. The predict method of LinearSVC, SGDClassifier, Perceptron compute the same function (linear prediction using a dot product with an intercept_ threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.
However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.
Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.