Scikit-learn KNN(K Nearest Neighbors ) parallelize using Apache Spark

Scikit-learn KNN(K Nearest Neighbors ) parallelize using Apache Spark - python

I have been working on machine learning KNN (K Nearest Neighbors) algorithm with Python and Python's Scikit-learn machine learning API.
I have created sample code with toy dataset simply using python and Scikit-learn and my KNN is working fine. But As we know Scikit-learn API is build to work on single machine and hence once I will replace my toy data with millions of dataset it will decrease my output performance.
I have searched for many options, help and code examples, which will distribute my machine learning processing parallel using spark with Scikit-learn API, but I was not found any proper solution and examples.
Can you please let me know how can I achieve and increase my performance with Apache Spark and Scikit-learn API's K Nearest Neighbors?
Thanks in advance!!

Well according to discussions https://issues.apache.org/jira/browse/SPARK-2336 here MLLib (Machine Learning Library for Apache Spark) does not have an implementation of KNN.
You could try https://github.com/saurfang/spark-knn.

Related

Unsupervised NearestNeighbours Sklearn Pipeline Examples

I am working on a project that compares the performance of unsupervised ML models on a loan default dataset and I am having trouble finding examples/tips on using unsupervised ML models within a scikit-learn pipeline. My pipeline is set up with pre-processing such as OneHotEncoding, StandardScalar, SimpleImputer and a couple Custom Transformers.
The lack of examples/tips available on the internet seems to suggest that this is not an advised route to go down for unsupervised ML models.
Has anyone come across any examples/tips or has any advice of there own to aid my project?
Any help would be greatly appreciated!

What's the difference between scikit-learn and tensorflow? Is it possible to use them together?

I cannot get a satisfying answer to this question. As I understand it, TensorFlow is a library for numerical computations, often used in deep learning applications, and Scikit-learn is a framework for general machine learning.
But what is the exact difference between them, what is the purpose and function of TensorFlow? Can I use them together, and does it make any sense?

Your understanding is pretty much spot on, albeit very, very basic. TensorFlow is more of a low-level library. Basically, we can think of TensorFlow as the Lego bricks (similar to NumPy and SciPy) that we can use to implement machine learning algorithms whereas Scikit-Learn comes with off-the-shelf algorithms, e.g., algorithms for classification such as SVMs, Random Forests, Logistic Regression, and many, many more. TensorFlow really shines if we want to implement deep learning algorithms, since it allows us to take advantage of GPUs for more efficient training. TensorFlow is a low-level library that allows you to build machine learning models (and other computations) using a set of simple operators, like “add”, “matmul”, “concat”, etc.
Makes sense so far?
Scikit-Learn is a higher-level library that includes implementations of several machine learning algorithms, so you can define a model object in a single line or a few lines of code, then use it to fit a set of points or predict a value.
Tensorflow is mainly used for deep learning while Scikit-Learn is used for machine learning.
Here is a link that shows you how to do Regression and Classification using TensorFlow. I would highly suggest downloading the data sets and running the code yourself.
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Of course, you can do many different kinds of Regression and Classification using Scikit-Learn, without TensorFlow. I would suggesting reading through the Scikit-Learn documentation when you have a chance.
https://scikit-learn.org/stable/user_guide.html
It's going to take a while to get through everything, but if yo make it to the end, you will have learned a ton!!! Finally, you can get the 2,600+ page user guide for Scikit-Learn from the link below.
https://scikit-learn.org/stable/_downloads/scikit-learn-docs.pdf

The Tensorflow is a library for constructing Neural Networks. The scikit-learn contains ready to use algorithms. The TF can work with a variety of data types: tabular, text, images, audio. The scikit-learn is intended to work with tabular data.
Yes, you can use both packages. But if you need only classic Multi-Layer implementation then the MLPClassifier and MLPRegressor available in scikit-learn is a very good choice. I have run a comparison of MLP implemented in TF vs Scikit-learn and there weren't significant differences and scikit-learn MLP works about 2 times faster than TF on CPU. You can read the details of the comparison in my blog post.
Below the scatter plots of performance comparison:

Both are 3rd party machine learning modules, and both are good at it.
Tensorflow is the more popular of the two.
Tensorflow is typically used more in Deep Learning and Neural Networks.
SciKit learn is more general Machine Learning.
And although I don't think I've come across anyone using both simultaneously, no one is saying you can't.

Feature Selection in PySpark

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.
from sklearn.feature_selection import RFECV,RFE
logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)
However, I could not find any article which could show how can I perform recursive feature selection in pyspark.
I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.
Could please someone help me achieve this in pyspark

You have a few options for doing this.
If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.

We can try following feature selection methods in pyspark
Chi-Squared selector
Randomforest selector
References:
https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection

I suggest with stepwise regression model you can easily find the important features and only that dataset them in logistics regression. Stepwise regression works on correlation but it has variations.
Below link will help to implement stepwise regression for feature selection.
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

Using sklearn and Python for a large application classification/scraping exercise

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ.
The research framework is as follows:
1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler)
2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result
My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be?
Or perhaps Python is enough if accompanied with some parallelization of the code?
Thanks

Use the HashingVectorizer and one of the linear classification modules that supports the partial_fit API for instance SGDClassifier, Perceptron or PassiveAggresiveClassifier to incrementally learn the model without having to vectorize and load all the data in memory upfront and you should not have any issue in learning a classifier on hundreds of millions of documents with hundreds of thousands (hashed) features.
You should however load a small subsample that fits in memory (e.g. 100k documents) and grid search good parameters for the vectorizer using a Pipeline object and the RandomizedSearchCV class of the master branch. You can also fine tune the value of the regularization parameter (e.g. C for PassiveAggressiveClassifier or alpha for SGDClassifier) using the same RandomizedSearchCVor a larger, pre-vectorized dataset that fits in memory (e.g. a couple of millions of documents).
Also linear models can be averaged (average the coef_ and intercept_ of 2 linear models) so that you can partition the dataset, learn linear models independently and then average the models to get the final model.

Fundamentally, if you rely on numpy, scipy, and sklearn, Python will not be a bottleneck as most critical portions of those libraries are implemented as C-extensions.
But, since you're scraping millions of sites, you're going to be bounded by your single machine's capabilities. I would consider using a service like PiCloud [1] or Amazon Web Services (EC2) to distribute your workload across many servers.
An example would be to funnel your scraping through Cloud Queues [2].
[1] http://www.picloud.com
[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

scikit-learn OpenMP libsvm

I am using scikit-learn SVC to classify some data. I would like to increase the training performance.
clf = svm.SVC(cache_size=4000, probability=True, verbose=True)
Since sckikit-learn interfaces with libsvm and libsvm uses OpenMp I was hoping that:
export OMP_NUM_THREADS=16
would run on multiple cores.
Unfortunately this did not help.
Any Ideas?
Thanks

There is no OpenMP support in the current binding for libsvm in scikit-learn. However it is very likely that if you have performance issues with sklearn.svm.SVC should you use a more scalable model instead.
If your data is high dimensional it might be linearly separable. In that case it is advised to first try simpler models such as naive bayes models or sklearn.linear_model.Perceptron that are known to be very speedy to train. You can also try sklearn.linear_model.LogisticRegression and sklearn.svm.LinearSVC both implemented using liblinear that is more scalable than libsvm albeit less memory efficients than other linear models in scikit-learn.
If your data is not linearly separable, you can try sklearn.ensemble.ExtraTreesClassifier (adjust the n_estimators parameter to trade-off training speed vs. predictive accuracy).
Alternatively you can try to approximate a RBF kernel using the RBFSampler transformer of scikit-learn + fitting a linear model on the output:
http://scikit-learn.org/dev/modules/kernel_approximation.html

If you are using cross validation or grid search in scikit-learn then you can use multiple CPUs with the n_jobs parameter:
GridSearchCV(..., n_jobs=-1)
cross_val_score(..., n_jobs=-1)
Note that cross_val_score only needs a job per forld so if your number of folds is less than your CPUs you still won't be using all of your processing power.
LibSVM can use OpenMP if you can compile it and use it directly as per these instructions in the LibSVM FAQ. So you could export your scaled data in LibSVM format (here's a StackOverflow question on how to do that) and use LibSVM directly to train your data. But that will only be of benefit if you're grid searching or wanting to know accuracy scores, as far as I know the model LibSVM creates cannot be used in scikit-learn.
There is also a GPU accelerated version of LibSVM which I have tried and is extremely fast, but is not based on the current LibSVM version. I have talked to the developers and they say they hope to release a new version soon.

Although this thread is a year+ old, I thought it is worth answering.
I wrote a patch for openmp support on scikit-learn for both libsvm and liblinear (linearSVC) that's available here - https://github.com/fidlr/sklearn-openmp.
It is based on libsvm's FAQ on how to add OpenMP support, and the multi-core implementation of liblinear.
Just clone the repo and run sklearn-build-openmp.sh to apply the patch and build it.
Timing OMP_NUM_THREADS=4 python plot_permutation_test_for_classification.py:
svmlib with linear kernel timinig dropped by a factor of 2.3
RBF kernel - same.
Liblinear with 4 thread dropped by x1.6
Details about and usage information can be found here -
http://fidlr.org/post/137303264732/scikit-learn-017-with-libsvm-openmp-support

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.