Using SVM with Linear Kernel in Pyspark

Using SVM with Linear Kernel in Pyspark - python

I want to use the SVM algorithm with RBF kernel. I tried using the sklearn library but as my dataset is 100k it's been running for last 2 days. I couldn't think of any optimization/parallelization to do in sklearn. So, now i'm thinking of using spark SVM with RBF kernel. I couldn't find any api call in pyspark for SVM. Does anyone know about any Pyspark API for rbf kernel?
Thanks

Related

Does sklearn support kernel regression?

I read about kernel regression:
https://en.wikipedia.org/wiki/Kernel_regression
Does sklearn contains this regression ?
I saw sklearn.kernel_ridge.KernelRidge but it don't seems to be the same.
Do I need to implement the kernel regression or sklearn has it's own kernel regression models (with different types of kernels) ?

Feature Selection in PySpark

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.
from sklearn.feature_selection import RFECV,RFE
logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)
However, I could not find any article which could show how can I perform recursive feature selection in pyspark.
I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.
Could please someone help me achieve this in pyspark

You have a few options for doing this.
If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.

We can try following feature selection methods in pyspark
Chi-Squared selector
Randomforest selector
References:
https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection

I suggest with stepwise regression model you can easily find the important features and only that dataset them in logistics regression. Stepwise regression works on correlation but it has variations.
Below link will help to implement stepwise regression for feature selection.
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

Running SVM code in SciPy over multiple cores?

I'm new to the field of data science. I want to train a model using SVM on a dataset with 500k rows and 81 columns.
So far, it's taking hours to run this model in SciPy. I have access to 100+ compute nodes with 16 cores a piece, but not sure how to take advantage of this due to my lack of knowledge in how I should be running this SVM code.
Can someone point me in the right direction for how I should go about solving this resource problem?

What kernel function are you using?
SVMs doesn't scale very well. The run time is O(n^3) where n is the number of training samples.
If you don't use a kernel function you can create a spark cluster and you can use spark mllib SVM that is a linear classifier:
https://spark.apache.org/docs/latest/mllib-linear-methods.html
If you use a non linear kernel function then you can use LIBIRWLS, that is multicore so you can use the parallelization on a machine using 16 cores:
https://github.com/RobeDM/LIBIRWLS

can we combine lasso or elastic net regressor with kernel?

I am using sklearn for training and testing my data. I want to use lasso and elastic net regressors with some kernel instead of using Linear model. Is there a possible way in which this can be done?

How to parallelize the training of an SVC with RBF kernel through MapReduce in scikit-learn for Python

How would I go about splitting up the training of a SVC classifier from the scikit-learn library among multiple processes (or completely separate computers)? I've found several papers that mention the feasibility of training an SVM iteratively in parallel, but I haven't found any specific examples that apply to the scikit-learn library.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using SVM with Linear Kernel in Pyspark - python

Related

Does sklearn support kernel regression?

Feature Selection in PySpark

Running SVM code in SciPy over multiple cores?

can we combine lasso or elastic net regressor with kernel?

How to parallelize the training of an SVC with RBF kernel through MapReduce in scikit-learn for Python

Categories

Resources