Feature Selection in PySpark - python

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.
from sklearn.feature_selection import RFECV,RFE
logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)
However, I could not find any article which could show how can I perform recursive feature selection in pyspark.
I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.
Could please someone help me achieve this in pyspark

You have a few options for doing this.
If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.

We can try following feature selection methods in pyspark
Chi-Squared selector
Randomforest selector
References:
https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection

I suggest with stepwise regression model you can easily find the important features and only that dataset them in logistics regression. Stepwise regression works on correlation but it has variations.
Below link will help to implement stepwise regression for feature selection.
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

Related

model.features_names_in not working for decision trees in Anaconda Jupyter Notebook

is there any way I can get the feature names of the decision tree model as defined below using sklearn or any other packages in the Anaconda Jupyter Notebook? I'm trying to work on this issue for a long time now, but have not been able to search for exact work through.
Code snippet with the error
Latest Code for preparing model
Code for running the model
Code for visualizing the DT
You just need to use a Pandas dataframe with the feature names when you train your decision tree model instead of a Numpy array with no feature names.
Scikit-learn decision tree documentation

How can I explain predictions of an imblearn pipeline?

I have an imblearn (not sklearn) pipeline consisting of the following steps:
Column selector
Preprocessing pipeline (ColumnTransformer with OneHotEncoders and CountVectorizers on different columns)
imblearn's SMOTE
XGBClassifier
I have a tabular dataset and I'm trying to explain my predictions.
I managed to work out feature importance plots with some work, but can't get either
eli5 or lime to work.
Lime requires that I transform the data to the state of before the last transformation (because the transformers in the Pipeline like custom vectorizers create new columns).
In principle, I can slice my Pipeline like this: pipeline[:-1].predict(instance). However, I get the following error: {AttributeError}'SMOTE' object has no attribute 'predict'.
I also tried an eli5 explainer, since it supposedly works with Sklearn Pipelines.
However, after running eli5.sklearn.explain_prediction.explain_prediction_sklearn_not_supported(pipeline, instance_to_explain) I get the message that the classifier is not supported.
Will appreciate any ideas on how to proceed with this.
Imblearn's samplers are effectively no-op (ie. identity) transformers during prediction. Therefore, it should be safe to delete them after the pipeline has been fitted.
Try the following workflow:
Construct an Imblearn pipeline, and fit it.
Extract the steps of the fitted Imblearn pipeline to a new Scikit-Learn pipeline.
Delete the SMOTE step.
Explain your predictions using standard Scikit-Learn pipeline explanation tools.

Using SVM with Linear Kernel in Pyspark

I want to use the SVM algorithm with RBF kernel. I tried using the sklearn library but as my dataset is 100k it's been running for last 2 days. I couldn't think of any optimization/parallelization to do in sklearn. So, now i'm thinking of using spark SVM with RBF kernel. I couldn't find any api call in pyspark for SVM. Does anyone know about any Pyspark API for rbf kernel?
Thanks

Spark Multiclass Classification using python

I am trying to implement Multiclass classification using pySpark, I have spent loads of time searching the web, and I have read that it is possible now using Spark 2.1.0.
I have generated my own dataset with all-numerical features and I have created a DataFrame as shown below;
I have three classes 'Service_Level' which are either 0, 1 or 2.
Questions:
Do I have to use LabeledPoints if I have features like these?
how do I use a multilayer perceptron instead of logistic regression?
Thanks.
Since there was no answer, I will share what I observed during research. using Labeled Points is ok when using the Spark MLlib which is now in maintenance mode in Spark 2.1.0. However, my features were categorical hence using the DataFrame API with Spark ML, I had to convert them to vectors using StringIndexer, OneHotEncoder and Pipelines to select my features and labels.
Answering the question
Yes, Labeled Points can be used with those features but when using Spark MLlib. I was not able to implement the Multilayer Perceptron because somehow it required libsvm formatted data which I did not have and could not convert my CSV into such.
In the final implementation, I had to use the Dataframe based API Spark ml

Scikit-learn KNN(K Nearest Neighbors ) parallelize using Apache Spark

I have been working on machine learning KNN (K Nearest Neighbors) algorithm with Python and Python's Scikit-learn machine learning API.
I have created sample code with toy dataset simply using python and Scikit-learn and my KNN is working fine. But As we know Scikit-learn API is build to work on single machine and hence once I will replace my toy data with millions of dataset it will decrease my output performance.
I have searched for many options, help and code examples, which will distribute my machine learning processing parallel using spark with Scikit-learn API, but I was not found any proper solution and examples.
Can you please let me know how can I achieve and increase my performance with Apache Spark and Scikit-learn API's K Nearest Neighbors?
Thanks in advance!!
Well according to discussions https://issues.apache.org/jira/browse/SPARK-2336 here MLLib (Machine Learning Library for Apache Spark) does not have an implementation of KNN.
You could try https://github.com/saurfang/spark-knn.

Categories