I read about kernel regression:
https://en.wikipedia.org/wiki/Kernel_regression
Does sklearn contains this regression ?
I saw sklearn.kernel_ridge.KernelRidge but it don't seems to be the same.
Do I need to implement the kernel regression or sklearn has it's own kernel regression models (with different types of kernels) ?
I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.
from sklearn.feature_selection import RFECV,RFE
logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)
However, I could not find any article which could show how can I perform recursive feature selection in pyspark.
I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.
Could please someone help me achieve this in pyspark
You have a few options for doing this.
If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.
We can try following feature selection methods in pyspark
Chi-Squared selector
Randomforest selector
References:
https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection
I suggest with stepwise regression model you can easily find the important features and only that dataset them in logistics regression. Stepwise regression works on correlation but it has variations.
Below link will help to implement stepwise regression for feature selection.
https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn
I'm new to the field of data science. I want to train a model using SVM on a dataset with 500k rows and 81 columns.
So far, it's taking hours to run this model in SciPy. I have access to 100+ compute nodes with 16 cores a piece, but not sure how to take advantage of this due to my lack of knowledge in how I should be running this SVM code.
Can someone point me in the right direction for how I should go about solving this resource problem?
What kernel function are you using?
SVMs doesn't scale very well. The run time is O(n^3) where n is the number of training samples.
If you don't use a kernel function you can create a spark cluster and you can use spark mllib SVM that is a linear classifier:
https://spark.apache.org/docs/latest/mllib-linear-methods.html
If you use a non linear kernel function then you can use LIBIRWLS, that is multicore so you can use the parallelization on a machine using 16 cores:
https://github.com/RobeDM/LIBIRWLS
I am using sklearn for training and testing my data. I want to use lasso and elastic net regressors with some kernel instead of using Linear model. Is there a possible way in which this can be done?
How would I go about splitting up the training of a SVC classifier from the scikit-learn library among multiple processes (or completely separate computers)? I've found several papers that mention the feasibility of training an SVM iteratively in parallel, but I haven't found any specific examples that apply to the scikit-learn library.