How to add Mlib library to Spark? - python

I was given assignment to run some code and show results using the Apache Spark using Python Language, I installed the Apache Spark server using the following steps: https://phoenixnap.com/kb/install-spark-on-windows-10. I tried my code and everything was fine. Now I am assigned another assignment, it needs MLlib linear regression and they provide us with some code that should be running then we will add additional code for it. When I try to run the code I have some errors and warnings, part of them appeared in the previous assignment but it still working. I believe the issue is that there are additiona things related to Mlib Library should be added so the code will run correctly. Anybody has any idea what files should be added to the spark so it runs the code related to MLib?
I am using Windows 10, and spark-3.0.1-bin-hadoop2.7
This is my code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StandardScaler
conf = SparkConf().setMaster("local").setAppName("LinearRegression")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
# Load training data
df = sqlContext.read.format("libsvm").option("numFeatures", 13).load("boston_housing.txt")
# Data needs to be scaled for better results and interpretation
# Initialize the `standardScaler`
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")
# Fit the DataFrame to the scaler
scaler = standardScaler.fit(df)
# Transform the data in `df` with the scaler
scaled_df = scaler.transform(df)
# Initialize the linear regression model
lr = LinearRegression(labelCol="label", maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the data to the model
linearModel = lr.fit(scaled_df)
# Print the coefficients for the model
print("Coefficients: %s" % str(linearModel.coefficients))
print("Intercept: %s" % str(linearModel.intercept))
here is the screenshot for what I have when I run the code:

Try to do pip install numpy (or pip3 install numpy if that fails). The traceback says numpy module is not found.

Related

How to mlflow-autolog a sklearn ConfusionMatrixDisplay?

I'm trying to log the plot of a confusion matrix generated with scikit-learn for a test set using mlflow's support for scikit-learn.
For this, I tried something that resemble the code below (I'm using mlflow hosted on Databricks, and sklearn==1.0.1)
import sklearn.datasets
import pandas as pd
import numpy as np
import mlflow
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Users/name.surname/plotcm")
data = sklearn.datasets.fetch_20newsgroups(categories=['alt.atheism', 'sci.space'])
df = pd.DataFrame(data = np.c_[data['data'], data['target']])\
.rename({0:'text', 1:'class'}, axis = 'columns')
train, test = train_test_split(df)
my_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', SGDClassifier(loss='modified_huber')),
])
mlflow.sklearn.autolog()
from sklearn.metrics import ConfusionMatrixDisplay # should I import this after the call to `.autolog()`?
my_pipeline.fit(train['text'].values, train['class'].values)
cm = ConfusionMatrixDisplay.from_predictions(
y_true=test["class"], y_pred=my_pipeline.predict(test["text"])
)
while the confusion matrix for the training set is saved in my mlflow run, no png file is created in the mlflow frontend for the test set.
If I try to add
cm.figure_.savefig('test_confusion_matrix.png')
mlflow.log_artifact('test_confusion_matrix.png')
that does the job, but requires explicitly logging the artifact.
Is there an idiomatic/proper way to autolog the confusion matrix computed using a test set after my_pipeline.fit()?
The proper way to do this is to use mlflow.log_figure as a fluent API announced in MLflow 1.13.0. You can read the documentation here. This code will do the job.
mlflow.log_figure(cm.figure_, 'test_confusion_matrix.png')
This function implicitly store the image, and then calls log_artifact against that path, something like you did.

Exception when trying to use saved Spark ML model for distributed computations with SHAP

I am having this error
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
and
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I have a saved LightGBM spark model which I want to use with SHAP package to get explanations of my predictions.
# loading LightGBM model
from synapse.ml.lightgbm import LightGBMClassificationModel
import shap
loaded_model = LightGBMClassificationModel.loadNativeModelFromFile(model_path)
# Custom predict_proba method
assembler = VectorAssembler(handleInvalid="keep",
inputCols=features,
outputCol="features")
def spark_to_pandas(X):
return spark.createDataFrame(X)
def predict_proba(X):
sdf = assembler.transform(spark_to_pandas(X).select(*features))
getNegative = F.udf(lambda x: float(x[0]), FloatType())
getPositive = F.udf(lambda x: float(x[1]), FloatType())
predictions = (
loaded_model.transform(sdf)
.select("probability")
.withColumn("0", getNegative(F.col("probability")))
.withColumn("1", getPositive(F.col("probability")))
.select("0", "1")
)
return predictions.toPandas()
# Creating SHAP explainer
explainer = shap.KernelExplainer(model=predict_proba,
data=data_explainer.select(*features).limit(100).toPandas())
# Trying to get explanations
def explain_df(explainer, df):
return [e.tolist() for e in explainer.shap_values(df)[1]]
explain = F.udf(lambda x: list(explain_df(explainer, x)), ArrayType(ArrayType(DoubleType())))
gr = datap.groupBy('_partition').agg(F.collect_list('features').alias('features'))
The last action doesn't work apparently because I am trying to use a spark model. This method works with sklearn models.
I ran into the same issue and discovered that Synapse ML has its own explainers built in. The documentation is severely lacking, however. The alternative if you really need to use the shap package is to re-build the model with the lightgbm package—you'd just need to convert the Synapse ML LightGBMClassifier hyperparameters to the format accepted by lightgbm.

How do you create a spark dataframe on a worker node when using HyperOpt and SparkTrials?

I'm trying to run ML trials in parallel using HyperOpt with SparkTrials on Databricks.
My opjective function converts the outputs to a spark dataframe using spark.createDataFrame(results) (to reuse some preprocessing code I've previously created - I'd prefer not to have to rewrite this).
However, this causes an error when attempting to use HyperOpt and SparkTrials, as the SparkContext used to create the dataframe "should only be created or accessed on the driver". Is there any way I can create a sparkDataFrame in my objective function here?
For a reproducible example:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
from pyspark.sql import SparkSession
# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line.
import mlflow
# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target
def objective(C):
# Create a support vector classifier model
clf = SVC(C)
# THESE TWO LINES CAUSE THE PROBLEM
ss = SparkSession.builder.getOrCreate()
sdf = ss.createDataFrame([('Alice', 1)])
# Use the cross-validation accuracy to compare the models' performance
accuracy = cross_val_score(clf, X, y).mean()
# Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
return {'loss': -accuracy, 'status': STATUS_OK}
search_space = hp.lognormal('C', 0, 1.0)
algo=tpe.suggest
# THIS WORKS (It's not using SparkTrials)
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16)
from hyperopt import SparkTrials
spark_trials = SparkTrials()
# THIS FAILS
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16,
trials=spark_trials)
I have tried looking at this, but it is solving a different problem - I can't see an obvious way to apply it to my situation.
How can I get the current SparkSession in any place of the codes?
I think the short answer is that it's not possible. The spark context can only exist on the driver node. Creating a new instance would be a kind of nesting, see this related question.
Nesting parallelizations in Spark? What's the right approach?
I solved my problem in the end by rewriting the transformations in pandas, which would then work.
If the transformations are too big for a single node then you'd probably have to pre-compute them and let hyperopt choose which version as part of the optimisation.

Scikit learn does not appear to respect global / local random_states in unittests

I'm trying to write an integration test that uses the descriptive statistics (.describe().to_list()) of the results of a model prediction (model.predict(X)). However, even though I've set np.random.seed(###) the descriptive statistics are different after running the tests in the console vs. in the environment created by Pycharm:
Here's a MRE for local:
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
np.random.seed(42)
X, y = make_regression(n_features=2, random_state=42)
regr = ElasticNet(random_state=42)
regr.fit(X, y)
pred = regr.predict(X)
# Theory: This result should be the same from the result in a class
pd.Series(pred).describe().to_list()
And an example test-file:
from unittest import TestCase
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
np.random.seed(42)
class TestPD(TestCase):
def testExpectedPrediction(self):
np.random.seed(42)
X, y = make_regression(n_features=2, random_state=42)
regr = ElasticNet(random_state=42)
regr.fit(X, y)
pred = pd.Series(regr.predict(X))
for i in pred.describe().to_list():
print(i)
# here we would have a self.assertTrue/Equals f.e. element
What appears to happen is that when I run this test in the Python Console, I get one result. But then when I run it using PyCharm's unittests for the folder, I get another result. Now, importantly, in PyCharm, the project interpreter is used to create an environment for the console that ought to be the same as the test environment. This leaves me to believe that I'm missing something about the way random_state is passed along. My expectation is, given that I have set a seed, that the results would be reproducible. But that doesn't appear to be the case and I would like to understand:
Why they aren't equal?
What I can do to make them equal?
I haven't been able to find a lot of best practices with respect to testing against expected model results. So commentary in that regard would also be helpful.

Create file OLS in Python Statsmodels

I dont have much knowledge in Python but I have to crack this for an assessment completion,
Question:
Run the following code to load the required libraries and create the data set to fit the model.
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
print(dataset.head())
I have to perform the following steps to complete this scenario.
For the boston dataset loaded in the above code snippet, perform linear regression.
Use the target variable as the dependent variable.
Use the RM variable as the independent variable.
Fit a single linear regression model using statsmodels package in python.
Import statsmodels packages appropriately in your code.
Upon fitting the model, Identify the coefficients.
Finally print the model summary in your code.
You can write your code using vim app.py .
Press i for insert mode.
Press esc and then :wq to save and quit the editor.
Please help me to understand how to get this completed. Your valuable comments are much appreciated
Thanks in Advance
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
print(dataset.head())
import statsmodels.api as sm
import statsmodels.formula.api as smf
X = dataset["RM"]
y = dataset['target']
X = sm.add_constant(X)
model = smf.OLS(y,X).fit()
predictions = model.predict(X)
print(model.summary())

Categories