Unable to make prediction with Sklearn model on pyspark dataframe

Unable to make prediction with Sklearn model on pyspark dataframe - python

I have loaded sklearn model successfully but unable to make predictions on pyspark dataframe. While running the below given code, getting an error mentioned below. Please help me to get the code to make predictions with sklearn model on pyspark. I also have searched relevant questions but could not find the solution.
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
braodcast_model.value
#update prediction method
def predictor(cols):
#call predict method for model
return model.value.predict(*cols)
udf_predictor = udf(predictor, FloatType())
#apply the udf to dataframe
df_prediction = df.withColumn("prediction", udf_predictor(df.select(list_of_columns)))
I get the following error message
TypeError: Invalid argument, not a string or column. For column literals, use 'lit', 'array',
'struct' or 'create_map' function.

I think you were on the right track for reaching your expected output.
I managed to find two possible solutions for such problem: one uses Spark UDF, the other uses Pandas UDF.
Spark UDF
from pyspark.sql.functions import udf
#udf('integer')
def predict_udf(*cols):
return int(braodcast_model.value.predict((cols,)))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_udf(*list_of_columns))
Pandas UDF
import pandas as pd
from pyspark.sql.functions import pandas_udf
#pandas_udf('integer')
def predict_pandas_udf(*cols):
X = pd.concat(cols, axis=1)
return pd.Series(braodcast_model.value.predict(X))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_pandas_udf(*list_of_columns))
Reproducible example
Here I used a Databricks Community cluster with Spark 3.1.2, pandas==1.2.4 and pyarrow==4.0.0.
broadcasted_model is a simple logistic regression from scikit-learn, trained on the breast cancer dataset.
import pandas as pd
import joblib
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from pyspark.sql.functions import udf, pandas_udf
# load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# split in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)
# create a small pipeline with standardization and model
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# save and reload the model
path = '/databricks/driver/test_model.joblib'
joblib.dump(model, path)
loaded_model = joblib.load(path)
# sample of unseen data
df = spark.createDataFrame(X_test.sample(50, random_state=42))
# create broadcasted model
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
Then I used the two methods illustrated above and you will see that the outputs df_prediction will be the same in both cases.

Related

UDF not being parallelized on pyspark

I have data that I am trying to build a tree model on. The dataset is 6 million row
I'm using pandas udf successfully to train multiple Lightgbm models with on multiple nodes
Using Databricks with the following configuration
worker type: r5.24xlarge
Driver type: r5.24xlarge
Max workers: 12
Since pandas udf, supposedly, does the computing in parallel for each group, I thought that I could finish the training faster by using a cluster with more workers.
checking the Stage Detail page on the Web UI ,I see that I have only one stage responsible for the udf pandas training. It took 4 hours. Acessing the details of this stage, I see it had only one task, with {'Locality Level': 'PROCESS_LOCAL'}, that took the whole 4 hours. Screenshots below.
My assumption is the training is not being parallelized. I have attached the code below but I am not able to point where the mistake is
#reading data into pandas dataframe
train_gbm_x = s3_read_csv()
train_gbm_original = train_gbm_x.copy()
#create replicas of the dataset so that I can run 4 versions of these in parallel
for i in range(1,5,1):
x_t_c = train_gbm_x.copy()
x_t_c['id']=i
print(train_gbm_x_original.shape)
train_gbm_x_original =train_gbm_x_original.append([x_t_c], ignore_index=True)
train_gbm_sp = spark.createDataFrame(train_gbm_x_original) #convert to spark dataframe
# main UDF
schema = StructType([
StructField('r_squared', DoubleType(), True)])
#f.pandas_udf(schema, f.PandasUDFType.GROUPED_MAP)
def train_RF(data):
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from scipy.stats.stats import pearsonr
import numpy as np
import lightgbm as lgb
# params_xgb['objective'] = 'binary:logistic'
import xgboost as xgb
xgb_model = lgb.LGBMClassifier(bagging_fraction=0.9500000000000001, boosting_type='goss',
feature_fraction=0.53, learning_rate=0.045075773005757123,
max_depth=2, metric='binary_logloss', min_child_samples=70,
min_child_weight=0, min_data_in_bin=54, min_data_in_leaf=219,
min_gain_to_split=0.75, n_estimators=9900, num_leaves=1000,
objective='binary', random_state=4, reg_alpha=7.906971438083483,
reg_lambda=9.905206242985752, verbosity=-1)
# create data and label groups
y_train = data['label']
X_train = data.drop(['employee_id', 'date','label'], axis=1)
xgb_model.fit(X_train, y_train)
# make predictions
y_pred = xgb_model.predict(X_train)
r = pearsonr(y_pred, y_train)
print(y_pred)
# return the number of trees, and the R value
return pd.DataFrame({'r_squared': (r[0]**2)}, index=[0])
The call to the UDF is as follows:
train_gbm_sp= train_gbm_sp.sort(F.col("emp_id"),F.col("id"))
model_coeffs_r =train_gbm_sp.groupby('id').apply(train_RF)
model_coeffs_r.show(5)
spark UI screenshot:
This view shows that it is being run on a single executor ID
I am not sure where the exact issue is, if someone can help point to what I might be doing wrong in the set up here.
Thank you

Pre-process y column in a sklearn pipeline before classification

As part of a sklearn pipeline, I'd like to bin my response variable into a variable with k ordinal categories and then do classification on these categories. I found KBinsDiscretizer which seems to perform this transformation but it seems it does only work on feature columns, not on the target column.
Reproducible example
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
data = load_boston()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
binarizer_col_y = make_column_transformer(
[sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal'), ['target']],
remainder = 'passthrough'
)
pipeline = Pipeline(steps = [
('preprocess', binarizer_col_y),
('ols', LinearRegression())
])
pipeline.fit(df[data['feature_names']], df['target'])
This errors with
pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'target'
The above exception was the direct cause of the following exception:
...
[ another key error for 'target']
I also found sklearn.compose.TransformedTargetRegressor to transform the response (but I want to do classification) and that I can write my own transformers, but they apparently ony modify X, not y.
Can anyone tell me how to modify y in a pre-processing step prior to classification as part of a pipeline?
Why inside the pipeline?
The idea is to move as many transformations into the pipeline as possible, reducing boilerplate code, avoiding data leaks plus simplifying model deployment (e.g. as services like Databricks model registry can deploy a sklearn model with pre-processing expected to happen inside the model).

You get the error because target is not available because the transformation is applied only to the X, and not y.
Sklearn pipeline does not support transforming target y in the way you tried to write it.
However, there is a sklearn.compose.TransformedTargetRegressor which can wrap model and can be provided with instructions how to transform target.
Warning, it is not well supported and I found many issues when trying to work with it on a real project. Maybe you want to have manual target transformation steps.
Here is a little demo that might work for you.
import sklearn
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
data = load_boston()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
X = df[data["feature_names"]]
y = data["target"]
pipeline = Pipeline(
steps=[
(
"ols",
TransformedTargetRegressor(
LinearRegression(),
transformer=sklearn.preprocessing.KBinsDiscretizer(
n_bins=3, encode="ordinal"
),
),
)
]
)
pipeline.fit(X, y)
pipeline.predict(X)
Or a more readable snippet that shows how you create target transformer
model = LinearRegression()
kbins = sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode="ordinal")
ttr = TransformedTargetRegressor(model, transformer=kbins)

Sample data using np.random.choice()

I am using the iris data set from sklearn to do some basic predictive modelling. I am splitting the data into training and test sets and for a list of given proportions, I want to sample without replacement different proportions of the training data. I need to sample using np.random.choice cannot use df.sample
But what I am doing to sample seems to be incorrect. I will greatly appreciate any insights.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
props=[0.2,0.5,0.7,0.9]
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
y=df[list(df.loc[:,df.columns.values =='target'])]
X=df[list(df.loc[:,df.columns.values !='target'])]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3
,train_size=0.7)
for i in proportions:
sampleX=np.random.choice(X_train, size=i, replace = False) #----> code to sample

You can sample the index:
props=[0.2,0.5,0.7,0.9]
for i in props:
ix = np.random.choice(X_train.index, size=int(i*len(X_train)), replace = False)
sampleX = X_train.loc[ix]
Or simply use a binomial:
for i in props:
sampleX = X_train.iloc[np.random.binomial(1,i,len(X))]

Fitting linear model using PolynominalFeatures

I want to create some random data and try to improve my model with PolynominalFeatures, however I'm facing little troubles with doing so.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import random
import pandas as pd
import numpy as np
import statsmodels.api as sm
#create some artificial data
x=np.linspace(-1,1,1000)
x=pd.array(random.choices(x,k=1000))
y=x**2+np.random.randn(1000)
#divide sample
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
#define data frame to futher use for PolynomialFeatures
df=pd.DataFrame([x_train,x_test])
df=df.transpose()
data = df
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=2)
data = trans.fit_transform(data)
model = sm.OLS(y_train,data).fit()
And then I get error : ValueError: unrecognized data structures: <class 'pandas.core.arrays.numpy_.PandasArray'> / <class 'numpy.ndarray'>
Do you have any ideas what should be done to make my regression work properly ?

use to_numpy() function to convert pandas array to numpy array
model = sm.OLS(y_train.to_numpy(),data).fit()

Using slices in Python

I use the dataset from UCI repo: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency
Then doing next:
from pandas import *
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.cross_validation import train_test_split
dataset = read_excel('/Users/Half_Pint_boy/Desktop/ENB2012_data.xlsx')
dataset = dataset.drop(['X1','X4'], axis=1)
trg = dataset[['Y1','Y2']]
trn = dataset.drop(['Y1','Y2'], axis=1)
Then do the models and cross validate:
models = [LinearRegression(),
RandomForestRegressor(n_estimators=100, max_features ='sqrt'),
KNeighborsRegressor(n_neighbors=6),
SVR(kernel='linear'),
LogisticRegression()
]
Xtrn, Xtest, Ytrn, Ytest = train_test_split(trn, trg, test_size=0.4)
I'm creating a regression model for predicting values but have a problems. Here is the code:
TestModels = DataFrame()
tmp = {}
for model in models:
m = str(model)
tmp['Model'] = m[:m.index('(')]
for i in range(Ytrn.shape[1]):
model.fit(Xtrn, Ytrn[:,i])
tmp[str(i+1)] = r2_score(Ytest[:,0], model.predict(Xtest))
TestModels = TestModels.append([tmp])
TestModels.set_index('Model', inplace=True)
It shows unhashable type: 'slice' for line model.fit(Xtrn, Ytrn[:,i])
How can it be avoided and made working?
Thanks!

I think that I had a similar problem before! Try to convert your data to numpy arrays before feeding them to sklearn estimators. It most probably solve the hashing problem. For instance, You can do:
Xtrn_array = Xtrn.as_matrix()
Ytrn_array = Ytrn.as_matrix()
and use Xtrn_array and Ytrn_array when you fit your data to estimators.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to make prediction with Sklearn model on pyspark dataframe - python

Related

UDF not being parallelized on pyspark

Pre-process y column in a sklearn pipeline before classification

Sample data using np.random.choice()

Fitting linear model using PolynominalFeatures

Using slices in Python

Categories

Resources