I have data that I am trying to build a tree model on. The dataset is 6 million row
I'm using pandas udf successfully to train multiple Lightgbm models with on multiple nodes
Using Databricks with the following configuration
worker type: r5.24xlarge
Driver type: r5.24xlarge
Max workers: 12
Since pandas udf, supposedly, does the computing in parallel for each group, I thought that I could finish the training faster by using a cluster with more workers.
checking the Stage Detail page on the Web UI ,I see that I have only one stage responsible for the udf pandas training. It took 4 hours. Acessing the details of this stage, I see it had only one task, with {'Locality Level': 'PROCESS_LOCAL'}, that took the whole 4 hours. Screenshots below.
My assumption is the training is not being parallelized. I have attached the code below but I am not able to point where the mistake is
#reading data into pandas dataframe
train_gbm_x = s3_read_csv()
train_gbm_original = train_gbm_x.copy()
#create replicas of the dataset so that I can run 4 versions of these in parallel
for i in range(1,5,1):
x_t_c = train_gbm_x.copy()
x_t_c['id']=i
print(train_gbm_x_original.shape)
train_gbm_x_original =train_gbm_x_original.append([x_t_c], ignore_index=True)
train_gbm_sp = spark.createDataFrame(train_gbm_x_original) #convert to spark dataframe
# main UDF
schema = StructType([
StructField('r_squared', DoubleType(), True)])
#f.pandas_udf(schema, f.PandasUDFType.GROUPED_MAP)
def train_RF(data):
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from scipy.stats.stats import pearsonr
import numpy as np
import lightgbm as lgb
# params_xgb['objective'] = 'binary:logistic'
import xgboost as xgb
xgb_model = lgb.LGBMClassifier(bagging_fraction=0.9500000000000001, boosting_type='goss',
feature_fraction=0.53, learning_rate=0.045075773005757123,
max_depth=2, metric='binary_logloss', min_child_samples=70,
min_child_weight=0, min_data_in_bin=54, min_data_in_leaf=219,
min_gain_to_split=0.75, n_estimators=9900, num_leaves=1000,
objective='binary', random_state=4, reg_alpha=7.906971438083483,
reg_lambda=9.905206242985752, verbosity=-1)
# create data and label groups
y_train = data['label']
X_train = data.drop(['employee_id', 'date','label'], axis=1)
xgb_model.fit(X_train, y_train)
# make predictions
y_pred = xgb_model.predict(X_train)
r = pearsonr(y_pred, y_train)
print(y_pred)
# return the number of trees, and the R value
return pd.DataFrame({'r_squared': (r[0]**2)}, index=[0])
The call to the UDF is as follows:
train_gbm_sp= train_gbm_sp.sort(F.col("emp_id"),F.col("id"))
model_coeffs_r =train_gbm_sp.groupby('id').apply(train_RF)
model_coeffs_r.show(5)
spark UI screenshot:
This view shows that it is being run on a single executor ID
I am not sure where the exact issue is, if someone can help point to what I might be doing wrong in the set up here.
Thank you
Related
I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages. What I'm trying to do is to apply a linear transformation after applying the predict_proba method within the PMMMLPipeline class in sklearn2pmml package. Any idea about how to do this?
Even a solution outside this package but automatable would help me (like modifying automatically the XML from the PMML).
Here's an example so you can get a deeper understanding of what I'm trying to do:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
# FORGET ABOIT TRAIN TEST SPLIT; we only care if the PMML pipeline works for now
BIRTHDAY_SEED = 1995
nrows, cols = 1000, 5
X, y = make_classification(n_samples=nrows, n_features=cols, n_informative=2, n_redundant=3, n_classes=2, shuffle=True, random_state=BIRTHDAY_SEED)
X, y = pd.DataFrame(X), pd.Series(y)
model = DecisionTreeClassifier()
model.fit(X,y)
def postprocessig_linear_transformation(probabilities, a,b):
"This function would multiply proabilities by a and sum b"
return probabilities*a+b
# the pipeline should look like this
# first predict probabilities
probabilities = model.predict_proba(X)[:,0]
# then scale them (apply linear transformation)
probabilities_scaled = postprocessig_linear_transformation(probabilities, a = 1000, b=100)
# of course it does not work,
pmml_pipeline = PMMLPipeline([
# here we should place the category preprocesor; I know it does not work but , so you can get the idea
('decisiontree',model),
('postprocesing_apply_linear_transformation',postprocessig_linear_transformation)
])
sklearn2pmml(pmml_pipeline, "example_pipeline_pmml.pmml", with_repr = True)
On a second thought, you don't need a full-blown LinearRegression step to perform a deterministic a * x + b probability scaling operation. A simple ExpressionTransformer step is more than adequate:
from sklearn2pmml.preprocessing import ExpressionTransformer
pipeline = PMMLPipeline([
("decisiontree", model)
], predict_proba_transformer = ExpressionTransformer("X[0] * 1000 + 100"))
I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages.
Don't blame the SkLearn2PMML package for your troubles. It is the Scikit-Learn framework that prohibits you from inserting two estimator objects into a single Pipeline object.
In the current case, you should rephrase your problem. What you're really trying to do is build a "chain of two models" (the first model feeding into the second model). The SkLearn2PMML package provides a sklearn2pmml.ensemble.EstimatorChain estimator type, which allows you to accomplish exactly that.
I have loaded sklearn model successfully but unable to make predictions on pyspark dataframe. While running the below given code, getting an error mentioned below. Please help me to get the code to make predictions with sklearn model on pyspark. I also have searched relevant questions but could not find the solution.
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
braodcast_model.value
#update prediction method
def predictor(cols):
#call predict method for model
return model.value.predict(*cols)
udf_predictor = udf(predictor, FloatType())
#apply the udf to dataframe
df_prediction = df.withColumn("prediction", udf_predictor(df.select(list_of_columns)))
I get the following error message
TypeError: Invalid argument, not a string or column. For column literals, use 'lit', 'array',
'struct' or 'create_map' function.
I think you were on the right track for reaching your expected output.
I managed to find two possible solutions for such problem: one uses Spark UDF, the other uses Pandas UDF.
Spark UDF
from pyspark.sql.functions import udf
#udf('integer')
def predict_udf(*cols):
return int(braodcast_model.value.predict((cols,)))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_udf(*list_of_columns))
Pandas UDF
import pandas as pd
from pyspark.sql.functions import pandas_udf
#pandas_udf('integer')
def predict_pandas_udf(*cols):
X = pd.concat(cols, axis=1)
return pd.Series(braodcast_model.value.predict(X))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_pandas_udf(*list_of_columns))
Reproducible example
Here I used a Databricks Community cluster with Spark 3.1.2, pandas==1.2.4 and pyarrow==4.0.0.
broadcasted_model is a simple logistic regression from scikit-learn, trained on the breast cancer dataset.
import pandas as pd
import joblib
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from pyspark.sql.functions import udf, pandas_udf
# load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# split in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)
# create a small pipeline with standardization and model
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# save and reload the model
path = '/databricks/driver/test_model.joblib'
joblib.dump(model, path)
loaded_model = joblib.load(path)
# sample of unseen data
df = spark.createDataFrame(X_test.sample(50, random_state=42))
# create broadcasted model
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
Then I used the two methods illustrated above and you will see that the outputs df_prediction will be the same in both cases.
What is the recommended way to parallelize the predict() method of a scikit-learn pipeline?
Here is a minimal working example that illustrates the problem by attempting parallel predict() on the iris data using an SVM pipeline and 5 parallel jobs:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.externals.joblib import Parallel, delayed
from sklearn import datasets
# Load Iris data
iris = datasets.load_iris()
# Create a pipeline with 2 steps: scaler and SVM; train
pipe = make_pipeline(StandardScaler(), SVC()).fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(pipe.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: fails
results = parallel(jobs)
This code fails with:
PicklingError: Can't pickle <function Pipeline.predict at 0x000000001746B730>: it's not the same object as sklearn.pipeline.Pipeline.predict
However, applying parallelization to the SVM classifier directly, instead of the pipeline, works:
# Load Iris data
iris = datasets.load_iris()
# Create SVM classifier, train
svc = SVC().fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(svc.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: works
results = parallel(jobs)
I am able to work around the problem by essentially taking the pipeline apart: first applying scaling to the whole array, then splitting into chunks and parallelizing svc.predict() like above. However, it is inconvenient and generally voids advantages offered by pipelines: e.g. I have to keep track of intermediate results, and the code would have to change if additional transformer steps are added into the pipeline.
Is there a way to parallelize using pipelines directly?
Many thanks,
Aleksey
So I have this dataset (1000 column by 1000 row) which has two classes, zero or one, where I applied the code below and it gave me a prediction rate of 58% I want to tune it but I am really confused between the different classes and how to select their parameters with this type of data, so I wish if I get some guideline here.
#here I am importing the libraries that I need for this situation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
#reading the data
data = pd.read_csv('train.csv')
x = data.loc[:, 'D_0':'D_1023']
y = data['Class']
test = pd.read_csv('test.csv')
model = svm.SVC(kernel='linear', C=1)
model.fit(x,y)
model.score(x,y)
predictions = model.predict(test)
pd.DataFrame(predictions,
columns=['PredictedScore']).to_csv('prediction.csv')
Sample of the dataset
The parameters really depend upon the data, so there is no general guideline. However, at least trying the "rbf" kernel is worth the effort I think. Also, I would first start by changing the C parameter, as that usually has the largest effect. But again, it depends on the data a lot.
I am new to Pyspark, trying to create a ML model in Pyspark
My goal is to create a TFidf vectorizer and pass those features to my SVM model.
I tried this
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("Stream")
sc = SparkContext(conf=conf)
parallelized = sc.parallelize(Dataset.CleanText)
#dataset is a pandas dataframe with CleanText as one of the column
from pyspark.mllib.feature import HashingTF, IDF
hashingTF = HashingTF()
tf = hashingTF.transform(parallelized)
# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:
# First to compute the IDF vector and second to scale the term frequencies by IDF.
#tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
print ("vecs: ",tfidf.glom().collect())
#This is printing all the TFidf vectors
import numpy as np
labels = np.array(Dataset['LabelNo'])
Now how should I pass these Tfidf and label values to my model?
I followed this
http://spark.apache.org/docs/2.0.0/api/python/pyspark.mllib.html
and tried to create labeled point as
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("SparkSessionZipsExample").getOrCreate()
dd = [(labels[i], Vectors.dense(tfidf[i])) for i in range(len(labels))]
df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
print ("df: ",df.glom().collect())
But this is giving me error as:
---〉 15 dd = [(labels[i], Vectors.dense(tfidf[i])) for i in range(len(labels))]
16 df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
17
TypeError: 'RDD' object does not support indexing
The error clearly explains itself RDD does not support indexing. You are trying to get the ith row of tfidf by using i as its index(tfidf[i] in line 15). RDDs don't work like lists. RDDs are distributed datasets. Rows are randomly distributed to workers.
You have to collect the tfidf to a single node if you want your code to work but that would defeat the purpose of a distributed framework like spark.
I would advise you to work with dataframes instead of rdds as they are much faster than rdds and ml lib supports most of the operations(HashingTF, IDF) provided by mllib.