Use statsmodels model fit on another dataset - python

Suppose I fit a model on the dataset dataset1 using SARIMAX from statsmodels.tsa.statespace.sarimax - is it possible to then use this fit to make predictions on another dataset dataset2?
Namely, consider the following:
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
import numpy as np
# generate example data
n=90
idx = pd.PeriodIndex(pd.date_range(start = '2015-01-02',end='2015-04-01',freq='D'))
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset1 = pd.Series(dat, index = idx)
# fit model
fit = SARIMAX(dataset1, order = (1,0,1)).fit()
# make 30 day forecast on dataset1
fit.forecast(30)
How would I go about using fit to make a prediction on dataset2?
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset2 = pd.Series(dat, index = idx)
Ideally, it'd be something super simple akin to fit(dataset2).forecast(30) but that clearly isn't the case.
I know I can extract the estimated parameters fit.params but short of going through this tedious process, is there a built-in way or a hack to using the existing fit instance?

You can use the apply results method:
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
import numpy as np
# generate example data
n=90
idx = pd.PeriodIndex(pd.date_range(start = '2015-01-02',end='2015-04-01',freq='D'))
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset1 = pd.Series(dat, index = idx)
# fit model
fit = SARIMAX(dataset1, order = (1,0,1)).fit()
# make 30 day forecast on dataset1
fit.forecast(30)
# ------------------------------------
# get the new dataset
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset2 = pd.Series(dat, index = idx)
# apply the parameters from `fit` to the new dataset
fit2 = fit.apply(dataset2)
# make 30 day forecast on dataset2
fit2.forecast(30)

Related

T-distributed Stochastic Neighbor Embedding (t-SNE)

I am trying to run T-distributed Stochastic Neighbor Embedding (t-SNE) in Jupyter but always facing a issue with
ValueError: could not convert string to float: '<Null>'
Code:
enter image description here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Reading the data using pandas
df = pd.read_csv("E:\\Field data\Output\\Pixel values7.csv")
# print first five rows of df
print(df.head(9))
# save the labels into a variable l.
l = df['label']
# Drop the label feature and store the pixel data in d.
d = df.drop("label", axis = 1)
I got error after this line
# Data-preprocessing: Standardizing the data
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(df)
print(standardized_data.shape)
# TSNE
# Picking the top 1000 points as TSNE
# takes a lot of time for 15K points
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
# creating a new data frame which
# help us in plotting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))
# Plotting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
I got this link from somewhere, I am not expert in python. I request you to kindly help me out.
I am trying to run this program for my data but always getting a error
ValueError: could not convert string to float: '<Null>'
If there is any other code for T-distributed Stochastic Neighbor Embedding (t-SNE). Please let me know.
My data look like this

How to input dataset while using Salesforce-merlion package for timeseries forecasting

I have installed Salesforce-Merlion package in my conda-environment. Now I want to use my own dataset to run the algorithm for forecasting. Here I need only one univariate series to forecast. But I cannot figure out how to do that. As there are some variables which I cannot find how to initialize those. In the example provided in GIThub, using some already splitted dataset. Can someone can help me out here?
GIThub example for forecasting is like this:
from merlion.utils import TimeSeries from ts_datasets.forecast import M4
# Data loader returns pandas DataFrames, which we convert to Merlion TimeSeries
time_series, metadata = M4(subset="Hourly")[0]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
The complete code with internal dataset is available in the following link:
https://github.com/salesforce/Merlion/tree/main/examples/forecast
(Here they are using their internal dataset M4)
Now, I have to use my dataset. So my code is like this:
from merlion.utils import TimeSeries
df = pd.read_csv(r'C:\Users\Doyel_De_Sarkar\Desktop\forecasting\15786_GIK.csv')
df.dropna(inplace=True)
df['ts'] = pd.to_datetime(df['ts'])
df.sort_values('ts', inplace=True)
trainval = []
for i in range(len(df)):
if i <= (round((len(df)*0.75),0)):
trainval.append(True)
else:
trainval.append(False)
df['trainval'] = trainval
df = df.drop(columns=['wday', 'hour'])
from merlion.utils import UnivariateTimeSeries
kpi = UnivariateTimeSeries(
time_stamps=df.ts, # timestamps in units of seconds
values=df.saps_total, # time series values
name="kpi" # optional: a name for this univariate
)
kpi_label = UnivariateTimeSeries(
time_stamps=df.ts, # timestamps in units of seconds
values=df.trainval # time series values
)
from merlion.utils import TimeSeries
time_series, metadata = kpi, kpi_label
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
I am getting this following error
'UnivariateTimeSeries' object has no attribute 'trainval'
at this line:
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
The reason you're getting this error is because trainval is not a parameter of the TimeSeries class. In the example from GitHub that you shared, metadata is a pandas timeframe, but you're constructing a TimeSeries object out of kpi_label.
I'm not sure exactly what your dataset looks like, but try using:
kpi_labels = df.trainval
instead.
Thank you SalmonKiller for taking out time to look into the issue. The dataset used in the github has a very weird data structure, hence I had to create the column trainval and set the metadata as the column df[['trainval']]. The univariate I had created was of no use. The issue was there with indexing. After I set the time stamp column as index , the issue got solved.
Here is the code which is running fine now.
import os
import numpy as np
import pandas as pd
from merlion.models.forecast.smoother import MSESConfig, MSES
from merlion.transform.resample import TemporalResample
from merlion.utils import TimeSeries
df = pd.read_csv(r'<file.csv>')
df['ts'] = pd.to_datetime(df['ts'])
df.set_index('ts', inplace=True)
df.sort_values('ts', inplace=True)
hours = pd.date_range(start=df.index[0], end=df.index[-1], freq='H')
mean = df.saps_total.mean()
df = df.reindex(hours, fill_value=mean)
trainval = []
for i in range(len(df)):
if i <= (round((len(df)*0.75),0)):
trainval.append(True)
else:
trainval.append(False)
df['trainval'] = trainval
df = df.drop(columns=['wday', 'hour'])
from merlion.utils import TimeSeries
time_series = df[['saps_total']]
metadata = df[['trainval']]
train_data = TimeSeries.from_pd(time_series[metadata.trainval])
test_data = TimeSeries.from_pd(time_series[~metadata.trainval])
from merlion.models.forecast.arima import Arima, ArimaConfig
config1 = ArimaConfig(max_forecast_steps=len(time_series[~metadata.trainval].index), order=(0, 1, 0),
transform=TemporalResample(granularity="1h"))
model1 = Arima(config1)
model1.train(train_data=train_data)
test_pred, test_err = model1.forecast(time_stamps=test_data.time_stamps)
print(test_pred)

H2O GLM in Python Repeats Lambda Values in Regularization

When fitting a GLM in H2O_cluster_version: 3.32.0.5 with
lamdba_search = True, nlambdas = 20, and lambda_min_ratio = .0001
My team and I receive 24 lambdas in our regularization path. The last 4 lambdas in the path are repeats of the first 4, the largest values.
Here is a reproducible example:
import pandas as pd
import numpy as np
import tweedie
import scipy
import os
import sys
import time
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch
sys.path.append(h2odir)
from h2o_auto_init import h2o_auto_init
os.system(h2oshellscript)
time.sleep(10)
h2o_auto_init()
#sample data
resp = np.random.choice(range(0,100),size=1000)
pred1 = np.random.choice(range(40,50),size=1000)
pred2 = np.random.choice(range(20,30),size=1000)
pred3 = np.random.choice([1,2,3,4,5],size=1000)
weight = np.random.choice([1,1,1,.9,.37],size=1000)
folds = np.random.choice([1,2,3,4,5],size=1000)
data = pd.DataFrame({'resp': resp, 'pred1':pred1,'pred2':pred2,'pred3':pred3,'weight':weight,'fold_column':folds})
predictors = ['pred1','pred2','pred3']
# convert pandas df to h2oframe
H2Odata = h2o.H2OFrame(data, column_names=data.columns.tolist())
# set up model
model = H2OGeneralizedLinearEstimator(
family="tweedie",
tweedie_link_power = 0,
tweedie_variance_power = 1.7,
lambda_search=True,
early_stopping = False,
lambda_min_ratio = 0.0001,
nlambdas=20,
alpha=.5,
standardize = True,
weights_column='weight',
solver = 'IRLSM',
#beta_constraints = constraints,
keep_cross_validation_models = True,
keep_cross_validation_predictions = True,
keep_cross_validation_fold_assignment=True
)
# Train the model with training and validation data
model.train(
x=predictors,
y='resp',
training_frame=H2Odata,
fold_column = 'fold_column'
)
# get full regularization paths
#list of cross validation model objects
regpath_h2o_cv=[]
for i in range(0,len(model.cross_validation_models())):
regpath_h2o_cv.append(H2OGeneralizedLinearEstimator.getGLMRegularizationPath(model.cross_validation_models()[i]))
H2OGeneralizedLinearEstimator.getGLMRegularizationPath(model.cross_validation_models()[0])['lambdas']
When I run this, there is an extra lambda, a repeat of the first lambda.
Can anyone provide guidance on why H2O is providing more lambdas than requested, and especially repeated lambdas?
Does this mean it is fitting unnecessary models?
Our real use case is on very large data, and any time we can save
avoiding unnecessary modeling will be helpful.

Supervised Time Series efficiency improvement

The data that I have is hourly recorded over the past 4 months. I am building a time series model and I've tried several methods so far: Arima, LSTMs, Prophet but they can be quite slow for my task since I have to run the model on thousands of time series in different locations. So then I thought it might be interesting to transform it into a supervised problem and use regression.
I extracted 4 features from the univariate time series and its time index, namely: dayofweek, hour, daily average and hourly average. So at the moment I am using these 4 predictors but could possibly extract more(like beginning of the day, noon, etc-also if you have any other suggestions here they are very welcomed :) )
I've used XGBoost for the regression and here are parts of the code:
# XGB
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Functions needed
def convert_dates(x):
x['date'] = pd.to_datetime(x['date'])
#x['month'] = x['date'].dt.month
#x['year'] = x['date'].dt.year
x['dayofweek'] = x['date'].dt.dayofweek
x['hour'] = x['date'].dt.hour
#x['week_no'] = pd.to_numeric(x['date'].index.strftime("%V"))
x.pop('date')
return(x)
def add_avg(x):
x['daily_avg']=x.groupby(['dayofweek'])['y'].transform('mean')
x['hourly_avg'] = x.groupby(['dayofweek','hour'])['y'].transform('mean')
#x['monthly_avg']=x.groupby(['month'])['y'].transform('mean')
#x['weekly_avg']=x.groupby(['week_no'])['y'].transform('mean')
return x
xgb_mape_r2_dict = {}
I then run a for loop in which I select a location and build the model for it. Here I split the data into a train and test part. I knew there might be problems due to the Easter holidays in my country last week because those are rare events so that is why I split the training and test data in that manner. So I actually consider the data from the beginning of the year up to two weeks ago as training data and the very next week after that as test data.
for j in range(10,20):
data = df_all.loc[df_all['Cell_Id']==top_cells[j]]
data.drop(['Cell_Id', 'WDay'], axis = 1, inplace = True)
data['date'] = data.index
period = 168
data_train = data.iloc[:-2*period,:]
data_test = data.iloc[-2*period:-period,:]
data_train = convert_dates(data_train)
data_test = convert_dates(data_test)
data_train.columns = ['y', 'dayofweek', 'hour']
data_test.columns = ['y', 'dayofweek', 'hour']
data_train = add_avg(data_train)
daily_avg = data_train.groupby(['dayofweek'])['y'].mean().reset_index()
hourly_avg = data_train.groupby(['dayofweek', 'hour'])['y'].mean().reset_index()
Now, for the test data I add the past averages, namely the 7 daily averages from the past and the 168 hourly averages from the past as well. This is actually the part that takes the longest amount of time to run and I would like to improve its efficiency.
value_dict ={}
for k in range(168):
value_dict[tuple(hourly_avg.iloc[k])[:2]] = tuple(hourly_avg.iloc[k])[2]
data_test['daily_avg'] = 0
data_test['hourly_avg'] = 0
for i in range(len(data_test)):
data_test['daily_avg'][i] = daily_avg['y'][data_test['dayofweek'][i]]
data_test['hourly_avg'][i] = value_dict[(data_test['dayofweek'][i], data_test['hour'][i])]
My current run time is of 30 seconds for every iteration in the for loop which is way too slow because of the poor way that I use to add the averages in the test data. I would really appreciate if anyone could point out how could I implement this bit faster.
I will also add the rest of my code and make some other observations as well:
x_train = data_train.drop('y',axis=1)
x_test = data_test.drop('y',axis=1)
y_train = data_train['y']
y_test = data_test['y']
def XGBmodel(x_train,x_test,y_train,y_test):
matrix_train = xgb.DMatrix(x_train,label=y_train)
matrix_test = xgb.DMatrix(x_test,label=y_test)
model=xgb.train(params={'objective':'reg:linear','eval_metric':'mae'}
,dtrain=matrix_train,num_boost_round=500,
early_stopping_rounds=20,evals=[(matrix_test,'test')],)
return model
model=XGBmodel(x_train,x_test,y_train,y_test)
#submission = pd.DataFrame(x_pred.pop('id'))
y_pred = model.predict(xgb.DMatrix(x_test), ntree_limit = model.best_ntree_limit)
#submission['sales']= y_pred
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(inplace = True, drop = True)
compare_df = pd.concat([y_test, y_pred], axis = 1)
compare_df.columns = ['Real', 'Predicted']
compare_df.plot()
mape = (np.abs((y_test['y'] - y_pred[0])/y_test['y']).mean())*100
r2 = r2_score(y_test['y'], y_pred[0])
xgb_mape_r2_dict[top_cells[j]] = [mape,r2]
I've used both R-squared and MAPE as accuracy measures although I don't think MAPE is indicated anymore since I've transformed the time series problem into a regression problem. Any thoughts on your part on this subject?
Thank you very much for your time and consideration. Any help is very much appreciated.
Update: I have managed to fix the issue using pandas' merge. I've first created two dataframes containing the daily averges and hourly averages from the training data and then merged these ataframes with the test data:
data_test = merge(data_test, daily_avg,['dayofweek'],'daily_avg')
data_test = merge(data_test, hourly_av['dayofweek','hour'],'hourly_avg')
data_test.columns = ['y', 'dayofweek', 'hour', 'daily_avg', 'hourly_avg']
where we used the merge function defined as:
def merge(x,y,col,col_name):
x =pd.merge(x, y, how='left', on=None, left_on=col, right_on=col,
left_index=False, right_index=False, sort=True,
copy=True, indicator=False,validate=None)
x=x.rename(columns={'sales':col_name})
return x
I can now run the model for 2000 locations per hour on a laptop with decent results but I will try to improve it while keeping it fast. Thank you very much once again.

Apache Spark StringIndexer applies non-existent labels (Unseen label Exception)

I am trying to do a Random Forest Classification using PySpark 2.3.0. My dataset contains three columns which are strings so I am using the StringIndexer to convert them to numbers. Unfortuantely during the evaluation the Indexer suddenly finds labels which are not existing anywhere in the dataset.
Here is an extract of my dataset (the last column is the label 0/1):
Year,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,DepDelay15Min
2004,1,12,1,623,UA,ORD,CLT,599,0
2004,1,13,2,621,UA,ORD,CLT,599,0
2004,1,14,3,633,UA,ORD,CLT,599,0
Here is my script:
CSV_PATH = "data/mllib/2004_10000_small.csv"
APP_NAME = "Random Forest Example"
SPARK_URL = "local[*]"
RANDOM_SEED = 13579
TRAINING_DATA_RATIO = 0.7
RF_NUM_TREES = 10
RF_MAX_DEPTH = 30
RF_MAX_BINS = 2048
LABEL = "DepDelay15Min"
CATEGORICAL_FEATURES = ["UniqueCarrier", "Origin", "Dest"]
from pyspark import SparkContext
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql import SparkSession
from time import *
# Creates Spark Session
spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
# Reads in CSV file as DataFrame
# header: The first line of files are used to name columns and are not included in data. All types are assumed to be string.
# inferSchema: Automatically infer column types. It requires one extra pass over the data.
df = spark.read.options(header = "true", inferschema = "true").csv(CSV_PATH)
# Transforms all strings into indexed numbers
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in CATEGORICAL_FEATURES]
pipeline = Pipeline(stages=indexers)
df = pipeline.fit(df).transform(df)
# Removes old string columns
df = df.drop(*CATEGORICAL_FEATURES)
# Moves the label to the last column
df = StringIndexer(inputCol=LABEL, outputCol=LABEL+"_label").fit(df).transform(df)
df = df.drop(LABEL)
# Converts the DataFrame into a LabeledPoint Dataset with the last column being the label and the rest the features.
transformed_df = df.rdd.map(lambda row: LabeledPoint(row[-1], Vectors.dense(row[0:-1])))
# Splits the dataset into a training and testing set according to the defined ratio using the defined random seed.
splits = [TRAINING_DATA_RATIO, 1.0 - TRAINING_DATA_RATIO]
training_data, test_data = transformed_df.randomSplit(splits, RANDOM_SEED)
print("Number of training set rows: %d" % training_data.count())
print("Number of test set rows: %d" % test_data.count())
# Run algorithm and measure runtime
start_time = time()
model = RandomForest.trainClassifier(training_data, numClasses=2, categoricalFeaturesInfo={}, numTrees=RF_NUM_TREES, featureSubsetStrategy="auto", impurity="gini", maxDepth=RF_MAX_DEPTH, maxBins=RF_MAX_BINS, seed=RANDOM_SEED)
end_time = time()
elapsed_time = end_time - start_time
print("Time to train model: %.3f seconds" % elapsed_time)
# Make predictions and compute accuracy
predictions = model.predict(test_data.map(lambda x: x.features))
labels_and_predictions = test_data.map(lambda x: x.label).zip(predictions)
acc = labels_and_predictions.filter(lambda x: x[0] == x[1]).count() / float(test_data.count())
print("Model accuracy: %.3f%%" % (acc * 100))
When executing the labels_and_predictions.filter() at the very end I get the following error message:
Caused by: org.apache.spark.SparkException: Unseen label: OR. To handle unseen labels, set Param handleInvalid to keep.
at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:260)
However, the label "OR" does not exist anywhere in the dataset, Only "ORD". I tried different datasets and it turned out that Spark keeps cutting off the last letter of the "Origin" row. I have not the slightest idea which part of the script could be responsible for this. Any ideas how I should proceed the investigation? Thanks and advance!
As Erik pointed out I was using the outdated MLLib instead of the ML library. I still do not understand why the original script was not working but after porting it to ML it does. Here is the new solution which is inspired by this example: https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
CSV_PATH = "data/mllib/2004_10000_small.csv"
APP_NAME = "Random Forest Example"
SPARK_URL = "local[*]"
RANDOM_SEED = 13579
TRAININGDATA_RATIO = 0.7
VI_MAX_CATEGORIES = 4
RF_NUM_TREES = 10
RF_MAX_DEPTH = 30
RF_MAX_BINS = 2048
LABEL = "DepDelay15Min"
CATEGORICAL_FEATURES = ["UniqueCarrier", "Origin", "Dest"]
from pyspark import SparkContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import IndexToString, StringIndexer, VectorAssembler, VectorIndexer
from pyspark.sql import SparkSession
from time import *
# Creates Spark Session
spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
# Reads in CSV file as DataFrame
# header: The first line of files are used to name columns and are not included in data. All types are assumed to be string.
# inferSchema: Automatically infer column types. It requires one extra pass over the data.
data = spark.read.options(header = "true", inferschema = "true").csv(CSV_PATH)
# Transforms all string features into indexed numbers
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(data) for column in CATEGORICAL_FEATURES]
pipeline = Pipeline(stages=indexers)
data = pipeline.fit(data).transform(data)
# Removes old string columns
data = data.drop(*CATEGORICAL_FEATURES)
# Indexes the label and moves it to the last column
data = StringIndexer(inputCol=LABEL, outputCol="label").fit(data).transform(data)
data = data.drop(LABEL)
# Assembles all feature columns and moves them to the last column
assembler = VectorAssembler(inputCols=data.columns[0:-1], outputCol="features")
data = assembler.transform(data)
# Remove all columns but label and features
data = data.drop(*data.columns[0:-2])
# Splits the dataset into a training and testing set according to the defined ratio using the defined random seed.
splits = [TRAININGDATA_RATIO, 1.0 - TRAININGDATA_RATIO]
trainingData, testData = data.randomSplit(splits, RANDOM_SEED)
print("Number of training set rows: %d" % trainingData.count())
print("Number of test set rows: %d" % testData.count())
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > VI_MAX_CATEGORIES distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedataeatures", maxCategories=VI_MAX_CATEGORIES).fit(data)
# Train a RandomForest model.
randomForest = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedataeatures", numTrees=RF_NUM_TREES, maxBins=RF_MAX_BINS)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, randomForest, labelConverter])
# Train model. This also runs the indexers. Measures the execution time as well.
start_time = time()
model = pipeline.fit(trainingData)
end_time = time()
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
rfModel = model.stages[2]
print(rfModel) # summary only

Categories