Apache Spark StringIndexer applies non-existent labels (Unseen label Exception)

Apache Spark StringIndexer applies non-existent labels (Unseen label Exception) - python

I am trying to do a Random Forest Classification using PySpark 2.3.0. My dataset contains three columns which are strings so I am using the StringIndexer to convert them to numbers. Unfortuantely during the evaluation the Indexer suddenly finds labels which are not existing anywhere in the dataset.
Here is an extract of my dataset (the last column is the label 0/1):
Year,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,DepDelay15Min
2004,1,12,1,623,UA,ORD,CLT,599,0
2004,1,13,2,621,UA,ORD,CLT,599,0
2004,1,14,3,633,UA,ORD,CLT,599,0
Here is my script:
CSV_PATH = "data/mllib/2004_10000_small.csv"
APP_NAME = "Random Forest Example"
SPARK_URL = "local[*]"
RANDOM_SEED = 13579
TRAINING_DATA_RATIO = 0.7
RF_NUM_TREES = 10
RF_MAX_DEPTH = 30
RF_MAX_BINS = 2048
LABEL = "DepDelay15Min"
CATEGORICAL_FEATURES = ["UniqueCarrier", "Origin", "Dest"]
from pyspark import SparkContext
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql import SparkSession
from time import *
# Creates Spark Session
spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
# Reads in CSV file as DataFrame
# header: The first line of files are used to name columns and are not included in data. All types are assumed to be string.
# inferSchema: Automatically infer column types. It requires one extra pass over the data.
df = spark.read.options(header = "true", inferschema = "true").csv(CSV_PATH)
# Transforms all strings into indexed numbers
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in CATEGORICAL_FEATURES]
pipeline = Pipeline(stages=indexers)
df = pipeline.fit(df).transform(df)
# Removes old string columns
df = df.drop(*CATEGORICAL_FEATURES)
# Moves the label to the last column
df = StringIndexer(inputCol=LABEL, outputCol=LABEL+"_label").fit(df).transform(df)
df = df.drop(LABEL)
# Converts the DataFrame into a LabeledPoint Dataset with the last column being the label and the rest the features.
transformed_df = df.rdd.map(lambda row: LabeledPoint(row[-1], Vectors.dense(row[0:-1])))
# Splits the dataset into a training and testing set according to the defined ratio using the defined random seed.
splits = [TRAINING_DATA_RATIO, 1.0 - TRAINING_DATA_RATIO]
training_data, test_data = transformed_df.randomSplit(splits, RANDOM_SEED)
print("Number of training set rows: %d" % training_data.count())
print("Number of test set rows: %d" % test_data.count())
# Run algorithm and measure runtime
start_time = time()
model = RandomForest.trainClassifier(training_data, numClasses=2, categoricalFeaturesInfo={}, numTrees=RF_NUM_TREES, featureSubsetStrategy="auto", impurity="gini", maxDepth=RF_MAX_DEPTH, maxBins=RF_MAX_BINS, seed=RANDOM_SEED)
end_time = time()
elapsed_time = end_time - start_time
print("Time to train model: %.3f seconds" % elapsed_time)
# Make predictions and compute accuracy
predictions = model.predict(test_data.map(lambda x: x.features))
labels_and_predictions = test_data.map(lambda x: x.label).zip(predictions)
acc = labels_and_predictions.filter(lambda x: x[0] == x[1]).count() / float(test_data.count())
print("Model accuracy: %.3f%%" % (acc * 100))
When executing the labels_and_predictions.filter() at the very end I get the following error message:
Caused by: org.apache.spark.SparkException: Unseen label: OR. To handle unseen labels, set Param handleInvalid to keep.
at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$9.apply(StringIndexer.scala:260)
However, the label "OR" does not exist anywhere in the dataset, Only "ORD". I tried different datasets and it turned out that Spark keeps cutting off the last letter of the "Origin" row. I have not the slightest idea which part of the script could be responsible for this. Any ideas how I should proceed the investigation? Thanks and advance!

As Erik pointed out I was using the outdated MLLib instead of the ML library. I still do not understand why the original script was not working but after porting it to ML it does. Here is the new solution which is inspired by this example: https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
CSV_PATH = "data/mllib/2004_10000_small.csv"
APP_NAME = "Random Forest Example"
SPARK_URL = "local[*]"
RANDOM_SEED = 13579
TRAININGDATA_RATIO = 0.7
VI_MAX_CATEGORIES = 4
RF_NUM_TREES = 10
RF_MAX_DEPTH = 30
RF_MAX_BINS = 2048
LABEL = "DepDelay15Min"
CATEGORICAL_FEATURES = ["UniqueCarrier", "Origin", "Dest"]
from pyspark import SparkContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import IndexToString, StringIndexer, VectorAssembler, VectorIndexer
from pyspark.sql import SparkSession
from time import *
# Creates Spark Session
spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
# Reads in CSV file as DataFrame
# header: The first line of files are used to name columns and are not included in data. All types are assumed to be string.
# inferSchema: Automatically infer column types. It requires one extra pass over the data.
data = spark.read.options(header = "true", inferschema = "true").csv(CSV_PATH)
# Transforms all string features into indexed numbers
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(data) for column in CATEGORICAL_FEATURES]
pipeline = Pipeline(stages=indexers)
data = pipeline.fit(data).transform(data)
# Removes old string columns
data = data.drop(*CATEGORICAL_FEATURES)
# Indexes the label and moves it to the last column
data = StringIndexer(inputCol=LABEL, outputCol="label").fit(data).transform(data)
data = data.drop(LABEL)
# Assembles all feature columns and moves them to the last column
assembler = VectorAssembler(inputCols=data.columns[0:-1], outputCol="features")
data = assembler.transform(data)
# Remove all columns but label and features
data = data.drop(*data.columns[0:-2])
# Splits the dataset into a training and testing set according to the defined ratio using the defined random seed.
splits = [TRAININGDATA_RATIO, 1.0 - TRAININGDATA_RATIO]
trainingData, testData = data.randomSplit(splits, RANDOM_SEED)
print("Number of training set rows: %d" % trainingData.count())
print("Number of test set rows: %d" % testData.count())
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > VI_MAX_CATEGORIES distinct values are treated as continuous.
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedataeatures", maxCategories=VI_MAX_CATEGORIES).fit(data)
# Train a RandomForest model.
randomForest = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedataeatures", numTrees=RF_NUM_TREES, maxBins=RF_MAX_BINS)
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, randomForest, labelConverter])
# Train model. This also runs the indexers. Measures the execution time as well.
start_time = time()
model = pipeline.fit(trainingData)
end_time = time()
# Make predictions.
predictions = model.transform(testData)
# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
rfModel = model.stages[2]
print(rfModel) # summary only

Related

Use statsmodels model fit on another dataset

Suppose I fit a model on the dataset dataset1 using SARIMAX from statsmodels.tsa.statespace.sarimax - is it possible to then use this fit to make predictions on another dataset dataset2?
Namely, consider the following:
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
import numpy as np
# generate example data
n=90
idx = pd.PeriodIndex(pd.date_range(start = '2015-01-02',end='2015-04-01',freq='D'))
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset1 = pd.Series(dat, index = idx)
# fit model
fit = SARIMAX(dataset1, order = (1,0,1)).fit()
# make 30 day forecast on dataset1
fit.forecast(30)
How would I go about using fit to make a prediction on dataset2?
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset2 = pd.Series(dat, index = idx)
Ideally, it'd be something super simple akin to fit(dataset2).forecast(30) but that clearly isn't the case.
I know I can extract the estimated parameters fit.params but short of going through this tedious process, is there a built-in way or a hack to using the existing fit instance?

You can use the apply results method:
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pandas as pd
import numpy as np
# generate example data
n=90
idx = pd.PeriodIndex(pd.date_range(start = '2015-01-02',end='2015-04-01',freq='D'))
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset1 = pd.Series(dat, index = idx)
# fit model
fit = SARIMAX(dataset1, order = (1,0,1)).fit()
# make 30 day forecast on dataset1
fit.forecast(30)
# ------------------------------------
# get the new dataset
dat = np.sin(np.linspace(0,12*np.pi,n)) + np.random.randn(n)/10
dataset2 = pd.Series(dat, index = idx)
# apply the parameters from `fit` to the new dataset
fit2 = fit.apply(dataset2)
# make 30 day forecast on dataset2
fit2.forecast(30)

How to build a full trainset when loading data from predefined folds in Surprise?

I am using Surprise to evaluate various recommender system algorithms. I would like to calculate predictions and prediction coverage on all possible user and item permutations. My data is loaded in from predefined splits.
My strategy to calculate prediction coverage is to
build a full trainset and fit
get lists of all users and items
iterate through the list and make predictions
count exceptions where predictions are impossible to calculate prediction coverage.
Trying to call data.build_full_trainset()) yields the following error:
AttributeError: 'DatasetUserFolds' object has no attribute 'build_full_trainset'
Is there a way to build a full trainset when loading data from predefined folds?
Alternatively, I will attempt to combine the data externally from Surprise into a dataframe and redo the process. Or are there better approaches?
Thank you.
# %% #https://surprise.readthedocs.io/en/stable/getting_started.html#basic-usage
import random
import pickle
import numpy as np
import pandas as pd
# from survey.data_cleaning import long_ratings
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
# from surprise.model_selection import LeaveOneOut, KFold
from surprise.model_selection import PredefinedKFold
#set random seed for reproducibility
my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)
path = 'data/recommenders/'
def load_splits():
"""
Loads splits from files load data from splits created by colab code and stored to files. used in surprise_recommenders.py
returns splits as dataset
"""
# path to dataset folder
files_dir = 'data/recommenders/splits/'
# This time, we'll use the built-in reader.
reader = Reader(line_format='user item rating', sep=' ', skip_lines=0, rating_scale=(1, 5))
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
return data
data = load_splits()
pkf = PredefinedKFold()
algos = {
'NormalPredictor': {'constructor': NormalPredictor,
'param_grid': {}
}}
key = "stratified_5_fold"
cv_results={}
print(f"Performing {key} cross validation.")
for algo_name, v in algos.items():
print("Working on algorithm: ", algo_name)
gs = GridSearchCV(v['constructor'], v['param_grid'], measures=['rmse', 'mae'], cv=pkf)
gs.fit(data)
# best RMSE score
print(gs.best_score['rmse'])
# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
# Predict on full dataset
# Use the weights that yields the best rmse:
algo = gs.best_estimator['rmse']
algo.fit(data.build_full_trainset()) #predefined folds breaks it.
cv_results[algo_name] = pd.DataFrame.from_dict(gs.cv_results)

TLDR; The model_selection documentation in Surprise indicates a "refit" method, that will fit data on a full trainset, however it explicitly doesn't work with predefined folds.
Another major issue: oyyablokov's comment on this issue suggests you cannot fit a model with data that has NaNs. So even if you have a full trainset, how does one create a full prediction matrix to calculate things like prediction coverage, which requires all users and item combinations with or without ratings?
My workaround was to create 3 Surprise datasets.
The dataset from predefined folds to compute best_params
The full dataset of ratings (combining all folds outside of Surprise)
The full prediction matrix dataset including all possible combinations of users and items (with or without ratings).
After you find your best paramaters with grid search cross validation, you can find your predictions and coverage with something like this:
import pandas as pd
from surprise import Dataset, Reader
def get_pred_coverage(data_matrix, algo_constructor, best_params, verbose=False):
"""
Calculates coverage
inputs:
data_matrix: Numpy Matrix with 0, 1, 2 columns as user, service, rating
algo_constructor: the Surprise algorithm constructor to pass the best params into
best_params: Surprise gs.best_params to pass into algo.
returns: coverage & full predictions
"""
reader=Reader(rating_scale=(1,5))
full_predictions = [] #list to store prediction results
df = pd.DataFrame(data_matrix)
if verbose: print(df.info())
df_no_nan = df.dropna(subset=[2])
if verbose: print(df_no_nan.head())
no_nan_dataset = Dataset.load_from_df(df_no_nan[[0,1,2]], reader)
full_dataset = Dataset.load_from_df(df[[0, 1, 2]], reader)
#Predict on full dataset
# Use the weights that yields the best rmse:
algo = algo_constructor(**best_params) # Pass the dictionary as double star keyword arguments to the algorithm constructor
#Create a no-nan trainset to fit on
no_nan_trainset = no_nan_dataset.build_full_trainset()
algo.fit(no_nan_trainset)
if verbose: print('Number of trainset users: ', no_nan_trainset.n_users, '\n')
if verbose: print('Number of trainset items: ', no_nan_trainset.n_items, '\n')
pred_set = full_dataset.build_full_trainset()
if verbose: print('Number of users: ', pred_set.n_users, '\n')
if verbose: print('Number of items: ', pred_set.n_items, '\n')
#get all item ids
pred_set_iids = list(pred_set.all_items())
# print(f'pred_set iids are {pred_set_iids}')
iid_converter = lambda x: pred_set.to_raw_iid(x)
pred_set_raw_iids = list(map(iid_converter, pred_set_iids))
#get all user ids
pred_set_uids = list(pred_set.all_users())
uid_converter = lambda x: pred_set.to_raw_uid(x)
pred_set_raw_uids = list(map(uid_converter, pred_set_uids))
# print(f'pred_set uids are {pred_set_uids}')
for user in pred_set_raw_uids:
for item in pred_set_raw_iids:
r_ui = float(df[2].loc[(df[0] == user) & (df[1]== item)]) #find the rating, by user and value
# print(f"r_ui is type {type(r_ui)} and value {r_ui}")
prediction = algo.predict(uid = user, iid = item, r_ui=r_ui)
# print(prediction)
full_predictions.append(prediction)
#access a tuple
#5th element, dicitonary item "was_impossible"
impossible_count = 0
for prediction in full_predictions:
impossible_count += prediction[4]['was_impossible']
if verbose: print(f"for algo {algo}, impossible_count is {impossible_count} ")
prediction_coverage = (pred_set.n_users*pred_set.n_items - impossible_count)/(pred_set.n_users*pred_set.n_items)
print(f"prediction_coverage is {prediction_coverage}")
return prediction_coverage, full_predictions

what is "UserWarning: No features were selected"

I am using LassoCV() model for feature selection. It is giving me this issue and not selecting any features too. "C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
UserWarning)"
The code is given below.
The data is in https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
# dataset URL = https://www.kaggle.com/jtrofe/beer-recipes/downloads/recipeData.csv/3
dataframe = pd.read_csv('Brewer Friend Beer Recipes.csv', encoding = 'latin')
# Encoding the non numerical columns
def encoding_data(dataframe):
if(dataframe.dtype == 'object'):
return LabelEncoder().fit_transform(dataframe.astype(str))
else:
return dataframe
# Feature Selection using the selected Target Feature
def feature_selection(raw_dataframe, target_feature_list):
output_list = []
# preprocessing Converting Categorical data into Numeric Data
dataframe = raw_dataframe.apply(encoding_data)
column_list = dataframe.columns.tolist()
dataframe = dataframe.dropna()
for target in target_feature_list:
target_feature = target
x = dataframe.drop(columns=[target_feature])
y = dataframe[target_feature].values
# Lasso feature selection
estimator = LassoCV(cv = 3, n_alphas = 1)
featureselection = SelectFromModel(estimator)
featureselection.fit(x,y)
features = featureselection.transform(x)
feature_list = x.columns[featureselection.get_support()]
features = ''
features = ', '.join(feature_list)
l = (target,features)
output_list.append(l)
output_df = pd.DataFrame(output_list,columns = ['Name','Selected Features'])
print('\nThe Feature Selection is done with the respective target feature(s)')
return output_df
print(feature_selection(dataframe, ['BrewMethod']))
I am getting this warning and no features are selected.
"C:\Users\xyz\Anaconda3\lib\site-packages\sklearn\feature_selection\base.py:80: UserWarning: No features were selected: either the data is too noisy or the selection test too strict. UserWarning)"
Any idea how to rectify this ?

If no features have been selected you can gradually decrease lambda (or in scikit's case alpha). This will reduce the penalization and probably return some nonzero coefficients.
It is extremely unusual that no coefficients have been selected. You should think about checking correlations in your data. Maybe you have a lot of collinearity.

Python running extremely slow one one line of code

I'm running the code below.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:\\path_here\\train.csv')
test=pd.read_csv('C:\\path_here\\test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
ID_col = ['REF_NO']
target_col = ['Status']
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns)))
other_col=['Type'] #Test and Train Data set identifier
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with 0
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
It seems to run, endlessly, in this line.
fullData[cat_cols] = fullData[cat_cols].fillna(value = 0)
I can't get it past that spot. how can I see what's happening in the background? Is there some way to see the work that's being done? Thanks.

One way to check where to code is getting to is to add print statements. For example you can add (right before the label encoder):
print("Code got before label encoder")
And then after that code block add another print statement. You can see in your console exactly where the code is getting stuck and debug that specific line.

Scikit Learn - Identifying target from loading a CSV

I'm loading a csv, using Numpy, as a dataset to create a decision tree model in Python. using the below extract places columns 0-7 in X and the last column as the target in Y.
#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:7] #identify columns as data sets
Y = data[:,8] #identfy last column as target
#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
What i'd like to know is if its possible to have the classifier in any column. for example if its in the fourth column would the following code still fit the model correctly or would it produce errors when it comes to predicting?
#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:8] #identify columns as data sets
Y = data[:,3] #identfy fourth column as target
#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

If you have >4 columns, and the 4th one is the target and the others are features, here's one way (out of many) to load them:
# load data
X = np.hstack([data[:, :3], data[:, 5:]]) # features
Y = data[:,4] # target
# process X & Y
(with belated thanks to #omerbp for reminding me hstack takes a tuple/list, not naked arguments!)

First of all, As suggested by #mescalinum in a comment to the question, think of this situation:
.... 4th_feature ... label
.... 1 ... 1
.... 0 ... 0
.... 1 ... 1
............................
In this example, the classifier (any classifier, not DecisionTreeClassifier particularly) will learn that the 4th feature can best predict the label, since the 4th feature is the label. Unfortunately, this issue happen a lot (by accident I mean).
Secondly, if you want the 4th feature to be input label, you can just swap the columns:
arr[:,[frm, to]] = arr[:,[to, frm]]
#Ahemed Fasih's answer can also do the trick, however its around 10 time slower:
import timeit
setup_code = """
import numpy as np
i, j = 400000, 200
my_array = np.arange(i*j).reshape(i, j)
"""
swap_cols = """
def swap_cols(arr, frm, to):
arr[:,[frm, to]] = arr[:,[to, frm]]
"""
stack ="np.hstack([my_array[:, :3], my_array[:, 5:]])"
swap ="swap_cols(my_array, 4, 8)"
print "hstack - total time:", min(timeit.repeat(stmt=stack,setup=setup_code,number=20,repeat=3))
#hstack - total time: 3.29988478635
print "swap - total time:", min(timeit.repeat(stmt=swap,setup=setup_code+swap_cols,number=20,repeat=3))
#swap - total time: 0.372791106328

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Spark StringIndexer applies non-existent labels (Unseen label Exception) - python

Related

Use statsmodels model fit on another dataset

How to build a full trainset when loading data from predefined folds in Surprise?

what is "UserWarning: No features were selected"

Python running extremely slow one one line of code

Scikit Learn - Identifying target from loading a CSV

Categories

Resources