I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally.
Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model.
rf = RandomForestRegressor(n_estimators=100)
print ("Trying to fit the Random Forest model --> ")
if os.path.exists('rf.pkl'):
print ("Trained model already pickled -- >")
with open('rf.pkl', 'rb') as f:
rf = cPickle.load(f)
else:
df_x_train = x_train[col_feature]
rf.fit(df_x_train,y_train)
print ("Training for the model done ")
with open('rf.pkl', 'wb') as f:
cPickle.dump(rf, f)
df_x_test = x_test[col_feature]
pred = rf.predict(df_x_test)
EDIT 1: I don't have the compute capacity to train the model on 4 years of data all at once.
What you're talking about, updating a model with additional data incrementally, is discussed in the sklearn User Guide:
Although not all algorithms can learn incrementally (i.e. without
seeing all the instances at once), all estimators implementing the
partial_fit API are candidates. Actually, the ability to learn
incrementally from a mini-batch of instances (sometimes called “online
learning”) is key to out-of-core learning as it guarantees that at any
given time there will be only a small amount of instances in the main
memory.
They include a list of classifiers and regressors implementing partial_fit(), but RandomForest is not among them. You can also confirm RFRegressor does not implement partial fit on the documentation page for RandomForestRegressor.
Some possible ways forward:
Use a regressor which does implement partial_fit(), such as SGDRegressor
Check your RandomForest model's feature_importances_ attribute, then retrain your model on 3 or 4 years of data after dropping unimportant features
Train your model on only the most recent two years of data, if you can only use two years
Train your model on a random subset drawn from all four years of data.
Change the tree_depth parameter to constrain how complicated your model can get. This saves computation time and so may allow you to use all your data. It can also prevent overfitting. Use Cross-Validation to select the best tree-depth hyperparameter for your problem
Set your RF model's param n_jobs=-1 if you haven't already,to use multiple cores/processors on your machine.
Use a faster ensemble-tree-based algorithm, such as xgboost
Run your model-fitting code on a large machine in the cloud, such as AWS or dominodatalab
You can set the 'warm_start' parameter to True in the model. This will ensure the retention of learning with previous learn using fit call.
Same model learning incrementally two times (train_X[:1], train_X[1:2]) after setting ' warm_start '
forest_model = RandomForestRegressor(warm_start=True)
forest_model.fit(train_X[:1],train_y[:1])
pred_y = forest_model.predict(val_X[:1])
mae = mean_absolute_error(pred_y,val_y[:1])
print("mae :",mae)
print('pred_y :',pred_y)
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 1290000.0
pred_y : [ 1630000.]
mae : 925000.0
pred_y : [ 1630000.]
Model only with the last learnt values ( train_X[1:2] )
forest_model = RandomForestRegressor()
forest_model.fit(train_X[1:2],train_y[1:2])
pred_y = forest_model.predict(val_X[1:2])
mae = mean_absolute_error(pred_y,val_y[1:2])
print("mae :",mae)
print('pred_y :',pred_y)
mae : 515000.0
pred_y : [ 1220000.]
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Related
My question is about the practical implementation of "Domain Adaptation" into a functional model in keras with tensorflow backend.
Description of the problem:
I have a collection of particle collision samples which consist of n variables. One half of them is simulated data with certain class labels (e.g "W-Boson"). The other half is real collision data which is not labeled. The key idea now is to setup a keras model, which has two outputs. One for classifying the class of a sample and one for classifying the domain, so wether it is simulated or real data. The thing is that the model shall be trained so that the domain classifier performs very poor. This is achieved by flipping the sign of the incoming gradient from the domain end of the network during training. This technique is called "Domain Adaptation". The model is expected to be trained to find domain-invariant features, or in other words, to perform the same on simulated and real collision data.
The framework I am working with has an existin functional keras model, which I wanted to expand with said domain classifier. This is a prototype I came up with:
# common layers
inputs = keras.Input(shape=(n_variables, ))
X = layers.Dense(units=50, activation="relu")(inputs)
# domain end
flip_layer = flipGradientTF.GradientReversal(hp_lambda=0.3)(X)
X_domain = layers.Dense(units=50, activation="relu")(flip_layer)
domain_out = layers.Dense(units=2, activation="softmax", name="domain_out")(X_domain)
# class end
X_class = layers.Dense(units=50, activation="relu")(X)
class_out = layers.Dense(units=n_classes, activation="softmax", name="class_out")(X_class)
The code for flipGradientTF is taken from https://github.com/michetonu/gradient_reversal_keras_tf
And further on for compiling and training the model:
model = keras.Model(inputs=inputs, outputs=[class_out, domain_out])
model.compile(optimizer="adam", loss=loss_function, metrics="accuracy")
# train model
model.fit(
x = train_data,
y = [train_class_labels, train_domain_labels],
batch_size = 200,
epochs = 200,
sample_weight = {"class_out": class_weights, "domain_out": None}
)
For train_data I am passing the dataframe which consists of the data from both domains. As I have tried to use either "categorical_crossentropy" or "sparse_categorical_crossentropy" as the loss_function, train_class_labels and train_domain_labels where either in the one-hot representation or in the integer representation. My biggest issue is figuring out what to use for the class labels of the unlabeled data and this led to a gut feeling that I am on the wrong track here.
So in a nutshell:
Is this implementation strategy legit and assuming it is, what should I do about the class labels for the unlabeled data? And if it is not legit, what would be a better way of attacking this problem?
Any help would be much appreciated :)
There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results.
I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.
There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.
There is a line of code in the main file:
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
if args.mode=='train':
model.train(data)
sess = model.restore_last_session()
model.predict(data, sess)
if args.mode=='test':
sess = model.restore_last_session()
model.predict(data, sess)
in which the 'data' is a class of Data(code) that includes test/train/dev datasets:
which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
class Data(object):
def __init__(self,data_path,vocab_path,pretrained,batch_size):
self.batch_size = batch_size
data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
self.train=data['train']
self.valid=data['valid']
self.test=data['test']
self.train2=data['train2']
self.valid2=data['valid2']
self.test2=data['test2']
self.word_size = len(vocab['word2id'])+1
self.max_sent_len = vocab['max_sent_len']
self.max_topic_len = vocab['max_topic_len']
self.word2id = vocab['word2id']
word2id = vocab['word2id']
#self.id2word = dict((v, k) for k, v in word2id.iteritems())
self.id2word = {}
for k, v in six.iteritems(word2id):
self.id2word[v]=k
self.pretrained=pretrained
by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)
so with a very minimal changes to existing code and libraries you can do smth like
first load all the data and build the model
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
then create the cross validation data set, smth like
def get_new_data_object():
return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
cross_validation = []
for i in range(10):
tmp_data = get_new_data_object()
tmp_data.train= #get 90% of tmp_data['train']
tmp_data.valid= #get 90% of tmp_data['valid']
tmp_data.test= #get 90% of tmp_data['test']
tmp_data.train2= #get 90% of tmp_data['train2']
tmp_data.valid2= #get 90% of tmp_data['valid2']
tmp_data.test2= #get 90% of tmp_data['test2']
cross_validation.append(tmp_data)
than run the model n times (10 for 10-fold cross validation)
sess = null
for data in cross_validation:
model.train(data, sess)
sess = model.restore_last_session()
keep in mind to pay attention to some key ideas
I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
you can train the model n times with cross validation or create n models and pick the best to avoid overfitting
this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)
one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object
more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach
hope it helps, and point you in the right direction
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am new to Machine Learning and to Python. I am trying to build a Random Forest Regression model on one of the datasets from the UCI repository. This is my first ML model. I may be entirely wrong in my approach.
The dataset is available here - https://archive.ics.uci.edu/ml/datasets/abalone
Below is the entire working code that I have written. I am using Python 3.6.4 with Windows 7 x64 OS (forgive me for the lengthy code).
import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest
#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window
root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows
#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options
print("Reading input file...")
# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
"At The Prompt, Enter 'Abalone_Data.csv' File.")
# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
quit()
else:
del(File_Checker)
file_loop = 0
while (file_loop == 0):
# Get path of base file
file_path = filedialog.askopenfilename(initialdir = "/",
title = "File Selection Prompt",
filetypes = (("XLSX Files","*.*"), ))
# Condition to check if user selected a file or not
if (len(file_path) < 1):
# Pop-up window to warn uer that no file was selected
result = messagebox.askretrycancel("File Selection Prompt Error",
"No file has been selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Get file name
file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
file_name = file_name[-1] # extracts the last element of the list
# Condition to check if correct file was selected or not
if (file_name != "Abalone_Data.csv"):
result = messagebox.askretrycancel("File Selection Prompt Error",
"Incorrect file selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Read the base file
input_file = pd.read_csv(file_path,
sep = ',',
encoding = 'utf-8',
low_memory = True)
break
# Delete unwanted files
del(file_loop, file_name)
#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")
# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])
# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")
# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)
# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)
# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
y = y.values
X = X.values
#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")
# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")
# Predicting a new result with regression
y_pred = regressor.predict(X_test)
# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
'Sex_M' : 0,
'Length' : 0.5,
'Diameter' : 0.35,
'Height' : 0.8,
'Whole_Weight' : 0.223,
'Shucked_Weight' : 0.09,
'Viscera_Weight' : 0.05,
'Shell_Weight' : 0.07}
# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])
# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
'Viscera_Weight', 'Sex_I', 'Sex_M']]
# Applying feature scaling
#test_values = sc_X.transform(test_values)
# Predicting values of new data
new_pred = regressor.predict(test_values)
#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")
# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")
# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))
print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)
When I look at the model accuracy, below is what I get.
Getting Model Accuracy...
Training Accuracy = 0.9359702279804791
Test Accuracy = 0.5695080680053354
Below are my questions.
1) Why are the Training Accuracy and Test Accuracy so far away?
2) How do I know if this model is being over/under fitted?
3) Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?
3) How can I build a confusion matrix using the variables I have created?
4) How do I validate the performance of the model?
I am looking for your guidance so that I too can learn from my mistakes and improve on my modelling skills.
Before trying to answer to your points, a comment: I see you are using a Regressor with accuracy as metric. But accuracy is a metric used in classification problems; in regressions models you usually use other metrics, as Mean Squared Error (MSE). See here.
If you just switch to a more adapt metric, maybe you will find that your model is not so bad.
I’m going anyway to reply to your questions.
Why are the Training Accuracy and Test Accuracy so far away?
This means that you overfitted your training samples: your model is very strong in predicting the data of the training dataset, but unable to generalise. Is like having a model trained on a set of cat pictures which believe only those pictures are cats, and all the other pictures of all the other cats are not. In fact, you have an accuracy on the test set of ~0.5, which is basically a random guess.
How do I know if this model is being over/under fitted?
Exactly form the difference in accuracy between the two sets. The more they are near each other, the more the model is able to generalise. You already know how on overfit looks like. An underfit is generally recognisable because of a low accuracy in both sets.
Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?
There is not a right model to use. Random Forest, and in general all the tree-based model (LightGBM, XGBoost) are the Swiss army knife of machine learning when you are dealing with structured data, because of their simplicity and reliability. Model based on Deep Learning perform better in theory, but much more complex to set up.
How can I build a confusion matrix using the variables I have created?
Confusion matrices can be created when you build a classification model, and are built on the output of your model.
You are using a regressor, it do not have lot of sense.
How do I validate the performance of the model?
In general, for a reliable validation of performances you split the data I three: you train on one (a.k.a. training set), tune the model on a second (a.k.a. validation set, this is what you call test set), and finally, when you are happy with the model and its hyper-parameters, you test it on the third (a.k.a. test set, not to be confused with the one you call test set). This last one tells you if your model generalize well or not. This because when you choose and tune the model you can also overfit the validation set (the one you call test set), maybe selecting a set of hyper-parameters which performs well only on that set.
Also, you have to choose a reliable metric, and this depends both on the data and on the model. With regressions, the MSE is pretty good.
With Trees and Ensemble, you have to be carreful of some settings. In your case, the difference comes from "overfitting". That means, your model have learned "too much" your training datas and is not able to generalise to other datas.
One important thing to do is to limit the depth of trees. For every trees, there is a branching factor of 2. That means at depth d, you gonna have 2^d branches.
Let's imagine you have 1000 training values. If you don't limit depth
(or/and min_samples_leaf), you can learn your complete dataset with a
depth of 10 (because 2^10 = 1024 > N_training).
What you can do is to compare training accuracy and test accuracy for a range of depth ( from let's say 3 to log(n) in base 2 ). If depth is too low, both accuracy will be low as you need more branches to learn properly datas, it will rise a peak then the training data will continue to riseup but test values will go down. It should be something like the following picture with Model Complexity which is your depth.
You can also play with min_samples_split and/or min_samples_leaf which can help you to let's say split only if you have multiple datas in this branch. As a result, this will limit also depth and will allow to have a tree with different depth per branches. Same as previously explained, you can play with the value to look for the best values (with a grid Search).
I hope it helps,
Playing around with Python's scikit SVM Linear Support Vector Classification and I'm running into an error when I attempt to make predictions:
ten_percent = len(raw_routes_data) / 10
# Training
training_label = all_labels[ten_percent:]
training_raw_data = raw_routes_data[ten_percent:]
training_data = DictVectorizer().fit_transform(training_raw_data).toarray()
learner = svm.LinearSVC()
learner.fit(training_data, training_label)
# Predicting
testing_label = all_labels[:ten_percent]
testing_raw_data = raw_routes_data[:ten_percent]
testing_data = DictVectorizer().fit_transform(testing_raw_data).toarray()
testing_predictions = learner.predict(testing_data)
m = metrics.classification_report(testing_label, testing_predictions)
The raw_data is represented as a Python dictionary with categories of arrival times for various travel options and categories for weather data:
{'72_bus': '6.0 to 11.0', 'uber_eta': '2.0 to 3.5', 'tweet_delay': '0', 'c_train': '1.0 to 4.0', 'weather': 'Overcast', '52_bus': '16.0 to 21.0', 'uber_surging': '1.0 to 1.15', 'd_train': '17.6666666667 to 21.8333333333', 'feels_like': '27.6666666667 to 32.5'}
When I train and fit the training data I use a Dictionary Vectorizer on 90% of the data and turning it into an array.
The provided testing_labels are represented as:
[1,2,3,3,1,2,3, ... ]
It's when I attempt to use the LinearSVC to predict that I'm informed:
ValueError: X has 27 features per sample; expecting 46
What am I missing here? Obviously it is the way I fit and transform the data.
The problem is that you creating and fitting different DictVectorizer for train and for test.
You should create and fit only one DictVectorizer using train data and use transform method of this object on your testing data to create feature representation of your test data.
Yes, I had similar concern while working with "CountVectorizer".
When I removed the additional fitting done for the Test data and only used "transform" method based on the fitting done for the Training data, it worked liked a gem.
Sharing it if helps the community on similar concerns in predicting the outcome using Test data.
Thanks,
Shabir Jameel