Attributes mismatch between training and testing data in sklearn - linear regression - python

I am trying to train a linear regression model using sklearn to predict likes of given tweets. I have the following as features/ attributes.
['id', 'month', 'hour', 'text', 'hasMedia', 'hasHashtag', 'followers_count', 'retweet_count', 'favourite_count', 'sentiment', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust', ......keywords............]
I use tfidfvectorizer for extracting keywords. The problem is, depending on the size of the training data, the number of keywords differ and therefore, the number of independent attributes differ. Because of this there is a mismatch of attributes between training and testing data. I get ValueError: Shape of passed values is (1, 1678), indices imply (1, 1928).
It works fine when I split the same data into train and test and predict with test as below.
Program for training and prediction
def train_favourite_prediction(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# now you can save it to a file
joblib.dump(regressor, os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
return None
def predict_favourites(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
# and later you can load it
regressor = joblib.load(os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("the large training just finished")
return None
Code for fit vectorization
Have a look at Applying Tfidfvectorizer on list of pos tags gives ValueError to understand the format of my 'text' column.
def ready_for_training(dataset):
dataset = dataset.head(1000)
dataset['text'] = dataset.text.apply(lambda x: literal_eval(x))
dataset['text'] = dataset['text'].apply(
lambda row: [item for sublist in row for item in sublist])
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text'])
keyword_matrix = pd.DataFrame(keyword_response.todense(), columns=tfidf.get_feature_names())
keyword_matrix = keyword_matrix.loc[:, (keyword_matrix != 0).any(axis=0)]
dataset['sentiments'] = dataset['sentiments'].map(eval)
dataset = pd.concat([dataset.drop(['sentiments'], axis=1), dataset['sentiments'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['neg', 'neu','pos'], axis=1)
dataset['emotions'] = dataset['emotions'].map(eval)
dataset = pd.concat([dataset.drop(['emotions'], axis=1), dataset['emotions'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['id', 'month', 'text'], axis=1)
result = pd.concat([dataset, keyword_matrix], axis=1, sort=False)
return result
What I need is to predict 'favourite_count' when a new single Tweet is given. When I get the keywords for this tweet I get only a few. While training I trained with 1000+ keywords. I have stored the trained knowledge in a .pkl file. How should I handle this mismatch of attributes? To fill the missing columns in testing tweet as in Keep same dummy variable in training and testing data I may need the training set as a dataframe. But I have stored the trained knowledge as .pkl. and won't be able to access the columns in the trained knowledge.

Related

How can I predict a result from new data with a trained existing model in python?

'''
#Defining Model Function
def models(X_train, Y_train):
log = LogisticRegression(random_state=0)
log.fit(X_train, Y_train)
tree = DecisionTreeClassifier(criterion='entropy', random_state=0)
tree.fit(X_train, Y_train)
forest = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
forest.fit(X_train, Y_train)
return log, tree, forest
#Main Code
df = pd.read_csv("data.csv")
X = df.iloc[:, 2:31].values #Selecting Data from 3rd column to 32th column
Y = df.iloc[:, 1].values #Selecting Data from 2nd column
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
model = models(X_train, Y_train)
newdata = pd.read_csv("new.csv") #This is new data
X_new = newdata.iloc[:, 2:31].values #These data I want to check with trained data
predictnewdata1=model[0].predict(X_test)
print("new data1") #Printing test data from model[0] i.e. log model
print(predictnewdata1)
predictnewdata2=model[1].predict((X_test))
print("new data2") #Printing test data from model[1] i.e. tree model
print(predictnewdata2)
predictnewdata3=model[2].predict(X_test)
print("new data3") #Printing test data from model[2] i.e. forest model
print(predictnewdata3)
predictnewdata4 = model[2].predict(X_new) #This code is anyhow wrong
print(predictnewdata4) #Printing wrong values
I want to test any model with new data, but somehow my algorithm/code isnt working. I searched a lot but couldnt find any solution. So please help in this regard.

Predictive model performs exceedingly well during training and testing, but predicts zero when predicting the very same data

I've created a binary classification model which predicts whether an article is part of the positive or negative class. I am using TF-IDF fed into an XGBoost classifier alongside another feature. I get an AUC score of very close to 1 when both training/testing and crossvalidating.
I got a .5 score when testing on my holdout data. This seemed odd to me, so I fed the very same training data into my model, and even that returns a .5 AUC score. The code below takes in a dataframe, fits and transforms to the tf-idf vectors and formats it all into a dMatrix.
def format_to_dmatrix(known_targets):
y = known_targets['target']
X = known_targets[['body', 'day_of_year']]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.1, random_state=42)
tfidf.fit(X_train['body'])
pickle.dump(tfidf.vocabulary_,open("tfidf_features.pkl","wb"))
X_train_enc = tfidf.transform(X_train['body']).toarray()
X_test_enc = tfidf.transform(X_test['body']).toarray()
new_cols = tfidf.get_feature_names()
new_cols.append('day_of_year')
a = np.array(X_train['day_of_year'])
a = a.reshape(a.shape[0], 1)
b = np.array(X_test['day_of_year'])
b = b.reshape(b.shape[0], 1)
X_train = np.append(X_train_enc, a, axis=1)
X_test = np.append(X_test_enc, b, axis=1)
dtrain = xgb.DMatrix(X_train, label=y_train.values, feature_names=new_cols)
dtest = xgb.DMatrix(X_test, label=y_test.values, feature_names=new_cols)
return dtrain, dtest, tfidf
I cross validate and find a test-auc-mean of .9979, so I save the model as shown below.
best_model = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")]
This is my code to load in new data:
def test_newdata(data):
tf1 = pickle.load(open("tfidf_features.pkl", 'rb'))
tf1_new = TfidfVectorizer(max_features=1500, lowercase=True, analyzer='word', stop_words='english', ngram_range=(1, 1), vocabulary = tf1.keys())
encoded_body = tf1_new.fit_transform(data['body']).toarray()
new_cols = tf1_new.get_feature_names()
new_cols.append('day_of_year')
day_of_year = np.array(data['day_of_year'])
day_of_year = day_of_year.reshape(day_of_year.shape[0], 1)
formatted_test_data = np.append(encoded_body, day_of_year, axis=1)
df= pd.DataFrame(formatted_test_data, columns=new_cols)
return xgb.DMatrix(df)
And this code below shows that my AUC score is .5 despite loading in the very same data. Is there an error i've missed somewhere?
loaded_model = xgb.Booster()
loaded_model.load_model("earn_modelv3.model")
holdout = known_targets
formatted_test_data = test_newdata(holdout)
holdout_preds = loaded_model.predict(formatted_test_data)
predictions_binary = np.where(holdout_preds > .5, 1, 0)
{round(roc_auc_score(holdout['target'], predictions_binary) ,4)

Python DummyRegressor with min of group

Trying to use sklearn.dummy DummyRegressor to create a baseline for my model which is a regression model with encoded categorical variables to predict a continuous target. Baseline strategy is 'min' and I'd like the min by group. Below is a reproducible example. My actual dataset is larger and it is a collection of runners ('a' ids) racing on courses ('c' ids) with the time they recorded for that performance as the target 'T'. I'm trying to see if the model performs better that the runner's best/fastest recorded time (min).
df = pd.DataFrame([['a1','c1',10],
['a1','c2',15],
['a1','c3',20],
['a1','c1',15],
['a2','c2',26],
['a4','c3',15],
['a2','c1',23],
['a2','c2',15],
['a3','c3',20],
['a3','c3',13],
['a1','c3',19],
['a4','c3',19],
['a3','c3',12],
['a3','c3',20]], columns=['aid','cid','T'])
X = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
X.drop(['T'], axis=1, inplace=True)
y = df['T']
# train test split 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regr = LinearRegression()
Lin_model = regr.fit(X_train, y_train)
y_pred = Lin_model.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
For comparison I'd like to use DummyRegressor. Using mean as strategy it works and as I understand it is using the mean of the entire column.
dummy_mean = DummyRegressor(strategy='mean')
dummy_mean.fit(X_train, y_train)
y_pred2 = dummy_mean.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred2))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred2))
to compare to the lowest T or fastest/best time I tried the constant function and defined it as min value by group
min_value = df.groupby('aid').agg({'T': ['min']})
dummy_min = DummyRegressor(strategy='constant',constant = min_value)
dummy_min.fit(X_train, y_train)
y_pred4 = dummy_min.predict(X_test)
which returns
ValueError: could not broadcast input array from shape (1,3) into shape (3,1)
what am I missing?
When you use min_value = df.groupby('aid').agg({'T': ['min']}) the shape of the data frame is changed to (3,1) try to change it tomin_value = df.groupby('aid').agg({'T': ['min']}).values.reshape(1,-1) Hope it helps.

Scikit-learn Pipeline: Size of predictions on test set is equal to size of training set

I'm trying to get predictions of the test dataset. I'm using a Sklearn Pipeline with MLPRegressor. However, I'm just getting the size of prediction from train set, even though I using 'test.csv'.
Where can I modify to obtain the predictions with lenght being the same as test data?
train_pipeline.py
# Read training data
data = pd.read_csv(data_path, sep=';', low_memory=False, parse_dates=parse_dates)
# Fill all None records
data[config.TARGET] = data[config.TARGET].fillna(0)
#
data[config.TARGET] = data[config.TARGET].apply(lambda x: split_join_string(x) if (type(x) == str and len(x.split('.')) > 0) else x)
# Divide train and test
X_train, X_test, y_train, y_test = train_test_split(
data[config.FEATURES],
data[config.TARGET],
test_size=0.1,
random_state=0) # we are setting the seed here
# Transform the target
y_train = y_train.apply(lambda x: np.log(float(x)) if x != 0 else 0)
y_test = y_test.apply(lambda x: np.log(float(x)) if x != 0 else 0)
data_test = pd.concat([X_test, y_test], axis=1)
# Save the dataset to a '.csv' file without index
data_test.to_csv(data_path_test, sep=';', index=False)
pipeline.order_pipe.fit(X_train[config.FEATURES],
y_train)
save_pipeline(pipeline_to_persist=pipeline.order_pipe)
predict.py
def make_prediction(*, input_data) -> dict:
"""Make a prediction using the saved model pipeline."""
data = pd.DataFrame(input_data)
validated_data = validate_inputs(input_data=data)
prediction = _order_pipe.predict(validated_data[config.FEATURES])
output = np.exp(prediction)
#score = _order_pipe.score(validated_data[config.FEATURES], validated_data[config.TARGET])
results = {'predictions': output, 'version': _version}
_logger.info(f'Making predictions with model version: {_version}'
f'\nInputs: {validated_data}'
f'\nPredictions: {results}')
return results
I expect the predictions be of size of 'test.csv', but the actual predictions has the size of 'train.csv'. Do I need to fit or transform the test dataset into 'order_pipe' to make predictions right size?
I solved this problem removing an preprocessor that was crashing the size of X_test. Because of that, the X_test was being replaced by X_train and I was not able to make predictions at right shape.
Furthermore, there was another preprocessor (creating dummies, using pd.get_dummies()) that was inserting new columns and brought more problems during X_test predictions. I also replaced that preprocessor, encoding the categorical features using groupby() and map().

Grading System - Input Features

I am working on a Grading System ( graduation project ). I have preprocessed the data, then used TfidfVectorizer on the data and used LinearSVC to fit the model.
The System goes as follows, it has 265 definitions, of arbitrary lengths; but in total, they sum up to shape of (265, 8581 )
so when I try to input some new random sentence to predict against it, I get this message
Error Message
you could have a look at the code used ( Full & long ) if you want to;
Code used;
def normalize(df):
lst = []
for x in range(len(df)):
text = re.sub(r"[,.'!?]",'', df[x])
lst.append(text)
filtered_sentence = ' '.join(lst)
return filtered_sentence
def stopWordRemove(df):
stop = stopwords.words("english")
needed_words = []
for x in range(len(df)):
words = word_tokenize(df)
for word in words:
if word not in stop:
needed_words.append(word)
return needed_words
def prepareDataSets(df):
sentences = []
for index, d in df.iterrows():
Definitions = stopWordRemove(d['Definitions'].lower())
Definitions_normalized = normalize(Definitions)
if d['Results'] == 'F':
sentences.append([Definitions, 'false'])
else:
sentences.append([Definitions, 'true'])
df_sentences = DataFrame(sentences, columns=['Definitions', 'Results'])
for x in range(len(df_sentences)):
df_sentences['Definitions'][x] = ' '.join(df_sentences['Definitions'][x])
return df_sentences
def featureExtraction(data):
vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3))
tfidf_data = vectorizer.fit_transform(data)
return tfidf_data
def learning(clf, X, Y):
X_train, X_test, Y_train, Y_test = \
cross_validation.train_test_split(X,Y, test_size=.2,random_state=43)
classifier = clf()
classifier.fit(X_train, Y_train)
predict = cross_validation.cross_val_predict(classifier, X_test, Y_test, cv=5)
scores = cross_validation.cross_val_score(classifier, X_test, Y_test, cv=5)
print(scores)
print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (classifier, scores.mean(), scores.std() *2))
print (classification_report(Y_test, predict))
Then I run these scripts : which I get the mentioned error after
test = LinearSVC()
data, target = preprocessed_df['Definitions'], preprocessed_df['Results']
tfidf_data = featureExtraction(data)
X_train, X_test, Y_train, Y_test = \
cross_validation.train_test_split(tfidf_data,target, test_size=.2,random_state=43)
test.fit(tfidf_data, target)
predict = cross_validation.cross_val_predict(test, X_test, Y_test, cv=10)
scores = cross_validation.cross_val_score(test, X_test, Y_test, cv=10)
print(scores)
print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (test, scores.mean(), scores.std() *2))
print (classification_report(Y_test, predict))
Xnew = ["machine learning is playing games in home"]
tvect = TfidfVectorizer(min_df=1, max_df=1.0, ngram_range=(1,3))
X_test= tvect.fit_transform(Xnew)
ynew = test.predict(X_test)
You never call fit_transform() on test, only transform() and use the same vectorizer which is used on training data.
Do this:
def featureExtraction(data):
vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3))
tfidf_data = vectorizer.fit_transform(data)
# Here I am returning the vectorizer as well, which was used to generate the training data
return vectorizer, tfidf_data
...
...
tfidf_vectorizer, tfidf_data = featureExtraction(data)
...
...
# Now using the same vectorizer on test data
X_test= tfidf_vectorizer.transform(Xnew)
...
In your code, you are using a new TfidfVectorizer which obviously will not know about the training data and also not know that training data has 8581 features.
The test data should be prepared in the same way as you prepare the train data, always. Else even if you not get error, the results are wrong and model will not perform like that in real case scenarios.
See my other answers explaining similar situation for different feature preprocessing techniques:
https://stackoverflow.com/a/47205199/3374996
https://stackoverflow.com/a/50461140/3374996
https://stackoverflow.com/a/44671967/3374996
I would have tagged this question as a duplicate of one of these, but seeing you are using a new vectorizer altogether and have a different method for transforming train data, I answered this. From next time, please search the issue first and try understanding whats happening in similar scenarios, before posting a question.

Categories