How to implement a model on a new data set - python

I'm new to machine learning using python. I'm trying to predict a factor lets say Price of a house, but i'm using polynomial feature of higher order degree to create a model.
So i have 2 data sets. I've prepared my model using one data set.
How to implement this model on an entirely new data set?
I'm attaching my code below:
data1 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz1.csv", engine = 'c', dtype=float, delimiter = ",")
data2 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz2.csv", engine = 'c', dtype=float, delimiter = ",")
#I have to do this step otherwise everytime i get an error of NaN or infinite value
data1.fillna(0.000, inplace=True)
data2.fillna(0.000, inplace=True)
X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result
x2_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X_train)
x3_ = PolynomialFeatures(degree=3, include_bias=False).fit_transform(X_train)
model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)
r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)
y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)
So basically I'm stuck after this.
How do i implement this same model on my test data to predict y_test value and compute the score?

To reproduce the effect of PolynomialFeatures, you need to store the object itself (once for degree=2 and again for degree=3.) Otherwise, you have no way to apply the fitted transform to the test dataset.
X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result
# store these data transform objects
pf2 = PolynomialFeatures(degree=2, include_bias=False)
pf3 = PolynomialFeatures(degree=3, include_bias=False)
# then apply the transform to the training set
x2_ = pf2.fit_transform(X_train)
x3_ = pf3.fit_transform(X_train)
model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)
r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)
y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)
# now apply the fitted transform to the test set
x2_test = pf2.transform(X_test)
x3_test = pf3.transform(X_test)
# apply trained model to transformed test data
y2_test_pred = model2.predict(x2_test)
y3_test_pred = model3.predict(x3_test)
# compute the model accuracy for the test data
r_sq2_test = model2.score(x2_test, y_test)
r_sq3_test = model3.score(x3_test, y_test)

Related

OSError: exception: access violation reading 0x0000000000000008 with XGBOOST classifier

I initially trained my model using an XGBOOST classifier and everything worked fine.
Now, I am trying to train the model on the same data set using an XGBOOST classifier but I am running into this error: OSError: exception: access violation reading 0x0000000000000008.
This time around, I am using sklearn's bootstrapping method to randomly sample from the dataset.
I first split the data set into a train set and a test set. Then I randomly sampled from the train and test sets to create 50 samples each for training and testing respectively.
The model is catching error around the .fit() line.
Kindly direct me on how I can fix this error, please.
I tried running the model outside the for loop and everything works fine but when I try with the bootstrap method then I catch the error again.
# Read each file and do analysis
for i in range(50):
# read train and test data
train_data = pd.read_csv(train_path + "\\" + "train" + str(i) + ".csv")
test_data = pd.read_csv(test_path + "\\" + "test" + str(i) + ".csv")
# Covert gender to binary
train_data['gender'] = train_data['gender'].map({1:1, 2:0})
test_data['gender'] = test_data['gender'].map({1:1, 2:0})
# Apply standard scalar to numerical columns
sc = StandardScaler()
train_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']] = sc.fit_transform(train_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']])
test_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']] = sc.fit_transform(test_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']])
# Create X_train, X_test, y_train, y_test
y_train = train_data["depression"]
y_test = test_data["depression"]
X_train = train_data.drop("depression", axis=1, inplace=True)
X_test = test_data.drop("depression", axis=1, inplace=True)
#print(y_train)
# Create model
model = XGBClassifier(use_label_encoder=False)
# Fit model with train data
_= model.fit(X_train, y_train)
# Predict on test set
y_pred = model.predict(X_test)
# Get accuracy of model
acc = model.score(X_test, y_test)
# get balanced accuracy
balAcc = balanced_accuracy_score(y_test, y_pred)
# roc_auc
roc_auc = roc_auc_score(y_true=y_test,y_score=model.predict_proba(X_test)[:,1])
# add y_pred to test set
predict_dataframe = prediction_dataframe(test_data, y_pred)
# define protected attributes.
p_attr1 = "gender"
p_attr2 = "ethnicity"
# compute TP, FP, TN, FN based on single protected attributes
tp, fp, tn, fn = compute_metrics_s(predict_dataframe, p_attr1)
# compute TPR based on single protected attributes
tpr_male = list(tp.values())[0] / np.add(list(tp.values())[0], list(fn.values())[0])
tpr_female = list(tp.values())[1] / np.add(list(tp.values())[1], list(fn.values())[1])
EOD = np.subtract(tpr_male, tpr_female)
dic_data["roc_auc"].append(roc_auc)
dic_data["bacc"].append(balAcc)
dic_data["EOD"].append(EOD)
dic_data["tpr_male"].append(tpr_male)
dic_data["tpr_female"].append(tpr_female)
i += 1
if i == 49:
df = pd.DataFrame.from_dict(dic_data)
df.to_csv(r"C:\Users\dzadq001\OneDrive -University\Experiments\Depression\Datasets\nhanes\results\dataframe\suppression\gender.csv", index=True)
The issue was with my X_train and X_test were returning None datatypes. So when I modified the following lines;
X_train = train_data.drop("depression", axis=1, inplace=True)
X_test = test_data.drop("depression", axis=1, inplace=True)
to: X_train = train_data.drop("depression", axis=1)
X_test = test_data.drop("depression", axis=1)
then the problem was solved.

Predictive model performs exceedingly well during training and testing, but predicts zero when predicting the very same data

I've created a binary classification model which predicts whether an article is part of the positive or negative class. I am using TF-IDF fed into an XGBoost classifier alongside another feature. I get an AUC score of very close to 1 when both training/testing and crossvalidating.
I got a .5 score when testing on my holdout data. This seemed odd to me, so I fed the very same training data into my model, and even that returns a .5 AUC score. The code below takes in a dataframe, fits and transforms to the tf-idf vectors and formats it all into a dMatrix.
def format_to_dmatrix(known_targets):
y = known_targets['target']
X = known_targets[['body', 'day_of_year']]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.1, random_state=42)
tfidf.fit(X_train['body'])
pickle.dump(tfidf.vocabulary_,open("tfidf_features.pkl","wb"))
X_train_enc = tfidf.transform(X_train['body']).toarray()
X_test_enc = tfidf.transform(X_test['body']).toarray()
new_cols = tfidf.get_feature_names()
new_cols.append('day_of_year')
a = np.array(X_train['day_of_year'])
a = a.reshape(a.shape[0], 1)
b = np.array(X_test['day_of_year'])
b = b.reshape(b.shape[0], 1)
X_train = np.append(X_train_enc, a, axis=1)
X_test = np.append(X_test_enc, b, axis=1)
dtrain = xgb.DMatrix(X_train, label=y_train.values, feature_names=new_cols)
dtest = xgb.DMatrix(X_test, label=y_test.values, feature_names=new_cols)
return dtrain, dtest, tfidf
I cross validate and find a test-auc-mean of .9979, so I save the model as shown below.
best_model = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")]
This is my code to load in new data:
def test_newdata(data):
tf1 = pickle.load(open("tfidf_features.pkl", 'rb'))
tf1_new = TfidfVectorizer(max_features=1500, lowercase=True, analyzer='word', stop_words='english', ngram_range=(1, 1), vocabulary = tf1.keys())
encoded_body = tf1_new.fit_transform(data['body']).toarray()
new_cols = tf1_new.get_feature_names()
new_cols.append('day_of_year')
day_of_year = np.array(data['day_of_year'])
day_of_year = day_of_year.reshape(day_of_year.shape[0], 1)
formatted_test_data = np.append(encoded_body, day_of_year, axis=1)
df= pd.DataFrame(formatted_test_data, columns=new_cols)
return xgb.DMatrix(df)
And this code below shows that my AUC score is .5 despite loading in the very same data. Is there an error i've missed somewhere?
loaded_model = xgb.Booster()
loaded_model.load_model("earn_modelv3.model")
holdout = known_targets
formatted_test_data = test_newdata(holdout)
holdout_preds = loaded_model.predict(formatted_test_data)
predictions_binary = np.where(holdout_preds > .5, 1, 0)
{round(roc_auc_score(holdout['target'], predictions_binary) ,4)

Training loop for XGBoost in different dataset

I have developed some different datasets and I want to write a for loop to do the training for each of which and at the end, I want to have RMSE for each dataset. I tried by passing through a for loop but it does not work since it gives back the same value for each dataset while I know that it should be different. The code that I have written is below:
for i in NEW_middle_index:
DF = df1.iloc[i-100:i+100,:]
# Append an empty sublist inside the list
FINAL_DF.append(DF)
y = DF.iloc[:,3]
X = DF.drop(columns='Target')
index_train = int(0.7 * len(X))
X_train = X[:index_train]
y_train = y[:index_train]
X_test = X[index_train:]
y_test = y[index_train:]
scaler_x = MinMaxScaler().fit(X_train)
X_train = scaler_x.transform(X_train)
X_test = scaler_x.transform(X_test)
xgb_r = xg.XGBRegressor(objective ='reg:linear',
n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
# print(i)
# Fitting the model
xgb_r.fit(X_train,y_train)
# Predict the model
pred = xgb_r.predict(X_test)
# RMSE Computation
rmse = np.sqrt(mean_squared_error(y_test,pred))
# print(rmse)
RMSE.append(rmse)
Not sure if you indented it correctly. You are overwriting X_train and X_test and when you fit your model, its always on the same dataset, hence you get the same results.
One option is to fit the model once you create the train / test dataframes. Else if you want to keep the train / test set, maybe something like below, to store them in a list of dictionaries, without changing too much of your code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import xgboost as xg
df1 = pd.DataFrame(np.random.normal(0,1,(600,3)))
df1['Target'] = np.random.uniform(0,1,600)
NEW_middle_index = [100,300,500]
NEWDF = []
for i in NEW_middle_index:
y = df1.iloc[i-100:i+100:,3]
X = df1.iloc[i-100:i+100,:].drop(columns='Target')
index_train = int(0.7 * len(X))
scaler_x = MinMaxScaler().fit(X)
X_train = scaler_x.transform(X[:index_train])
y_train = y[:index_train]
X_test = scaler_x.transform(X[index_train:])
y_test = y[index_train:]
NEWDF.append({'X_train':X_train,'y_train':y_train,'X_test':X_test,'y_test':y_test})
Then we fit and calculate RMSE:
RMSE = []
xgb_r = xg.XGBRegressor(objective ='reg:linear',n_estimators = 20, seed = 123)
for i in range(len(NEW_middle_index)):
xgb_r.fit(NEWDF[i]['X_train'],NEWDF[i]['y_train'])
pred = xgb_r.predict(NEWDF[i]['X_test'])
rmse = np.sqrt(mean_squared_error(NEWDF[i]['y_test'],pred))
RMSE.append(rmse)
RMSE
[0.3524827559800294, 0.3098101362502435, 0.3843173269966071]

Scikit-learn Pipeline: Size of predictions on test set is equal to size of training set

I'm trying to get predictions of the test dataset. I'm using a Sklearn Pipeline with MLPRegressor. However, I'm just getting the size of prediction from train set, even though I using 'test.csv'.
Where can I modify to obtain the predictions with lenght being the same as test data?
train_pipeline.py
# Read training data
data = pd.read_csv(data_path, sep=';', low_memory=False, parse_dates=parse_dates)
# Fill all None records
data[config.TARGET] = data[config.TARGET].fillna(0)
#
data[config.TARGET] = data[config.TARGET].apply(lambda x: split_join_string(x) if (type(x) == str and len(x.split('.')) > 0) else x)
# Divide train and test
X_train, X_test, y_train, y_test = train_test_split(
data[config.FEATURES],
data[config.TARGET],
test_size=0.1,
random_state=0) # we are setting the seed here
# Transform the target
y_train = y_train.apply(lambda x: np.log(float(x)) if x != 0 else 0)
y_test = y_test.apply(lambda x: np.log(float(x)) if x != 0 else 0)
data_test = pd.concat([X_test, y_test], axis=1)
# Save the dataset to a '.csv' file without index
data_test.to_csv(data_path_test, sep=';', index=False)
pipeline.order_pipe.fit(X_train[config.FEATURES],
y_train)
save_pipeline(pipeline_to_persist=pipeline.order_pipe)
predict.py
def make_prediction(*, input_data) -> dict:
"""Make a prediction using the saved model pipeline."""
data = pd.DataFrame(input_data)
validated_data = validate_inputs(input_data=data)
prediction = _order_pipe.predict(validated_data[config.FEATURES])
output = np.exp(prediction)
#score = _order_pipe.score(validated_data[config.FEATURES], validated_data[config.TARGET])
results = {'predictions': output, 'version': _version}
_logger.info(f'Making predictions with model version: {_version}'
f'\nInputs: {validated_data}'
f'\nPredictions: {results}')
return results
I expect the predictions be of size of 'test.csv', but the actual predictions has the size of 'train.csv'. Do I need to fit or transform the test dataset into 'order_pipe' to make predictions right size?
I solved this problem removing an preprocessor that was crashing the size of X_test. Because of that, the X_test was being replaced by X_train and I was not able to make predictions at right shape.
Furthermore, there was another preprocessor (creating dummies, using pd.get_dummies()) that was inserting new columns and brought more problems during X_test predictions. I also replaced that preprocessor, encoding the categorical features using groupby() and map().

Attributes mismatch between training and testing data in sklearn - linear regression

I am trying to train a linear regression model using sklearn to predict likes of given tweets. I have the following as features/ attributes.
['id', 'month', 'hour', 'text', 'hasMedia', 'hasHashtag', 'followers_count', 'retweet_count', 'favourite_count', 'sentiment', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust', ......keywords............]
I use tfidfvectorizer for extracting keywords. The problem is, depending on the size of the training data, the number of keywords differ and therefore, the number of independent attributes differ. Because of this there is a mismatch of attributes between training and testing data. I get ValueError: Shape of passed values is (1, 1678), indices imply (1, 1928).
It works fine when I split the same data into train and test and predict with test as below.
Program for training and prediction
def train_favourite_prediction(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# now you can save it to a file
joblib.dump(regressor, os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
return None
def predict_favourites(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
# and later you can load it
regressor = joblib.load(os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("the large training just finished")
return None
Code for fit vectorization
Have a look at Applying Tfidfvectorizer on list of pos tags gives ValueError to understand the format of my 'text' column.
def ready_for_training(dataset):
dataset = dataset.head(1000)
dataset['text'] = dataset.text.apply(lambda x: literal_eval(x))
dataset['text'] = dataset['text'].apply(
lambda row: [item for sublist in row for item in sublist])
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text'])
keyword_matrix = pd.DataFrame(keyword_response.todense(), columns=tfidf.get_feature_names())
keyword_matrix = keyword_matrix.loc[:, (keyword_matrix != 0).any(axis=0)]
dataset['sentiments'] = dataset['sentiments'].map(eval)
dataset = pd.concat([dataset.drop(['sentiments'], axis=1), dataset['sentiments'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['neg', 'neu','pos'], axis=1)
dataset['emotions'] = dataset['emotions'].map(eval)
dataset = pd.concat([dataset.drop(['emotions'], axis=1), dataset['emotions'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['id', 'month', 'text'], axis=1)
result = pd.concat([dataset, keyword_matrix], axis=1, sort=False)
return result
What I need is to predict 'favourite_count' when a new single Tweet is given. When I get the keywords for this tweet I get only a few. While training I trained with 1000+ keywords. I have stored the trained knowledge in a .pkl file. How should I handle this mismatch of attributes? To fill the missing columns in testing tweet as in Keep same dummy variable in training and testing data I may need the training set as a dataframe. But I have stored the trained knowledge as .pkl. and won't be able to access the columns in the trained knowledge.

Categories