I have used the MinMax normalization in order to normalize my dataset, both features and label. My question is, it's correct to normalize also the label? If yes, how can I denormalize the output of the neural network (the one that I predict with the test set that is normalized)?
I can't upload the dataset, but it is composed by 18 features and 1 label. It is a regression task, the features and the label are physical quantities.
So the problem is that the y_train_pred and y_test_pred are between 0 and 1. How can I predict the "real value"?
The code:
dataset = pd.read_csv('DataSet.csv', decimal=',', delimiter = ";")
label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])
features = features[best_features]
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = 0.25, random_state = 1, shuffle = True)
y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)
scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)
optimizer = tf.keras.optimizers.Adamax(lr=0.001)
model = Sequential()
model.add(Dense(80, input_shape = (X_train.shape[1],), activation = 'relu',kernel_initializer='random_normal'))
model.add(Dropout(0.15))
model.add(Dense(120, activation = 'relu',kernel_initializer='random_normal'))
model.add(Dropout(0.15))
model.add(Dense(80, activation = 'relu',kernel_initializer='random_normal'))
model.add(Dense(1,activation = 'linear'))
model.compile(loss = 'mse', optimizer = optimizer, metrics = ['mse'])
history = model.fit(X_train, y_train, epochs = 300,
validation_split = 0.1, shuffle=False, batch_size=120
)
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
You should denormalize so you can get real world predictions to your neural network, rather than a number between 0-1
The min - max normalization is defined by:
z = (x - min)/(max - min)
With z being the normalized value, x being the label value, max being the max x value, and min being the min x value. So if we have z, min, and max we can resolve for x as follows:
x = z(max - min) + min
Thus before you normalize your data, define variables for the max and min value for the label if it is continuous. Then after you get your pred values, you can use the following function:
y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)
def denormalize(y):
final_value = y(y_max_pre_normalize - y_min_pre_normalize) + y_min_pre_normalize
return final_value
And apply this function to your y_test/y_pred to get the corresponding value.
You can use this link here to better visualize this.
Related
I
have built multi classification model with Keras and after model is finished I would like to predict value for one of my test input.
This is the part where I scaled features:
x = dataframe.drop("workTime", axis = 1)
x = dataframe.drop("creation", axis = 1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = pd.DataFrame(sc.fit_transform(x))
y = dataframe["workTime"]
import seaborn as sb
corr = dataframe.corr()
sb.heatmap(corr, cmap="Blues", annot=True)
print("Scaled features:", x.head(3))
Then I did:
y_cat = to_categorical(y)
x_train, x_test, y_train, y_test = train_test_split(x.values, y_cat, test_size=0.2)
And built model:
model = Sequential()
model.add(Dense(16, input_shape = (9,), activation = "relu"))
model.add(Dense(8, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(6, activation = "softmax"))
model.compile(Adam(lr = 0.0001), "categorical_crossentropy", metrics = ["categorical_accuracy"])
model.summary()
model.fit(x_train, y_train, verbose=1, batch_size = 8, epochs=100, shuffle=True)
After my calculation finished, I wanted to take first element from test data and predict
value/classify it.
print(x_test.shape, x_train.shape) // (1550, 9) (6196, 9)
firstTest = x_test[:1]; // [[ 2.76473141 1.21064165 0.18816548 -0.94077449 -0.30981017 -0.37723917
-0.44471711 -1.44141792 0.20222467]]
prediction = model.predict(firstTest)
print(prediction) // [[7.5265622e-01 2.4710520e-01 2.3643016e-04 2.1405797e-06 3.8411264e-19
9.4137732e-23]]
print(prediction[0]) // [7.5265622e-01 2.4710520e-01 2.3643016e-04 2.1405797e-06 3.8411264e-19
9.4137732e-23]
unscaled = sc.inverse_transform(prediction)
print("prediction", unscaled)
During this I retrieve:
ValueError: operands could not be broadcast together with shapes (1,6) (9,) (1,6)
I think it may be related to my scalers.
And please correct me if I wrong, but what I want to achieve here is to either have one output value which points me how this entry was classified or array of possibilities for each classification label.
Thank you for hints
Your StandardScaler was used to scale the input features, you can't apply it (or its inverse) on the outputs!
If you are looking for the probabilities of the test sample being in each class, you already have it in prediction[0].
If you want the final class predicted, just take the one with the largest probability with argmax: tf.math.argmax(prediction[0]).
I've created a binary classification model which predicts whether an article is part of the positive or negative class. I am using TF-IDF fed into an XGBoost classifier alongside another feature. I get an AUC score of very close to 1 when both training/testing and crossvalidating.
I got a .5 score when testing on my holdout data. This seemed odd to me, so I fed the very same training data into my model, and even that returns a .5 AUC score. The code below takes in a dataframe, fits and transforms to the tf-idf vectors and formats it all into a dMatrix.
def format_to_dmatrix(known_targets):
y = known_targets['target']
X = known_targets[['body', 'day_of_year']]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.1, random_state=42)
tfidf.fit(X_train['body'])
pickle.dump(tfidf.vocabulary_,open("tfidf_features.pkl","wb"))
X_train_enc = tfidf.transform(X_train['body']).toarray()
X_test_enc = tfidf.transform(X_test['body']).toarray()
new_cols = tfidf.get_feature_names()
new_cols.append('day_of_year')
a = np.array(X_train['day_of_year'])
a = a.reshape(a.shape[0], 1)
b = np.array(X_test['day_of_year'])
b = b.reshape(b.shape[0], 1)
X_train = np.append(X_train_enc, a, axis=1)
X_test = np.append(X_test_enc, b, axis=1)
dtrain = xgb.DMatrix(X_train, label=y_train.values, feature_names=new_cols)
dtest = xgb.DMatrix(X_test, label=y_test.values, feature_names=new_cols)
return dtrain, dtest, tfidf
I cross validate and find a test-auc-mean of .9979, so I save the model as shown below.
best_model = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")]
This is my code to load in new data:
def test_newdata(data):
tf1 = pickle.load(open("tfidf_features.pkl", 'rb'))
tf1_new = TfidfVectorizer(max_features=1500, lowercase=True, analyzer='word', stop_words='english', ngram_range=(1, 1), vocabulary = tf1.keys())
encoded_body = tf1_new.fit_transform(data['body']).toarray()
new_cols = tf1_new.get_feature_names()
new_cols.append('day_of_year')
day_of_year = np.array(data['day_of_year'])
day_of_year = day_of_year.reshape(day_of_year.shape[0], 1)
formatted_test_data = np.append(encoded_body, day_of_year, axis=1)
df= pd.DataFrame(formatted_test_data, columns=new_cols)
return xgb.DMatrix(df)
And this code below shows that my AUC score is .5 despite loading in the very same data. Is there an error i've missed somewhere?
loaded_model = xgb.Booster()
loaded_model.load_model("earn_modelv3.model")
holdout = known_targets
formatted_test_data = test_newdata(holdout)
holdout_preds = loaded_model.predict(formatted_test_data)
predictions_binary = np.where(holdout_preds > .5, 1, 0)
{round(roc_auc_score(holdout['target'], predictions_binary) ,4)
I am trying to predict 100 values of the function f(x) in the future. I found a script online where it was predicting the price of a stock into the future and decided to modify it for the purpose of my research. Unfortunately, the prediction is quite a bit off. Here is the script:
#import packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
import tensorflow as tf
#for normalizing data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
#read the file
df = pd.read_csv('MyDataSet')
#print the head
df.head()
look_back = 60
num_periods = 20
NUM_NEURONS_FirstLayer = 128
NUM_NEURONS_SecondLayer = 64
EPOCHS = 1
num_prediction = 100
training_size = 8000 #initial
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['XValue', 'YValue'])
for i in range(0,len(data)):
new_data['XValue'][i] = data['XValue'][i]
new_data['YValue'][i] = data['YValue'][i]
#setting index
new_data.index = new_data.XValue
new_data.drop('XValue', axis=1, inplace=True)
#creating train and test sets
dataset = new_data.values
train = dataset[0:training_size,:]
valid = dataset[training_size:,:]
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(dataset)
x_train, y_train = [], []
for i in range(look_back,len(train)):
x_train.append(scaled_data[i-look_back:i,0])
y_train.append(scaled_data[i,0])
x_train, y_train = np.array(x_train), np.array(y_train)
x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))
#Custom Loss Function
def my_loss_fn(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
return tf.reduce_mean(squared_difference, axis=-1)
#Training Phase
model = Sequential()
model.add(LSTM(NUM_NEURONS_FirstLayer,input_shape=(look_back,1),
return_sequences=True))
model.add(LSTM(NUM_NEURONS_SecondLayer,input_shape=(NUM_NEURONS_FirstLayer,1)))
model.add(Dense(1))
model.compile(loss=my_loss_fn, optimizer='adam')
model.fit(x_train,y_train,epochs=EPOCHS,batch_size=2, verbose=2)
inputs = dataset[(len(dataset) - len(valid) - look_back):]
inputs = inputs.reshape(-1,1)
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back,inputs.shape[0]):
X_test.append(inputs[i-look_back:i,0])
X_test = np.array(X_test)
#Validation Phase
X_test = np.reshape(X_test, (X_test.shape[0],X_test.shape[1],1))
PredictValue = model.predict(X_test)
PredictValue = scaler.inverse_transform(PredictValue)
#Measures the RMSD between the validation data and the predicted data
rms=np.sqrt(np.mean(np.power((valid-PredictValue ),2)))
train = new_data[:int(new_data.index[training_size])]
valid = new_data[int(new_data.index[training_size]):]
valid['Predictions'] = PredictValue
valid = np.asarray(valid).astype('float32')
valid = valid.reshape(-1,1)
#Prediction Phase
def predict(num_prediction, model):
prediction_list = valid[-look_back:]
for _ in range(num_prediction):
x = prediction_list[-look_back:]
x = x.reshape((1, look_back, 1))
out = model.predict(x)[0][0]
prediction_list = np.append(prediction_list, out)
prediction_list = prediction_list[look_back-1:]
return prediction_list
#Get the predicted x-value
def predict_x(num_prediction):
last_x = df['XValue'].values[-1]
XValue = range(int(last_x),int(last_x)+num_prediction+1)
prediction_x = list(XValue)
return prediction_x
#Predict the y-value and the x-value
forecast = predict(num_prediction, model)
forecast_x = predict_x(num_prediction)
The file I was using was just for the function f(x) = sin(2πx/10), where x is all natural numbers from 0 to 10,000. Following the above, I get the following plot:
Prediction-1
Granted I am using only epochs of 1 and I am busy running a new job with a value of 50, but I was hoping for something a lot better. Any advice to improve the prediction here?
I also decided to modify the script and only predict one value at a time, add it to the dataset and then re-run the script. 100 times. It takes long and it obviously has its own issues (such as using the predicted value of the previous x (step)), but it's the best solution I can think of now. This is the plot I got for it. I can attach this file if you want, but it's not the one I want to continue focusing on with this project:
Prediction-2
Any sort of help would be appreciated in getting a better prediction.
Much appreciated
I'm new to machine learning using python. I'm trying to predict a factor lets say Price of a house, but i'm using polynomial feature of higher order degree to create a model.
So i have 2 data sets. I've prepared my model using one data set.
How to implement this model on an entirely new data set?
I'm attaching my code below:
data1 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz1.csv", engine = 'c', dtype=float, delimiter = ",")
data2 = pd.read_csv(r"C:\Users\DELL\Desktop\experimental data/xyz2.csv", engine = 'c', dtype=float, delimiter = ",")
#I have to do this step otherwise everytime i get an error of NaN or infinite value
data1.fillna(0.000, inplace=True)
data2.fillna(0.000, inplace=True)
X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result
x2_ = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X_train)
x3_ = PolynomialFeatures(degree=3, include_bias=False).fit_transform(X_train)
model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)
r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)
y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)
So basically I'm stuck after this.
How do i implement this same model on my test data to predict y_test value and compute the score?
To reproduce the effect of PolynomialFeatures, you need to store the object itself (once for degree=2 and again for degree=3.) Otherwise, you have no way to apply the fitted transform to the test dataset.
X_train = data1.drop('result', axis = 1)
y_train = data1.result
X_test = data2.drop('result', axis = 1)
y_test = data2.result
# store these data transform objects
pf2 = PolynomialFeatures(degree=2, include_bias=False)
pf3 = PolynomialFeatures(degree=3, include_bias=False)
# then apply the transform to the training set
x2_ = pf2.fit_transform(X_train)
x3_ = pf3.fit_transform(X_train)
model2 = LinearRegression().fit(x2_, y_train)
model3 = LinearRegression().fit(x3_, y_train)
r_sq2 = model2.score(x2_, y_train)
r_sq3 = model3.score(x3_, y_train)
y_pred2 = model2.predict(x2_)
y_pred3 = model3.predict(x3_)
# now apply the fitted transform to the test set
x2_test = pf2.transform(X_test)
x3_test = pf3.transform(X_test)
# apply trained model to transformed test data
y2_test_pred = model2.predict(x2_test)
y3_test_pred = model3.predict(x3_test)
# compute the model accuracy for the test data
r_sq2_test = model2.score(x2_test, y_test)
r_sq3_test = model3.score(x3_test, y_test)
I'm having trouble with LSTM and Keras.
I try to predict normal/fake domain names.
My dataset is like this:
domain,fake
google, 0
bezqcuoqzcjloc,1
...
with 50% normal and 50% fake domains
Here's my LSTM model:
def build_model(max_features, maxlen):
"""Build LSTM model"""
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(LSTM(64))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['acc'])
return model
then I preprocess my text data to transform it into numbers:
"""Run train/test on logistic regression model"""
indata = data.get_data()
# Extract data and labels
X = [x[1] for x in indata]
labels = [x[0] for x in indata]
# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X)))}
max_features = len(valid_chars) + 1
maxlen = 100
# Convert characters to int and pad
X = [[valid_chars[y] for y in x] for x in X]
X = sequence.pad_sequences(X, maxlen=maxlen)
# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels]
Then I split my data into training, testing and validation sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("Build model...")
model = build_model(max_features, maxlen)
print("Train...")
X_train, X_holdout, y_train, y_holdout = train_test_split(X_train, y_train, test_size=0.2)
And then I train my model on training data and validation data, and evaluate on test data:
history = model.fit(X_train, y_train, epochs=max_epoch, validation_data=(X_holdout, y_holdout), shuffle=False)
scores = model.evaluate(X_test, y_test, batch_size=batch_size)
At the end of my training/testing I have these results:
And these scores when evaluating on test dataset:
loss = 0.060554939906234596
accuracy = 0.978109902033532
However when I predict on a sample of the dataset like this:
LSTM_model = load_model('LSTMmodel_64_sgd.h5')
data = pickle.load(open('traindata.pkl', 'rb'))
#### LSTM ####
"""Run train/test on logistic regression model"""
# Extract data and labels
X = [x[1] for x in data]
labels = [x[0] for x in data]
X1, _, labels1, _ = train_test_split(X, labels, test_size=0.9)
# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X1)))}
max_features = len(valid_chars) + 1
maxlen = 100
# Convert characters to int and pad
X1 = [[valid_chars[y] for y in x] for x in X1]
X1 = sequence.pad_sequences(X1, maxlen=maxlen)
# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels1]
y_pred = LSTM_model.predict(X1)
I have very poor performance:
accuracy = 0.5934741842730341
confusion matrix = [[25201 14929]
[17589 22271]]
F1-score = 0.5780171295094731
Can someone explain to me why?
I have tried 64 instead of 128 for the LSTM node, adam and rmsprop for optimizers, increasing batch_size however performance remains very low.
Ok so I have found the answer.
This is this line
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X1)))}
In Python 3 set seems to produce different results everytime a new python3 console is open.
So running the code in Python 2 has resolved my issues !