I'm using dataset from Kaggle - Cardiovascular Disease Dataset.
The model has been trained and what I want to do is to label a single input(a row of 13 values)
inserted in dynamic way.
Shape of Dataset is 13 Features + 1 Target, 66k rows
#prepare dataset for train and test
dfCardio = load_csv("cleanCardio.csv")
y = dfCardio['cardio']
x = dfCardio.drop('cardio',axis = 1, inplace=False)
model = knn = KNeighborsClassifier()
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
model.fit(x_train, y_train)
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
ML is trained, what I want to do is to predict the label of this single row :
['69','1','151','22','37','0','65','140','90','2','1','0','0','1']
to return 0 or 1 for Target.
So I wrote this code :
import numpy as np
import pandas as pd
single = np.array(['69','1','151','22','37','0','65','140','90','2','1','0','0','1'])
singledf = pd.DataFrame(single)
final=singledf.transpose()
prediction = model.predict(final)
print(prediction)
but it gives error : query data dimension must match training data dimension
how can I fix the labeling for single row ? why I'm not able to predict a single case ?
Each instance in your dataset has 13 features and 1 label.
x = dfCardio.drop('cardio',axis = 1, inplace=False)
This line in the code removes what I assume is the label column from the data, leaving only the (13) feature columns.
The feature vector on which you are trying to predict, is 14 elements long. You can only predict on feature vectors that are 13 elements long because that is what the model was trained on.
if you are looking for a real and quick solution you can use this
import numpy as np
import pandas as pd
single = np.array([['69','1','151','22','37','0','65','140','90','2','1','0','0']])
prediction = model.predict(single)
print(prediction)
I disagree with the others, this is not a problem with including the target.
I had this problem too. The only way I got around it was to input part of x.
So:
x2=x.iloc[0:3]
then give the first row a new value:
x2.iloc[0]=single
ypred=model.predict(x2)
and just look at ypred[0].
Or try a dataframe with 2 values
Related
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new
I have an excel file with contains 3 simple columns: Unit Price, Quantity Sold, and Total. The Total is simply obtained by multiplying unit price with quantity. So i set up a simple sklearn linear regression algorithm code to predict this value:
import numpy as np
import pandas as pd
from sklearn import *
data = pd.read_excel("Sample.xls")[["Units", "Unit Cost", "Total"]]
predict = "Total"
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)
I ran the print function 3 times and it gave me the following results:
-1.517292267629657
0.9167778505657976
0.15292336331892453
Why am i getting this and not a 100% accuracy? The model should recognise that the prediction is simply multiplication of the frist column with the second
Linear regression doesn't work that way. Plot a graph using matplotlib and see if you get straight line for training data. If your inputs are X1 and X2 your output is X1*X2 and not of the form aX1+bX2. Model is predicting correctly, What is wrong is the model application itself.
I am currently trying to build neuronal network to be able to predict time series, but the question is, is it possible to predict further than just the test dataset. I mean, for my example, I have a dataset of about 3000 values, from which I keep 90% for training and 10% for testing. Then When I compare the prediction with the actual test value, it corresponds, but is it possible for instance to ask the program to predict the next 500 values (i.e. from 3001 to 3500) ?
Here is a snipper of the code I use.
import csv
import numpy as np
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM, GRU
from keras.models import Sequential
from keras import optimizers
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.kernel_ridge import KernelRidge
import time
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (-1, 1))
def load_data(datasetname, column, seq_len, normalise_window):
# A support function to help prepare datasets for an RNN/LSTM/GRU
data = datasetname.loc[:,column]
sequence_length = seq_len + 1
result = []
for index in range(len(data) - sequence_length):
result.append(data[index: index + sequence_length])
result = np.array(result)
result.reshape(-1,1)
training_set_scaled = sc.fit_transform(result)
print (result)
#Last 10% is used for validation test, first 90% for training
row = round(0.9 * training_set_scaled.shape[0])
train = training_set_scaled[:int(row), :]
#np.random.shuffle(train)
x_train = train[:, :-1]
y_train = train[:, -1]
X_test = training_set_scaled[int(row):, :-1]
y_test = training_set_scaled[int(row):, -1]
print ("shape train", x_train)
print ("shape train", X_test)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
return [x_train, X_test, y_train, y_test]
def build_model():
model = Sequential()
layers = {'input': 100, 'hidden1': 150, 'hidden2': 256, 'hidden3': 100, 'output': 10}
model.add(LSTM(
50,
return_sequences=True,
input_shape=(200,1)
))
model.add(Dropout(0.2))
model.add(LSTM(
layers['hidden2'],
return_sequences=True,
))
model.add(Dropout(0.2))
model.add(LSTM(
layers['hidden3'],
return_sequences=False,
))
model.add(Dropout(0.2))
model.add(Activation("linear"))
model.add(Dense(
output_dim=layers['output']))
start = time.time()
model.compile(loss="mean_squared_error", optimizer="adam")
print ("Compilation Time : ", time.time() - start)
return model
dataset = pd.read_csv(
'data.csv')
X_train, X_test, y_train, y_test = load_data(dataset, 'mean anomaly', 200, False)
model = build_model()
print ("train",X_train)
print ("test",X_test)
model.fit(X_train, y_train, batch_size=256, epochs=1, validation_split=0.05)
predictions = model.predict(X_test)
predictions = np.reshape(predictions, (predictions.size,))
plt.figure(1)
plt.subplot(311)
plt.title("Actual Test Signal w/Anomalies & noise")
plt.plot(y_test)
plt.subplot(312)
plt.title("predicted signal")
plt.plot(predictions, 'g')
plt.subplot(313)
plt.title("training signal")
plt.plot(y_train, 'b')
plt.plot(y_test, 'y')
plt.legend(['train', 'test'])
plt.show()
I have read that I should increase the output dim of the dense layer to get more than 1 predicted value, or increase the size of my window in the load data function ?
Here is the result, the yellow plot is supposed to be after the blue one, it respresents my input test data, the first plot is a zoom on this data and the second one the prediction.
If you want to predict the output value of your serie at t+x based on data at time t, the data you need to feed to the network should already have this format.
Time series data formating :
If you have 3000 data point and want to predict the output value for the next "virtual" 500 point you should offset the output value by this amount. For exemple :
In your dataset, your 500th data point correspond to the 500th output value. If you want to predict "future" values then the 500th data point should have the 1000th output value. You can do this in pandas with the shift function. Be aware that you will loose the last 500 data point by doing so, has they will no longer have an output value.
Then when you predict on data point xi you'll have the output value yi+500. You should find some basic guides for time serie forecasting on sites like machinelearningmastery
Good pratice for model evaluation :
If you want to better evaluate the quality of your model, first find some metrics that suits your problem and try to increase test set percenatage. While graphics are a good way to visualise result, they can be deceiving, try combining them with some metrics ! (be carefull with Mean Squarred Error, it can give you a biased score with value in the range [-1;1] as the square of an error in this range will always be less than the acutal error, try Mean Absolute Error instead)
Data leakage when scalling data :
While scalling data is usually a good thing you need to be carefull doing so. You comited something called a data leak. You used scalling on the whole data set before splitting into training and test set. Further reading about this data leak.
Update
I think i misunderstood your problem.
If you want to "predict further than just the test dataset" you will need some unseen/new data to make more prediction. The test set is only made to evaluate the performance of the learning phase.
Now if you want to predict further than just the next step (this won't allow you to "predict further than just the test dataset" because of the way you change your dataset, see bellow) :
Your model as it's made will only ever predict the next step.
In your example you feed to the algorithm series of lenght 'seq_len' and give them as output the value right after the end of those series. If you want your algorithm to learn to predict in more than one step into the future you y_train must have value at the corresponding time in the future, example :
x = [0,1,2,3,4,5,6,7,8,9,10,...]
seq_len = 5
step_to_predict = 5
So to predict not one step into the future but five, your series will have to look like this :
x_serie_1 = [0,1,2,3,4]
y_serie_1 = [9]
x_serie_2 = [1,2,3,4,5]
y_serie_2 = [10]
This is a way to get your model to learn how to make predictions further into the future than just the next step.
import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()
# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()
train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]
test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]
naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)
print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)
I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label.
I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. predict_proba() is not giving the desired output. Please suggest required changes.
GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. In your case, it should return a result with five columns with the same number of rows as in your test data. You can verify which column corresponds to which class using naive_b.classes_ . So, it is not clear why you are saying that this is not the desired output. Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. Try:
pred_prob = naive_b.predict_proba(test_features)
instead of
test_data["p_malw"] = naive_b.predict_proba(test_features)
and verify its shape using pred_prob.shape. The second dimension should be 5.
If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly.
from sklearn.metrics import confusion_matrix
naive_B.fit(train_features, train_label)
pred_label = naive_B.predict(test_features)
confusion_m = confusion_matrix(test_label, pred_label)
confusion_m
Here is some useful reading.
sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba
sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
I was trying to train and test my dataset using MLPRegressor. I have two datasets (train dataset and test dataset), both of them have the exact same columns of features and label. Here's the example of my datasets :
Full,Id,Id & PPDB,Id & Words Sequence,Id & Synonyms,Id & Hypernyms,Id & Hyponyms,Gold Standard
1.667,0.476,0.952,0.476,1.429,0.952,0.476,2.345
3.056,1.111,1.667,1.111,3.056,1.389,1.111,1.9
1.765,1.176,1.176,1.176,1.765,1.176,1.176,2.2
0.714,0.714,0.714,0.714,0.714,0.714,0.714,0.0
................
here's my code :
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPRegressor
randomseed = np.random.seed(0)
datatraining = pd.read_csv("datatrain.csv")
datatesting = pd.read_csv("datatest.csv")
columns = ["Full","Id","Id & PPDB","Id & Words Sequence","Id & Synonyms","Id & Hypernyms","Id & Hyponyms"]
labeltrain = datatraining["Gold Standard"].values
featurestrain = datatraining[list(columns)].values
labeltest = datatesting["Gold Standard"].values
featurestest = datatesting[list(columns)].values
X_train = featurestrain
y_train = labeltrain
X_test = featurestest
y_test = labeltest
mlp = MLPRegressor(solver='lbfgs', hidden_layer_sizes=50, max_iter=1000, learning_rate='constant', random_state=randomseed)
mlp.fit(X_train, y_train)
print('Accuracy training : {:.3f}'.format(mlp.score(X_train, y_train)))
print
predicting = mlp.predict(X_test)
print predicting
print
And here's the result of the prediction :
[ 1.97553444 3.43401776 3.04097607 2.7015464 2.03777686 3.63274593
3.37826962 -0.60260337 0.41626517 3.5374289 3.66114929 3.244683
2.6313756 2.14243075 3.20841434 2.105238 4.9805092 4.00868273
2.45508505 4.53332828 3.41862096 3.35721078 3.23069344 3.72149434
4.9805092 2.61705563 1.55052494 -0.14135979 2.65875196 3.05328206
3.51127424 0.51076396 2.39947967 1.95916595 3.71520651 2.1526807
2.26438616 0.73249057 2.46888695 3.56976227 1.03109988 2.15894353
2.06396103 0.66133707 4.72861602 2.4592647 2.84176811 2.3157664
1.68426416 2.56022955 -0.00518545 1.67213609 0.6998739 3.25940136
3.25369266 3.88888542 1.9168694 2.26036302 3.97917769 2.00322903
3.03121106 3.29083723 0.6998739 4.33375678 0.6998739 2.71141538
-4.23755447 3.958574 2.67765274 2.68715423 2.32714117 2.6500056
........]
As we can see, there are some negative results. How not to predict negative results? Besides, my datasets are contain of all positive values.
Assuming you have no categorical variables. Also, you mentioned in the question that you have all positive values.
Try to standardize your data using SatandardSacler(). Use your X_train and y_train to standardize data.
from sklearn import preprocessing as pre
...
scaler = pre.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
After initializing the model with best parameters based on your case, fit scaled data,
mlp.fit(X_train_scaled, y_train)
...
predicting = mlp.predict(X_test_scaled)
This should do it. Let me know how it goes.
Also, there are some good reads,
https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning
https://stats.stackexchange.com/questions/7757/data-normalization-and-standardization-in-neural-networks
Add a second hidden layer with one ReLU node.