I was trying to train and test my dataset using MLPRegressor. I have two datasets (train dataset and test dataset), both of them have the exact same columns of features and label. Here's the example of my datasets :
Full,Id,Id & PPDB,Id & Words Sequence,Id & Synonyms,Id & Hypernyms,Id & Hyponyms,Gold Standard
1.667,0.476,0.952,0.476,1.429,0.952,0.476,2.345
3.056,1.111,1.667,1.111,3.056,1.389,1.111,1.9
1.765,1.176,1.176,1.176,1.765,1.176,1.176,2.2
0.714,0.714,0.714,0.714,0.714,0.714,0.714,0.0
................
here's my code :
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPRegressor
randomseed = np.random.seed(0)
datatraining = pd.read_csv("datatrain.csv")
datatesting = pd.read_csv("datatest.csv")
columns = ["Full","Id","Id & PPDB","Id & Words Sequence","Id & Synonyms","Id & Hypernyms","Id & Hyponyms"]
labeltrain = datatraining["Gold Standard"].values
featurestrain = datatraining[list(columns)].values
labeltest = datatesting["Gold Standard"].values
featurestest = datatesting[list(columns)].values
X_train = featurestrain
y_train = labeltrain
X_test = featurestest
y_test = labeltest
mlp = MLPRegressor(solver='lbfgs', hidden_layer_sizes=50, max_iter=1000, learning_rate='constant', random_state=randomseed)
mlp.fit(X_train, y_train)
print('Accuracy training : {:.3f}'.format(mlp.score(X_train, y_train)))
print
predicting = mlp.predict(X_test)
print predicting
print
And here's the result of the prediction :
[ 1.97553444 3.43401776 3.04097607 2.7015464 2.03777686 3.63274593
3.37826962 -0.60260337 0.41626517 3.5374289 3.66114929 3.244683
2.6313756 2.14243075 3.20841434 2.105238 4.9805092 4.00868273
2.45508505 4.53332828 3.41862096 3.35721078 3.23069344 3.72149434
4.9805092 2.61705563 1.55052494 -0.14135979 2.65875196 3.05328206
3.51127424 0.51076396 2.39947967 1.95916595 3.71520651 2.1526807
2.26438616 0.73249057 2.46888695 3.56976227 1.03109988 2.15894353
2.06396103 0.66133707 4.72861602 2.4592647 2.84176811 2.3157664
1.68426416 2.56022955 -0.00518545 1.67213609 0.6998739 3.25940136
3.25369266 3.88888542 1.9168694 2.26036302 3.97917769 2.00322903
3.03121106 3.29083723 0.6998739 4.33375678 0.6998739 2.71141538
-4.23755447 3.958574 2.67765274 2.68715423 2.32714117 2.6500056
........]
As we can see, there are some negative results. How not to predict negative results? Besides, my datasets are contain of all positive values.
Assuming you have no categorical variables. Also, you mentioned in the question that you have all positive values.
Try to standardize your data using SatandardSacler(). Use your X_train and y_train to standardize data.
from sklearn import preprocessing as pre
...
scaler = pre.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
After initializing the model with best parameters based on your case, fit scaled data,
mlp.fit(X_train_scaled, y_train)
...
predicting = mlp.predict(X_test_scaled)
This should do it. Let me know how it goes.
Also, there are some good reads,
https://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning
https://stats.stackexchange.com/questions/7757/data-normalization-and-standardization-in-neural-networks
Add a second hidden layer with one ReLU node.
Related
I have test and train subset separately. I need to apply regression models on those datasets. Just tried to make an example but something is wrong. My prediction column for regression is 'survival-time'. Also i have 93 column on each datasets. Here is code.
NOTE: Specially, troubling on 'score = svr.score(X_train,y_train)', should i use test for train argument?
import pandas as pd
import numpy as np
from sklearn.svm import SVR
train = pd.read_csv('training.csv',sep=";")
test = pd.read_csv('test.csv',sep=";")
X_train = train.drop('survival-time',axis=1)
y_train = train['survival-time']
X_test = test.drop('survival-time',axis=1)
y_test = test['survival-time']
svr = SVR().fit(X_train, y_train)
print(svr)
yfit = svr.predict(X_train)
score = svr.score(X_train,y_train)
print("R-squared:", score)
print("MSE:", mean_squared_error(y_train, yfit))
print("RMSLE:", mean_squared_log_error(y_train, yfit))
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new
I'm using dataset from Kaggle - Cardiovascular Disease Dataset.
The model has been trained and what I want to do is to label a single input(a row of 13 values)
inserted in dynamic way.
Shape of Dataset is 13 Features + 1 Target, 66k rows
#prepare dataset for train and test
dfCardio = load_csv("cleanCardio.csv")
y = dfCardio['cardio']
x = dfCardio.drop('cardio',axis = 1, inplace=False)
model = knn = KNeighborsClassifier()
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
model.fit(x_train, y_train)
# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
ML is trained, what I want to do is to predict the label of this single row :
['69','1','151','22','37','0','65','140','90','2','1','0','0','1']
to return 0 or 1 for Target.
So I wrote this code :
import numpy as np
import pandas as pd
single = np.array(['69','1','151','22','37','0','65','140','90','2','1','0','0','1'])
singledf = pd.DataFrame(single)
final=singledf.transpose()
prediction = model.predict(final)
print(prediction)
but it gives error : query data dimension must match training data dimension
how can I fix the labeling for single row ? why I'm not able to predict a single case ?
Each instance in your dataset has 13 features and 1 label.
x = dfCardio.drop('cardio',axis = 1, inplace=False)
This line in the code removes what I assume is the label column from the data, leaving only the (13) feature columns.
The feature vector on which you are trying to predict, is 14 elements long. You can only predict on feature vectors that are 13 elements long because that is what the model was trained on.
if you are looking for a real and quick solution you can use this
import numpy as np
import pandas as pd
single = np.array([['69','1','151','22','37','0','65','140','90','2','1','0','0']])
prediction = model.predict(single)
print(prediction)
I disagree with the others, this is not a problem with including the target.
I had this problem too. The only way I got around it was to input part of x.
So:
x2=x.iloc[0:3]
then give the first row a new value:
x2.iloc[0]=single
ypred=model.predict(x2)
and just look at ypred[0].
Or try a dataframe with 2 values
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
I am attempting to train models with GradientBoostingClassifier using categorical variables.
The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
X_train = pandas.DataFrame(X_train)
# Insert fake categorical variable.
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40
# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
The following error appears:
ValueError: could not convert string to float: 'b'
From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model.
Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding?
R gbm package is capable of handling the sample data above. I'm looking for a Python library with equivalent capability.
pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. We can then merge the dummy matrix back to the training data.
Below is the example code from the question with the above procedure carried out.
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np
iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]
# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]
###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.
# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)
catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################
# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)
prob = clf.predict_proba(X_test)[:,1] # Only look at P(y==1).
fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)
print(prob)
print(y_test)
print(roc_auc_prob)
Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators.
Sure it can handle it, you just have to encode the categorical variables as a separate step on the pipeline. Sklearn is perfectly capable of handling categorical variables as well as R or any other ML package. The R package is still (presumably) doing one-hot encoding behind the scenes, it just doesn't separate the concerns of encoding and fitting in this case (as it arguably should).