I have successfully built logistic regression model using train dataset below.
X = train.drop('y', axis=1)
y = train['y']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5)
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
logreg1 = LogisticRegression()
logreg1.fit(X_train, y_train)
score = logreg1.score(X_test, y_test)
cvs = cross_val_score(logreg1, X_test, y_test, cv=5).mean()
My problem is I want to bring in the test dataset to predict the unknown y value. In the test data theres no y column. How can I predict the y value using the seperate test dataset??
Use predict():
y_pred = logreg1.predict(X_test)
score = logreg1.score(X_test, y_pred)
print(y_pred) // see the predictions
Related
I'm trying to split a multi-label dataset into train, val and test datasets. I want to do something similar to
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
def iterative_train_test_split(X, y, test_size):
stratifier = IterativeStratification(
n_splits=2, order=1, sample_distribution_per_fold=[test_size, 1-test_size])
train_indices, test_indices = next(stratifier.split(X, y))
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]
return X_train, X_test, y_train, y_test
but with n_splits=3. When I try to set n_splits=3 I still only get 2 sets of indices out. Am I doing something wrong?
I have a dataset and have one hot encoded the target column (5 different strings throughout the entire column) using pd.get_dummies. I have then used sklearn's train_test_split function to create the training, testing and validation sets. The training set (features) were then normalized with standardScalar(). I have fit the training sets of both the features and the target to a logistic regression model.
I am now trying to calculate the accuracy score for the training, validation and test sets but am having no luck. My code up to this part is below:
dataset = pd.read_csv('tabular_data/clean_tabular_data.csv')
features, label = load_airbnb(dataset, 'Category')
label_series = dataset['Category']
label_encoded = pd.get_dummies(label_series)
X_train, X_test, y_train, y_test = train_test_split(features, label_encoded, test_size=0.3)
X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=0.5)
# normalize the features
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validation_scaled = scaler.transform(X_validation)
X_test_scaled = scaler.transform(X_test)
# get baseline classification model
model = LogisticRegression()
y_train = y_train.iloc[:, 0]
model.fit(X_train_scaled, y_train)
y_train_pred = model.predict(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=0)
y_validation_pred = model.predict(X_validation_scaled)
y_validation_pred = np.argmax(y_validation_pred, axis =0)
y_test_pred = model.predict(X_test_scaled)
y_test_pred = np.argmax(y_test_pred, axis = 0)
# evaluate model using accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
validation_acc = accuracy_score(y_validation, y_validation_pred)
The error I am getting is here: "File "C:\Users\lcox1\Documents\VSCode\AiCore\Data science\classification_prac.py", line 56, in
train_acc = accuracy_score(y_train, y_train_pred)
TypeError: Singleton array 16 cannot be considered a valid collection."
I am fairly new to python so have no idea what the issue is. Any help appreciated.
You are getting that error because of these lines:
y_train_pred = model.predict(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=0)
When you call model.predict(), it actually returns you an array of predicted labels, and not the probabilities. And if you do argmax of this array, you get 1 value, which is the index of the maximum value, hence it throws you the error, during prediction.
Most likely you mean to do:
y_train_pred = model.predict_proba(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=1)
y_train_pred
As #BenReiniger pointed out in the comments, if you are trying to train a model on multi class labels, you should not one-hot encode. Try something below, where I used an example dataset, and have the labels as a category:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
data = load_iris()
features = data.data
label_series = pd.Series(data.target).map({0:"setosa",1:"virginica",2:"versicolor"})
label_series = pd.Categorical(label_series)
le = LabelEncoder()
label_encoded = le.fit_transform(label_series)
Running your code with some changes:
X_train, X_test, y_train, y_test = train_test_split(features, label_encoded, test_size=0.3)
X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=0.5)
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_validation_scaled = scaler.transform(X_validation)
X_test_scaled = scaler.transform(X_test)
# get baseline classification model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_train_pred = model.predict_proba(X_train_scaled)
y_train_pred = np.argmax(y_train_pred, axis=1)
y_validation_pred = model.predict_proba(X_validation_scaled)
y_validation_pred = np.argmax(y_validation_pred, axis =1)
y_test_pred = model.predict_proba(X_test_scaled)
y_test_pred = np.argmax(y_test_pred, axis = 1)
# evaluate model using accuracy
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
validation_acc = accuracy_score(y_validation, y_validation_pred
The results:
print(train_acc,test_acc,validation_acc)
0.9809523809523809 0.9090909090909091 1.0
'''
#Defining Model Function
def models(X_train, Y_train):
log = LogisticRegression(random_state=0)
log.fit(X_train, Y_train)
tree = DecisionTreeClassifier(criterion='entropy', random_state=0)
tree.fit(X_train, Y_train)
forest = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
forest.fit(X_train, Y_train)
return log, tree, forest
#Main Code
df = pd.read_csv("data.csv")
X = df.iloc[:, 2:31].values #Selecting Data from 3rd column to 32th column
Y = df.iloc[:, 1].values #Selecting Data from 2nd column
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
model = models(X_train, Y_train)
newdata = pd.read_csv("new.csv") #This is new data
X_new = newdata.iloc[:, 2:31].values #These data I want to check with trained data
predictnewdata1=model[0].predict(X_test)
print("new data1") #Printing test data from model[0] i.e. log model
print(predictnewdata1)
predictnewdata2=model[1].predict((X_test))
print("new data2") #Printing test data from model[1] i.e. tree model
print(predictnewdata2)
predictnewdata3=model[2].predict(X_test)
print("new data3") #Printing test data from model[2] i.e. forest model
print(predictnewdata3)
predictnewdata4 = model[2].predict(X_new) #This code is anyhow wrong
print(predictnewdata4) #Printing wrong values
I want to test any model with new data, but somehow my algorithm/code isnt working. I searched a lot but couldnt find any solution. So please help in this regard.
I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)
Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)
You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.
Im trying to make a confusion matrix to determine how well my model performed. I split my model into x and y testing and training set however, to make my confusion matrix, I need the y_test data(the predicted data) and the actual data. Is there a way I can see the actual results of the y_test data.
Heres a little snippet of my cod:
x_train, x_test, y_train, y_test = train_test_split(a, yy, test_size=0.2, random_state=1)
model = MultinomialNB() #don forget these brackets here
model.fit(x_train,y_train.ravel())
#CONFUSION MATRIX
confusion = confusion_matrix(y_test, y_test)
print(confusion)
print(len(y_test))
Your y_test is the actual data and the results from the predict method will be the predicted data.
y_pred = model.predict(x_test)
confusion = confusion_matrix(y_test, y_pred)