What am I doing wrong when training a model? - python

I solve the following problem: `
We have collected more data on cats and dogs, and are ready to train
our robot to classify them! Download a training dataset https://stepik.org/media/attachments/course/4852/dogs_n_cats.csv and train the
Decision Tree on it. After that, download the dataset from the
assignment and predict which observations belong to whom. Enter the
number of dogs in your dataset. A certain error is allowed in the
assignment.
I trained the model:
import sklearn
import pandas as pd
import numpy as nm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score
df = pd.read_csv('dogs_n_cats.csv')
X = df.drop(['Вид', 'Шерстист'], axis=1)
y = df['Вид']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67, random_state=42)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=4)
clf.fit(X_train, y_train)
After that, I downloaded the dataset from the task https://stepik.org/api/attempts/540562013/file and began to determine the number of dogs in the dataset:
df2 = pd.read_json('we.txt')
X2 = df.drop(['Вид', 'Шерстист'], axis=1)
y2 = df['Вид']
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, train_size=0.67, random_state=42)
df2_predict = clf.predict(X2)
l = list(df2_predict)
l.count('собачка')
The number of dogs in the task should be 49, but after executing l.count ('dog') I get 500. What am I doing wrong when training a model?

This seems to be a typo. In your snippet, you're using the first dataframe to create X2.
I cannot access the second file, but changing this line should do the trick:
X2 = df.drop(['Вид', 'Шерстист'], axis=1)
-->
X2 = df2.drop(['Вид', 'Шерстист'], axis=1)
Besides that, you're already provided with a training set and test set, so none of the calls to train_test_split should be necessary.

Related

What's wrong with these seemingly perfect ML model?

I wanted to find an optimal model to solve the assigned classification problem. Everything went smooth before I applied pd.get_dummies() function to preprocess the data. The experiment showed a impossibly perfect result. I know it is unlikely to happen but I do not know why. Any help would be highly appreciated.
Code for preprocessing data is as below
# Encoding Booking Status
status_dict = {'Not_Canceled':1, 'Canceled':0}
df.booking_status = df.booking_status.map(status_dict)
df.drop('Booking_ID',axis=1, inplace=True)
df = df.dropna()
df = pd.get_dummies(df)
# Standardizing Data
from sklearn.preprocessing import StandardScaler
import numpy as np
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
And I split my data into training and testing with a proportion of 0.3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
I used several models and the amazing result is
enter image description here
Simple code, stupid me. By the way, just a beginner in ML field. Any advice to master it well?
It was caused by data leaks. You must split your data first before any data pre-processing step. For example,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
Then do your data scaling part on the training and test data separately.
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
You could try to use Pipe line as well to avoid data leaks.
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))
Ref: https://machinelearningmastery.com/data-preparation-without-data-leakage/

Can you train your set on a separate dataframe than the one you use to test your set with? (Python)

for a school project I need to make a forecast of the baseline sales. I've splitted the dataset into a X and Y set. X = all my variables except Total units (baseline), Y = Total Units (baseline).
I would love to train my set on the weeks where there are no promotions = (FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0]), and test my set on all the weeks (and not only the weeks without promotion = FINAL_df) this should give a better result than if I train them both on the same set (FINAL_df).
But I have no idea how to train the training set separately from the test set?
( I know this part of my code is wrong: X_train = FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0], but I don't know how to correct it?)
(I am new to coding and ML so any help is very much appreciated!) Thanks in advance!
code:
X = FINAL_df.drop(["Total_Units"],axis="columns")
Y = FINAL_df.Total_Units
from sklearn.model_selection import train_test_split
X_train = FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0]
from sklearn.linear_model import LinearRegression
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=101)
lin_model = LinearRegression()
lin_model.fit(X_train,Y_train)
--> here i get an error (Unable to allocate 3.48 GiB for an array with shape (453971, 1028) and data type float64)strong text
from sklearn.model_selection import train_test_split
FINAL_df_prom_pres0 = FINAL_df[FINAL_df['PROMOTIONAL_PRESENCE']==0]
Y = FINAL_df_prom_pres0.Total_Units
X = FINAL_df_prom_pres0.drop(["Total_Units"],axis="columns")
from sklearn.linear_model import LinearRegression
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=101)
lin_model = LinearRegression()
lin_model.fit(X_train,Y_train)
I changed the answer can you please try like this.

Performing regression while dividing sample

Let's take data
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
And consider code following :
#Defining X,y - independent variable and dependent variables
X=df.drop(df.columns[[1]], axis=1)
y = (df[1] == 'B').astype(int)
clf=LogisticRegression(solver="lbfgs")
kfold = StratifiedKFold(n_splits=10, shuffle=True)
for train, validation in kfold.split(X, y):
# Fit the model
clf.fit(X[train], y[train])
And the following error occurs :
Do you have any idea why it occurs ? I think I did really not complicated things, so I'm not sure what exactly I did wrong.
X is a DataFrame so you need to use .iloc to select the indices:
for train_index, validation_index in kfold.split(X, y):
# Fit the model
X_train = X.iloc[train_index]
y_train = y[train_index]
clf.fit(X_train, y_train)

100% error rate on test set with one class svm

I am trying to detect outlier images. But I'm getting bizarre results from the model.
I've read in the images with cv2, flattened them into 1d-arrays, and turned them into a pandas dataframe and then fed that into the SVM.
import numpy as np
import cv2
import glob
import pandas as pd
import sys, os
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import *
import seaborn as sns`
load the labels and files
labels_wt = np.loadtxt("labels_wt.txt", delimiter="\t", dtype="str")
files_wt = np.loadtxt("files_wt.txt", delimiter="\t", dtype="str")`
load and flatten the images
wt_images_tmp = [cv2.imread(file) for file in files_wt]
wt_images = [image.flatten() for image in wt_images_tmp]
tmp3 = np.array(wt_images)
mutant_images_tmp = [cv2.imread(file) for file in files_mut]
mutant_images = [image.flatten() for image in mutant_images_tmp]
tmp4 = np.array(mutant_images)
X = pd.DataFrame(tmp3) #load the wild-type images
y = pd.Series(labels_wt)
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)
X_outliers = pd.DataFrame(tmp4)
clf = svm.OneClassSVM(nu=0.15, kernel="rbf", gamma=0.0001)
clf.fit(X_train)
Then I evaluate the results according to the sklearn tutorial on oneclass SVM.
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size
print(n_error_train / len(y_pred_train))
print(float(n_error_test) / float(len(y_pred_test)))
print(n_error_outliers / len(y_pred_outliers))`
my error rates on the training set have been variable (10-30%), but on the test set, they have never gone below 100%. Am I doing this wrong?
My guess is that you are setting random_state = 42, this is biasing your train_test_split to always have the same splitting pattern. You can read more about it in this answer. Don't specify any state and run the code again, so:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2)
This will show different results. Once you are sure this works, make sure yo then do cross-validation, possibly using k-fold validation. Let us know if this helps.

Sklearn DecisionTreeClassifier F-Score Different Results with Each run

I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

Categories