Using sklearn.model_selection to split unbalanced dataset - python

I am using the following code to split my dataset into train/val/test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42)
The problem is that my dataset is really unbalanced. Some classes have 500 samples while some have 70 for example. Is this splitting method accurate in this situation? Is the sampling random or does sklearn use seome methods to keep the distribution of the data same in all sets?

You should use the stratify option (see the docs):
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42, stratify=y_data)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42, stratify=y_test)

Related

How to quantify how good the model is after using train_test_split

I'm using the train_test_split from sklearn.model_selection. My code looks like the following:
x_train, x_test , y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1234)
Edit: After this is done, how do I fit these to the linear regression model, and then see how good this model is? i.e. Which of the four components (x_train, x_test, y_train, or y_test) would I use to calculate MSE or RMSE? And how exactly how would I do that?

UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(

screenshot here Help please?
Already tried adding .values to the X's, still resulted in an error. Any suggestions?
X = df[['Personal income','Personal saving']]
y = df['Gross domestic product']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
regr = linear_model.LinearRegression().fit(X_train, y_train)
sample = [10000, 1000]
sample_pred = regr.predict([sample])
As stated in this issue https://github.com/tylerjrichards/Getting-Started-with-Streamlit-for-Data-Science/issues/5 , converting X_train dataframe to np array (X_train.values) before fitting removes the warning.
it did for my testing : you can either try :
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X.to_numpy(), y, test_size=0.2, random_state=42)
In addition, this warning doesn't affect the calculation precision, you can ignore it and continue working until the update of the versions of libraries.

Can the train_test_split be replaced from sklearn.model_selection?

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)
Can this call be replaced with
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
If your question is if both codes does the same job, then yes. Use the second one where you import train_test_split, makes the code more simple to read and understand.

ValueError: Expected 2D array, got 1D array instead for K-fold cross validation

I am trying to perform K-Fold Cross Validation for my dataset but an error message popped up instead
ValueError: Expected 2D array, got 1D array instead:
The data looks like this:
Fuel Consumption Distance
13.046 298.89
14.717 468.60
15.032 464.38
Below is the code that I have used:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3,shuffle=True,random_state=42)
kf
x = data.loc[:,'Fuel Consumption'].values
y = data.loc[:, 'Distance'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=42, shuffle=True)
for train_index, test_index in kf.split(y):
print(train_index, test_index)
def get_score(model, x_train, x_test, y_train, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
scores = []
best_svr = SVR(kernel='rbf')
cv = KFold(n_splits=3, random_state=42, shuffle=True)
for train_index, test_index in cv.split(x):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
x_train, x_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]
x_train = x_train.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
best_svr.fit(x_train, y_train)
scores.append(best_svr.score(x_test, y_test))
Thanks!

How to fix "ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]"?

I have problem with training my code using Stochastic Gradient Descent and MNIST database.
from sklearn.datasets import fetch_mldata
from sklearn.linear_model import SGDClassifier
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
Error at the end of process (in my opinion the last verse of code is bad):
ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]
It's a typo on your side, you are assigning to X_train twice:
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Correct answer would be:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
BTW. fetch_mldata will be deprecated soon, it would be a better idea to use:
from sklearn.datasets import fetch_openml
X, y = fetch_openml("mnist_784", version=1, return_X_y=True)
I would suggest using a stratified splitting between train and test dataset because some classes might skewed representation in the training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Categories