Can the train_test_split be replaced from sklearn.model_selection? - python

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2, random_state=42)
Can this call be replaced with
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

If your question is if both codes does the same job, then yes. Use the second one where you import train_test_split, makes the code more simple to read and understand.

Related

How to quantify how good the model is after using train_test_split

I'm using the train_test_split from sklearn.model_selection. My code looks like the following:
x_train, x_test , y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1234)
Edit: After this is done, how do I fit these to the linear regression model, and then see how good this model is? i.e. Which of the four components (x_train, x_test, y_train, or y_test) would I use to calculate MSE or RMSE? And how exactly how would I do that?

UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(

screenshot here Help please?
Already tried adding .values to the X's, still resulted in an error. Any suggestions?
X = df[['Personal income','Personal saving']]
y = df['Gross domestic product']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
regr = linear_model.LinearRegression().fit(X_train, y_train)
sample = [10000, 1000]
sample_pred = regr.predict([sample])
As stated in this issue https://github.com/tylerjrichards/Getting-Started-with-Streamlit-for-Data-Science/issues/5 , converting X_train dataframe to np array (X_train.values) before fitting removes the warning.
it did for my testing : you can either try :
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X.to_numpy(), y, test_size=0.2, random_state=42)
In addition, this warning doesn't affect the calculation precision, you can ignore it and continue working until the update of the versions of libraries.

I keep on getting the error name 'y_test' is not defined

I really need your help! I've written this code:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
def train_test_rmse(x,y):
X = df_new[feature_cols]
y = df_new['TOTAL CONSTRUCTION COST - EXCLUDING TAX']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,random_state=123)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_test = train_test_split(x, y, test_size = 0.2,random_state=123)
y_pred = linreg.predict(X_test)
print(accuracy_score(y_test, y_pred))
return np.sqrt(metrics.mean_squared_error(y_test, y_pred))
^ The code above runs correctly. But when I try to plot a scatter plot in the cell beneath:
import matplotlib.pyplot as plt
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Y')
plt.ylabel('Predicted Y')
plt.show()
I get the error "name 'y_test' is not defined". Please let me know how to fix it. Thanks.
In the code, i see that y_test is defined inside the train_test_rmse function, you need to initialize y_test outside this function.
your code should work fine with few changes as follows :
y_test = None
def train_test_rmse(x,y):
global y_test
X = df_new[feature_cols]
y = df_new['TOTAL CONSTRUCTION COST - EXCLUDING TAX']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,random_state=123)

Using sklearn.model_selection to split unbalanced dataset

I am using the following code to split my dataset into train/val/test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42)
The problem is that my dataset is really unbalanced. Some classes have 500 samples while some have 70 for example. Is this splitting method accurate in this situation? Is the sampling random or does sklearn use seome methods to keep the distribution of the data same in all sets?
You should use the stratify option (see the docs):
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42, stratify=y_data)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42, stratify=y_test)

How to fix "ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]"?

I have problem with training my code using Stochastic Gradient Descent and MNIST database.
from sklearn.datasets import fetch_mldata
from sklearn.linear_model import SGDClassifier
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
Error at the end of process (in my opinion the last verse of code is bad):
ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]
It's a typo on your side, you are assigning to X_train twice:
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Correct answer would be:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
BTW. fetch_mldata will be deprecated soon, it would be a better idea to use:
from sklearn.datasets import fetch_openml
X, y = fetch_openml("mnist_784", version=1, return_X_y=True)
I would suggest using a stratified splitting between train and test dataset because some classes might skewed representation in the training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Categories