Split dataframes into train_test to fed into CNN - python

I have two dataframes. One with 1065*75000 which is my training matrix and one with 1065*1 which is my target matrix. I want to split them into training and testing so i can fed them to my CNN model. Any help would be appreciated.
I have tried following
y = target_dataframe["Target_column"]
X_train, X_test, Y_train, Y_test = (df_training, y, test_size = 0.2)
and
x_train, X_test = train_test_split(df_training)
y_train, y_test = train_test_split(df_target)
The cnn is working on both of them but how would i know which one is better approach and if there is something new please share it with me.

Related

Cannot concatenate object of type '<class 'numpy.ndarray'>'

i have a problem when i try to concatenate train set and validation set. I split my dataset into train set, validation set and test set. Then i scale them with 'StandardScaler()':
X_train, X_test, t_train, t_test = train_test_split(x, t, test_size=0.20, random_state=1)
X_train, X_valid, t_train, t_valid = train_test_split(X_train, t_train, test_size=0.25, random_state=1)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_valid = sc.transform(X_valid)
X_test = sc.transform(X_test)
Then after model selection i want concatenate training and validation set:
X_train = pd.concat([X_train, X_valid])
t_train = pd.concat([t_train, t_valid])
But it doesn't work. I give me that error:
cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
Can someone help me? Thanks
X_train, X_valid, t_train, t_valid are all numpy arrays so they need to be concatenated using numpy:
X_train = np.concatenate([X_train, X_valid])
t_train = np.concatenate([t_train, t_valid])
As suggested in the comments it is most likely not a good idea to merge train and validation sets together. Make sure you understand why datasets are split into training, testing and validation parts. You can apply cross validation to use all the data for train/test/valid in multiple steps.

TensorFlow: How to validate on 1 specific row of data and train on the rest?

I have 11 rows of data, and my goal is to train the network on 10, and validate on 1 specific row (not random).
The aim is to work through validating on each single row while training on the other 10, until I have a prediction for all 11 rows.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
The train/test split as shown above doesn't seem like it will work as it is random, is there a way to specify exactly which rows are to be used for training and testing?
What you are looking for seems to be k-fold cross validation. This will use each row as a validation set, and train on the remaining k - 1 rows and so forth. I would suggest using sklearn's built-in method.
from sklearn.model_selection import KFold
n_splits = 11
for train_idx, test_idx in KFold(n_splits).split(x):
x_train, x_test = x[train_idx], x[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# do your stuff

Simple linear regression model

Please help me. After splitting my data into
X_train, y_train, X_test, y_test = train_test_split(X,y)
then passing it to my linear regression model I.e
linereg = LinearRegression().fit(X_train, y_train)
It brings out an error saying array must be 2D not 1D array. How can I make it a 2D array.
first split the data correctly
X_train, x_test, Y_train,y_test=train_test_split(features,labels,train_size=0.7, test_size=0.3, random_state=2)
try reshaping the x_train and x_test using reshape method.
x_test=x_test.reshape(-1,1)
x_train=x_train.reshape(-1,1)

What should be passed as input parameter when using train-test-split function twice in python 3.6

Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.

Saving order of splitting with a vector of index

l want to split data into train and test and also a vector that contains names (it serves me as an index and reference).
name_images has a shape of (2440,)
My data are :
data has a shape of (2440, 3072)
labels has a shape of (2440,)
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test= train_test_split(data, labels, test_size=0.3)
but l want also to split my name_images into name_images_train and name_images_test with respect to the split of data and labels
l tried
x_train, x_test, y_train, y_test,name_images_train,name_images_test= train_test_split(data, labels,name_images, test_size=0.3)
it doesn't preserve the order
Any suggestions
thank you
EDIT1:
x_train, x_test, y_train, y_test= train_test_split(data, labels,test_size=0.3, random_state=42)
name_images_train, name_images_test=train_test_split(name_images,
test_size=0.3,
random_state=42)
EDIT1 don't preserve the order
There are multiple ways to accomplish this.
The most straight forward is to use random_state parameter of train_test_split. As the documentation states:
random_state : int or RandomState :-
Pseudo-random number generator state used for random sampling.
When you fix the random_state, the indices which are generated for splitting the arrays into train and test are exact same each time.
So change your code to:
x_train, x_test,
y_train, y_test,
name_images_train, name_images_test=train_test_split(data, labels, name_images,
test_size=0.3,
random_state=42)
For more understanding on random_state, see my answer here:
https://stackoverflow.com/a/42197534/3374996
In my case, I realize that my input arrays were not in proper order in the first place. So for future Googlers--you may want to double-check if (data, labels) are in the same order or not.

Categories