I was not able to find the answer to this anywhere. I have data for three months, where I would like to split it into the first two months('Jan-19', 'Feb-19') as training set and the last month as the test ('Mar-19').
Previously I have done random sampling with simple code like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=109)
and before that assigned y as the label and x as the columns to use to predict. I'm not sure how to assign the test and training to the months I want.
Thank you
If your data is in a pandas dataframe, you can use subsetting like this:
X_train = X[X['month'] != 'Mar-19']
y_train = y[X['month'] != 'Mar-19']
X_test = X[X['month'] == 'Mar-19']
y_test = y[X['month'] == 'Mar-19']
You try this option and see if it helps.
dataset_train = df['2004-02-12 11:02:39':'2004-02-13 23:52:39']
dataset_test = df['2004-02-13 23:52:39':]
Related
I am trying to use train_test_split to get my train data to be the dataframe between indexes 31 and 39.
I want to write something like X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target, test_size = 0.3) where faces is faces = sk.datasets.fetch_olivetti_faces()
How can I select which indexes I want to go into my train data?
As #berkayln suggested, I'm not sure your train-test split strategy is advisable, but to split the data as you're suggesting, I believe you can use:
from sklearn import datasets
faces = datasets.fetch_olivetti_faces()
X_train = faces.data[31:40]
X_test = faces.data[np.r_[0:31, 40:400]]
y_train = faces.target[31:40]
y_test = faces.target[np.r_[0:31, 40:400]]
You can give with fancy index easily:
X_train=faces.data[:number what you want]
X_test=faces.target[:number what you want]
y_train=aces.data[number what you want]
y_test= faces.target[number what you want:]
I know how to utilize a basic train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
However, what if I want to divide my training and testing set by a variable, in this case year. I want all values where year==2019 to be my test set while year<2019 is my training set. How can I alter the code above to make that happen?
Let me explain with an example:
If your corpus have 1000 data points and you want 700/300 for train/test, find data points with year == 2019 take (move) them to the end of the corpus and consider them as test data with something like bellow: (suppose 200 data point satisfy year == 2019 condition)
X_test, y_test = X[800:1000], y[800:1000]
and for example 300 data points have year < 2019 after moving them to top:
X_train, y_train = X[0:300], y[0:300]
Now for rest of your corpus (from 300 to 800) redefine X and Y like:
X = data.iloc[301:799]
Y = label.iloc[301:799]
and then use train_test_split for new X and Y and join new X_test, y_test, X_train, y_train with the previous ones.
I have 11 rows of data, and my goal is to train the network on 10, and validate on 1 specific row (not random).
The aim is to work through validating on each single row while training on the other 10, until I have a prediction for all 11 rows.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
The train/test split as shown above doesn't seem like it will work as it is random, is there a way to specify exactly which rows are to be used for training and testing?
What you are looking for seems to be k-fold cross validation. This will use each row as a validation set, and train on the remaining k - 1 rows and so forth. I would suggest using sklearn's built-in method.
from sklearn.model_selection import KFold
n_splits = 11
for train_idx, test_idx in KFold(n_splits).split(x):
x_train, x_test = x[train_idx], x[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# do your stuff
I have two dataframes. One with 1065*75000 which is my training matrix and one with 1065*1 which is my target matrix. I want to split them into training and testing so i can fed them to my CNN model. Any help would be appreciated.
I have tried following
y = target_dataframe["Target_column"]
X_train, X_test, Y_train, Y_test = (df_training, y, test_size = 0.2)
and
x_train, X_test = train_test_split(df_training)
y_train, y_test = train_test_split(df_target)
The cnn is working on both of them but how would i know which one is better approach and if there is something new please share it with me.
I run a random forest model in python to see the importance of features. However, the predictable value (y) cannot be dropped and it looks like it plays as one of the parameters that takes over 98% of importance.
The code is as below:
temp=pd.read_csv('temp_data.csv',sep=',',engine='python')
temp['y'] = temp['temp_actual']
y = temp['y'].values
temp = temp.drop(['y'],axis=1)
#X = temp.loc[:,:]
x= temp.values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
Please help correct the coding. Thanks!
In your code you made a copy of the target feature to column y by using the code
temp['y'] = temp['temp_actual']
Then you set y as the values in that column
y = temp['y'].values
You then dropped the column y from the data frame with the following code
temp = temp.drop(['y'],axis=1)
Now if you looked at the columns of the dataframe temp you can see that y is not present but temp_actual is there.
You have to remove that column from the dataframe, in order to do that you can do any of the following methods.
del temp['temp_actual']
OR
temp = temp.drop(['temp_actual'], axis=1)