Related
I have built a multi-step, multi-variate LSTM model to predict the target variable 5 days into the future with 5 days of look-back. The model runs smooth (even though it has to be further improved), but I cannot correctly invert the transformation applied, once I get my predictions.
I have seen on the web that there are many ways to pre-process and transform data. I decided to follow these steps:
Data fetching and cleaning
df = yfinance.download(['^GSPC', '^GDAXI', 'CL=F', 'AAPL'], period='5y', interval='1d')['Adj Close'];
df.dropna(axis=0, inplace=True)
df.describe()
Data set table
Split the data set into train and test
size = int(len(df) * 0.80)
df_train = df.iloc[:size]
df_test = df.iloc[size:]
Scaled train and test sets separately with MinMaxScaler()
scaler = MinMaxScaler(feature_range=(0,1))
df_train_sc = scaler.fit_transform(df_train)
df_test_sc = scaler.transform(df_test)
Creation of 3D X and y time-series compatible with the LSTM model
I borrowed the following function from this article
def create_X_Y(ts: np.array, lag=1, n_ahead=1, target_index=0) -> tuple:
"""
A method to create X and Y matrix from a time series array for the training of
deep learning models
"""
# Extracting the number of features that are passed from the array
n_features = ts.shape[1]
# Creating placeholder lists
X, Y = [], []
if len(ts) - lag <= 0:
X.append(ts)
else:
for i in range(len(ts) - lag - n_ahead):
Y.append(ts[(i + lag):(i + lag + n_ahead), target_index])
X.append(ts[i:(i + lag)])
X, Y = np.array(X), np.array(Y)
# Reshaping the X array to an RNN input shape
X = np.reshape(X, (X.shape[0], lag, n_features))
return X, Y
#In this example let's assume that the first column (AAPL) is the target variable.
trainX,trainY = create_X_Y(df_train_sc,lag=5, n_ahead=5, target_index=0)
testX,testY = create_X_Y(df_test_sc,lag=5, n_ahead=5, target_index=0)
Model creation
def build_model(optimizer):
grid_model = Sequential()
grid_model.add(LSTM(64,activation='tanh', return_sequences=True,input_shape=(trainX.shape[1],trainX.shape[2])))
grid_model.add(LSTM(64,activation='tanh', return_sequences=True))
grid_model.add(LSTM(64,activation='tanh'))
grid_model.add(Dropout(0.2))
grid_model.add(Dense(trainY.shape[1]))
grid_model.compile(loss = 'mse',optimizer = optimizer)
return grid_model
grid_model = KerasRegressor(build_fn=build_model,verbose=1,validation_data=(testX,testY))
parameters = {'batch_size' : [12,24],
'epochs' : [8,30],
'optimizer' : ['adam','Adadelta'] }
grid_search = GridSearchCV(estimator = grid_model,
param_grid = parameters,
cv = 3)
grid_search = grid_search.fit(trainX,trainY)
grid_search.best_params_
my_model = grid_search.best_estimator_.model
Get predictions
yhat = my_model.predict(testX)
Invert transformation of predictions and actual values
Here my problems begin, because I am not sure which way to go. I have read many tutorials, but it seems that those authors prefer to apply MinMaxScaler() on the entire dataset before splitting the data into train and test. I do not agree on this, because, otherwise, training data will be incorrectly scaled with information we should not use (i.e. the test set). So, I followed my approach, but I am stucked here.
I found this possible solution on another post, but it's not working for me:
# invert scaling for forecast
pred_scaler = MinMaxScaler(feature_range=(0, 1)).fit(df_test.values[:,0].reshape(-1, 1))
inv_yhat = pred_scaler.inverse_transform(yhat)
# invert scaling for actual
inv_y = pred_scaler.inverse_transform(testY)
In fact, when I double check the last values of the target from my original data set they don't match with the inverted scaled version of the testY.
Can someone please help me on this? Many thanks in advance for your support!
Two things could be mentioned here. First, you cannot inverse transform something you did not see. This happens because you use two different scalers. The NN will predict values in the range of Scaler 1, where it is not said that this lies within the range of Scaler 2 (scaled on test data). Second, the best practice is to fit your scaler on the training set and use the same scaler (only transform) on the test data as well. Now, you should be able to reverse transform your test results. Third if scaling wents off, because the test set has completely different values - e.g. happens with live streaming data, it is up to you to deal with it, e.g. the min-max scaler will produce values > 1.0.
I essentially want to tag each of the targets so that afterward, I can analyze each sample individually. This way I can look at that sample in much greater detail. I have a dataset and I split into training, validation, and testing sets.
import numpy as np
from sklearn.model_selection import train_test_split
ins = np.load('ins.npy')
ins.shape
# (100,12,60,60)
type(ins)
# <class 'numpy.ndarray'>
And then I send it through the training. I assume here I need to attach another np array that gives identifying information about the sample.
ins_train, ins_test , outs_train, outs_test = train_test_split(ins, outs, test_size=0.25, random_state=3)
ins_train, ins_val, outs_train, outs_val = train_test_split(ins_train, outs_train, test_size =0.2, random_state=2)
However, I do use a generator to send this data through a keras neural network. Should I provide that code as well? I will provide some, let me know if more is needed.
Here I make the generator:
def training_set_generator_images(ins, outs, batch_size=10,
input_name='input',
output_name='output'):
'''
Generator for producing random minibatches of image training samples.
#param ins Full set of training set inputs (examples x row x col x chan)
#param outs Corresponding set of sample (examples x nclasses)
#param batch_size Number of samples for each minibatch
#param input_name Name of the model layer that is used for the input of the model
#param output_name Name of the model layer that is used for the output of the model
'''
while True:
# Randomly select a set of example indices
example_indices = random.choices(range(ins.shape[0]), k=batch_size)
# The generator will produce a pair of return values: one for inputs and one for outputs
yield({input_name: ins[example_indices,:,:,:]},
{output_name: outs[example_indices,:,:,:]})
I create a keras neural network. I assume any neural network would do, but I'm not sure.
model = create_uNet(ins_train.shape[1:])
# call the generator
generator = training_set_generator_images(ins_train, outs_train,batch_size=50,
input_name='input',
output_name='output')
And then we fit the model.
history = model.fit(x=generator,epochs=1000)
# and we save the results
results = {}
results['predict_training'] = model.predict(ins_train)
results['predict_training_eval'] = model.evaluate(ins_train, outs_train)
results['true_training'] = outs_train
results['predict_validation'] = model.predict(ins_val)
results['predict_validation_eval'] = model.evaluate(ins_val, outs_val)
results['true_validation'] = outs_val
results['true_testing'] = outs_test
results['predict_testing'] = model.predict(ins_test)
results['predict_testing_eval'] = model.evaluate(ins_test, outs_test)
results['history'] = history.history
Within results['true_testing'] are the truths for the test set, and results['predict_testing'] are the corresponding predictions for the test set. For this to work, each example results['predict_testing'][i] will need to have, in addition to the image, but also the identifying information (in our case, a timestamp). Or there could be 3 dictionaries:
results['training_timestamps'] = training_timestamps
results['validation_timestamps'] = validation_timestamps
results['testing_timestamps'] = testing_timestamps
Please let me know if you need more information. It seems like the solution might be quick and easy, but the generator throws me off.
I am training a GCN (Graph Convolutional Network) on Cora dataset.
The Cora dataset has the following attributes:
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Number of training nodes: 140
Training node label rate: 0.05
Is undirected: True
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Since my code is very long, I only put the relevent parts of my code here. Firstly, I split the Cora dataset as follows:
def to_mask(index, size):
mask = torch.zeros(size, dtype=torch.bool)
mask[index] = 1
return mask
def cora_splits(data, num_classes):
indices = []
for i in range(num_classes):
# returns all indices of the elements = i from data.y tensor
index = (data.y == i).nonzero().view(-1)
# returns a random permutation of integers from 0 to index.size(0).
index = index[torch.randperm(index.size(0))]
# indices is a list of tensors and it has a length of 7
indices.append(index)
# select 20 nodes from each class for training
train_index = torch.cat([i[:20] for i in indices], dim=0)
rest_index = torch.cat([i[20:] for i in indices], dim=0)
rest_index = rest_index[torch.randperm(len(rest_index))]
data.train_mask = to_mask(train_index, size=data.num_nodes)
data.val_mask = to_mask(rest_index[:500], size=data.num_nodes)
data.test_mask = to_mask(rest_index[500:], size=data.num_nodes)
return data
The train is as follows (taken from here with few modifications):
def train(model, optimizer, data, epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(data)
loss_train = F.nll_loss(output[data.train_mask], data.y[data.train_mask])
acc_train = accuracy(output[data.train_mask], data.y[data.train_mask])
loss_train.backward()
optimizer.step()
loss_val = F.nll_loss(output[data.val_mask], data.y[data.val_mask])
acc_val = accuracy(output[data.val_mask], data.y[data.val_mask])
def accuracy(output, labels):
preds = output.max(1)[1].type_as(labels)
correct = preds.eq(labels).double()
correct = correct.sum()
return correct / len(labels)
When I ran my code with 200 epochs in 10 runs I gained:
tensor([0.7690, 0.8030, 0.8530, 0.8760, 0.8600, 0.8550, 0.8850, 0.8580, 0.8940, 0.8830])
Val Loss: 0.5974, Test Accuracy: 0.854 ± 0.039
where each value in the tensor belongs to the model accurracy of each run and the mean accuracy of all those 10 runs is 0.854 with std ± 0.039.
As it can be observed, the accuracy from the first run to the 10th one is increasing substantially. Therefore, I think the model is overfitting. One reason of overfitting is that in the code, the test data has been seen by the model in the training time since in the train function, there is a line output = model(data) so the model is trained over the whole data. What I intend to do is to train my model only on a part of the data (something similar to data[data.train_mask]) but the problem is I cannot pass data[data.train_mask], due to the forward function of the GCN model (from this repository):
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = F.relu(self.conv1(x, edge_index))
for conv in self.convs:
x = F.relu(conv(x, edge_index))
x = F.relu(self.lin1(x))
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin2(x)
return F.log_softmax(x, dim=-1)
If I pass data[data.train_mask] to the GCN model, then in the above forward function in line x, edge_index = data.x, data.edge_index, x and edge_index cannot be retrieved from data[data.train_mask]. Therefore, I need to find a way to split the Cora dataset in a way that I can pass a specefic part of it with the nodes, edge-index and other attributes to the model. My question is how to do it?
Also, any suggestion about k-fold cross validation is much appreciated.
I guess you are a little confused by the nature of transductive learning and the question you asked doesn't actually address the problem you are facing.
As it can be observed, the accuracy from the first run to the 10th one
is increasing substantially. Therefore, I think the model is
overfitting.
Not necessarily, increasing test accuracy could be a normal behavior when your model is learning from the training samples. The learning can last for several dozens of epochs due to the complexity and non-convexity of loss function. The best signal to tell overfit is when your training accuracy increase but test accuracy decreases significantly.
One reason of overfitting is that in the code, the test data has been
seen by the model in the training time since in the train function,
there is a line output = model(data) so the model is trained over the
whole data.
The model indeed has seen the entire graph(adjacency matrix) in the training, but it only sees the labels of nodes in the training set and knows nothing about the labels of nodes in the test set. This is exactly what transductive learning does.
In the end, if you are 100% sure you want to avoid the paradigm of transductive learning, then you might need to write your own split algorithm to achieve that. But I would like to remind that in the real-world use case, transduction is perfectly suitable. An example is to predict the potential links between social network users, where we have the whole network structure as input and want to simply run edge prediction--> transduction. Thus it doesn't make a lot of sense to avoid it.
Depending on your task, you could take a look of how Stellargraph's EdgeSplitter class(docs) and scikit-learn’s train_test_split function (docs) achive the split.
Node classification
If your task is a node classification task, this Node classification with Graph Convolutional Network (GCN) is a good example of how to load data and do train-test-split. It took Cora dataset as an example. The most important steps are the following:
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()
train_subjects, test_subjects = model_selection.train_test_split(
node_subjects, train_size=140, test_size=None, stratify=node_subjects
)
val_subjects, test_subjects = model_selection.train_test_split(
test_subjects, train_size=500, test_size=None, stratify=test_subjects
)
train_gen = generator.flow(train_subjects.index, train_targets)
val_gen = generator.flow(val_subjects.index, val_targets)
test_gen = generator.flow(test_subjects.index, test_targets)
Basically, it's the same as train-test-split with a normal classification task, except what we split here is nodes.
Edge classification
If your task is edge classification, you could have a look at this Link prediction example: GCN on the Cora citation dataset. The most relevant code for train-test-split is
# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global", keep_connected=True
)
# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
p=0.1, method="global", keep_connected=True
)
# For training we create a generator on the G_train graph, and make an
# iterator over the training links using the generator’s flow() method:
train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)
test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)
Here the splitting algorithm behind EdgeSplitter class(docs) is more complex, it needs to maintain the graph structure while doing the split, such as keeping the graph connectivity for example. For more details, cf source code for EdgeSplitter
I have a bunch of images that look like this of someone playing a videogame (a simple game I created in Tkinter):
The idea of the game is that the user controls the box at the bottom of the screen in order to dodge the falling balls (they can only dodge left and right).
My goal is to have the neural network output the position of the player on the bottom of the screen. If they're totally on the left, the neural network should output a 0, if they're in the middle, a .5, and all the way right, a 1, and all the values in-between.
My images are 300x400 pixels. I stored my data very simply. I recorded each of the images and position of the player as a tuple for each frame in a 50-frame game. Thus my result was a list in the form [(image, player position), ...] with 50 elements. I then pickled that list.
So in my code I try to create an extremely basic feed-forward network that takes in the image and outputs a value between 0 and 1 representing where the box on the bottom of the image is. But my neural network is only outputting 1s.
What should I change in order to get it to train and output values close to what I want?
Of course, here is my code:
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0:1]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = [list(val) for val in zip(*data)]
for index, image in enumerate(inputs):
# convert the PIL images into numpy arrays so Keras can process them
inputs[index] = pil_image_to_np_array(image)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X[0].shape[1] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
for index, output in enumerate(Y):
Y[index] = output / width
# flatten the image inputs so they can be passed to a neural network
for index, inpt in enumerate(X):
X[index] = np.ndarray.flatten(inpt)
# keras expects an array (not a list) of image-arrays for input to the neural network
X = np.array(X)
Y = np.array(Y)
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix
EDIT: print(Y) gives me:
so it's definitely not all zeroes.
Of course, a deeper model might give you a better accuracy, but considering the fact that your images are simple, a pretty simple (shallow) model with only one hidden layer should give a medium to high accuracy. So here are the modifications you need to make this happen:
Make sure X and Y are of type float32 (currently, X is of type uint8):
X = np.array(X, dtype=np.float32)
Y = np.array(Y, dtype=np.float32)
When training a neural network it would be much better to normalize the training data. Normalization helps the optimization process to go smoothly and speed up the convergence to a solution. It further prevent large values to cause large gradient updates which would be desruptive. Usually, the values of each feature in the input data should fall in a small range, where two common ranges are [-1,1] and [0,1]. Therefore, to make sure that all values fall in the range [-1,1], we subtract from each feature its mean and divide it by its standard deviation:
X_mean = X.mean(axis=0)
X -= X_mean
X_std = X.std(axis=0)
X /= X_std + 1e-8 # add a very small constant to prevent division by zero
Note that we are normalizing each feature (i.e. each pixel in this case) here not each image. When you want to predict on new data, i.e. in inference or testing mode, you need to subtract X_mean from test data and divide it by X_std (you should NEVER EVER subtract from test data its own mean or divide it by its own standard deviation; rather, use the mean and std of training data):
X_test -= X_mean
X_test /= X_std + 1e-8
If you apply the changes in points one and two, you might notice that the network no longer predicts only ones or only zeros. Rather, it shows some faint signs of learning and predicts a mix of zeros and ones. This is not bad but it is far from good and we have high expectations! The predictions should be much better than a mix of only zeros and ones. There, you should take into account the (forgotten!) learning rate. Since the network has relatively large number of parameters considering a relatively simple problem (and there are a few samples of training data), you should choose a smaller learning rate to smooth the gradient updates and the learning process:
from keras import optimizers
model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(lr=0.0001))
You would notice the difference: the loss value reaches to around 0.01 after 10 epochs. And the network no longer predicts a mix of zeros and ones; rather the predictions are much more accurate and close to what they should be (i.e. Y).
Don't forget! We have high (logical!) expectations. So, how can we do better without adding any new layers to the network (obviously, we assume that adding more layers might help!!)?
4.1. Gather more training data.
4.2. Add weight regularization. Common ones are L1 and L2 regularization (I highly recommend the Jupyter notebooks of the the book Deep Learning with Python written by François Chollet the creator of Keras. Specifically, here is the one which discusses regularization.)
You should always evaluate your model in a proper and unbiased way. Evaluating it on the training data (that you have used to train it) does not tell you anything about how well your model would perform on unseen (i.e. new or real world) data points (e.g. consider a model which stores or memorize all the training data. It would perform perfectly on the training data, but it would be a useless model and perform poorly on new data). So we should have test and train datasets: we train model on the training data and evaluate the model on the test (i.e. new) data. However, during the process of coming up with a good model you are performing lots of experiments: for example, you first change the type and number of layers, train the model and then evaluate it on test data to make sure it is good. Then you change another thing say the learning rate, train it again and then evaluate it again on test data... To make it short, these cycles of tuning and evaluations somehow causes an over-fitting on the test data. Therefore, we would need a third dataset called validation data (read more: What is the difference between test set and validation set?):
# first shuffle the data to make sure it isn't in any particular order
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
Y = Y[indices]
# you have 200 images
# we select 100 images for training,
# 50 images for validation and 50 images for test data
X_train = X[:100]
X_val = X[100:150]
X_test = X[150:]
Y_train = Y[:100]
Y_val = Y[100:150]
Y_test = Y[150:]
# train and tune the model
# you can attempt train and tune the model multiple times,
# each time with different architecture, hyper-parameters, etc.
model.fit(X_train, Y_train, epochs=15, batch_size=10, validation_data=(X_val, Y_val))
# only and only after completing the tuning of your model
# you should evaluate it on the test data for just one time
model.evaluate(X_test, Y_test)
# after you are satisfied with the model performance
# and want to deploy your model for production use (i.e. real world)
# you can train your model once more on the whole data available
# with the best configurations you have found out in your tunings
model.fit(X, Y, epochs=15, batch_size=10)
(Actually, when we have few training data available it would be wasteful to separate validation and test data from whole available data. In this case, and if the model is not computationally expensive, instead of separating a validation set which is called cross-validation, one can do K-fold cross-validation or iterated K-fold cross-validation in case of having very few data samples.)
It is around 4 AM at the time of writing this answer and I am feeling sleepy, but I would like to mention one more thing which is not directly related to your question: by using the Numpy library and its functionalities and methods you can write more concise and efficient code and also save yourself a lot time. So make sure you practice using it more as it is heavily used in machine learning community and libraries. To demonstrate this, here is the same code you have written but with more use of Numpy (Note that I have not applied all the changes I mentioned above in this code):
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = zip(*data)
inputs = [pil_image_to_np_array(image) for image in inputs]
inputs = np.array(inputs, dtype=np.float32)
outputs = np.array(outputs, dtype=np.float32)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X.shape[2] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
Y /= width
# flatten the image inputs so they can be passed to a neural network
X = np.reshape(X, (X.shape[0], -1))
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix
I use python's scikit-learn module for predicting some values in the CSV file. I am using Random Forest Regressor to do it. As example, i have 8 train values and 3 values to predict - which of codes i must use? As a values to be predicted, I have to give all target values at once (A) or separately (B)?
Variant A:
#Readind CSV file
dataset = genfromtxt(open('Data/for training.csv','r'), delimiter=',', dtype='f8')[1:]
#Target value to predict
target = [x[8:11] for x in dataset]
#Train values to train
train = [x[0:8] for x in dataset]
#Starting traing
rf = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf.fit(train, target)
Variant B:
#Readind CSV file
dataset = genfromtxt(open('Data/for training.csv','r'), delimiter=',', dtype='f8')[1:]
#Target values to predict
target1 = [x[8] for x in dataset]
target2 = [x[9] for x in dataset]
target3 = [x[10] for x in dataset]
#Train values to train
train = [x[0:8] for x in dataset]
#Starting traings
rf1 = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf1.fit(train, target1)
rf2 = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf2.fit(train, target2)
rf3 = RandomForestRegressor(n_estimators=300,compute_importances = True)
rf3.fit(train, target3)
Which version is correct?
Thanks in advance!
Both are possible, but do different things.
The first learns independent models for the different entries of y. The second learns a joint model for all entries of y. If there are meaningful relations between the entries of y that can be learned, the second should be more accurate.
As you are training on very little data and don't regularize, I imagine you are simply overfitting in the second case. I am not entirely sure about the splitting criteria in the regression case but it takes much longer for a leaf to be "pure" if the label-space is three dimensional than if it is just one-dimensional. So you will learn more complex models, that are not warranted by the little data you have.
"8 train values and 3 values" is probably best expressed as "8 features and 3 target variables" in usual machine learning parlance.
Both variants should work and yield the similar predictions as RandomForestRegressor has been made to support multi output regression.
The predictions won't be exactly the same as RandomForestRegressor is a non deterministic algorithm though. But on average the predictive quality of both approaches should be the same.
Edit: see Andreas answer instead.