I have a classification problem for which I want to create a train and test dataframe with 21 lags of multiple features (X-variables). I already have an easy way to do this with only one feature but I don't know how to adjust this code if I want to use more variables (e.g. df['ETHLogReturn']).
The code I have for one variable is:
Ntest = 252
train = df.iloc[:-Ntest]
test = df.iloc[-Ntest:]
# Create data ready for machine learning algoritm
series = df['BTCLogReturn'].to_numpy()[1:] # first change is NaN
# Did the price go up or down?
target = (targets > 0) * 1
T = 21 # 21 Lags
X = []
Y = []
for t in range(len(series)-T):
x = series[t:t+T]
X.append(x)
y = target[t+T]
Y.append(y)
X = np.array(X).reshape(-1,T)
Y = np.array(Y)
N = len(X)
print("X.shape", X.shape, "Y.shape", Y.shape)
#output --> X.shape (8492, 21) Y.shape (8492,)
Then I create my train and test datasets like this:
Xtrain, Ytrain = X[:-Ntest], Y[:-Ntest]
Xtest, Ytest = X[-Ntest:], Y[-Ntest:]
# example of model:
lr = LogisticRegression()
lr.fit(Xtrain, Ytrain)
print(lr.score(Xtrain, Ytrain))
print(lr.score(Xtest, Ytest))
Does anyone have a suggestion how to adjust this code for a model with lagging variables of multiple columns? Like:
df[['BTCLogReturn','ETHLogReturn']]
Many thanks for your help!
Related
I want to scale time series data with outliers and use it in a LSTM model with Keras.
My code for the scaling is:
# Train Data
scaler = RobustScaler().fit(train)
train = pd.DataFrame(scaler.fit_transform(train))
train = train.values
# Test Data
test = pd.DataFrame(scaler.transform(test))
test = test.values
Afterwards, I put the data into 3D format for Keras:
# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
X, y = list(), list()
for i in range(len(sequences)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the dataset
if end_ix > len(sequences)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix, :12]
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)
# choose a number of time steps
n_steps = 30
# convert into train input/output
X_trai, y_trai = split_sequences(train, n_steps)
print(X_trai.shape, y_trai.shape)
# convert into test input/output
X_test, y_test = split_sequences(test, n_steps)
print(X_test.shape, y_test.shape)
The training and prediction works well, however, I am not able to inverse transform the predicted y data of the test dataset.
My questions:
Is the above scaling method correct?
If yes, How can I regain the original scale of my y_hat predictions to compare it with the original y test dataset?
Thank you!
I am pretty new to this, but this is the way I've been able to do it successfully:
Create a transformer for the various features
Create a transformer for the target (i.e., the thing you are predicting)
Transform the data
Create the dataset
Create and run the model
Use the "inverse_transform" function to bring the data back to its original form
Here is an example of what I am trying to describe, below.
Note: This does cause a "setting with copy warning" that I need to resolve as in "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value, self.name)"
from sklearn.preprocessing import RobustScaler
f_columns = ['A',
'B',
'C',
'D'
]
f_transformer = RobustScaler()
target_transformer = RobustScaler()
f_transformer = f_transformer.fit(train[f_columns].to_numpy())
target_transformer = target_transformer.fit(train[['target']])
train.loc[:, f_columns] = f_transformer.transform(train[f_columns].to_numpy()).copy()
train['target'] = DA_transformer.transform(train[['target']]).copy()
#Steps 4 and 5
y_trai_inv = target_transformer.inverse_transform(y_trai.reshape(1, -1))
y_test_inv = target_transformer.inverse_transform(y_test.reshape(1, -1))
y_pred_inv = target_transformer.inverse_transform(y_pred)
I'm trying to use TensorFlow in python, to make some prediction with cryptocurrency data. The problem is that the output of the prediction is like a 0.1-0.9 number whereas the cryptocurrency data should be a 10000-10100 format, and I don't find a solution to convert the 0.* number to the real one.
I've try to create a ratio, with substrat max - min from predicted values, and max-min from tested data, and divide to have a ratio but when I multiply this ratio with prediction there is a big rate of error ( found a 14000 number instead of a 10000 one )
Here some code :
train_start = 0
train_end = int(np.floor(0.7*n))
test_start = train_end
test_end = n
data_train = data[np.arange(train_start, train_end), :]
data_test = data[np.arange(test_start, test_end), :]
Scale data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_train = scaler.fit_transform(data_train)
data_test = scaler.transform(data_test)
Build X and y:
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]
.
.
.
n_data = 10
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1
X = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None, n_data])
Y = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None])
Hidden layer
..
Output layer (must be transposed)
..
Cost function
..
Optimizer
..
Make Session:
sess = tf.compat.v1.Session()
Run initializer:
sess.run(tf.compat.v1.global_variables_initializer())
Setup interactive plot:
plt.ion()
fig = plt.figure()
ax1 = fig.add_subplot(111)
line1, = ax1.plot(y_test)
line2, = ax1.plot(y_test*0.5)
plt.show()
epochs = 10
batch_size = 256
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
sess.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 5) == 0:
# Prediction
pred = sess.run(out, feed_dict={X: X_test})
#This pred var is the output of the prediction
I persiste my result in a file and this is what its looks like :
2019-08-21 06-AM;15310.444858356934;0.50021994;
2019-08-21 12-PM;14287.717187390663;0.46680558;
2019-08-21 06-PM;14104.63871795706;0.46082407;
For example, the last prediction is 0,46 but when I try to convert it I found 14104 whereas it should be nearer a 10000 value
Does anyone have an idea how to convert those predictions?
Thanks!
You will have to make use of inverse_transform of MinMaxScaler to convert back the output you are getting in range of 0-1.
You have not given your model, but I believe you are making use of regression task with few dense layers. You will have to keep minimizing your loss. If you are using mean squared error, the larger the loss, more is the likelihood your output will be far away from the desired set of results.
Even after your loss is a small number and the result is coming good for train samples, but the prediction is bad for test dataset, you may have to consider increasing your train dataset so that more possibilities are covered. If that is not possible, consider reducing the number of neurons in your neural network so that it stops over-fitting.
You can do some postprocessing to restrict the output to some desired range.
I have one problem concerning lgb. When I write
lgb.train(.......)
it finishes in less than milisecond. (for (10 000,25) ) shape dataset.
and when I write predict, all the output variables have same value.
train = pd.read_csv('data/train.csv', dtype = dtypes)
test = pd.read_csv('data/test.csv')
test.head()
X = train.iloc[:10000, 3:-1].values
y = train.iloc[:10000, -1].values
sc = StandardScaler()
X = sc.fit_transform(X)
#pca = PCA(0.95)
#X = pca.fit_transform(X)
d_train = lgb.Dataset(X, label=y)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10
num_round = 10
clf = lgb.train(params, d_train, num_round, verbose_eval=1000)
X_test = sc.transform(test.iloc[:100,3:].values)
pred = clf.predict(X_test, num_iteration = clf.best_iteration)
when I print pred, all the values are (0.49)
It's my first time using lightgbm module. Do I have some error in the code? or I should look for some mismatches in dataset.
Your num_round is too small, it just starts to learn and stops there. Other than that, make your verbose_eval smaller, so see the results visually upon training. My suggestion for you to try the lgb.train code as below:
clf = lgb.train(params, d_train, num_boost_round=5000, verbose_eval=10, early_stopping_rounds = 3500)
Always use early_stopping_rounds since the model should stop if there is no evident learning or the model starts to overfit.
Do not hesitate to ask more. Have fun.
I am using Keras with a TensorFlow backend. I am using the ImageDataGenerator with the validation_split argument to split my data into train set and validation set. As such, I use flow_from_directory with the subset set to "training" and "testing" like so:
total_gen = ImageDataGenerator(validation_split=0.3)
train_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=BATCH_SIZE, subset="training")
valid_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=32, subset="validation")
This is amazingly convenient, as it allows me to use only one directory instead of two (one for training and one for validation). Now I wonder if it is possible to expand this process in order to generating a third set i.e. test set?
This is not possible out of the box. You should be able to do it with some minor modifications to the source code of ImageDataGenerator:
if subset is not None:
if subset not in {'training', 'validation'}: # add a third subset here
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation".') # adjust message
split_idx = int(len(x) * image_data_generator._validation_split)
# you'll need two split indices here
if subset == 'validation':
x = x[:split_idx]
x_misc = [np.asarray(xx[:split_idx]) for xx in x_misc]
if y is not None:
y = y[:split_idx]
elif subset == '...' # add extra case here
else:
x = x[split_idx:]
x_misc = [np.asarray(xx[split_idx:]) for xx in x_misc] # change slicing
if y is not None:
y = y[split_idx:] # change slicing
Edit: this is how you could modify the code:
if subset is not None:
if subset not in {'training', 'validation', 'test'}:
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation" or "test".')
split_idxs = (int(len(x) * v) for v in image_data_generator._validation_split)
if subset == 'validation':
x = x[:split_idxs[0]]
x_misc = [np.asarray(xx[:split_idxs[0]]) for xx in x_misc]
if y is not None:
y = y[:split_idxs[0]]
elif subset == 'test':
x = x[split_idxs[0]:split_idxs[1]]
x_misc = [np.asarray(xx[split_idxs[0]:split_idxs[1]]) for xx in x_misc]
if y is not None:
y = y[split_idxs[0]:split_idxs[1]]
else:
x = x[split_idxs[1]:]
x_misc = [np.asarray(xx[split_idxs[1]:]) for xx in x_misc]
if y is not None:
y = y[split_idxs[1]:]
Basically, validation_split is now expected to be a tuple of two floats instead of a single float. The validation data will be the fraction of data between 0 and validation_split[0], test data between validation_split[0] and validation_split[1] and training data between validation_split[1] and 1. This is how you can use it:
import keras
# keras_custom_preprocessing is how i named my directory
from keras_custom_preprocessing.image import ImageDataGenerator
generator = ImageDataGenerator(validation_split=(0.1, 0.5))
# First 10%: validation data - next 40% test data - rest: training data
gen = generator.flow_from_directory(directory='./data/', subset='test')
# Finds 40% of the images in the dir
You will need to modify the file in two or three additional lines (there is a typecheck you will have to change), but that's it and that should work. I have the modified file, let me know if you are interested, I can host it on my github.
For some reason the features of this dataset is being interpreted as rows, "Model n_features is 16 and input n_features is 18189" Where 18189 is the number of rows and 16 is the correct feature list.
The suspect code is here:
for var in cat_cols:
num = LabelEncoder()
train[var] = num.fit_transform(train[var].astype('str'))
train['output'] = num.fit_transform(train['output'].astype('str'))
for var in cat_cols:
num = LabelEncoder()
test[var] = num.fit_transform(test[var].astype('str'))
test['output'] = num.fit_transform(test['output'].astype('str'))
clf = RandomForestClassifier(n_estimators = 10)
xTrain = train[list(features)].values
yTrain = train["output"].values
xTest = test[list(features)].values
xTest = test["output"].values
clf.fit(xTrain,yTrain)
clfProbs = clf.predict(xTest)#Error happens here.
Anyone got any ideas?
Sample training date csv
tr4,42,"JobCat4","divorced","tertiary","yes",2,"yes","no","unknown",5,"may",0,1,-1,0,"unknown","TypeA"
Sample test data csv
tst2,47,"JobCat3","married","unknown","no",1506,"yes","no","unknown",5,"may",0,1,-1,0,"unknown",?
You have a small typo - you created the variable xTest and then are immediately overwriting to something incorrect. Change the offending lines to:
xTest = test[list(features)].values
yTest = test["output"].values