Number of features of the model must match the input - python

For some reason the features of this dataset is being interpreted as rows, "Model n_features is 16 and input n_features is 18189" Where 18189 is the number of rows and 16 is the correct feature list.
The suspect code is here:
for var in cat_cols:
num = LabelEncoder()
train[var] = num.fit_transform(train[var].astype('str'))
train['output'] = num.fit_transform(train['output'].astype('str'))
for var in cat_cols:
num = LabelEncoder()
test[var] = num.fit_transform(test[var].astype('str'))
test['output'] = num.fit_transform(test['output'].astype('str'))
clf = RandomForestClassifier(n_estimators = 10)
xTrain = train[list(features)].values
yTrain = train["output"].values
xTest = test[list(features)].values
xTest = test["output"].values
clf.fit(xTrain,yTrain)
clfProbs = clf.predict(xTest)#Error happens here.
Anyone got any ideas?
Sample training date csv
tr4,42,"JobCat4","divorced","tertiary","yes",2,"yes","no","unknown",5,"may",0,1,-1,0,"unknown","TypeA"
Sample test data csv
tst2,47,"JobCat3","married","unknown","no",1506,"yes","no","unknown",5,"may",0,1,-1,0,"unknown",?

You have a small typo - you created the variable xTest and then are immediately overwriting to something incorrect. Change the offending lines to:
xTest = test[list(features)].values
yTest = test["output"].values

Related

Create train and test with lags of multiple features

I have a classification problem for which I want to create a train and test dataframe with 21 lags of multiple features (X-variables). I already have an easy way to do this with only one feature but I don't know how to adjust this code if I want to use more variables (e.g. df['ETHLogReturn']).
The code I have for one variable is:
Ntest = 252
train = df.iloc[:-Ntest]
test = df.iloc[-Ntest:]
# Create data ready for machine learning algoritm
series = df['BTCLogReturn'].to_numpy()[1:] # first change is NaN
# Did the price go up or down?
target = (targets > 0) * 1
T = 21 # 21 Lags
X = []
Y = []
for t in range(len(series)-T):
x = series[t:t+T]
X.append(x)
y = target[t+T]
Y.append(y)
X = np.array(X).reshape(-1,T)
Y = np.array(Y)
N = len(X)
print("X.shape", X.shape, "Y.shape", Y.shape)
#output --> X.shape (8492, 21) Y.shape (8492,)
Then I create my train and test datasets like this:
Xtrain, Ytrain = X[:-Ntest], Y[:-Ntest]
Xtest, Ytest = X[-Ntest:], Y[-Ntest:]
# example of model:
lr = LogisticRegression()
lr.fit(Xtrain, Ytrain)
print(lr.score(Xtrain, Ytrain))
print(lr.score(Xtest, Ytest))
Does anyone have a suggestion how to adjust this code for a model with lagging variables of multiple columns? Like:
df[['BTCLogReturn','ETHLogReturn']]
Many thanks for your help!

Use RobustScaler in a LSTM with Keras

I want to scale time series data with outliers and use it in a LSTM model with Keras.
My code for the scaling is:
# Train Data
scaler = RobustScaler().fit(train)
train = pd.DataFrame(scaler.fit_transform(train))
train = train.values
# Test Data
test = pd.DataFrame(scaler.transform(test))
test = test.values
Afterwards, I put the data into 3D format for Keras:
# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
X, y = list(), list()
for i in range(len(sequences)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the dataset
if end_ix > len(sequences)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix, :12]
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)
# choose a number of time steps
n_steps = 30
# convert into train input/output
X_trai, y_trai = split_sequences(train, n_steps)
print(X_trai.shape, y_trai.shape)
# convert into test input/output
X_test, y_test = split_sequences(test, n_steps)
print(X_test.shape, y_test.shape)
The training and prediction works well, however, I am not able to inverse transform the predicted y data of the test dataset.
My questions:
Is the above scaling method correct?
If yes, How can I regain the original scale of my y_hat predictions to compare it with the original y test dataset?
Thank you!
I am pretty new to this, but this is the way I've been able to do it successfully:
Create a transformer for the various features
Create a transformer for the target (i.e., the thing you are predicting)
Transform the data
Create the dataset
Create and run the model
Use the "inverse_transform" function to bring the data back to its original form
Here is an example of what I am trying to describe, below.
Note: This does cause a "setting with copy warning" that I need to resolve as in "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value, self.name)"
from sklearn.preprocessing import RobustScaler
f_columns = ['A',
'B',
'C',
'D'
]
f_transformer = RobustScaler()
target_transformer = RobustScaler()
f_transformer = f_transformer.fit(train[f_columns].to_numpy())
target_transformer = target_transformer.fit(train[['target']])
train.loc[:, f_columns] = f_transformer.transform(train[f_columns].to_numpy()).copy()
train['target'] = DA_transformer.transform(train[['target']]).copy()
#Steps 4 and 5
y_trai_inv = target_transformer.inverse_transform(y_trai.reshape(1, -1))
y_test_inv = target_transformer.inverse_transform(y_test.reshape(1, -1))
y_pred_inv = target_transformer.inverse_transform(y_pred)

UnboundLocalError: local variable 'X_train' referenced before assignment

def CreateData(self, n_samples, seed_in=5,
train_prop=0.9, bound_limit=6., n_std_devs=1.96,**kwargs):
np.random.seed(seed_in)
scale_c=1.0 # default
shift_c=1.0
# for ideal boundary
X_ideal = np.linspace(start=-bound_limit,stop=bound_limit, num=50000)
y_ideal_U = np.ones_like(X_ideal)+1. # default
y_ideal_L = np.ones_like(X_ideal)-1.
y_ideal_mean = np.ones_like(X_ideal)+0.5
if self.type_in[:1] == '~':
if self.type_in=="~boston":
path = 'boston_housing_data.csv'
data = np.loadtxt(path,skiprows=0)
elif self.type_in=="~concrete":
path = 'Concrete_Data.csv'
data = np.loadtxt(path, delimiter=',',skiprows=1)
elif self.type_in=="~wind":
path = '/content/Deep_Learning_Prediction_Intervals/code/canada_CSV.csv'
data = np.loadtxt(path,delimiter=',',skiprows=1,usecols = (1,2)) ## CHECK WHTHER TO HAVE LOADTXT OR ANYTHING ELSE PARUL
# work out normalisation constants (need when unnormalising later)
scale_c = np.std(data[:,-1])
shift_c = np.mean(data[:,-1])
# normalise data for ALL COLUMNS
for i in range(0,data.shape[1]): ## i varies from 0 to number of columns ,means it reads one by one the columns
# avoid zero variance features (exist one or two)
# nonlocal X_train, y_train, X_val, y_val ## ADDED BY PARUL
sdev_norm = np.std(data[:,i])
sdev_norm = 0.001 if sdev_norm == 0 else sdev_norm
data[:,i] = (data[:,i] - np.mean(data[:,i]) )/sdev_norm
# split into train/test
perm = np.random.permutation(data.shape[0]) ## DO THE DATA PERMUTATION OF ALL THE ROWS (shuffle)
train_size = int(round(train_prop*data.shape[0]))
train = data[perm[:train_size],:]
test = data[perm[train_size:],:]
y_train = train[:,-1].reshape(-1,1) ## LAST COLUMN IS CONSIDERED AS THE TARGET AND RESHAPED IN BETWEEN -1,1
X_train = train[:,:-1] ## INPUTS ARE ALL EXCEPT LAST COLUMN
y_val = test[:,-1].reshape(-1,1)
X_val = test[:,:-1]
# save important stuff
self.X_train = X_train
self.y_train = y_train
self.X_val = X_val
self.y_val = y_val
self.X_ideal = X_ideal
self.y_ideal_U = y_ideal_U
self.y_ideal_L = y_ideal_L
self.y_ideal_mean = y_ideal_mean
self.scale_c = scale_c
self.shift_c = shift_c
return X_train, y_train, X_val, y_val
It gives me an error
'UnboundLocalError: local variable 'X_train' referenced before assignment'
Any help will be appreciated. I am stuck at this. I have tried initialising X_train with X_train=[] and also tried making them global, but didn't get any result. Please help me so that I could move forward.

lightGBM predicts same value

I have one problem concerning lgb. When I write
lgb.train(.......)
it finishes in less than milisecond. (for (10 000,25) ) shape dataset.
and when I write predict, all the output variables have same value.
train = pd.read_csv('data/train.csv', dtype = dtypes)
test = pd.read_csv('data/test.csv')
test.head()
X = train.iloc[:10000, 3:-1].values
y = train.iloc[:10000, -1].values
sc = StandardScaler()
X = sc.fit_transform(X)
#pca = PCA(0.95)
#X = pca.fit_transform(X)
d_train = lgb.Dataset(X, label=y)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10
num_round = 10
clf = lgb.train(params, d_train, num_round, verbose_eval=1000)
X_test = sc.transform(test.iloc[:100,3:].values)
pred = clf.predict(X_test, num_iteration = clf.best_iteration)
when I print pred, all the values are (0.49)
It's my first time using lightgbm module. Do I have some error in the code? or I should look for some mismatches in dataset.
Your num_round is too small, it just starts to learn and stops there. Other than that, make your verbose_eval smaller, so see the results visually upon training. My suggestion for you to try the lgb.train code as below:
clf = lgb.train(params, d_train, num_boost_round=5000, verbose_eval=10, early_stopping_rounds = 3500)
Always use early_stopping_rounds since the model should stop if there is no evident learning or the model starts to overfit.
Do not hesitate to ask more. Have fun.

Keras: How to expand validation_split to generate a third set i.e. test set?

I am using Keras with a TensorFlow backend. I am using the ImageDataGenerator with the validation_split argument to split my data into train set and validation set. As such, I use flow_from_directory with the subset set to "training" and "testing" like so:
total_gen = ImageDataGenerator(validation_split=0.3)
train_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=BATCH_SIZE, subset="training")
valid_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=32, subset="validation")
This is amazingly convenient, as it allows me to use only one directory instead of two (one for training and one for validation). Now I wonder if it is possible to expand this process in order to generating a third set i.e. test set?
This is not possible out of the box. You should be able to do it with some minor modifications to the source code of ImageDataGenerator:
if subset is not None:
if subset not in {'training', 'validation'}: # add a third subset here
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation".') # adjust message
split_idx = int(len(x) * image_data_generator._validation_split)
# you'll need two split indices here
if subset == 'validation':
x = x[:split_idx]
x_misc = [np.asarray(xx[:split_idx]) for xx in x_misc]
if y is not None:
y = y[:split_idx]
elif subset == '...' # add extra case here
else:
x = x[split_idx:]
x_misc = [np.asarray(xx[split_idx:]) for xx in x_misc] # change slicing
if y is not None:
y = y[split_idx:] # change slicing
Edit: this is how you could modify the code:
if subset is not None:
if subset not in {'training', 'validation', 'test'}:
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation" or "test".')
split_idxs = (int(len(x) * v) for v in image_data_generator._validation_split)
if subset == 'validation':
x = x[:split_idxs[0]]
x_misc = [np.asarray(xx[:split_idxs[0]]) for xx in x_misc]
if y is not None:
y = y[:split_idxs[0]]
elif subset == 'test':
x = x[split_idxs[0]:split_idxs[1]]
x_misc = [np.asarray(xx[split_idxs[0]:split_idxs[1]]) for xx in x_misc]
if y is not None:
y = y[split_idxs[0]:split_idxs[1]]
else:
x = x[split_idxs[1]:]
x_misc = [np.asarray(xx[split_idxs[1]:]) for xx in x_misc]
if y is not None:
y = y[split_idxs[1]:]
Basically, validation_split is now expected to be a tuple of two floats instead of a single float. The validation data will be the fraction of data between 0 and validation_split[0], test data between validation_split[0] and validation_split[1] and training data between validation_split[1] and 1. This is how you can use it:
import keras
# keras_custom_preprocessing is how i named my directory
from keras_custom_preprocessing.image import ImageDataGenerator
generator = ImageDataGenerator(validation_split=(0.1, 0.5))
# First 10%: validation data - next 40% test data - rest: training data
gen = generator.flow_from_directory(directory='./data/', subset='test')
# Finds 40% of the images in the dir
You will need to modify the file in two or three additional lines (there is a typecheck you will have to change), but that's it and that should work. I have the modified file, let me know if you are interested, I can host it on my github.

Categories