I'm trying to build and train a simple MLP model using keras.Sequential().
However, I'm having issues when, after each training epoch, I try to evaluate the current status of the model on the train and test data.
I'm having this problem on a couple different datasets, one of them is the "CAR DETAILS FROM CAR DEKHO" dataset, you can find it here
This is what I'm doing so far:
import numpy as np
import tensorflow as tf
import pandas as pd
def main()
## read, preprocess and split data
df_data = pd.read_csv('car_data_CAR_DEKHO.csv')
df_data = pre_process(df_data)
X_train, y_train, X_test, y_test = split_data(df_data) ## -> these are PANDAS DATAFRAMES!
train(X_train, X_test, y_train, y_test)
def train(X_train, X_test, y_train, y_test):
##--------------------
## building model
##--------------------
batch = 5000
epochs = 500
lr = 0.001
data_iter = load_array((X_train, y_train), batch)
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
##--------------#
## training #
##--------------#
for epoch in range(1, epochs + 1):
for X_batch, Y_batch in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X_batch, training=True), Y_batch)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
# test on train set after epoch
y_pred_train = net(X_train) ## ERROR HERE!!!
l_train = loss(y_pred_train, y_train)
y_pred_test = net(X_test)
l_test = loss(y_pred_test, y_test)
def load_array(data_arrays, batch_size, is_train=True):
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
def split_data(df_data):
X = df_data.copy()
y = X.pop('selling_price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, y_train, X_test, y_test
def pre_process(df_data):
## check NaNs and drop rows if any
print(df_data.isnull().sum())
df_data.dropna(inplace=True)
## drop weird outlier, turns out it has 1 km_driven
df_data.drop([1312], inplace=True)
## features engineering
df_data['name'] = df_data['name'].map(lambda x: x.split(' ')[0])
df_data['owner'] = df_data['owner'].map(lambda x: x.split(' ')[0])
df_data['selling_price'] = df_data['selling_price']/1000
df_data_dummies = pd.get_dummies(df_data, drop_first=True)
df_data_dummies = normalize(df_data_dummies) ## this is a simple min-max scaling, I do it manually but you can use sklearn or something similar
return df_data_dummies
def normalize(df):
print('Data normalization:')
result = df.copy()
for feature_name in df.columns:
if feature_name == 'selling_prize':
pass
else:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
if result[feature_name].isnull().values.any():
result.drop([feature_name], axis=1, inplace=True)
print(f'Something wrong in {feature_name}, dropped.')
print(f'now shape is {len(result)}, {len(result.columns)}')
print(f'\treturning {len(result)}, {len(result.columns)}')
return result
and I'm getting the error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 232, in assert_input_compatibility
ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'
I guess the error is due to me passing X_train (which is a dataframe) directly to net.
I also tried using again:
y_pred_train = net(tf.data.Dataset.from_tensor_slices(X_train))
like when creating training batches, but it returns another error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 201, in assert_input_compatibility
raise TypeError('Inputs to a layer should be tensors. Got: %s' % (x,))
TypeError: Inputs to a layer should be tensors. Got: <TensorSliceDataset shapes: (19,), types: tf.float64>
Finally, I tried using:
y_pred_train = net.predict(X_train)
the weird thing in this case is that I got an OOM error, referring to a tensor with shape[76571,76571]:
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[76571,76571] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:SquaredDifference]
but the X_train datagrame has shape (76571, 19), so I don't understand what is happening.
What is the correct way to do this?
Your code mostly looks OK, the issue must be with the data that you pass.
Check content and datatypes of the data that you feed.
Try converting pandas slices into np.arrays, re-check their dimensions and then feed np.arrays to load_array().
Also try smaller batches, like 64 (not 5000).
UPDATE:
Apparently when you pass X_batch to the model you pass tf.tensor, but later when you pass whole X_train or X_test - you pass pd.DataFrames and the model gets confused.
You should change just 2 lines:
y_pred_train = net(tf.constant(X_train)) # pass TF.tensor - best
#alternative:
y_pred_train = net(X_train.values) # pass np.array - also good
y_pred_test = net(tf.constant(X_test)) # make similar change here
The issue looks like it is related to the data (as Poe Dator says). What I believe is going on is that your network has some input shape based on the batches of data it is receiving. Then when you are trying to predict or call your network on the data (this also recomputes shapes since it calls the build() function), it tries to get the data into the shape it expects. I think specifically it is expecting a shape of (batch, 1, 19) but with your data in (76571, 19) it is not finding the correct shape.
A couple of easy steps to work on this would be:
Call net.summary() to see what the shapes it believes it is getting before and after training
Provide the input shape to the first layer, net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer, input_shape=(1, 19)))
Slice your X data in the same shape as your training data.
Add a dimension to your data so it is (76571, 1, 19) to explicitly shape it as well.
Also as noted above, smaller batch sizes would be best. I would also recommend using the model.train() method instead of handling gradients if you don't have a lot of experience with tensorflow. This saves you code and is easier to ensure you are handling your model correctly during training.
Related
I am following Tensorflow’s regression tutorial and have created a multivariable linear regression and deep neural network however, when I am trying to collect the test set in test_results, I get the following error:
ValueError: Exception encountered when calling layer "normalization" (type Normalization).
Dimensions must be equal, but are 7 and 8 for '{{node sequential/normalization/sub}} = Sub[T=DT_FL Dimensions must be equal, but are 7 and Dimensions must be equal, but are 7 and 8 for '{{node sequential/normalization/sub}} = Sub[T=DT_FLOAT](sequential/Cast, sequential/normalizati
on/sub/y)' with input shapes: [?,7], [1,8].
Call arguments received by layer "normalization" (type Normalization):
• inputs=tf.Tensor(shape=(None, 7), dtype=float32)
Here is the some of code for the linear regression,starting from splitting labels, the error appears on the last line, test_results[‘linear_model’] = linear_model.evaluate(test_features, test_labels, verbose = 0) However, I am able to generate the error plots and everything seems to work fine otherwise, so I’m not entirely sure what the error is with getting test results.
Any help would be much appreciated!
#Split labels
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('HCO3')
test_labels = test_features.pop('HCO3')
train_features = np.asarray(train_dataset.copy()).astype('float32')
#print(train_features.tail())
#Normalization
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))
first = np.array(train_features[:1])
linear_model = tf.keras.Sequential([
normalizer,
layers.Dense(units=1)
])
#Compilation
linear_model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
loss='mean_absolute_error'
)
history = linear_model.fit(
train_features,
train_labels,
epochs=100,
# Suppress logging.
verbose=0,
# Calculate validation results on 20% of the training data.
validation_split = 0.2)
#Track error for later
test_results = {}
test_results['linear_model'] = linear_model.evaluate(test_features, test_labels, verbose = 0)
You lost the outcome column in the dataframe because of pop. Try extracting that column using
train_labels = train_features['HC03']
test_labels = test_features['HC03']
I try to develop a network, and use python generator as data provider. Everything looks OK until the model starts to fit, then I receive this error:
ValueError: `y` argument is not supported when using dataset as input.
I proofed every line and, I think the problem is in the format of x_test and y_test feed to the network. After hours of googling, and changing the format several times, the error is still there.
Can you help me to fix it? You can find the whole code below:
import os
import numpy as np
import pandas as pd
import re # To match regular expression for extracting labels
import tensorflow as tf
print(tf.__version__)
def xfiles(filename):
if re.match('^\w{12}_x\.csv$', filename) is None:
return False
else:
return True
def data_generator():
folder = "i:/Stockpred/csvdbase/datasets/DS0002"
file_list = os.listdir(folder)
x_files = list(filter(xfiles, file_list))
x_files.sort()
np.random.seed(1729)
np.random.shuffle(x_files)
for file in x_files:
filespec = folder + '/' + file
xs = pd.read_csv(filespec, header=None)
yfile = file.replace('_x', '_y')
yfilespec = folder + '/' + yfile
ys = pd.read_csv(open(yfilespec, 'r'), header=None, usecols=[1])
xs = np.asarray(xs, dtype=np.float32)
ys = np.asarray(ys, dtype=np.float32)
for i in range(xs.shape[0]):
yield xs[i][1:169], ys[i][0]
dataset = tf.data.Dataset.from_generator(
data_generator,
(tf.float32, tf.float32),
(tf.TensorShape([168, ]), tf.TensorShape([])))
dataset = dataset.shuffle(buffer_size=16000, seed=1729)
# dataset = dataset.batch(4000, drop_remainder=True)
dataset = dataset.cache('R:/Temp/model')
def is_test(i, d):
return i % 4 == 0
def is_train(i, d):
return not is_test(i, d)
recover = lambda i, d: d
test_dataset = dataset.enumerate().filter(is_test).map(recover)
train_dataset = dataset.enumerate().filter(is_train).map(recover)
x_test = test_dataset.map(lambda x, y: x)
y_test = test_dataset.map(lambda x, y: y)
x_train = train_dataset.map(lambda x, y: x)
y_train = train_dataset.map(lambda x, y: y)
print(x_train.element_spec)
print(y_train.element_spec)
print(x_test.element_spec)
print(y_test.element_spec)
# define an object (initializing RNN)
model = tf.keras.models.Sequential()
# first LSTM layer
model.add(tf.keras.layers.LSTM(units=168, activation='relu', return_sequences=True, input_shape=(168, 1)))
# dropout layer
model.add(tf.keras.layers.Dropout(0.2))
# second LSTM layer
model.add(tf.keras.layers.LSTM(units=168, activation='relu', return_sequences=True))
# dropout layer
model.add(tf.keras.layers.Dropout(0.2))
# third LSTM layer
model.add(tf.keras.layers.LSTM(units=80, activation='relu', return_sequences=True))
# dropout layer
model.add(tf.keras.layers.Dropout(0.2))
# fourth LSTM layer
model.add(tf.keras.layers.LSTM(units=120, activation='relu'))
# dropout layer
model.add(tf.keras.layers.Dropout(0.2))
# output layer
model.add(tf.keras.layers.Dense(units=1))
model.summary()
# compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x_train.as_numpy_iterator(), y_train.as_numpy_iterator(), batch_size=32, epochs=100)
predicted_stock_price = model.predict(x_test)
everything looks OK until the model starts to fit. and i reciev this error:
ValueError: `y` argument is not supported when using dataset as input.
Can you help to fix it?
As the docs say:
y - Target data. Like the input data x, it could be either Numpy array(s) or TensorFlow tensor(s). It should be consistent with x (you cannot have Numpy inputs and tensor targets, or inversely). If x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).
So, I suppose you should have one generator serving tuples of sample and label.
If you are providing Dataset as input, then
type(train_dataset) should be tensorflow.python.data.ops.dataset_ops.BatchDataset
if so, simply feed this Dataset (which includes your X and y bundle) into the model,
model.fit(train_dataset, batch_size=32, epochs=100)
(Yes, this is a little different convention than how we did in sklearn - X and y separately.)
meanwhile, if you want tensorflow to explicitly use a separate dataset for validation, you must use the kwarg like:
model.fit(train_dataset, validation_data=val_dataset, batch_size=32, epochs=100)
where val_dataset is a separate dataset you had spared for validation during model training. (Not test).
use model.fit_generator, and use tuples (x,y) of input data and labels. So altogether:
model.fit_generator(train_dataset.as_numpy_iterator(),epochs=100)
So suppose I have x_train and y_train where they are arrays and each element of that array a data point (in an array form)(so x_train would be in the form of x_train[i][j]). so x_train[0] represents 1st data point in the training set (in an array form) and suppose I want to create a simple regression
so I coded this
input = tf.placeholder(tf.float32, shape=[len(data[0]),None])
target = tf.placeholder(tf.flaot32, shape=[len(data[0]),None])
network = tf.layers.Dense(10, tf.keras.activations.relu)(input)
network = tf.layers.BatchNormalization()(network)
network = tf.layers.Dense(10,tf.keras.activations.relu)(network)
network = tf.layers.BatchNormalization()(network)
network = tf.layers.Dense(10,tf.keras.activations.linear)(network)
cost = tf.reduce_mean((target - network)**2)
optimizer = tf.train.AdamOptimizer().minimize(cost)
with tf.Session() as sess:
for epoch in range(1000):
_, val = sess.run([optimizer,cost], feed_dict={input: x_train, target: y_train})
print(val)
But is this correct? I'm not sure if the dimensions for the placeholders even match. When I try to run this code,
I get the error message
ValueError: The last dimension of the inputs to `Dense` should be defined. Found `None`.
So what I tried was to interchange the position of the dimensions' size for placeholders, so
the changed placeholders were
input = tf.placeholder(tf.float32, shape=[None,len(data[0])])
target = tf.placeholder(tf.float32, shape=[None,len(data[0])])
But with these, I then get the error message
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value dense/bias
[[{{node dense/bias/read}}]]
I was able to solve the above issue by performing np.expand_dims() on x_train & y_train at axis=0 and initializing batch_norm and network parameters with sess.run(tf.global_variable_initializer()) before optimizing the model.
Note: The presence of None in the first dimension of the shape of placeholder is alright as it allows TensorFlow to train models when batch_size is unknown (the same is true even for other dimensions of placeholder's shape). The error is due to mismatch in input and placeholder dimensions. Your inputs (x_train & y_train) were probably one-dimensional tensors while the placeholders either needed two-dimensional ones or one-dimensional vectors reshaped to two-dimensions.
Please find my below implementation for the same and a matplotlib plot that verifies the implementation:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data = [[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20]]
x_train = data[0]
y_train = data[1]
x_train = np.expand_dims(x_train, 0)
y_train = np.expand_dims(y_train, 0)
input = tf.placeholder(tf.float32, shape=[None, len(data[0])])
target = tf.placeholder(tf.float32, shape=[None, len(data[1])])
network = tf.layers.Dense(10, tf.keras.activations.relu)(input)
network = tf.layers.BatchNormalization()(network)
network = tf.layers.Dense(10,tf.keras.activations.relu)(network)
network = tf.layers.BatchNormalization()(network)
network = tf.layers.Dense(10,tf.keras.activations.linear)(network)
cost = tf.reduce_mean((target - network)**2)
optimizer = tf.train.AdamOptimizer().minimize(cost)
costs = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(1000):
_, val = sess.run([optimizer,cost], feed_dict={input: x_train, target: y_train})
costs.append(val)
print(val)
fig, ax = plt.subplots(figsize=(11, 8))
ax.plot(range(1000), costs)
ax.set_title("Costs vs epochs")
ax.set_xlabel("Epoch")
ax.set_ylabel("Avg. val. accuracy")
Here's the plot of costs vs epochs:
Costs vs Epochs
Additionally, to test the network on new data (say) x_test = [[21,22,23,24,25,26,27,28,29,30]], you could use below code:
y_pred = sess.run(network,feed_dict={input: x_test})
PS: Ensure you use the same Tensorflow Session sess created above to run the inference (unless you're not saving and loading the model checkpoint)
I'm learning how to use pytorch and I was able to get a grasp on the overall process of construction and execution of ML models. However, what I am not able to grasp is how to "format" or "reshape" the data before executing the model. I keep getting errors like:
RuntimeError: size mismatch, m1: [1 x 700], m2: [1 x 1] at c:\programdata\miniconda3\conda-bld\pytorch_1524543037166\work\aten\src\th\generic/THTensorMath.c:2033
Or,
Expected object of type Variable[torch.DoubleTensor] but found type Variable[torch.FloatTensor] for argument #1 ‘mat2’
So, I have a csv file named "train.csv" with attributes called 'x' and 'y' and there are 700 samples in it, I want to perform a simple linear regression on the data, and I parse data from it using pandas, how do I format or reshape the data such that it will execute smoothly? How does pytorch iterate through input data?
The recent code i executed is:
import torch
import torch.nn as nn
from torch.autograd import Variable
import pandas as pd
class Linear_Reg(nn.Module):
def __init__(self, inp_sz, out_sz):
super(Linear_Reg, self).__init__()
self.linear = nn.Linear(inp_sz, out_sz)
def forward(self, x):
out = self.linear(x)
return out
train = pd.read_csv('C:\\Users\\hgstr\\Jupyter_Files\\Data_Sets\\linear_regression\\train.csv')
test = pd.read_csv('C:\\Users\\hgstr\\Jupyter_Files\\Data_Sets\\linear_regression\\test.csv')
x_train = torch.Tensor(train['x'])
y_train = torch.Tensor(train['y'])
x_test = torch.Tensor(test['x'])
y_test = torch.Tensor(test['y'])
x_train = torch.Tensor(x_train)
x_train = x_train.view(1,-1)
#================================
input_sz = 1;
output_sz = 1
epochs = 60
learning_rate = 0.001
#================================
model = Linear_Reg(input_sz, output_sz)
crit = nn.MSELoss()
opt = torch.optim.SGD(model.parameters(), learning_rate)
for e in range(epochs):
opt.zero_grad()
out = model(x_train)
loss = crit(out, y_train)
loss.backward()
opt.step()
print('epoch {}, loss {}'.format(e,loss.data[0]))
And it gave out the following:
RuntimeError: size mismatch, m1: [1 x 700], m2: [1 x 1] at c:\programdata\miniconda3\conda-bld\pytorch_1524543037166\work\aten\src\th\generic/THTensorMath.c:2033
Solutions?
According to the error, I believe that your data is not correctly formatted. The tensor should be in the form [700, 2] (batch x data) and yours is [1, 700] (data x batch). This makes the model 'think' that you are adding only one entry as training with 700 features instead of 700 entries with only 1 feature.
Reshaping the x_train variable should make the code work. Just remove the line x_train = x_train.view(1,-1).
Regarding the second error, it can be that after reading the .csv into a variable its type is Double (due to pd.read_csv) while in pytorch by default Tensors are created as floats. I think that casting your input data before feeding it to the model should be enough: model(x_train.float()) or specifying it in the Tensor creation part x_train = torch.FloatTensor(train['x']). Note that you should cast all the Tensors that are not Floats.
edit: This piece of code works for me
import torch
import torch.nn as nn
import pandas as pd
class Linear_Reg(nn.Module):
def __init__(self, inp_sz, out_sz):
super(Linear_Reg, self).__init__()
self.linear = nn.Linear(inp_sz, out_sz)
def forward(self, x):
out = self.linear(x)
return out
train = pd.read_csv('yourpath')
test = pd.read_csv('yourpath')
x_train = torch.Tensor(train['x']).to(torch.float).view(700, 1)
y_train = torch.Tensor(train['y']).to(torch.float).view(700, 1)
x_test = torch.Tensor(test['x']).to(torch.float).view(300, 1)
y_test = torch.Tensor(test['y']).to(torch.float).view(300, 1)
# ================================
input_sz = 1;
output_sz = 1
epochs = 60
learning_rate = 0.001
# ================================
model = Linear_Reg(input_sz, output_sz)
crit = nn.MSELoss()
opt = torch.optim.SGD(model.parameters(), learning_rate)
for e in range(epochs):
opt.zero_grad()
out = model(x_train)
loss = crit(out, y_train)
loss.backward()
opt.step()
print('epoch {}, loss {}'.format(e, loss.data[0]))
I'm trying to build a Deep Neural Network classifier modeled after the example in the TensorFlow directory. The code of the example is shown here:
def main(unused_argv):
# Load dataset.
iris = learn.datasets.load_dataset('iris')
x_train, x_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42)
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = learn.DNNClassifier(hidden_units=[10, 20, 10], n_classes=3)
# Fit and predict.
classifier.fit(x_train, y_train, steps=200)
score = metrics.accuracy_score(y_test, classifier.predict(x_test))
print('Accuracy: {0:f}'.format(score))
I'm doing the exact same thing except I'm using my data, which is the same as the iris data (continuous feature values and discrete 0 or 1 target values). A sample of my data is shown here:
G1 G2 G3 G4 Target
7.733347 6.933914 6.493334 5.31336 0
6.555225 6.924448 6.353376 5.568334 1
7.515558 6.326627 6.197123 5.565245 0
7.132243 6.733111 7.107221 5.681575 1
I'm reading my data with the following code:
def extract_examples_labels(filepath):
data = pd.read_csv(filepath).as_matrix()
num_inputs = len(data[0])-1
data_examples = data[:,range(num_inputs)]
data_labels= data[:,len(data[0])-1]
return data_examples, data_labels
I then do the EXACT same thing as in the TensorFlow example but I use my data instead. However, I keep getting an error that says:
ValueError: Target's dtype should be int32, int64 or compatible. Instead got dtype: 'float64'
So I figure this means that since my y_train is a float, I need to cast it to an int so I do so using:
y_train = y_train.astype(int)
I confirm its of type int64 and run the classifier again but get the following error:
ValueError: Targets are incompatible with given information. Given targets: Tensor("output:0", shape=(?,), dtype=int64), required signatures: TensorSignature(dtype=tf.float64, shape=TensorShape([Dimension(None)]), is_sparse=False).
Now it says it wants a float64. So I'm confused what I'm doing wrong. Any suggestions or obvious mistakes?
After doing a little digging, I found a solution. If you look in the following directory in the TensorFlow package:
tensorflow.contrib.learn.python.learn.datasets
You can find a file called base.py which has the csv file loading functions. Basically, I just modified the function called load_csv to take in my file. The code is shown below:
Dataset = collections.namedtuple('Dataset', ['data', 'target'])
Datasets = collections.namedtuple('Datasets', ['train', 'validation', 'test'])
def load_csv(filename, target_dtype, target_column=-1, has_header=True):
"""Load dataset from CSV file."""
with gfile.Open(filename) as csv_file:
data_file = csv.reader(csv_file)
if has_header:
header = next(data_file)
n_samples = int(header[0])
n_features = int(header[1])
data = np.empty((n_samples, n_features))
target = np.empty((n_samples,), dtype=np.int)
for i, ir in enumerate(data_file):
target[i] = np.asarray(ir.pop(target_column), dtype=target_dtype)
data[i] = np.asarray(ir, dtype=np.float64)
else:
data, target = [], []
for ir in data_file:
target.append(ir.pop(target_column))
data.append(ir)
return Dataset(data=data, target=target)
So if you see the code above, I think the problem I was having is the target_dtype attribute. Even though I changed the dtype of the target array, I didn't change the target_dtype attribute, which made it seem incompatible when TensorFlow checked the signatures. My code works now =. If you have any questions or can clarify this further please feel free to do so!