Index Error - Python - EMNIST dataset - python

I've been trying to construct a neural network to train the EMNIST datasets. The two segments of code below are in entirely different cells in jupyter notebook however they are the two that cause the error stated below. My problem comes from the fact that for one dataset the code runs fine then for this particular dataset i receive an error. If anyone could tell me where i've been going wrong it would be greatly appreciated.
IndexError: index 540774 is out of bounds for size 540774
def dense_to_one_hot(labels_dense, num_classes):
num_labels = labels_dense.shape[0]
index_offset = np.arange(num_labels) * num_classes
labels_one_hot = np.zeros((num_labels, num_classes))
labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
return labels_one_hot
test_labels_flat = test_data_labels[["1"]].values.ravel()
test_labels_count = np.unique(test_labels_flat).shape[0]
test_labels = dense_to_one_hot(test_labels_flat, test_labels_count)
test_labels = test_labels.astype(np.uint8)

Related

Tensorflow getting ' ValueError: Exception encountered when calling layer "normalization" Dimensions must be equal'

I am following Tensorflow’s regression tutorial and have created a multivariable linear regression and deep neural network however, when I am trying to collect the test set in test_results, I get the following error:
ValueError: Exception encountered when calling layer "normalization" (type Normalization).
Dimensions must be equal, but are 7 and 8 for '{{node sequential/normalization/sub}} = Sub[T=DT_FL Dimensions must be equal, but are 7 and Dimensions must be equal, but are 7 and 8 for '{{node sequential/normalization/sub}} = Sub[T=DT_FLOAT](sequential/Cast, sequential/normalizati
on/sub/y)' with input shapes: [?,7], [1,8].
Call arguments received by layer "normalization" (type Normalization):
• inputs=tf.Tensor(shape=(None, 7), dtype=float32)
Here is the some of code for the linear regression,starting from splitting labels, the error appears on the last line, test_results[‘linear_model’] = linear_model.evaluate(test_features, test_labels, verbose = 0) However, I am able to generate the error plots and everything seems to work fine otherwise, so I’m not entirely sure what the error is with getting test results.
Any help would be much appreciated!
#Split labels
train_features = train_dataset.copy()
test_features = test_dataset.copy()
train_labels = train_features.pop('HCO3')
test_labels = test_features.pop('HCO3')
train_features = np.asarray(train_dataset.copy()).astype('float32')
#print(train_features.tail())
#Normalization
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))
first = np.array(train_features[:1])
linear_model = tf.keras.Sequential([
normalizer,
layers.Dense(units=1)
])
#Compilation
linear_model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.1),
loss='mean_absolute_error'
)
history = linear_model.fit(
train_features,
train_labels,
epochs=100,
# Suppress logging.
verbose=0,
# Calculate validation results on 20% of the training data.
validation_split = 0.2)
#Track error for later
test_results = {}
test_results['linear_model'] = linear_model.evaluate(test_features, test_labels, verbose = 0)
You lost the outcome column in the dataframe because of pop. Try extracting that column using
train_labels = train_features['HC03']
test_labels = test_features['HC03']

Tensorflow: data api for big datasets

I'm learning about neural networks reading the book "hands on machine learning with scikit learn keras tensorflow" and the page 410 the author shows the following function saying it is a small helper function: it will create and return a dataset that will efficiently load data from multiple CSV files, then shuffle it, preprocess it and batch it. And it is a good intput pipe for learge datasets that don't fit in memory(ram).
I tried to run the function with 3 small files pretending they are the training set and the thing is the entire "training set"(all three files together) was loaded in memory. More precisaly the function tf.data.TextLineDataset() is reading the whole file. I though it would load in batches lets say, 32 instances from hard drive to ram at the time but is not whats hapening.
So I don't understand whats happening here. Why it is reading the whole dataset?
X_mean,X_std = [...] # mean and scale of each feature in the training set n_inputs = 8
def preprocess(line):
defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
fields = tf.io.decode_csv(line, record_defaults=defs)
x = tf.stack(fields[:-1])
y = tf.stack(fields[-1:])
return (x - X_mean) / X_std, y
def csv_reader_dataset(filepaths, n_readers=5, shuffle_buffer_size=10000, n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
dataset = dataset.interleave(lambda filepath: tf.data.TextLineDataset(filepath).skip(1), cycle_length=n_readers)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = csv_reader_dataset(train_filepaths)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
model = keras.models.Sequential([...])
model.compile([...])
model.fit(train_set, steps_per_epoch=len(X_train) // batch_size,
epochs=10, validation_data=valid_set, validation_steps=len(X_valid) // batch_size)
For reading multiple CSV files from disk as necessary you can use the make_csv_dataset function.
train_ds = tf.data.experimental.make_csv_dataset(
tf.io.gfile.glob("data/multiple/*.csv"),
batch_size = 10,
column_names = None,
column_defaults= [0,0,0,0,0.,"a",0],
label_name='target',
select_columns = ['age', 'sex', 'cp', 'trestbps','oldpeak','thal', 'target'],
shuffle=True,
shuffle_seed=101,
)
For more information see here.

Fine-tuning a neural network in tensorflow

I've been working on this neural network with the intent to predict TBA (time based availability) of simulated windmill parks based on certain attributes. The neural network runs just fine, and gives me some predictions, however I'm not quite satisfied with the results. It fails to notice some very obvious correlations that I can clearly see by myself. Here is my current code:
`# Import
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
maxi = 0.96
mini = 0.7
# Make data a np.array
data = pd.read_csv('datafile_ML_no_avg.csv')
data = data.values
# Shuffle the data
shuffle_indices = np.random.permutation(np.arange(len(data)))
data = data[shuffle_indices]
# Training and test data
data_train = data[0:int(len(data)*0.8),:]
data_test = data[int(len(data)*0.8):int(len(data)),:]
# Scale data
scaler = MinMaxScaler(feature_range=(mini, maxi))
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
# Build X and y
X_train = data_train[:, 0:5]
y_train = data_train[:, 6:7]
X_test = data_test[:, 0:5]
y_test = data_test[:, 6:7]
# Number of stocks in training data
n_args = X_train.shape[1]
multi = int(8)
# Neurons
n_neurons_1 = 8*multi
n_neurons_2 = 4*multi
n_neurons_3 = 2*multi
n_neurons_4 = 1*multi
# Session
net = tf.InteractiveSession()
# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_args])
Y = tf.placeholder(dtype=tf.float32, shape=[None,1])
# Initialize1s
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg",
distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()
# Hidden weights
W_hidden_1 = tf.Variable(weight_initializer([n_args, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))
# Output weights
W_out = tf.Variable(weight_initializer([n_neurons_4, 1]))
bias_out = tf.Variable(bias_initializer([1]))
# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2),
bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3),
bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4),
bias_hidden_4))
# Output layer (transpose!)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))
# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))
# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)
# Init
net.run(tf.global_variables_initializer())
# Fit neural net
batch_size = 10
mse_train = []
mse_test = []
# Run
epochs = 10
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
net.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 50) == 0:
mse_train.append(net.run(mse, feed_dict={X: X_train, Y: y_train}))
mse_test.append(net.run(mse, feed_dict={X: X_test, Y: y_test}))
pred = net.run(out, feed_dict={X: X_test})
print(pred)`
Have tried to tweak around with the number of hidden layers, number of nodes per layer, number of epochs to run and trying different activation functions and optimizers. However, I am quite new to neural networks, so there might be something very obvious that I'm missing.
Thanks in advance to anyone who managed to read through all of that.
It will make is much easier you you will share a small dataset that illustrate the problem. However, I will state some of the issues with non-standards datasets and how to overcome them.
Possible solutions
Regularization and validation-based optimization - are methods that are always good to try when looking for some extra-accuracy. See dropout methods here (original paper), and some overview here.
Unbalanced data - Sometimes of the time series categories/events behave like anomalies, or just in unbalanced ways. If you read a book, words like the or it will appear much more times than warehouse or such. This can become a problem if your main task is to detect the word warehouse and you train your network (even lstms) in traditional ways. A way to overcome this problem is by balancing the samples (creating balanced datasets) or to give more weight to low-frequent categories.
Model structure - sometimes fully connected layers are not enough. See computer vision problems for instance, where we train using convolution layers. The convolution and pooling layers enforce structure on the model, which is suitable for images. This is also some sort of regulation, since we have less parameters in those layers. In time-series problems, convolutions are also possible and turns out that works just fine. See example in Conditional Time Series Forecasting with Convolution Neural Networks.
The above suggestions are presented in the order I would suggest to try.
Good luck!

Understanding dimension of input to pre-defined LSTM

I am trying to design a model in tensorflow to predict next words using lstm.
Tensorflow tutorial for RNN gives pseudocode how to use LSTM for PTB dataset.
I reached to step of generating batches and labels.
def generate_batches(raw_data, batch_size):
global data_index
data_len = len(raw_data)
num_batches = data_len // batch_size
#batch = dict.fromkeys([i for i in range(num_batches)])
#labels = dict.fromkeys([i for i in range(num_batches)])
batch = np.ndarray(shape=(batch_size), dtype=np.float)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.float)
for i in xrange(batch_size) :
batch[i] = raw_data[i + data_index]
labels[i, 0] = raw_data[i + data_index + 1]
data_index = (data_index + 1) % len(raw_data)
return batch, labels
This code gives batch and labels size (batch_size X 1).
These batch and labels can also be size of (batch_size x vocabulary_size) using tf.nn.embedding_lookup().
So, the problem here is how to proceed next using the function rnn_cell.BasicLSTMCell or using user defined lstm model? What will be the input dimension to LSTM cell and how will it be used with num_steps?
Which size of batch and labels is useful in any scenario?
The full example for PTB is in the source code. There are recommended defaults (SmallConfig, MediumConfig, and LargeConfig) that you can use.

Cannot seem to get read_batch_examples working alongside an Estimator

EDIT: I'm using TensorFlow version 0.10.0rc0
I'm currently trying to use tf.contrib.learn.read_batch_examples working while using a TensorFlow (SKFlow/tf.contrib) Estimator, specifically the LinearClassifier. I create a read_batch_examples op feeding in a CSV file with a tf.decode_csv for the parse_fn parameter with appropriate default records. I then feed that op to my input_fn for fitting the Estimator, but when that's run I receive the following error:
ValueError: Tensor("centered_bias_weight:0", shape=(1,), dtype=float32_ref) must be from the same graph as Tensor("linear/linear/BiasAdd:0", shape=(?, 1), dtype=float32).
I'm confused because neither of those Tensors appear to be from the read_batch_examples op. The code works if I run the op beforehand and then feed the input instead as an array of values. While this workaround exists, it is unhelpful because I am working with large datasets in which I need to batch in my inputs. Currently going over Estimator.fit (currently equivalent to Estimator.partial_fit in iterations isn't nearly as fast as being able to feed in data as it trains, so having this working is ideal. Any ideas? I'll post the non-functioning code below.
def input_fn(examples_dict):
continuous_cols = {k: tf.cast(examples_dict[k], dtype=tf.float32)
for k in CONTINUOUS_FEATURES}
categorical_cols = {
k: tf.SparseTensor(
indices=[[i, 0] for i in xrange(examples_dict[k].get_shape()[0])],
values=examples_dict[k],
shape=[int(examples_dict[k].get_shape()[0]), 1])
for k in CATEGORICAL_FEATURES}
feature_cols = dict(continuous_cols)
feature_cols.update(categorical_cols)
label = tf.contrib.layers.one_hot_encoding(labels=examples_dict[LABEL],
num_classes=2,
on_value=1,
off_value=0)
return feature_cols, label
filenames = [...]
csv_headers = [...] # features and label headers
batch_size = 50
min_after_dequeue = int(num_examples * min_fraction_of_examples_in_queue)
queue_capacity = min_after_dequeue + 3 * batch_size
examples = tf.contrib.learn.read_batch_examples(
filenames,
batch_size=batch_size,
reader=tf.TextLineReader,
randomize_input=True,
queue_capacity=queue_capacity,
num_threads=1,
read_batch_size=1,
parse_fn=lambda x: tf.decode_csv(x, [tf.constant([''], dtype=tf.string) for _ in xrange(csv_headers)]))
examples_dict = {}
for i, header in enumerate(csv_headers):
examples_dict[header] = examples[:, i]
categorical_cols = []
for header in CATEGORICAL_FEATURES:
categorical_cols.append(tf.contrib.layers.sparse_column_with_keys(
header,
keys # Keys for that particular feature, source not shown here
))
continuous_cols = []
for header in CONTINUOUS_FEATURES:
continuous_cols.append(tf.contrib.layers.real_valued_column(header))
feature_columns = categorical_cols + continuous_cols
model = tf.contrib.learn.LinearClassifier(
model_dir=model_dir,
feature_columns=feature_columns,
optimizer=optimizer,
n_classes=num_classes)
# Above code is ok up to this point
model.fit(input_fn=lambda: input_fn(examples_dict),
steps=200) # This line causes the error ****
Any alternatives for batching would be appreciated as well!
I was able to figure out my mistake through the help of the great TensorFlow team! read_batch_examples has to be called within input_fn, otherwise the op has to be run beforehand as it'll be from a different graph.
Edit
Here is the modified code that functions properly for those who are interested:
def input_fn(file_names, batch_size):
examples_dict = read_csv_examples(file_names, batch_size)
# Continuous features
feature_cols = {k: tf.string_to_number(examples_dict[k], dtype=tf.float32)
for k in CONTINUOUS_FEATURES}
# Categorical features
feature_cols.update({
k: tf.SparseTensor(
indices=[[i, 0] for i in range(examples_dict[k].get_shape()[0])],
values=examples_dict[k],
shape=[int(examples_dict[k].get_shape()[0]), 1])
for k in CATEGORICAL_FEATURES})
# Change out type for classification/regression
out_type = tf.int32 if CLASSIFICATION else tf.float32
label = tf.string_to_number(examples_dict[LABEL], out_type=out_type)
return feature_cols, label
def read_csv_examples(file_names, batch_size):
def parse_fn(record):
record_defaults = [tf.constant(['']), dtype=tf.string] * len(FEATURE_HEADERS)
return tf.decode_csv(record, record_defaults)
examples_op = tf.contrib.learn.read_batch_examples(
file_names,
batch_size=batch_size,
reader=tf.TextLineReader,
parse_fn=parse_fn)
# Important: convert examples to dict for ease of use in `input_fn`
# Map each header to its respective column (FEATURE_HEADERS order
# matters!
examples_dict_op = {}
for i, header in enumerate(FEATURE_HEADERS):
examples_dict_op[header] = examples_op[:, i]
return examples_dict_op
This code is near minimal for producing a generic input function for your data. Also note that if you would like to pass num_epochs to read_batch_examples, you'll need to do something different for your categorical features (see this answer for details). Disclaimer: I wrote that answer. Hope this helps!

Categories