tf.data.datasets set each batch (prefetch) - python

I am looking for help thinking through this.
I have a function (that is not a generator) that will give me any number of samples.
Let's say that getting all the data I want to train (1000 samples) can't fit into memory.
So I want to call this function 10 times to get smaller number of samples that fit into memory.
This is a dummy example for simplicity.
def get_samples(num_samples: int, random_seed=0):
np.random.seed(random_seed)
x = np.random.randint(0,100, num_samples)
y = np.random.randint(0,2, num_samples)
return np.array(list(zip(x,y))
Again lets say get_samples(1000,0) won't fit into memory.
So in theory I am looking for something like this:
batch_size = 100
total_num_samples = 1000
batches = []
for i in range(total_num_samples//batch_size):
batches.append(get_samples(batch_size, i))
But this still loads everything into memory.
Again this function is a dummy representation and the real one is already defined and not a generator.
In tf land. I was hoping that:
tf.data.Dataset.batch[0] would equal to the output of get_data(100,0)
tf.data.Dataset.batch[1] would equal to the output of get_data(100,1)
tf.data.Dataset.batch[2] would equal to the output of get_data(100,2)
...
tf.data.Dataset.batch[9] would equal to the output of get_data(100,9)
I understand that I can use tf.data.Datasets with a generator (and I think you can set a generator per batch). But the function I have gives more than a single sample. The set up is too expensive to set it up for a every single sample.
I was wanting to use tf.data.Dataset.prefetch() to run the get_batch function on every batch. And of course, it would call the get_batch with the same parameters on every epoch.
Sorry if the explaination is convoluted. Trying my best to describe the problem.
Anyone have any ideas?

This what I came up with:
def simple_static_synthesizer(batch_size, seed=1, verbose=True):
if verbose:
print(f"Creating Synthetic Data with seed {seed}")
rng = np.random.default_rng(seed)
all_x = []
all_y = []
for i in range(batch_size):
x = np.array(np.concatenate((rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int))))
y = np.array(rng.integers(0,2,1, dtype=int))
all_x.append(x)
all_y.append(y)
return all_x, all_y
def my_generator(total_size, batch_size, seed=0, verbose=True):
counter = 0
for i in range(total_size):
# Regenerate for each batch
if counter%batch_size == 0: # Regen data for every batch
x,y = simple_static_synthesizer(batch_size,seed,verbose)
seed += 1
yield x[i%batch_size],y[i%batch_size]
counter += 1
my_gen = my_generator(10,2,seed=1)
# See values
for x,y in my_gen:
print(x,y)
# Call again, this give same answer as above
my_gen = my_generator(10,2,seed=1)
for x,y in my_gen:
print(x,y)
# Dataset with small batches to see if it is doing it correctly
total_samples = 10
batch_size = 2
seed = 5
dataset = tf.data.Dataset.from_generator(
my_generator,
args=[total_samples,batch_size,seed],
output_signature=(
tf.TensorSpec(shape=(3,), dtype=tf.uint8),
tf.TensorSpec(shape=(1,), dtype=tf.uint8),
)
)
for i,(x,y) in enumerate(dataset):
print(x.numpy(),y.numpy())
if i == 4:
break # shows first 3 syn calls
Wish we could have notebook answers!

Related

Yielding generator data in batches while reading files from disk is piling up in memory just like loading whole data

I am using keras to train my Inception-Resnet for my MRI data. The data is spread across 151 files and each file corresponds to a patient (I cannot accumulate data into one as filenames are IDs for paitent which provide labels). The data is too large to fit in my memory (larger than 25 GB). So I used keras.fit_generator. I am trying to read files (equal in number to batch_size) and do some processing on them and then yield that data and repeat. The problem here is that when I use generator and observe my RAM at every next(data), my RAM usage increases every time instead of remaining same or almost same (as file sizes are nearly same) which at some 88 files crashes my code. What am I missing? I've tried a lot. Here is my generator code:
def file_reading(datafiles, batch_size, labels, d_type, N):
while True:
batch_data = []
batch_label = []
batch_start = 0
batch_end = batch_size
label = -1
i=0
while (batch_end<=N):
for path in datafiles[batch_start:batch_end] :
id = int(path.split('/')[-1].split('.')[0].lstrip("0"))
if id in labels:
dlabel_of_id = labels[id][d_type]
else:
pass
if (dlabel_of_id != 'na'):
if (dlabel_of_id == 'yes'):
label = 1
elif (dlabel_of_id == 'no'):
label = 0
raw = mne.io.read_raw_edf(path, preload=True);
if max(raw.times) < 660:
raw.crop(tmin=60, tmax=max(raw.times));
else:
raw.crop(tmin=60, tmax=660);
temp = montages(raw)
raw = 0
temp = np.resize(temp,(2966,2966,3))
batch_data.append(temp)
batch_label.append(label)
yield (batch_data, batch_label)
batch_start += batch_size
batch_end += batch_size
Moreover, when I do this
data = file_reading(paths, 16, labels, 'EEG', 151)
len(next(data)[0])
Then the memory does not pile up and generator works fine.

change() missing 1 required positional argument: 'X' while predicting future value

I am trying to predict the future value with three inputs. Here I want to forecast the future value according to the three inputs in every one hour. Here g= temperature, p=humidity, c=wind and I want to predict temperature in next hour according to these inputs. That's why here I put n_out is 1, I wrote the code in def class. After that I tried to add that def class value as x,y value. Because I am going to write it as train and test value. But the error came as this. I am going to predict future value using LSTM . After this I don't know how to add this code as train and test into LSTM model. Can anyone help me to solve this problem?
Here I paste my code and csv file.
def change(train,X, n_out=1):
data = train.reshape((train.shape[0]))
x, y = list(), list()
in_start = 0
# step over the entire history one time step at a time
for _ in range(len(data)):
# define the end of the input sequence
in_end = in_start + X
out_end = in_end + n_out
# ensure we have enough data for this instance
if out_end < len(data):
x_input = data[in_start:in_end, 0]
x_input = x_input.reshape((len(x_input), 3))
x.append(x_input)
y.append(data[in_end:out_end, 0])
# move along one time step
in_start += 1
return array(x), array(y)
data= pd.DataFrame(data,columns=['g','p','c'])
data.columns = ['g', 'p', 'c',]
pd.options.display.float_format = '{:,.0f}'.format
data = data.dropna ()
cols=['g', 'p', 'c']
X=data[cols]
x,y = change(data)
The error came as
my csv file:
Aftre edditing the code it gave me this error:
In your definition of the function you have 3 parameters:
train, X and n_out=1
def change(train, X, n_out=1)
when you are calling your function you are providing just 1 argument(data)
x,y = change(data)
how n_out you define as 1, you need to provide x also, or define your function as:
def change(train, n_out=1)
NOTE:
you need to provide X when you are calling your function for example :
x,y = change(data, 1)
or define functrion like :
def change(train, X=1, n_out=1)

Odd Results on Entropy Calculation

I am trying to write a function that properly calculates the entropy of a given dataset. However, I am getting very weird entropy values.
I am following the understanding that all entropy calculations must fall between 0 and 1, yet I am consistently getting values above 2.
Note: I must use log base 2 for this
Can someone explain why am I yielding incorrect entropy results?
The dataset I am testing is the ecoli dataset from the UCI Machine Learning Repository
import numpy
import math
#################### DATA HANDLING LIBRARY ####################
def csv_to_array(file):
# Open the file, and load it in delimiting on the ',' for a comma separated value file
data = open(file, 'r')
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except Exception:
data[index] = 0
except ValueError:
data[index] = 0
# Return the now type-formatted data
return data
# Function that utilizes the numpy library to randomize the dataset.
def randomize_data(csv):
csv = numpy.random.shuffle(csv)
return csv
# Function to split the data into test, training set, and validation sets
def split_data(csv):
# Call the randomize data function
randomize_data(csv)
# Grab the number of rows and calculate where to split
num_rows = csv.shape[0]
validation_split = int(num_rows * 0.10)
training_split = int(num_rows * 0.72)
testing_split = int(num_rows * 0.18)
# Validation set as the first 10% of the data
validation_set = csv[:validation_split]
# Training set as the next 72
training_set = csv[validation_split:training_split + validation_split]
# Testing set as the last 18
testing_set = csv[training_split + validation_split:]
# Split the data into classes vs actual data
training_cols = training_set.shape[1]
testing_cols = testing_set.shape[1]
validation_cols = validation_set.shape[1]
training_classes = training_set[:, training_cols - 1]
testing_classes = testing_set[:, testing_cols - 1]
validation_classes = validation_set[:, validation_cols - 1]
# Take the sets and remove the last (classification) column
training_set = training_set[:-1]
testing_set = testing_set[:-1]
validation_set = validation_set[:-1]
# Return the datasets
return testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes
#################### DATA HANDLING LIBRARY ####################
# This function returns the list of classes, and their associated weights (i.e. distributions)
# for a given dataset
def class_distribution(dataset):
# Ensure the dataset is a numpy array
dataset = numpy.asarray(dataset)
# Collect # of total rows and columns, using numpy
num_total_rows = dataset.shape[0]
num_columns = dataset.shape[1]
# Create a numpy array of just the classes
classes = dataset[:, num_columns - 1]
# Use numpy.unique to remove duplicates
classes = numpy.unique(classes)
# Create an empty array for the class weights
class_weights = []
# Loop through the classes one by one
for aclass in classes:
# Create storage variables
total = 0
weight = 0
# Now loop through the dataset
for row in dataset:
# If the class of the dataset is equal to the current class you are evaluating, increase the total
if numpy.array_equal(aclass, row[-1]):
total = total + 1
# If not, continue
else:
continue
# Divide the # of occurences by total rows
weight = float((total / num_total_rows))
# Add that weight to the list of class weights
class_weights.append(weight)
# Turn the weights into a numpy array
class_weights = numpy.asarray(class_weights)
# Return the array
return classes, class_weights
# This function returns the entropy for a given dataset
# Can be used across an entire csv, or just for a column of data (feature)
def get_entropy(dataset):
# Set initial entropy
entropy = 0.0
# Determine the classes and their frequencies (weights) of the dataset
classes, class_freq = class_distribution(dataset)
# Utilize numpy's quicksort to test the most occurring class first
numpy.sort(class_freq)
# Determine the max entropy for the dataset
max_entropy = math.log(len(classes), 2)
print("MAX ENTROPY FOR THIS DATASET: ", max_entropy)
# Loop through the frequencies and use given formula to calculate entropy
# For...Each simulates the sequence operator
for freq in class_freq:
entropy += float(-freq * math.log(freq, 2))
# Return the entropy value
return entropy
def main():
ecol = csv_to_array('ecoli.csv')
testing_set, testing_classes, training_set, training_classes, validation_set, validation_classes = split_data(ecol)
entropy = get_entropy(ecol)
print(entropy)
main()
The following function was used to calculate Entropy:
# Function to return Shannon's Entropy
def entropy(attributes, dataset, targetAttr):
freq = {}
entropy = 0.0
index = 0
for item in attributes:
if (targetAttr == item):
break
else:
index = index + 1
index = index - 1
for item in dataset:
if ((item[index]) in freq):
# Increase the index
freq[item[index]] += 1.0
else:
# Initialize it by setting it to 0
freq[item[index]] = 1.0
for freq in freq.values():
entropy = entropy + (-freq / len(dataset)) * math.log(freq / len(dataset), 2)
return entropy
As #MattTimmermans had indicated, entropy's value is actually contingent on the number of classes. For strictly 2 classes, it is contained in the 0 to 1 (inclusive) range. However, for more than 2 classes (which is what was being tested), entropy is calculated with a different formula (converted to Pythonic code above). This post here explains those mathematics and calculations a bit more in detail.

Initialize a batch-dependent variable in Tensorflow

I have a tensorflow code that runs well and accurately, but occupies a lot of memory. Specifically, in my code, I have a for-loop that looks something like this:
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
vals = []
for k in range(0,K):
tmp = tf.reduce_sum(myarray1*myarray2[k],axis=(1,2))
vals.append(tmp)
result = tf.min( tf.stack(vals,axis=-1), axis=-1 )
Unfortunately, that takes a lot of memory as K gets to be big in my application. So, I want to have a better way of doing it. For example, in numpy/python, you would just keep track of the minimum value as you iterate through the loops, and update it on each iteration. It seems like I could use tf.assign, as:
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
min_value = tf.Variable(myarray1, validate_shape=False, trainable=False)
for k in range(0,K):
tmp = myarray1*myarray2[k]
idx = tf.where(tmp<min_value)
tf.scatter_nd_assign(min_value, idx, tmp[idx], use_locking=True)
result = min_value
While this code builds the graph (when validate_shape=False), it fails to run because it complains that min_value has not been initialized. The issue is, when I run the initializer as:
sess.run(tf.global_variables_initializer())
or
sess.run(tf.variables_initializer(tf.trainable_variables()))
it complains that I am not feeding in a placeholder. This actually makes sense because the definition of min_value depends on myarray1 in the graph.
What I would actually want to do is define a dummy variable that doesn't depend on myarray1's values, but does match its shape. I would like these values to be initialized as some number (in this case something large is fine), as I will manually ensure these are overwritten in the network.
Note: as far as I know, currently you cannot define a variable with an unknown shape unless you feed in another variable of the desired shape and set validate_shape=False). Maybe there is another way?
Any help / suggestions appreciated.
Try this, if don't know how to feed placeholder, read the tutorial.
K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
###################ADD THIS ####################
sess=tf.Session()
FOO = tf.run(myarray1,feed_dict={myarray1: YOURDATA}) #get myarray1 value
#replace all myarray1 below with FOO
################################################
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
min_value = tf.Variable(FOO, validate_shape=False, trainable=False)
for k in range(0,K):
tmp = FOO*myarray2[k]
idx = tf.where(tmp<min_value)
tf.scatter_nd_assign(min_value, idx, tmp[idx], use_locking=True)
result = min_value
-------above new 15.April.2018------
Since I don't know your input data, I would like to try to make some steps.
Step_1: make a place for input data
x = tf.placeholder(tf.float32, shape=[None,2])
Step_2: Get batches of data
batch_x=[[1,2],[3,4]] #example
#since x=[None,2], the batch size would be batch_x_size/x_size=2
Step_3: make a session
sess=tf.Session()
if you have variables add the following code to initialize before calculation
init=tf.gobal_variables_initializer()
sess.run(init)
Step_4:
yourplaceholderdictiornay={x: batch_x}
sess.run(x, feed_dict=yourplaceholderdictiornay)
always feed your placeholder so it gets the value to calculate.
There is a Tensorflow and Deep Learning without a PHD, very helpful PDF file, you can also find it on youtube with this title.

tensorflow: memory allocation for a 'for' cycle

I am trying to use TensorFlow for calculating minimum Euclidean distance between each column in the matrix and all other columns (excluding itself):
with graph.as_default():
...
def get_diversity(matrix):
num_rows = matrix.get_shape()[0].value
num_cols = matrix.get_shape()[1].value
identity = tf.ones([1, num_cols], dtype=tf.float32)
diversity = 0
for i in range(num_cols):
col = tf.reshape(matrix[:, i], [num_rows, 1])
col_extended_to_matrix = tf.matmul(neuron_matrix, identity)
difference_matrix = (col_extended_to_matrix - matrix) ** 2
sum_vector = tf.reduce_sum(difference_matrix, 0)
mask = tf.greater(sum_vector, 0)
non_zero_vector = tf.select(mask, sum_vector, tf.ones([num_cols], dtype=tf.float32) * 9e99)
min_diversity = tf.reduce_min(non_zero_vector)
diversity += min_diversity
return diversity / num_cols
...
diversity = get_diversity(matrix1)
...
When I call get_diversity() once per 1000 iterations (on the scale of 300k) it works just fine. But when I try to call it at every iteration the interpreter returns:
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 2.99MiB. See logs for memory state.
I was thinking that was because TF creates a new set of variables each time get_diversity() is called. I tried this:
def get_diversity(matrix, scope):
scope.reuse_variables()
...
with tf.variable_scope("diversity") as scope:
diversity = get_diversity(matrix1, scope)
But it did not fix the problem.
How can I fix this allocation issue and use get_diversity() with large number of iterations?
Assuming you call get_diversity() multiple times in your training loop, Aaron's comment is a good one: instead you can do something like the following:
diversity_input = tf.placeholder(tf.float32, [None, None], name="diversity_input")
diversity = get_diversity(matrix)
# ...
with tf.Session() as sess:
for _ in range(NUM_ITERATIONS):
# ...
diversity_val = sess.run(diversity, feed_dict={diversity_input: ...})
This will avoid creating new operations each time round the loop, which should prevent the memory leak. This answer has more details.

Categories