I have a list named 'datastate' of shape (?,10) which is getting filled with the results of 10 batch samples in tensorflow (all tensors). In other words, with a batch size of 256, this will be populated with 10 different tensors of size 256.
in pseudocode below ....
datastate = {}
for sample in range(num_samples):
datastate[sample] = batch_results
What I would like to do next is define a variable like 'datastate_change', which would determine if the i-th record of batch_results was changed versus the (i-1)th record of batch_results. This might look something like the following if Pandas style syntax worked ... but I'm not clear on how to do this inside of tf during the sess.run.
for sample in range(num_samples):
datastate[sample] = batch_results
datastate_change[sample] = batch_results - batch_results.shift(1)
To be a bit more concrete, if a single instance of batch_results are [1,1,1,0,1] I would like to have datastate[1] = [1,1,1,0,1] and datastate_change[1] = [1,0,0,-1,1]
Found a satisfactory answer on my own - key was numpy is better analogue plugin than pandas....
First I create a copy of my datastate which is padded along the top with zeros
Then I slice off the bottom row of this copy
Lastly I subtract the two.
top_paddings = tf.constant([[1, 0]]) #New tensor with the 'top' being zeros
top_padded_datastate_[sample] = tf.pad(datastate[sample], top_paddings, "CONSTANT")
top_padded_datastate[sample] = top_padded_datastate_[sample][:-1]
datastate_changes[sample] = tf.subtract(datastate[sample], top_padded_datastate[sample])
Related
I am following the example: https://www.tensorflow.org/tutorials/structured_data/time_series.
In my case I have a sensor which collect the data every hour. This one has not being really reliable during the last months and I have lost some data. To solve this problem, the values have being replaced with the previous valid value. I got many duplicated values and I think this is the reason why my NN is unable to predict anything. I do not want to skip the wrong values before creating the dataset because it will create time series with no consecutive values.
I would like to create the timeseries dataset as in the example and then, remove the entries/outputs (tensors) which has certain duplicity in de data or update the tensor values with the value 0.
def hasMultipleDuplicatedElements (mylist, multiplicity):
return Counter(mylist[:,0]).most_common(1)[0][1] >multiplicity
WindowGenerator.hasMultipleDuplicatedElements = hasMultipleDuplicatedElements
def dsCleanedRowsWithHighMultiplycity(self,ds,multiplicity):
for batch in ds:
dataBatch=batch.numpy()
for j in range (len (dataBatch)):
selectedDataBatch=dataBatch[j]
indices = tf.constant([[j] for j in range(len(selectedDataBatch))])
inputData =(selectedDataBatch[:self.input_width])
labelData= (selectedDataBatch[self.input_width:])
if ( hasMultipleDuplicatedElements(inputData,multiplicity) or
( hasMultipleDuplicatedElements(labelData,multiplicity) )):
#print(batch[j])
tf.tensor_scatter_nd_update(batch[j], indices,
tf.zeros(shape=selectedDataBatch.shape,dtype=tf.float32),
name=None)
#print(batch[j])
WindowGenerator.dsCleanedOfRowsWithHighMultipliciy = dsCleanedOfRowsWithHighMultipliciy
def make_dataset(self, data):
data = np.array(data, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=1,
shuffle=True,
batch_size=32,)
self.dsCleanedRowsWithHighMultiplycity(ds,10)
ds = ds.map(self.split_window)
return ds
The dataset contains batches, each one with 32 entries/outputs(tensors). I scan every entry/output looking for duplicated data, which a minimum of 10 times. I manage to spot this entries and create a new tensor with tf.tensor_scatter_nd_update but what I would like is to update the original tensor inside the batch.
If there is a way to remove the wrong tensor from the batch, it would also be an acceptable solution.
Thanks in advance!
I have the following code:
A = Tensor of [186,3]
If I create a new empty tensor as follows:
tempTens = torch.tensor(np.zeros((186,3)), requires_grad = True).cuda()
And I apply some operations on a block of A and output it into tempTens, which I use totally for further computation, say like this:
tempTens[20,:] = SomeMatrix * A[20,:]
Will the gradients actually be transferred correctly, lets say I am having a cost function that optimizes for the output of tempTens to some ground truth
In this case, tempTens[20,:] = SomeMatrix * A[20,:] is an in-place operation with respect to tempTens, which is generally not guaranteed to work with autograd. However, if you create a new variable by applying an operation like concatenation
output = torch.cat([SomeMatrix * A[20, :], torch.zeros(163, 3, device='cuda')], dim=0)
you will get the same result in terms of math (a matrix with first 20 rows from SomeMatrix * A[20, :] and the following 166 rows of 0s), but this will work properly with autograd. This is, generally speaking, the right way to approach this kind of problems.
I want to apply a filter to a tensor and remove values that do not meet my criteria. For example, lets say I have a tensor that looks like this:
softmax_tensor = [[ 0.05 , 0.05, 0.2, 0.7], [ 0.25 , 0.25, 0.3, 0.2 ]]
Right now, the classifier picks the argmax of the tensors to predict:
predictions = [[3],[2]]
But this isn't exactly what I want because I loose information about the confidence of that prediction. I would rather not make a prediction than to make an incorrect prediction. So what I would like to do is return filtered tensors like so:
new_softmax_tensor = [[ 0.05 , 0.05, 0.2, 0.7]]
new_predictions = [[3]]
If this were straight-up python, I'd have no trouble:
new_softmax_tensor = []
new_predictions = []
for idx,listItem in enumerate(softmax_tensor):
# get two highest max values and see if they are far enough apart
M = max(listItem)
M2 = max(n for n in listItem if n!=M)
if M2 - M > 0.3: # just making up a criteria here
new_softmax_tensor.append(listItem)
new_predictions.append(predictions[idx])
but given that tensorflow works on tensors, I'm not sure how to do this - and if I did, would it break the computation graph?
A previous SO post suggested using tf.gather_nd, but in that scenario they already had a tensor that they wated to filter on. I've also looked at tf.cond but still don't understand. I would imagine many other people would benefit from this exact same solution.
Thanks all.
Two things that I would do to solve your problem :
First, I would return the value of the softmax tensor. You look for a reference to it somewhere (you keep a reference to it when you create it, or you find it back in the appropriate tensor collection) And then evaluate it in a sess.run([softmaxtensor,prediction],feed_dict=..) And then you play with it with python as much as you like.
Second If you want to stay within the graph, I would use the build-it tf.where(), working quite alike the np.where function from numpy package doc there
Ok. I've got it sorted out now. Here is a working example.
import tensorflow as tf
#Set dummy example tensor
original_softmax_tensor = tf.Variable([
[0.4,0.2,0.2,0.9,0.1],
[0.5,0.2,0.2,0.9,0.1],
[0.6,0.2,0.2,0.1,0.99],
[0.1,0.8,0.2,0.09,0.99]
],name='original_softmax_tensor')
#Set dummy prediction tensor
original_predictions = tf.Variable([3,3,4,4],name='original_predictions')
#Now create a place to store my new variables
new_softmax_tensor = original_softmax_tensor
new_predictions = original_predictions
#set my cutoff variable
min_diff = tf.constant(0.3)
#initialize
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op) #execute init_op
#There's probably a better way to do this, but I had to do this hack to get
# the difference between the top 2 scores
tmp_diff1, _ = tf.nn.top_k(original_softmax_tensor,k=2,sorted=True)
tmp_diff2, _ = tf.nn.top_k(original_softmax_tensor,k=1,sorted=True)
#subtracting the max scores from both, makes the largest one '0'
actual_diff = tf.subtract(tmp_diff2,tmp_diff1)
#The max value for each will be the actual value of interest
actual_diff = tf.reduce_max(actual_diff,reduction_indices=[1])
#Create a boolean tensor that says to keep or not
cond_result = actual_diff > min_diff
#Keep only the values I want
new_predictions = tf.boolean_mask(original_predictions,cond_result)
new_softmax_tensor = tf.boolean_mask(new_softmax_tensor,cond_result)
new_predictions.eval()
new_softmax_tensor.eval()
# return these if this is in a function
I am trying to create a training data file which is structured as follows:
[Rows = Samples, Columns = features]
So if I have 100 samples and 2 features the shape of my np.array would be (100,2).
The list bellow contains path-strings to the .nrrd 3D sample patch-data files.
['/Users/FK/Documents/image/0128/subject1F_200.nrrd',
'/Users/FK/Documents/image/0128/subject2F_201.nrrd']
This is the code I have so far:
training_file = []
# For each sample in my image folder
for patches in dir_0128_list:
# Reads the 64x64x64 numpy array
data, options = nrrd.read(patches)
# Calculates the median and sum of the 3D array file. 2 Features per sample
f_median = np.median(data)
training_file.append(f_median)
f_sum = np.sum(data)
training_file.append(f_sum)
# Calculates a numpy array with the following shape (169,) containing 169 features per sample.
f_mof = my_own_function(data)
training_file.append(f_mof)
training_file = np.array((training_file), dtype=np.float32)
# training_file = np.column_stack((core_training_list))
If I don't use the np.column_stack function I get a (173,1) matrix. (1,173) if I run the function. In this scenario it should have a (2,171) shape.
I want to calculate the sum and median and append it to an list or numpy array column wise. At the end of the for loop I want to jump 1 row down and append the 2 features column wise for the second sample and so on...
Very simple solution
Instead of
f_median = np.median(data)
training_file.append(f_median)
f_sum = np.sum(data)
training_file.append(f_sum)
you could do do
training_file.append((np.median(data), np.sum(data)))
slightly longer solution
Then you would still have 1 piece of consecutive code that is not easy to reuse and test indicidually.
I would structure the different parts of the script
Iterate over the files to read the patches
Calculate the mean and sum
Aggregate to the requested format
Read the patches
def read_patches(files):
for file in files:
yield nrrd.read(file)
make a generator yielding the patches info
Calculate
def parse_patch(patch):
data, options = patch
return np.median(data), np.sum(data)
Putting it together
from pathlib import Path
file_dir = Path(<my_filedir>)
files = file_dir.glob('*.nrrd')
patches = read_patches(files)
training_file = np.array([parse_patch(patch) for patch in patches], dtype=np.float32)
This might look convoluted, but it allows for easy testing of each of the sub-blocks
I am trying to do a Mean operation given the actual lengths of sequences. (Masking Zero vectors)
My inputs sequence_outpus are of (batch_size, max_len, dimensions)
I have a tensor that stores the actual lengths of each sequence in the batch. I used the function from https://danijar.com/variable-sequence-lengths-in-tensorflow/
def length(sequence):
used = tf.sign(tf.reduce_max(tf.abs(sequence), reduction_indices=2))
length = tf.reduce_sum(used, reduction_indices=1)
length = tf.cast(length, tf.int64)
return length
I do this:
lengths = length(sequence_outputs)
lengths = tf.cast(length, tf.float32)
lengths = tf.expand_dims(lengths,1)
sentence_outputs = tf.reduce_sum(sentence_outputs,1) / lengths
The graph compiles but I am getting NaN loss values. Furthermore my lengths become negative values when debugging with eval().
This seems to be a simple problem but I've been stuck with this for sometime and would appreciate some help!
Thanks!
I see no issue. Your code is slightly over-complicated. The following code
import numpy as np
import tensorflow as tf
# creating data
B = 15
MAX_LEN = 4
data = np.zeros([B, MAX_LEN], dtype=np.float32)
for b in range(B):
current_len = np.random.randint(2, MAX_LEN)
current_vector = np.concatenate([np.random.randn(current_len), np.zeros(MAX_LEN - current_len)], axis=-1)
print("{}\t\t{}".format(current_vector, current_vector.shape))
data[b, ...] = current_vector
data_op = tf.convert_to_tensor(data)
def tf_length(x):
assert len(x.get_shape().as_list()) == 2
length = tf.count_nonzero(x, axis=1, keepdims=True)
return length
x = tf.reduce_sum(data_op, axis=1) / tf_length(data_op)
# test gradients
grads = tf.gradients(tf.reduce_mean(x), [data_op])
with tf.Session() as sess:
print sess.run(grads)
runs perfectly fine here without any NaNs. Are you sure, you are really using this code? If I need to guess, I would bet you forget the tf.abs somewhere in your sequence length computation.
Be aware: your length function, as well as tf_length in this post, assume non-zero values in the sequence! The calculating the sequence-length should be the task of the data-producer and fed into the computation graph. Everything else, I consider as a hacky solution.