Tensorflow. Batch Tensor modify an entry (tensor) - python

I am following the example: https://www.tensorflow.org/tutorials/structured_data/time_series.
In my case I have a sensor which collect the data every hour. This one has not being really reliable during the last months and I have lost some data. To solve this problem, the values have being replaced with the previous valid value. I got many duplicated values and I think this is the reason why my NN is unable to predict anything. I do not want to skip the wrong values before creating the dataset because it will create time series with no consecutive values.
I would like to create the timeseries dataset as in the example and then, remove the entries/outputs (tensors) which has certain duplicity in de data or update the tensor values with the value 0.
def hasMultipleDuplicatedElements (mylist, multiplicity):
return Counter(mylist[:,0]).most_common(1)[0][1] >multiplicity
WindowGenerator.hasMultipleDuplicatedElements = hasMultipleDuplicatedElements
def dsCleanedRowsWithHighMultiplycity(self,ds,multiplicity):
for batch in ds:
dataBatch=batch.numpy()
for j in range (len (dataBatch)):
selectedDataBatch=dataBatch[j]
indices = tf.constant([[j] for j in range(len(selectedDataBatch))])
inputData =(selectedDataBatch[:self.input_width])
labelData= (selectedDataBatch[self.input_width:])
if ( hasMultipleDuplicatedElements(inputData,multiplicity) or
( hasMultipleDuplicatedElements(labelData,multiplicity) )):
#print(batch[j])
tf.tensor_scatter_nd_update(batch[j], indices,
tf.zeros(shape=selectedDataBatch.shape,dtype=tf.float32),
name=None)
#print(batch[j])
WindowGenerator.dsCleanedOfRowsWithHighMultipliciy = dsCleanedOfRowsWithHighMultipliciy
def make_dataset(self, data):
data = np.array(data, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=1,
shuffle=True,
batch_size=32,)
self.dsCleanedRowsWithHighMultiplycity(ds,10)
ds = ds.map(self.split_window)
return ds
The dataset contains batches, each one with 32 entries/outputs(tensors). I scan every entry/output looking for duplicated data, which a minimum of 10 times. I manage to spot this entries and create a new tensor with tf.tensor_scatter_nd_update but what I would like is to update the original tensor inside the batch.
If there is a way to remove the wrong tensor from the batch, it would also be an acceptable solution.
Thanks in advance!

Related

How can I optimize this data smoothing python loop?

I am trying to make a data smoothng function on a set of data I am using savitzky golay filter in order to do that, I am collecting an array of data and call the function by Scipy.
But since I am looping through a spcific element in a different frame I dont have spatial locality nor time locality.
dataobj.body.data[j][0][i]
holds (x,y) and I am only collecting the ys.
Here's the following loop :
def smooth_data(dataobj):
number_of_frames = len(dataobj.body.data)
for i in range(0, 137):
arr = []
for j in range(0, number_of_frames):
arr.append(dataobj.body.data[j][0][i][1])
newdata = scipy.signal.savgol_filter(arr, 25, 3)
for k in range(0, number_of_frames):
dataobj.body.data[k][0][i][1] = newdata[k]
return dataobj
I'd like to make it work faster, right now when the number of frames is over 1000 it takes a considerable amount of time, something like 30 seconds.
Thanks alot to all of the helpers !
If the input data is a multi-dimensional numpy array, then you can pass in a slice of the numpy array to the scipy method, and then insert the resulting array back into the original data object:
def smooth_data(dataobj):
number_of_frames = len(dataobj[:,0,0,1])
number_of_records = len(dataobj[0,0,:,1])
for i in range(0, number_of_records):
newdata = scipy.signal.savgol_filter(dataobj[:,0,i,1], 3, 1)
dataobj[:][0][i][1] = newdata
return dataobj
What about training a Krige model (of just a polynomial interpolation ) with 50 % of your x and y datas, and then taking the ^y evaluation of the model on your whole set x ?
Krige model example of code (using smt module) :
from smt.surrogate_models import KRG
t= KRG(theta0=[1e-2]*ndim,print_prediction = False)
t.set_training_values(xt,yt) #training inputs, outputs
t.train()
# Prediction of the other points
y = t.predict_values(xtest)

Logical operation on the contents of a tensor

I have a list named 'datastate' of shape (?,10) which is getting filled with the results of 10 batch samples in tensorflow (all tensors). In other words, with a batch size of 256, this will be populated with 10 different tensors of size 256.
in pseudocode below ....
datastate = {}
for sample in range(num_samples):
datastate[sample] = batch_results
What I would like to do next is define a variable like 'datastate_change', which would determine if the i-th record of batch_results was changed versus the (i-1)th record of batch_results. This might look something like the following if Pandas style syntax worked ... but I'm not clear on how to do this inside of tf during the sess.run.
for sample in range(num_samples):
datastate[sample] = batch_results
datastate_change[sample] = batch_results - batch_results.shift(1)
To be a bit more concrete, if a single instance of batch_results are [1,1,1,0,1] I would like to have datastate[1] = [1,1,1,0,1] and datastate_change[1] = [1,0,0,-1,1]
Found a satisfactory answer on my own - key was numpy is better analogue plugin than pandas....
First I create a copy of my datastate which is padded along the top with zeros
Then I slice off the bottom row of this copy
Lastly I subtract the two.
top_paddings = tf.constant([[1, 0]]) #New tensor with the 'top' being zeros
top_padded_datastate_[sample] = tf.pad(datastate[sample], top_paddings, "CONSTANT")
top_padded_datastate[sample] = top_padded_datastate_[sample][:-1]
datastate_changes[sample] = tf.subtract(datastate[sample], top_padded_datastate[sample])

Get data set as numpy array from TFRecordDataset

I'm using the new tf.data API to create an iterator for the CIFAR10 dataset. I'm reading the data from two .tfrecord files. One which holds the training data (train.tfrecords) and another one which holds the test data (test.tfrecords). This works all fine. At some point, however, I need both data sets (training data and test data) as numpy arrays.
Is it possible to retrieve a data set as numpy array from a tf.data.TFRecordDataset object?
You can use the tf.data.Dataset.batch() transformation and tf.contrib.data.get_single_element() to do this.
As a refresher, dataset.batch(n)
will take up to n consecutive elements of dataset and convert them into one element by concatenating each component. This requires all elements to have a fixed shape per component. If n is larger than the number of elements in dataset (or if n doesn't divide the number of elements exactly), then the last batch can be smaller. Therefore, you can choose a large value for n and do the following:
import numpy as np
import tensorflow as tf
# Insert your own code for building `dataset`. For example:
dataset = tf.data.TFRecordDataset(...) # A dataset of tf.string records.
dataset = dataset.map(...) # Extract components from each tf.string record.
# Choose a value of `max_elems` that is at least as large as the dataset.
max_elems = np.iinfo(np.int64).max
dataset = dataset.batch(max_elems)
# Extracts the single element of a dataset as one or more `tf.Tensor` objects.
# No iterator needed in this case!
whole_dataset_tensors = tf.contrib.data.get_single_element(dataset)
# Create a session and evaluate `whole_dataset_tensors` to get arrays.
with tf.Session() as sess:
whole_dataset_arrays = sess.run(whole_dataset_tensors)

How can I append features column wise and start a row for each sample

I am trying to create a training data file which is structured as follows:
[Rows = Samples, Columns = features]
So if I have 100 samples and 2 features the shape of my np.array would be (100,2).
The list bellow contains path-strings to the .nrrd 3D sample patch-data files.
['/Users/FK/Documents/image/0128/subject1F_200.nrrd',
'/Users/FK/Documents/image/0128/subject2F_201.nrrd']
This is the code I have so far:
training_file = []
# For each sample in my image folder
for patches in dir_0128_list:
# Reads the 64x64x64 numpy array
data, options = nrrd.read(patches)
# Calculates the median and sum of the 3D array file. 2 Features per sample
f_median = np.median(data)
training_file.append(f_median)
f_sum = np.sum(data)
training_file.append(f_sum)
# Calculates a numpy array with the following shape (169,) containing 169 features per sample.
f_mof = my_own_function(data)
training_file.append(f_mof)
training_file = np.array((training_file), dtype=np.float32)
# training_file = np.column_stack((core_training_list))
If I don't use the np.column_stack function I get a (173,1) matrix. (1,173) if I run the function. In this scenario it should have a (2,171) shape.
I want to calculate the sum and median and append it to an list or numpy array column wise. At the end of the for loop I want to jump 1 row down and append the 2 features column wise for the second sample and so on...
Very simple solution
Instead of
f_median = np.median(data)
training_file.append(f_median)
f_sum = np.sum(data)
training_file.append(f_sum)
you could do do
training_file.append((np.median(data), np.sum(data)))
slightly longer solution
Then you would still have 1 piece of consecutive code that is not easy to reuse and test indicidually.
I would structure the different parts of the script
Iterate over the files to read the patches
Calculate the mean and sum
Aggregate to the requested format
Read the patches
def read_patches(files):
for file in files:
yield nrrd.read(file)
make a generator yielding the patches info
Calculate
def parse_patch(patch):
data, options = patch
return np.median(data), np.sum(data)
Putting it together
from pathlib import Path
file_dir = Path(<my_filedir>)
files = file_dir.glob('*.nrrd')
patches = read_patches(files)
training_file = np.array([parse_patch(patch) for patch in patches], dtype=np.float32)
This might look convoluted, but it allows for easy testing of each of the sub-blocks

Using Machine Learning in Python to load custom datasets?

Here's the problem:
It take 2 variable inputs, and predict a result.
For example: price and volume as inputs and a decision to buy/sell as a result.
I tried implementing this using K-Neighbors with no success. How would you go about this?
X = cleanedData['ES1 End Price'] #only accounts for 1 variable, don't know how to use input another.
y = cleanedData["Result"]
print(X.shape, y.shape)
kmm = KNeighborsClassifier(n_neighbors = 5)
kmm.fit(X,y) #ValueError for size inconsistency, but both are same size.
Thanks!
X needs to be a matrix/2d array where each column stands for a feature, which doesn't seem true from your code, try reshape X to 2d with X[:,None]:
kmm.fit(X[:,None], y)
Or without resorting to reshape, you'd better always use a list to extract features from a data frame:
X = cleanedData[['ES1 End Price']]
OR with more than one columns:
X = cleanedData[['ES1 End Price', 'volume']]
Then X would be a 2d array, and can be used directly in fit:
kmm.fit(X, y)

Categories