Get data set as numpy array from TFRecordDataset - python

I'm using the new tf.data API to create an iterator for the CIFAR10 dataset. I'm reading the data from two .tfrecord files. One which holds the training data (train.tfrecords) and another one which holds the test data (test.tfrecords). This works all fine. At some point, however, I need both data sets (training data and test data) as numpy arrays.
Is it possible to retrieve a data set as numpy array from a tf.data.TFRecordDataset object?

You can use the tf.data.Dataset.batch() transformation and tf.contrib.data.get_single_element() to do this.
As a refresher, dataset.batch(n)
will take up to n consecutive elements of dataset and convert them into one element by concatenating each component. This requires all elements to have a fixed shape per component. If n is larger than the number of elements in dataset (or if n doesn't divide the number of elements exactly), then the last batch can be smaller. Therefore, you can choose a large value for n and do the following:
import numpy as np
import tensorflow as tf
# Insert your own code for building `dataset`. For example:
dataset = tf.data.TFRecordDataset(...) # A dataset of tf.string records.
dataset = dataset.map(...) # Extract components from each tf.string record.
# Choose a value of `max_elems` that is at least as large as the dataset.
max_elems = np.iinfo(np.int64).max
dataset = dataset.batch(max_elems)
# Extracts the single element of a dataset as one or more `tf.Tensor` objects.
# No iterator needed in this case!
whole_dataset_tensors = tf.contrib.data.get_single_element(dataset)
# Create a session and evaluate `whole_dataset_tensors` to get arrays.
with tf.Session() as sess:
whole_dataset_arrays = sess.run(whole_dataset_tensors)

Related

Split my dataset in train/validation using MapDataset in python

Hi everyone I'm facing an issue after that I elaborate images and labels. To create an unique dataset I use the zip function. After the elaboration both images and labels are 18k and it's correct but when I call the zip(image,labels), items become 563.
Here some code to let you to understand:
# Map the load_and_preprocess_image function over the dataset of image paths
images = image_paths.map(load_and_preprocess_image)
# Map the extract_label function over the dataset of image paths
labels = image_paths.map(extract_label)
# Zip the labels and images together to create a dataset of (image, label) pairs
#HERE SOMETHING STRANGE HAPPENS
data = tf.data.Dataset.zip((images,labels))
# Shuffle and batch the data
data = data.shuffle(buffer_size=1000).batch(32)
# Split the data into train and test sets
data = data.shuffle(buffer_size=len(data))
# Convert the dataset into a collection of data
num_train = int(0.8 * len(data))
train_data = image_paths.take(num_train)
val_data = image_paths.skip(num_train)
I cannot see where is the error. Can you help me plese? Thanks
I'd like to have a dataset of 18k images,labels
tf's zip
tf.data.Dataset.zip is not like Python's zip. The tf.data.Dataset.zip's input is tf datasets. You may check the images/label return from your map function is the correct tf.Dataset object.
check tf.ds
make sure your image/label is correct tf.ds.
print("ele: ", images_dataset.element_spec)
print("num: ", images_dataset.cardinality().numpy())
print("ele: ", labels_dataset.element_spec)
print("num: ", labels_dataset.cardinality().numpy())
workaround
In your case, combine the image and label processing in one map function and return both to bypass to use tf.data.Dataset.zip:
# load_and_preprocess_image_and_label
def load_and_preprocess_image_and_label(image_path):
""" load image and label then some operations """
return image, label
# Map the load_and_preprocess_image function over the dataset of image/label paths
train_list = tf.data.Dataset.list_files(str(PATH / 'train/*.jpg'))
data = train_list.map(load_and_preprocess_image_and_label,
num_parallel_calls=tf.data.AUTOTUNE)

tsfresh time series feature extraction

I am using tsfresh for extracting features from my data.
inital data:
My inital data was timeseries data of a machine sensor.
I used the third column to add another column named hub. It represents the cycles of the machine. Also I converted the Timestamp to integer "Timesteps" for each cycle. Rsulting in this Dataframe:
when extracting the features the algorithm returns 787 features for each of my datarows.
from tsfresh import extract_features
extracted_features = extract_features(df_sample, column_id="hub", column_sort="step")
features = extracted_features.columns.tolist()
But when I use the select features method with the labeled vector y, it gives back an empty dataframe.
I dont understand why?
from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute
impute(extracted_features)
features_filtered = select_features(extracted_features, y)
I am pretty new to feature extraction.
If anybody has any pointers, as to how I could extract good features from the cyclic timeseries data, I would be very thankful.
The plot of the sensor value over the crankshaft position is shown below.

Faiss : How to create an Index of 10M vectors of size 1024

I want to create an index of nearly 10M vectors of size 1024. Here is the code that I used.
import numpy as np
import faiss
import random
f = 1024
vectors = []
no_of_vectors=10000000
for k in range(no_of_vectors):
v = [random.gauss(0, 1) for z in range(f)]
vectors.append(v)
np_vectors = np.array(vectors).astype('float32')
index = faiss.IndexFlatL2(f)
index.add(np_vectors)
faiss.write_index(index, "faiss_index.index")
The code is worked for a small number of vectors. But the memory limit exceeds when the number of vectors is about 2M. I used index.add() instead of appending vectors to list(vectors=[]). But it didn't work as well.
I want to know how to create an index for large number of vectors.
If you want to continue using Faiss, there is a reference for choosing a different index, maybe HNSW or IVFPQ.
ref: https://wangzwhu.github.io/home/file/acmmm-t-part3-ann.pdf go the last page.
And another option is to try some distributed solutions, such as Milvus, which build top of Ann library like faiss

Tensorflow. Batch Tensor modify an entry (tensor)

I am following the example: https://www.tensorflow.org/tutorials/structured_data/time_series.
In my case I have a sensor which collect the data every hour. This one has not being really reliable during the last months and I have lost some data. To solve this problem, the values have being replaced with the previous valid value. I got many duplicated values and I think this is the reason why my NN is unable to predict anything. I do not want to skip the wrong values before creating the dataset because it will create time series with no consecutive values.
I would like to create the timeseries dataset as in the example and then, remove the entries/outputs (tensors) which has certain duplicity in de data or update the tensor values with the value 0.
def hasMultipleDuplicatedElements (mylist, multiplicity):
return Counter(mylist[:,0]).most_common(1)[0][1] >multiplicity
WindowGenerator.hasMultipleDuplicatedElements = hasMultipleDuplicatedElements
def dsCleanedRowsWithHighMultiplycity(self,ds,multiplicity):
for batch in ds:
dataBatch=batch.numpy()
for j in range (len (dataBatch)):
selectedDataBatch=dataBatch[j]
indices = tf.constant([[j] for j in range(len(selectedDataBatch))])
inputData =(selectedDataBatch[:self.input_width])
labelData= (selectedDataBatch[self.input_width:])
if ( hasMultipleDuplicatedElements(inputData,multiplicity) or
( hasMultipleDuplicatedElements(labelData,multiplicity) )):
#print(batch[j])
tf.tensor_scatter_nd_update(batch[j], indices,
tf.zeros(shape=selectedDataBatch.shape,dtype=tf.float32),
name=None)
#print(batch[j])
WindowGenerator.dsCleanedOfRowsWithHighMultipliciy = dsCleanedOfRowsWithHighMultipliciy
def make_dataset(self, data):
data = np.array(data, dtype=np.float32)
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=self.total_window_size,
sequence_stride=1,
shuffle=True,
batch_size=32,)
self.dsCleanedRowsWithHighMultiplycity(ds,10)
ds = ds.map(self.split_window)
return ds
The dataset contains batches, each one with 32 entries/outputs(tensors). I scan every entry/output looking for duplicated data, which a minimum of 10 times. I manage to spot this entries and create a new tensor with tf.tensor_scatter_nd_update but what I would like is to update the original tensor inside the batch.
If there is a way to remove the wrong tensor from the batch, it would also be an acceptable solution.
Thanks in advance!

How can I append features column wise and start a row for each sample

I am trying to create a training data file which is structured as follows:
[Rows = Samples, Columns = features]
So if I have 100 samples and 2 features the shape of my np.array would be (100,2).
The list bellow contains path-strings to the .nrrd 3D sample patch-data files.
['/Users/FK/Documents/image/0128/subject1F_200.nrrd',
'/Users/FK/Documents/image/0128/subject2F_201.nrrd']
This is the code I have so far:
training_file = []
# For each sample in my image folder
for patches in dir_0128_list:
# Reads the 64x64x64 numpy array
data, options = nrrd.read(patches)
# Calculates the median and sum of the 3D array file. 2 Features per sample
f_median = np.median(data)
training_file.append(f_median)
f_sum = np.sum(data)
training_file.append(f_sum)
# Calculates a numpy array with the following shape (169,) containing 169 features per sample.
f_mof = my_own_function(data)
training_file.append(f_mof)
training_file = np.array((training_file), dtype=np.float32)
# training_file = np.column_stack((core_training_list))
If I don't use the np.column_stack function I get a (173,1) matrix. (1,173) if I run the function. In this scenario it should have a (2,171) shape.
I want to calculate the sum and median and append it to an list or numpy array column wise. At the end of the for loop I want to jump 1 row down and append the 2 features column wise for the second sample and so on...
Very simple solution
Instead of
f_median = np.median(data)
training_file.append(f_median)
f_sum = np.sum(data)
training_file.append(f_sum)
you could do do
training_file.append((np.median(data), np.sum(data)))
slightly longer solution
Then you would still have 1 piece of consecutive code that is not easy to reuse and test indicidually.
I would structure the different parts of the script
Iterate over the files to read the patches
Calculate the mean and sum
Aggregate to the requested format
Read the patches
def read_patches(files):
for file in files:
yield nrrd.read(file)
make a generator yielding the patches info
Calculate
def parse_patch(patch):
data, options = patch
return np.median(data), np.sum(data)
Putting it together
from pathlib import Path
file_dir = Path(<my_filedir>)
files = file_dir.glob('*.nrrd')
patches = read_patches(files)
training_file = np.array([parse_patch(patch) for patch in patches], dtype=np.float32)
This might look convoluted, but it allows for easy testing of each of the sub-blocks

Categories