Using tf.data to read data from disk - python

I have a directory of images with random ids and a text file which has ids and their corresponding label. I was wondering if there is a way to read the data directly from disk and not loading the entire dataset in ram as a matrix.
I know it can be done by using the method of python generators and then follow up with placeholders to feed the data.
def generator_(path1,filename):
.
.
yield x,y
x=tf.placeholder(tf.float32,shape=[None,w,h,3])
y=tf.placeholder(tf.float32,shape=[None,n_c])
x,y=generator_(path_image,'labels.txt')
But what is the better way to do it using tf.data api ?

Supposing your labels.txt has such as structure (comma-separated image IDs and labels):
1, 0
2, 2
3, 1
...
42, 2
and your images are stored like:
/data/
|---- image1.jpg
|---- image2.jpg
...
|---- image42.jpg
You could then use tf.data in such a way:
import tensorflow as tf
def generate_parser(separator=",", image_path=["/data/image", ".jpg"]):
image_path = [tf.constant(image_path[0]), tf.constant(image_path[1])]
def _parse_data(line):
# Split the line according to separator:
line_split = tf.string_split([line], separator)
# Convert label value to int:
label = tf.string_to_number(line_split.values[1], out_type=tf.int32)
# Build complete image path from ID:
image_filepath = image_path[0] + line_split.values[0] + image_path[1]
# Open image:
image_string = tf.read_file(image_filepath)
image_decoded = tf.image.decode_image(image_string)
return image_decoded, label
return _parse_data
label_file = "/var/data/labels.txt"
dataset = (tf.data.TextLineDataset([label_file])
.map(generate_parser(separator=",", image_path=["/data/image", ".jpg"])))
# add .batch(), .repeat(), etc.

Related

Split my dataset in train/validation using MapDataset in python

Hi everyone I'm facing an issue after that I elaborate images and labels. To create an unique dataset I use the zip function. After the elaboration both images and labels are 18k and it's correct but when I call the zip(image,labels), items become 563.
Here some code to let you to understand:
# Map the load_and_preprocess_image function over the dataset of image paths
images = image_paths.map(load_and_preprocess_image)
# Map the extract_label function over the dataset of image paths
labels = image_paths.map(extract_label)
# Zip the labels and images together to create a dataset of (image, label) pairs
#HERE SOMETHING STRANGE HAPPENS
data = tf.data.Dataset.zip((images,labels))
# Shuffle and batch the data
data = data.shuffle(buffer_size=1000).batch(32)
# Split the data into train and test sets
data = data.shuffle(buffer_size=len(data))
# Convert the dataset into a collection of data
num_train = int(0.8 * len(data))
train_data = image_paths.take(num_train)
val_data = image_paths.skip(num_train)
I cannot see where is the error. Can you help me plese? Thanks
I'd like to have a dataset of 18k images,labels
tf's zip
tf.data.Dataset.zip is not like Python's zip. The tf.data.Dataset.zip's input is tf datasets. You may check the images/label return from your map function is the correct tf.Dataset object.
check tf.ds
make sure your image/label is correct tf.ds.
print("ele: ", images_dataset.element_spec)
print("num: ", images_dataset.cardinality().numpy())
print("ele: ", labels_dataset.element_spec)
print("num: ", labels_dataset.cardinality().numpy())
workaround
In your case, combine the image and label processing in one map function and return both to bypass to use tf.data.Dataset.zip:
# load_and_preprocess_image_and_label
def load_and_preprocess_image_and_label(image_path):
""" load image and label then some operations """
return image, label
# Map the load_and_preprocess_image function over the dataset of image/label paths
train_list = tf.data.Dataset.list_files(str(PATH / 'train/*.jpg'))
data = train_list.map(load_and_preprocess_image_and_label,
num_parallel_calls=tf.data.AUTOTUNE)

Creating an image LMDB for Caffe2

In the original Caffe framework, there was an executable under caffe/build/tools called convert_imageset, which took a directory of JPEG images and a text file with labels for each image, and output an LMDB that could be fed to a Caffe model to train, test, etc.
What is the best way to convert raw JPEG images and labels to an LMDB that Caffe2 can ingest using the AddInput() function from this MNIST tutorial on the Caffe2 website?
According to my research, you cannot simply create an LMDB file using this tool and feed a Caffe2 model.
The tutorial script just downloads two LMDBs (mnist-train-nchw-lmdb and mnist-test-nchw-lmdb) and passes them to AddInput(), but gives no insight as to how the LMDBs were created.
There is a binary called make_image_db.cc which does precisely what you are describing. It is located in caffe2/build/bin/make_image_db:
// This script converts an image dataset to a database.
//
// caffe2::FLAGS_input_folder is the root folder that holds all the images
//
// caffe2::FLAGS_list_file is the path to a file containing a list of files
// and their labels, as follows:
//
// subfolder1/file1.JPEG 7
// subfolder1/file2.JPEG 7
// subfolder2/file1.JPEG 8
// ...
As described in https://github.com/caffe2/caffe2/issues/1755 you can use the binary in the following way (also with fewer parameters):
caffe2/build/bin/make_image_db -color -db lmdb -input_folder ./some_input_folder
-list_file ./labels_file -num_threads 10 -output_db_name ./some_output_folder -raw -scale 256 -shuffle
A full Caffe2 example on how to create and read a lmdb database (for random images) can be found in the official github repository and can be used as a skeleton to adapt to your own images https://github.com/caffe2/caffe2/blob/master/caffe2/python/examples/lmdb_create_example.py. Since I have not used this method yet, I will simply copy the example. In order to create the database, one can use:
import argparse
import numpy as np
import lmdb
from caffe2.proto import caffe2_pb2
from caffe2.python import workspace, model_helper
def create_db(output_file):
print(">>> Write database...")
LMDB_MAP_SIZE = 1 << 40 # MODIFY
env = lmdb.open(output_file, map_size=LMDB_MAP_SIZE)
checksum = 0
with env.begin(write=True) as txn:
for j in range(0, 128):
# MODIFY: add your own data reader / creator
label = j % 10
width = 64
height = 32
img_data = np.random.rand(3, width, height)
# ...
# Create TensorProtos
tensor_protos = caffe2_pb2.TensorProtos()
img_tensor = tensor_protos.protos.add()
img_tensor.dims.extend(img_data.shape)
img_tensor.data_type = 1
flatten_img = img_data.reshape(np.prod(img_data.shape))
img_tensor.float_data.extend(flatten_img)
label_tensor = tensor_protos.protos.add()
label_tensor.data_type = 2
label_tensor.int32_data.append(label)
txn.put(
'{}'.format(j).encode('ascii'),
tensor_protos.SerializeToString()
)
checksum += np.sum(img_data) * label
if (j % 16 == 0):
print("Inserted {} rows".format(j))
print("Checksum/write: {}".format(int(checksum)))
return checksum
The database can then by loaded by:
def read_db_with_caffe2(db_file, expected_checksum):
print(">>> Read database...")
model = model_helper.ModelHelper(name="lmdbtest")
batch_size = 32
data, label = model.TensorProtosDBInput(
[], ["data", "label"], batch_size=batch_size,
db=db_file, db_type="lmdb")
checksum = 0
workspace.RunNetOnce(model.param_init_net)
workspace.CreateNet(model.net)
for _ in range(0, 4):
workspace.RunNet(model.net.Proto().name)
img_datas = workspace.FetchBlob("data")
labels = workspace.FetchBlob("label")
for j in range(batch_size):
checksum += np.sum(img_datas[j, :]) * labels[j]
print("Checksum/read: {}".format(int(checksum)))
assert np.abs(expected_checksum - checksum < 0.1), \
"Read/write checksums dont match"
Last but not least, there is also a tutorial on how to create a minidb database: https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/create_your_own_dataset.ipynb. For this, one could use the following function:
def write_db(db_type, db_name, features, labels):
db = core.C.create_db(db_type, db_name, core.C.Mode.write)
transaction = db.new_transaction()
for i in range(features.shape[0]):
feature_and_label = caffe2_pb2.TensorProtos()
feature_and_label.protos.extend([
utils.NumpyArrayToCaffe2Tensor(features[i]),
utils.NumpyArrayToCaffe2Tensor(labels[i])])
transaction.put(
'train_%03d'.format(i),
feature_and_label.SerializeToString())
# Close the transaction, and then close the db.
del transaction
del db
Features would be a tensor containing your images as numpy arrays. Labels are the corresponding true labels for the features. You would then simply call the function as
write_db("minidb", "train_images.minidb", train_features, train_labels)
Finally, you would load the images from the database by
net_proto = core.Net("example_reader")
dbreader = net_proto.CreateDB([], "dbreader", db="train_images.minidb", db_type="minidb")
net_proto.TensorProtosDBInput([dbreader], ["X", "Y"], batch_size=16)
for create database in lmbd:
create the train data folder
create train.txt file conataining filename label
create validation data folder
create val.txt file contatining filename and label
edit this file
gedit examples/imagenet/create_imagenet.sh
EXAMPLE= path to where *.lmbd folder wil be stored
DATA= path where val.txt and train.txt is present
TOOLS=build/tools
TRAIN_DATA_ROOT=test/make_caffe_data/train/ # path to trainfiles
VAL_DATA_ROOT=test/make_caffe_data/val/ # path to test_files
Set RESIZE=true to resize the images to 256x256. Leave as false if images have
already been resized using another tool.
RESIZE=true
./examples/imagenet/create_imagenet.sh

How split pre loaded data in fix length along the 0-dim, to use them with the QueueRunner in Tensorflow?

Is there anything comparable to the tf.FixedLengthRecordReader only with the difference that the data is loaded from a tensor instead of a file? I try to build an input pipeline that looks like this (My problem is described under point 4.):
1. Load data in dictionaties
...
# Each dictionary contains two 'key/value' pairs:
# [b' images'] / List_of_Arrays
# [b' labels'] / List_of_Integers
dict_1 = unpickle(path_1)
dict_2 = unpickle(path_2)
...
dict_n = unpickle(path_n)
2. Create new dictionary
# Select certain individual data points from the N dictionaries
# and merge them into a new dictionary or array.
..
dict_new = ...
....
3. Create a tensor with training data points
class PRELOADdata(object):
pass
pre_load_data = PRELOADdata()
# Images
dict_value_img = dict_new[b'images']
array_image = np.asarray(dict_value_img, np.float32)
pre_load_data.images = tf.convert_to_tensor(array_image, np.float32)
#Labels
dict_value_lbl = dict_new[b'labels']
array_label = np.asarray(dict_value_lbl, np.float32)
pre_load_data.labels = tf.convert_to_tensor(array_label, np.float32)
...
retun pre_load_data
4. Here i need help :)
At this point I would like to use database similar to a file which is read with the function read() from tf.FixedLengthRecordReader. In my current solution, the whole data set is packed in a batch.
class DATABASERecord(object):
pass
result = DATABASERecord()
database = get_pre_load_data()
... ???
result.image = ..
result.label = ..
return result
5. Do some operations on the 'result'
data_point = get_result()
label = data_point.label
image = tf.cast(data_point.image, tf.int32)
#... tf.random_crop, tf.image.random_flip_left_right,
#... tf.image.random_brightness, tf.image.random_contrast,
#... tf.image.per_image_standardization
...
6. Create Batch, QUEUE ..
...
image_batch, label_batch = tf.train.batch([image, label],
batch_size=BATCH_SIZE,num_threads=THREADS, capacity=BA_CAPACITY * BATCH_SIZE)
...
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[image_batch, label_batch], capacity=QU_CAPACITY)
...
..batch_queue.dequeue()
...
tf.train.start_queue_runners(sess=my_sess)
I don't know if it's relevant but the whole thing runs as a multi GPU system
EDIT:
I don't have an answer to the question yet, but I have a solution for the problem that should solve the answer to my question. So instead of saving the datapoints in a tensor, I save them in a binary file and load them with tf.FixedLengthRecordReader. This answer has helped me a lot: Attach a queue to a numpy array in tensorflow for data fetch instead of files?

Tensorflow read images with labels

I am building a standard image classification model with Tensorflow. For this I have input images, each assigned with a label (number in {0,1}). The Data can hence be stored in a list using the following format:
/path/to/image_0 label_0
/path/to/image_1 label_1
/path/to/image_2 label_2
...
I want to use TensorFlow's queuing system to read my data and feed it to my model. Ignoring the labels, one can easily achieve this by using string_input_producer and wholeFileReader. Here the code:
def read_my_file_format(filename_queue):
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
example = tf.image.decode_png(value)
return example
#removing label, obtaining list containing /path/to/image_x
image_list = [line[:-2] for line in image_label_list]
input_queue = tf.train.string_input_producer(image_list)
input_images = read_my_file_format(input_queue)
However, the labels are lost in that process as the image data is purposely shuffled as part of the input pipeline. What is the easiest way of pushing the labels together with the image data through the input queues?
Using slice_input_producer provides a solution which is much cleaner. Slice Input Producer allows us to create an Input Queue containing arbitrarily many separable values. This snippet of the question would look like this:
def read_labeled_image_list(image_list_file):
"""Reads a .txt file containing pathes and labeles
Args:
image_list_file: a .txt file with one /path/to/image per line
label: optionally, if set label will be pasted after each line
Returns:
List with all filenames in file image_list_file
"""
f = open(image_list_file, 'r')
filenames = []
labels = []
for line in f:
filename, label = line[:-1].split(' ')
filenames.append(filename)
labels.append(int(label))
return filenames, labels
def read_images_from_disk(input_queue):
"""Consumes a single filename and label as a ' '-delimited string.
Args:
filename_and_label_tensor: A scalar string tensor.
Returns:
Two tensors: the decoded image, and the string label.
"""
label = input_queue[1]
file_contents = tf.read_file(input_queue[0])
example = tf.image.decode_png(file_contents, channels=3)
return example, label
# Reads pfathes of images together with their labels
image_list, label_list = read_labeled_image_list(filename)
images = ops.convert_to_tensor(image_list, dtype=dtypes.string)
labels = ops.convert_to_tensor(label_list, dtype=dtypes.int32)
# Makes an input queue
input_queue = tf.train.slice_input_producer([images, labels],
num_epochs=num_epochs,
shuffle=True)
image, label = read_images_from_disk(input_queue)
# Optional Preprocessing or Data Augmentation
# tf.image implements most of the standard image augmentation
image = preprocess_image(image)
label = preprocess_label(label)
# Optional Image and Label Batching
image_batch, label_batch = tf.train.batch([image, label],
batch_size=batch_size)
See also the generic_input_producer from the TensorVision examples for full input-pipeline.
There are three main steps to solving this problem:
Populate the tf.train.string_input_producer() with a list of strings containing the original, space-delimited string containing the filename and the label.
Use tf.read_file(filename) rather than tf.WholeFileReader() to read your image files. tf.read_file() is a stateless op that consumes a single filename and produces a single string containing the contents of the file. It has the advantage that it's a pure function, so it's easy to associate data with the input and the output. For example, your read_my_file_format function would become:
def read_my_file_format(filename_and_label_tensor):
"""Consumes a single filename and label as a ' '-delimited string.
Args:
filename_and_label_tensor: A scalar string tensor.
Returns:
Two tensors: the decoded image, and the string label.
"""
filename, label = tf.decode_csv(filename_and_label_tensor, [[""], [""]], " ")
file_contents = tf.read_file(filename)
example = tf.image.decode_png(file_contents)
return example, label
Invoke the new version of read_my_file_format by passing a single dequeued element from the input_queue:
image, label = read_my_file_format(input_queue.dequeue())
You can then use the image and label tensors in the remainder of your model.
In addition to the answers provided there are few other things you can do:
Encode your label into the filename. If you have N different categories you can rename your files to something like: 0_file001, 5_file002, N_file003. Afterwards when you read the data from a reader key, value = reader.read(filename_queue) your key/value are:
The output of Read will be a filename (key) and the contents of that file (value)
Then parse your filename, extract the label and convert it to int. This will require a little bit of preprocessing of the data.
Use TFRecords which will allow you to store the data and labels at the same file.

OpenCV and Content Based Image retrieval - Is there a way to work with an online database of images without downloading them

I'm trying to build a CBIR system and recently wrote a program in Python using OpenCV functions that lets me query a local database of images and return a result (followed this tutorial). I now need to link this up with another web scraping module (used Scrapy) wherein I output ~1000 links to images online. These images are scattered throughout the web and should be input to the first OpenCV module. Is it possible to perform calculations on this online image set without downloading it ?
These are the steps I followed for the OpenCV module
1) Define the region-based color image descriptor
2) Extract features from dataset (Indexing) (dataset to be passed as command line argument)
# import the necessary packages
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
from colordescriptor import ColorDescriptor
import argparse
import glob
import cv2
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required = True,
help = "Path to the directory that contains the images to be indexed")
ap.add_argument("-i", "--index", required = True,
help = "Path to where the computed index will be stored")
args = vars(ap.parse_args())
# initialize the color descriptor
cd = ColorDescriptor((8, 12, 3))
# open the output index file for writing
output = open(args["index"], "w")
# use glob to grab the image paths and loop over them
for imagePath in glob.glob(args["dataset"] + "/*.jpg"):
# extract the image ID (i.e. the unique filename) from the image
# path and load the image itself
imageID = imagePath[imagePath.rfind("/") + 1:]
image = cv2.imread(imagePath)
# describe the image
features = cd.describe(image)
# write the features to file
features = [str(f) for f in features]
output.write("%s,%s\n" % (imageID, ",".join(features)))
# close the index file
output.close()
3) Deifning the similarity metric
# import the necessary packages
import numpy as np
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import csv
class Searcher:
def __init__(self, indexPath):
# store our index path
self.indexPath = indexPath
def search(self, queryFeatures, limit = 5):
# initialize our dictionary of results
results = {}
# open the index file for reading
with open(self.indexPath) as f:
# initialize the CSV reader
reader = csv.reader(f)
# loop over the rows in the index
for row in reader:
# parse out the image ID and features, then compute the
# chi-squared distance between the features in our index
# and our query features
features = [float(x) for x in row[1:]]
d = self.chi2_distance(features, queryFeatures)
# now that we have the distance between the two feature
# vectors, we can udpate the results dictionary -- the
# key is the current image ID in the index and the
# value is the distance we just computed, representing
# how 'similar' the image in the index is to our query
results[row[0]] = d
# close the reader
f.close()
# sort our results, so that the smaller distances (i.e. the
# more relevant images are at the front of the list)
results = sorted([(v, k) for (k, v) in results.items()])
# return our (limited) results
return results[:limit]
def chi2_distance(self, histA, histB, eps = 1e-10):
# compute the chi-squared distance
d = 0.5 * np.sum([((a - b) ** 2) / (a + b + eps)
for (a, b) in zip(histA, histB)])
# return the chi-squared distance
return d
`
4) Perform the actual search
# import the necessary packages
from colordescriptor import ColorDescriptor
from searcher import Searcher
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import argparse
import cv2
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--index", required = True,
help = "Path to where the computed index will be stored")
ap.add_argument("-q", "--query", required = True,
help = "Path to the query image")
ap.add_argument("-r", "--result-path", required = True,
help = "Path to the result path")
args = vars(ap.parse_args())
# initialize the image descriptor
cd = ColorDescriptor((8, 12, 3))
# load the query image and describe it
query = cv2.imread(args["query"])
features = cd.describe(query)
# perform the search
searcher = Searcher(args["index"])
results = searcher.search(features)
# display the query
cv2.imshow("Query", query)
# loop over the results
for (score, resultID) in results:
# load the result image and display it
result = cv2.imread(args["result_path"] + "/" + resultID)
cv2.imshow("Result", result)
cv2.waitKey(0)
And the final command line command is:
python search.py --index index.csv --query query.png --result-path dataset
where index.csv is the file generated after step 2 on the database of images. query.png is my query image and dataset is the folder containing the ~100 images.
So is it possible to modify the indexing such that I don't need a local dataset and to be querying and indexing can be done directly from the list of URLs ?

Categories