How to speed up caffe classifer in python

How to speed up caffe classifer in python - python

I am using python to use caffe classifier. I got image from my camera and peform predict image from training set. It work well but the problem is speed very slow. I thinks just 4 frames/second. Could you suggest to me some way to improve computational time in my code?
The problem can be explained as following. I have to reload an network model age_net.caffemodel that its size about 80MB by following code
age_net_pretrained='./age_net.caffemodel'
age_net_model_file='./deploy_age.prototxt'
age_net = caffe.Classifier(age_net_model_file, age_net_pretrained,
mean=mean,
channel_swap=(2,1,0),
raw_scale=255,
image_dims=(256, 256))
And for each input image (caffe_input), I call the predict function
prediction = age_net.predict([caffe_input])
I think that due to size of network is very large. Then predict function takes long time to predict image. I think the slow time is from it.
This is my full reference code. It changed by me.
from conv_net import *
import matplotlib.pyplot as plt
import numpy as np
import cv2
import glob
import os
caffe_root = './caffe'
import sys
sys.path.insert(0, caffe_root + 'python')
import caffe
DATA_PATH = './face/'
cnn_params = './params/gender_5x5_5_5x5_10.param'
face_params = './params/haarcascade_frontalface_alt.xml'
def format_frame(frame):
img = frame.astype(np.float32)/255.
img = img[...,::-1]
return img
if __name__ == '__main__':
files = glob.glob(os.path.join(DATA_PATH, '*.*'))
# This is the configuration of the full convolutional part of the CNN
# `d` is a list of dicts, where each dict represents a convolution-maxpooling
# layer.
# Eg c1 - first layer, convolution window size
# p1 - first layer pooling window size
# f_in1 - first layer no. of input feature arrays
# f_out1 - first layer no. of output feature arrays
d = [{'c1':(5,5),
'p1':(2,2),
'f_in1':1, 'f_out1':5},
{'c2':(5,5),
'p2':(2,2),
'f_in2':5, 'f_out2':10}]
# This is the configuration of the mlp part of the CNN
# first tuple has the fan_in and fan_out of the input layer
# of the mlp and so on.
nnet = [(800,256),(256,2)]
c = ConvNet(d,nnet, (45,45))
c.load_params(cnn_params)
face_cascade = cv2.CascadeClassifier(face_params)
cap = cv2.VideoCapture(0)
cv2.namedWindow("Image", cv2.WINDOW_NORMAL)
plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
mean_filename='./mean.binaryproto'
proto_data = open(mean_filename, "rb").read()
a = caffe.io.caffe_pb2.BlobProto.FromString(proto_data)
mean = caffe.io.blobproto_to_array(a)[0]
age_net_pretrained='./age_net.caffemodel'
age_net_model_file='./deploy_age.prototxt'
age_net = caffe.Classifier(age_net_model_file, age_net_pretrained,
mean=mean,
channel_swap=(2,1,0),
raw_scale=255,
image_dims=(256, 256))
age_list=['(0, 2)','(4, 6)','(8, 12)','(15, 20)','(25, 32)','(38, 43)','(48, 53)','(60, 100)']
while(True):
val, image = cap.read()
if image is None:
break
image = cv2.resize(image, (320,240))
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, 1.3, 5, minSize=(30,30))
for f in faces:
x,y,w,h = f
cv2.rectangle(image, (x,y), (x+w,y+h), (0,255,255))
face_image_rgb = image[y:y+h, x:x+w]
caffe_input = cv2.resize(face_image_rgb, (256, 256)).astype(np.float32)
prediction = age_net.predict([caffe_input])
print 'predicted age:', age_list[prediction[0].argmax()]
cv2.imshow('Image', image)
ch = 0xFF & cv2.waitKey(1)
if ch == 27:
break
#break

Try calling age_net.predict([caffe_input]) with oversmaple=False:
prediction = age_net.predict([caffe_input], oversample=False)
The default behavior of predict is to create 10, slightly different, crops of the input image and feed them to the network to classify, by disabling this option you should get a x10 speedup.

For all of you who still use Caffe, I'd recommend trying OpenVINO to decrease inference time. OpenVINO optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.
Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[caffe]
Use Model Optimizer to convert Caffe model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Caffe model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --input_model "age_net.caffemodel" --data_type FP32 --source_layout "[n,c,h,w]" --target_layout "[n,h,w,c]" --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency or throughput, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/age_net.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"}) # alternatively THROUGHPUT or CUMULATIVE_THROUGHPUT
# Get input and output layers
input_layer_ir = compiled_model_ir.input(0)
output_layer_ir = compiled_model_ir.output(0)
# Resize and reshape input image
height, width = list(input_layer_ir.shape)[1:3]
input_image = cv2.resize(input_image, (width, height))[np.newaxis, ...]
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Related

How to preprocess input for a keras H5 converted model into a pb file

I successfully converted a Keras H5 model into a Tensorflow pb file but I get totally different result when making a prediction.
In Python I use 2 Keras modules to preprocess the data before feeding the network:
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
Here is how I preprocess the data in my Python code:
# extract the object ROI, convert it from BGR to RGB channel
# ordering, resize it to 224x224, and preprocess it
moving_object = img_orig[startY:endY, startX:endX]
moving_object = cv2.cvtColor(moving_object, cv2.COLOR_BGR2RGB)
moving_object = cv2.resize(moving_object, (224, 224))
moving_object = img_to_array(moving_object)
moving_object = preprocess_input(moving_object)
objects.append(moving_object)
Then I make batch predictions via the Keras predict method:
# only make a predictions if at least one object was detected
if len(objects) > 0:
objects = np.array(objects, dtype="float32")
preds = wine_plant_model.predict(objects)
Here is how I preprocess the data in C++:
vector<Mat> detected_objects;
//extract the object ROI
Mat image_roi = img_orig(roi);
detected_objects.push_back(image_roi);
and how I make batch predictions in C++:
if (detected_objects.size() > 0) {
vector<Mat> preds;
Mat inputBlobs = cv::dnn::blobFromImages(detected_objects, 1.0, Size(224, 224));
net.setInput(inputBlobs);
Mat outputs = net.forward();
}
It seems that I am not preprocessing the image the right way in C++ and therefore I am not getting the same results. But I cannot find a equivalent for the Keras preprocess_input() method in C++.
Looking at the Keras documentation the python preprocess_input() method scale the data between 1 and -1. So I do not know if I should normalize the data using the cv::normalize method or do something with the blobFromImages scale factor. I am a bit confused here.
Could you please tell me what I should do to preprocess the data the same way in C++ even if it is not through Keras which does not seem to be available in C++.

Tensorflow Lite quantization on Raspberry Pi 0 in image classification problem

I am developing an image classification model/program for Raspberry Pi 0 W. I was wondering if it is possible to make a code upgrade that will accelerate image processing.
General information:
the main model was trained on EfficientNetB5
image dimensions are 240x320 in grayscale
on Raspberry, it should be an image classification, no possibility of 'live streaming' and object detection
I acknowledge that Raspberry Pi 0 W is not the best match for TF, but anyway maybe there is a way for acceleration
at the moment one image is being predicted in 60 seconds, which is too much
My thoughts about this are that maybe I should train the model with lower dimensions and maybe the learning_rate of the main model can affect rpi's speed?
Below I am attaching two scripts.
Tensorflow save_model transformation into tf_lite quantized model
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.models import load_model
model = load_model('../models/effnet_v22.h5')
TFLITE_QUANT_MODEL = "../tflite_models/effnet_v22_quant.tflite"
run_model = tf.function(lambda x : model(x))
# Save the concrete function.
concrete_func = run_model.get_concrete_function(
tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype)
)
# Convert the model to quantized version with post-training quantization
converter = tf.lite.TFLiteConverter.from_concrete_functions([concrete_func])
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
open(TFLITE_QUANT_MODEL, "wb").write(tflite_quant_model)
print("TFLite Quantized Model Is Created")
One image processing on Raspberry Pi 0
import tensorflow as tf
import numpy as np
import matplotlib.image as img
import cv2
# uploading tflite model
tflite_interpreter =tf.lite.Interpreter(
model_path='../../tflite_models/effnet_v22_quant.tflite')
# taking pre-trained model parameters
input_details = tflite_interpreter.get_input_details()
output_details = tflite_interpreter.get_output_details()
img_width = input_details[0]['shape'][2]
img_height = input_details[0]['shape'][1]
# uploading and processing the image to be predicted
testimg=img.imread('../img/c21.jpg')
testimg=cv2.resize(testimg, (img_width,img_height))
testimg=cv2.cvtColor(testimg, cv2.COLOR_BGR2GRAY)
testimg=testimg[np.newaxis, ..., np.newaxis]
testimg=np.array(testimg, dtype=np.float32)
# resizing tflite's tensors
tflite_interpreter.resize_tensor_input(input_details[0]['index'], (1, img_height, img_width, 1))
tflite_interpreter.resize_tensor_input(output_details[0]['index'], (1, 8))
tflite_interpreter.allocate_tensors()
input_details = tflite_interpreter.get_input_details()
output_details = tflite_interpreter.get_output_details()
tflite_interpreter.set_tensor(input_details[0]['index'], testimg)
tflite_interpreter.invoke()
tflite_model_predictions = tflite_interpreter.get_tensor(output_details[0]['index'])
# TFLite prediction results
classes = np.array([101,102,104,105, 107, 110, 113, 115]) # class array creation
mat = np.vstack([classes, tflite_model_predictions])
np.set_printoptions(suppress=True, precision = 10) # to get rid of scientific numbers
if np.max(mat[1,:]) > 0.50:
theclass = int(mat[0, np.argmax(mat[1,:])])
else:
theclass = "NO_CLASS"
print(mat)
print("The predicted class is", theclass)

You are using EfficientNet-B5 model which has nearly 30M parameters. Even though you get benefits from Tensorflow Lite and quantization method, it is very hard to get a latency of inference below 30ms assuming you are using high-performance CPU like in Pixel 4. Considering you are using very limited powered embedded system, it is normal to get 60 seconds for one inferencing.
There exists one well-explained webpage about latency on EfficientNet-lite models. Here, you can visit, https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models-with-efficientnet-lite.html

Shearing image in tensorflow

I am using tf.keras to build my network. And I am doing all the augmentation in tensor_wise level since my data in tfrecords file. Then I needed to do shearing and zca for augmentation but couldn't find a proper implementation in tensor flow. And I can't use the DataImageGenerator that did both operation I needed because as I said my data doesn't fit in memory and it is in tfrecord format. So all my augmentations process should be tesnorwise.
#fchollet here suggested a way to use ImgaeDataGenerator with large dataset.
My first questino is
if I use #fchollet way, which is basically using X-sample of the large data to run the ImageDataGenerator then using train_on_batch to train the network , how I can feed my validation data to the network.
My Second question is there any tensor-wise implementation for shear and zca operations. Some people like here suggested using tf.contrib.image.transform but couldn't understand how. If some one have the idea on how to do it, I will appreciate that.
Update:
This is my trial to construct the transformation matrix through ski_image
from skimage import io
from skimage import transform as trans
import tensor flow as tf
def augment()
afine_tf = trans.AffineTransform(shear=0.2)
transform = tf.contrib.image.matrices_to_flat_transforms(tf.linalg.inv(afine_tf.params))
transform= tf.cast(transform, tf.float32)
image = tf.contrib.image.transform(image, transform) # Image here is a tensor
return image
dataset_train = tf.data.TFRecordDataset(training_files, num_parallel_reads=calls)
dataset_train = dataset_train.apply(tf.contrib.data.shuffle_and_repeat(buffer_size=1000+ 4 * batch_size))
dataset_train = dataset_train.map(decode_train, num_parallel_calls= calls)
dataset_train = dataset_train.map(augment,num_parallel_calls=calls )
dataset_train = dataset_train.batch(batch_size)
dataset_train = dataset_train.prefetch(tf.contrib.data.AUTOTUNE)

I will answer the second question.
Today one of my old questions was commented by a user, but the comments have been deleted when I was adding more details on how to use tf.contrib.image.transform. I guess it's you, right?
So, I have edited my question and added an example, check it here.
TL;DR:
def transformImg(imgIn,forward_transform):
t = tf.contrib.image.matrices_to_flat_transforms(tf.linalg.inv(forward_transform))
# please notice that forward_transform must be a float matrix,
# e.g. [[2.0,0,0],[0,1.0,0],[0,0,1]] will work
# but [[2,0,0],[0,1,0],[0,0,1]] will not
imgOut = tf.contrib.image.transform(imgIn, t, interpolation="BILINEAR",name=None)
return imgOut
def shear_transform_example(filename,shear_lambda):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string, channels=3)
img = transformImg(image_decoded, [[1.0,shear_lambda,0],[0,1.0,0],[0,0,1.0]])
# Notice that this is a shear transformation parallel to the x axis
# If you want a y axis version, use this:
# img = transformImg(image_decoded, [[1.0,0,0],[shear_lambda,1.0,0],[0,0,1.0]])
return img
img = shear_transform_example("white_square.jpg",0.1)

How to load a tflite model in script?

I have converted the .pb file to tflite file using the bazel. Now I want to load this tflite model in my python script just to test that weather this is giving me correct output or not ?

You can use TensorFlow Lite Python interpreter to load the tflite model in a python shell, and test it with your input data.
The code will be like this:
import numpy as np
import tensorflow as tf
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Test model on random input data.
input_shape = input_details[0]['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
output_data = interpreter.get_tensor(output_details[0]['index'])
print(output_data)
The above code is from TensorFlow Lite official guide, for more detailed information, read this.

Using TensorFlow lite models in Python:
The verbosity of TensorFlow Lite is powerful because it allows you more control, but in many cases you just want to pass input and get an output, so I made a class that wraps this logic:
The following works with classification models from tfhub.dev, for example: https://tfhub.dev/tensorflow/lite-model/mobilenet_v2_1.0_224/1/metadata/1
# Usage
model = TensorflowLiteClassificationModel("path/to/model.tflite")
(label, probability) = model.run_from_filepath("path/to/image.jpeg")
import tensorflow as tf
import numpy as np
from PIL import Image
class TensorflowLiteClassificationModel:
def __init__(self, model_path, labels, image_size=224):
self.interpreter = tf.lite.Interpreter(model_path=model_path)
self.interpreter.allocate_tensors()
self._input_details = self.interpreter.get_input_details()
self._output_details = self.interpreter.get_output_details()
self.labels = labels
self.image_size=image_size
def run_from_filepath(self, image_path):
input_data_type = self._input_details[0]["dtype"]
image = np.array(Image.open(image_path).resize((self.image_size, self.image_size)), dtype=input_data_type)
if input_data_type == np.float32:
image = image / 255.
if image.shape == (1, 224, 224):
image = np.stack(image*3, axis=0)
return self.run(image)
def run(self, image):
"""
args:
image: a (1, image_size, image_size, 3) np.array
Returns list of [Label, Probability], of type List<str, float>
"""
self.interpreter.set_tensor(self._input_details[0]["index"], image)
self.interpreter.invoke()
tflite_interpreter_output = self.interpreter.get_tensor(self._output_details[0]["index"])
probabilities = np.array(tflite_interpreter_output[0])
# create list of ["label", probability], ordered descending probability
label_to_probabilities = []
for i, probability in enumerate(probabilities):
label_to_probabilities.append([self.labels[i], float(probability)])
return sorted(label_to_probabilities, key=lambda element: element[1])
Caution
However, you'll need to modify this to support different use cases, since I am passing images as input, and getting classification ([label, probability]) output. If you need text input (NLP), or other output (object detection outputs bounding boxes, labels and probabilities), classification (just labels), etc).
Also, if you are expecting different size image inputs, then you'd have to change the input size and reallocate the model (self.interpreter.allocate_tensors()). This is slow (inefficient). It's better to use the platform resizing functionality (e.g. Android graphics library) instead of using a TensorFlow lite model to do the resizing. Alternatively, you could resize the model with a separate model which would be much quicker to allocate_tensors() for.

How to speedup caffe for sliding window object detection in a test image

Question:
I have trained a convolutional neural network (CNN) to determine/detect if an object of interest is present or not in a given image patch.
Now given a large image, i am trying to locate all occurrences of the object in the image in a sliding window fashion by applying my CNN model to the patch surrounding each pixel in the image. However this is very slow.
The size of my test images is (512 x 512). And, for my caffe net, the test batch size is 1024 and the patch size is (65 x 65 x 1).
I tried to apply my caffe net on a batch of patches (size = test_batch_size) instead of a single patch at a time. Even then it is slow.
Below is my current solution that is very slow. I would appreciate any other suggestions other than down-sampling my test image to speed this up.
Current solution that is very slow:
def detectObjects(net, input_file, output_file):
# read input image
inputImage = plt.imread(input_file)
# get test_batch_size and patch_size used for cnn net
test_batch_size = net.blobs['data'].data.shape[0]
patch_size = net.blobs['data'].data.shape[2]
# collect all patches
w = np.int(patch_size / 2)
num_patches = (inputImage.shape[0] - patch_size) * \
(inputImage.shape[1] - patch_size)
patches = np.zeros((patch_size, patch_size, num_patches))
patch_indices = np.zeros((num_patches, 2), dtype='int64')
count = 0
for i in range(w + 1, inputImage.shape[0] - w):
for j in range(w + 1, inputImage.shape[1] - w):
# store patch center index
patch_indices[count, :] = [i, j]
# store patch
patches[:, :, count] = \
inputImage[(i - w):(i + w + 1), (j - w):(j + w + 1)]
count += 1
print "Extracted %s patches" % num_patches
# Classify patches using cnn and write result to output image
outputImage = np.zeros_like(inputImage)
outputImageFlat = np.ravel(outputImage)
pad_w = test_batch_size - num_patches % test_batch_size
patches = np.pad(patches, ((0, 0), (0, 0), (0, pad_w)),
'constant')
patch_indices = np.pad(patch_indices, ((0, pad_w), (0, 0)),
'constant')
start_time = time.time()
for i in range(0, num_patches, test_batch_size):
# get current batch of patches
cur_pind = patch_indices[i:i + test_batch_size, :]
cur_patches = patches[:, :, i:i + test_batch_size]
cur_patches = np.expand_dims(cur_patches, 0)
cur_patches = np.rollaxis(cur_patches, 3)
# apply cnn on current batch of images
net.blobs['data'].data[...] = cur_patches
output = net.forward()
prob_obj = output['prob'][:, 1]
if i + test_batch_size > num_patches:
# remove padded part
num_valid = num_patches - i
prob_obj = prob_obj[0:num_valid]
cur_pind = cur_pind[0:num_valid, :]
# set output
cur_pind_lin = np.ravel_multi_index((cur_pind[:, 0],
cur_pind[:, 1]),
outputImage.shape)
outputImageFlat[cur_pind_lin] = prob_obj
end_time = time.time()
print 'Took %s seconds' % (end_time - start_time)
# Save output
skimage.io.imsave(output_file, outputImage * 255.0)
I was hoping that with the lines
net.blobs['data'].data[...] = cur_patches
output = net.forward()
caffe would classify all the patches in cur_patches in parallel using GPU. Wonder why it is still slow.

I think what you are looking for is described in the section Casting a Classifier into a Fully Convolutional Network of the "net surgery" tutorial.
What this solution basically says is that instead of conv layers followed by an "InnerProduct" layer for classification, the "InnerProduct" layer can be transformed into an equivalent conv layer, resulting with a fully convolutional net that can process images of any size and output a prediction according to the input size.
Moving to a fully convolutional architecture will significantly reduce the number of redundent computations you are currently make, and should significantly speed up your process.
Another possible direction for speedup is to approximate high-dimensional "InnerProduct" layers by a product of two lower rank matrices using truncated SVD trick.

If you still use Caffe, I'd recommend trying OpenVINO to decrease inference time. OpenVINO optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.
The instruction on how to use it is below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[caffe]
Use Model Optimizer to convert Caffe model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Caffe model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --input_model "model.caffemodel" --data_type FP32 --source_layout "[n,c,h,w]" --target_layout "[n,h,w,c]" --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. It seems you care about throughput, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"CUMULATIVE_THROUGHPUT"}) # alternatively THROUGHPUT
# Get input and output layers
input_layer_ir = compiled_model_ir.input(0)
output_layer_ir = compiled_model_ir.output(0)
# Resize and reshape input image
height, width = list(input_layer_ir.shape)[1:3]
input_image = cv2.resize(input_image, (width, height))[np.newaxis, ...]
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up caffe classifer in python - python

Related

How to preprocess input for a keras H5 converted model into a pb file

Tensorflow Lite quantization on Raspberry Pi 0 in image classification problem

Shearing image in tensorflow

How to load a tflite model in script?

How to speedup caffe for sliding window object detection in a test image

Categories

Resources