Slow performance in hashing images in Python

Slow performance in hashing images in Python - python

I have a function to create image difference hash and stored in a list in Python:
import glob
import dhash
from alive_progress import alive_bar
from wand.image import Image
def get_photo_hashes(dir='./'):
file_list = {}
imgs = glob.glob(dir + '*.jpg')
total = len(list(imgs))
with alive_bar(total) as bar:
for i in imgs:
with Image(filename=i) as image:
row, col = dhash.dhash_row_col(image)
hash_val = dhash.format_hex(row, col)
file_list[i] = hash_val
bar()
return file_list
The performance of hashing a folder of 10,000 500kb - 1MB JPEG images are surprisingly slow, around 2 hashes per second. How can I enhance the performance of this function? Thanks.

Multiprocessing would ideal for this as the hashing is CPU intensive.
Here's what I suggest:
import dhash
import glob
from wand.image import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
PATH = '*.jpg'
def makehash(t):
filename, d = t
with Image(filename=filename) as image:
row, col = dhash.dhash_row_col(image)
d[filename] = dhash.format_hex(row, col)
def main():
with Manager() as manager:
d = manager.dict()
with ProcessPoolExecutor() as executor:
executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
print(d)
if __name__ == '__main__':
main()
Some stats:
I have a folder containing 129 JPGs. The average size of each file is >12MB. The net processing time is ~19s

I like #JCaesar answer a lot, and decided to have a play with it. I created 1,000 JPEGs of around 500kB each with:
parallel magick -size 640x480 xc: +noise random {}.jpg ::: {1..1000}
Then I tried the code from his answer using Wand, and got 21.3s for 1000 images. I then switched to PIL, and using the same images, the time dropped to 9.6s. I then had a think and realised that the Perceptual hash algorithm converts the image to greyscale and shrinks it to 8x8 pixels - and that the JPEG library has a "shrink-on-load" feature which you can use in PIL if you call Image.draft(newMode,newSize). That reduces the time to load and also the amount of I/O needed. Enabling that feature further reduces the time to 6s for the same images. The code looks like this:
#!/usr/bin/env python3
# https://stackoverflow.com/a/70709538/2836621
import dhash
import glob
from PIL import Image
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager
PATH = '*.jpg'
def makehash(t):
filename, d = t
with Image.open(filename) as image:
image.draft('L', (32,32))
row, col = dhash.dhash_row_col(image)
d[filename] = dhash.format_hex(row, col)
def main():
with Manager() as manager:
d = manager.dict()
with ProcessPoolExecutor() as executor:
executor.map(makehash, [(jpg, d) for jpg in glob.glob(PATH)])
print(d)
if __name__ == '__main__':
main()

Related

Accelerating speed of reading contents from dataframe in pandas

Let us suppose we have table with following dimension :
print(metadata.shape)-(8732, 8)
let us suppose we want to read slice_file_name for each row( and then read sound files from drive ) and extract mel frequencies :
def feature_extractor(file_name):
audio,sample_rate =librosa.load(file_name,res_type='kaiser_fast')
mfccs_features =librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
mfccs_scaled_features =np.mean(mfccs_features.T,axis=0)
return mfccs_scaled_features
if i use following loops :
from tqdm import tqdm
extracted_features =[]
for index_num, row in tqdm(metadata.iterrows()):
file_name = os.path.join(os.path.abspath(Base_Directory),str(row["slice_file_name"]))
final_class_labels=row["class"]
data=feature_extractor(file_name)
extracted_features.append([data,final_class_labels])
it takes in total following amount of time :
3555it [21:15, 2.79it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1323
n_fft, y.shape[-1]
8326it [48:40, 3.47it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1103
n_fft, y.shape[-1]
8329it [48:41, 3.89it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1523
n_fft, y.shape[-1]
8732it [50:53, 2.86it/s]
how can i optimize this code to do the thing in less amount of time? it is possible?

You could try and run the feature extractor in parallel, this could give a new column in your dataframe with the mfccs_scaled_features.
from pandarallel import pandarallel
pandarallel.initialize()
PATH = os.path.abspath(Base_Directory)
def feature_extractor(file_name):
# If using windows, you may need to put these here~
# import librosa
# import numpy as np
# import os
file_name = os.path.join(PATH, file_name)
audio,sample_rate = librosa.load(file_name, res_type='kaiser_fast')
mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
mfccs_scaled_features = np.mean(mfccs_features.T, axis=0)
return mfccs_scaled_features
df['mfccs_scaled_features'] = df['slice_file_name'].parallel_apply(feature_extractor)

How to convert images into numpy array quickly?

To train the image classification model I'm loading input data as NumPy array, I deal with thousands of images. Currently, I'm looping through each image and converting it into a NumPy array as shown below.
import glob
import cv2
import numpy as np
tem_arr_list = []
from time import time
images_list = glob.glob(r'C:\Datasets\catvsdogs\cat\*.jpg')
start = time()
for idx, image_path in enumerate(images_list):
start = time()
img = cv2.imread(image_path)
temp_arr = np.array(cv2.imread(image_path))
# print(temp_arr.shape)
tem_arr_list.append(temp_arr)
print("Total time taken {}".format (time() - start))
running this method takes a lot of time when data is huge. So I tried using list comprehension as below
tem_arr_list = [np.array(cv2.imread(image_path)) for image_path in images_list]
which is slight quicker than looping but not fastest
I'm looking any other way to reduce the time to do this operation . Any help or suggestion on this will be appreciated.

Use the multiprocessing pool to load data parallely. In my PC the cpus count is 16. I tried loading 100 images and below you could see the time taken.
import multiprocessing
import cv2
import glob
from time import time
def load_image(image_path):
return cv2.imread(image_path)
if __name__ == '__main__':
image_path_list = glob.glob('*.png')
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
pool = multiprocessing.Pool(processes=cpus)
start = time()
images = pool.map(load_image, image_path_list)
print("Total time taken using multiprocessing pool {} seconds".format (time() - start))
images = []
start = time()
for image_path in image_path_list:
images.append(load_image(image_path))
print("Total time taken using for loop {} seconds".format (time() - start))
start = time()
images = [load_image(image_path) for image_path in image_path_list]
print("Total time taken using list comprehension {} seconds".format (time() - start))
Output:
Total time taken using multiprocessing pool 0.2922379970550537 seconds
Total time taken using for loop 1.4935636520385742 seconds
Total time taken using list comprehension 1.4925990104675293 seconds

Why is my parallel processing code in python running slower than sequential? (+ how to read tiff files fast in python?)

I have a set of large tiff images (3200 x 3200). I want to load them in jupyter notebook, compute averages and create a numpy array. To make it concrete, each movie clip is composed of 10 tiff images and there are several movie clips. I want to compute the average of 10 frames for each movie, and return an array of size (# movie clips x 3200 x 3200).
from tifffile import tifffile
def uselibtiff(f):
tif = tifffile.imread(f) # open tiff file in read mode
return tif
def avgFrames(imgfile_Movie,im_shape):
# calculating the sum of all frames of a movie. imgfile_Movie is a list of tiff image files for a particular movie
im = np.zeros(im_shape)
for img_frame in imgfile_Movie:
frame_im = uselibtiff(img_frame)
im += frame_im
return im
# imgfiles_byMovie is a 2D list
im_avg_all = np.zeros((num_movies,im_shape[0],im_shape[1]))
for n,imgfile_Movie in enumerate(imgfiles_byMovie):
print(n)
#start = timeit.default_timer()
results = avgFrames(imgfile_Movie,im_shape)/num_frames
#end = timeit.default_timer()
#print(end-start)
im_avg_all[n] = results
so far I've noticed that avgFrames takes about 3 seconds (for 10 frames per movie). I wanted to speed things up and tried using multiprocessing in python.
import multiprocessing as mp
print("Number of processors: ", mp.cpu_count())
pool = mp.Pool(mp.cpu_count())
results = [pool.apply(avgFrames, args=(imgfile_Movie, im_shape)) for imgfile_Movie in imgfiles_byMovie]
pool.close()
Unfortunately, the above code runs very slow and I have to terminate it without seeing the results. What am I doing wrong here?

Why is my memory growing when using Librosa?

I'm using Librosa and I have a memory problem.
I have a lot of audio files, let's say a hundred.
I process audio files one by one.
Each audio file is loaded by chunks of 1 minute.
Each chunk is processed before I move to the next chunk.
This way, I know that I never have more than 60s of audio in memory at a given time.
This allows me to avoid using too much memory during the whole process.
For some reason, the memory used by the process is growing over time.
Here is a simpler version of the code:
import librosa
import matplotlib.pyplot as plt
import os
import psutil
SAMPLING_RATE = 22050
N_FFT = 2048
HOP_LENGTH = 1024
def foo():
y, sr = librosa.load("clip_04.wav", sr=SAMPLING_RATE, offset=600, duration=60)
D = librosa.stft(y, n_fft=N_FFT, hop_length=HOP_LENGTH)
spec_mag = abs(D)
spec_db = librosa.amplitude_to_db(spec_mag)
return 42 # I return a constant to make sure the memory is (or should be) released
def main():
process = psutil.Process(os.getpid())
array = []
for i in range(100):
foo()
m = int(process.memory_info().rss / 1024**2)
array.append(m)
plt.figure()
plt.plot(array)
plt.xlabel('iterations')
plt.ylabel('MB')
plt.show()
if __name__ == '__main__':
main()
Using this code, the memory increases like this:
Is that normal? And if it is, is there a way to clear Librosa memory at each iteration?

Python : Get features of several images

I would like to get the feature of a several images located in the same folder.
My codes are as follow - Prerequisites (librairies needed):
import numpy as np
from PIL import Image
import glob
import cv2
import os
Definition of folder where are located the images (around 6000)
images_dir = "TrainImages"
Creation of a function that defines the different variables et compute them
def get_data_from_image(image_path):
cv_img = cv2.imread(image_path)
(means, stds) = cv2.meanStdDev(cv_img)
stats = np.concatenate([means, stds]).flatten()
image_features_list = [stats.tolist()]
return image_features_list
Creation of a variable that scans and analyses the images
image_files = [x.path for x in os.scandir(images_dir)]
Creation of a loop function
i = 0
mylist =[]
for i in range (4): # I test only 4 images, could be more
mylist.append((get_data_from_image(image_files[i])))
Running the stuff
image_features_list = get_data_from_image(image_files[i])
Look at the output
image_features_list
The output provides only the feature of one image, instead of all images located in the folder
[Out]:
[[114.31548828125001,
139.148388671875,
139.57832682291667,
50.54138521536725,
53.82290182999255,
51.946187641459595]]
I would be grateful if I could have a solution on how to have the features of all images (not only one). At this effect, do not hesitate to correct the code.
Thanks and kindest regards
After some commments from friendly persons, here is an additional information for those who would be interested by the response : The output to look at is mylist.
mylist
[Out]:
[[[144.28788548752834,
151.28145691609978,
148.6195351473923,
51.50620316379085,
53.36979275398226,
52.2493589172815]],
[[56.220865079365076,
59.99653968253968,
60.28386507936508,
66.72797279655177,
65.24673515467009,
64.93141350917332]],
[[125.2066064453125,
118.1168994140625,
145.0827685546875,
68.95463582009148,
52.65138276425348,
56.68269683130363]],
[[114.31548828125001,
139.148388671875,
139.57832682291667,
50.54138521536725,
53.82290182999255,
51.946187641459595]]]
Thanks for your help. It is a great forum here !

Try this approach and tell me if its successful
import os, os.path
import numpy as np
from PIL import Image
import cv2
def get_data_from_image(image_path):
cv_img = cv2.imread(image_path)
(means, stds) = cv2.meanStdDev(cv_img)
stats = np.concatenate([means, stds]).flatten()
image_features_list = [stats.tolist()]
return image_features_list
images_dir = 'C:\\Users\\User\\Directory\\TrainImages\\'
images_names = []
with os.scandir(images_dir) as dirs:
for entry in dirs:
images_names.append(entry.name)
for image in images_names:
path = images_dir + image
image_features_list = get_data_from_image(path))
print(image_features_list)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slow performance in hashing images in Python - python

Related

Accelerating speed of reading contents from dataframe in pandas

How to convert images into numpy array quickly?

Why is my parallel processing code in python running slower than sequential? (+ how to read tiff files fast in python?)

Why is my memory growing when using Librosa?

Python : Get features of several images

Categories

Resources