I want to find the GLCM matrix using python (numpy)
I've written this code and it gave me a correct result from the four angles but it very slow , to processes 1000 picture with demonsion 128x128 it take about 35min
def getGLCM(image, distance, direction):
npPixel = np.array(image) // image as numpy array
glcm = np.zeros((255, 255), dtype=int)
if direction == 1: # direction 90° up ↑
for i in range(distance, npPixel.shape[0]):
for j in range(0, npPixel.shape[1]):
glcm[npPixel[i, j], npPixel[i-distance, j]] += 1
elif direction == 2: # direction 45° up-right ↗
for i in range(distance, npPixel.shape[0]):
for j in range(0, npPixel.shape[1] - distance):
glcm[npPixel[i, j], npPixel[i - distance, j + distance]] += 1
elif direction == 3: # direction 0° right →
for i in range(0, npPixel.shape[0]):
for j in range(0, npPixel.shape[1] - distance):
glcm[npPixel[i, j], npPixel[i, j + distance]] += 1
elif direction == 4: # direction -45° down-right ↘
for i in range(0, npPixel.shape[0] - distance):
for j in range(0, npPixel.shape[1] - distance):
glcm[npPixel[i, j], npPixel[i + distance, j + distance]] += 1
return glcm
I need help to make this code faster
Thanks.
There is a bug in your code. You need to change the initialization of the gray level co-occurrence matrix to glcm = np.zeros((256, 256), dtype=int), otherwise if the image to process contains some pixels with the intensity level 255, the function getGLCM will throw an error.
Here's a pure NumPy implementation that improves performance through vectorization:
def vectorized_glcm(image, distance, direction):
img = np.array(image)
glcm = np.zeros((256, 256), dtype=int)
if direction == 1:
first = img[distance:, :]
second = img[:-distance, :]
elif direction == 2:
first = img[distance:, :-distance]
second = img[:-distance, distance:]
elif direction == 3:
first = img[:, :-distance]
second = img[:, distance:]
elif direction == 4:
first = img[:-distance, :-distance]
second = img[distance:, distance:]
for i, j in zip(first.ravel(), second.ravel()):
glcm[i, j] += 1
return glcm
If you are open to use other packages, I would strongly recommend you to utilize scikit-image's greycomatrix. As shown below, this speeds up the calculation by two orders of magnitude.
Demo
In [93]: from skimage import data
In [94]: from skimage.feature import greycomatrix
In [95]: img = data.camera()
In [96]: a = getGLCM(img, 1, 1)
In [97]: b = vectorized_glcm(img, 1, 1)
In [98]: c = greycomatrix(img, distances=[1], angles=[-np.pi/2], levels=256)
In [99]: np.array_equal(a, b)
Out[99]: True
In [100]: np.array_equal(a, c[:, :, 0, 0])
Out[100]: True
In [101]: %timeit getGLCM(img, 1, 1)
240 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [102]: %timeit vectorized_glcm(img, 1, 1)
203 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [103]: %timeit greycomatrix(img, distances=[1], angles=[-np.pi/2], levels=256)
1.46 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
I am trying to implement LDA using Gibbs sampling and in the step of updating each topic proportion, I have a 4 layer loop and it runs extremely slow and I am not sure how to improve the efficiency of this code. The code I have now is the following:
N_W is the number of words, and N_D is the number of document, and Z[i,j] is the topic assignment (1 to K possible assignments), X[i,j] is the count of the j-th word in i-th document, Beta[k,:] is of dimension [K, N_W].
And the update is the following:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
You could get rid of the last two for loops using logic functions:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
a = np.logical_not(X-w) # all X(i,j) == w become a True, others a false
b = np.logical_not(Z-k) # all Z(i,j) == w become a True, others a false
c = np.logical_and(a,b) # all (i,j) where X(i,j) == w and Z(i,j) == k are True, others false
n_k[w] = np.sum(c) # sum all True values
Or even as a one liner:
n_k = np.array([[np.sum(np.logical_and(np.logical_not(X[:N_D,:N_W]-w), np.logical_not(Z[:N_D,:N_W]-k))) for w in range(N_W)] for k in range(K)])
Each row in n_k then can be used for beta calculation. Now it also includes N_W and N_D as restrictions, if they are not equal to the size of X and Z
I did some testing with the following matrices:
import numpy as np
K = 90
N_W = 100
N_D = 11
N_W = 12
Z = np.random.randint(0, K, size=(N_D, N_W))
X = np.random.randint(0, N_W, size=(N_D, N_W))
gamma = 1
The original code:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
865 ms ± 8.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Then vectorising only the inner two loops:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
n_k[w] = np.sum((X == w) & (Z == k))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
21.6 ms ± 542 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally with some creative application of broadcasting and extracting common elements:
%%timeit
Beta = numpy.zeros((K, N_W))
w = np.arange(N_W)
X_eq_w = np.equal.outer(X, w)
for k in range(K): # iteratively for each topic update
n_k = np.sum(X_eq_w & (Z == k)[:, :, None], axis=(0, 1))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
4.6 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The trade-off here is between speed and memory. For the shapes I used this was not so memory-intensive, but the intermediate three-dimensional arrays I built in the last solution could get quite large.
In scipy.spatial there is the Delaunay function. The documentation includes an example of how to calculate barycentric coordinates.
Following that example, the following code will calculate barycentric coordinates using a loop.
points = np.array([(0,0),(0,1),(1,0),(1,1)])
samples = np.array([(0.5,0.5),(0,0),(0.1,0.1)])
dim = len(points[0]) # determine the dimension of the samples
simp = Delaunay(points) # create simplexes for the defined points
s = simp.find_simplex(samples) # for each sample, find corresponding simplex for each sample
b0 = np.zeros((len(samples),dim)) # reserve space for each barycentric coordinate
for ii in range(len(samples)):
b0[ii,:] = simp.transform[s[ii],:dim].dot((samples[ii] - simp.transform[s[ii],dim]).transpose())
coord = np.c_[b0, 1 - b0.sum(axis=1)]
This is ok for short list of samples to convert to barycentric coordinates, however for very large lists of samples, the performance is poor. How can this be modified to take advantage of vectorized math in numpy/scipy to improve performance?
Consider the following modification (for-loop replaced with numpy methods):
def f_1(points, samples):
""" original """
dim = len(points[0])
simp = ssp.Delaunay(points)
s = simp.find_simplex(samples)
b0 = np.zeros((len(samples), dim))
for ii in range(len(samples)):
b0[ii, :] = simp.transform[s[ii], :dim].dot(
(samples[ii] - simp.transform[s[ii], dim]).transpose())
coord = np.c_[b0, 1 - b0.sum(axis=1)]
return coord
def f_2(points, samples):
""" modified """
simp = ssp.Delaunay(points)
s = simp.find_simplex(samples)
b0 = (simp.transform[s, :points.shape[1]].transpose([1, 0, 2]) *
(samples - simp.transform[s, points.shape[1]])).sum(axis=2).T
coord = np.c_[b0, 1 - b0.sum(axis=1)]
return coord
Test case:
N = 100
points = np.array(list(itertools.product(range(N), repeat=2)))
samples = np.random.rand(100_000, 2) * N
Result:
%timeit f_1(points, samples)
712 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(points, samples)
422 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
With modified version the line simp.find_simplex(samples) gives about 95% of the running time. So, I guess there is nothing else you can do with vectorization. To improve perfomance further you need another implementation of find_simplex method or another approach to the problem.
I'm generating a few filters for use with implementing Fast Fourier Transform blur and sharpen operations on images. The filters are correctly generated, however the computations last very long.
The way I'm currently generating the filters is by iterating over the dimensions of the desired filter, item by item. I understand that I need to use Numpy in order to solve this problem, but I can't figure out how exactly. Here is my code for generating the Gaussian filter:
def gaussian_filter(mode, size, cutoff):
filterImage = np.zeros(size, np.float64)
cutoffTerm = 2 * (cutoff ** 2)
v = np.asarray([size[0] // 2, size[1] // 2])
for px in range(0, size[0]):
for py in range(0, size[1]):
u = np.asarray([px, py])
Duv = np.linalg.norm(u - v)
distance = -1 * (Duv ** 2)
result = pow(np.e, distance / cutoffTerm)
if mode == 'low':
filterImage.itemset((px, py), result)
elif mode == 'high':
filterImage.itemset((px, py), 1 - result)
return filterImage
Generating the filter of size 1920 x 1080 takes 70.36 seconds, which is completely unacceptable. Any ideas would be much appreciated.
Here's a vectorized one leveraging broadcasting -
def gaussian_filter_vectorized(mode, size, cutoff):
cutoffTerm = 2 * (cutoff ** 2)
v = np.asarray([size[0] // 2, size[1] // 2])
I,J = np.ogrid[:size[0],:size[1]]
p,q = I-v[0],J-v[1]
Dsq = p**2 + q**2
d = -1 * Dsq
R = np.power(np.e,d/cutoffTerm)
if mode == 'low':
return R
elif mode == 'high':
return 1-R
Timings on big size's -
In [80]: N = 100
...: %timeit gaussian_filter(mode='low', size=(N,N), cutoff=N)
...: %timeit gaussian_filter_vectorized(mode='low', size=(N,N), cutoff=N)
10 loops, best of 3: 65.2 ms per loop
1000 loops, best of 3: 225 µs per loop
In [81]: N = 1000
...: %timeit gaussian_filter(mode='low', size=(N,N), cutoff=N)
...: %timeit gaussian_filter_vectorized(mode='low', size=(N,N), cutoff=N)
1 loop, best of 3: 6.5 s per loop
10 loops, best of 3: 29.8 ms per loop
200x+ speedups!
Leverage numexpr on large data computations for further perf. boost
When working with large data, we can also use numexpr module that supports multi-core processing if the intended operations could be expressed as arithmetic ones. To solve our case, we can replace the steps at : Dsq = p**2 + q**2 and R = np.power(np.e,d/cutoffTerm) with numexpr equivalent ones, using numexpr.evaluate function.
So, we would end up with something like this -
import numexpr as ne
def gaussian_filter_vectorized_numexpr(mode, size, cutoff):
cutoffTerm = 2 * (cutoff ** 2)
I,J = np.ogrid[:size[0],:size[1]]
v0,v1 = size[0] // 2, size[1] // 2
p,q = I-v0,J-v1
E = np.e
if mode == 'low':
return ne.evaluate('E**(-1*(p**2+q**2)/cutoffTerm)')
elif mode == 'high':
return ne.evaluate('1-E**(-1*(p**2+q**2)/cutoffTerm)')
Timings on 1920x1080 sized image -
In [2]: M,N=1920,1080
...: %timeit gaussian_filter(mode='low', size=(M,N), cutoff=N)
...: %timeit gaussian_filter_vectorized(mode='low', size=(M,N), cutoff=N)
...: %timeit gaussian_filter_vectorized_numexpr(mode='low',size=(M,N),cutoff=N)
1 loop, best of 3: 13.9 s per loop
10 loops, best of 3: 63.3 ms per loop
100 loops, best of 3: 9.48 ms per loop
Close to 1500x speedup here!
This was with 8 threads. Thus, with more number of threads available for compute, it should improve further. Related post on how to control multi-core functionality.
Few days ago I tried to find python implementation of Daugman's Iris detection algorithm. I found only one repo on github, but implementation was very slow, around 46 sec on my laptop. So, I rewrote all stuff, according to this formulae:
Final result is ~85 times faster (~530 ms on my laptop), but it is still not fast enough fo real-time video processing (at 10 fps, at least).
I've read this stackoverflow topics 1,2 and tried to vectorize the code.
I've tested map(), np.vectorize(), np.fromiter(), but they were no faster than my current solution (some of them were even slower) and I am out of ideas.
Is there a way to vectorize code below or I need to use C extensions or try to run it on PyPy?
My solution:
def daugman(center, start_r, gray_img):
"""return maximal intense radius for given center
center -- tuple(x, y)
start_r -- int
gray_img -- grayscale picture as np.array(), it should be square
"""
# get separate coordinates
x, y = center
# get img dimensions
h, w = gray_img.shape
# for calculation convinience
img_shape = np.array([h, w])
c = np.array(center)
# define some other vars
tmp = []
mask = np.zeros_like(gray)
# for every radius in range
# we are presuming that iris will be no bigger than 1/3 of picture
for r in range(start_r, int(h/3)):
# draw circle on mask
cv2.circle(mask, center, r, 255, 1)
# get pixel from original image
radii = gray_img & mask # it is faster than np or cv2
# normalize
tmp.append(radii[radii > 0].sum()/(2*3.1415*r))
# refresh mask
mask.fill(0)
# calculate delta of radius intensitiveness
tmp = np.array(tmp)
tmp = tmp[1:] - tmp[:-1]
# aply gaussian filter
tmp = abs(cv2.GaussianBlur(tmp[:-1], (1, 5), 0))
# get maximum value
idx = np.argmax(tmp)
# return value, center coords, radius
return tmp[idx], [center, idx + start_r]
def find_iris(gray, start_r):
"""Apply daugman() on every pixel in calculated image slice
gray -- graysacale img as np.array()
start_r -- initial radius as int
Selection of image slice guarantees that every
radius will be drawn in iage borders, so we need to check it (speed up)
To speed up the whole process we need to pregenerate all centers for detection
"""
_, s = gray.shape
# reduce step for better accuracy
# 's/3' is the maximum radius of a daugman() search
a = range(0 + int(s/3), s - int(s/3), 3)
all_points = list(itertools.product(a, a))
values = []
coords = []
for p in all_points:
tmp = daugman(p, start_r, gray)
if tmp is not None:
val, circle = tmp
values.append(val)
coords.append(circle)
# return the radius with biggest intensiveness delta on image
# [(xc, yc), radius]
return coords[np.argmax(values)]
UPD: image for testing
UPD 2: I've tried to create a tensor with all possible x,y,r values and map function over it -- it is not faster.
Tensor creation:
# prepare img
img = cv2.imread('eye.jpg')
img = img[20:130, 20:130]
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
start_r = 10
# prepare some vars
h, w = gray.shape
img_shape = np.array([h, w])
mask = np.zeros_like(gray)
# generate points
_, s = gray.shape
a = range(0 + int(s/3), s - int(s/3), 3)
b = range(start_r, int(s/3))
all_points = list(itertools.product(a, a, b))
all_points_arr = np.array(all_points)
Rewritten daugman for such case:
def daugman_6(point, gray_img=gray):
"""return maximal intense radius for given center
center -- tuple(x, y)
start_r -- int
gray_img -- grayscale picture as np.array(), it should be square
"""
# get separate coordinates
x, y, r = point
# for every radius in range
# we are presuming that iris will be no bigger than 1/3 of picture
# draw circle on mask
cv2.circle(mask, (x, y), r, 255, 1)
# get pixel from original image
radii = gray_img & mask # it is faster than np or cv2
# refresh mask
mask.fill(0)
return radii[radii > 0].sum()/(2*3.1415*r)
Results on server with Core i9:
# iterate via numpy array with for
%%timeit
[daugman_6(i) for i in all_points_arr]
#80.6 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# np.fromiter() on numpy array
%%timeit
np.fromiter((daugman_6(p) for p in all_points_arr), dtype=np.float32)
#82.9 ms ± 3.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# iteration over list of tuples
%%timeit
[daugman_6(i) for i in all_points]
#70 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# original implementation
%%timeit
find_iris(gray, 10)
#71.6 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Per https://stackoverflow.com/a/48981834/1840471, this is an implementation of the weighted Gini coefficient in Python:
import numpy as np
def gini(x, weights=None):
if weights is None:
weights = np.ones_like(x)
# Calculate mean absolute deviation in two steps, for weights.
count = np.multiply.outer(weights, weights)
mad = np.abs(np.subtract.outer(x, x) * count).sum() / count.sum()
rmad = mad / np.average(x, weights=weights)
# Gini equals half the relative mean absolute deviation.
return 0.5 * rmad
This is clean and works well for medium-sized arrays, but as warned in its initial suggestion (https://stackoverflow.com/a/39513799/1840471) it's O(n2). On my computer that means it breaks after ~20k rows:
n = 20000 # Works, 30000 fails.
gini(np.random.rand(n), np.random.rand(n))
Can this be adjusted to work for larger datasets? Mine is ~150k rows.
Here is a version which is much faster than the one you provided above, and also uses a simplified formula for the case without weight to get even faster results in that case.
def gini(x, w=None):
# The rest of the code requires numpy arrays.
x = np.asarray(x)
if w is not None:
w = np.asarray(w)
sorted_indices = np.argsort(x)
sorted_x = x[sorted_indices]
sorted_w = w[sorted_indices]
# Force float dtype to avoid overflows
cumw = np.cumsum(sorted_w, dtype=float)
cumxw = np.cumsum(sorted_x * sorted_w, dtype=float)
return (np.sum(cumxw[1:] * cumw[:-1] - cumxw[:-1] * cumw[1:]) /
(cumxw[-1] * cumw[-1]))
else:
sorted_x = np.sort(x)
n = len(x)
cumx = np.cumsum(sorted_x, dtype=float)
# The above formula, with all weights equal to 1 simplifies to:
return (n + 1 - 2 * np.sum(cumx) / cumx[-1]) / n
Here is some test code to check we get (mostly) the same results:
>>> x = np.random.rand(1000000)
>>> w = np.random.rand(1000000)
>>> gini_max_ghenis(x, w)
0.33376310938610521
>>> gini(x, w)
0.33376310938610382
But the speed is very different:
%timeit gini(x, w)
203 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit gini_max_ghenis(x, w)
55.6 s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you remove the pandas ops from the function, it is already much faster:
%timeit gini_max_ghenis_no_pandas_ops(x, w)
1.62 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to get the last drop of performance you could use numba or cython but that would only gain a few percent because most of the time is spent in sorting.
%timeit ind = np.argsort(x); sx = x[ind]; sw = w[ind]
180 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
edit: gini_max_ghenis is the code used in Max Ghenis' answer
Adapting the StatsGini R function from here:
import numpy as np
import pandas as pd
def gini(x, w=None):
# Array indexing requires reset indexes.
x = pd.Series(x).reset_index(drop=True)
if w is None:
w = np.ones_like(x)
w = pd.Series(w).reset_index(drop=True)
n = x.size
wxsum = sum(w * x)
wsum = sum(w)
sxw = np.argsort(x)
sx = x[sxw] * w[sxw]
sw = w[sxw]
pxi = np.cumsum(sx) / wxsum
pci = np.cumsum(sw) / wsum
g = 0.0
for i in np.arange(1, n):
g = g + pxi.iloc[i] * pci.iloc[i - 1] - pci.iloc[i] * pxi.iloc[i - 1]
return g
This works for large vectors, at least up to 10M rows:
n = 1e7
gini(np.random.rand(n), np.random.rand(n)) # Takes ~15s.
It also produces the same result as the function provided in the question, for example giving 0.2553 for this example:
gini(np.array([3, 1, 6, 2, 1]), np.array([4, 2, 2, 10, 1]))