I need my screenshot function to be as fast as possible, and now every call to the function takes about 0.2sec.
This is the function:
def get_screenshot(self, width, height):
image = self.screen_capture.grab(self.monitor)
image = Image.frombuffer('RGB', image.size, image.bgra, 'raw', 'BGRX')
image = image.resize((int(width), int(height)), Image.BICUBIC) # Resize to the size of 0.8 from original picture
image = np.array(image)
image = np.swapaxes(image, 0, 1)
# This code below supposed to replace each black color ([0,0,0]) to the color of [0,0,1]
# r1,g1,b1 = [0,0,0] and r2,g2,b2 = [0,0,1]
red, green, blue = image[:, :, 0], image[:, :, 1], image[:, :, 2]
mask = (red == r1) & (green == g1) & (blue == b1)
image[:, :, :3][mask] = [r2, g2, b2]
return image
Do you notice any changes that I can do to make the function faster?
Edit: Some details that I forgot to mention:
My screen dimensions are 1920*1080
This function is a part of a live stream project that I am currently working on. The solution that Carlo has suggested below is not appropriate in this case because the remote computer will not be synchronized with our computer screen.
As your code is incomplete, I can only guess what might help, so here are a few thoughts...
I started with a 1200x1200 image, because I don't know how big yours is, and reduced it by a factor of 0.8x to 960x960 because of a comment in your code.
My ideas for speeding it up are based on either using a different interpolation method, or using OpenCV which is highly optimised SIMD code. Either, or both, may be appropriate, but as I don't know what your images look like, only you can say.
So, here we go, first with PIL resize() and different interpolation methods:
# Open image with PIL
i = Image.open('start.png').convert('RGB')
In [91]: %timeit s = i.resize((960,960), Image.BICUBIC)
16.2 ms ± 28 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [92]: %timeit s = i.resize((960,960), Image.BILINEAR)
10.9 ms ± 87.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [93]: %timeit s = i.resize((960,960), Image.NEAREST)
440 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, BILINEAR is 1.5x faster than BICUBIC and the real winner here is NEAREST at 32x faster.
Now, converting to a Numpy array (as you are doing anyway) and using the highly optimised OpenCV SIMD code to resize:
# Now make into Numpy array for OpenCV methods
n = np.array(i)
In [100]: %timeit s = cv2.resize(n, (960,960), interpolation = cv2.INTER_CUBIC)
806 µs ± 9.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %timeit s = cv2.resize(n, (960,960), interpolation = cv2.INTER_LINEAR)
3.69 ms ± 29 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [102]: %timeit s = cv2.resize(n, (960,960), interpolation = cv2.INTER_AREA)
12.3 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [103]: %timeit s = cv2.resize(n, (960,960), interpolation = cv2.INTER_NEAREST)
692 µs ± 448 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
And the winner here looks like INTER_CUBIC which is 20x faster than PIL's resize().
Please try them all and see what works for you! Just remove the Python magic %timeit at the start of the line and run what's left.
This is just an example of what I mean.
If it solve the problem let me know.
You can create two different thread. One that take the screenshot, the other that elaborate the screen later. Both add the result to a list.
This improve the speed of the get_screenshot function. But for elaborate it you need the time that is required for the function execution.
import threading
#import your stuff
class your_class(object):
def __init__(self):
self.images = list()
self.elaborated_images = list()
threading.Thread(name="Take_Screen", target=self.get_screenshot, args=(width, height))
threading.Thread(name="Elaborate_Screen", target=self.elaborate_screenshot)
def get_screenshot(self, width, height):
while True:
images.append(self.screen_capture.grab(self.monitor))
def elaborate_screenshot(self):
while True:
image = self.images[0]
image = Image.frombuffer('RGB', image.size, image.bgra, 'raw', 'BGRX')
image = image.resize((int(width), int(height)), Image.BICUBIC) # Resize to the size of 0.8 from original picture
image = np.array(image)
image = np.swapaxes(image, 0, 1)
# This code below supposed to replace each black color ([0,0,0]) to the color of [0,0,1]
# r1,g1,b1 = [0,0,0] and r2,g2,b2 = [0,0,1]
red, green, blue = image[:, :, 0], image[:, :, 1], image[:, :, 2]
mask = (red == r1) & (green == g1) & (blue == b1)
image[:, :, :3][mask] = [r2, g2, b2]
del self.images[0]
self.elaborated_images.append(image)
your_class()
Because I don't have your full code I can't build it the best as possible.
Related
I currently have sixteen images (A,B,C,D,E,F,G,...) which must be concatenated into one as part of a Tensorflow Dataset workflow. Each image is 128 x 128 and has the shape of (128, 128, 3). The final output should be a 512 x 512 image of shape (512,512,3). All of the images come from an image sequence, known as img_seq. This img_seq has the shape of (None, 128, 128, 3)
Right now, this is accomplished through the following code:
#tf.function
def glue_to_one(imgs_seq):
first_row= tf.concat((imgs_seq[0], imgs_seq[1],imgs_seq[2],imgs_seq[3]), 0)
second_row = tf.concat((imgs_seq[4], imgs_seq[5], imgs_seq[6], imgs_seq[7]), 0)
third_row = tf.concat((imgs_seq[8], imgs_seq[9], imgs_seq[10], imgs_seq[11]), 0)
fourth_row = tf.concat((imgs_seq[12], imgs_seq[13], imgs_seq[14], imgs_seq[15]), 0)
img_glue = tf.stack((first_row, second_row, third_row, fourth_row), axis=1)
img_glue = tf.reshape(img_glue, [512,512,3])
return img_glue
It is suspected that this method is inefficient and is learning to a bottleneck.
A different approach would be to allocate a 512 x 512 tensor and then replace the elements. Would this be more efficient? How would it be done? Can you please recommend a better approach?
Simply use tf.split method instead of writing that much code...,
**Your Inputs seems to be a list of inputs**
def stack_and_concat(x):
t = tf.split(x , 16 , axis=0)
t = tf.reshape(tf.stack([tf.concat(t[(i*4): 4 * (i+1)] , axis=1) for i in range(4)],axis=2) , (512,512,3))
return t
concat_inputs(x).shape
TensorShape([512, 512, 3])
for thousand iterations my method took 3.28 secs but your's took 10.35 secs
You can improve it about 3 times using something like this:
def glue_answer(imgs_seq):
image = tf.reshape(imgs_seq, (4, 4, 128, 128, 3))
image = tf.concat(image, axis=1)
image = tf.concat(image, axis=1)
return image
I tested the performance as follows:
def glue_to_one(imgs_seq):
first_row= tf.concat((imgs_seq[0], imgs_seq[1],imgs_seq[2],imgs_seq[3]), 0)
second_row = tf.concat((imgs_seq[4], imgs_seq[5], imgs_seq[6], imgs_seq[7]), 0)
third_row = tf.concat((imgs_seq[8], imgs_seq[9], imgs_seq[10], imgs_seq[11]), 0)
fourth_row = tf.concat((imgs_seq[12], imgs_seq[13], imgs_seq[14], imgs_seq[15]), 0)
img_glue = tf.stack((first_row, second_row, third_row, fourth_row), axis=1)
img_glue = tf.reshape(img_glue, [512,512,3])
return img_glue
def glue_answer(imgs_seq):
image = tf.reshape(imgs_seq, (4, 4, 128, 128, 3))
image = tf.concat(image, axis=1)
image = tf.concat(image, axis=1)
return image
print("Method in question:")
%timeit -n 1000 -r 10 glue_to_one(imgs_seq)
print("Method in answe:")
%timeit -n 1000 -r 10 glue_answer(imgs_seq)
Output:
Method in question:
1.7 ms ± 212 µs per loop (mean ± std. dev. of 10 runs, 1,000 loops each)
Method in answe:
540 µs ± 28.8 µs per loop (mean ± std. dev. of 10 runs, 1,000 loops each)
tf.squeeze(tf.concat(tf.split(tf.concat(imgs_seq, axis=0),4),axis=1))
Testing:
imgs_seq = [ tf.random.normal(shape=(128,128,3)) for _ in range(16)]
out1 = glue_to_one(imgs_seq)
out2 = tf.squeeze(tf.concat(tf.split(tf.concat(imgs_seq, axis=0),4),axis=1))
#check whether both outputs are same.
np.testing.assert_allclose(out1.numpy(), out2.numpy())
%timeit glue_to_one(imgs_seq)
1.52 ms ± 123 µs per loop
%timeit tf.squeeze(tf.concat(tf.split(tf.concat(imgs_seq,axis=0),4),axis=1))
308 µs ± 13.6 µs per loop
I have here pure python code, except just making a NumPy array. My problem here is that the result I get is completely wrong when I use #jit, but when I remove it its good. Could anyone give me any tips on why this is?
#jit
def grayFun(image: np.array) -> np.array:
gray_image = np.empty_like(image)
for i in range(image.shape[0]):
for j in range(image.shape[1]):
gray = gray_image[i][j][0]*0.21 + gray_image[i][j][1]*0.72 + gray_image[i][j][2]*0.07
gray_image[i][j] = (gray,gray,gray)
gray_image = gray_image.astype("uint8")
return gray_image
This will return a grayscale image with your conversion formula. USUALLY, you do not need to duplicate the columns; a grayscale image with shape (X,Y) can be used just like an image with shape (X,Y,3).
def gray(image):
return image[:,:,0]*0.21+image[:,:,1]*0.72 + image[:,:,2]*0.07
This should work just fine with numba. #TimRobert's answer is definitely fast, so you may just want to go with that implementation. But the biggest win is simply from vectorization. I'm sure others could find additional performance tweaks but at this point I think we've whittled down most of the runtime & issues:
# your implementation, but fixed so that `gray` is calculated from `image`
def grayFun(image: np.array) -> np.array:
gray_image = np.empty_like(image)
for i in range(image.shape[0]):
for j in range(image.shape[1]):
gray = image[i][j][0]*0.21 + image[i][j][1]*0.72 + image[i][j][2]*0.07
gray_image[i][j] = (gray,gray,gray)
gray_image = gray_image.astype("uint8")
return gray_image
# a vectorized numpy version of your implementation
def grayQuick(image: np.array) -> np.array:
return np.tile(
np.expand_dims(
(image[:, :, 0]*0.21 + image[:, :, 1]*0.72 + image[:, :, 2]*0.07), -1
),
(1,1, 3)
).astype(np.uint8)
# a parallelized implementation in numba
#numba.jit
def gray_numba(image: np.array) -> np.array:
out = np.empty_like(image)
for i in numba.prange(image.shape[0]):
for j in numba.prange(image.shape[1]):
gray = np.uint8(image[i, j, 0]*0.21 + image[i, j, 1]*0.72 + image[i, j, 2]*0.07)
out[i, j, :] = gray
return out
# a 2D solution leveraging #TimRoberts's speedup
def gray_2D(image):
return image[:,:,0]*0.21+image[:,:,1]*0.72 + image[:,:,2]*0.07
I loaded a reasonably large image:
In [69]: img = matplotlib.image.imread(os.path.expanduser(
...: "~/Desktop/Screen Shot.png"
...: ))
...: image = (img[:, :, :3] * 256).astype('uint8')
...:
In [70]: image.shape
Out[70]: (1964, 3024, 3)
Now, running these three reveals a slight speedup from numba, while the fastest is the 2D solution:
In [71]: %%timeit
...: grey = grayFun(image) # watch out - this takes ~21 minutes
...:
...:
2min 56s ± 1min 58s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: %%timeit
...: grey_np = grayQuick(image)
...:
...:
556 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [73]: %%timeit
...: grey = gray_numba(image)
...:
...:
246 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [74]: %%timeit
...: grey = gray_2D(image)
...:
...:
117 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that numba will be noticeably slower on the first iteration, so the vectorized numpy solutions will significantly outperform numba if you're only doing this once. But if you're going to call the function repeatedly within the same python session numba is a good option. You could of course use numba for the 2D result to get a further speedup - I'm not sure if this would outperform numpy though.
I've got the following situation.
I have a list of i coordinates (x, y, z) and have to compute all triples inside a cutoff sphere, such that r_ij and r_ik are smaller than a cutoff value.
Therefore I am computing a matrix r_ij that contains all distances.
To compute the triples my idea is, to construct a r_ijk matrix.
I've done this with a loop over the number of elements i as
import tensorflow as tf
r_ij = tf.reshape(tf.range(4*4), (4, 4))
r_ijk = []
for i in range(len(x)):
r_ijk.append(tf.roll(r_ij, shift=-i, axis=1))
tf.stack(r_ijk)
I want to improve this code because of two issues.
Primarly because I assume, that it could be fully vectorized.
But also to use this in my model, I need to alter it:
#tf.function
def get_triplets(full_r_ij, r_cut):
r_ij = tf.norm(full_r_ij, axis=-1) # Shape of full_r_ij is (n_timesteps, n_atoms, n_atoms, 3)
n_atoms = tf.shape(r_ij)[1]
r_ijk = r_ij[None]
for atom in range(1, n_atoms):
tf.autograph.experimental.set_loop_options(
shape_invariants=[(r_ijk, tf.TensorShape([None, None, None, None]))]
)
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = tf.concat([r_ijk, tmp[None]], axis=0) # shape is (n_atoms, n_timesteps, n_atoms, n_atoms)
r_ijk = tf.transpose(r_ijk, perm=(1, 0, 2, 3))
r_ijk = tf.where(r_ijk == 0, tf.ones_like(r_ijk) * r_cut, r_ijk)
intermediate_indices = tf.where(
tf.math.logical_and(r_ijk[:, 0, None] == 3.0, r_ijk[:, 1:] == 3.0)
)
n_atoms = tf.cast(n_atoms, dtype=tf.int64)
t, n, i, j = tf.unstack(intermediate_indices, axis=1)
k = j + n + 1
k = tf.where(k >= n_atoms, k - n_atoms, k)
triples = tf.stack([t, i, j, k], axis=1)
return triples
and use tf.autograph.experimental.set_loop_options because I am kind of looping over the r_ij tensor.
Is there a way to improve the first code example (or the second as well)?
I tested two further implementations using tf.vectorized_mad and tf.map_fn which both performed worse than the initial function I wrote. All tests were performed with r_ij = tf.random.normal((32, 150, 150))
#tf.function
def roll_loop(r_ij, n_atoms):
r_ijk = r_ij[None]
for atom in range(1, n_atoms):
tf.autograph.experimental.set_loop_options(
shape_invariants=[(r_ijk, tf.TensorShape([None, None, None, None]))]
)
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = tf.concat([r_ijk, tmp[None]], axis=0) # shape is (n_atoms, n_timesteps, n_atoms, n_atoms)
return r_ijk
It took 129 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#tf.function
def roll_vect(r_ij, n_atoms):
r_ijk = tf.repeat(r_ij[None], repeats=n_atoms, axis=0)
def roll(args):
x, shift = args
return tf.roll(x, shift=shift, axis=2)
return tf.vectorized_map(roll, [r_ijk, tf.range(n_atoms)])
It took 225 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#tf.function
def roll_map(r_ij, n_atoms):
r_ijk = tf.repeat(r_ij[None], repeats=n_atoms, axis=0)
def roll(args):
x, shift = args
return tf.roll(x, shift=shift, axis=2)
return tf.map_fn(roll, (r_ijk, tf.range(n_atoms)), fn_output_signature=tf.float32)
It took 327 ms ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So it seems like going for tf.function with python for loop is fastest (so far). All functions were compiled before testing.
EDIT:
Using tf.TensorArray seems to be the best way for this task.
I tested it with a few different inputs and it performs as good or even a little better, than tf.autograph.experimental.set_loop_options
#tf.function
def roll_loop(r_ij, n_atoms):
r_ijk = tf.TensorArray(tf.float32, size=n_atoms)
for atom in range(0, n_atoms):
tmp = tf.roll(r_ij, shift=-atom, axis=2)
r_ijk = r_ijk.write(atom, tmp)
return r_ijk.stack()
It took 128 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I want to calculate a rolling quantile of a large 2D matrix with dimensions (1e6, 1e5), column wise. I am looking for the fastest way, since I need to perform this operation thousands of times, and it's very computationally expensive. For experiments window=1000 and q=0.1 is used.
import numpy as np
import pandas as pd
import multiprocessing as mp
from functools import partial
import numba as nb
X = np.random.random((10000,1000)) # Original array has dimensions of about (1e6, 1e5)
My current approaches:
Pandas: %timeit: 5.8 s ± 15.5 ms per loop
def pd_rolling_quantile(X, window, q):
return pd.DataFrame(X).rolling(window).quantile(quantile=q)
Numpy Strided: %timeit: 2min 42s ± 3.29 s per loop
def strided_app(a, L, S):
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
def np_1d(x, window, q):
return np.pad(np.percentile(strided_app(x, window, 1), q*100, axis=-1), (window-1, 0) , mode='constant')
def np_rolling_quantile(X, window, q):
results = []
for i in np.arange(X.shape[1]):
results.append(np_1d(X[:,i], window, q))
return np.column_stack(results)
Multiprocessing: %timeit: 1.13 s ± 27.6 ms per loop
def mp_rolling_quantile(X, window, q):
pool = mp.Pool(processes=12)
results = pool.map(partial(pd_rolling_quantile, window=window, q=q), [X[:,i] for i in np.arange(X.shape[1])])
pool.close()
pool.join()
return np.column_stack(results)
Numba: %timeit: 2min 28s ± 182 ms per loop
#nb.njit
def nb_1d(x, window, q):
out = np.zeros(x.shape[0])
for i in np.arange(x.shape[0]-window+1)+window:
out[i-1] = np.quantile(x[i-window:i], q=q)
return out
def nb_rolling_quantile(X, window, q):
results = []
for i in np.arange(X.shape[1]):
results.append(nb_1d(X[:,i], window, q))
return np.column_stack(results)
The timings are not great, and ideally I would target an improvement of 10-50x by speed. I would appreciate any suggestions, how to speed it up. Maybe someone has ideas on using lower level languages (Cython), or other ways to speed it up with Numpy/Numba/Tensorflow based methods. Thanks!
I would recommend the new rolling-quantiles package.
To demonstrate, even the somewhat naive approach of constructing a separate filter for each column outperforms the above single-threaded pandas experiment:
pipes = [rq.Pipeline(rq.LowPass(window=1000, quantile=0.1)) for i in range(1000)]
%timeit [pipe.feed(X[:, i]) for i, pipe in enumerate(pipes)]
1.34 s ± 7.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
versus
df = pd.DataFrame(X)
%timeit df.rolling(1000).quantile(0.1)
5.63 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Both can be trivially parallelized by means of multiprocessing, as you showed.
My goal is to convert a list of pixels from RGB to Hex as quickly as possible. The format is a Numpy dimensional array (rgb colorspace) and ideally it would be converted from RGB to Hex and maintain it's shape.
My attempt at doing this uses list comprehension and with the exception of performance, it solves it. Performance wise, adding the ravel and list comprehension really slows this down. Unfortunately I just don't know enough math to know the solution of how to to speed this up:
Edited: Updated code to reflex most recent changes. Current running around 24ms on 35,000 pixel image.
def np_array_to_hex(array):
array = np.asarray(array, dtype='uint32')
array = (1 << 24) + ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return [hex(x)[-6:] for x in array.ravel()]
>>> np_array_to_hex(img)
['afb3bc', 'abaeb5', 'b3b4b9', ..., '8b9dab', '92a4b2', '9caebc']
I tried it with a LUT ("Look Up Table") - it takes a few seconds to initialise and it uses 100MB (0.1GB) of RAM, but that's a small price to pay amortised over a million images:
#!/usr/bin/env python3
import numpy as np
def np_array_to_hex1(array):
array = np.asarray(array, dtype='uint32')
array = ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return array
def np_array_to_hex2(array):
array = np.asarray(array, dtype='uint32')
array = (1 << 24) + ((array[:, :, 0]<<16) + (array[:, :, 1]<<8) + array[:, :, 2])
return [hex(x)[-6:] for x in array.ravel()]
def me(array, LUT):
h, w, d = array.shape
# Reshape to a color vector
z = np.reshape(array,(-1,3))
# Make array and fill with 32-bit colour number
y = np.zeros((h*w),dtype=np.uint32)
y = z[:,0]*65536 + z[:,1]*256 + z[:,2]
return LUT[y]
# Define dummy image of 35,000 RGB pixels
w,h = 175, 200
im = np.random.randint(0,256,(h,w,3),dtype=np.uint8)
# %timeit np_array_to_hex1(im)
# 112 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# %timeit np_array_to_hex2(im)
# 8.42 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# This may take time to set up, but amortize that over a million images...
LUT = np.zeros((256*256*256),dtype='a6')
for i in range(256*256*256):
h = hex(i)[2:].zfill(6)
LUT[i] = h
# %timeit me(im,LUT)
# 499 µs ± 8.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So that appears to be 4x slower than your fastest which doesn't work, and 17x faster that your slowest which does work.
My next suggestion is to use multi-threading or multi-processing so all your CPU cores get busy in parallel and reduce your overall time by a factor of 4 or more assuming you have a reasonably modern 4+ core CPU.