How to load and process 2 images at a time - python

I have 20 images in a folder.
I want to load first two images and process, then load the next two images and process and so on.
I want to know how to achieve this in python openCV
Sequence to follow; Load image 1, 2 > process (i will do this bit), then load image 2, 3 > process, 3, 4 > process, 4,5 > process...so on

I don't really know if you just want to process them 2 by 2 or 2 at the same time so here is both!
Process 2 by 2 sequentially:
import os
import cv2
files = os.listdir('<image_folder>')
for i in range(0, len(files), 2):
image1 = cv2.imread(files[i])
image2 = cv2.imread(files[i+1])
process(image1)
process(image2)
Process 2 at the same time:
A useful tool is the map function in python's multiprocessing library. It's actually very simple to use, example:
from multiprocessing import Pool
p = Pool(2)
for i in range(0, len(files), 2):
p.map(process, [cv2.imread(files[i]),
cv2.imread(files[i+1])])
The list holds your elements and you're trying to apply function process to each of those elements in parallel. p.map will do that for you no problem!
Good luck!

import glob2
import cv2
images = glob2.glob('imageFolder/*.jpg')
images = list(zip(images, images[1:] + images[:1]))
for item in images:
img1 = cv2.imread(item[0])
img2 = cv2.imread(item[1])
#process here

Related

How to distribute multiprocess CPU usage over multiple nodes?

I am trying to run a job on an HPC using multiprocessing. Each process has a peak memory usage of ~44GB. The job class I can use allows 1-16 nodes to be used, each with 32 CPUs and a memory of 124GB. Therefore if I want to run the code as quickly as possible (and within the max walltime limit) I should be able to run 2 CPUs on each node up to a maximum of 32 across all 16 nodes. However, when I specify mp.Pool(32) the job quickly exceeds the memory limit, I assume because more than two CPUs were used on a node.
My natural instinct was to specify 2 CPUs as the maximum in the pbs script I run my python script from, however this configuration is not permitted on the system. Would really appreciate any insight, having been scratching my head on this one for most of today - and have faced and worked around similar problems in the past without addressing the fundamentals at play here.
Simplified versions of both scripts below:
#!/bin/sh
#PBS -l select=16:ncpus=32:mem=124gb
#PBS -l walltime=24:00:00
module load anaconda3/personal
source activate py_env
python directory/script.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import multiprocessing as mp
def df_function(df, arr1, arr2):
df['col3'] = some_algorithm(df, arr1, arr2)
return df
def parallelize_dataframe(df, func, num_cores):
df_split = np.array_split(df, num_cores)
with mp.Pool(num_cores, maxtasksperchild = 10 ** 3) as pool:
df = pd.concat(pool.map(func, df_split))
return df
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = pd.read_csv(direc + file)
a = np.load(direc + a_file)
b = np.load(direc + b_file)
# Globally defining function with keyword defaults
global f
def f(df):
return df_function(df, arr1 = a, arr2 = b)
num_cores = 32 # i.e. 2 per node if evenly distributed.
# Running the function as a multiprocess:
df = parallelize_dataframe(df, f, num_cores)
# Saving:
df.to_csv(direc + 'outfile.csv', index = False)
if __name__ == '__main__':
main()
To run your job as-is, you could simply request ncpu=32 and then in your python script set num_cores = 2. Obviously this has you paying for 32 cores and then leaving 30 of them idle, which is wasteful.
The real problem here is that your current algorithm is memory-bound, not CPU-bound. You should be going to great lengths to read only chunks of your files into memory, operating on the chunks, and then writing the result chunks to disk to be organized later.
Fortunately Dask is built to do exactly this kind of thing. As a first step, you can take out the parallelize_dataframe function and directly load and map your some_algorithm with a dask.dataframe and dask.array:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import dask.dataframe as dd
import dask.array as da
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = dd.read_csv(direc + file, blocksize=25e6)
a_and_b = da.from_np_stack(direc)
df['col3'] = df.apply(some_algorithm, args=(a_and_b,))
# dask is lazy, this is the only line that does any work
# Saving:
df.to_csv(
direc + 'outfile.csv',
index = False,
compute_kwargs={"scheduler": "threads"}, # also "processes", but try threads first
)
if __name__ == '__main__':
main()
That will require some tweaks to some_algorithm, and to_csv and from_np_stack work a bit differently, but you will be able to reasonably run this thing just on your own laptop and it will scale to your cluster hardware. You can level up from here by using the distributed scheduler or even deploy it directly to your cluster with dask-jobqueue.

Split video into images with ffmpeg-python

As far as I understand ffmpeg-python is main package in Python to operate ffmpeg directly.
Now I want to take a video and save it's frames as separate files at some fps.
There are plenty of command line ways to do it, e.g. ffmpeg -i video.mp4 -vf fps=1 img/output%06d.png described here
But I want to do it in Python. Also there are solutions [1] [2] that use Python's subprocess to call ffmpeg CLI, but it looks dirty for me.
Is there any way to to make it using ffmpeg-python?
The following works for me:
ffmpeg
.input(url)
.filter('fps', fps='1/60')
.output('thumbs/test-%d.jpg',
start_number=0)
.overwrite_output()
.run(quiet=True)
I'd suggest you try imageio module and use the following code as a starting point:
import imageio
reader = imageio.get_reader('imageio:cockatoo.mp4')
for frame_number, im in enumerate(reader):
# im is numpy array
if frame_number % 10 == 0:
imageio.imwrite(f'frame_{frame_number}.jpg', im)
You can also use openCV for that.
Reference code:
import cv2
video_capture = cv2.VideoCapture("your_video_path")
video_capture.set(cv2.CAP_PROP_FPS, <your_desired_fps_here>)
saved_frame_name = 0
while video_capture.isOpened():
frame_is_read, frame = video_capture.read()
if frame_is_read:
cv2.imwrite(f"frame{str(saved_frame_name)}.jpg", frame)
saved_frame_name += 1
else:
print("Could not read the frame.")
#norus solution is actually good, but for me it was missing the ss and r parameters in the input. I used a local file instead of a url.
This is my solution:
ffmpeg.input(<path/to/file>, ss = 0, r = 1)\
.filter('fps', fps='1/60')\
.output('thumbs/test-%d.jpg', start_number=0)\
.overwrite_output()]
.run(quiet=True)
ss is the starting second in the above code starts on 0
r is the ration, because the filter fps is set to 1/60 an r of 1 will return 1 frame per second, of 2 1 frame every 2 seconds, 0.5 a frame every half second....

Proper handling of parallel function that writes output in Python

I have a function that takes a text file as input, does some processing, and writes a pickled result to file. I'm trying to perform this in parallel across multiple files. The order in which files are processed doesn't matter, and the processing of each is totally independent. Here's what I have now:
import mulitprocessing as mp
import pandas as pd
from glob import glob
def processor(fi):
df = pd.read_table(fi)
...do some processing to the df....
filename = fi.split('/')[-1][:-4]
df.to_pickle('{}.pkl'.format(filename))
if __name__ == '__main__':
files = glob('/path/to/my/files/*.txt')
pool = mp.Pool(8)
for _ in pool.imap_unordered(processor, files):
pass
Now, this actually works totally fine as far as I can tell, but the syntax seems really hinky and I'm wondering if there is a better way of going about it. E.g. can I get the same result without having to perform an explicit loop?
I tried map_async(processor, files), but this doesn't generate any output files (but doesn't throw any errors).
Suggestions?
You can use map_async, but you need to wait for it to finish, since the async bit means "don't block after setting off the jobs, but return immediately". If you don't wait, if there's nothing after your code your program will exit and all subprocesses will be killed immediately and before completing - not what you want!
The following example should help:
from multiprocessing.pool import Pool
from time import sleep
def my_func(val):
print('Executing %s' % val)
sleep(0.5)
print('Done %s' % val)
pl = Pool()
async_result = pl.map_async(my_func, [1, 2, 3, 4, 5])
res = async_result.get()
print('Pool done: %s' % res)
The output of which (when I ran it) is:
Executing 2
Executing 1
Executing 3
Executing 4
Done 2
Done 1
Executing 5
Done 4
Done 3
Done 5
Pool done: [None, None, None, None, None]
Alternatively, using plain map would also do the trick, and then you don't have to wait for it since it is not "asynchronous" and synchronously waits for all jobs to be complete:
pl = Pool()
res = pl.map(my_func, [1, 2, 3, 4, 5])
print('Pool done: %s' % res)

parallel image processing in python

I am doing some image processing but I have a lot of images (~10,000). Thus I would like to do it in parallel but for some reason it does not go as fast as it should. I am using a MacBook Pro 16Gb and i7 . The code is like this :
def process_image(img_name):
cv2.imread('image/'+img_name)
tfs_im = some_function(im) # use opencv, skimage and math
cv2.imwrite("new_img/"img_name,tfs_im)
if __name__ == '__main__':
### Set Working Dir
wd_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(wd_path+'/..')
img_list = os.listdir('images')
pool = Pool(processes=8)
pool.map(process_image, img_list) # proces data_inputs iterable with pool
I also tried a more basic approach using queueing.
def process_image(img_names):
for img_name in img_names:
cv2.imread('image/'+img_name)
im = read_img(img_name)
tfs_im = some_function(im) # use opencv, skimage and math
cv2.imwrite('new_img/'+img_name,tfs_im)
if __name__ == '__main__':
### Set Working Dir
wd_path = os.path.dirname(os.path.realpath(__file__))
os.chdir(wd_path+'/..')
q = Queue()
img_list = os.listdir('image')
# split work into 8 processes
processes = 8
def splitlist(inlist, chunksize):
return [inlist[x:x+chunksize] for x in xrange(0, len(inlist), chunksize)]
list_splitted = splitlist(img_list, len(img_list)/processes+1)
for imgs in list_splitted:
p = Process(target=process_image, args=([imgs]))
p.Daemon = True
p.start()
None of those approaches a giving the expected speed. I know that there some set up time expected for each process and thus the code will not run 8 time faster but as of now it is barely running 2 time faster than single thread.
Maybe some tasks are not meant to be parallelize such as writing/reading images from/to the same folder in different processes ?
Thanks for any tips or advices!

Reading bmp files in Python

Is there a way to read in a bmp file in Python that does not involve using PIL? PIL doesn't work with version 3, which is the one I have. I tried to use the Image object from graphics.py, Image(anchorPoint, filename), but that only seems to work with gif files.
In Python it can simply be read as:
import os
from scipy import misc
path = 'your_file_path'
image= misc.imread(os.path.join(path,'image.bmp'), flatten= 0)
## flatten=0 if image is required as it is
## flatten=1 to flatten the color layers into a single gray-scale layer
I realize that this is an old question, but I found it when solving this problem myself and I figured that this might help someone else in the future.
It's pretty easy actually to read a BMP file as binary data. Depending on how broad support and how many corner-cases you need to support of course.
Below is a simple parser that ONLY works for 1920x1080 24-bit BMP's (like ones saved from MS Paint). It should be easy to extend though. It spits out the pixel values as a python list like (255, 0, 0, 255, 0, 0, ...) for a red image as an example.
If you need more robust support there's information on how to properly read the header in answers to this question: How to read bmp file header in python?. Using that information you should be able to extend the simple parser below with any features you need.
There's also more information on the BMP file format over at wikipedia https://en.wikipedia.org/wiki/BMP_file_format if you need it.
def read_rows(path):
image_file = open(path, "rb")
# Blindly skip the BMP header.
image_file.seek(54)
# We need to read pixels in as rows to later swap the order
# since BMP stores pixels starting at the bottom left.
rows = []
row = []
pixel_index = 0
while True:
if pixel_index == 1920:
pixel_index = 0
rows.insert(0, row)
if len(row) != 1920 * 3:
raise Exception("Row length is not 1920*3 but " + str(len(row)) + " / 3.0 = " + str(len(row) / 3.0))
row = []
pixel_index += 1
r_string = image_file.read(1)
g_string = image_file.read(1)
b_string = image_file.read(1)
if len(r_string) == 0:
# This is expected to happen when we've read everything.
if len(rows) != 1080:
print "Warning!!! Read to the end of the file at the correct sub-pixel (red) but we've not read 1080 rows!"
break
if len(g_string) == 0:
print "Warning!!! Got 0 length string for green. Breaking."
break
if len(b_string) == 0:
print "Warning!!! Got 0 length string for blue. Breaking."
break
r = ord(r_string)
g = ord(g_string)
b = ord(b_string)
row.append(b)
row.append(g)
row.append(r)
image_file.close()
return rows
def repack_sub_pixels(rows):
print "Repacking pixels..."
sub_pixels = []
for row in rows:
for sub_pixel in row:
sub_pixels.append(sub_pixel)
diff = len(sub_pixels) - 1920 * 1080 * 3
print "Packed", len(sub_pixels), "sub-pixels."
if diff != 0:
print "Error! Number of sub-pixels packed does not match 1920*1080: (" + str(len(sub_pixels)) + " - 1920 * 1080 * 3 = " + str(diff) +")."
return sub_pixels
rows = read_rows("my image.bmp")
# This list is raw sub-pixel values. A red image is for example (255, 0, 0, 255, 0, 0, ...).
sub_pixels = repack_sub_pixels(rows)
Use pillow for this. After you installed it simply import it
from PIL import Image
Then you can load the BMP file
img = Image.open('path_to_file\file.bmp')
If you need the image to be a numpy array, use np.array
img = np.array(Image.open('path_to_file\file.bmp'))
The numpy array will only be 1D. Use reshape() to bring it into the right shape in case your image is RGB. For example:
np.array(Image.open('path_to_file\file.bmp')).reshape(512,512,3)
I had to work on a project where I needed to read a BMP file using python, it was quite interesting, actually the best way is to have a review on the BMP file format (https://en.wikipedia.org/wiki/BMP_file_format) then reading it as binairy file, to extract the data.
You will need to use the struct python library to perform the extraction
You can use this tutorial to see how it proceeds https://youtu.be/0Kwqdkhgbfw
Use the excellent matplotlib library
import matplotlib.pyplot as plt
im = plt.imread('image.bmp')
It depends what you are trying to achieve and on which platform?
Anyway using a C library to load BMP may work e.g. http://code.google.com/p/libbmp/ or http://freeimage.sourceforge.net/, and C libraries can be easily called from python e.g. using ctypes or wrapping it as a python module.
or you can compile this version of PIL https://github.com/sloonz/pil-py3k
If you're doing this in Windows, this site, should allow you to get PIL (and many other popular packages) up and running with most versions of Python: Unofficial Windows Binaries for Python Extension Packages
The common port of PIL to Python 3.x is called "Pillow".
Also I would suggest pygame library for simple tasks. It is a library, full of features for creating games - and reading from some common image formats is among them. Works with Python 3.x as well.

Categories