Merge large images on disk

Merge large images on disk - python

The main problem:
I have a map step where I render a large amount of sectors of an image in parallel:
1 2
3 4
worker a -> 1
worker b -> 2
...
merge 1,2,3,4 to make final image
If it can fit in memory
With images that are relatively small and can fit in RAM, one can simply use PIL's functionality:
def merge_images(image_files, x, y):
images = map(Image.open, image_files)
width, height = images[0].size
new_im = Image.new('RGB', (width * x, height * y))
for n, im in enumerate(images):
new_im.paste(im, ((n%x) * width, (n//y) * height))
return new_im
Unfortunately, I am going to have many, many large sectors. I want to merge the pictures finally into a single image of about 40,000 x 60,000 pixels, which I estimate to be around 20 GB's. (Or maybe even more)
So obviously, we can't approach this problem on RAM. I know there are alternatives like memmap'ing arrays and writing to slices, which I will try. However, I am looking for as-out-of-the-box-as-possible solutions.
Does anyone know of any easier alternatives? Even though all the approaches I've tried so far is in python, it doesn't need to be in python.

pyvips can do exactly what you want very quickly and efficiently. For example:
import sys
import pyvips
images = [pyvips.Image.new_from_file(filename, access="sequential")
for filename in sys.argv[2:]]
final = pyvips.Image.arrayjoin(images, across=10)
final.write_to_file(sys.argv[1])
The access="sequential" option tells pyvips that you want to stream the image. It will only load pixels on demand as it generates output, so you can merge enormous images using only a little memory. The arrayjoin operator joins an array of images into a grid across tiles across. It has quite a few layout options: you can specify borders, overlaps, background, centring behaviour and so on.
I can run it like this:
$ for i in {1..100}; do cp ~/pics/k2.jpg $i.jpg; done
$ time ../arrayjoin.py x.tif *.jpg
real 0m2.498s
user 0m3.579s
sys 0m1.054s
$ vipsheader x.tif
x.tif: 14500x20480 uchar, 3 bands, srgb, tiffload
So it joined 100 JPG images to make a 14,000 x 20,000 pixel mosaic in about 2.5s on this laptop, and from watching top, needed about 300mb of memory. I've used it to join over 30,000 images into a single file, and it would go higher. I've made images of over 300,000 by 300,000 pixels.
The pyvips equivalent of PIL's paste is insert. You could use that too, though it won't work so well for very large numbers of images.
There's also a command-line interface, so you could just enter:
vips arrayjoin "${echo *.jpg}" x.tif --across 10
To join up a large set of JPG images.

I would suggest using the TIFF file format. Most TIFF files are striped (one or more scan lines are stored as a block on file), but it is possible to write tiled TIFF files (where the image is divided into tiles, and each is stored as an independent block on file).
LibTIFF is the canonical way of reading and writing TIFF files. It has an easy way of creating a new TIFF file, and add tiles one at the time. Thus, your program can create the TIFF file, obtain one sector, write it as (one or more) tiles to the TIFF file, obtain the next sector, etc. You would have to choose your tile size to evenly divide one sector.
There is a Python binding to LibTIFF called, what else, PyLibTIFF. It should allow you to follow the above model from within Python. That same repository has pure Python module to read and write TIFF files, I don't know if that is able to write TIFF files in tiles, or if it allows to write them in chunks. There are many other Python modules for reading and writing TIFF files, but most will write one matrix as a TIFF file, rather than allow you to write a file one tile at a time.

Related

How to efficiently read part of a huge tiff image?

I have a collection of rather large tiff files (typical resolution is 30k x 30k to 50k x 50k). These have some interesting properties: 1. they contain several different resolutions of the same image, and 2. the data seems to be compressed as JPEG tiles (512x512). They also seem to be BigTIFFs.
I'd like to read specific regions of interest (more or less random access pattern) and I'd like it to be as fast as possible (~about as fast as decompressing the needed tiles plus a little bit of overhead). Decompressing the whole file, or even one of the resolution layers, is not practical. I'm using python.
I tried opening with skimage.io.MultiImage('test.tiff'))[level], and this works (produces correct data). It does decompress each resolution layer completely, although not the whole file; so okay at the low resolutions but not really useful for the highest resolutions. There didn't seem to be any way to select a region of interest in skimage.io, or in any of the libraries it uses (imread, PIL, ...).
I also tried using OpenSlide using img.read_region((x,y), level, (width, height)). This library seems made exactly for this type of data, and is very fast, but unfortunately produces incorrect data for some regions. Until the bug is fixed upstream, I can't use it.
Lastly, using a very recent version of tifffile=2020.6.3 and imagecodecs=2020.5.30 (older versions don't work - I think at least 2018.10.18 is needed), I could list the tiles, using code modified from the tifffile documentation:
with tifffile.TiffFile('test.tiff') as tif:
fh = tif.filehandle
for page in tif.pages:
jpegtables = page.tags.get('JPEGTables', None)
if jpegtables is not None:
jpegtables = jpegtables.value
for index, (offset, bytecount) in enumerate(
zip(page.dataoffsets, page.databytecounts)
):
fh.seek(offset)
data = fh.read(bytecount)
tile, indices, shape = page.decode(data, index, jpegtables)
print(tile.shape, indices, shape)
It seems the page.decode() call actually decompresses each tile (tile is a numpy array with pixel data). It is not obvious how to only get the index but not decompress. I'm also not sure how fast this would be. This leaves any selection of a region of interest and reassembly of tiles as an exercise to the user.
How do I efficiently read regions of interest out of files like this? Does someone have example code to do that with tifffile? Or, is there another library that would do the trick?

Apply image smoothing and edge detection algorithms on a 2.4 GB GeoTIFF image

So I have over 150,000 huge GeoTIFF images (each 2.4 GB) which I need to run image smoothing and edge detection (LoG Filter) on, to get a sharpened image. I read the image using Gdal, smoothed it, subsampled it, created a high-pass filter (level 5) and reconstructed the image.
This works fine for a normal .jpg file.
But I'm not able to accomplish this for a huge TIFF file becaus I keep running into memory errors even with a 32 GB RAM 8 core processor and 4 TB disk space.
What is the best way to do heavy weight image processing / image segmentation on a Python 3.6 Ubuntu 18 LTS?

pyvips can process huge images quickly and in little memory. It's LGPL, runs on Linux, macOS and Windows, and works on every version of Python. Most linuxes (including Ubuntu) have it in the package manager.
It's a demand-driven, streaming image processing library. Instead of processing images in single huge lumps, it constructs a network of image processing operators behind your back and pixels are pulled through your computer's memory in small regions by the need to create the output.
For example, I can run this program:
import sys
import pyvips
# access='sequential' puts pyvips into streaming mode for this image
im = pyvips.Image.new_from_file(sys.argv[1], access='sequential')
im = im.crop(100, 100, im.width - 200, im.height - 200)
# 10% shrink, lanczos3 (by default)
im = im.resize(0.9)
mask = pyvips.Image.new_from_array([[-1, -1, -1],
[-1, 16, -1],
[-1, -1, -1]], scale=8)
# integer convolution ... you can use large float masks too, canny,
# sobel, etc. etc.
im = im.conv(mask, precision='integer')
im.write_to_file(sys.argv[2])
On a 40k x 30k pixel GeoTIFF image:
$ vipsheader SAV_X5S_transparent_mosaic_group1.tif
SAV_X5S_transparent_mosaic_group1.tif: 42106x29852 uchar, 4 bands, srgb, tiffload
On this 2015 laptop runs like this:
$ /usr/bin/time -f %M:%e python3 bench.py SAV_X5S_transparent_mosaic_group1.tif x.tif
257012:101.43
ie. 260mb of ram, 101s of elapsed time. It should be quite a bit quicker on your large machine.
One issue you might have is with the GeoTIFF tags: they won't be preserved by pyvips. Perhaps you won't need them in later processing.

Typically, such large images are processed tile-wise. The idea is to divide the image up into tiles, read each one in independently, with enough "overlap" to account for the filtering applied, process it and write it to file.
The TIFF standard knows the "tiled" format (I believe GeoTIFF files are typically stored in a tiled format). This format is explicitly designed to make it easy to read in a small window of the image without having to piece together bits and pieces from all over the file. Each tile in the TIFF file can be indexed by location, and is encoded and compressed independently, making it easy to read in and write out one tile at a time.
The overlap you need depends on the filtering applied. If you apply, say, a Laplace of Gaussian filter with a 9x9 window (reaches 4 pixels past the central pixel), and then overlap needs to be only 4 pixels. If you chain filters, you typically want to add the reach of each filter to obtain a total overlap value.
Next, divide the image into some multiple of the tile size in the TIFF file. Say the file has tiles of 512x512 pixels. You can choose to process 8 tiles at once, a region of 2048x2048 pixels.
Now loop over the image in steps of 2048 in each dimension. Read in the 8 tiles, and include neighboring tiles, which you will crop so that you get square images of 2048+2*4 pixels to a side. Process the sub-image, remove the overlap region, and write the rest to the corresponding tiles in the output TIFF file. The output TIFF file should be set up in the same way as the input TIFF file.
I am sure there is software out there that automates this process, but I don't know of any in Python. If you implement this yourself, you will need to learn how to read and write individual tiles in the TIFF file. One option is pylibtiff.

Load portions of matrix into RAM

I'm writing some image processing routines for a micro-controller that supports MicroPython. The bad news is that it only has 0.5 MB of RAM. This means that if I want to work with relatively big images/matrices like 256x256, I need to treat it as a collection of smaller matrices (e.g. 32x32) and perform the operation on them. Leaving at aside the fact of reconstructing the final output of the orignal (256x256) matrix from its (32x32) submatrices, I'd like to focus on how to do the loading/saving from/to disk (an SD card in this case) of this smaller matrices from a big image.
Given that intro, here is my question: Assuming I have a 256x256 on disk that I'd like to apply some operation onto (e.g. convolution), what's the most convenient way of storing that image so it's easy to load it into 32x32 image patches? I've seen there is a MicroPython implementation of the pickle module, is this a good idea for my problem?

Sorry, but your question contains the answer - if you need to work with 32x32 tiles, the best format is that which represents your big image as a sequence of tiles (and e.g. not as one big 256x256 image, though reading tiles out of it is also not a rocket science and should be fairly trivial to code in MicroPython, though 32x32 tiles would be more efficient of course).
You don't describe the exact format of your images, but I wouldn't use pickle module for it, but store images as raw bytes and load them into array.array() objects (using inplace .readinto() operation).

Extract tiles from tiled TIFF and store in numpy array

My overall goal is to crop several regions from an input mirax (.mrxs) slide image to JPEG output files.
Here is what one of these images looks like:
Note that the darker grey area is part of the image, and the regions I ultimately wish to extract in JPEG format are the 3 black square regions.
Now, for the specifics:
I'm able to extract the color channels from the mirax image into 3 separate TIFF files using vips on the command line:
vips extract_band INPUT.mrxs OUTPUT.tiff[tile,compression=jpeg] C --n 1
Where C corresponds to the channel number (0-2), and each output file is about 250 MB in size.
The next job is to somehow recognize and extract the regions of interest from the images, so I turned to several python imaging libraries, and this is where I encountered difficulties.
When I try to load any of the TIFFs using OpenCV using:
i = cv2.imread('/home/user/input_img.tiff',cv2.IMREAD_ANYDEPTH)
I get an error error: (-211) The total matrix size does not fit to "size_t" type in function setSize
I managed to get a little more traction with Pillow, by doing:
from PIL import Image
tiff = Image.open('/home/user/input_img.tiff')
print len(tiff.tile)
print tiff.tile[0]
print tiff.info
which outputs:
636633
('jpeg', (0, 0, 128, 128), 8, ('L', ''))
{'compression': 'jpeg', 'dpi': (25.4, 25.4)}
However, beyond loading the image, I can't seem to perform any useful operations; for example doing tiff.tostring() results in a MemoryError (I do this in an attempt to convert the PIL object to a numpy array) I'm not sure this operation is even valid given the existence of tiles.
From my limited understanding, these TIFFs store the image data in 'tiles' (of which the above image contains 636633) in a JPEG-compressed format.
It's not clear to me, however, how would one would extract these tiles for use as regular JPEG images, or even whether the sequence of steps in the above process I outlined is a potentially useful way of accomplishing the overall goal of extracting the ROIs from the mirax image.
If I'm on the right track, then some guidance would be appreciated, or, if there's another way to accomplish my goal using vips/openslide without python I would be interested in hearing ideas. Additionally, more information about how I could deal with or understand the TIFF files I described would also be helpful.
The ideal situations would include:
1) Some kind of autocropping feature in vips/openslide which can generate JPEGs from either the TIFFs or original mirax image, along the lines of what the following command does, but without generated tens of thousands of images:
vips dzsave CMU-1.mrxs[autocrop] pyramid
2) Being able to extract tiles from the TIFFs and store the data corresponding to the image region as a numpy array in order to detect the 3 ROIs using OpenCV or another methd.

I would use the vips Python binding, it's very like PIL but can handle these huge images. Try something like:
from gi.repository import Vips
slide = Vips.Image.new_from_file(sys.argv[1])
tile = slide.extract_area(left, top, width, height)
tile.write_to_file(sys.argv[2])
You can also extract areas on the command-line, of course:
$ vips extract_area INPUT.mrxs OUTPUT.tiff left top width height
Though that will be a little slower than a loop in Python. You can use crop as a synonym for extract_area.
openslide attaches a lot of metadata to the image describing the layout and position of the various subimages. Try:
$ vipsheader -a myslide.mrxs
And have a look through the output. You might be able to calculate the position of your subimages from that. I would also ask on the openslide mailing list, they are very expert and very helpful.
One more thing you could try: get a low-res overview, corner-detect on that, then extract the tiles from the high-res image. To get a low-res version of your slide, try:
$ vips copy myslide.mrxs[level=7] overview.tif
Level 7 is downsampled by 2 ** 7, so 128x.

Create image file from array

Essentially my problem is just finding an easy way to create an image file from an array.
My problem is unparsing CUPS raster files into images. The CUPS RGB raster file header is 1800 bytes. If I input the width and height I can read the raster array contained in the file correctly into Photoshop in Mac order, with interleaved 16 bit data 00RRGGBB. I have written a utility which extracts the width and height from the header.
I'd like to write another command-line utility which takes the width, height and file-name as inputs, truncates the first 1800 bytes off the raster file, and creates a Tiff or BMP or whatever is easiest to write image with the array that is contained in the rest - any well-known image format will do.
program should be C or Python, run under Mac, Linux.

For Python, PIL is the tool for this task. Use the putdata() (search the link for putdata) method on image objects to put the pixels from a list into an image.

You can try GDAL，which supports many image formats.You can use RasterIO(...) method for reading image data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.