Extract tiles from tiled TIFF and store in numpy array

Extract tiles from tiled TIFF and store in numpy array - python

My overall goal is to crop several regions from an input mirax (.mrxs) slide image to JPEG output files.
Here is what one of these images looks like:
Note that the darker grey area is part of the image, and the regions I ultimately wish to extract in JPEG format are the 3 black square regions.
Now, for the specifics:
I'm able to extract the color channels from the mirax image into 3 separate TIFF files using vips on the command line:
vips extract_band INPUT.mrxs OUTPUT.tiff[tile,compression=jpeg] C --n 1
Where C corresponds to the channel number (0-2), and each output file is about 250 MB in size.
The next job is to somehow recognize and extract the regions of interest from the images, so I turned to several python imaging libraries, and this is where I encountered difficulties.
When I try to load any of the TIFFs using OpenCV using:
i = cv2.imread('/home/user/input_img.tiff',cv2.IMREAD_ANYDEPTH)
I get an error error: (-211) The total matrix size does not fit to "size_t" type in function setSize
I managed to get a little more traction with Pillow, by doing:
from PIL import Image
tiff = Image.open('/home/user/input_img.tiff')
print len(tiff.tile)
print tiff.tile[0]
print tiff.info
which outputs:
636633
('jpeg', (0, 0, 128, 128), 8, ('L', ''))
{'compression': 'jpeg', 'dpi': (25.4, 25.4)}
However, beyond loading the image, I can't seem to perform any useful operations; for example doing tiff.tostring() results in a MemoryError (I do this in an attempt to convert the PIL object to a numpy array) I'm not sure this operation is even valid given the existence of tiles.
From my limited understanding, these TIFFs store the image data in 'tiles' (of which the above image contains 636633) in a JPEG-compressed format.
It's not clear to me, however, how would one would extract these tiles for use as regular JPEG images, or even whether the sequence of steps in the above process I outlined is a potentially useful way of accomplishing the overall goal of extracting the ROIs from the mirax image.
If I'm on the right track, then some guidance would be appreciated, or, if there's another way to accomplish my goal using vips/openslide without python I would be interested in hearing ideas. Additionally, more information about how I could deal with or understand the TIFF files I described would also be helpful.
The ideal situations would include:
1) Some kind of autocropping feature in vips/openslide which can generate JPEGs from either the TIFFs or original mirax image, along the lines of what the following command does, but without generated tens of thousands of images:
vips dzsave CMU-1.mrxs[autocrop] pyramid
2) Being able to extract tiles from the TIFFs and store the data corresponding to the image region as a numpy array in order to detect the 3 ROIs using OpenCV or another methd.

I would use the vips Python binding, it's very like PIL but can handle these huge images. Try something like:
from gi.repository import Vips
slide = Vips.Image.new_from_file(sys.argv[1])
tile = slide.extract_area(left, top, width, height)
tile.write_to_file(sys.argv[2])
You can also extract areas on the command-line, of course:
$ vips extract_area INPUT.mrxs OUTPUT.tiff left top width height
Though that will be a little slower than a loop in Python. You can use crop as a synonym for extract_area.
openslide attaches a lot of metadata to the image describing the layout and position of the various subimages. Try:
$ vipsheader -a myslide.mrxs
And have a look through the output. You might be able to calculate the position of your subimages from that. I would also ask on the openslide mailing list, they are very expert and very helpful.
One more thing you could try: get a low-res overview, corner-detect on that, then extract the tiles from the high-res image. To get a low-res version of your slide, try:
$ vips copy myslide.mrxs[level=7] overview.tif
Level 7 is downsampled by 2 ** 7, so 128x.

Related

Apply image smoothing and edge detection algorithms on a 2.4 GB GeoTIFF image

So I have over 150,000 huge GeoTIFF images (each 2.4 GB) which I need to run image smoothing and edge detection (LoG Filter) on, to get a sharpened image. I read the image using Gdal, smoothed it, subsampled it, created a high-pass filter (level 5) and reconstructed the image.
This works fine for a normal .jpg file.
But I'm not able to accomplish this for a huge TIFF file becaus I keep running into memory errors even with a 32 GB RAM 8 core processor and 4 TB disk space.
What is the best way to do heavy weight image processing / image segmentation on a Python 3.6 Ubuntu 18 LTS?

pyvips can process huge images quickly and in little memory. It's LGPL, runs on Linux, macOS and Windows, and works on every version of Python. Most linuxes (including Ubuntu) have it in the package manager.
It's a demand-driven, streaming image processing library. Instead of processing images in single huge lumps, it constructs a network of image processing operators behind your back and pixels are pulled through your computer's memory in small regions by the need to create the output.
For example, I can run this program:
import sys
import pyvips
# access='sequential' puts pyvips into streaming mode for this image
im = pyvips.Image.new_from_file(sys.argv[1], access='sequential')
im = im.crop(100, 100, im.width - 200, im.height - 200)
# 10% shrink, lanczos3 (by default)
im = im.resize(0.9)
mask = pyvips.Image.new_from_array([[-1, -1, -1],
[-1, 16, -1],
[-1, -1, -1]], scale=8)
# integer convolution ... you can use large float masks too, canny,
# sobel, etc. etc.
im = im.conv(mask, precision='integer')
im.write_to_file(sys.argv[2])
On a 40k x 30k pixel GeoTIFF image:
$ vipsheader SAV_X5S_transparent_mosaic_group1.tif
SAV_X5S_transparent_mosaic_group1.tif: 42106x29852 uchar, 4 bands, srgb, tiffload
On this 2015 laptop runs like this:
$ /usr/bin/time -f %M:%e python3 bench.py SAV_X5S_transparent_mosaic_group1.tif x.tif
257012:101.43
ie. 260mb of ram, 101s of elapsed time. It should be quite a bit quicker on your large machine.
One issue you might have is with the GeoTIFF tags: they won't be preserved by pyvips. Perhaps you won't need them in later processing.

Typically, such large images are processed tile-wise. The idea is to divide the image up into tiles, read each one in independently, with enough "overlap" to account for the filtering applied, process it and write it to file.
The TIFF standard knows the "tiled" format (I believe GeoTIFF files are typically stored in a tiled format). This format is explicitly designed to make it easy to read in a small window of the image without having to piece together bits and pieces from all over the file. Each tile in the TIFF file can be indexed by location, and is encoded and compressed independently, making it easy to read in and write out one tile at a time.
The overlap you need depends on the filtering applied. If you apply, say, a Laplace of Gaussian filter with a 9x9 window (reaches 4 pixels past the central pixel), and then overlap needs to be only 4 pixels. If you chain filters, you typically want to add the reach of each filter to obtain a total overlap value.
Next, divide the image into some multiple of the tile size in the TIFF file. Say the file has tiles of 512x512 pixels. You can choose to process 8 tiles at once, a region of 2048x2048 pixels.
Now loop over the image in steps of 2048 in each dimension. Read in the 8 tiles, and include neighboring tiles, which you will crop so that you get square images of 2048+2*4 pixels to a side. Process the sub-image, remove the overlap region, and write the rest to the corresponding tiles in the output TIFF file. The output TIFF file should be set up in the same way as the input TIFF file.
I am sure there is software out there that automates this process, but I don't know of any in Python. If you implement this yourself, you will need to learn how to read and write individual tiles in the TIFF file. One option is pylibtiff.

Merge large images on disk

The main problem:
I have a map step where I render a large amount of sectors of an image in parallel:
1 2
3 4
worker a -> 1
worker b -> 2
...
merge 1,2,3,4 to make final image
If it can fit in memory
With images that are relatively small and can fit in RAM, one can simply use PIL's functionality:
def merge_images(image_files, x, y):
images = map(Image.open, image_files)
width, height = images[0].size
new_im = Image.new('RGB', (width * x, height * y))
for n, im in enumerate(images):
new_im.paste(im, ((n%x) * width, (n//y) * height))
return new_im
Unfortunately, I am going to have many, many large sectors. I want to merge the pictures finally into a single image of about 40,000 x 60,000 pixels, which I estimate to be around 20 GB's. (Or maybe even more)
So obviously, we can't approach this problem on RAM. I know there are alternatives like memmap'ing arrays and writing to slices, which I will try. However, I am looking for as-out-of-the-box-as-possible solutions.
Does anyone know of any easier alternatives? Even though all the approaches I've tried so far is in python, it doesn't need to be in python.

pyvips can do exactly what you want very quickly and efficiently. For example:
import sys
import pyvips
images = [pyvips.Image.new_from_file(filename, access="sequential")
for filename in sys.argv[2:]]
final = pyvips.Image.arrayjoin(images, across=10)
final.write_to_file(sys.argv[1])
The access="sequential" option tells pyvips that you want to stream the image. It will only load pixels on demand as it generates output, so you can merge enormous images using only a little memory. The arrayjoin operator joins an array of images into a grid across tiles across. It has quite a few layout options: you can specify borders, overlaps, background, centring behaviour and so on.
I can run it like this:
$ for i in {1..100}; do cp ~/pics/k2.jpg $i.jpg; done
$ time ../arrayjoin.py x.tif *.jpg
real 0m2.498s
user 0m3.579s
sys 0m1.054s
$ vipsheader x.tif
x.tif: 14500x20480 uchar, 3 bands, srgb, tiffload
So it joined 100 JPG images to make a 14,000 x 20,000 pixel mosaic in about 2.5s on this laptop, and from watching top, needed about 300mb of memory. I've used it to join over 30,000 images into a single file, and it would go higher. I've made images of over 300,000 by 300,000 pixels.
The pyvips equivalent of PIL's paste is insert. You could use that too, though it won't work so well for very large numbers of images.
There's also a command-line interface, so you could just enter:
vips arrayjoin "${echo *.jpg}" x.tif --across 10
To join up a large set of JPG images.

I would suggest using the TIFF file format. Most TIFF files are striped (one or more scan lines are stored as a block on file), but it is possible to write tiled TIFF files (where the image is divided into tiles, and each is stored as an independent block on file).
LibTIFF is the canonical way of reading and writing TIFF files. It has an easy way of creating a new TIFF file, and add tiles one at the time. Thus, your program can create the TIFF file, obtain one sector, write it as (one or more) tiles to the TIFF file, obtain the next sector, etc. You would have to choose your tile size to evenly divide one sector.
There is a Python binding to LibTIFF called, what else, PyLibTIFF. It should allow you to follow the above model from within Python. That same repository has pure Python module to read and write TIFF files, I don't know if that is able to write TIFF files in tiles, or if it allows to write them in chunks. There are many other Python modules for reading and writing TIFF files, but most will write one matrix as a TIFF file, rather than allow you to write a file one tile at a time.

python: wand autocompress monochromatic image on export. Any workaround?

I'm working on a tool that cuts a large image into smaller tiles using ImageMagick via Python. And I need all the tiles to be on the same format (png, 8 or 16 bits).
In most case, it works just fine, but on monochromatic tiles ImageMagick compresses the picture on writing the file. For instance, pure black tiles are compressed to a 1 bit picture.
I use the plain save method, as explained in the docs.
I found no documentation about this autocompressing feature nor any way to avoid this.
Is there a workaround for this or a way I can avoid this happening?
edit:
For instance, if I use this code to import a 24bit rgb picture:
from wand.image import Image
img = Image(filename ='http://upload.wikimedia.org/wikipedia/commons/6/68/Solid_black.png')
print img.type
I get this as type
bilevel
if I add this,
img.type = 'grayscale'
print img.type
Once again I get
bilevel
If I try to force the pixel depth like this,
img.depth = 16
print img.type
print img.depth
I get:
bilevel
16
I thought that maybe it actually changed the depth, but once I save the image, it become 1 bit depth again.
So it seems to me that ImageMagick just automatically compresses the picture and that I have no control over it. It even refuses to change the image type.
Any ideas to avoid this? Any way to force the pixel depth?

Making a 3 Colour FITS file using aplpy

I am trying to make a three colour FITS image using the $aplpy.make_rgb_image$ function. I use three separate FITS images in RGB to do so and am able to save a colour image in png, jpeg.... formats, but I would prefer to save its as a FITS file.
When I try that I get the following error.
IOError: FITS save handler not installed
I've tried to find a solution in the web for a few days but was unable to get any good results.
Would anyone know how to get such a handler installed, or perhaps any other approach I could use to get this done?

I don't think there is enough information for me to answer your question completely; for example, I don't know what call you are making to perform the "image" "save", but I can guess:
FITS does not store RGB data like you wish it to. FITS can store multi-band data as individual monochromatic data layers in a multi-extension data "cube". Software, including ds9 and aplpy, can read that FITS data cube and author RGB images in RGB formats (png, jpg...). The error you see comes from PIL, which has no backend to author FITS files (I think, but the validity of that point doesn't matter).
So I think that you should use aplpy.make_rgb_cube to save a 3 HDU FITS cube based your 3 input FITS files, then import that FITS cube back into aplpy and use aplpy.make_rgb_image to output RGB compatible formats. This way you have the saved FITS cube in near native astronomy formats, and a means to create RGB formats from a variety of tools that can import that cube.

Python: replace a rgb colour with a colour with alpha pixels. batch convert 400 images

How can I replace a colour across multiple images with another in python? I have a folder with 400 sprite animations. I would like to change the block coloured shadow (111,79,51) with one which has alpha transparencies. I could easily do the batch converting using:
img = glob.glob(filepath\*.bmp)
however I dont know how I could change the pixel colours. If it makes any difference, the images are all 96x96 and i dont care how long the process is. I am using python 3.2.2 so I cant really use PIL (I think)

BMP is a windows file format, so you will need PIL or something like it; or you can roll your own reader/writer. The basic modules won't help as far as I'm aware. You can read PPM and GIF using Tk (PhotoImage()) which is part of the standard distribution and use get() and put() on that image to change pixel values. See references online, because it's not straight-forward - the pixels come from get() as 3-tuple integers, but need to go back to put() as space-separated hex text!

Are your images in indexed mode (8 bit per pixel with a palette),or "truecolor" 32bpp images? If they are in indexed modes, it would not be hard to simply mark the palette entry for that color to be transparent across all files.
Otherwise, you will really have to process all pixel data. It also could be done by writting a Python script for GIMP - but that would require Python-2 nonetheless.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.