I'm trying to use multithreading.pool to compare images based on similarity. While I have code working on a single core, using a for loop or map() to iterate over the data, it's dreadfully slow on large groups of images. For that reason I've been trying to implement multiprocessing but I can't seem to get it right. My main question is why doesn't getssim() in the below code change the list?
The structure of the iterable looks something like this:
[[("images/000.jpg",np.ndarray),0.923],...]
Where the float is the simiarlity index of an image compared to the current image being tested. Here is the (somewhat abbreviated) non-working code:
import cv2
import glob
from skimage.measure import structural_similarity as ssim
import operator
import multiprocessing
def makeSimilarList(imagesdata):
simImgList = [] #list of images ordered by their similarity
while(imagesdata):
simImg = findSimilar(imagesdata)
simImgList.append(os.path.basename(simImg))
return simImgList
def getssim(imgd):
similarityIndex = ssim(img1,imgd[0][1])
print(similarityIndex) #this prints correctly
imgd[1] = similarityIndex
return imgd #this appears to have no effect
def findSimilar(imagesdata):
limg = imagesdata.pop()
global img1 #making img1 accessible to getssim, a bad idea!
img1 = limg[0][1]
p = multiprocessing.Pool(processes=multiprocessing.cpu_count(),maxtasksperchild=2)
p.map(getssim,imagesdata)
p.close()
p.join()
imagesdata.sort(key=operator.itemgetter(1))
return limg[0][0] #return name of image
images = [f for f in glob.glob(src + "*." + ftype)]
images.reverse()
imagesdata = [[(f,cv2.imread(f,0)),""] for f in images]
finalList = makeSimilarList(imagesdata)
with open("./simlist.txt", 'w') as f:
f.write('\n'.join(finalList))
Thanks for the help!!
You forgot to assign the result from multiprocessing.map to a variable. The key function should probably read
def findSimilar(imagesdata):
limg = imagesdata.pop()
global img1 # making img1 accessible to getssim, a bad idea!
img1 = limg[0][1]
p = multiprocessing.Pool(maxtasksperchild=2)
imagesdata = p.map(getssim, imagesdata)
p.close()
p.join()
imagesdata.sort(key=operator.itemgetter(1))
return limg[0][0] #return name of image
Since you don't give enough details, I could not test your code, but I think this was the crucial point.
Related
I want to know how to apply a function over a file of images and save each of them in a separate file. For one image it works successfully, but i cannot apply it to all images.
import glob
images = glob.glob('/Desktop/Dataset/Images/*')
for img in images:
img = np.array(Image.open(img))
output = 'Desktop/Dataset/Output'
MyFn(img = img,saveFile = output)
You did not define the sv value in your 2nd code snippet.
As the image will be overwrite, try this code:
import glob
images = glob.glob('/Desktop/Dataset/Images/*')
i = 0
for img in images:
i += 1 #iteration to avoid overwrite
img = np.array(Image.open(img))
output = 'Desktop/Dataset/Output'
MyFn(img = img + str(i),saveFile = output)
try to use the library os directly with
import os
entries = os.listdir('image/')
this will return a list of all the file into your folder
This is because you are not setting the sv value in your loop. You should set it to a different value at each iteration in order for it to write to different files.
Thank you massively to Paul M who posted the following code in response to my first query on how to compile a "stack" of randomly selected images into a unique image:
from pathlib import Path
from random import choice
layers = [list(Path(directory).glob("*.png")) for directory in ("bigcircle/", "mediumcircle/")]
selected_paths = [choice(paths) for paths in layers]
img = Image.new("RGB", (4961, 4961), color=(0, 220, 15))
for path in selected_paths:
layer = Image.open(str(path), "r")
img.paste(layer, (0, 0), layer)
I have the code sat in a for _ in itertools.repeat(None, num): where num defines the number of different images being generated. I end the loop with the following to save each image with a unique (incremental) file name:
i = 0
while os.path.exists("Finished Pieces/image %s.png" % i):
i += 1
img.save("Finished Pieces/image %s.png" % i,)
So far, so good. The challenge I'm struggling with now is how to append to a data.csv file with the details of each image created.
For example, in loop 1 bigcircle1.png is selected from bigcircle/ folder and mediumcircle6.png from mediumcircle/, loop 2 uses bigcircle3.png and mediumcircle2.png, and so on. At the end of this loop the data.csv file would read:
Filename,bigcircle,mediumcircle
image 0,bigcircle1.png,mediumcircle6.png
image 1,bigcircle3.png,mediumcircle2.png
I've tried the following, which I know wouldn't give the desired result, but which I thought might be a good start for me to run and tweak until right, but it doesn't generate any output (and I am importing numpy as np):
np.savetxt('data.csv', [p for p in zip(img, layer)], delimiter=',', fmt='%s')
If it's not too much of an ask, ideally the first iteration of the loop would create data.csv and store the first record, with the second iteration onwards appending this file.
Me again ;)
I think it would make sense to split up the functionality of the program into separate functions. I would start maybe with a function called something like discover_image_paths, which discovers (via glob) all image paths. It might make sense to store the paths according to what kind of circle they represent - I'm envisioning a dictionary with "big" and "medium" keys, and lists of paths as associated values:
def discover_image_paths():
from pathlib import Path
keys = ("bigcircle", "mediumcircle")
return dict(zip(keys, (list(Path(directory).glob("*.png")) for directory in (key+"/" for key in keys))))
def main():
global paths
paths = discover_image_paths()
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
In the terminal:
>>> paths["bigcircle"]
[WindowsPath('bigcircle/big1.png'), WindowsPath('bigcircle/big2.png'), WindowsPath('bigcircle/big3.png')]
>>> paths["mediumcircle"]
[WindowsPath('mediumcircle/med1.png'), WindowsPath('mediumcircle/med2.png'), WindowsPath('mediumcircle/med3.png')]
>>>
As you can see, to test the script, I created some dummy image files - three for each category.
Extending this by adding a function to generate an output image (given an iterable of paths to combine and an output file name) and a main loop to generate num_images number of images (sorry, I'm not familiar with numpy):
def generate_image(paths, color, output_filename):
from PIL import Image
dimensions = (4961, 4961)
image = Image.new("RGB", dimensions, color=color)
for path in paths:
layer = Image.open(path, "r")
image.paste(layer, (0, 0), layer)
image.save(output_filename)
def discover_image_paths(keys):
from pathlib import Path
return dict(zip(keys, (list(Path(directory).glob("*.png")) for directory in (key+"/" for key in keys))))
def main():
from random import choice, choices
from csv import DictWriter
field_names = ["filename", "color"]
keys = ["bigcircle", "mediumcircle"]
paths = discover_image_paths(keys)
num_images = 5
with open("data.csv", "w", newline="") as file:
writer = DictWriter(file, fieldnames=field_names+keys)
writer.writeheader()
for image_no in range(1, num_images + 1):
selected_paths = {key: choice(category_paths) for key, category_paths in paths.items()}
file_name = "output_{}.png".format(image_no)
color = tuple(choices(range(0, 256), k=3))
generate_image(map(str, selected_paths.values()), color, file_name)
row = {**dict(zip(field_names, [file_name, color])), **{key: path.name for key, path in selected_paths.items()}}
writer.writerow(row)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Example output in the CSV:
filename,color,bigcircle,mediumcircle
output_1.png,"(49, 100, 190)",big3.png,med1.png
output_2.png,"(228, 37, 227)",big2.png,med3.png
output_3.png,"(251, 14, 193)",big1.png,med1.png
output_4.png,"(35, 12, 196)",big1.png,med3.png
output_5.png,"(62, 192, 170)",big2.png,med2.png
I am new to google earth engine and was trying to understand how to use the Google Earth Engine python api. I can create an image collection, but apparently the getdownloadurl() method operates only on individual images. So I am trying to understand how to iterate over and download all of the images in the collection.
Here is my basic code. I broke it out in great detail for some other work I am doing.
import ee
ee.Initialize()
col = ee.ImageCollection('LANDSAT/LC08/C01/T1')
col.filterDate('1/1/2015', '4/30/2015')
pt = ee.Geometry.Point([-2.40986111110000012, 26.76033333330000019])
buff = pt.buffer(300)
region = ee.Feature.bounds(buff)
col.filterBounds(region)
So I pulled the Landsat collection, filtered by date and a buffer geometry. So I should have something like 7-8 images in the collection (with all bands).
However, I could not seem to get iteration to work over the collection.
for example:
for i in col:
print(i)
The error indicates TypeError: 'ImageCollection' object is not iterable
So if the collection is not iterable, how can I access the individual images?
Once I have an image, I should be able to use the usual
path = col[i].getDownloadUrl({
'scale': 30,
'crs': 'EPSG:4326',
'region': region
})
It's a good idea to use ee.batch.Export for this. Also, it's good practice to avoid mixing client and server functions (reference). For that reason, a for-loop can be used, since Export is a client function. Here's a simple example to get you started:
import ee
ee.Initialize()
rectangle = ee.Geometry.Rectangle([-1, -1, 1, 1])
sillyCollection = ee.ImageCollection([ee.Image(1), ee.Image(2), ee.Image(3)])
# This is OK for small collections
collectionList = sillyCollection.toList(sillyCollection.size())
collectionSize = collectionList.size().getInfo()
for i in xrange(collectionSize):
ee.batch.Export.image.toDrive(
image = ee.Image(collectionList.get(i)).clip(rectangle),
fileNamePrefix = 'foo' + str(i + 1),
dimensions = '128x128').start()
Note that converting a collection to a list in this manner is also dangerous for large collections (reference). However, this is probably the most scalable method if you really need to download.
Here is my solution:
import ee
ee.Initialize()
pt = ee.Geometry.Point([-2.40986111110000012, 26.76033333330000019])
region = pt.buffer(10)
col = ee.ImageCollection('LANDSAT/LC08/C01/T1')\
.filterDate('2015-01-01','2015-04-30')\
.filterBounds(region)
bands = ['B4','B5'] #Change it!
def accumulate(image,img):
name_image = image.get('system:index')
image = image.select([0],[name_image])
cumm = ee.Image(img).addBands(image)
return cumm
for band in bands:
col_band = col.map(lambda img: img.select(band)\
.set('system:time_start', img.get('system:time_start'))\
.set('system:index', img.get('system:index')))
# ImageCollection to List
col_list = col_band.toList(col_band.size())
# Define the initial value for iterate.
base = ee.Image(col_list.get(0))
base_name = base.get('system:index')
base = base.select([0], [base_name])
# Eliminate the image 'base'.
new_col = ee.ImageCollection(col_list.splice(0,1))
img_cummulative = ee.Image(new_col.iterate(accumulate,base))
task = ee.batch.Export.image.toDrive(
image = img_cummulative.clip(region),
folder = 'landsat',
fileNamePrefix = band,
scale = 30).start()
print('Export Image '+ band+ ' was submitted, please wait ...')
img_cummulative.bandNames().getInfo()
A reproducible example can you found it here: https://colab.research.google.com/drive/1Nv8-l20l82nIQ946WR1iOkr-4b_QhISu
You could possibly use ee.ImageCollection.iterate() with a function that gets the image and adds it to a list.
import ee
def accumluate_images(image, images):
images.append(image)
return images
for img in col.iterate(accumulate_images, []):
url = img.getDownloadURL(dict(scale=30, crs='EPSG:4326', region=region))
Unfortunately I am not able to test this code as I do not have access to the API, but it might help you arrive at a solution.
I have a similar problem and was not able o solve with presented solutions. Then I have elaborated a sample code for this purpose. It iterates over an image collection in client side, then it is not affected by limitations (server side only) of .map() or .iterate().
It is possible to download the code and see its explanation here
It basically transform the ImageCollection into a list (ic.toList()). Then it performs a standard loop, and for each individual image it is possible to convert it back to ee.Image(list.get(i)), and then process one by one taking all images in the collection.
In your particular case, to download each image, the function to be called within the loop could be: getDOwnloadURL() or getThumbURL():
var url = imgNew.getDownloadURL({
region: geometry,
});
var thumbURL = imgNew.getThumbURL({region: geometry,dimensions: 512, format: 'png'});
My problem is really quite simple.
I have a 100 images on my computer, those images are called 1.ppm 2.ppm and so on until 100.ppm
I want to read each image to a variable using imread, and then perform a few operations. I want to do the exact same thing to all of the images.
My question is this - Instead of copy pasting one hundred times, is it possible to use imread in a loop? something like:
for i in range(1,100):
X=io.imread('/home/oria/Desktop/more pics/'i'.ppm')
Instead of copy pasting the same code block and just changing the picture number a hundred times, I want to do this in a loop.
I have a similar issue with numpy.load. I want to load files called ICA1 ICA2 etc up to ICA100. Is it possible to write something like
numpy.load('/home/oria/Desktop/ICA DB/ICA'i'.npy)?
Like this:
for i in range(1,100):
X=io.imread('/home/oria/Desktop/more pics/%s.ppm' %(i))
Or, like this:
for i in range(1,100):
X=io.imread('/home/oria/Desktop/more pics/'+str(i)+'.ppm')
Go ahead and read the article on basic string operations as well as this simple article on string formatting
If I correctly understand what you're asking, it could be done as:
for i in range(1, 101):
x = io.imread('/home/oria/Desktop/more pics/' + str(i) + '.ppm')
Note that the high end of the range function is not inclusive, so using range(1, 100) would only produce 1, 2, 3...99. Also note that i must be converted to a string or you will receive TypeError: cannot concatenate 'str' and 'int' objects.
import cv2
import os
def load_images_from_folder(folder):
images = []
for filename in os.listdir(folder):
img = cv2.imread(os.path.join(folder,filename))
if img is not None:
images.append(img)
return images
Just use str.format, passing the variable i:
for i in range(1,100):
X = io.imread('/home/oria/Desktop/more pics/{}.ppm'.format(i))
When you want to load with numpy do the same thing again:
for i in range(1,100):
X = numpy.load('/home/oria/Desktop/ICA DB/ICA{}.npy'.format(i))
I wrote a function in Python 2.7 (on Window OS 64bit) in order to calculate the mean value of of the intersection area from a reference polygon (Ref) and one or more segmented (Seg) polygon(s) in ESRI shapefile format. The code is quite slow because i have more that 2000 reference polygon (s) and for each Ref_polygon the function run for every time for all Seg polygons(s) (more than 7000). I am sorry but the function is a prototype.
I wish to know if multiprocessing can help me to increase the speed of my loop or there are more performance solutions. if multiprocessing can be a possible solution i wish to know the best way to optimize my following function
import numpy as np
import ogr
import osr,gdal
from shapely.geometry import Polygon
from shapely.geometry import Point
import osgeo.gdal
import osgeo.gdal as gdal
def AreaInter(reference,segmented,outFile):
# open shapefile
ref = osgeo.ogr.Open(reference)
if ref is None:
raise SystemExit('Unable to open %s' % reference)
seg = osgeo.ogr.Open(segmented)
if seg is None:
raise SystemExit('Unable to open %s' % segmented)
ref_layer = ref.GetLayer()
seg_layer = seg.GetLayer()
# create outfile
if not os.path.split(outFile)[0]:
file_path, file_name_ext = os.path.split(os.path.abspath(reference))
outFile_filename = os.path.splitext(os.path.basename(outFile))[0]
file_out = open(os.path.abspath("{0}\\{1}.txt".format(file_path, outFile_filename)), "w")
else:
file_path_name, file_ext = os.path.splitext(outFile)
file_out = open(os.path.abspath("{0}.txt".format(file_path_name)), "w")
# For each reference objects-i
for index in xrange(ref_layer.GetFeatureCount()):
ref_feature = ref_layer.GetFeature(index)
# get FID (=Feature ID)
FID = str(ref_feature.GetFID())
ref_geometry = ref_feature.GetGeometryRef()
pts = ref_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
# convert in a shapely polygon
ref_polygon = Polygon(points)
# get the area
ref_Area = ref_polygon.area
# create an empty list
Area_seg, Area_intersect = ([] for _ in range(2))
# For each segmented objects-j
for segment in xrange(seg_layer.GetFeatureCount()):
seg_feature = seg_layer.GetFeature(segment)
seg_geometry = seg_feature.GetGeometryRef()
pts = seg_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
seg_polygon = Polygon(points)
seg_Area.append = seg_polygon.area
# intersection (overlap) of reference object with the segmented object
intersect_polygon = ref_polygon.intersection(seg_polygon)
# area of intersection (= 0, No intersection)
intersect_Area.append = intersect_polygon.area
# Avarage for all segmented objects (because 1 or more segmented polygons can intersect with reference polygon)
seg_Area_average = numpy.average(seg_Area)
intersect_Area_average = numpy.average(intersect_Area)
file_out.write(" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
file_out.close()
You can use the multiprocessing package, and especially the Pool class. First create a function that does all the stuff you want to do within the for loop, and that takes as an argument only the index:
def process_reference_object(index):
ref_feature = ref_layer.GetFeature(index)
# all your code goes here
return (" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
Note that this doesn't write to a file itself- that would be messy because you'd have multiple processes writing to the same file at the same time. Instead, it returns the string that needs to be written. Also note that there are objects in this function like ref_layer or ref_geometry that will need to reach it somehow- that's up to you how to do it (you could put process_reference_object as the method in a class initialized with them, or it could be as ugly as just defining them globally).
Then, you create a pool of process resources, and run all of your indices using Pool.imap_unordered (which will itself allocate each index to a different process as necessary):
from multiprocessing import Pool
p = Pool() # run multiple processes
for l in p.imap_unordered(process_reference_object, range(ref_layer.GetFeatureCount())):
file_out.write(l)
This will parallelize the independent processing of your reference objects across multiple processes, and write them to the file (in an arbitrary order, note).
Threading can help to a degree, but first you should make sure you can't simplify the algorithm. If you're checking each of 2000 reference polygons against 7000 segmented polygons (perhaps I misunderstood), then you should start there. Stuff that runs at O(n2) is going to be slow, so maybe you can prune away things that will definitely not intersect or find some other way to speed things up. Otherwise, running multiple processes or threads will only improve things linearly when your data grows geometrically.