Python multiprocess lists of images - python

I want to use multi process to stack many images. Each stack consists of 5 images, which means I have a list of images with a sublist of the images which should be combined:
img_lst = [[01_A, 01_B, 01_C, 01_D, 01_E], [02_A, 02_B, 02_C, 02_D, 02_E], [03_A, 03_B, 03_C, 03_D, 03_E]]
At them moment I call my function do_stacking(sub_lst) with a loop:
for sub_lst in img_lst:
# example: do_stacking([01_A, 01_B, 01_C, 01_D, 01_E])
do_stacking(sub_lst)
I want to speed up with multiprocessing but I am not sure how to call pool.map function:
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, ???)
pool.close()
pool.join()
Edit:
I have a list of list:
[
[01_A, 01_B, 01_C, 01_D, 01_E],
[02_A, 02_B, 02_C, 02_D, 02_E],
[03_A, 03_B, 03_C, 03_D, 03_E]
]
Each sublist should be passed to a function called do_stacking(sublist). I only want to proceed with the sublist and not with the entire list.
My question is how to handle the loop of the list (for x in img_lst)? Should I create a loop for each Pool?

Pool.map works like the builtin map function.It fetch one element from the second argument each time and pass it to the function that represent by the first argument.
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
img_list = []
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
img_list.append(data)
print img_list # check if the img_list is right?
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, img_list)
pool.close()
pool.join()

Related

How can I efficiently implement multithreading/multiprocessing in a Python web bot?

Let's say I have a web bot written in python that sends data via POST request to a web site. The data is pulled from a text file line by line and passed into an array. Currently, I'm testing each element in the array through a simple for-loop. How can I effectively implement multi-threading to iterate through the data quicker. Let's say the text file is fairly large. Would attaching a thread to each request be smart? What do you think the best approach to this would be?
with open("c:\file.txt") as file:
dataArr = file.read().splitlines()
dataLen = len(open("c:\file.txt").readlines())-1
def test(data):
#This next part is pseudo code
result = testData('www.example.com', data)
if result == 'whatever':
print 'success'
for i in range(0, dataLen):
test(dataArr[i])
I was thinking of something along the lines of this, but I feel it would cause issues depending on the size of the text file. I know there is software that exists which allows the end-user to specify the amount of the threads when working with large amounts of data. I'm not entirely sure of how that works, but that's something I'd like to implement.
import threading
with open("c:\file.txt") as file:
dataArr = file.read().splitlines()
dataLen = len(open("c:\file.txt").readlines())-1
def test(data):
#This next part is pseudo code
result = testData('www.example.com', data)
if result == 'whatever':
print 'success'
jobs = []
for x in range(0, dataLen):
thread = threading.Thread(target=test, args=(dataArr[x]))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()
This sounds like a recipe for multiprocessing.Pool
See here: https://docs.python.org/2/library/multiprocessing.html#introduction
from multiprocessing import Pool
def test(num):
if num%2 == 0:
return True
else:
return False
if __name__ == "__main__":
list_of_datas_to_test = [0, 1, 2, 3, 4, 5, 6, 7, 8]
p = Pool(4) # create 4 processes to do our work
print(p.map(test, list_of_datas_to_test)) # distribute our work
Output looks like:
[True, False, True, False, True, False, True, False, True, False]
Threads are slow in python because of the Global Interpreter Lock. You should consider using multiple processes with the Python multiprocessing module instead of threads. Using multiple processes can increase the "ramp up" time of your code, as spawning a real process takes more time than a light thread, but due to the GIL, threading won't do what you're after.
Here and here are a couple of basic resources on using the multiprocessing module. Here's an example from the second link:
import multiprocessing as mp
import random
import string
# Define an output queue
output = mp.Queue()
# define a example function
def rand_string(length, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put(rand_str)
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(4)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)

How can I multithread a function that reads a list of objects in Python? Astrophysics example code

This is my first post to stack overflow. I'll try to include all the necessary information, but please let me know if there's more info I can provide to clarify my question.
I'm trying to multithread a costly function for an astrophysical code in python using pool.map. The function takes as an input a list of objects. The basic code structure is like this:
There's a class of Stars with physical properties:
Class Stars:
def __init__(self,mass,metals,positions,age):
self.mass = mass
self.metals = metals
self.positions = positions
self.age = age
def info(self):
return(self.mass,self.metals,self.positions,self.age)
and there's a list of these objects:
stars_list = []
for i in range(nstars):
stars_list.append(Stars(mass[i],metals[i],positions[i],age[i]))
(where mass, metals, positions and age are known from another script).
There's a costly function that I run with these star objects that returns a spectrum for each one:
def newstars_gen(stars_list):
....
return stellar_nu,stellar_fnu
where stellar_nu and stellar_fnu are numpy arrays
What I would like to do is break the list of star objects (stars_list) up into chunks, and then run newstars_gen on these chunks on multiple threads to gain a speedup. So, to do this, I split the list up into three sublists, and then try to run my function through pool.map:
p = Pool(processes = 3)
nchunks = 3
chunk_start_indices = []
chunk_start_indices.append(0) #the start index is 0
delta_chunk_indices = nstars / nchunks
for n in range(1,nchunks):
chunk_start_indices.append(chunk_start_indices[n-1]+delta_chunk_indices)
for n in range(nchunks):
stars_list_chunk = stars_list[chunk_start_indices[n]:chunk_start_indices[n]+delta_chunk_indices]
#if we're on the last chunk, we might not have the full list included, so need to make sure that we have that here
if n == nchunks-1:
stars_list_chunk = stars_list[chunk_start_indices[n]:-1]
chunk_sol = p.map(newstars_gen,stars_list_chunk)
But when I do this, I get as the error:
File "/Users/[username]/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/Users/[username]/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
AttributeError: Stars instance has no attribute '__getitem__'
So, I'm confused as to what sort of attribute I should include with the Stars class. I've tried reading about this online and am not sure how to define an appropriate __getitem__ for this class. I'm quite new to object oriented programming (and python in general).
Any help is much appreciated!
So, it looks like there may be a couple things wrong here and that could be cleaned up or made more pythonic. However, the key problem is that you are using pool.multiprocessing.Pool.map incorrectly for what you have. Your newstars_gen function expects a list, but p.map is going to break up the list you give it into chunks and hand it one Star at a time. You should probably rewrite newstars_gen to operate on one star at a time and then throw away all but the first and last lines of your last code block. If the calculations in newstars_gen aren't independent between Stars (e.g., the mass of one impacts the calculation for another), you will have to do a more dramatic refactoring.
It also looks like it would behoove you to learn about list comprehensions. Be aware that the other built in structures (e.g., set, dict) have equivalents, and also look into generator comprehensions.
I've written a function for distributing the processing of an iterable (like your list of stars objects) among multiple processors, which I'm pretty sure will work well for you.
from multiprocessing import Process, cpu_count, Lock
from sys import stdout
from time import clock
def run_multicore_function(iterable, function, func_args = [], max_processes = 0):
#directly pass in a function that is going to be looped over, and fork those
#loops onto independant processors. Any arguments the function needs must be provided as a list.
if max_processes == 0:
cpus = cpu_count()
if cpus > 7:
max_processes = cpus - 3
elif cpus > 3:
max_processes = cpus - 2
elif cpus > 1:
max_processes = cpus - 1
else:
max_processes = 1
running_processes = 0
child_list = []
start_time = round(clock())
elapsed = 0
counter = 0
print "Running function %s() on %s cores" % (function.__name__,max_processes)
#fire up the multi-core!!
stdout.write("\tJob 0 of %s" % len(iterable),)
stdout.flush()
for next_iter in iterable:
if type(iterable) is dict:
next_iter = iterable[next_iter]
while 1: #Only fork a new process when there is a free processor.
if running_processes < max_processes:
#Start new process
stdout.write("\r\tJob %s of %s (%i sec)" % (counter,len(iterable),elapsed),)
stdout.flush()
if len(func_args) == 0:
p = Process(target=function, args=(next_iter,))
else:
p = Process(target=function, args=(next_iter,func_args))
p.start()
child_list.append(p)
running_processes += 1
counter += 1
break
else:
#processor wait loop
while 1:
for next in range(len(child_list)):
if child_list[next].is_alive():
continue
else:
child_list.pop(next)
running_processes -= 1
break
if (start_time + elapsed) < round(clock()):
elapsed = round(clock()) - start_time
stdout.write("\r\tJob %s of %s (%i sec)" % (counter,len(iterable),elapsed),)
stdout.flush()
if running_processes < max_processes:
break
#wait for remaining processes to complete --> this is the same code as the processor wait loop above
while len(child_list) > 0:
for next in range(len(child_list)):
if child_list[next].is_alive():
continue
else:
child_list.pop(next)
running_processes -= 1
break #need to break out of the for-loop, because the child_list index is changed by pop
if (start_time + elapsed) < round(clock()):
elapsed = round(clock()) - start_time
stdout.write("\r\tRunning job %s of %s (%i sec)" % (counter,len(iterable),elapsed),)
stdout.flush()
print " --> DONE\n"
return
As a usage example, let's use your star_list, and send the result of newstars_gen to a shared file. Start by setting up your iterable, file, and a file lock
star_list = []
for i in range(nstars):
stars_list.append(Stars(mass[i],metals[i],positions[i],age[i]))
outfile = "some/where/output.txt"
file_lock = Lock()
Define your costly function like so:
def newstars_gen(stars_list_item,args): #args = [outfile,file_lock]
outfile,file_lock = args
....
with file_lock:
with open(outfile,"a") as handle:
handle.write(stellar_nu,stellar_fnu)
Now send your list of stars into run_multicore_function()
run_multicore_function(star_list, newstars_gen, [outfile,file_lock])
After all of your items have been calculated, you can go back into the output file to grab the data and carry on. Instead of writing to a file, you can also share the state with multiprocessing.Value or multiprocessing.Array, but I've ran into the occasional issue with data getting lost if my list is large and the function I'm calling is fairly fast. Maybe someone else out there can see why that's happening.
Hopefully this all makes sense!
Good luck,
-Steve

Multiprocess itertool combination with two arguments

I have the following function that I would like to run using multiprocessing:
def bruteForcePaths3(paths, availableNodes):
results = []
#start by taking each combination 2 at a time, then 3, etc
for i in range(1,len(availableNodes)+1):
print "combo number: %d" % i
currentCombos = combinations(availableNodes, i)
for combo in currentCombos:
#get a fresh copy of paths for this combiniation
currentPaths = list(paths)
currentRemainingPaths = []
# print combo
for node in combo:
#determine better way to remove nodes, for now- if it's in, we remove
currentRemainingPaths = [path for path in currentPaths if not (node in path)]
currentPaths = currentRemainingPaths
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print combo
results.append(frozenset(combo))
return results
Bases on a few other post (Combining itertools and multiprocessing?), I tried to multiprocess this by the following:
def grouper_nofill(n, iterable):
it=iter(iterable)
def take():
while 1: yield list(islice(it,n))
return iter(take().next,[])
def mp_bruteForcePaths(paths, availableNodes):
pool = multiprocessing.Pool(4)
chunksize=256
async_results=[]
def worker(paths,combos, out_q):
""" The worker function, invoked in a process. 'nums' is a
list of numbers to factor. The results are placed in
a dictionary that's pushed to a queue.
"""
results = bruteForcePaths2(paths, combos)
print results
out_q.put(results)
for i in range(1,len(availableNodes)+1):
currentCombos = combinations(availableNodes, i)
for finput in grouper_nofill(chunksize,currentCombos):
args = (paths, finput)
async_results.extend(pool.map_async(bruteForcePaths2, args).get())
print async_results
def bruteForcePaths2(args):
paths, combos = args
results = []
for combo in combos:
#get a fresh copy of paths for this combiniation
currentPaths = list(paths)
currentRemainingPaths = []
# print combo
for node in combo:
#determine better way to remove nodes, for now- if it's in, we remove
currentRemainingPaths = [path for path in currentPaths if not (combo in path)]
currentPaths = currentRemainingPaths
#if there are no paths left
if len(currentRemainingPaths) == 0:
#save this combination
print combo
results.append(frozenset(combo))
return results
I need to be able to pass in 2 arguments to the bruteforce function. I'm getting the error:
"too many values to unpack"
So 3 part question:
How can I multiprocess the bruteforce function over nproc cpu's splitting the combinations iterator?
How can I pass in the two arguments- path and combinations?
How do I get the result (think the mpa_async should do that for me)?
Thanks.
This
args = (paths, finput)
pool.map_async(bruteForcePaths2, args)
makes these two calls, which is not your intent:
bruteForcePaths2(paths)
bruteForcePaths2(finput)
You can use apply_async instead to submit single function calls to the pool. Note also that if you call get immediately it will wait for the result, and you don't get any advantage from multiprocessing.
You could do it like this:
for i in range(1,len(availableNodes)+1):
currentCombos = combinations(availableNodes, i)
for finput in grouper_nofill(chunksize,currentCombos):
args = (paths, finput)
async_results.append(pool.apply_async(bruteForcePaths2, args))
results = [x.get() for x in async_results]

Python Multiprocessing map_async

I’d like to skip results that are returned from map_async. They are growing in memory but I don’t need them.
Here is some code:
def processLine(line):
#process something
print "result"
pool = Pool(processes = 8)
for line in sys.stdin:
lines.append(line)
if len(lines) >= 100000:
pool.map_async(processLine, lines, 2000)
pool.close()
pool.join()
When I have to process file with hundreds of millions of rows, the python process grows in memory to a few gigabytes. How can I resolve that?
Thanks for your help :)
Your code has a bug:
for line in sys.stdin:
lines.append(line)
if len(lines) >= 100000:
pool.map_async(processLine, lines, 2000)
This is going to wait until lines accumulates more than 100000 lines. After that, pool.map_async is being called on the entire list of 100000+ lines for each additional line.
It is not clear exactly what you are really trying to do, but
if you don't want the return value, use pool.apply_async, not pool.map_async. Maybe something like this:
import multiprocessing as mp
def processLine(line):
#process something
print "result"
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
for line in sys.stdin:
pool.apply_async(processLine, args = (line, ))
pool.close()
pool.join()
Yes you're right. There is some bug
I mean:
def processLine(line):
#process something
print "result"
pool = Pool(processes = 8)
if __name__ == '__main__':
for line in sys.stdin:
lines.append(line)
if len(lines) >= 100000:
pool.map_async(processLine, lines, 2000)
lines = [] #to clear buffer
pool.map_async(processLine, lines, 2000)
pool.close()
pool.join()
I used map_async because it has configurable chunk_size so it is more efficient if there are lots of lines which processing time is quite short.

Python: Question about multiprocessing / multithreading and shared resources

Here's the simplest multi threading example I found so far:
import multiprocessing
import subprocess
def calculate(value):
return value * 10
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10000)
results = []
r = pool.map_async(calculate, tasks, callback=results.append)
r.wait() # Wait on the results
print results
I have two lists and one index to access the elements in each list. The ith position on the first list is related to the ith position on the second. I didn't use a dict because the lists are ordered.
What I was doing was something like:
for i in xrange(len(first_list)):
# do something with first_list[i] and second_list[i]
So, using that example, I think can make a function sort of like this:
#global variables first_list, second_list, i
first_list, second_list, i = None, None, 0
#initialize the lists
...
#have a function to do what the loop did and inside it increment i
def function:
#do stuff
i += 1
But, that makes i a shared resource and I'm not sure if that'd be safe. It also seems to me my design is not lending itself well to this multithreaded approach, but I'm not sure how to fix it.
Here's a working example of what I wanted (Edit an image you want to use):
import multiprocessing
import subprocess, shlex
links = ['http://www.example.com/image.jpg']*10 # don't use this URL
names = [str(i) + '.jpg' for i in range(10)]
def download(i):
command = 'wget -O ' + names[i] + ' ' + links[i]
print command
args = shlex.split(command)
return subprocess.call(args, shell=False)
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10)
r = pool.map_async(download, tasks)
r.wait() # Wait on the results
First off, it might be beneficial to make one list of tuples, for example
new_list[i] = (first_list[i], second_list[i])
That way, as you change i, you ensure that you are always operating on the same items from first_list and second_list.
Secondly, assuming there are no relations between the i and i-1 entries in your lists, you can use your function to operate on one given i value, and spawn a thread to handle each i value. Consider
indices = range(len(new_list))
results = []
r = pool.map_async(your_function, indices, callback=results.append)
r.wait() # Wait on the results
This should give you what you want.

Categories