Python Multiprocessing map_async - python

I’d like to skip results that are returned from map_async. They are growing in memory but I don’t need them.
Here is some code:
def processLine(line):
#process something
print "result"
pool = Pool(processes = 8)
for line in sys.stdin:
lines.append(line)
if len(lines) >= 100000:
pool.map_async(processLine, lines, 2000)
pool.close()
pool.join()
When I have to process file with hundreds of millions of rows, the python process grows in memory to a few gigabytes. How can I resolve that?
Thanks for your help :)

Your code has a bug:
for line in sys.stdin:
lines.append(line)
if len(lines) >= 100000:
pool.map_async(processLine, lines, 2000)
This is going to wait until lines accumulates more than 100000 lines. After that, pool.map_async is being called on the entire list of 100000+ lines for each additional line.
It is not clear exactly what you are really trying to do, but
if you don't want the return value, use pool.apply_async, not pool.map_async. Maybe something like this:
import multiprocessing as mp
def processLine(line):
#process something
print "result"
if __name__ == '__main__':
pool = mp.Pool(processes = 8)
for line in sys.stdin:
pool.apply_async(processLine, args = (line, ))
pool.close()
pool.join()

Yes you're right. There is some bug
I mean:
def processLine(line):
#process something
print "result"
pool = Pool(processes = 8)
if __name__ == '__main__':
for line in sys.stdin:
lines.append(line)
if len(lines) >= 100000:
pool.map_async(processLine, lines, 2000)
lines = [] #to clear buffer
pool.map_async(processLine, lines, 2000)
pool.close()
pool.join()
I used map_async because it has configurable chunk_size so it is more efficient if there are lots of lines which processing time is quite short.

Related

Python Multiprocessing starmap

i am trying to run a function in parallel with Multiprocessing starmap.
data = [(i, board) for i in range(board.width)]
if __name__ == '__main__':
p = mp.Pool(processes=mp.cpu_count())
ratings = p.starmap(self.rate, data)
print("Ratings: " + ratings)
My problem is that print is never executed. The Function just returns with None.
self.rate() should return a number.
Github: https://github.com/Builder20/Connect4/tree/develop
Any ideas?
Obvious the function loops. you can add logging to see which parts loops, I see the only candidate.
while canSet == -1:
opponentColumn = random.randint(0, board.width)
canSet = board.setStone(self.other, opponentColumn)
add
logging.info(board)
to see how it progresses

Python multiprocess lists of images

I want to use multi process to stack many images. Each stack consists of 5 images, which means I have a list of images with a sublist of the images which should be combined:
img_lst = [[01_A, 01_B, 01_C, 01_D, 01_E], [02_A, 02_B, 02_C, 02_D, 02_E], [03_A, 03_B, 03_C, 03_D, 03_E]]
At them moment I call my function do_stacking(sub_lst) with a loop:
for sub_lst in img_lst:
# example: do_stacking([01_A, 01_B, 01_C, 01_D, 01_E])
do_stacking(sub_lst)
I want to speed up with multiprocessing but I am not sure how to call pool.map function:
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, ???)
pool.close()
pool.join()
Edit:
I have a list of list:
[
[01_A, 01_B, 01_C, 01_D, 01_E],
[02_A, 02_B, 02_C, 02_D, 02_E],
[03_A, 03_B, 03_C, 03_D, 03_E]
]
Each sublist should be passed to a function called do_stacking(sublist). I only want to proceed with the sublist and not with the entire list.
My question is how to handle the loop of the list (for x in img_lst)? Should I create a loop for each Pool?
Pool.map works like the builtin map function.It fetch one element from the second argument each time and pass it to the function that represent by the first argument.
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
img_list = []
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
img_list.append(data)
print img_list # check if the img_list is right?
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, img_list)
pool.close()
pool.join()

How to get all pool.apply_async processes to stop once any one process has found a match in python

I have the following code that is leveraging multiprocessing to iterate through a large list and find a match. How can I get all processes to stop once a match is found in any one processes? I have seen examples but I none of them seem to fit into what I am doing here.
#!/usr/bin/env python3.5
import sys, itertools, multiprocessing, functools
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!##$%^&*?,()-=+[]/;"
num_parts = 4
part_size = len(alphabet) // num_parts
def do_job(first_bits):
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
# CHECK FOR MATCH HERE
print(''.join(x))
# EXIT ALL PROCESSES IF MATCH FOUND
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
results = []
for i in range(num_parts):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
pool.apply_async(do_job, (first_bit,))
pool.close()
pool.join()
Thanks for your time.
UPDATE 1:
I have implemented the changes suggested in the great approach by #ShadowRanger and it is nearly working the way I want it to. So I have added some logging to give an indication of progress and put a 'test' key in there to match.
I want to be able to increase/decrease the iNumberOfProcessors independently of the num_parts. At this stage when I have them both at 4 everything works as expected, 4 processes spin up (one extra for the console). When I change the iNumberOfProcessors = 6, 6 processes spin up but only for of them have any CPU usage. So it appears 2 are idle. Where as my previous solution above, I was able to set the number of cores higher without increasing the num_parts, and all of the processes would get used.
I am not sure about how to refactor this new approach to give me the same functionality. Can you have a look and give me some direction with the refactoring needed to be able to set iNumberOfProcessors and num_parts independently from each other and still have all processes used?
Here is the updated code:
#!/usr/bin/env python3.5
import sys, itertools, multiprocessing, functools
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!##$%^&*?,()-=+[]/;"
num_parts = 4
part_size = len(alphabet) // num_parts
iProgressInterval = 10000
iNumberOfProcessors = 6
def do_job(first_bits):
iAttemptNumber = 0
iLastProgressUpdate = 0
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
sKey = ''.join(x)
iAttemptNumber = iAttemptNumber + 1
if iLastProgressUpdate + iProgressInterval <= iAttemptNumber:
iLastProgressUpdate = iLastProgressUpdate + iProgressInterval
print("Attempt#:", iAttemptNumber, "Key:", sKey)
if sKey == 'test':
print("KEY FOUND!! Attempt#:", iAttemptNumber, "Key:", sKey)
return True
def get_part(i):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
return first_bit
if __name__ == '__main__':
# with statement with Py3 multiprocessing.Pool terminates when block exits
with multiprocessing.Pool(processes = iNumberOfProcessors) as pool:
# Don't need special case for final block; slices can
for gotmatch in pool.imap_unordered(do_job, map(get_part, range(num_parts))):
if gotmatch:
break
else:
print("No matches found")
UPDATE 2:
Ok here is my attempt at trying #noxdafox suggestion. I have put together the following based on the link he provided with his suggestion. Unfortunately when I run it I get the error:
... line 322, in apply_async
raise ValueError("Pool not running")
ValueError: Pool not running
Can anyone give me some direction on how to get this working.
Basically the issue is that my first attempt did multiprocessing but did not support canceling all processes once a match was found.
My second attempt (based on #ShadowRanger suggestion) solved that problem, but broke the functionality of being able to scale the number of processes and num_parts size independently, which is something my first attempt could do.
My third attempt (based on #noxdafox suggestion), throws the error outlined above.
If anyone can give me some direction on how to maintain the functionality of my first attempt (being able to scale the number of processes and num_parts size independently), and add the functionality of canceling all processes once a match was found it would be much appreciated.
Thank you for your time.
Here is the code from my third attempt based on #noxdafox suggestion:
#!/usr/bin/env python3.5
import sys, itertools, multiprocessing, functools
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!##$%^&*?,()-=+[]/;"
num_parts = 4
part_size = len(alphabet) // num_parts
iProgressInterval = 10000
iNumberOfProcessors = 4
def find_match(first_bits):
iAttemptNumber = 0
iLastProgressUpdate = 0
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
sKey = ''.join(x)
iAttemptNumber = iAttemptNumber + 1
if iLastProgressUpdate + iProgressInterval <= iAttemptNumber:
iLastProgressUpdate = iLastProgressUpdate + iProgressInterval
print("Attempt#:", iAttemptNumber, "Key:", sKey)
if sKey == 'test':
print("KEY FOUND!! Attempt#:", iAttemptNumber, "Key:", sKey)
return True
def get_part(i):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
return first_bit
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
class Worker():
def __init__(self, workers):
self.workers = workers
def callback(self, result):
if result:
self.pool.terminate()
def do_job(self):
print(self.workers)
pool = multiprocessing.Pool(processes=self.workers)
for part in grouper(alphabet, part_size):
pool.apply_async(do_job, (part,), callback=self.callback)
pool.close()
pool.join()
print("All Jobs Queued")
if __name__ == '__main__':
w = Worker(4)
w.do_job()
You can check this question to see an implementation example solving your problem.
This works also with concurrent.futures pool.
Just replace the map method with apply_async and iterated over your list from the caller.
Something like this.
for part in grouper(alphabet, part_size):
pool.apply_async(do_job, part, callback=self.callback)
grouper recipe
multiprocessing isn't really designed to cancel tasks, but you can simulate it for your particular case by using pool.imap_unordered and terminating the pool when you get a hit:
def do_job(first_bits):
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
# CHECK FOR MATCH HERE
print(''.join(x))
if match:
return True
# If we exit loop without a match, function implicitly returns falsy None for us
# Factor out part getting to simplify imap_unordered use
def get_part(i):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
if __name__ == '__main__':
# with statement with Py3 multiprocessing.Pool terminates when block exits
with multiprocessing.Pool(processes=4) as pool:
# Don't need special case for final block; slices can
for gotmatch in pool.imap_unordered(do_job, map(get_part, range(num_parts))):
if gotmatch:
break
else:
print("No matches found")
This will run do_job for each part, returning results as fast as it can get them. When a worker returns True, the loop breaks, and the with statement for the Pool is exited, terminate-ing the Pool (dropping all work in progress).
Note that while this works, it's kind of abusing multiprocessing; it won't handle canceling individual tasks without terminating the whole Pool. If you need more fine grained task cancellation, you'll want to look at concurrent.futures, but even there, it can only cancel undispatched tasks; once they're running, they can't be cancelled without terminating the Executor or using a side-band means of termination (having the task poll some interprocess object intermittently to determine if it should continue running).

How can I efficiently implement multithreading/multiprocessing in a Python web bot?

Let's say I have a web bot written in python that sends data via POST request to a web site. The data is pulled from a text file line by line and passed into an array. Currently, I'm testing each element in the array through a simple for-loop. How can I effectively implement multi-threading to iterate through the data quicker. Let's say the text file is fairly large. Would attaching a thread to each request be smart? What do you think the best approach to this would be?
with open("c:\file.txt") as file:
dataArr = file.read().splitlines()
dataLen = len(open("c:\file.txt").readlines())-1
def test(data):
#This next part is pseudo code
result = testData('www.example.com', data)
if result == 'whatever':
print 'success'
for i in range(0, dataLen):
test(dataArr[i])
I was thinking of something along the lines of this, but I feel it would cause issues depending on the size of the text file. I know there is software that exists which allows the end-user to specify the amount of the threads when working with large amounts of data. I'm not entirely sure of how that works, but that's something I'd like to implement.
import threading
with open("c:\file.txt") as file:
dataArr = file.read().splitlines()
dataLen = len(open("c:\file.txt").readlines())-1
def test(data):
#This next part is pseudo code
result = testData('www.example.com', data)
if result == 'whatever':
print 'success'
jobs = []
for x in range(0, dataLen):
thread = threading.Thread(target=test, args=(dataArr[x]))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()
This sounds like a recipe for multiprocessing.Pool
See here: https://docs.python.org/2/library/multiprocessing.html#introduction
from multiprocessing import Pool
def test(num):
if num%2 == 0:
return True
else:
return False
if __name__ == "__main__":
list_of_datas_to_test = [0, 1, 2, 3, 4, 5, 6, 7, 8]
p = Pool(4) # create 4 processes to do our work
print(p.map(test, list_of_datas_to_test)) # distribute our work
Output looks like:
[True, False, True, False, True, False, True, False, True, False]
Threads are slow in python because of the Global Interpreter Lock. You should consider using multiple processes with the Python multiprocessing module instead of threads. Using multiple processes can increase the "ramp up" time of your code, as spawning a real process takes more time than a light thread, but due to the GIL, threading won't do what you're after.
Here and here are a couple of basic resources on using the multiprocessing module. Here's an example from the second link:
import multiprocessing as mp
import random
import string
# Define an output queue
output = mp.Queue()
# define a example function
def rand_string(length, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put(rand_str)
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(4)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)

How can I multithread a function that reads a list of objects in Python? Astrophysics example code

This is my first post to stack overflow. I'll try to include all the necessary information, but please let me know if there's more info I can provide to clarify my question.
I'm trying to multithread a costly function for an astrophysical code in python using pool.map. The function takes as an input a list of objects. The basic code structure is like this:
There's a class of Stars with physical properties:
Class Stars:
def __init__(self,mass,metals,positions,age):
self.mass = mass
self.metals = metals
self.positions = positions
self.age = age
def info(self):
return(self.mass,self.metals,self.positions,self.age)
and there's a list of these objects:
stars_list = []
for i in range(nstars):
stars_list.append(Stars(mass[i],metals[i],positions[i],age[i]))
(where mass, metals, positions and age are known from another script).
There's a costly function that I run with these star objects that returns a spectrum for each one:
def newstars_gen(stars_list):
....
return stellar_nu,stellar_fnu
where stellar_nu and stellar_fnu are numpy arrays
What I would like to do is break the list of star objects (stars_list) up into chunks, and then run newstars_gen on these chunks on multiple threads to gain a speedup. So, to do this, I split the list up into three sublists, and then try to run my function through pool.map:
p = Pool(processes = 3)
nchunks = 3
chunk_start_indices = []
chunk_start_indices.append(0) #the start index is 0
delta_chunk_indices = nstars / nchunks
for n in range(1,nchunks):
chunk_start_indices.append(chunk_start_indices[n-1]+delta_chunk_indices)
for n in range(nchunks):
stars_list_chunk = stars_list[chunk_start_indices[n]:chunk_start_indices[n]+delta_chunk_indices]
#if we're on the last chunk, we might not have the full list included, so need to make sure that we have that here
if n == nchunks-1:
stars_list_chunk = stars_list[chunk_start_indices[n]:-1]
chunk_sol = p.map(newstars_gen,stars_list_chunk)
But when I do this, I get as the error:
File "/Users/[username]/python2.7/multiprocessing/pool.py", line 250, in map
return self.map_async(func, iterable, chunksize).get()
File "/Users/[username]/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
AttributeError: Stars instance has no attribute '__getitem__'
So, I'm confused as to what sort of attribute I should include with the Stars class. I've tried reading about this online and am not sure how to define an appropriate __getitem__ for this class. I'm quite new to object oriented programming (and python in general).
Any help is much appreciated!
So, it looks like there may be a couple things wrong here and that could be cleaned up or made more pythonic. However, the key problem is that you are using pool.multiprocessing.Pool.map incorrectly for what you have. Your newstars_gen function expects a list, but p.map is going to break up the list you give it into chunks and hand it one Star at a time. You should probably rewrite newstars_gen to operate on one star at a time and then throw away all but the first and last lines of your last code block. If the calculations in newstars_gen aren't independent between Stars (e.g., the mass of one impacts the calculation for another), you will have to do a more dramatic refactoring.
It also looks like it would behoove you to learn about list comprehensions. Be aware that the other built in structures (e.g., set, dict) have equivalents, and also look into generator comprehensions.
I've written a function for distributing the processing of an iterable (like your list of stars objects) among multiple processors, which I'm pretty sure will work well for you.
from multiprocessing import Process, cpu_count, Lock
from sys import stdout
from time import clock
def run_multicore_function(iterable, function, func_args = [], max_processes = 0):
#directly pass in a function that is going to be looped over, and fork those
#loops onto independant processors. Any arguments the function needs must be provided as a list.
if max_processes == 0:
cpus = cpu_count()
if cpus > 7:
max_processes = cpus - 3
elif cpus > 3:
max_processes = cpus - 2
elif cpus > 1:
max_processes = cpus - 1
else:
max_processes = 1
running_processes = 0
child_list = []
start_time = round(clock())
elapsed = 0
counter = 0
print "Running function %s() on %s cores" % (function.__name__,max_processes)
#fire up the multi-core!!
stdout.write("\tJob 0 of %s" % len(iterable),)
stdout.flush()
for next_iter in iterable:
if type(iterable) is dict:
next_iter = iterable[next_iter]
while 1: #Only fork a new process when there is a free processor.
if running_processes < max_processes:
#Start new process
stdout.write("\r\tJob %s of %s (%i sec)" % (counter,len(iterable),elapsed),)
stdout.flush()
if len(func_args) == 0:
p = Process(target=function, args=(next_iter,))
else:
p = Process(target=function, args=(next_iter,func_args))
p.start()
child_list.append(p)
running_processes += 1
counter += 1
break
else:
#processor wait loop
while 1:
for next in range(len(child_list)):
if child_list[next].is_alive():
continue
else:
child_list.pop(next)
running_processes -= 1
break
if (start_time + elapsed) < round(clock()):
elapsed = round(clock()) - start_time
stdout.write("\r\tJob %s of %s (%i sec)" % (counter,len(iterable),elapsed),)
stdout.flush()
if running_processes < max_processes:
break
#wait for remaining processes to complete --> this is the same code as the processor wait loop above
while len(child_list) > 0:
for next in range(len(child_list)):
if child_list[next].is_alive():
continue
else:
child_list.pop(next)
running_processes -= 1
break #need to break out of the for-loop, because the child_list index is changed by pop
if (start_time + elapsed) < round(clock()):
elapsed = round(clock()) - start_time
stdout.write("\r\tRunning job %s of %s (%i sec)" % (counter,len(iterable),elapsed),)
stdout.flush()
print " --> DONE\n"
return
As a usage example, let's use your star_list, and send the result of newstars_gen to a shared file. Start by setting up your iterable, file, and a file lock
star_list = []
for i in range(nstars):
stars_list.append(Stars(mass[i],metals[i],positions[i],age[i]))
outfile = "some/where/output.txt"
file_lock = Lock()
Define your costly function like so:
def newstars_gen(stars_list_item,args): #args = [outfile,file_lock]
outfile,file_lock = args
....
with file_lock:
with open(outfile,"a") as handle:
handle.write(stellar_nu,stellar_fnu)
Now send your list of stars into run_multicore_function()
run_multicore_function(star_list, newstars_gen, [outfile,file_lock])
After all of your items have been calculated, you can go back into the output file to grab the data and carry on. Instead of writing to a file, you can also share the state with multiprocessing.Value or multiprocessing.Array, but I've ran into the occasional issue with data getting lost if my list is large and the function I'm calling is fairly fast. Maybe someone else out there can see why that's happening.
Hopefully this all makes sense!
Good luck,
-Steve

Categories