How can I efficiently implement multithreading/multiprocessing in a Python web bot? - python

Let's say I have a web bot written in python that sends data via POST request to a web site. The data is pulled from a text file line by line and passed into an array. Currently, I'm testing each element in the array through a simple for-loop. How can I effectively implement multi-threading to iterate through the data quicker. Let's say the text file is fairly large. Would attaching a thread to each request be smart? What do you think the best approach to this would be?
with open("c:\file.txt") as file:
dataArr = file.read().splitlines()
dataLen = len(open("c:\file.txt").readlines())-1
def test(data):
#This next part is pseudo code
result = testData('www.example.com', data)
if result == 'whatever':
print 'success'
for i in range(0, dataLen):
test(dataArr[i])
I was thinking of something along the lines of this, but I feel it would cause issues depending on the size of the text file. I know there is software that exists which allows the end-user to specify the amount of the threads when working with large amounts of data. I'm not entirely sure of how that works, but that's something I'd like to implement.
import threading
with open("c:\file.txt") as file:
dataArr = file.read().splitlines()
dataLen = len(open("c:\file.txt").readlines())-1
def test(data):
#This next part is pseudo code
result = testData('www.example.com', data)
if result == 'whatever':
print 'success'
jobs = []
for x in range(0, dataLen):
thread = threading.Thread(target=test, args=(dataArr[x]))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()

This sounds like a recipe for multiprocessing.Pool
See here: https://docs.python.org/2/library/multiprocessing.html#introduction
from multiprocessing import Pool
def test(num):
if num%2 == 0:
return True
else:
return False
if __name__ == "__main__":
list_of_datas_to_test = [0, 1, 2, 3, 4, 5, 6, 7, 8]
p = Pool(4) # create 4 processes to do our work
print(p.map(test, list_of_datas_to_test)) # distribute our work
Output looks like:
[True, False, True, False, True, False, True, False, True, False]

Threads are slow in python because of the Global Interpreter Lock. You should consider using multiple processes with the Python multiprocessing module instead of threads. Using multiple processes can increase the "ramp up" time of your code, as spawning a real process takes more time than a light thread, but due to the GIL, threading won't do what you're after.
Here and here are a couple of basic resources on using the multiprocessing module. Here's an example from the second link:
import multiprocessing as mp
import random
import string
# Define an output queue
output = mp.Queue()
# define a example function
def rand_string(length, output):
""" Generates a random string of numbers, lower- and uppercase chars. """
rand_str = ''.join(random.choice(
string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
for i in range(length))
output.put(rand_str)
# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(4)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)

Related

Returning the first non-zero result from Pool.async_map

I am using the python multiprocessing library in order to run a number of tests on a large array of numbers.
I have the follow syntax:
import multiprocessing as mp
pool = mp.Pool(processes = 6)
res = pool.async_map(testFunction, arrayOfNumbers)
However I want to return the first number that passes the test, and then exit. I am not interested in storing the array of results.
Currently testFunction will return 0 for any numbers that fail, so if doing this without any optimisation, I would wait for it to finish and use:
return filter(lambda x: x != 0, res)[0]
assuming there is a result. However since it is running asynchronously, I want to get the non-zero value as soon as possible.
What is the best approach to this?
I am not sure if this is the best approach, but it is a working approach. Adding tasks to a queue is non blocking and the program will keep operating. Now by storing all the possible return values I can iterate over them by myself.
The return values are actually close to a promise object, now by checking their ready() function I can check if the result is ready to be read. Then using the get() method I can verify what that value is. If I know the value is 0, I can terminate the pool early and return the final result.
A minimal working example demonstrating this is the following:
import time
import multiprocessing as mp
def worker(value):
print('working')
time.sleep(3)
return value
def main():
pool = mp.Pool(2) # Only two workers
results = []
for n in range(0, 8):
value = 0 if n == 0 else 1
results.append(pool.apply_async(worker, (value,)))
running = True
while running:
for result in results:
if result.ready() and result.get() == 0:
print(f"There was a zero returned")
pool.terminate()
running = False
if all(result.ready() for result in results):
running = False
pool.close()
pool.join()
if __name__ == '__main__':
main()
The expected output would be:
working
working
working
There was a zero returned
Process finished with exit code 0
I created a small pool of 2 processes, that are calling a function that will sleep for 3 seconds and then return either 1 or 0. Currently the first task will return a 0, and the program will early terminate after the results are available.
If there is no terminating task, the line:
if all(result.ready() for result in results):
running = False
Will terminate the loop if all processes are done.
If you would like to now all the results, you can use:
print([result.get() for result in results if result.ready()])

Python Multiprocessing starmap

i am trying to run a function in parallel with Multiprocessing starmap.
data = [(i, board) for i in range(board.width)]
if __name__ == '__main__':
p = mp.Pool(processes=mp.cpu_count())
ratings = p.starmap(self.rate, data)
print("Ratings: " + ratings)
My problem is that print is never executed. The Function just returns with None.
self.rate() should return a number.
Github: https://github.com/Builder20/Connect4/tree/develop
Any ideas?
Obvious the function loops. you can add logging to see which parts loops, I see the only candidate.
while canSet == -1:
opponentColumn = random.randint(0, board.width)
canSet = board.setStone(self.other, opponentColumn)
add
logging.info(board)
to see how it progresses

Python multiprocess lists of images

I want to use multi process to stack many images. Each stack consists of 5 images, which means I have a list of images with a sublist of the images which should be combined:
img_lst = [[01_A, 01_B, 01_C, 01_D, 01_E], [02_A, 02_B, 02_C, 02_D, 02_E], [03_A, 03_B, 03_C, 03_D, 03_E]]
At them moment I call my function do_stacking(sub_lst) with a loop:
for sub_lst in img_lst:
# example: do_stacking([01_A, 01_B, 01_C, 01_D, 01_E])
do_stacking(sub_lst)
I want to speed up with multiprocessing but I am not sure how to call pool.map function:
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, ???)
pool.close()
pool.join()
Edit:
I have a list of list:
[
[01_A, 01_B, 01_C, 01_D, 01_E],
[02_A, 02_B, 02_C, 02_D, 02_E],
[03_A, 03_B, 03_C, 03_D, 03_E]
]
Each sublist should be passed to a function called do_stacking(sublist). I only want to proceed with the sublist and not with the entire list.
My question is how to handle the loop of the list (for x in img_lst)? Should I create a loop for each Pool?
Pool.map works like the builtin map function.It fetch one element from the second argument each time and pass it to the function that represent by the first argument.
if __name__ == '__main__':
from multiprocessing import Pool
# I store my lists in a file
f_in = open(stacking_path + "stacks.txt", 'r')
f_stack = f_in.readlines()
img_list = []
for data in f_stack:
data = data.strip()
data = data.split('\t')
# data is now my sub_lst
img_list.append(data)
print img_list # check if the img_list is right?
# Not sure what to do here, set the sublist, f_stack?
pool = Pool()
pool.map(do_stacking, img_list)
pool.close()
pool.join()

How to get all pool.apply_async processes to stop once any one process has found a match in python

I have the following code that is leveraging multiprocessing to iterate through a large list and find a match. How can I get all processes to stop once a match is found in any one processes? I have seen examples but I none of them seem to fit into what I am doing here.
#!/usr/bin/env python3.5
import sys, itertools, multiprocessing, functools
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!##$%^&*?,()-=+[]/;"
num_parts = 4
part_size = len(alphabet) // num_parts
def do_job(first_bits):
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
# CHECK FOR MATCH HERE
print(''.join(x))
# EXIT ALL PROCESSES IF MATCH FOUND
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
results = []
for i in range(num_parts):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
pool.apply_async(do_job, (first_bit,))
pool.close()
pool.join()
Thanks for your time.
UPDATE 1:
I have implemented the changes suggested in the great approach by #ShadowRanger and it is nearly working the way I want it to. So I have added some logging to give an indication of progress and put a 'test' key in there to match.
I want to be able to increase/decrease the iNumberOfProcessors independently of the num_parts. At this stage when I have them both at 4 everything works as expected, 4 processes spin up (one extra for the console). When I change the iNumberOfProcessors = 6, 6 processes spin up but only for of them have any CPU usage. So it appears 2 are idle. Where as my previous solution above, I was able to set the number of cores higher without increasing the num_parts, and all of the processes would get used.
I am not sure about how to refactor this new approach to give me the same functionality. Can you have a look and give me some direction with the refactoring needed to be able to set iNumberOfProcessors and num_parts independently from each other and still have all processes used?
Here is the updated code:
#!/usr/bin/env python3.5
import sys, itertools, multiprocessing, functools
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!##$%^&*?,()-=+[]/;"
num_parts = 4
part_size = len(alphabet) // num_parts
iProgressInterval = 10000
iNumberOfProcessors = 6
def do_job(first_bits):
iAttemptNumber = 0
iLastProgressUpdate = 0
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
sKey = ''.join(x)
iAttemptNumber = iAttemptNumber + 1
if iLastProgressUpdate + iProgressInterval <= iAttemptNumber:
iLastProgressUpdate = iLastProgressUpdate + iProgressInterval
print("Attempt#:", iAttemptNumber, "Key:", sKey)
if sKey == 'test':
print("KEY FOUND!! Attempt#:", iAttemptNumber, "Key:", sKey)
return True
def get_part(i):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
return first_bit
if __name__ == '__main__':
# with statement with Py3 multiprocessing.Pool terminates when block exits
with multiprocessing.Pool(processes = iNumberOfProcessors) as pool:
# Don't need special case for final block; slices can
for gotmatch in pool.imap_unordered(do_job, map(get_part, range(num_parts))):
if gotmatch:
break
else:
print("No matches found")
UPDATE 2:
Ok here is my attempt at trying #noxdafox suggestion. I have put together the following based on the link he provided with his suggestion. Unfortunately when I run it I get the error:
... line 322, in apply_async
raise ValueError("Pool not running")
ValueError: Pool not running
Can anyone give me some direction on how to get this working.
Basically the issue is that my first attempt did multiprocessing but did not support canceling all processes once a match was found.
My second attempt (based on #ShadowRanger suggestion) solved that problem, but broke the functionality of being able to scale the number of processes and num_parts size independently, which is something my first attempt could do.
My third attempt (based on #noxdafox suggestion), throws the error outlined above.
If anyone can give me some direction on how to maintain the functionality of my first attempt (being able to scale the number of processes and num_parts size independently), and add the functionality of canceling all processes once a match was found it would be much appreciated.
Thank you for your time.
Here is the code from my third attempt based on #noxdafox suggestion:
#!/usr/bin/env python3.5
import sys, itertools, multiprocessing, functools
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ12234567890!##$%^&*?,()-=+[]/;"
num_parts = 4
part_size = len(alphabet) // num_parts
iProgressInterval = 10000
iNumberOfProcessors = 4
def find_match(first_bits):
iAttemptNumber = 0
iLastProgressUpdate = 0
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
sKey = ''.join(x)
iAttemptNumber = iAttemptNumber + 1
if iLastProgressUpdate + iProgressInterval <= iAttemptNumber:
iLastProgressUpdate = iLastProgressUpdate + iProgressInterval
print("Attempt#:", iAttemptNumber, "Key:", sKey)
if sKey == 'test':
print("KEY FOUND!! Attempt#:", iAttemptNumber, "Key:", sKey)
return True
def get_part(i):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
return first_bit
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
class Worker():
def __init__(self, workers):
self.workers = workers
def callback(self, result):
if result:
self.pool.terminate()
def do_job(self):
print(self.workers)
pool = multiprocessing.Pool(processes=self.workers)
for part in grouper(alphabet, part_size):
pool.apply_async(do_job, (part,), callback=self.callback)
pool.close()
pool.join()
print("All Jobs Queued")
if __name__ == '__main__':
w = Worker(4)
w.do_job()
You can check this question to see an implementation example solving your problem.
This works also with concurrent.futures pool.
Just replace the map method with apply_async and iterated over your list from the caller.
Something like this.
for part in grouper(alphabet, part_size):
pool.apply_async(do_job, part, callback=self.callback)
grouper recipe
multiprocessing isn't really designed to cancel tasks, but you can simulate it for your particular case by using pool.imap_unordered and terminating the pool when you get a hit:
def do_job(first_bits):
for x in itertools.product(first_bits, *itertools.repeat(alphabet, num_parts-1)):
# CHECK FOR MATCH HERE
print(''.join(x))
if match:
return True
# If we exit loop without a match, function implicitly returns falsy None for us
# Factor out part getting to simplify imap_unordered use
def get_part(i):
if i == num_parts - 1:
first_bit = alphabet[part_size * i :]
else:
first_bit = alphabet[part_size * i : part_size * (i+1)]
if __name__ == '__main__':
# with statement with Py3 multiprocessing.Pool terminates when block exits
with multiprocessing.Pool(processes=4) as pool:
# Don't need special case for final block; slices can
for gotmatch in pool.imap_unordered(do_job, map(get_part, range(num_parts))):
if gotmatch:
break
else:
print("No matches found")
This will run do_job for each part, returning results as fast as it can get them. When a worker returns True, the loop breaks, and the with statement for the Pool is exited, terminate-ing the Pool (dropping all work in progress).
Note that while this works, it's kind of abusing multiprocessing; it won't handle canceling individual tasks without terminating the whole Pool. If you need more fine grained task cancellation, you'll want to look at concurrent.futures, but even there, it can only cancel undispatched tasks; once they're running, they can't be cancelled without terminating the Executor or using a side-band means of termination (having the task poll some interprocess object intermittently to determine if it should continue running).

Python: Question about multiprocessing / multithreading and shared resources

Here's the simplest multi threading example I found so far:
import multiprocessing
import subprocess
def calculate(value):
return value * 10
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10000)
results = []
r = pool.map_async(calculate, tasks, callback=results.append)
r.wait() # Wait on the results
print results
I have two lists and one index to access the elements in each list. The ith position on the first list is related to the ith position on the second. I didn't use a dict because the lists are ordered.
What I was doing was something like:
for i in xrange(len(first_list)):
# do something with first_list[i] and second_list[i]
So, using that example, I think can make a function sort of like this:
#global variables first_list, second_list, i
first_list, second_list, i = None, None, 0
#initialize the lists
...
#have a function to do what the loop did and inside it increment i
def function:
#do stuff
i += 1
But, that makes i a shared resource and I'm not sure if that'd be safe. It also seems to me my design is not lending itself well to this multithreaded approach, but I'm not sure how to fix it.
Here's a working example of what I wanted (Edit an image you want to use):
import multiprocessing
import subprocess, shlex
links = ['http://www.example.com/image.jpg']*10 # don't use this URL
names = [str(i) + '.jpg' for i in range(10)]
def download(i):
command = 'wget -O ' + names[i] + ' ' + links[i]
print command
args = shlex.split(command)
return subprocess.call(args, shell=False)
if __name__ == '__main__':
pool = multiprocessing.Pool(None)
tasks = range(10)
r = pool.map_async(download, tasks)
r.wait() # Wait on the results
First off, it might be beneficial to make one list of tuples, for example
new_list[i] = (first_list[i], second_list[i])
That way, as you change i, you ensure that you are always operating on the same items from first_list and second_list.
Secondly, assuming there are no relations between the i and i-1 entries in your lists, you can use your function to operate on one given i value, and spawn a thread to handle each i value. Consider
indices = range(len(new_list))
results = []
r = pool.map_async(your_function, indices, callback=results.append)
r.wait() # Wait on the results
This should give you what you want.

Categories