I am trying to use concurrent.futures to process a function with multiple threads to efficiently speed up the code.
I have read their documentation and this guide but believe I may not be doing this correctly. This MRE should allow us to test a number of different string lengths and list sizes to compare performance:
import pandas as pd, tqdm, string, random
from thefuzz import fuzz, process
from concurrent.futures import ThreadPoolExecutor
def generate_string(items=10, lengths=5):
return [''.join(random.choice(string.ascii_letters) for i in range (lengths))] * items
def matching(a, b):
matches = {}
scorers = {'token_sort_ratio': fuzz.token_sort_ratio, 'token_set_ratio': fuzz.token_set_ratio, 'partial_token_sort_ratio': fuzz.partial_token_sort_ratio,
'Quick': fuzz.QRatio, 'Unicode Quick': fuzz.UQRatio, 'Weighted': fuzz.WRatio, 'Unweighted': fuzz.UWRatio}
for x in tqdm.tqdm(a):
best = 0
for _, scorer in scorers.items():
res = process.extractOne(x, b, scorer=scorer)
if res[1] > best:
best = res[1]
matches[x] = res
else:
continue
return matches
list_a = generate_string(100, 10)
list_b = generate_string(10, 5)
with ThreadPoolExecutor(max_workers=5) as executor:
future = executor.submit(matching, list_a, list_b)
This code runs with no error; how can I use multiple workers to execute these loops in parallel so that the code will run faster?
Thanks to a hint from #Anentropic, I was able to use the following change with multiprocessing
if __name__ == '__main__':
list_a = generate_string(500, 10)
list_b = generate_string(500, 10)
pool = Pool(os.cpu_count()-2)
res = pool.map(matching, zip(list_a, list_b))
norm_res = matching([list_a, list_b])
Related
Given this example:
I want to distribute the calculation over a list (map)
from multiprocessing import Pool
def fun(elem):
lis = []
for i in range(0,elem):
lis.append(i)
return lis
p = Pool(3)
p.map(fun, [2,3,4])
How can I make something like this work?
I have a 2 dimensional array which produces a huge (>300GB) list of combinations, so i'd like to do lazy iteration on the iterator produced by itertools.combinations and parallelize this operation. The problem is that I need to filter the output and this isn't supported by Multiprocessing. My existing workaround for this requires loading the combinations list into memory, which also doesn't work because of the size of the list.
n_nodes = np.random.randn(10, 100)
cutoff=0.3
def node_combinations(nodes):
return itertools.combinations(list(range(len(nodes))), 2)
def pfilter(func, candidates):
return np.asarray([c for c, keep in zip(candidates, pool.map(func, candidates)) if keep])
def pearsonr(xy: tuple):
correlation_coefficient = scipy.stats.pearsonr(n_nodes[xy[0]], n_nodes[xy[1]])[0]
if correlation_coefficient >= cutoff:
return True
else:
return False
edgelist = pfilter(pearsonr, node_combinations(n_nodes))
I'm looking for a way to do lazy evaluation of a large iterator using multiprocessing with filter instead of map.
The following uses a Semaphore to slow down the over eager pool thread. Not the proper solution as it doesn't fix the other issues such as that nested loops that use the same pool and loop over the result of imap have their outer loop's jobs finish before any of the inner loops jobs even get to start. But it does limit the memory usage:
def slowdown(n=16):
s = threading.Semaphore(n)
def inner(it):
for item in it:
s.acquire()
yield item
def outer(it):
for item in it:
s.release()
yield item
return outer, inner
This is used to wrap pool.imap as such:
outer, inner = slowdown()
outer(pool.imap(func, inner(candidates)))
Hoxha's suggestion works fine -- thanks!
#Dan the issue is that even empty lists take up memory, which x42 billion pairings is nearly 3TB in memory.
here's my implementation:
import more_itertools
import itertools
import multiprocessing as mp
import numpy as np
import scipy
from tqdm import tqdm
n_nodes = np.random.randn(10, 100)
num_combinations = int((int(n_nodes.shape[0]) ** 2) - int(n_nodes.shape[0]) // 2)
cpu_count = 8
cutoff=0.3
def node_combinations(nodes):
return itertools.combinations(list(range(len(nodes))), 2)
def edge_gen(xy_iterator: type(itertools.islice)):
edges = []
for cand in tqdm(xy_iterator, total=num_combinations//cpu_count)
if pearsonr(cand):
edges.append(cand)
def pearsonr(xy: tuple):
correlation_coefficient = scipy.stats.pearsonr(n_nodes[xy[0]], n_nodes[xy[1]])[0]
if correlation_coefficient >= cutoff:
return True
else:
return False
slices = more_itertools.distribute(cpu_count), node_combinations(n_nodes))
pool = mp.Pool(cpu_count)
results = pool.imap(edge_gen, slices)
pool.close()
pool.join()
I am using a function that take too much time to finish since it takes a large input and use two nested for loops .
The code of the function :
def transform(self, X):
global brands
result=[]
for x in X:
index=0
count=0
for brand in brands:
all_matches= re.findall(re.escape(brand), x,flags=re.I)
count_all_match=len(all_matches)
if(count_all_match>count):
count=count_all_match
index=brands.index(brand)
result.append([index])
return np.array(result)
So how to change the code of this function so that it uses multiprocessing in order to optimize the running time ?
I don't see the use of self in the method transform. So i made a common function.
import re
import numpy as np
from concurrent.futures import ProcessPoolExecutor
def transformer(x):
global brands
index = 0
count = 0
for brand in brands:
all_matches = re.findall(re.escape(brand), x, flags=re.I)
count_all_match = len(all_matches)
if count_all_match > count:
count = count_all_match
index = brands.index(brand)
return [index]
def transform(X):
with ProcessPoolExecutor() as executor:
result = executor.map(transformer, X)
return np.array(list(result))
I am having difficulty understanding how to use Python's multiprocessing module.
I have a sum from 1 to n where n=10^10, which is too large to fit into a list, which seems to be the thrust of many examples online using multiprocessing.
Is there a way to "split up" the range into segments of a certain size and then perform the sum for each segment?
For instance
def sum_nums(low,high):
result = 0
for i in range(low,high+1):
result += i
return result
And I want to compute sum_nums(1,10**10) by breaking it up into many sum_nums(1,1000) + sum_nums(1001,2000) + sum_nums(2001,3000)... and so on. I know there is a close-form n(n+1)/2 but pretend we don't know that.
Here is what I've tried
import multiprocessing
def sum_nums(low,high):
result = 0
for i in range(low,high+1):
result += i
return result
if __name__ == "__main__":
n = 1000
procs = 2
sizeSegment = n/procs
jobs = []
for i in range(0, procs):
process = multiprocessing.Process(target=sum_nums, args=(i*sizeSegment+1, (i+1)*sizeSegment))
jobs.append(process)
for j in jobs:
j.start()
for j in jobs:
j.join()
#where is the result?
I find the usage of multiprocess.Pool and map() much more simple
Using your code:
from multiprocessing import Pool
def sum_nums(args):
low = int(args[0])
high = int(args[1])
return sum(range(low,high+1))
if __name__ == "__main__":
n = 1000
procs = 2
sizeSegment = n/procs
# Create size segments list
jobs = []
for i in range(0, procs):
jobs.append((i*sizeSegment+1, (i+1)*sizeSegment))
pool = Pool(procs).map(sum_nums, jobs)
result = sum(pool)
>>> print result
>>> 500500
You can do this sum without multiprocessing at all, and it's probably simpler, if not faster, to just use generators.
# prepare a generator of generators each at 1000 point intervals
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> list(xr)[:3]
[xrange(1, 1001), xrange(1001, 2001), xrange(2001, 3001)]
# sum, using two map functions
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> sum(map(sum, map(lambda x:x, xr)))
50000000005000000000L
However, if you want to use multiprocessing, you can also do this too. I'm using a fork of multiprocessing that is better at serialization (but otherwise, not really different).
>>> xr = (xrange(1000*i+1,i*1000+1001) for i in xrange(10000000))
>>> import pathos
>>> mmap = pathos.multiprocessing.ProcessingPool().map
>>> tmap = pathos.multiprocessing.ThreadingPool().map
>>> sum(tmap(sum, mmap(lambda x:x, xr)))
50000000005000000000L
The version w/o multiprocessing is faster and takes about a minute on my laptop. The multiprocessing version takes a few minutes due to the overhead of spawning multiple python processes.
If you are interested, get pathos here: https://github.com/uqfoundation
First, the best way to get around the memory issue is to use an iterator/generator instead of a list:
def sum_nums(low, high):
result = 0
for i in xrange(low, high+1):
result += 1
return result
in python3, range() produces an iterator, so this is only needed in python2
Now, where multiprocessing comes in is when you want to split up the processing to different processes or CPU cores. If you don't need to control the individual workers than the easiest method is to use a process pool. This will let you map a function to the pool and get the output. You can alternatively use apply_async to apply jobs to the pool one at a time and get a delayed result which you can get with .get():
import multiprocessing
from multiprocessing import Pool
from time import time
def sum_nums(low, high):
result = 0
for i in xrange(low, high+1):
result += i
return result
# map requires a function to handle a single argument
def sn((low,high)):
return sum_nums(low, high)
if __name__ == '__main__':
#t = time()
# takes forever
#print sum_nums(1,10**10)
#print '{} s'.format(time() -t)
p = Pool(4)
n = int(1e8)
r = range(0,10**10+1,n)
results = []
# using apply_async
t = time()
for arg in zip([x+1 for x in r],r[1:]):
results.append(p.apply_async(sum_nums, arg))
# wait for results
print sum(res.get() for res in results)
print '{} s'.format(time() -t)
# using process pool
t = time()
print sum(p.map(sn, zip([x+1 for x in r], r[1:])))
print '{} s'.format(time() -t)
On my machine, just calling sum_nums with 10**10 takes almost 9 minutes, but using a Pool(8) and n=int(1e8) reduces this to just over a minute.
I have a simulation that is currently running, but the ETA is about 40 hours -- I'm trying to speed it up with multi-processing.
It essentially iterates over 3 values of one variable (L), and over 99 values of of a second variable (a). Using these values, it essentially runs a complex simulation and returns 9 different standard deviations. Thus (even though I haven't coded it that way yet) it is essentially a function that takes two values as inputs (L,a) and returns 9 values.
Here is the essence of the code I have:
STD_1 = []
STD_2 = []
# etc.
for L in range(0,6,2):
for a in range(1,100):
### simulation code ###
STD_1.append(value_1)
STD_2.append(value_2)
# etc.
Here is what I can modify it to:
master_list = []
def simulate(a,L):
### simulation code ###
return (a,L,STD_1, STD_2 etc.)
for L in range(0,6,2):
for a in range(1,100):
master_list.append(simulate(a,L))
Since each of the simulations are independent, it seems like an ideal place to implement some sort of multi-threading/processing.
How exactly would I go about coding this?
EDIT: Also, will everything be returned to the master list in order, or could it possibly be out of order if multiple processes are working?
EDIT 2: This is my code -- but it doesn't run correctly. It asks if I want to kill the program right after I run it.
import multiprocessing
data = []
for L in range(0,6,2):
for a in range(1,100):
data.append((L,a))
print (data)
def simulation(arg):
# unpack the tuple
a = arg[1]
L = arg[0]
STD_1 = a**2
STD_2 = a**3
STD_3 = a**4
# simulation code #
return((STD_1,STD_2,STD_3))
print("1")
p = multiprocessing.Pool()
print ("2")
results = p.map(simulation, data)
EDIT 3: Also what are the limitations of multiprocessing. I've heard that it doesn't work on OS X. Is this correct?
Wrap the data for each iteration up into a tuple.
Make a list data of those tuples
Write a function f to process one tuple and return one result
Create p = multiprocessing.Pool() object.
Call results = p.map(f, data)
This will run as many instances of f as your machine has cores in separate processes.
Edit1: Example:
from multiprocessing import Pool
data = [('bla', 1, 3, 7), ('spam', 12, 4, 8), ('eggs', 17, 1, 3)]
def f(t):
name, a, b, c = t
return (name, a + b + c)
p = Pool()
results = p.map(f, data)
print results
Edit2:
Multiprocessing should work fine on UNIX-like platforms such as OSX. Only platforms that lack os.fork (mainly MS Windows) need special attention. But even there it still works. See the multiprocessing documentation.
Here is one way to run it in parallel threads:
import threading
L_a = []
for L in range(0,6,2):
for a in range(1,100):
L_a.append((L,a))
# Add the rest of your objects here
def RunParallelThreads():
# Create an index list
indexes = range(0,len(L_a))
# Create the output list
output = [None for i in indexes]
# Create all the parallel threads
threads = [threading.Thread(target=simulate,args=(output,i)) for i in indexes]
# Start all the parallel threads
for thread in threads: thread.start()
# Wait for all the parallel threads to complete
for thread in threads: thread.join()
# Return the output list
return output
def simulate(list,index):
(L,a) = L_a[index]
list[index] = (a,L) # Add the rest of your objects here
master_list = RunParallelThreads()
Use Pool().imap_unordered if ordering is not important. It will return results in a non-blocking fashion.