I have a python script which runs a method in parallel.
parsers = {
'parser1': parser1.process,
'parser2': parser2.process
}
def process((key, value)):
parsers[key](value)
pool = Pool(4)
pool.map(process_items, items)
process_items is my method and items is a list of tuples with two elements to each tuple. The items list has around 100k items.
process_items will then call a method depending on what parameters are given. My problem being maybe 70% of the list I can run with high parallelism but the other 30% can only run with 1/2 threads otherwise will cause a failure outside of my control.
So in my code I have around 10 different parser processes. For say 1-8 I want to run with Pool(4) but 9-10 Pool(2).
What is the best way to optimise this?
I think your best option is to use two pools here:
from multiprocessing import Pool
# import parsers here
parsers = {
'parser1': parser1.process,
'parser2': parser2.process,
'parser3': parser3.process,
'parser4': parser4.process,
'parser5': parser5.process,
'parser6': parser6.process,
'parser7': parser7.process,
}
# Sets that define which items can use high parallelism,
# and which must use low
high_par = {"parser1", "parser3", "parser4", "parser6", "parser7"}
low_par = {"parser2", "parser5"}
def process_items(key, value):
parsers[key](value)
def run_pool(func, items, num_items, check_set):
pool = Pool(num_items)
out = pool.map(func, (item for item in items if item[0] in check_set))
pool.close()
pool.join()
return out
if __name__ == "__main__":
items = [('parser2', x), ...] # Your list of tuples
# Process with high parallelism
high_results = run_pool(process_items, items, 4, high_par)
# Process with low parallelism
low_results = run_pool(process_items, items, 2, low_par)
Trying to do this in one Pool is possible, through clever use of synchronization primitives, but I don't think it will end up looking much cleaner than this. It's also could end up running less efficiently, since sometimes your pool will need to wait around for work to finish, so it can process a low parallelism item, even when high parallelism items are available behind it in the queue.
This would get complicated a bit if you needed to get the results from each process_items call in the same order as they fell in the original iterable, meaning the results from each Pool need to get merged, but based on your example I don't think that's a requirement. Let me know if it is, and I'll try to adjust my answer accordingly.
You can specify the number of parallel threads in the constructor for multiprocessing.Pool:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(5) # 5 is the number of parallel threads
print pool.map(f, [1, 2, 3])
Related
I am learning multiprocessing in Python, and thinking of a problem. I want that for a shared list(nums = mp.Manager().list), is there any way that it automatically splits the list for all the processes so that it does not compute on same numbers in parallel.
Current code:
# multiple processes
nums = mp.Manager().list(range(10000))
results = mp.Queue()
def get_square(list_of_num, results_sharedlist):
# simple get square
results_sharedlist.put(list(map(lambda x: x**2, list_of_num)))
start = time.time()
process1 = mp.Process(target=get_square, args = (nums, results))
process2 = mp.Process(target=get_square, args=(nums, results))
process1.start()
process2.start()
process1.join()
process2.join()
print(time.time()-start)
for i in range(results.qsize()):
print(results.get())
Current Behaviour
It computes the square of same list twice
What I want
I want the process 1 and process 2 to compute squares of nums list 1 time in parallel without me defining the split.
You can make function to decide on which data it needs to perform operations. In current scenario, you want your function to divide the square calculation work by it's own based on how many processes are working in parallel.
To do so, you need to let your function know which process it is working on and how many other processes are working along with it. So that it can only work on specific data. So you can just pass two more parameters to your functions which will give information about processes running in parallel. i.e. current_process and total_process.
If you have a list of length divisible by 2 and you want to calculate squares of same using two processes then your function would look something like as follows:
def get_square(list_of_num, results_sharedlist, current_process, total_process):
total_length = len(list_of_num)
start = (total_length // total_process) * (current_process - 1)
end = (total_length // total_process) * current_process
results_sharedlist.put(list(map(lambda x: x**2, list_of_num[start:end])))
TOTAL_PROCESSES = 2
process1 = mp.Process(target=get_square, args = (nums, results, 1, TOTAL_PROCESSES))
process2 = mp.Process(target=get_square, args=(nums, results, 2, TOTAL_PROCESSES))
The assumption I have made here is that the length of list on which you are going to work is in multiple of processes you are allocating. And if it not then the current logic will leave behind some numbers with no output.
Hope this answers your question!
Agree on the answer by Jake here, but as a bonus:
if you are using a multiprocessing.Pool(), it keeps an internal counter of the multiprocessing threads spawned, so you can avoid the parametr to identify the current_process by accessing _identity from the current_process by multiprocessing, like this:
from multiprocessing import current_process, Pool
p = current_process()
print('process counter:', p._identity[0])
more info from this answer.
I am fairly new to python, kindly excuse me for insufficient information if any. As a part of the curriculum , I got introduced to python for quants/finance, I am studying multiprocessing and trying to understand this better. I tried modifying the problem given and now I am stuck mentally with the problem.
Problem:
I have a function which gives me ticks, in ohlc format.
{'scrip_name':'ABC','timestamp':1504836192,'open':301.05,'high':303.80,'low':299.00,'close':301.10,'volume':100000}
every minute. I wish to do the following calculation concurrently and preferably append/insert in the samelist
Find the Moving Average of the last 5 close data
Find the Median of the last 5 open data
Save the tick data to a database.
so expected data is likely to be
['scrip_name':'ABC','timestamp':1504836192,'open':301.05,'high':303.80,'low':299.00,'close':301.10,'volume':100000,'MA_5_open':300.25,'Median_5_close':300.50]
Assuming that the data is going to a db, its fairly easy to write a simple dbinsert routine to the database, I don't see that as a great challenge, I can spawn a to execute a insert statement for every minute.
How do I sync 3 different functions/process( a function to insert into db, a function to calculate the average, a function to calculate the median), while holding in memory 5 ticks to calculate the 5 period, simple average Moving Average and push them back to the dict/list.
The following assumption, challenges me in writing the multiprocessing routine. can someone guide me. I don't want to use pandas dataframe.
====REVISION/UPDATE===
The reason, why I don't want any solution on pandas/numpy is that, my objective is to understand the basics, and not the nuances of a new library. Please don't mistake my need for understanding to be arrogance or not wanting to be open to suggestions.
The advantage of having
p1=Process(target=Median,arg(sourcelist))
p2=Process(target=Average,arg(sourcelist))
p3=process(target=insertdb,arg(updatedlist))
would help me understand the possibility of scaling processes based on no of functions /algo components.. But how should I make sure p1&p2 are in sync while p3 should execute post p1&p2
Here is an example of how to use multiprocessing:
from multiprocessing import Pool, cpu_count
def db_func(ma, med):
db.save(something)
def backtest_strat(d, db_func):
a = d.get('avg')
s = map(sum, a)
db_func(s/len(a), median(a))
with Pool(cpu_count()) as p:
from functools import partial
bs = partial(backtest_strat, db_func=db_func)
print(p.map(bs, [{'avg': [1,2,3,4,5], 'median': [1,2,3,4,5]}]))
also see :
https://stackoverflow.com/a/24101655/2026508
note that this will not speed up anything unless there are a lot of slices.
so for the speed up part:
def get_slices(data)
for slice in data:
yield {'avg': [1,2,3,4,5], 'median': [1,2,3,4,5]}
p.map(bs, get_slices)
from what i understand multiprocessing works by message passing via pickles, so the pool.map when called should have access to all three things, the two arrays, and the db_save function. There are of course other ways to go about it, but hopefully this shows one way to go about it.
Question: how should I make sure p1&p2 are in sync while p3 should execute post p1&p2
If you sync all Processes, computing one Task (p1,p2,p3) couldn't be faster as the slowes Process are be.
In the meantime the other Processes running idle.
It's called "Producer - Consumer Problem".
Solution using Queue all Data serialize, no synchronize required.
# Process-1
def Producer()
task_queue.put(data)
# Process-2
def Consumer(task_queue)
data = task_queue.get()
# process data
You want multiple Consumer Processes and one Consumer Process gather all Results.
You don't want to use Queue, but Sync Primitives.
This Example let all Processes run independent.
Only the Process Result waits until notified.
This Example uses a unlimited Task Buffer tasks = mp.Manager().list().
The Size could be minimized if List Entrys for done Tasks are reused.
If you have some very fast algos combine some to one Process.
import multiprocessing as mp
# Base class for all WORKERS
class Worker(mp.Process):
tasks = mp.Manager().list()
task_ready = mp.Condition()
parties = mp.Manager().Value(int, 0)
#classmethod
def join(self):
# Wait until all Data processed
def get_task(self):
for i, task in enumerate(Worker.tasks):
if task is None: continue
if not self.__class__.__name__ in task['result']:
return (i, task['range'])
return (None, None)
# Main Process Loop
def run(self):
while True:
# Get a Task for this WORKER
idx, _range = self.get_task()
if idx is None:
break
# Compute with self Method this _range
result = self.compute(_range)
# Update Worker.tasks
with Worker.lock:
task = Worker.tasks[idx]
task['result'][name] = result
parties = len(task['result'])
Worker.tasks[idx] = task
# If Last, notify Process Result
if parties == Worker.parties.value:
with Worker.task_ready:
Worker.task_ready.notify()
class Result(Worker):
# Main Process Loop
def run(self):
while True:
with Worker.task_ready:
Worker.task_ready.wait()
# Get (idx, _range) from tasks List
idx, _range = self.get_task()
if idx is None:
break
# process Task Results
# Mark this tasks List Entry as done for reuse
Worker.tasks[idx] = None
class Average(Worker):
def compute(self, _range):
return average of DATA[_range]
class Median(Worker):
def compute(self, _range):
return median of DATA[_range]
if __name__ == '__main__':
DATA = mp.Manager().list()
WORKERS = [Result(), Average(), Median()]
Worker.start(WORKERS)
# Example creates a Task every 5 Records
for i in range(1, 16):
DATA.append({'id': i, 'open': 300 + randrange(0, 5), 'close': 300 + randrange(-5, 5)})
if i % 5 == 0:
Worker.tasks.append({'range':(i-5, i), 'result': {}})
Worker.join()
Tested with Python: 3.4.2
Is there a way to have multithreading implemented for multiple for loops under a single function. I am aware that it can be achieved if we have separate functions, but is it possible to have it under the same function.
For example:
def sqImport():
for i in (0,50):
do something specific to 0-49
for i in (50,100):
do something specific to 50-99
for i in (100,150):
do something specific to 100-149
If there are 3 separate functions for 3 different for loops then we can do:
threadA = Thread(target = loopA)
threadB = Thread(target = loopB)
threadC = Thread(target = loopC)
threadA.run()
threadB.run()
threadC.run()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()
threadC.join()
But is there a way to achieve this under a single function?
First of all: I think you really should take a look at multiprocessing.ThreadPool if you are going to use it in a productive system. What I describe below is just a possible workaround (which might be simpler and therefore could be used for testing purposes).
You could pass an id to the function and use that to decide which loop you take like so:
from threading import Thread
def sqImport(tId):
if tId == 0:
for i in range(0,50):
print i
elif tId == 1:
for i in range(50,100):
print i
elif tId == 2:
for i in range(100,150):
print i
threadA = Thread(target = sqImport, args=[0])
threadB = Thread(target = sqImport, args=[1])
threadC = Thread(target = sqImport, args=[2])
threadA.start()
threadB.start()
threadC.start()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()
threadC.join()
Note that I used start() instead of run() because run() does not start a different thread but executes in the current thread context. Moreover I changed your for i in (x, y) loops in for i in range(x,y) loops, because I think, You want to iterate over a range and not a tuple(that would iterate only over x and y).
An alternative Solution using multiprocessing might look like this:
from multiprocessing.dummy import Pool as ThreadPool
# The worker function
def sqImport(data):
for i in data:
print i
# The three ranges for the three different threads
ranges = [
range(0, 50),
range(50, 100),
range(100, 150)
]
# Create a threadpool with 3 threads
pool = ThreadPool(3)
# Run sqImport() on all ranges
pool.map(sqImport, ranges)
pool.close()
pool.join()
You can use multiprocessing.ThreadPool which will divide you tasks equally between running threads.
Follow Threading pool similar to the multiprocessing Pool? for more on this.
If you are really looking for parallel execution then go for processes because threads will face python GIL(Global Interpreted Lock).
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
I have a list of numbers. I want to perform some time-consuming operation on each number in the list and make a new list with all the results. Here's a simplified version of what I have:
def calcNum(n):#some arbitrary, time-consuming calculation on a number
m = n
for i in range(5000000):
m += i%25
if m > n*n:
m /= 2
return m
nums = [12,25,76,38,8,2,5]
finList = []
for i in nums:
return_val = calcNum(i)
finList.append(return_val)
print(finList)
Now, I wanted to take advantage of the multiple cores in my CPU, and give each of them a task of processing one of the numbers, and since the "number calculation" function is self-contained from start to finish I figured this would be fairly simple to do and a perfect situation for multiprocessing/threading.
My question is, which one should I use (multiprocessing or threading?), and what is the simplest way to do this?
I did a test with various code I found in other questions to achieve this, and while it runs fine it doesn't seem to be doing any actual multithreading/processing and takes just as long as my first test:
from multiprocessing.pool import ThreadPool
def calcNum(n):#some arbitrary, time-consuming calculation on a number
m = n
for i in range(5000000):
m += i%25
if m > n*n:
m /= 2
return m
pool = ThreadPool(processes=3)
nums = [12,25,76,38,8,2,5]
finList = []
for i in nums:
async_result = pool.apply_async(calcNum, (i,))
return_val = async_result.get()
finList.append(return_val)
print(finList)
multiprocessing.pool and pool.map are your best friends here. It saves a lot of headache as it hides all the other complex queues and whatnot you need to make it work. All you need to do is set up the pool, assign it the max number of processes, point it to the function and iterable. See working code below.
Because of the join and the usage cases pool.map was intended to work, the program will wait until ALL processes have returned something before giving you the result.
from multiprocessing.pool import Pool
def calcNum(n):#some arbitrary, time-consuming calculation on a number
print "Calcs Started on ", n
m = n
for i in range(5000000):
m += i%25
if m > n*n:
m /= 2
return m
if __name__ == "__main__":
p = Pool(processes=3)
nums = [12,25,76,38,8,2,5]
finList = []
result = p.map(calcNum, nums)
p.close()
p.join()
print result
That will get you something like this:
Calcs Started on 12
Calcs Started on 25
Calcs Started on 76
Calcs Started on 38
Calcs Started on 8
Calcs Started on 2
Calcs Started on 5
[72, 562, 5123, 1270, 43, 23, 23]
Regardless of when each process is started or when it completes, map waits for each to finish and then puts them all back in the correct order (corresponding to the input iterable).
As #Guy mentioned, the GIL hurts us here. You can change the Pool to ThreadPool in the code above and see how it affects the timing of the calculations. Since the same function is used, the GIL only allows one thread to use the calcNum function at a time. So it near enough still runs serially.
Multirocessing with a process or pool essentially starts further instances of your script which gets around the issue of the GIL. If you watch your running processes during the above, you'll see extra instances of 'python.exe' start while the pool is running. In this case, you'll see a total of 4.
I guess you are affected by python Global Interpreter Lock
The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations.
try to use multiprocessing instead
from multiprocessing import Pool