EDIT: Updated with environment information (see first section)
Environment
I'm using Python 2.7
Ubuntu 16.04
Issue
I have an application which I've simplified into a three-stage process:
Gather data from multiple data sources (HTTP requests, system info, etc)
Compute metrics based on this data
Output these metrics in various formats
Each of these stages must complete before moving on to the next stage, however each stage consists of multiple sub-tasks that can be run in parallel (I can send off 3 HTTP requests and read system logs while waiting for them to return)
I've divided up the stages into modules and the sub-tasks into submodules, so my project hierarchy looks like so:
+ datasources
|-- __init__.py
|-- data_one.py
|-- data_two.py
|-- data_three.py
+ metrics
|-- __init__.py
|-- metric_one.py
|-- metric_two.py
+ outputs
|-- output_one.py
|-- output_two.py
- app.py
app.py looks roughly like so (pseudo-code for brevity):
import datasources
import metrics
import outputs
for datasource in dir(datasources):
datasource.refresh()
for metric in dir(metrics):
metric.calculate()
for output in dir(outputs):
output.dump()
(There's additional code wrapping the dir call to ignore system modules, there's exception handling, etc -- but this is the gist of it)
Each datasource sub-module looks roughly like so:
data = []
def refresh():
# Populate the "data" member somehow
data = [1, 2, 3]
return
Each metric sub-module looks roughly like so:
import datasources.data_one as data_one
import datasources.data_two as data_two
data = []
def calculate():
# Use the datasources to compute the metric
data = [sum(x) for x in zip(data_one, data_two)]
return
In order to parallelize the first stage (datasources) I wrote something simple like the following:
def run_thread(datasource):
datasource.refresh()
threads = []
for datasource in dir(datasources):
thread = threading.Thread(target=run_thread, args=(datasource))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
This works, and afterwards I can compute any metric and the datasources.x.data attribute is populated
In order to parallelize the second stage (metrics) because it depends less on I/O and more on CPU, I felt like simple threading wouldn't actually speed things up and I would need the multiprocessing module in order to take advantage of multiple cores. I wrote the following:
def run_pool(calculate):
calculate()
pool = multiprocessing.Pool()
pool.map(run_pool, [m.calculate for m in dir(metrics)]
pool.close()
pool.join()
This code runs for a few seconds (so I think it's working?) but then when I try:
metrics.metric_one.data
it returns [], like the module was never run
Somehow by using the multiprocessing module it seems to be scoping the threads so that they no longer share the data attribute. How should I go about rewriting this so that I can compute each metric in parallel, taking advantage of multiple cores, but still have access to the data when it's done?
Updated again, per comments:
Since you're in 2.7, and you're dealing with modules instead of objects, you're having problems pickling what you need. The workaround is not pretty. It involves passing the name of each module to your operating function. I updated the partial section, and also updated to remove the with syntax.
A few things:
First, in general, it's better to multicore than thread. With threading, you always run a risk of dealing with the Global Interpreter Lock, which can be extremely inefficient. This becomes a non-issue if you use multicore.
Second, you've got the right concept, but you make it strange by having a global-to-the-module data member. Make your sources return the data you're interested in, and make your metrics (and outputs) take a list of data as input and output the resultant list.
This would turn your pseudocode into something like this:
app.py:
import datasources
import metrics
import outputs
pool = multiprocessing.Pool()
data_list = pool.map(lambda o: o.refresh, list(dir(datasources)))
pool.close()
pool.join()
pool = multiprocessing.Pool()
metrics_funcs = [(m, data_list) for m in dir(metrics)]
metrics_list = pool.map(lambda m: m[0].calculate(m[1]), metrics_funcs)
pool.close()
pool.join()
pool = multiprocessing.Pool()
output_funcs = [(o, data_list, metrics_list) for o in dir(outputs)]
output_list = pool.map(lambda o: o[0].dump(o[1], o[2]), output_funcs)
pool.close()
pool.join()
Once you do this, your data source would look like this:
def refresh():
# Populate the "data" member somehow
return [1, 2, 3]
And your metrics would look like this:
def calculate(data_list):
# Use the datasources to compute the metric
return [sum(x) for x in zip(data_list)]
And finally, your output could look like this:
def dump(data_list, metrics_list):
# do whatever; you now have all the information
Removing the data "global" and passing it makes each piece a lot cleaner (and a lot easier to test). This highlights making each piece completely independent. As you can see, all I'm doing is changing what's in the list that gets passed to map, and in this case, I'm injecting all the previous calculations by passing them as a tuple and unpacking them in the function. You don't have to use lambdas, of course. You can define each function separately, but there's really not much to define. However, if you do define each function, you could use partial functions to reduce the number of arguments you pass. I use that pattern a lot, and in your more complicated code, you may need to. Here's one example:
from functools import partial
do_dump(module_name, data_list, metrics_list):
globals()[module_name].dump(data_list, metrics_list)
invoke = partial(do_dump, data_list=data_list, metrics_list=metrics_list)
with multiprocessing.Pool() as pool:
output_list = pool.map(invoke, [o.__name__ for o in dir(outputs)])
pool.close()
pool.join()
Update, per comments:
When you use map, you're guaranteed that the order of your inputs matches the order of your outputs, i.e. data_list[i] is the output for running dir(datasources)[i].refresh(). Rather than importing the datasources modules into metrics, I would make this change to app.py:
data_list = ...
pool.close()
pool.join()
data_map = {name: data_list[i] for i, name in enumerate(dir(datasources))}
And then pass data_map into each metric. Then the metric gets the data that it wants by name, e.g.
d1 = data_map['data_one']
d2 = data_map['data_two']
return [sum(x) for x in zip([d1, d2])]
Process and Thread behave quite differently in python. If you want to use multiprocessing you will need to use a synchronized data type to pass information around.
For example you could use multiprocessing.Array, which can be shared between your processes.
For detail see the docs: https://docs.python.org/2/library/multiprocessing.html#sharing-state-between-processes
Related
I've never done anything with multiprocessing before, but I recently ran into a problem with one of my projects taking an excessive amount of time to run. I have about 336,000 files I need to process, and a traditional for loop would likely take about a week to run.
There are two loops to do this, but they are effectively identical in what they return so I've only included one.
import json
import os
from tqdm import tqdm
import multiprocessing as mp
jsons = os.listdir('/content/drive/My Drive/mrp_workflow/JSONs')
materials = [None] * len(jsons)
def asyncJSONs(file, index):
try:
with open('/content/drive/My Drive/mrp_workflow/JSONs/{}'.format(file)) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = file.split('.')[0]
materials[index] = properties
except:
print("Error parsing at {}".format(file))
process_list = []
i = 0
for file in tqdm(jsons):
p = mp.Process(target=asyncJSONs,args=(file,i))
p.start()
process_list.append(p)
i += 1
for process in process_list:
process.join()
Everything in that relating to multiprocessing was cobbled together from a collection of google searches and articles, so I wouldn't be surprised if it wasn't remotely correct. For example, the 'i' variable is a dirty attempt to keep the information in some kind of order.
What I'm trying to do is load information from those JSON files and store it in the materials variable. But when I run my current code nothing is stored in materials.
As you can read in other answers - processes don't share memory and you can't set value directly in materials. Function has to use return to send result back to main process and it has to wait for result and get it.
It can be simpler with Pool. It doesn't need to use queue manually. And it should return results in the same order as data in all_jsons. And you can set how many processes to run at the same time so it will not block CPU for other processes in system.
But it can't use tqdm.
I couldn't test it but it can be something like this
import os
import json
from multiprocessing import Pool
# --- functions ---
def asyncJSONs(filename):
try:
fullpath = os.path.join(folder, filename)
with open(fullpath) as f:
data = json.loads(f.read())
properties = process_dict(data, {})
properties['name'] = filename.split('.')[0]
return properties
except:
print("Error parsing at {}".format(filename))
# --- main ---
# for all processes (on some systems it may have to be outside `__main__`)
folder = '/content/drive/My Drive/mrp_workflow/JSONs'
if __name__ == '__main__':
# code only for main process
all_jsons = os.listdir(folder)
with Pool(5) as p:
materials = p.map(asyncJSONs, all_jsons)
for item in materials:
print(item)
BTW:
Other modules: concurrent.futures, joblib, ray,
Going to mention a totally different way of solving this problem. Don't bother trying to append all the data to the same list. Extract the data you need, and append it to some target file in ndjson/jsonlines format. That's just where, instead of objects part of a json array [{},{}...], you have separate objects on each line.
{"foo": "bar"}
{"foo": "spam"}
{"eggs": "jam"}
The workflow looks like this:
spawn N workers with a manifest of files to process and the output file to write to. You don't even need MP, you could use a tool like rush to parallelize.
worker parses data, generates the output dict
worker opens the output file with append flag. dump the data and flush immediately:
with open(out_file, 'a') as fp:
print(json.dumps(data), file=fp, flush=True)
Flush ensure that as long as your data is less than the buffer size on your kernel (usually several MB), your different processes won't stomp on each other and conflict writes. If they do get conflicted, you may need to write to a separate output file for each worker, and then join them all.
You can join the files and/or convert to regular JSON array if needed using jq. To be honest, just embrace jsonlines. It's a way better data format for long lists of objects, since you don't have to parse the whole thing in memory.
You need to understand how multiprocessing works. It starts a brand new process for EACH task, each with a brand new Python interpreter, which runs your script all over again. These processes do not share memory in any way. The other processes get a COPY of your globals, but they obviously can't be the same memory.
If you need to send information back, you can using a multiprocessing.queue. Have the function stuff the results in a queue, while your main code waits for stuff to magically appear in the queue.
Also PLEASE read the instructions in the multiprocessing docs about main. Each new process will re-execute all the code in your main file. Thus, any one-time stuff absolutely must be contained in a
if __name__ == "__main__":
block. This is one case where the practice of putting your mainline code into a function called main() is a "best practice".
What is taking all the time here? Is it reading the files? If so, then you might be able to do this with multithreading instead of multiprocessing. However, if you are limited by disk speed, then no amount of multiprocessing is going to reduce your run time.
from concurrent.futures import ProcessPoolExecutor
from concurrent.futures import as_completed
import numpy as np
import time
#creating iterable
testDict = {}
for i in range(1000):
testDict[i] = np.random.randint(1,10)
#default method
stime = time.time()
newdict = []
for k, v in testDict.items():
for i in range(1000):
v = np.tanh(v)
newdict.append(v)
etime = time.time()
print(etime - stime)
#output: 1.1139910221099854
#multi processing
stime = time.time()
testresult = []
def f(item):
x = item[1]
for i in range(1000):
x = np.tanh(x)
return x
def main(testDict):
with ProcessPoolExecutor(max_workers = 8) as executor:
futures = [executor.submit(f, item) for item in testDict.items()]
for future in as_completed(futures):
testresult.append(future.result())
if __name__ == '__main__':
main(testDict)
etime = time.time()
print(etime - stime)
#output: 3.4509658813476562
Learning multiprocessing and testing stuff. Ran a test to check if I have implemented this correctly. Looking at the output time taken, concurrent method is 3 times slower. So what's wrong?
My objective is to parallelize a script which mostly operates on a dictionary of around 500 items. Each loop, values of those 500 items are processed and updated. This loops for let's say 5000 generations. None of the k,v pairs interact with other k,v pairs. [Its a genetic algorithm].
I am also looking at guidance on how to parallelize the above described objective. If I use the correct concurrent futures method on each of my function in my genetic algorithm code, where each function takes an input of a dictionary and outputs a new dictionary, will it be useful? Any guides/resources/help is appreciated.
Edit: If I run this example: https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor-example, it takes 3 times more to solve than a default for loop check.
There are a couple basic problems here, you're using numpy but you're not vectorizing your calculations. You'll not benefit from numpy's speed benefit with the way you write your code here, and might as well just use the standard library math module, which is faster than numpy for this style of code:
# 0.089sec
import math
for k, v in testDict.items():
for i in range(1000):
v = math.tanh(v)
newdict.append(v)
Once you vectorise the operation, only then you see the benefit of numpy:
# 0.016sec
for k, v in testDict.items():
arr = no.full(1000, v)
arr2 = np.tanh(arr)
newdict.append(arr2[-1])
For comparison, your original single threaded code runs in 1.171sec on my test machine. As you can see here, when it's not used properly, NumPy can be a couple orders of magnitude slower than even pure Python.
Now on to why you're seeing what you're seeing.
To be honest, I can't replicate your timing results. Your original multiprocessing code runs in 0.299sec for me macOS on Python 3.6), which is faster than the single process code. But if I have to take a guess, you're probably using Windows? In some platforms like Windows, creating a child process and setting up an environment to run multiprocessing task is very expensive, so using multiprocessing for a task that lasts less than a few seconds is of dubious benefit. If your are interested in why, read here.
Also, in platforms that lacks a usable fork() like MacOS after Python 3.8 or Windows, when you use multiprocessing, the child process has to reimport the module, so if you put both code in the same file, it has to run your single threaded code in the child processes before it can run the multiprocessing code. You'll likely want to put your test code in a function and protect the top level code with if __name__ == "__main__" block. On Mac with Python 3.8 or higher, you can also revert to using fork method by calling multiprocessing.set_start_method("fork") if you're not calling into Mac's non-fork-safe framework libraries.
With that out of the way, on to your title question.
When you use multiprocessing, you need to copy data to the child process and back to the main process to retrieve the result and there's a cost to spawning child processes. To benefit from multiprocessing, you need to design your workload so that this part of the cost is negligible.
If your data comes from external source, try loading the data in the child processes, rather than having the main process load the data then transfer it to the child process, have the main process tell the child how to fetch its slice of data. Here you're generating the testDict in the main process, so if you can, parallelize that and move them to the children instead.
Also, since you're using numpy, if you vectorise your operations properly, numpy will release the GIL while doing vectorised operations, so you may be able to just use multithreading instead. Since numpy doesn't hold GIL during vector operation, you can take advantage of multiple threads in a single Python process, and you don't need to fork or copy data over to child processes, as threads share memory.
I'm very new to Python (and coding in general) and I need help parallising the code below. I looked around and found some packages (eg. Multiprocessing & JobLib) which could be useful.
However, I have trouble using it in my example. My code makes an outputfile, and updates it doing the loop(s). Therefore is it not directly paralisable, so I think I need to make smaller files. After this, I could merge the files together.
I'm unable to find a way to do this, could someone be so kind and give me a decent start?
I appreciate any help,
A code newbie
Code:
def delta(graph,n,t,nx,OutExt):
fout_=open(OutExt+'Delta'+str(t)+'.txt','w')
temp=nx.Graph(graph)
for u in range(0,n):
#print "stamp: "+str(t)+" node: "+str(u)
for v in range(u+1,n):
#print str(u)+"\t"+str(v)
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
del temp
fout_.close()
As a start, find the part of the code that you want to be able to execute in parallel with something (perhaps with other invocations of that very same function). Then, figure out how to make this code not share mutable state with anything else.
Mutable state is the enemy of parallel execution. If two pieces of code are executing in parallel and share mutable state, you don't know what the outcome will be (and the outcome will be different each time you run the program). This is becaues you don't know what order the code from the parallel executions will run in. Perhaps the first will mutate something and then the second one will compute something. Or perhaps the second one will compute something and then the first one will mutate it. Who knows? There are solutions to that problem but they involve fine-grained locking and careful reasoning about what can change and when.
After you have an algorithm with a core that doesn't share mutable state, factor it into a separate function (turning locals into parameters).
Finally, use something like the threading (if your computations are primarily in CPython extension modules with good GIL behavior) or multiprocessing (otherwise) modules to execute the algorithm core function (which you have abstracted out) at some level of parallelism.
The particular code example you've shared is a challenge because you use the NetworkX library and a lot of shared mutable state. Each iteration of your loop depends on the results of the previous, apparently. This is not obviously something you can parallelize. However, perhaps if you think about your goals more abstractly you will be able to think of a way to do it (remember, the key is to be able to expressive your algorithm without using shared mutable state).
Your function is called delta. Perhaps you can split your graph into sub-graphs and compute the deltas of each (which are now no longer shared) in parallel.
If the code within your outermost loop is concurrent safe (I don't know if it is or not), you could rewrite it like this for parallel execution:
from multiprocessing import Pool
def do_one_step(nx, graph, n, t, OutExt, u):
# Create a separate output file for this set of results.
name = "{}Delta{}-{}.txt".format(OutExt, t, u)
fout_ = open(name, 'w')
temp = nx.Graph(graph)
for v in range(u+1,n):
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
fout_.close()
def delta(graph,n,t,nx,OutExt):
pool = Pool()
pool.map(
partial(
do_one_step,
nx,
graph,
n,
t,
OutExt,
),
range(0,n),
)
This supposes that all of the arguments can be serialized across processes (required for any argument you pass to a function you call with multiprocessing). I suspect that nx and graph may be problems but I don't know what they are.
And again, this assumes it's actually correct to concurrently execute the inner loop.
Best use pool.map. Here an example that shows what you need to do. Here a simple example of how multiprocessing works with pool:
Single threaded, basic function:
def f(x):
return x*x
if __name__ == '__main__':
print(map(f, [1, 2, 3]))
>> [1, 4, 9]
Using multiple processors:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
Using 1 processor
from multiprocessing.pool import ThreadPool as Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
When you use map you can easily get a list back from the results of your function.
I am trying to create workers for a task that involves reading a lot of files and analyzing them.
I want something like this:
list_of_unique_keys_from_csv_file = [] # About 200mb array (10m rows)
# a list of uniquekeys for comparing inside worker processes to a set of flatfiles
I need more threads as it is going very slow, doing the comparison with one process (10 minutes per file).
I have another set of flat-files that I compare the CSV file to, to see if unique keys exist. This seems like a map reduce type of problem.
main.py:
def worker_process(directory_glob_of_flat_files, list_of_unique_keys_from_csv_file):
# Do some parallel comparisons "if not in " type stuff.
# generate an array of
# lines of text like : "this item_x was not detected in CSV list (from current_flatfile)"
if current_item not in list_of_unique_keys_from_csv_file:
all_lines_this_worker_generated.append(sometext + current_item)
return all_lines_this_worker_generated
def main():
all_results = []
pool = Pool(processes=6)
partitioned_flat_files = [] # divide files from glob by 6
results = pool.starmap(worker_process, partitioned_flat_files, {{{{i wanna pass in my read-only parameter}}}})
pool.close()
pool.join()
all_results.extend(results )
resulting_file.write(all_results)
I am using both a linux and a windows environment, so perhaps I need something cross-platform compatible (the whole fork() discussion).
Main Question: Do I need some sort of Pipe or Queue, I can't seem to find good examples of how to transfer around a big read-only string array, a copy for each worker process?
You can just split your read-only parameters and then pass them in. The multiprocessing module is cross-platform compatible, so don't worry about it.
Actually, every process, even sub-process, has its own resources, that means no matter how you pass the parameters to it, it will keep a copy of the original one instead of sharing it. In this simple case, when you pass the parameters from main process into sub-processes, Pool automatically makes a copy of your variables. Because sub-processes just have the copies of original one, so the modification cannot be shared. It doesn't matter in this case as your variables are read-only.
But be careful about your code, you need to wrap the parameters you need into an
iterable collection, for example:
def add(a, b):
return a + b
pool = Pool()
results = pool.starmap(add, [(1, 2), (3, 4)])
print(results)
# [3, 7]
I am totaly new in multiprocessing. I am trying to change my code in order to run part of it simultaneously.
I have a huge list where I have to call an API for each node. Since, the APIs are independence, I don't need the result of the first one in order to proceed to the second one. So, I have this code:
def xmlpart1(id):
..call the api..
..retrieve the xml..
..find the part of xml I want..
return xml_part1
def xmlpart2(id):
..call the api..
..retrieve the xml..
..find the part of xml I want..
return xml_part2
def main(index):
mylist = [[..,..],[..,..],[..,..],[..,...]] # A huge list of lists with ids I need for calling the APIs
myL= mylist[index] c
mydic = {}
for i in myL:
flag1 = xmlpart1(i)
flag2 = xmlpart2(i)
mydic[flag1] = flag2
root = "myfilename %s.json" %(str(index))
with open(root, "wb") as f:
json.dump(mydic,f)
from multiprocessing import Pool
if __name__=='__main__':
Pool().map(main, [0,1,2,3])
After a few suggestions from here and from the chat, I end up with this code. The problem is still there. I run the script at 9:50. At 10:25 the first file "myfilename 0.json" appeared in my folder. Now it is 11:25 and neither of the other files have been appeared. The sublists have equal length and they do the same thing, so they need approximately the same time.
This is something more suited to the multiprocessing.Pool() class.
Here's a simple example:
from multiprocessing import Pool
def job(args):
"""Your job function"""
Pool().map(job, inputs)
Where:
inputs is your list of inputs. Each input gets passed to job and processed in a separate process.
You get the results back as a list when all jobs have completed.
multiprocessing.Pool().map is just like the Python builtin map() but sets up a process pool of workers for you and passes each input to the given function.
See the docs for more details: http://docs.python.org/2/library/multiprocessing.html