Is it possible in python (maybe using dask, maybe using multiprocessing) to 'emplace' generators on cores, and then, in parallel, step through the generators and process the results?
It needs to be generators in particular (or objects with __iter__); lists of all the yielded elements the generators yield won't fit into memory.
In particular:
With pandas, I can call read_csv(...iterator=True), which gives me an iterator (TextFileReader) - I can for in it or explicitly call next multiple times. The entire csv never gets read into memory. Nice.
Every time I read a next chunk from the iterator, I also perform some expensive computation on it.
But now I have 2 such files. I would like to create 2 such generators, and 'emplace' 1 on one core and 1 on another, such that I can:
result = expensive_process(next(iterator))
on each core, in parallel, and then combine and return the result. Repeat this step until one generator or both is out of yield.
It looks like the TextFileReader is not pickleable, nor is a generator. I can't find out how to do this in dask or multiprocessing. Is there a pattern for this?
Dask's read_csv is designed to load data from multiple files in chunks, with a chunk-size that you can specify. When you operate on the resultant dataframe, you will be working chunk-wise, which is exactly the point of using Dask in the first place. There should be no need to use your iterator method.
The dask dataframe method you will want to use, most likely, is map_partitions().
If you really wanted to use the iterator idea, you should look into dask.delayed, which is able to parallelise arbitrary python functions, by sending each invocation of the function (with a different file-name for each) to your workers.
So luckily I think this problem maps nicely onto python's multiprocessing .Process and .Queue.
def data_generator(whatever):
for v in something(whatever):
yield v
def generator_constructor(whatever):
def generator(outputQueue):
for d in data_generator(whatever):
outputQueue.put(d)
outputQueue.put(None) # sentinel
return generator
def procSumGenerator():
outputQs = [Queue(size) for _ in range(NumCores)]
procs = [Process(target=generator_constructor(whatever),
args=(outputQs[i],))
for i in range(NumCores)]
for proc in procs: proc.start()
# until any output queue returns a None, collect
# from all and yield
done = False
while not done:
results = [oq.get() for oq in outputQs]
done = any(res is None for res in results)
if not done:
yield some_combination_of(results)
for proc in procs: terminate()
for v in procSumGenerator():
print(v)
Maybe this can be done better with Dask? I find that my solution fairly quickly saturates the network for large sizes of generated data - I'm manipulating csvs with pandas and returning large numpy arrays.
https://github.com/colinator/doodle_generator/blob/master/data_generator_uniform_final.ipynb
Related
I was wondering if there was a more efficient way of doing the following without using loops.
I have a numpy array with the shape (i, x, y, z). Essentially I have i elements of the shape (x, y, z).
I want to write each element to a separate file so that I have i files, each with the data from a single element.
In my case, each element is an image, but I'm sure a solution can be format agnostic.
I'm currently looping through each of the i elements and writing them out one at a time.
As i gets really large, this takes a progressively longer time. Is there a better way or a useful library which could make this more efficient?
Update
I tried the suggestion to use multiprocessing by using concurrent.futures both the thread pool and then also trying the process pool. It was simpler in the code but the time to complete was 4x slower.
i in this case is approximately 10000 while x and y are approximately 750
This sounds very suitable for multiprocessing, as the different elements need to be processed separately and can be save to disk independantly.
Python has a usefull package for this, called multiprocessing, with a variety of pooling, processing, and other options.
Here's a simple (and comment-documented) example of usage:
from multiprocessing import Process
import numpy as np
# This should be your existing function
def write_file(element):
# write file
pass
# You'll still be looping of course, but in parallel over batches. This is a helper function for looping over a "batch"
def write_list_of_files(elements_list):
for element in elements_list:
write_file(element)
# You're data goes here...
all_elements = np.ones((1000, 256, 256, 3))
num_procs = 10 # Depends on system limitations, number of cpu-cores, etc.
procs = [Process(target=write_list_of_files, args=[all_elements[k::num_procs, ...]]) for k in range(num_procs)] # Each of these processes in the list is going to run the "write_list_of_files" function, but have separate inputs, due to the indexing trick of using "k::num_procs"...
for p in procs:
p.start() # Each process starts running independantly
for p in procs:
p.join() # assures the code won't continue until all are "joined" and done. Optional obviously...
print('All done!') # This only runs onces all procs are done, due to "p.join"
Assuming I have a code that using multiprocessing returns me a list of 200000 items that are a list of their own (in the code marked as point 1). In case i need one list with the internal items only, I iterate the received list again (in the code marked as point 2).
Problem: the line marked as point 2 does not work in parallel and therefore takes valuable time. Is there a way to write all the data in directly from the function cu to the doc?
def cu(num):
return range(num)
pool = mp.Pool(processes=384)
results = [pool.apply_async(cu, args=(20, )) for ind in range(200000)]
docs = [p.get() for p in results] # point 1
docs = [point for item in docs for point in item] # point 2
pool.close()
pool.join()
I suspect that replacing multithreading by multiprocessing will solve this problem but I am afraid that it will not saves time.
Note: this is a minimal example.
The problem is that you are running a pool for only one operation but 200k times, but you want the pool to run over 200k operations. You need to use, map_async, with generators and itertools.chain.from_iterable:
docs = itertools.chain.from_iterable(
pool.map_async(cu, args=(20 for _ in range(200000)))
)
This solution is lazy, that means that you need to consume the iterator for taking the values (iterating over it), you can easyly use:
docs = list(docs)
or if you dont want the results stored:
for doc in docs:
... #Do your stuff here
I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?
You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.
You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.
I am using multiprocessing.Pool() to parallelize some heavy computations.
The target function returns a lot of data (a huge list). I'm running out of RAM.
Without multiprocessing, I'd just change the target function into a generator, by yielding the resulting elements one after another, as they are computed.
I understand multiprocessing does not support generators -- it waits for the entire output and returns it at once, right? No yielding. Is there a way to make the Pool workers yield data as soon as they become available, without constructing the entire result array in RAM?
Simple example:
def target_fnc(arg):
result = []
for i in xrange(1000000):
result.append('dvsdbdfbngd') # <== would like to just use yield!
return result
def process_args(some_args):
pool = Pool(16)
for result in pool.imap_unordered(target_fnc, some_args):
for element in result:
yield element
This is Python 2.7.
This sounds like an ideal use case for a Queue: http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes
Simply feed your results into the queue from the pooled workers and ingest them in the master.
Note that you still may run into memory pressure issues unless you drain the queue nearly as fast as the workers are populating it. You could limit the queue size (the maximum number of objects that will fit in the queue) in which case the pooled workers would block on the queue.put statements until space is available in the queue. This would put a ceiling on memory usage. But if you're doing this, it may be time to reconsider whether you require pooling at all and/or if it might make sense to use fewer workers.
From your description, it sounds like you're not so much interested in processing the data as they come in, as in avoiding passing a million-element list back.
There's a simpler way of doing that: Just put the data into a file. For example:
def target_fnc(arg):
fd, path = tempfile.mkstemp(text=True)
with os.fdopen(fd) as f:
for i in xrange(1000000):
f.write('dvsdbdfbngd\n')
return path
def process_args(some_args):
pool = Pool(16)
for result in pool.imap_unordered(target_fnc, some_args):
with open(result) as f:
for element in f:
yield element
Obviously if your results can contain newlines, or aren't strings, etc., you'll want to use a csv file, a numpy, etc. instead of a simple text file, but the idea is the same.
That being said, even if this is simpler, there are usually benefits to processing the data a chunk at a time, so breaking up your tasks or using a Queue (as the other two answers suggest) may be better, if the downsides (respectively, needing a way to break the tasks up, or having to be able to consume the data as fast as they're produced) are not deal-breakers.
If your tasks can return data in chunks… can they be broken up into smaller tasks, each of which returns a single chunk? Obviously, this isn't always possible. When it isn't, you have to use some other mechanism (like a Queue, as Loren Abrams suggests). But when it is, it's probably a better solution for other reasons, as well as solving this problem.
With your example, this is certainly doable. For example:
def target_fnc(arg, low, high):
result = []
for i in xrange(low, high):
result.append('dvsdbdfbngd') # <== would like to just use yield!
return result
def process_args(some_args):
pool = Pool(16)
pool_args = []
for low in in range(0, 1000000, 10000):
pool_args.extend(args + [low, low+10000] for args in some_args)
for result in pool.imap_unordered(target_fnc, pool_args):
for element in result:
yield element
(You could of course replace the loop with a nested comprehension, or a zip and flatten, if you prefer.)
So, if some_args is [1, 2, 3], you'll get 300 tasks—[[1, 0, 10000], [2, 0, 10000], [3, 0, 10000], [1, 10000, 20000], …], each of which only returns 10000 elements instead of 1000000.
Both list comprehensions and map-calculations should -- at least in theory -- be relatively easy to parallelize: each calculation inside a list-comprehension could be done independent of the calculation of all the other elements. For example in the expression
[ x*x for x in range(1000) ]
each x*x-Calculation could (at least in theory) be done in parallel.
My question is: Is there any Python-Module / Python-Implementation / Python Programming-Trick to parallelize a list-comprehension calculation (in order to use all 16 / 32 / ... cores or distribute the calculation over a Computer-Grid or over a Cloud)?
As Ken said, it can't, but with 2.6's multiprocessing module, it's pretty easy to parallelize computations.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def square(n):
return n * n
pool = multiprocessing.Pool(processes=cpus)
print(pool.map(square, range(1000)))
There are also examples in the documentation that show how to do this using Managers, which should allow for distributed computations as well.
For shared-memory parallelism, I recommend joblib:
from joblib import delayed, Parallel
def square(x): return x*x
values = Parallel(n_jobs=NUM_CPUS)(delayed(square)(x) for x in range(1000))
On automatical parallelisation of list comprehension
IMHO, effective automatic parallisation of list comprehension would be impossible without additional information (such as those provided using directives in OpenMP), or limiting it to expressions that involve only built-in types/methods.
Unless there is a guarantee that the processing done on each list item has no side effects, there is a possibility that the results will be invalid (or at least different) if done out of order.
# Artificial example
counter = 0
def g(x): # func with side-effect
global counter
counter = counter + 1
return x + counter
vals = [g(i) for i in range(100)] # diff result when not done in order
There is also the issue of task distribution. How should the problem space be decomposed?
If the processing of each element forms a task (~ task farm), then when there are many elements each involving trivial calculation, the overheads of managing the tasks will swamps out the performance gains of parallelisation.
One could also take the data decomposition approach where the problem space is divided equally among the available processes.
The fact that list comprehension also works with generators makes this slightly tricky, however this is probably not a show stopper if the overheads of pre-iterating it is acceptable. Of course, there is also a possibility of generators with side-effects which can change the outcome if subsequent items are prematurely iterated. Very unlikely, but possible.
A bigger concern would be load imbalance across processes. There is no guarantee that each element would take the same amount of time to process, so statically partitioned data may result in one process doing most of the work while the idle your time away.
Breaking the list down to smaller chunks and handing them as each child process is available is a good compromise, however, a good selection of chunk size would be application dependent hence not doable without more information from the user.
Alternatives
As mentioned in several other answers, there are many approaches and parallel computing modules/frameworks to choose from depending on one requirements.
Having used only MPI (in C) with no experience using Python for parallel processing, I am not in a position to vouch for any (although, upon a quick scan through,
multiprocessing, jug, pp and pyro stand out).
If a requirement is to stick as close as possible to list comprehension, then jug seems to be the closest match. From the tutorial, distributing tasks across multiple instances can be as simple as:
from jug.task import Task
from yourmodule import process_data
tasks = [Task(process_data,infile) for infile in glob('*.dat')]
While that does something similar to multiprocessing.Pool.map(), jug can use different backends for synchronising process and storing intermediate results (redis, filesystem, in-memory) which means the processes can span across nodes in a cluster.
As the above answers point out, this is actually pretty hard to do automatically. Then I think the question is actually how to do it in the easiest way possible. Ideally, a solution wouldn't require you to know things like "how many cores do I have". Another property that you might want is to be able to still do the list comprehension in a single readable line.
Some of the given answers already seem to have nice properties like this, but another alternative is Ray (docs), which is a framework for writing parallel Python. In Ray, you would do it like this:
import ray
# Start Ray. This creates some processes that can do work in parallel.
ray.init()
# Add this line to signify that the function can be run in parallel (as a
# "task"). Ray will load-balance different `square` tasks automatically.
#ray.remote
def square(x):
return x * x
# Create some parallel work using a list comprehension, then block until the
# results are ready with `ray.get`.
ray.get([square.remote(x) for x in range(1000)])
Using the futures.{Thread,Process}PoolExecutor.map(func, *iterables, timeout=None) and futures.as_completed(future_instances, timeout=None) functions from the new 3.2 concurrent.futures package could help.
It's also available as a 2.6+ backport.
No, because list comprehension itself is a sort of a C-optimized macro. If you pull it out and parallelize it, then it's not a list comprehension, it's just a good old fashioned MapReduce.
But you can easily parallelize your example. Here's a good tutorial on using MapReduce with Python's parallelization library:
http://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python/
Not within a list comprehension AFAIK.
You could certainly do it with a traditional for loop and the multiprocessing/threading modules.
There is a comprehensive list of parallel packages for Python here:
http://wiki.python.org/moin/ParallelProcessing
I'm not sure if any handle the splitting of a list comprehension construct directly, but it should be trivial to formulate the same problem in a non-list comprehension way that can be easily forked to a number of different processors. I'm not familiar with cloud computing parallelization, but I've had some success with mpi4py on multi-core machines and over clusters. The biggest issue that you'll have to think about is whether the communication overhead is going to kill any gains you get from parallelizing the problem.
Edit: The following might also be of interest:
http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/
You can use asyncio. (Documentation can be found [here][1]). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.
import asyncio
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def your_function(argument):
#code
Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there.
For your specific case, you can do:
import asyncio
import time
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def op(x): # Do any operation you want
time.sleep(1)
print(f"function called for {x=}\n", end='')
return x*x
loop = asyncio.get_event_loop() # Have a new event loop
looper = asyncio.gather(*[op(i) for i in range(20)]) # Run the loop; Doing for 20 for better demo
results = loop.run_until_complete(looper) # Wait until finish
print('List comprehension has finished and results are gathered!')
print(results)
This produces following output:
function called for x=5
function called for x=4
function called for x=2
function called for x=0
function called for x=6
function called for x=1
function called for x=7
function called for x=3
function called for x=8
function called for x=9
function called for x=10
function called for x=12
function called for x=11
function called for x=15
function called for x=13
function called for x=14
function called for x=16
function called for x=17
function called for x=18
function called for x=19
List comprehension has finished and results are gathered!
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
Note that all function calls were in parallel thus shuffled prints however original order is preserved in the resulting list.