I am trying to rewrite an entire project that has been developed with classes. Little by little, the heaviest computational chunks should be parallelized, clearly we have a lot of independent sequential loops. An example with classes that mimicks the behaviour is this toy problem (I'm a mathematician obsessed with the p-sums):
class Summer:
def __init__(self, p):
self.p = p
def sum(self):
return sum(pow(i,-self.p) for i in range(1,1000000))
total = sum([Summer(p).sum() for p in range(2,20)])
If I replace the last line by:
from dask.distributed import Client
def psum(p): return Summer(p).sum()
client = Client()
A = client.map(psum,range(2,20))
total=client.submit(sum,A).result()
My runtime is cut in 4 (the number of cores available on my machine). This ideal behaviour does NOT persist if I use my real classes which are data intensive (big pandas structures taking up memory). Is there a recommended alternative to dask.distributed? I'm seeing bad slowdowns which I attribute to data being passed around.
Related
I don't have any formal training in programming, but I routinely come across this question when I am making classes and running individual methods of that class in sequence. What is better: save results as class variables or return them and use them as inputs to subsequent method calls. For example, here is a class where the the variables are returned and used as inputs:
class ProcessData:
def __init__(self):
pass
def get_data(self,path):
data = pd.read_csv(f"{path}/data.csv"}
return data
def clean_data(self, data)
data.set_index("timestamp", inplace=True)
data.drop_duplicates(inplace=True)
return data
def main():
processor = ProcessData()
temp = processor.get_data("path/to/data")
processed_data = processor.clean_data(temp)
And here is an example where the results are saved/used to update the class variable:
class ProcessData:
def __init__(self):
self.data = None
def get_data(self,path):
data = pd.read_csv(f"{path}/data.csv"}
self.data = data
def clean_data(self)
self.data.set_index("timestamp", inplace=True)
self.data.drop_duplicates(inplace=True)
def main():
processor = ProcessData()
processor.get_data("path/to/data")
processor.clean_data()
I have a suspicion that the latter method is better, but I could also see instances where the former might have its advantages. I am sure the answer to my question is "it depends", but I am curious in general, what are the best practices?
Sketch the class based on usage, then create it
Instead of inventing classes to make your high level coding easier, tap your heels together and write the high-level code as if the classes already existed. Then create the classes with the methods and behavior that exactly fits what you need.
PEP AS AN EXAMPLE
If you look at several peps, you'll notice that the rationale or motivation is given before the details. The rationale and motivation shows how the new Python feature is going to solve a problem and how it is going to be used sometimes with code examples.
Example from PEP 289 – Generator Expressions:
Generator expressions are especially useful with functions like sum(),
min(), and max() that reduce an iterable input to a single value:
max(len(line) for line in file if line.strip())
Generator expressions also address some examples of functionals coded
with lambda:
reduce(lambda s, a: s + a.myattr, data, 0)
reduce(lambda s, a: s + a[3], data, 0)
These simplify to:
sum(a.myattr for a in data)
sum(a[3] for a in data)
My methodology given above is the same as describing the motivation and rationale for a class in terms of use. Because you are writing the code that is actually going to use it first.
I am developing an heuristic algorithm to find "good" solutions for a NP (hence CPU intensive) problem.
I am implementing my solution using Python (I agree it is not the best choice when speed is a concern, but so it is) and I am splitting the workload across many subprocesses, each one in charge to explore a branch of the space of possible solutions.
To improve performances I would like to share some information gathered during the execution of each subprocess among all subprocesses.
The "obvious" way to gather such information is gathering them inside a dictionary whose keys are (frozen)sets of integers and values are lists (or sets) of integers.
Hence the shared dictionary must both be readable and writable from each subprocess, but I can safely expect that reads will be far more frequent than writes because a subprocess will write to the shared dict only when it finds something "interesting" and will read the dict far more frequently to know if a certain solution has already been evaluated by other processes (to avoid exploring the same branch twice or more).
I do not expect the dimension of such dictionary to exceed 10 MB.
At the moment I implemented the shared dict using an instance of multiprocessing.Manager() that takes care of handling concurrent accesses to the shared dictionary out of the box.
However (according to what I have found) this way of sharing data is implemented using pipes between processes which are a lot slower than plain and simple shared memory (moreover the dictionary must be pickled before being sent through the pipe and unpickled when it is received).
So far my code looks like this:
# main.py
import multiprocessing as mp
import os
def worker(a, b, c, shared_dict):
while condition:
# do things
# sometimes reads from shared_dict to check if a candidate solution has already been evaluated by other process
# if not, evaluate it and store it inside the shared_dict together with some related info
return worker_result
def main():
with mp.Manager() as manager:
# setup params a, b, c, ...
# ...
shared_dict = manager.dict()
n_processes = os.cpu_count()
with mp.Pool(processes=n_processes) as pool:
async_results = [pool.apply_async(worker, (a, b, c, shared_dict)) for _ in range(n_processes)]
results = [res.get() for res in async_results]
# gather the overall result from 'results' list
if __name__ == '__main__':
main()
To avoid the overhead due to pipes I would like to use shared memory, but it doesn't seem that the Python standard library offers a straightforward way to handle a dictionary in shared memory.
As far as I know the Python standard library offers helpers to store data in shared memory only for standard ctypes (with multiprocessing.Value and multiprocessing.Array) or gives you access to raw areas of shared memory.
I do not want to implement my own hash table in a raw area of shared memory since I am not an expert neither of hash tables nor of concurrent programming, instead I am wondering if there are other faster solutions to my needs that doesn't require to write everything from zero.
For example, I have seen that the ray library allows to read data written in shared memory way faster than using pipes, however it seems that you cannot modify a dictionary once it has been serialized and written to a shared memory area.
Any help?
Unfortunately shared memory in Ray must be immutable. Typically, it is recommended that you use actors for mutable state. (see here).
You can do a couple of tricks with actors. For example, you can store object references in your dict if the values are immutable. Then the dict itself won't be in shared memory, but all of its objects would be.
#ray.remote
class DictActor
def __init__(self):
self._dict = {}
def put(self, key, value):
self._dict[key] = ray.put(value)
def get(self, key):
return self._dict[key]
d = DictActor.remote()
ray.get(d.put.remote("a", np.zeros(100)))
ray.get(d.get.remote("a")) # This result is in shared memory.
I have a generator (or, a list of generators). Let's call them gens
Each generator in gens is a complicated function that returns the next value of a complicated procedure. Fortunately, they are all independent of one another.
I want to call gen.__next__() for each element gen in gens, and return the resulting values in a list. However, multiprocessing is unhappy with pickling generators.
Is there a fast, simple way to do this in Python? I would like it such that gens of length m is mapped to n cores locally on my machine, where n could be larger or smaller than m. Each generator should run on a separate core.
If this is possible, can someone provide a minimal example?
You can't pickle generators. Read more about it here.
There is a blog post which explains it in much more detail. Referring a quote from it:
Let’s ignore that problem for a moment and look what we would need to do to pickle a generator. Since a generator is essentially a souped-up function, we would need to save its bytecode, which is not guarantee to be backward-compatible between Python’s versions, and its frame, which holds the state of the generator such as local variables, closures and the instruction pointer. And this latter is rather cumbersome to accomplish, since it basically requires to make the whole interpreter picklable. So, any support for pickling generators would require a large number of changes to CPython’s core.
Now if an object unsupported by pickle (e.g., a file handle, a socket, a database connection, etc) occurs in the local variables of a generator, then that generator could not be pickled automatically, regardless of any pickle support for generators we might implement. So in that case, you would still need to provide custom getstate and setstate methods. This problem renders any pickling support for generators rather limited.
He also suggests a solution, to use simple iterators.
the best solution to this problem to the rewrite the generators as simple iterators (i.e., one with a __next__ method). Iterators are easy and efficient space-wise to pickle because their state is explicit. You would still need to handle objects representing some external state explicitly however; you cannot get around this.
Another offered solution (which I haven't tried) suggests this
Convert the generator to a class in which the generator code is the __iter__ method
Add __getstate__ and __setstate__ methods to the class, to handling pickling. Remember that you can’t pickle file objects. So __setstate__ will have to re-open files, as necessary.
If your subtasks are truly parallel (do not rely on any shared state), you can do this with multiprocesing.Pool().
Take a look at
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
This requires you to make the arguments of pool.map() serializable. You can't pass in a generator to your worker, but you can achieve something similar by defining your generator inside the target function, and pass in initialization arguments to the multiprocessing library:
import multiprocessing as mp
import time
def worker(value):
# The generator is defined inside the multiprocessed function
def gen():
for k in range(value):
time.sleep(1) # Simulate long running task
yield k
# Execute the generator
for x in gen():
print(x)
# Do something with x?
pass
pool = mp.Pool()
pool.map(worker, [2, 5, 2])
pool.join() # Wait for all the work to be finished.
pool.close() # Clean up system resources
The output will be:
0
0
0
1
1
1
2
3
4
Note that this solution only really works if you build your generators, then use them only once, as their final state is lost at the end of the worker function.
Keep in mind that anytime you want to use multiprocessing, you have to use for serializable objects due to the limitations of inter-process communication; this can often prove limiting.
If your process is not CPU bound but instead I/O bound (disk access, network access, etc), you'll have a much easier time using threads.
You don't need to pickle the generator, just send an index of the generator to the processing pool.
M = len(gens)
N = multiprocessing.cpu_count()
def proc(gen_idx):
return [r for r in gens[gen_idx]()]
if __name__ == "__main__":
with multiprocessing.Pool(N) as p:
for r in p.imap_unordered(proc, range(M)):
print(r)
Note that I don't call/initialize the generator until within the processing function.
Using imap_unordered will allow you to process the results as each generator completes.
It's quite easy to implement, just dont block the threads sincronusly, just constantly loop thru the states and join them on complition. This template shuld be good enough to give an idea, self.done alwais needs to be set last on thread complition and las on thread reuse.
import threading as th
import random
import time
class Gen_thread(th.Thread):
def is_done(self):
return self.done
def get_result(self):
return self.work_result
def __init__(self, *args, **kwargs):
self.g_id = kwargs['id']
self.kwargs = kwargs
self.args = args
self.work_result = None
self.done = False
th.Thread.__init__(self)
def run(self):
# time.sleep(*self.args) to pass variables
time.sleep(random.randint(1, 4))
self.work_result = 'Thread {0} done'.format(self.g_id + 1)
self.done = True
class Gens(object):
def __init__(self, n):
self.n_needed = 0
self.n_done = 0
self.n_loop = n
self.workers_tmp = None
self.workers = []
def __iter__(self):
return self
def __next__(self):
if self.n_needed == 0:
for w in range(self.n_loop):
self.workers.append(Gen_thread(id=w))
self.workers[w].start()
self.n_needed += 1
while self.n_done != self.n_needed:
for w in range(self.n_loop):
if self.workers[w].is_done():
self.workers[w].join()
self.workers_tmp = self.workers[w].get_result()
self.workers.pop(w)
self.n_loop -= 1
self.n_done += 1
return self.workers_tmp
raise StopIteration()
if __name__ == '__main__':
for gen in Gens(4):
print(gen)
Is it possible to access a fasttext model (gensim) using multithreading?
Currently, I'm trying to load a model once (due to size and loading time), so it stays in memory and access its similarity functions multiple thousands times in a row. I want to do that in parallel and my current approach uses a wrapper class that loads the model and is then passed to the workers. But it looks like it does not return any results.
The wrapper class. Initiated once.
from gensim.models.fasttext import load_facebook_model
class FastTextLocalModel:
def __init__(self):
self.model_name = "cc.de.300.bin"
self.model_path = path.join("data", "models", self.model_name)
self.fast_text = None
def load_model(self):
self.fast_text = load_facebook_model(self.model_path)
def similarity(self, word1: str = None, word2: str = None):
return self.fast_text.wv.similarity(word1, word2)
And the Processor class makes use of the FastTextLocalModel methods above:
fast_text_instance = FastTextLocalModel()
fast_text_instance.load_model()
with concurrent.futures.ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
docs = corpus.get_documents() # docs is iterable
processor = ProcessorClass(model=fast_text_instance)
executor.map(processor.process, docs)
Using max_workers=1 seems to work.
I have to mention that I have no expertise in python multithreading.
There may be useful ideas for you in this prior answer, which may need adaptation for FastText & latest versions of gensim:
https://stackoverflow.com/a/43067907/130288
(The keys are...
even redundantly loading in different processes may not use redundant memory, if the key memory-consuming arrays are mmapped and thus automatically shared at the OS level; and
you have to do a little extra trickery to prevent the usual recalc after-load and before-similarity-ops of normed vectors, which would destroy the sharing
..but messiness in the FastText code might make these a bit harder there.)
I have some complex class A that computes data (large matrix calculations) while consuming input data from class B.
A itself uses multiple cores. However, when A needs the next chunk of data, it waits for quite some time since B runs in the same main thread.
Since A mainly uses the GPU for computations, I would like to have B collecting data concurrently on the CPU.
My latest approach was:
# every time *A* needs data
def some_computation_method(self):
data = B.get_data()
# start computations with data
...and B looks approximately like this:
class B(object):
def __init__(self, ...):
...
self._queue = multiprocessing.Queue(10)
loader = multiprocessing.Process(target=self._concurrent_loader)
def _concurrent_loader(self):
while True:
if not self._queue.full():
# here: data loading from disk and pre-processing
# that requires access to instance variables
# like self.path, self.batch_size, ...
self._queue.put(data_chunk)
else:
# don't eat CPU time if A is too busy to consume
# the queue at the moment
time.sleep(1)
def get_data(self):
return self._queue.get()
Could this approach be considered a "pythonic" solution?
Since I have not much experience with Python's multiprocessing module, I've built an easy/simplistic approach. However, it looks kind of "hacky" to me.
What would be a better solution to have a class B loading data from disk concurrently and supplying it via some queue, while the main thread runs heavy computations and consumes data from the queue from time to time?
While your solution is perfectly OK, especially for "small" projects, it has the downside of the threading getting tightly coupled with the class B. Hence if you (for example) for some reason wanted to use B in a non-threaded manner, your out of luck.
I would personally write the class in a thread safe manner, and then call it using threads from outside:
class B(object):
def __init__(self):
self._queue = multiprocessing.Queue(10)
...
if __name__ == '__main__':
b = B()
loader = multiprocessing.Process(target=b._concurrent_loader)
loader.start()
This makes B more flexible, separates dependencies better and is easier to test. It also makes the code more readable by being explicit about the thread creation, as compared to it happening implicitly on class creation.