How to share a cache between multiple processes? - python

I'm using a LRU cache to speed up some rather heavy duty processing. It works well and speeds things up considerably. However...
When I multiprocess, each process creates it's own separate cache and there are 8 copies of the same thing. That doesn't appear to be a problem, until the box runs out of memory and bad things happen as a result...
Ideally I only need a cachesize of around 300 items for the application, and 1*300 will fit in the 7GB i have to work with, but the 8*300 just doesn't fit.
How do I get all the processes to share the same cache?

I believe you can use a Manager to share a dict between processes. That should in theory let you use the same cache for all functions.
However, I think a saner logic would be to have one process that responds to queries by looking them up in the cache, and if they are not present then delegating the work to a subprocess, and caching the result before returning it. You could easily do that with
with concurrent.futures.ProcessPoolExecutor() as e:
#functools.lru_cache
def work(*args, **kwargs):
return e.submit(slow_work, *args, **kwargs)
Note that work will return Future objects, which the consumer will have to wait on. The lru_cache will cache the future objects so they will be returned automatically; I believe you can access their data more than once but can't test it right now.
If you're not using Python 3, you'll have to install backported versions of concurrent.futures and functools.lru_cache.

Pass the shared cache to each process. The parent process can instantiate a single cache and refer it to each process as an argument...
#utils.lru_cache(maxsize=300)
def get_stuff(key):
"""This is the routine that does the stuff which can be cached.
"""
return Stuff(key)
def process(stuff_obj):
"""This is the routine which multiple processes call to do work with that Stuff
"""
# get_stuff(key) <-- Wrong; I was calling the cache from here
stuff_obj.execute()
def iterate_stuff(keys):
"""This generates work for the processses.
"""
for key in keys:
yield get_stuff(key) # <-- I can call the cache from the parent
def main():
...
keys = get_list_of_keys()
for result in pool.imap(process, iterate_stuff(keys)):
evaluate(result)
...
This example is simple because I can look up the cache before calling the process. Some scenarios might prefer to pass a pointer to the cache rather than the value. eg:
yield (key, get_stuff)
Katriel's put me on the right track and I would implement that answer, but, silly me, my mistake was even simpler to solve than what he suggested.

Related

Python memory issues - memory doesn't get released after finishing a method

I have a quite complex python (2.7 on ubuntu) code which is leaking memory unexpectedly. To break it down, it is a method which is repeatedly called (and itself calls different methods) and returns a very small object. After finishing the method the used memory is not released. As far as I know it is not unusual to reserve some memory for later usages, but if I use big enough input my machine eventually consumes all memory and freezes. This is not the case if I use a subprocess with concurrent.futures ProcessPoolExecutor, thus I need to assume it is not my code but some underlying problems?!
Is this a known issue? Might it be a problem in 3rd party libraries I am using (e.g. PyQgis)? Where should I start to search for the problem?
Some more Background to eliminate silly reasons (because I am still somewhat of a beginner):
The method uses some global variables but in my understanding these should only be active in the file where they are declared and anyways should be overwritten in the next call of the method?!
To clarify in pseudocode:
def main():
load input from file
for x in input:
result = extra_file.initialization(x)
#here is the point where memory should get released in my opinion
#extra file
def initialization(x):
global input
input = x
result_container = []
while not result do:
part_of_result = method1()
result_container.append(part_of_result)
if result_container fulfills condition to be the final result:
result = result_container
del input
return result
def method1():
#do stuff
method2()
#do stuff
return part_of_result
def method2():
#do stuff with input not altering it
Numerous different methods and global variables are involved and the global declaration is used to not pass like 5 different input variables through multiple methods which don't even use them.
Should I try using garbage collection? All references after finishing the method should be deleted and python itself should take care of it?
Definitely try using garbage collection. I don't believe it's a known problem.

Memoization of recursive functions in Python, but only for the duration of the top-level call

I am writing a top-down parser which consists of a top-level function that initiates a recursive parse down the text with lower-level functions. Note that the lower-level functions never call the top-level function, but the lower-level functions are mutually recursive.
I noticed that the parser runs somewhat slowly, and I suspect this to be caused by exponential growth in the recursion, because the parser might repeatedly try to parse the same type of object on the same text at the same offset, resulting in wasted effort.
For this reason I want to memoize the lower-level function calls, but after the top-level function returns, I want to clear the memoization cache to release the memory.
That means that if the user calls the top-level function multiple times with the same parameters, the program should actually go through the whole parsing procedure again.
My motivation is that it is unlikely the same text will be parsed at top-level multiple times, so the memory overhead is not worth it (each parse will generate a fairly large cache).
One possible solution is to rewrite all the lower-level functions to take an additional cache argument like this:
def low_level_parse(text, start, cache):
if (text, start) not in cache:
# Do something to compute a result
# ...
cache[(text, start)] = result
return cache[(text, start)]
and rewrite all calls to the low-level functions to pass down the cache argument (which is initially set to {} in the top-level function).
Unfortunately there are many low-level parse functions, and each may also call other low-level parse functions many times. Refactoring the code to implement caching this way would be very tedious and error prone.
Another solution would be to use decorators, and I believe this would be best in terms of maintainability, but I don't know how to implement the memoize decorator in such a way that its cache exists only during the top-level function scope.
I also thought of defining the cache as a global variable in my module, and clear it explicitly after returning from the top-level function. This would spare me the need to modify the low-level functions to take the cache argument explicitly, and I could then use a memoize decorator that makes use of the global cache. But I am not sure the global cache would be a good idea if this is used in a multi-threaded environment.
I found this link to Decorators with Arguments which I think is what is needed here:
class LowLevelProxy:
def __init__(self, cache):
self.cache = cache
def __call__(self, f):
def wrapped_f(*args, **kwargs):
key = (f,args) # <== had to remove kwargs as dicts cannot be keys
if key not in self.cache:
result = f(*args, **kwargs)
self.cache[key] = result
return self.cache[key]
return wrapped_f
NB each function that is wrapped will have its own section in the cache.
you might be able to wrap each of your low level functions like this:
#LowLevelProxy(cache)
def low_level(param_1, param_2):
# do stuff

Using python multiprocessing.Pool without returing result object

I have a large number of CPU-bounded tasks that I want to run in parallel. Most of those tasks will return similar results and I only need to store unique results and count non-unique ones.
Here's how it is currently designed: I use two managed dictionaries - one for results and another one for result counters. My tasks are checking those dictionaries using unique result keys for the results they found and either write into both dictionaries or only increase the counters for non-unique results (if I have to write I acquire the lock and check again to avoid inconsistency).
What I am concerned about: since Pool.map should actually return result object, even though I do not save a reference to it, results will still pile up in memory until they are garbage collected. Even though I will have millions of just None's there (since I am processing results in a different manner and all my tasks just return None) I can not rely on specific garbage collector behavior so the program might eventually run out of memory. I still want to keep nice features of the pool but leave out this built-in result handling. Is my understanding correct and is my concern valid? If so, are there any alternatives?
Also, now that I laid it out on paper it looks really clumsy :) Do you see a better way to design such thing?
Thanks!
Question: I still want to keep nice features of the pool
Remove return result from multiprocessing.Pool.
Copy class MapResult and inherit from mp.pool.ApplyResult.
Add, replace ,comment the following:
import multiprocessing as mp
from multiprocessing.pool import Pool
class MapResult(mp.pool.ApplyResult):
def __init__(self, cache, chunksize, length, callback, error_callback):
super().__init__(cache, callback, error_callback=error_callback)
...
#self._value = [None] * length
self._value = None
...
def _set(self, i, success_result):
...
if success:
#self._value[i*self._chunksize:(i+1)*self._chunksize] = result
Create your own class myPool(Pool) inherit from multiprocessing.Pool.
Copy def _map_async(... from multiprocessing.Pool.
Add, replace, comment the following:
class myPool(Pool):
def __init__(self, processes=1):
super().__init__(processes=processes)
def _map_async(self, func, iterable, mapper, chunksize=None, callback=None,
error_callback=None):
...
#if self._state != RUN:
if self._state != mp.pool.RUN:
...
#return result
Tested with Python: 3.4.2

Python Dependency Injection for Lazy Callables

In programming for fun, I've noticed that managing dependencies feels like a boring chore that I want to minimize. After reading this, I've come up with a super trivial dependency injector, whereby the dependency instances are looked up by a string key:
def run_job(job, args, instance_keys, injected):
args.extend([injected[key] for key in instance_keys])
return job(*args)
This cheap trick works since calls in my program are always lazily defined (where the function handle is stored separately from its arguments) in an iterator, e.g.:
jobs_to_run = [[some_func, ("arg1", "arg2"), ("obj_key",)], [other_func,(),()]]
The reason is because of a central main loop that must schedule all events. It has a reference to all dependencies, so the injection for "obj_key" can be passed in a dict object, e.g.:
# inside main loop
injection = {"obj_key" : injected_instance}
for (callable, with_args, and_dependencies) in jobs_to_run:
run_job(callable, with_args, and_dependencies, injection)
So when an event happens (user input, etc.), the main loop may call an update() on a particular object who reacts to that input, who in turn builds a list of jobs for the main loop to schedule when there's resources. To me it is cleaner to key-reference any dependencies for someone else to inject rather than having all objects form direct relationships.
Because I am lazily defining all callables (functions) for a game clock engine to run them on its own accord, the above naive approach worked with very little added complexity. Still, there is a code stink in having to reference objects by strings. At the same time, it was stinky to be passing dependencies around, and constructor or setter injection would be overkill, as would perhaps most large dependency injection libraries.
For the special case of injecting dependencies in callables that are lazily defined, are there more expressive design patterns in existence?
I've noticed that managing dependencies feels like a boring chore that I want to minimize.
First of all, you shouldn't assume that dependency injection is a means to minimize the chore of dependency management. It doesn't go away, it is just deferred to another place and time and possibly delegated to someone else.
That said, if what you are building is going to be used by others it would thus be wise to include some form of version checking into your 'injectables'so that your users will have an easy way to check if their version matches the one that is expected.
are there more expressive design patterns in existence?
Your method as I understand it is essentially a Strategy-Pattern, that is the job's code (callable) relies on calling methods on one of several concrete objects. The way you do it is perfectly reasonable - it works and is efficient.
You may want to formalize it a bit more to make it easier to read and maintain, e.g.
from collections import namedtuple
Job = namedtuple('Job', ['callable', 'args', 'strategies'])
def run_job(job, using=None):
strategies = { k: using[k] for k in job.strategies] }
return job.callable(*args, **strategies)
jobs_to_run = [
Job(callable=some_func, args=(1,2), strategies=('A', 'B')),
Job(callable=other_func, ...),
]
strategies = {"A": injected_strategy, ...}
for job in jobs_to_run:
run_job(job, using=strategies)
# actual job
def some_func(arg1, arg2, A=None, B=None):
...
As you can see the code still does the same thing, but it is instantly more readable, and it concentrates knowledge about the structure of the Job() objects in run_job. Also the call to a job function like some_func will fail if the wrong number of arguments are passed, and the job functions are easier to code and debug due to their explicitely listed and named arguments.
About the strings you could just make'em constants in a dependencies.py file an use these constants.
An more robust option with still little overhead would to be to use a dependency injection framework such as Injectable:
#autowired
def job42(some_instance: Autowired("SomeInstance", lazy=true)):
...
# some_instance is autowired to job42 calls and
# it will be automatically injected for you
job42()
Disclosure: I am the project maintainer.

Is filter thread-safe

I have a thread which is updating a list called l. Am I right in saying that it is thread-safe to do the following from another thread?
filter(lambda x: x[0] == "in", l)
If its not thread safe, is this then the correct approach:
import threading
import time
import Queue
class Logger(threading.Thread):
def __init__(self, log):
super(Logger, self).__init__()
self.log = log
self.data = []
self.finished = False
self.data_lock = threading.Lock()
def run(self):
while not self.finished:
try:
with self.data_lock:
self.data.append(self.log.get(block=True, timeout=0.1))
except Queue.Empty:
pass
def get_data(self, cond):
with self.data_lock:
d = filter(cond, self.data)
return d
def stop(self):
self.finished = True
self.join()
print("Logger stopped")
where the get_data(self, cond) method is used to retrieve a small subset of the data in the self.data in a thread safe manner.
First, to answer your question in the title: filter is just a function. Hence, its thread-safety will rely on the data-structure you use it with.
As pointed out in the comments already, list operations themselves are thread-safe in CPython and protected by the GIL, but that is arguably only an implementation detail of CPython that you shouldn't really rely on. Even if you could rely on it, thread safety of some of their operations probably does not mean the kind of thread safety you mean:
The problem is that iterating over a sequence with filter is in general not an atomic operation. The sequence could be changed during iteration. Depending on the data-structure underlying your iterator this might cause more or less weird effects. One way to overcome this problem is by iterating over a copy of the sequence that is created with one atomic action. Easiest way to do this for standard sequences like tuple, list, string is with the slice operator like this:
filter(lambda x: x[0] == "in", l[:])
Apart from this not necessarily being thread-safe for other data-types, there's one problem with this though: it's only a shallow copy. As your list's elements seem to be list-like as well, another thread could in parallel do del l[1000][:] to empty one of the inner lists (which are pointed to in your shallow copy as well). This would make your filter expression fail with an IndexError.
All that said, it's not a shame to use a lock to protect access to your list and I'd definitely recommend it. Depending on how your data changes and how you work with the returned data, it might even be wise to deep-copy the elements while holding the lock and to return those copies. That way you can guarantee that once returned the filter condition won't suddenly change for the returned elements.
Wrt. your Logger code: I'm not 100 % sure how you plan to use this and if it's critical for you to run several threads on one queue and join them. What looks weird to me is that you never use Queue.task_done() (assuming that its self.log is a Queue). Also your polling of the queue is potentially wasteful. If you don't need the join of the thread, I'd suggest to at least turn the lock acquisition around:
class Logger(threading.Thread):
def __init__(self, log):
super(Logger, self).__init__()
self.daemon = True
self.log = log
self.data = []
self.data_lock = threading.Lock()
def run(self):
while True:
l = self.log.get() # thread will sleep here indefinitely
with self.data_lock:
self.data.append(l)
self.log.task_done()
def get_data(self, cond):
with self.data_lock:
d = filter(cond, self.data)
# maybe deepcopy d here
return d
Externally you could still do log.join() to make sure that all of the elements of the log queue are processed.
If one thread writes to a list and another thread reads that list, the two must be synchronized. It doesn't matter for that aspect whether the reader uses filter(), an index or iteration or whether the writer uses append() or any other method.
In your code, you achieve the necessary synchronization using a threading.Lock. Since you only access the list within the context of with self.data_lock, the accesses are mutually exclusive.
In summary, your code is formally correct concerning the list handling between threads. But:
You do access self.finished without the lock, which is problematic. Assigning to that member will change self, i.e. the mapping of the object to the according members, so this should be synced. Effectively, this won't hurt, because True and False are global constants, at the worst you will have a short delay between setting the state in one thread and seeing the state in the other. It remains bad, because it is habit-forming.
As a rule, when you use a lock, always document which objects this lock protects. Also, document which object is accessed by which thread. The fact that self.finished is shared and requires synchronization would have been obvious. Also, making a visual distinction between public functions and data and private ones (beginning with an _underscore, see PEP 8) helps keeping track of this. It also helps other readers.
A similar issue is your baseclass. In general, inheriting from threading.Thread is a bad idea. Rather, include an instance of the thread class and give it a function like self._main_loop to run on. The reason is that you say that your Logger is a Thread and that all of its baseclass' public members are also public members of your class, which is probably a much wider interface than what you intended.
You should never block with a lock held. In your code, you block in self.log.get(block=True, timeout=0.1) with the lock on the mutex. In that time, even if nothing actually happens, no other thread will be able to call and complete a call to get_data(). There is actually just a tiny window between unlocking the mutex and locking it again where a caller of get_data() does not have to wait, which is very bad for performance. I could even imagine that your question is motivated by the really bad performance this causes. Instead, call log.get(..) without lock, it shouldn't need one. Then, with the lock held, append data to self.data and check self.finished.

Categories