Why isn't the APScheduler class serialized? - python

I'm trying to make a singleton using shared memory to make sure that the class object will be the same between all processes, but for this I need to serialize it:
class Singleton(type):
filename = path.basename(path.abspath(__file__)).split('.')[0]
def __call__(cls, *args, **kwargs):
instance = super(Singleton, cls).__call__(*args, **kwargs)
pickled = pickle.dumps(instance)
class SingleScheduler(APScheduler, metaclass=Singleton):
def __init__(self, *args, **kwargs):
super().__init__(args, kwargs)
TypeError: Schedulers cannot be serialized. Ensure that you are not passing a scheduler instance as an argument to a job, or scheduling an instance method where the instance contains a scheduler as an attribute.
Is there some way to serialize this?
P.S. Yes, I know that I can just take a redis and pass parameters and status to it, but I want to try to do it without any dependencies...

The scheduler was explicitly made unserializable because it would not serialize anyway because it contains synchronization primitives and references to the job store. A lot of people ran into pickling errors because of this so the 3.7.0 release added this explicit error message.
With enough effort, all of this could be worked around, with the exception of the memory job store. If your intention is to create multiple copies of the scheduler sharing a single external store, that is not going to work on APScheduler 3.x because it was not designed for distributed use. The 4.0 release will rectify this and it is the most prominent feature there. If, on the other hand, you were using a memory backed job store, serializing would just create multiple copies of the jobs, at which point you're better off creating a new scheduler for each process anyway.
If you want to share a scheduler between multiple processes, the way to do it on APScheduler 3.x is to have the scheduler run in a dedicated process and then communicate with said process over either RPyC, some web server or similar IPC or network mechanism.
FAQ entry on scheduler sharing
Example on how to use RPyC to provide a shared scheduler

Related

multiprocessing initargs - how it works under the hood?

I've assumed that multiprocessing.Pool uses pickle to pass initargs to child processes.
However I find the following stange:
value = multiprocessing.Value('i', 1)
multiprocess.Pool(initializer=worker, initargs=(value, )) # Works
But this does not work:
pickle.dumps(value)
throwing:
RuntimeError: Synchronized objects should only be shared between processes through inheritance
Why is that, and how multiprocessing initargs can bypass that, as it's using pickle as well?
As I understand, multiprocessing.Value is using shared memory behind the scenes, what is the difference between inheritance or passing it via initargs? Specifically speaking on Windows, where the code does not fork, so a new instance of multiprocessing.Value is created.
And if you had instead passed an instance of multiprocessing.Lock(), the error message would have been RuntimeError: Lock objects should only be shared between processes through inheritance. These things can be passed as arguments if you are creating an instance of multiprocessing.Process, which is in fact what is being used when you say initializer=worker, initargs=(value,). The test being made is whether a process is currently being spawned, which is not the case when you already have an existing process pool and are now just submitting some work for it. But why this restriction?
Would it make sense for you to be able to pickle this shared memory to a file and then a week later trying to unpickle it and use it? Of course not! Python cannot know that you would not be doing anything so foolish as that and so it places great restrictions on how shared memory and locks can be pickled/unpickled, which is only for passing to other processes.

Django - collecting info per request

I want to collect data in each request (single request can cause several changes), and process the data at the end of the request.
So I'm using a singleton class to collect the data, and I'm processing the data in it on request_finished signal. Should it work or should I expect data loss/other issues?
singleton class:
class Singleton(type):
_instances = {}
def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
return cls._instances[cls]
class DataManager(object, metaclass=Singleton):
....
using it in other signal:
item_created = DataManager().ItemCreated(item)
DataManager().add_event_to_list(item_created)
request finished signal:
#receiver(request_finished, dispatch_uid="request_finished")
def my_request_finished_handler(sender, **kwargs):
DataManager().process_data()
A singleton means you have one single instance per process. Typical production Django setup is with one or more front server running multiple long running Django processes, each of one serving any incoming request. FWIW you could even serve Django in concurrent threads in a same process AFAICT. In this context, the same user can have subsequent requests served by different processes/threads, and any long-lived 'global' object will be shared by all requests served by the current process. The net result is, as Daniel Roseman rightly comments, that "Singletons will never do what you want in a multi-process multi-user environment like Django".
If you want to collect per-request-response cycle data, your best bet is to store them on the request object itself, using a middleware to initialize the collector on request processing cycle start and do something with collected data on request processing cycle end. This of course requires passing the request all along... which is indeed kind of clumsy.
A workaround here could be to connect some of your per-request "collector" object's methods as signal handlers, taking care of properly setting the "dispatch_uid" so you can disconnect them before sending the response, and preferably using weak references to avoid memory leaks.
NB : If you want to collect per-user informations, that's what the session framework is for, but I assume you already understood this.
Edited my answer, the first one was not well thought of.
After further thinking, it's not a good idea to use the request in the signals, neither by global nor by passing it down.
Django has two main paths: Views and Commands. Views are used for web requests, which is generating a 'request' object. Commands are used through the console and do not generate a 'request' object. Ideally, your models (thus, also signals) should be able to support both paths (for example: for data migrations during a project's lifespan). So it's inherently incorrect to tie down your signals to the request object.
It would be better to use something like a thread-local memory space and make sure your thread-global class does not rely on anything from the request. For example:
Is there a way to access the context from everywhere in Django?

Objects in Multiprocess Shared Memory?

I have a set of objects states which is greater than I think it would be reasonable to thread or process at a 1:1 basis, let's say it looks like this
class SubState(object):
def __init__(self):
self.stat_1 = None
self.stat_2 = None
self.list_1 = []
class State(object):
def __init__(self):
self.my_sub_states = {'a': SubState(), 'b': SubState(), 'c': SubState()}
What I'd like to do is to make each of the sub_states to the self.my_sub_states keys shared, and simply access them by grabbing a single lock for the entire sub-state - i.e. self.locks={'a': multiprocessing.Lock(), 'b': multiprocessing.Lock() etc. and then release it when I'm done. Is there a class I can inherit to share an entire SubState object with a single Lock?
The actually process workers would be pulling tasks from a queue (I can't pass the sub_states as args into the process because they don't know which sub_state they need until they get the next task).
Edit: also I'd prefer not to use a manager - manager's are atrociously slow (I haven't done the benchmarks but I'm inclined to think an in memory database would work faster than a manager if it came down to it).
As the multiprocessing docs state, you've really only got two options for actually sharing state between multiprocessing.Process instances (at least without going to third-party options - e.g. redis):
Use a Manager
Use multiprocessing.sharedctypes
A Manager will allow you to share pure Python objects, but as you pointed out, both read and write access to objects being shared this way is quite slow.
multiprocessing.sharedctypes will use actual shared memory, but you're limited to sharing ctypes objects. So you'd need to convert your SubState object to a ctypes.Struct. Also of note is that each multiprocessing.sharedctypes object has its own lock built-in, so you can synchronize access to each object by taking that lock explicitly before operating on it.

Asynchronous object instantiation

How can I make the following object instantiation asynchronous:
class IndexHandler(tornado.web.RequentHandler):
def get(self, id):
# Async the following
data = MyDataFrame(id)
self.write(data.getFoo())
The MyDataFrame returns a pandas DataFrame object and can take some time depending on the file it has to parse.
MyDataFrame() is a synchronous interface; to use it without blocking you need to do one of two things:
Rewrite it to be asynchronous. You can't really make an __init__ method asynchronous, so you'll need to refactor things into a static factory function instead of a constructor. In most cases this path only makes sense if the method depends on network I/O (and not reading from the filesystem or processing the results on the CPU).
Run it on a worker thread and asynchronously wait for its result on the main thread. From the way you've framed the question, this sounds like the right approach for you. I recommend the concurrent.futures package (in the standard library since Python 3.2; available via pip install futures for 2.x).
This would look something like:
#tornado.gen.coroutine
def get(self, id):
data = yield executor.submit(MyDataFrame, id)
self.write(data.getFoo())
where executor is a global instance of ThreadPoolExecutor.

python, using multiprocess.Pool to spawn several subprocess calls

I've searched several related posts, but none explicitly answer my query. I'm trying to create a class that will use multiprocessing to distribute jobs to a machine. The 'jobs' are system calls using subprocess and I don't wish for the script to stay connected to the jobs once spawned. I've gotten everything to work using the Process class, but I would like to try the Pool class, and I'm having problems.
My code is here:
https://gist.github.com/2627589
The relevant methods are the run_queue method. You can see in the Test_pool class I overwrite the run_queue method of my Runner class. However, when I run this, I get an error:
PicklingError: Can't pickle : attribute lookup builtin.instancemethod failed
My goal is to be able to define a MAX_NUM_CORES that should be kept busy, and continually submit jobs as long as the jobs distributed are not using more than the maximum number of cores defined (e.g. MAX_NUM_CORES). Maybe I'm not using the right design pattern? Suggestions are welcome.

Categories