Django - collecting info per request - python

I want to collect data in each request (single request can cause several changes), and process the data at the end of the request.
So I'm using a singleton class to collect the data, and I'm processing the data in it on request_finished signal. Should it work or should I expect data loss/other issues?
singleton class:
class Singleton(type):
_instances = {}
def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
return cls._instances[cls]
class DataManager(object, metaclass=Singleton):
....
using it in other signal:
item_created = DataManager().ItemCreated(item)
DataManager().add_event_to_list(item_created)
request finished signal:
#receiver(request_finished, dispatch_uid="request_finished")
def my_request_finished_handler(sender, **kwargs):
DataManager().process_data()

A singleton means you have one single instance per process. Typical production Django setup is with one or more front server running multiple long running Django processes, each of one serving any incoming request. FWIW you could even serve Django in concurrent threads in a same process AFAICT. In this context, the same user can have subsequent requests served by different processes/threads, and any long-lived 'global' object will be shared by all requests served by the current process. The net result is, as Daniel Roseman rightly comments, that "Singletons will never do what you want in a multi-process multi-user environment like Django".
If you want to collect per-request-response cycle data, your best bet is to store them on the request object itself, using a middleware to initialize the collector on request processing cycle start and do something with collected data on request processing cycle end. This of course requires passing the request all along... which is indeed kind of clumsy.
A workaround here could be to connect some of your per-request "collector" object's methods as signal handlers, taking care of properly setting the "dispatch_uid" so you can disconnect them before sending the response, and preferably using weak references to avoid memory leaks.
NB : If you want to collect per-user informations, that's what the session framework is for, but I assume you already understood this.

Edited my answer, the first one was not well thought of.
After further thinking, it's not a good idea to use the request in the signals, neither by global nor by passing it down.
Django has two main paths: Views and Commands. Views are used for web requests, which is generating a 'request' object. Commands are used through the console and do not generate a 'request' object. Ideally, your models (thus, also signals) should be able to support both paths (for example: for data migrations during a project's lifespan). So it's inherently incorrect to tie down your signals to the request object.
It would be better to use something like a thread-local memory space and make sure your thread-global class does not rely on anything from the request. For example:
Is there a way to access the context from everywhere in Django?

Related

Why isn't the APScheduler class serialized?

I'm trying to make a singleton using shared memory to make sure that the class object will be the same between all processes, but for this I need to serialize it:
class Singleton(type):
filename = path.basename(path.abspath(__file__)).split('.')[0]
def __call__(cls, *args, **kwargs):
instance = super(Singleton, cls).__call__(*args, **kwargs)
pickled = pickle.dumps(instance)
class SingleScheduler(APScheduler, metaclass=Singleton):
def __init__(self, *args, **kwargs):
super().__init__(args, kwargs)
TypeError: Schedulers cannot be serialized. Ensure that you are not passing a scheduler instance as an argument to a job, or scheduling an instance method where the instance contains a scheduler as an attribute.
Is there some way to serialize this?
P.S. Yes, I know that I can just take a redis and pass parameters and status to it, but I want to try to do it without any dependencies...
The scheduler was explicitly made unserializable because it would not serialize anyway because it contains synchronization primitives and references to the job store. A lot of people ran into pickling errors because of this so the 3.7.0 release added this explicit error message.
With enough effort, all of this could be worked around, with the exception of the memory job store. If your intention is to create multiple copies of the scheduler sharing a single external store, that is not going to work on APScheduler 3.x because it was not designed for distributed use. The 4.0 release will rectify this and it is the most prominent feature there. If, on the other hand, you were using a memory backed job store, serializing would just create multiple copies of the jobs, at which point you're better off creating a new scheduler for each process anyway.
If you want to share a scheduler between multiple processes, the way to do it on APScheduler 3.x is to have the scheduler run in a dedicated process and then communicate with said process over either RPyC, some web server or similar IPC or network mechanism.
FAQ entry on scheduler sharing
Example on how to use RPyC to provide a shared scheduler

Overwriting class variable and concurrent Flask requests

I'm running a python Flask server to perform tricky algorithms, one of which assigns cables to tubes.
class Tube:
max_capacity = 5
cables: List[str]
def has_capacity(self):
return len(self.cables) < self.max_capacity
The max capacity was always 5, but now there's a new customer that actually has tubes that can fit 6 cables.
When I receive a request, I now just set Tube.max_capacity = request.args.get('max_capacity', 5). Then each instance of Tube will have the correct setting.
I was wondering if this will keep working if there are multiple requests being handled at the same time?
Are the Flask (I use Gunicorn as WSGI) processes all separate from each other such that this is safe to do? I don't want to end up with strange bugs because the max capacity changed halfway through a request because another request came in.
EDIT:
I tried this out and it appears to work as intended:
#app.route('/concurrency')
def concurrency():
my_value = randint(0, 100)
Concurrency.value = my_value
time.sleep(8)
return f"My value: {my_value} should be equal to Concurrency.value {Concurrency.value}"
class Concurrency:
value = 10
Still, I want to know more about how multiple Flask/Gunicorn requests work to be certain.
WSGI applications are typically served using multiple processes - eventually on different servers -, and requests from a same user will be handled by the first available process. IOW: you do NOT want to change any module or class level variables on a per-request basis, this is **garanteed* to mess up everything.
It's impossible to tell you exactly how to solve the issue without much more context, but in all cases, you'll have to rethink your design.
EDIT:
how do processes behave? If one of them sets the value, does another process see that value as well?
Of course not - each process is totally isolated from the others - so changing a module-level variable or class attribute will only affect the current process. But since processes are not tied to clients (which process will handle a given request is totally unpredictable), such kind of changes in one process will not necessarily be seen in the next request if it's served by another process. AND:
Or, is a process re-used, and then still has the value from the previous request?
process are of course reused, but that doesnt mean the same process will be reused for the next request from a same user - and this is the second part of the issue: when serving another user, your process will still use the "updated" max_capacity value from the previous user.
IOW, what you're doing is garanteed to mess up everything for all your users. That's why we use external (out of process) means to store and share per-user data between requests - either sessions (for volatile data) or a database (for permanent storage).

Django Threading Structure

First of all to begin with 'Yes' i checked and googled this topic but can't find anything that gives me a clear answer to my question? I am a beginner in Djagno and studying its documentation where i read about the Thread Safety Considerations for render method of nodes for Templates Tags. Here is the link to the documentation Link. My question lies where it states that Once the node is parsed the render method for that node might be called multiple times i am confused whether it is talking about the use of the template tag in the same document at different places for the same user at the single instance level of the user on the server or the use of the template tag for multiple request coming from users all around the world sharing the same django instance in memory? If its the latter one does't django create a new instance at the server level for every new user request and have separate resources for every user in the memory or am i wrong about this?
It's the latter.
A WSGI server usually runs a number of persistent processes, and in each process it runs a number of threads. While some automatic scaling can be applied, the number of processes and threads is more or less constant, and determines how many concurrent requests Django can handle. The days where each request would create a new CGI process are long gone, and in most cases persistent processes are much more efficient.
Each process has its own memory, and the communication between processes is usually handled by the database, the cache etc. They can't communicate directly through memory.
Each thread within a process shares the same memory. That means that any object that is not locally scoped (e.g. only defined inside a function), is accessible from the other threads. The cached template loader parses each template once per process, and each thread uses the same parsed nodes. That also means that if you set e.g. self.foo = 'bar' in one thread, each thread will then read 'bar' when accessing self.foo. Since multiple threads run at the same time, this can quickly become a huge mess that's impossible to debug, which is why thread safety is so important.
As the documentation says, as long as you don't store data on self, but put it into context.render_context, you should be fine.

How to handle local long-living objects in WSGI env

INTRO
I've recently switched to Python, after about 10 years of PHP development and habits.
Eg. in Symfony2, every request to server (Apache for instance) has to load eg. container class and instantiate it, to construct the "rest" of the objects.
As far as I understand (I hope) Python's WSGI env, an app is created once, and until that app closes, every request just calls methods/functions.
This means that I can have eg. one instance of some class, that can be accessed every time, request is dispatched, without having to instantiate it in every request. Am I right?
QUESTION
I want to have one instance of class since the call to __init__ is very expensive (in both computing and resources lockup). In PHP instantiating this in every request degrades performance, am I right that with Python's WSGI I can instantiate this once, on app startup, and use through requests? If so, how do I achieve this?
WSGI is merely a standardized interface that makes it possible to build the various components of a web-server architecture so that they can talk to each other.
Pyramid is a framework whose components are glued with each other through WSGI.
Pyramid, like other WSGI frameworks, makes it possible to choose the actual server part of the stack, like gunicorn, Apache, or others. That choice is for you to make, and there lies the ultimate answer to your question.
What you need to know is whether your server is multi-threaded or multi-process. In the latter case, it's not enough to check whether a global variable has been instantiated in order to initialize costly resources, because subsequent requests might end up in separate processes, that don't share state.
If your model is multi-threaded, then you might indeed rely on global state, but be aware of the fact that you are introducing a strong dependency in your code. Maybe a singleton pattern coupled with dependency-injection can help to keep your code cleaner and more open to change.
The best method I found was mentioned (and I missed it earlier) in Pyramid docs:
From Pyramid Docs#Startup
Note that an augmented version of the values passed as **settings to the Configurator constructor will be available in Pyramid view callable code as request.registry.settings. You can create objects you wish to access later from view code, and put them into the dictionary you pass to the configurator as settings. They will then be present in the request.registry.settings dictionary at application runtime.
There are a number of ways to do this in pyramid, depending on what you want to accomplish in the end. It might be useful to look closely at the Pyramid/SQLAlchemy tutorial as an example of how to handle an expensive initialization (database connection and metadata setup) and then pass that into the request-handling engine.
Note that in the referenced link, the important part for your question is the __init__.py file's handling of initialize_sql and the subsequent creation of DBSession.

Is there any way to make an asynchronous function call from Python [Django]?

I am creating a Django application that does various long computations with uploaded files. I don't want to make the user wait for the file to be handled - I just want to show the user a page reading something like 'file is being parsed'.
How can I make an asynchronous function call from a view?
Something that may look like that:
def view(request):
...
if form.is_valid():
form.save()
async_call(handle_file)
return render_to_response(...)
Rather than trying to manage this via subprocesses or threads, I recommend you separate it out completely. There are two approaches: the first is to set a flag in a database table somewhere, and have a cron job running regularly that checks the flag and performs the required operation.
The second option is to use a message queue. Your file upload process sends a message on the queue, and a separate listener receives the message and does what's needed. I've used RabbitMQ for this sort of thing, but others are available.
Either way, your user doesn't have to wait for the process to finish, and you don't have to worry about managing subprocesses.
I have tried to do the same and failed after multiple attempt due of the nature of django and other asynchronous call.
The solution I have come up which could be a bit over the top for you is to have another asynchronous server in the background processing messages queues from the web request and throwing some chunked javascript which get parsed directly from the browser in an asynchronous way (ie: ajax).
Everything is made transparent for the end user via mod_proxy setting.
Unless you specifically need to use a separate process, which seems to be the gist of the other questions S.Lott is indicating as duplicate of yours, the threading module from the Python standard library (documented here) may offer the simplest solution. Just make sure that handle_file is not accessing any globals that might get modified, nor especially modifying any globals itself; ideally it should communicate with the rest of your process only through Queue instances; etc, etc, all the usual recommendations about threading;-).
threading will break runserver if I'm not mistaken. I've had good luck with multiprocess in request handlers with mod_wsgi and runserver. Maybe someone can enlighten me as to why this is bad:
def _bulk_action(action, objs):
# mean ponies here
def bulk_action(request, t):
...
objs = model.objects.filter(pk__in=pks)
if request.method == 'POST':
objs.update(is_processing=True)
from multiprocessing import Process
p = Process(target=_bulk_action,args=(action,objs))
p.start()
return HttpResponseRedirect(next_url)
context = {'t': t, 'action': action, 'objs': objs, 'model': model}
return render_to_response(...)
http://docs.python.org/library/multiprocessing.html
New in 2.6

Categories