Check for atomic context - python

One of my methods doesn't work when run on atomic context. I want to ask Django if it's running a transaction.
The method can create a thread or a process and saves the result to database. This is a bit odd but there is a huge performance benefit when a process can be used.
I find that especially processes are a bit sketchy with Django. I know that Django will raise an exception if the method chooses to save the results in a process and the method is run on atomic context.
If I can check for an atomic context then I can throw an exception straight away (instead of getting odd errors) or force the method to only create a thread.
I found the is_managed() method but according to this question it's been removed in Django 1.8.

According to this ticket there are a couple ways to detect this: not transaction.get_autocommit() (using a public API) or transaction.get_connection().in_atomic_block (using a private API).

Related

Two flask apps using one database

Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.
SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.
Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.

Django Threading Structure

First of all to begin with 'Yes' i checked and googled this topic but can't find anything that gives me a clear answer to my question? I am a beginner in Djagno and studying its documentation where i read about the Thread Safety Considerations for render method of nodes for Templates Tags. Here is the link to the documentation Link. My question lies where it states that Once the node is parsed the render method for that node might be called multiple times i am confused whether it is talking about the use of the template tag in the same document at different places for the same user at the single instance level of the user on the server or the use of the template tag for multiple request coming from users all around the world sharing the same django instance in memory? If its the latter one does't django create a new instance at the server level for every new user request and have separate resources for every user in the memory or am i wrong about this?
It's the latter.
A WSGI server usually runs a number of persistent processes, and in each process it runs a number of threads. While some automatic scaling can be applied, the number of processes and threads is more or less constant, and determines how many concurrent requests Django can handle. The days where each request would create a new CGI process are long gone, and in most cases persistent processes are much more efficient.
Each process has its own memory, and the communication between processes is usually handled by the database, the cache etc. They can't communicate directly through memory.
Each thread within a process shares the same memory. That means that any object that is not locally scoped (e.g. only defined inside a function), is accessible from the other threads. The cached template loader parses each template once per process, and each thread uses the same parsed nodes. That also means that if you set e.g. self.foo = 'bar' in one thread, each thread will then read 'bar' when accessing self.foo. Since multiple threads run at the same time, this can quickly become a huge mess that's impossible to debug, which is why thread safety is so important.
As the documentation says, as long as you don't store data on self, but put it into context.render_context, you should be fine.

Python GIL: is django save() blocking?

My django app saves django models to a remote database. Sometimes the saves are bursty. In order to free the main thread (*thread_A*) of the application from the time toll of saving multiple objects to the database, I thought of transferring the model objects to a separate thread (*thread_B*) using collections.deque and have *thread_B* save them sequentially.
Yet I'm unsure regarding this scheme. save() returns the id of the new database entry, so it "ends" only after the database responds, which is at the end of the transaction.
Does django.db.models.Model.save() really block GIL-wise and release other python threads during the transaction?
Django's save() does nothing special to the GIL. In fact, there is hardly anything you can do with the GIL in Python code -- when it is executed, the thread must hold the GIL.
There are only two ways the GIL could get released in save():
Python decides to switch threads (after sys.getcheckinterval() instructions)
Django calls a database interface routine that is implemented to release the GIL
The second point could be what you are looking for -- a SQL COMMITis executed and during that execution, the SQL backend releases the GIL. However, this depends on the SQL interface, and I'm not sure if the popular ones actually release the GIL*.
Moreover, save() does a lot more than just running a few UPDATE/INSERT statements and a COMMIT; it does a lot of bookkeeping in Python, where it has to hold the GIL. In summary, I'm not sure that you will gain anything from moving save() to a different thread.
UPDATE: From looking at the sources, I learned that both the sqlite module and psycopg do release the GIL when they are calling database routines, and I guess that other interfaces do the same.
Generally you should never have to worry about threads in a Django application. If you're serving your application with Apache, gunicorn or nearly any other server other than the development server, the server will spawn multiple processes and evade the GIL entirely. The exception is if you're using gunicorn with gevent, in which case there will be multiple processes but also microthreads inside those processes -- in that case concurrency helps a bit, but you don't have to manage the threads yourself to take advantage of that. The only case where you need to worry about the GIL is if you're trying to spawn multiple threads to handle a single request, which is not usually a good idea.
The Django save() method does not release the GIL itself, but the database backend will (in most cases the bulk of the time spent in save() will be doing database I/O). However, it's almost impossible to properly take advantage of this in a well-designed web application. Responses from your view should be fast even when done synchronously -- if they are doing too much work to be fast, then use a delayed job with Celery or another taskmaster to finish up the extra work. If you try to thread in your view, you'll have to finish up that thread before sending a response to the client, which in most cases won't help anything and will just add extra overhead.
I think python dont lock anything by itself, but database does.

Set / get objects with memcached

In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status

Are there any concerns I should have about storing a Python Lock object in a Beaker session?

There is a certain page on my website where I want to prevent the same user from visiting it twice in a row. To prevent this, I plan to create a Lock object (from Python's threading library). However, I would need to store that across sessions. Is there anything I should watch out for when trying to store a Lock object in a session (specifically a Beaker session)?
Storing a threading.Lock instance in a session (or anywhere else that needs serialization) is a terrible idea, and presumably you'll get an exception if you try to (since such an object cannot be serialized, e.g., it cannot be pickled). A traditional approach for cooperative serialization of processes relies on file locking (on "artificial" files e.g. in a directory such as /tmp/locks/<username> if you want the locking to be per-user, as you indicate). I believe the wikipedia entry does a good job of describing the general area; if you tell us what OS you're running order, we might suggest something more specific (unfortunately I do not believe there is a cross-platform solution for this).
I just realized that this was a terrible question since locking a lock and saving it to the session takes two steps thus defeating the purpose of the lock's atomic actions.

Categories