Here is a simple example of a django view with a potential race condition:
# myapp/views.py
from django.contrib.auth.models import User
from my_libs import calculate_points
def add_points(request):
user = request.user
user.points += calculate_points(user)
user.save()
The race condition should be fairly obvious: A user can make this request twice, and the application could potentially execute user = request.user simultaneously, causing one of the requests to override the other.
Suppose the function calculate_points is relatively complicated, and makes calculations based on all kinds of weird stuff that cannot be placed in a single update and would be difficult to put in a stored procedure.
So here is my question: What kind of locking mechanisms are available to django, to deal with situations similar to this?
Django 1.4+ supports select_for_update, in earlier versions you may execute raw SQL queries e.g. select ... for update which depending on underlying DB will lock the row from any updates, you can do whatever you want with that row until the end of transaction. e.g.
from django.db import transaction
#transaction.commit_manually()
def add_points(request):
user = User.objects.select_for_update().get(id=request.user.id)
# you can go back at this point if something is not right
if user.points > 1000:
# too many points
return
user.points += calculate_points(user)
user.save()
transaction.commit()
As of Django 1.1 you can use the ORM's F() expressions to solve this specific problem.
from django.db.models import F
user = request.user
user.points = F('points') + calculate_points(user)
user.save()
For more details see the documentation:
https://docs.djangoproject.com/en/1.8/ref/models/instances/#updating-attributes-based-on-existing-fields
https://docs.djangoproject.com/en/1.8/ref/models/expressions/#django.db.models.F
Database locking is the way to go here. There are plans to add "select for update" support to Django (here), but for now the simplest would be to use raw SQL to UPDATE the user object before you start to calculate the score.
Pessimistic locking is now supported by Django 1.4's ORM when the underlying DB (such as Postgres) supports it. See the Django 1.4a1 release notes.
You have many ways to single-thread this kind of thing.
One standard approach is Update First. You do an update which will seize an exclusive lock on the row; then do your work; and finally commit the change. For this to work, you need to bypass the ORM's caching.
Another standard approach is to have a separate, single-threaded application server that isolates the Web transactions from the complex calculation.
Your web application can create a queue of scoring requests, spawn a separate process, and then write the scoring requests to this queue. The spawn can be put in Django's urls.py so it happens on web-app startup. Or it can be put into separate manage.py admin script. Or it can be done "as needed" when the first scoring request is attempted.
You can also create a separate WSGI-flavored web server using Werkzeug which accepts WS requests via urllib2. If you have a single port number for this server, requests are queued by TCP/IP. If your WSGI handler has one thread, then, you've achieved serialized single-threading. This is slightly more scalable, since the scoring engine is a WS request and can be run anywhere.
Yet another approach is to have some other resource that has to be acquired and held to do the calculation.
A Singleton object in the database. A single row in a unique table can be updated with a session ID to seize control; update with session ID of None to release control. The essential update has to include a WHERE SESSION_ID IS NONE filter to assure that the update fails when the lock is held by someone else. This is interesting because it's inherently race-free -- it's a single update -- not a SELECT-UPDATE sequence.
A garden-variety semaphore can be used outside the database. Queues (generally) are easier to work with than a low-level semaphore.
This may be oversimplifying your situation, but what about just a JavaScript link replacement? In other words when the user clicks the link or button wrap the request in a JavaScript function which immediately disables / "greys out" the link and replaces the text with "Loading..." or "Submitting request..." info or something similar. Would that work for you?
Now, you must use:
Model.objects.select_for_update().get(foo=bar)
Related
In my Django (1.9) project I need to construct a table from an expensive JOIN. I would therefore like to store the table in the DB and only redo the query if the tables involved in the JOIN change. As I need the table as a basis for later JOIN operations I definitely want to store it in my database and not in any cache.
The problem I'm facing is that I'm not sure how to determine whether the data in the tables have changed. Connecting to the post_save, post_delete signals of the respective models seems not to be right since the models might be updated in bulk via CSV upload and I don't want the expensive query to be fired each time a new row is imported, because the DB table will change right away. My current approach is to check whether the data has changed every certain time interval, which would be perfectly fine for me. For this purpose I use a new thread, which compares the Checksums of the involved tables (see code below) to run this task. As I'm not really familiar with multi threading, especially on web servers I do not now, whether this is acceptable. My questions therefore:
Is the threading approach acceptable for running this single task?
Would a Distributed Task Queue like Celery be more appropriate?
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
This is my current code:
import threading
from django.apps import apps
from .models import SomeModel
def check_for_table_change():
app_label = SomeModel._meta.app_label
def join():
"""Join the tables and save the resulting table to the DB."""
...
def get_involved_models(app_label):
"""Get all the models that are involved in the join."""
...
involved_models = get_involved_models(app_label)
involved_dbtables = tuple(model._meta.db_table for model in involved_models)
sql = 'CHECKSUM TABLE %s' % ', '.join(involved_dbtables)
old_checksums = None
while(True):
# Get the result of the query as named tuples.
checksums = from_db(sql, fetch_as='namedtuple')
if old_checksums is not None:
# Compare checksums.
for pair in zip(checksums, old_checksums):
if pair[0].Checksum != pair[1].Checksum:
print('db changed, table is rejoined')
join()
break
old_checksums = checksums
time.sleep(60)
check_tables_thread = threading.Thread()
check_tables_thread.run = check_for_table_change
check_tables_thread.start()
I'm grateful for any suggestions.
Materialized Views and Postgresql
If you were on postgresql, you could have used what's known as a Materialized View. Thus you can create a view based on your join and it would exist almost like a real table. This is very different from normal joins where the query needs to be executed each and every time a view is used. Now the bad news. Mysql does not have materialized views.
If you switched to postgresql, you might even find that materialized vies are not needed after all. That's because postgresql can use more than one index per table in queries. Thus your join that seems slow at the moment on mysql might be made to run faster with better use of indexes on Postgresql. Of course this is very dependent on what your structure is like.
Signals vs Triggers
The problem I'm facing is that I'm not sure how to determine whether
the data in the tables have changed. Connecting to the post_save,
post_delete signals of the respective models seems not to be right
since the models might be updated in bulk via CSV upload and I don't
want the expensive query to be fired each time a new row is imported,
because the DB table will change right away.
As you have rightly determined Django signals isn't the right way. This is the sort of task that is best done at the database level. Since you don't have materialized views, this is a job for triggers. However that's a lot of hard work involved (whether you use triggers or signals)
Is the threading approach acceptable for running this single task?
Why not use django as a CLI here? Which effectively means a django script is invoked by a cron or executed by some other mechanism independently of your website.
Would a Distributed Task Queue like Celery be more appropriate?
Very much so. Each time the data changes, you can fire off a task that does the update of the table.
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
Keyword here is 'TRIGGER' :-)
Alternatives.
Having said all that doing a join and physically populating a table is going to be very very slow if your table grows to even a few thousand rows. This is because you will need an elaborate query to determine which records have changed (unless you used a separate queue for that). You would then need to insert or update the records in the 'join table' generally update/insert is slower than retrieve so as the size of the data goes, this would become progressively worse.
The real solution maybe to optimize your queries and or tables. May I suggest you post a new question with the slow query and also share your table structures?
One of my methods doesn't work when run on atomic context. I want to ask Django if it's running a transaction.
The method can create a thread or a process and saves the result to database. This is a bit odd but there is a huge performance benefit when a process can be used.
I find that especially processes are a bit sketchy with Django. I know that Django will raise an exception if the method chooses to save the results in a process and the method is run on atomic context.
If I can check for an atomic context then I can throw an exception straight away (instead of getting odd errors) or force the method to only create a thread.
I found the is_managed() method but according to this question it's been removed in Django 1.8.
According to this ticket there are a couple ways to detect this: not transaction.get_autocommit() (using a public API) or transaction.get_connection().in_atomic_block (using a private API).
My development stack is Django/Python. (that includes Django-REST-framework)
I am currently looking for ways to do multiple distinct API calls.
client.py
def submit_order(list_of_orders):
//submit each element in list_of_orders in thread
for order in list_of_orders:
try:
response=request.POST(order, "www.confidential.url.com")
except:
//retry_again
else:
if !response.status==200:
//retry_again
In the above method, I am currently submitting order one by one, I want to submit all orders at once. Secondly, I want to retry submission for x times if it fails.
I currently do not know how well to achieve it.
I am looking for ways that python libraries or Django application provide rather than re-inventing the wheel.
Thanks
As #Selcuk said you can try django-celery which is a recommended approach in my opinion, but you will need to make some configuration and read some manuals.
On the other hand, you can try using multiprocessing like this:
from multiprocessing import Pool
def process_order(order):
#Handle each order here doing requests.post and then retrying if neccesary
pass
def submit_order(list_of_orders):
orders_pool = Pool(len(list_of_orders))
results = orders_pool.map(process_order, list_of_orders)
#Do something with the results here
It will depend on what you need to get done, if you can do the requests operations on the background and your api user can be notified later, just use django-celery and then notify the user accordingly, but if you want a simple approach to react immediately, you can use the one I "prototyped" for you.
You should consider some kind of delay on the responses for your requests (as you are doing some POST request). So make sure your POST request don't grow a lot, because it could affect the experience of the API clients calling your services.
I am dealing with a doubt about sqlalchemy and objects refreshing!
I am in the situation in what I have 2 sessions, and the same object has been queried in both sessions! For some particular thing I cannot to close one of the sessions.
I have modified the object and commited the changes in session A, but in session B, the attributes are the initial ones! without modifications!
Shall I implement a notification system to communicate changes or there is a built-in way to do this in sqlalchemy?
Sessions are designed to work like this. The attributes of the object in Session B WILL keep what it had when first queried in Session B. Additionally, SQLAlchemy will not attempt to automatically refresh objects in other sessions when they change, nor do I think it would be wise to try to create something like this.
You should be actively thinking of the lifespan of each session as a single transaction in the database. How and when sessions need to deal with the fact that their objects might be stale is not a technical problem that can be solved by an algorithm built into SQLAlchemy (or any extension for SQLAlchemy): it is a "business" problem whose solution you must determine and code yourself. The "correct" response might be to say that this isn't a problem: the logic that occurs with Session B could be valid if it used the data at the time that Session B started. Your "problem" might not actually be a problem. The docs actually have an entire section on when to use sessions, but it gives a pretty grim response if you are hoping for a one-size-fits-all solution...
A Session is typically constructed at the beginning of a logical
operation where database access is potentially anticipated.
The Session, whenever it is used to talk to the database, begins a
database transaction as soon as it starts communicating. Assuming the
autocommit flag is left at its recommended default of False, this
transaction remains in progress until the Session is rolled back,
committed, or closed. The Session will begin a new transaction if it
is used again, subsequent to the previous transaction ending; from
this it follows that the Session is capable of having a lifespan
across many transactions, though only one at a time. We refer to these
two concepts as transaction scope and session scope.
The implication here is that the SQLAlchemy ORM is encouraging the
developer to establish these two scopes in his or her application,
including not only when the scopes begin and end, but also the expanse
of those scopes, for example should a single Session instance be local
to the execution flow within a function or method, should it be a
global object used by the entire application, or somewhere in between
these two.
The burden placed on the developer to determine this scope is one area
where the SQLAlchemy ORM necessarily has a strong opinion about how
the database should be used. The unit of work pattern is specifically
one of accumulating changes over time and flushing them periodically,
keeping in-memory state in sync with what’s known to be present in a
local transaction. This pattern is only effective when meaningful
transaction scopes are in place.
That said, there are a few things you can do to change how the situation works:
First, you can reduce how long your session stays open. Session B is querying the object, then later you are doing something with that object (in the same session) that you want to have the attributes be up to date. One solution is to have this second operation done in a separate session.
Another is to use the expire/refresh methods, as the docs show...
# immediately re-load attributes on obj1, obj2
session.refresh(obj1)
session.refresh(obj2)
# expire objects obj1, obj2, attributes will be reloaded
# on the next access:
session.expire(obj1)
session.expire(obj2)
You can use session.refresh() to immediately get an up-to-date version of the object, even if the session already queried the object earlier.
Run this, to force session to update latest value from your database of choice:
session.expire_all()
Excellent DOC about default behavior and lifespan of session
I just had this issue and the existing solutions didn't work for me for some reason. What did work was to call session.commit(). After calling that, the object had the updated values from the database.
TL;DR Rather than working on Session synchronization, see if your task can be reasonably easily coded with SQLAlchemy Core syntax, directly on the Engine, without the use of (multiple) Sessions
For someone coming from SQL and JDBC experience, one critical thing to learn about SQLAlchemy, which, unfortunately, I didn't clearly come across reading through the multiple documents for months is that SQLAlchemy consists of two fundamentally different parts: the Core and the ORM. As the ORM documentation is listed first on the website and most examples use the ORM-like syntax, one gets thrown into working with it and sets them-self up for errors and confusion - if thinking about ORM in terms of SQL/JDBC. ORM uses its own abstraction layer that takes a complete control over how and when actual SQL statements are executed. The rule of thumb is that a Session is cheap to create and kill, and it should never be re-used for anything in the program's flow and logic that may cause re-querying, synchronization or multi-threading. On the other hand, the Core is the direct no-thrills SQL, very much like a JDBC Driver. There is one place in the docs I found that "suggests" using Core over ORM:
it is encouraged that simple SQL operations take place here, directly on the Connection, such as incrementing counters or inserting extra rows within log
tables. When dealing with the Connection, it is expected that Core-level SQL
operations will be used; e.g. those described in SQL Expression Language Tutorial.
Although, it appears that using a Connection causes the same side effect as using a Session: re-query of a specific record returns the same result as the first query, even if the record's content in the DB was changed. So, apparently Connections are as "unreliable" as Sessions to read DB content in "real time", but a direct Engine execution seems to be working fine as it picks a Connection object from the pool (assuming that the retrieved Connection would never be in the same "reuse" state relatively to the query as the specific open Connection). The Result object should be closed explicitly, as per SA docs
What is your isolation level is set to?
SHOW GLOBAL VARIABLES LIKE 'transaction_isolation';
By default mysql innodb transaction_isolation is set to REPEATABLE-READ.
+-----------------------+-----------------+
| Variable_name | Value |
+-----------------------+-----------------+
| transaction_isolation | REPEATABLE-READ |
+-----------------------+-----------------+
Consider setting it to READ-COMMITTED.
You can set this for your sqlalchemy engine only via:
create_engine("mysql://<connection_string>", isolation_level="READ COMMITTED")
I think another option is:
engine = create_engine("mysql://<connection_string>")
engine.execution_options(isolation_level="READ COMMITTED")
Or set it globally in the DB via:
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
and
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#setting-transaction-isolation-levels-dbapi-autocommit
If u had added the incorrect model to the session, u can do:
db.session.rollback()
Most of the longest (most time-consuming) logic I've encountered basically involves two things: sending email and committing items to the database.
Is there any kind of built-in mechanism for doing these things asynchronously so as not to slow down page load?
Validation should be handled synchronously, but it really seems that the most performant way to email and write to the database should be asynchronously.
For example, let's say that I want to track pageviews. Thus, every time I get a view, I do:
pv = PageView.objects.get(page = request.path)
pv.views = pv.views + 1
pv.save() # SLOWWWWWWWWWWWWWW
Is it natural to think that I should speed this up by making the whole process asynchronous?
Take a look at Celery. It gives you asynchronous workers to offload tasks exactly like you're asking about: sending e-mails, counting page views, etc. It was originally designed to work only with Django, but now works in other environments too.
I use this pattern to update the text index (which is slow), since this can be done in background. This way the user sees a fast response time:
# create empty file
dir=os.path.join(settings.DT.HOME, 'var', 'belege-changed')
file=os.path.join(dir, str(self.id))
fd=open(file, 'a') # like "touch" shell command
fd.close()
A cron-job scans this directory every N minutes and updates the text index.
In your case, I would write the request.path to a file, and update the PageView model in background. This would improve the performance, since you don't need to hit the database for every increment operator.
You can have a python ThreadPool and assign the writes to the database. Although GIL prevent the Python threads to work concurrently this allow to continue the response flow before the write is finished.
I use this technique when the result of the write is not important to render the response.
Of course, if you want to post request and want to return a 201, this is not a god practice.
http://www.mongodb.org/ can do this.