In my Django (1.9) project I need to construct a table from an expensive JOIN. I would therefore like to store the table in the DB and only redo the query if the tables involved in the JOIN change. As I need the table as a basis for later JOIN operations I definitely want to store it in my database and not in any cache.
The problem I'm facing is that I'm not sure how to determine whether the data in the tables have changed. Connecting to the post_save, post_delete signals of the respective models seems not to be right since the models might be updated in bulk via CSV upload and I don't want the expensive query to be fired each time a new row is imported, because the DB table will change right away. My current approach is to check whether the data has changed every certain time interval, which would be perfectly fine for me. For this purpose I use a new thread, which compares the Checksums of the involved tables (see code below) to run this task. As I'm not really familiar with multi threading, especially on web servers I do not now, whether this is acceptable. My questions therefore:
Is the threading approach acceptable for running this single task?
Would a Distributed Task Queue like Celery be more appropriate?
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
This is my current code:
import threading
from django.apps import apps
from .models import SomeModel
def check_for_table_change():
app_label = SomeModel._meta.app_label
def join():
"""Join the tables and save the resulting table to the DB."""
...
def get_involved_models(app_label):
"""Get all the models that are involved in the join."""
...
involved_models = get_involved_models(app_label)
involved_dbtables = tuple(model._meta.db_table for model in involved_models)
sql = 'CHECKSUM TABLE %s' % ', '.join(involved_dbtables)
old_checksums = None
while(True):
# Get the result of the query as named tuples.
checksums = from_db(sql, fetch_as='namedtuple')
if old_checksums is not None:
# Compare checksums.
for pair in zip(checksums, old_checksums):
if pair[0].Checksum != pair[1].Checksum:
print('db changed, table is rejoined')
join()
break
old_checksums = checksums
time.sleep(60)
check_tables_thread = threading.Thread()
check_tables_thread.run = check_for_table_change
check_tables_thread.start()
I'm grateful for any suggestions.
Materialized Views and Postgresql
If you were on postgresql, you could have used what's known as a Materialized View. Thus you can create a view based on your join and it would exist almost like a real table. This is very different from normal joins where the query needs to be executed each and every time a view is used. Now the bad news. Mysql does not have materialized views.
If you switched to postgresql, you might even find that materialized vies are not needed after all. That's because postgresql can use more than one index per table in queries. Thus your join that seems slow at the moment on mysql might be made to run faster with better use of indexes on Postgresql. Of course this is very dependent on what your structure is like.
Signals vs Triggers
The problem I'm facing is that I'm not sure how to determine whether
the data in the tables have changed. Connecting to the post_save,
post_delete signals of the respective models seems not to be right
since the models might be updated in bulk via CSV upload and I don't
want the expensive query to be fired each time a new row is imported,
because the DB table will change right away.
As you have rightly determined Django signals isn't the right way. This is the sort of task that is best done at the database level. Since you don't have materialized views, this is a job for triggers. However that's a lot of hard work involved (whether you use triggers or signals)
Is the threading approach acceptable for running this single task?
Why not use django as a CLI here? Which effectively means a django script is invoked by a cron or executed by some other mechanism independently of your website.
Would a Distributed Task Queue like Celery be more appropriate?
Very much so. Each time the data changes, you can fire off a task that does the update of the table.
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
Keyword here is 'TRIGGER' :-)
Alternatives.
Having said all that doing a join and physically populating a table is going to be very very slow if your table grows to even a few thousand rows. This is because you will need an elaborate query to determine which records have changed (unless you used a separate queue for that). You would then need to insert or update the records in the 'join table' generally update/insert is slower than retrieve so as the size of the data goes, this would become progressively worse.
The real solution maybe to optimize your queries and or tables. May I suggest you post a new question with the slow query and also share your table structures?
Related
I have two processes: 1) scraper - take info from another website, do the needfull calculations and put results in db 2) web app (Flask) take data from db and draw plot with the help of matpotlib. The two processes do not communicate, but they use one db.
Problem: All works fine now, but to draw plots and save them to folder of web app project is time consuming operation, it takes about 5 seconds to display the page with plots. Pictures are created every time the webpage is requested, since data can be added to db in random moment and user should get plot with all information.
How I see the solution: To create table in db with only one line and only two colums: id = 1 and boolean field in db. Let's call boolean column IS_UPDATED. When the scraper process put some new data to db, we change IS_UPDATED to True. When the second process web app ask the data from db, it changes IS_UPDATED to False. Hereby the pictures are recreted only when new data was provided to db by the scraper process, otherwise we use the old pictures.
Is my solution is fine? Please share any other ways to do the same.
I think it's not a good idea to implement a locking mechanism whereas DBMS already has its own. I.e. if App1 crashes between IS_UPDATED=True IS_UPDATED=False, the lock stays active.
A typical solution is based on transaction isolation levels. I suppose, App1 does the updates strongly in the scope of a single transaction, so if you set REPEATABLE READ or rather SERILIZABLE level, then App2 will always get the last consistent version of data.
Second point is the isolation implementation approach. When using a blocking DBMS, App2 will wait until App1 finishes whereas a version-based DBMS lets App2 read immediately last consistent data excluding App1 modifications which are not committed yet. I.e. SQL Server supports both approaches but requires some settings to do.
For more details have a look on "Programming with databases" book containing explanations and examples (first edition exists in Russian, too).
I have an application which was running very quickly. Let's say it took 10 seconds to run. All it does is read a csv, parse it, and store some of the info in sqlalchemy objects which are written to the database. (We never attempt to read the database, only to write).
After adding a many to many relationship to the entity we are building and relating it to an address entity which we now build, the time to process the file has increased by an order of magnitude. We are doing very little additional work: just instantiating an address and storing it in the relationship collection on our entity using append.
Most of the time appears to be lost in _load_for_state as can be seen in the attached profiling screenshot:
I'm pretty sure this is unnecessary lost time, because it looks like it is trying to do some loading even though we never make any queries of the database (we always instantiate new objects and save them in this app).
Anyone have an idea how to optimize sqlalchemy here?
update
I tried setting SQLALCHEMY_ECHO = True just to see if it is doing a bunch of database reads, or maybe some extra writes. Bizarrely, it only accesses the database itself at the same times it did before (following a db.session.commit()). I'm pretty sure all this extra time is not being spent due to database access.
My colleague run a script that pulls data from the db periodically. He is using the query:
SELECT url, data FROM table LIMIT {} OFFSET {}'.format( OFFSET, PAGE * OFFSET
We use Amazon AURORAS and he has his own slaves server but everytime it touches 98%+
Table have millions of records.
Would it be nice if we go for sqldump instead of SQL queries for fetching data?
The options come in my mind are:
SQL DUMP of selective tables( not sure of benchmark)
Federate tables based on certain reference(date, ID etc)
Thanks
I'm making some fairly big assumptions here, but from
without choking it
I'm guessing you mean that when your colleague runs the SELECT to grab the large amount of data, the database performance drops for all other operations - presumably your primary application - while the data is being prepared for export.
You mentioned SQL Dump so I'm also assuming that this colleague will be satisfied with data that is roughly correct, ie: it doesn't have to be up to the instant transactionally correct data. Just good enough for something like analytics work.
If those assumptions are close, your colleague and your database might benefit from
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
This line of code should be used carefully and almost never in a line of business application but it can help people querying the live database with big queries, as long as you fully understand the implications.
To use it, simply start a transaction and put this line before any queries you run.
The 'choking'
What you are seeing when your colleague runs a large query is record locking. Your database engine is - quite correctly - set up to provide an accurate view of your data, at any point. So, when a large query comes along the database engine first waits for all write locks (transactions) to clear, runs the large query and holds all future write locks until the query has run.
This actually happens for all transactions, but you only really notice it for the big ones.
What READ UNCOMMITTED does
By setting the transaction isolation level to READ UNCOMMITTED, you are telling the database engine that this transaction doesn't care about write locks and to go ahead and read anyway.
This is known as a 'dirty read', in that the long-running query could well read a table with a write lock on it and will ignore the lock. The data actually read could be the data before the write transaction has completed, or a different transaction could start and modify records before this query gets to it.
The data returned from anything with READ UNCOMMITTED is not guaranteed to be correct in the ACID sense of a database engine, but for some use cases it is good enough.
What the effect is
Your large queries magically run faster and don't lock the database while they are running.
Use with caution and understand what it does before you use it though.
MySQL Manual on transaction isolation levels
This is a rather specific question to advanced users of celery. Let me explain the use case I have:
Usecase
I have to run ~1k-100k tasks that will run a simulation (movie) and return the data of the simulation as a rather large list of smaller objects (frames), say 10k-100k per frame and 1k frames. So the total amount of data produced will be very large, but assume that I have a database that can handle this. Speed is not a key factor here. Later I need to compute features from each frame which can be done completely independent.
The frames look like a dict that point to some numpy arrays and other simple data like strings and numbers and have a unique identifier UUID.
Important is that the final objects of interest are arbitrary joins and splits of these generated lists. As a metaphor consider the result movies be chop and recombined into new movies. These final lists (movies) are then basically a list of references to the frames using their UUIDs.
Now, I consider using celery to get these first movies and since these will end up in the backend DB anyway I might just keep these results indefinitely, at least the ones I specify to keep.
My question
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID. And if so does that make sense because of overhead and performance, etc.
Another possibility would be to not return anything and let the worker store the result in a DB. Is that preferred? It seems unnecessary to have a second channel of communication to another DB when Celery can do this already.
I am also interested in comments on using Celery in general for highly independent tasks that run long (>1h) and return large result objects. A fail is not problematic and can just be restarted. The resulting movies are stochastic! so functional approaches can be problematic. Even storing the random seed might not garantuee reproducible results! although I do not have side-effects. I just might have lots of workers available that are widely distributed. Imagine lots of desktop machines in a closed environment where every worker helps even if it is slow. Network speed and security is not an issue here. I know that this is not the original use case, but it seemed very easy to use it for these cases. The best analogy I found are projects like Folding#Home.
Can I configure a backend, preferably a NonSQL DB, in a way to keep the results and access these later independent from Celery using the objects UUID.
Yes, you can configure celery to store its results in a NoSQL database such as redis for access by UUID later. The two settings that will control the behavior of interest for you are result_expires and result_backend.
result_backend will specify which NoSQL database you want to store your results in (e.g., elasticsearch or redis) while result_expires will specify how long after a task completes that the task's result will be available for access.
After the task completes, you can access the results in python like this:
from celery.result import AsyncResult
result = task_name.delay()
print result.id
uuid = result.id
checked_result = AsyncResult(uuid)
# and you can access the result output here however you'd like
And if so does that make sense because of overhead and performance, etc.
I think this strategy makes perfect sense. I have typically used this a number of times when generating long-running reports for web users. The initial post will return the UUID from the celery task. The web client can poll the app sever via javascript using the UUID to see if the task is ready/complete. Once the report is ready, the page can redirect the user to the route that will allow the user to download or view the report by passing in the UUID.
Here is a simple example of a django view with a potential race condition:
# myapp/views.py
from django.contrib.auth.models import User
from my_libs import calculate_points
def add_points(request):
user = request.user
user.points += calculate_points(user)
user.save()
The race condition should be fairly obvious: A user can make this request twice, and the application could potentially execute user = request.user simultaneously, causing one of the requests to override the other.
Suppose the function calculate_points is relatively complicated, and makes calculations based on all kinds of weird stuff that cannot be placed in a single update and would be difficult to put in a stored procedure.
So here is my question: What kind of locking mechanisms are available to django, to deal with situations similar to this?
Django 1.4+ supports select_for_update, in earlier versions you may execute raw SQL queries e.g. select ... for update which depending on underlying DB will lock the row from any updates, you can do whatever you want with that row until the end of transaction. e.g.
from django.db import transaction
#transaction.commit_manually()
def add_points(request):
user = User.objects.select_for_update().get(id=request.user.id)
# you can go back at this point if something is not right
if user.points > 1000:
# too many points
return
user.points += calculate_points(user)
user.save()
transaction.commit()
As of Django 1.1 you can use the ORM's F() expressions to solve this specific problem.
from django.db.models import F
user = request.user
user.points = F('points') + calculate_points(user)
user.save()
For more details see the documentation:
https://docs.djangoproject.com/en/1.8/ref/models/instances/#updating-attributes-based-on-existing-fields
https://docs.djangoproject.com/en/1.8/ref/models/expressions/#django.db.models.F
Database locking is the way to go here. There are plans to add "select for update" support to Django (here), but for now the simplest would be to use raw SQL to UPDATE the user object before you start to calculate the score.
Pessimistic locking is now supported by Django 1.4's ORM when the underlying DB (such as Postgres) supports it. See the Django 1.4a1 release notes.
You have many ways to single-thread this kind of thing.
One standard approach is Update First. You do an update which will seize an exclusive lock on the row; then do your work; and finally commit the change. For this to work, you need to bypass the ORM's caching.
Another standard approach is to have a separate, single-threaded application server that isolates the Web transactions from the complex calculation.
Your web application can create a queue of scoring requests, spawn a separate process, and then write the scoring requests to this queue. The spawn can be put in Django's urls.py so it happens on web-app startup. Or it can be put into separate manage.py admin script. Or it can be done "as needed" when the first scoring request is attempted.
You can also create a separate WSGI-flavored web server using Werkzeug which accepts WS requests via urllib2. If you have a single port number for this server, requests are queued by TCP/IP. If your WSGI handler has one thread, then, you've achieved serialized single-threading. This is slightly more scalable, since the scoring engine is a WS request and can be run anywhere.
Yet another approach is to have some other resource that has to be acquired and held to do the calculation.
A Singleton object in the database. A single row in a unique table can be updated with a session ID to seize control; update with session ID of None to release control. The essential update has to include a WHERE SESSION_ID IS NONE filter to assure that the update fails when the lock is held by someone else. This is interesting because it's inherently race-free -- it's a single update -- not a SELECT-UPDATE sequence.
A garden-variety semaphore can be used outside the database. Queues (generally) are easier to work with than a low-level semaphore.
This may be oversimplifying your situation, but what about just a JavaScript link replacement? In other words when the user clicks the link or button wrap the request in a JavaScript function which immediately disables / "greys out" the link and replaces the text with "Loading..." or "Submitting request..." info or something similar. Would that work for you?
Now, you must use:
Model.objects.select_for_update().get(foo=bar)