I'm starting a Django project and need to shard multiple tables that are likely to all be of too many rows. I've looked through threads here and elsewhere, and followed the Django multi-db documentation, but am still not sure how that all stitches together. My models have relationships that would be broken by sharding, so it seems like the options are to either drop the foreign keys of forgo sharding the respective models.
For argument's sake, consider the classic Authot, Publisher and Book scenario, but throw in book copies and users that can own them. Say books and users had to be sharded. How would you approach that? A user may own a copy of a book that's not in the same database.
In general, what are the best practices you have used for routing and the sharding itself? Did you use Django database routers, manually selected a database inside commands based on your sharding logic, or overridden some parts of the ORM to achive that?
I'm using PostgreSQL on Ubuntu, if it matters.
Many thanks.
In the past I've done something similar using Postgresql Table Partitioning, however this merely splits a table up in the same DB. This is helpful in reducing table search time. This is also nice because you don't need to modify your django code much. (Make sure you perform queries with the fields you're using for constraints).
But it's not sharding.
If you haven't seen it yet, you should check out Sharding Postgres with Instagram.
I agree with #DanielRoseman. Also, how many is too many rows. If you are careful with indexing, you can handle a lot of rows with no performance problems. Keep your indexed values small (ints). I've got tables in excess of 400 million rows that produce sub-second responses even when joining with other many million row tables.
It might make more sense to break user up into multiple tables so that the user object has a core of commonly used things and then the "profile" info lives elsewhere (std Django setup). Copies would be a small table referencing books which has the bulk of the data. Considering how much ram you can put into a DB server these days, sharding before you have too seems wrong.
Related
Can you take form data and change database schema? Is it a good idea? Is there a downside to many migrations from a 'default' database?
I want users to be able to add / remove tables, columns, and rows. Making schema changes requires migrations, so adding in that functionality would require writing a view that takes form data and inserts it into a function that then uses Flask-Migrate.
If I manage to build this, don't migrations build the required separate scripts and everything that goes along with that each time something is added or removed? Is that practical for something like this, where 10 or 20 new tables might be added to the starting database?
If I allow users to add columns to a table, it will have to modify the table's class. Is that possible, or a safe idea? If not, I'd appreciate it if someone could help me out, and at least get me pointed in the right direction.
In a typical web application, the deployed database does not change its schema at runtime. The schema is only changed during an upgrade, and only the developers make these changes. Operations that users perform on the application can add, remove or modify rows, but not modify the tables or columns themselves.
If you need to offer your users a way to add flexible data structures, then you should design your database schema in a way that this is possible. For example, if you wanted your users to add custom key/value pairs, you could have a table with columns user_id, key_name and value. You may also want to investigate if a schema-less database fits your needs better.
Here are my entities:
class Article(db.Entity):
id = PrimaryKey(int, auto=True)
creation_time = Required(datetime)
last_modification_time = Optional(datetime, default=datetime.now)
title = Required(str)
contents = Required(str)
authors = Set('Author')
class Author(db.Entity):
id = PrimaryKey(int, auto=True)
first_name = Required(str)
last_name = Required(str)
articles = Set(Article)
And here is the code I'm using to get some data:
return left_join((article, author) for article in entities.Article
for author in article.authors).prefetch(entities.Author)[:]
Whether I'm using the prefetch method or not, the generated sql always looks the same:
SELECT DISTINCT "article"."id", "t-1"."author"
FROM "article" "article"
LEFT JOIN "article_author" "t-1"
ON "article"."id" = "t-1"."article"
And then when I iterated over the results, pony is issuing yet another query (queries):
SELECT "id", "creation_time", "last_modification_time", "title", "contents"
FROM "article"
WHERE "id" = %(p1)s
SELECT "id", "first_name", "last_name"
FROM "author"
WHERE "id" IN (%(p1)s, %(p2)s)
The desired behavior for me would be if the orm would issue just one query that would load all the data needed. So how do I achieve that?
Author of PonyORM is here. We don't want to load all this objects using just one query, because this is inefficient.
The only benefit of using a single query to load many-to-many relation is to reduce the number of round-trips to the database. But if we would replace three queries with one, this is not a major improvement. When your database server located near your application server these round-trips are actually very fast, comparing with the processing the resulted data in Python.
On the other side, when both sides of many-to-many relation are loaded using the same query, it is inevitable that the same object's data will be repeated over and over in multiple rows. This has many drawbacks:
The size of data transferred from the database became much larger as compared to situation when no duplicate information is transferred. In your example, if you have ten articles, and each is written by three authors, the single query will return thirty rows, with large fields like article.contents duplicated multiple times. Separate queries will transfer the minimum amount of data possible, the difference in size may easily be an order of magnitude depending on specific many-to-many relation.
The database server is usually written in compiled language like C and works very fast. The same is true for networking layer. But Python code is interpreted, and the time consumed by Python code is (contrary to some opinions) usually much more than the time which is spent in the database. You can see profiling tests that was performed by the SQLAlchemy author Mike Bayer after which he came to conclusion:
A great misconception I seem to encounter often is the notion that communication with the database takes up a majority of the time spent in a database-centric Python application. This perhaps is a common wisdom in compiled languages such as C or maybe even Java, but generally not in Python. Python is very slow, compared to such systems (...) Whether a database driver (DBAPI) is written in pure Python or in C will incur significant additional Python-level overhead. For just the DBAPI alone, this can be as much as an order of magnitude slower.
When all data of many-to-many relation are loaded using the same query and the same data is repeated in many rows, it is necessary to parse all of this repeated data in Python just to throw out most of them. As Python is the slowest part of process, such "optimization" may lead to decreased performance.
As a support to my words I can point to Django ORM. This ORM has two methods which can be used to query optimization. The first one, called select_related loads all related objects in a single query, while the more recently added method called prefetch_related loads objects in a way Pony does by default. According to Django users the second method works much faster:
In some scenarios, we have found up to a 30% speed improvement.
The database is required to perform joins which consume precious resources of the database server.
While Python code is the slowest part when processing a single request, the database server CPU time is a shared resource which is used by all parallel requests. You can scale Python code easily by starting multiple Python processes on different servers, but it is much harder to scale the database. Because of this, in high-load application it is better to offload useful work from the database server to application server, so this work can be done in parallel by multiple application servers.
When database performs join it needs to spend additional time for doing it. But for Pony it is irrelevant if database make join or not, because in any case an object will be interlinked inside ORM identity map. So the work that database doing when perform join is just useless spend of database time. On the other hand, using identity map pattern Pony can link objects equally fast regardless of whether they are provided in the same database row or not.
Returning to the number of round-trips, Pony have dedicated mechanism to eliminate "N+1 query" problem. The "N+1 query" anti-pattern arises when an ORM sends hundreds of very similar queries each of them loads separate object from the database. Many ORMs suffers from this problem. But Pony can detect it and replace repeated N queries with a single query which loads all necessary objects at once. This mechanism is very efficient and can greatly reduce the number of round-trips. But when we speak about loading many-to-many relation, there are no N queries here, there are just three queries which are more efficient when executed separately, so there are no benefit in trying to execute single query instead.
To summarize, I need to say that the ORM performance is a very important to us, Pony ORM developers. And because of that, we don't want to implement loading many-to-many relation in a single query, as it most certainly will be slower than our current solution.
So, to answer your question, you cannot load both side of many-to-many relation in a single query. And I think this is a good thing.
This should work
python
from pony.orm import select
select((article, author) for article in Article if Article.authors == Authors.id)
I am working on a project which requires me to create a table of every user who registers on the website using the username of that user. The columns in the table are same for every user.
While researching I found this Django dynamic model fields. I am not sure how to use django-mutant to accomplish this. Also, is there any way I could do this without using any external apps?
PS : The backend that I am using is Mysql
An interesting question, which might be of wider interest.
Creating one table per user is a maintenance nightmare. You should instead define a single table to hold all users' data, and then use the database's capabilities to retrieve only those rows pertaining to the user of interest (after checking permissions if necessary, since it is not a good idea to give any user unrestricted access to another user's data without specific permissions having been set).
Adopting your proposed solution requires that you construct SQL statements containing the relevant user's table name. Successive queries to the database will mostly be different, and this will slow the work down because every SQL statement has to be “prepared” (the syntax has to be checked, the names of table and columns has to be verified, the requesting user's permission to access the named resources has to be authorized, and so on).
By using a single table (model) the same queries can be used repeatedly, with parameters used to vary specific data values (in this case the name of the user whose data is being sought). Your database work will move along faster, you will only need a single model to describe all users' data, and database management will not be a nightmare.
A further advantage is that Django (which you appear to be using) has an extensive user-based permission model, and can easily be used to authenticate user login (once you know how). These advantages are so compelling I hope you will recant from your heresy and decide you can get away with a single table (and, if you planning to use standard Django logins, a relationship with the User model that comes as a central part of any Django project).
Please feel free to ask more questions as you proceed. It seems you are new to database work, and so I have tried to present an appropriate level of detail. There are many pitfalls such as this if you cannot access knowledgable advice. People on SO will help you.
This page shows how to create a model and install table to database on the fly. So, you could use type('table_with_username', (models.Model,), attrs) to create a model and use django.core.management to install it to the database.
Need a way to improve performance on my website's SQL based Activity Feed. We are using Django on Heroku.
Right now we are using actstream, which is a Django App that implements an activity feed using Generic Foreign Keys in the Django ORM. Basically, every action has generic foreign keys to its actor and to any objects that it might be acting on, like this:
Action:
(Clay - actor) wrote a (comment - action object) on (Andrew's review of Starbucks - target)
As we've scaled, its become way too slow, which is understandable because it relies on big, expensive SQL joins.
I see at least two options:
Put a Redis layer on top of the SQL database and get activity feeds from there.
Try to circumvent the Django ORM and do all the queries in raw SQL, which I understand can improve performance.
Any one have thoughts on either of these two, or other ideas, I'd love to hear them.
You might want to look at Materialized Views. Since you're on Heroku, and that uses PostgreSQL generally, you could look at Materialized View Support for PostgreSQL. It is not as mature as for other database servers, but as far as I understand, it can be made to work. To work with the Django ORM, you would probably have to create a new "entity" (not familiar with Django here so modify as needed) for the feed, and then do queries over it as if it was a table. Manual management of the view is a consideration, so look into it carefully before you commit to it.
Hope this helps!
You said redis? Everything is better with redis.
Caching is one of the best ideas in software development, no mather if you use Materialized Views you should also consider trying to cache those, believe me your users will notice the difference.
Went with an approach that sort of combined the two suggestions.
We created a master list of every action in the database, which included all the information we needed about the actions, and stuck it in Redis. Given an action ID, we can now do a Redis look up on it and get a dictionary object that is ready to be returned to the front end.
We also created action id lists that correspond to all the different types of activity streams that are available to a user. So given a user id, we have his friends' activity, his own activity, favorite places activity, etc, available for look up. (These I guess correspond somewhat to materialized views, although they are in Redis, not in PSQL.)
So we get a user's feed as a list of action ids. Then we get the details of those actions by look ups on the ids in the master action list. Then we return the feed to the front end.
Thanks for the suggestions, guys.
I am not very familiar with databases, and so I do not know how to partition a table using SQLAlchemy.
Your help would be greatly appreciated.
There are two kinds of partitioning: Vertical Partitioning and Horizontal Partitioning.
From the docs:
Vertical Partitioning
Vertical partitioning places different
kinds of objects, or different tables,
across multiple databases:
engine1 = create_engine('postgres://db1')
engine2 = create_engine('postgres://db2')
Session = sessionmaker(twophase=True)
# bind User operations to engine 1, Account operations to engine 2
Session.configure(binds={User:engine1, Account:engine2})
session = Session()
Horizontal Partitioning
Horizontal partitioning partitions the
rows of a single table (or a set of
tables) across multiple databases.
See the “sharding” example in
attribute_shard.py
Just ask if you need more information on those, preferably providing more information about what you want to do.
It's quite an advanced subject for somebody not familiar with databases, but try Essential SQLAlchemy (you can read the key parts on Google Book Search -- p 122 to 124; the example on p. 125-126 is not freely readable online, so you'd have to purchase the book or read it on commercial services such as O'Reilly's Safari -- maybe on a free trial -- if you want to read the example).
Perhaps you can get better answers if you mention whether you're talking about vertical or horizontal partitioning, why you need partitioning, and what underlying database engines you are considering for the purpose.
Automatic partitioning is a very database engine specific concept and SQLAlchemy doesn't provide any generic tools to manage partitioning. Mostly because it wouldn't provide anything really useful while being another API to learn. If you want to do database level partitioning then do the CREATE TABLE statements using custom Oracle DDL statements (see Oracle documentation how to create partitioned tables and migrate data to them). You can use a partitioned table in SQLAlchemy just like you would use a normal table, you just need the table declaration so that SQLAlchemy knows what to query. You can reflect the definition from the database, or just duplicate the table declaration in SQLAlchemy code.
Very large datasets are usually time-based, with older data becoming read-only or read-mostly and queries usually only look at data from a time interval. If that describes your data, you should probably partition your data using the date field.
There's also application level partitioning, or sharding, where you use your application to split data across different database instances. This isn't all that popular in the Oracle world due to the exorbitant pricing models. If you do want to use sharding, then look at SQLAlchemy documentation and examples for that, for how SQLAlchemy can support you in that, but be aware that application level sharding will affect how you need to build your application code.