in case i have a user Model and article Model, user and article are one-to-many relation. so i can access article like this
user = session.query(User).filter(id=1).one()
print user.articles
but this will list user's all articles, what if i want to limit articles to 10 ? in rails there is an all() method which can have limit / offset in it. in sqlalchemy there also is an all() method, but take no params, how to achieve this?
Edit:
it seems user.articles[10:20] is valid, but the sql didn't use 10 / 20 in queries. so in fact it will load all matched data, and filter in python?
The solution is to use a dynamic relationship as described in the collection configuration techniques section of the SQLAlchemy documentation.
By specifying the relationship as
class User(...):
# ...
articles = relationship('Articles', order_by='desc(Articles.date)', lazy='dynamic')
you can then write user.articles.limit(10) which will generate and execute a query to fetch the last ten articles by the user. Or you can use the [x:y] syntax if you prefer which will automatically generate a LIMIT clause.
Performance should be reasonable unless you want to query the past ten articles for 100 or so users (in which instance at least 101 queries will be sent to the server).
Related
I'm using Django 2.0 and have a Content model with a ForeignKey(User, ...). I also have a list of user IDs for which I'd like to fetch that Content, ordered by "newest first", but only up to 25 elements per user. I know I can do this:
Content.objects.filter(user_id__in=[1, 2, 3, ...]).order_by('-id')
...to fetch all the Content objects created by each of these users, plus I'll get it all sorted with newest elements first. But I'd like to fetch up to 25 elements for each of these users (some users might create hundreds of these objects, some might create zero). There's of course the dumb way:
for user in [1, 2, 3, ...]:
Content.objects.filter(user_id=user).order_by('-id')[:25]
This however hits the database as many times as there's objects in the user ID list, and that goes quite high (around 100 or so per page view). Is there any way to optimize this case? (I've tried looking around select_related, but that seems to fetch as many related models as possible.)
There are plenty of ways to form a greatest-n-per-group query, but in this case you could form a union of top-n queries of all users:
contents = Content.objects.\
none().\
union(*[Content.objects.
filter(user_id=uid).
order_by('-id')[:25] for uid in user_ids],
all=True)
Using prefetch_related() you could then produce a queryset that fetches the users and injects an attribute of latest content:
users = User.objects.\
filter(id__in=user_ids).\
prefetch_related(models.Prefetch(
'content_set',
queryset=contents,
to_attr='latest_content'))
Does it actually hit the database that many times? I have not looked at the raw SQL but according to the documentation it is equivalent to the LIMIT clause and it also states "Generally, slicing a QuerySet returns a new QuerySet – it doesn’t evaluate the query".
https://docs.djangoproject.com/en/2.0/topics/db/queries/#limiting-querysets
I would be curious to see the raw SQL if you are looking at it and it does NOT do this as I use this paradigm.
I'm trying to provide an interface for the user to write custom queries over the database. I need to make sure they can only query the records they are allowed to. In order to do that, I decided to apply row based access control using django-guardian.
Here is how my schemas look like
class BaseClass(models.Model):
somefield = models.TextField()
class Meta:
permissions = (
('view_record', 'View record'),
)
class ClassA(BaseClass):
# some other fields here
classb = models.ForeignKey(ClassB)
class ClassB(BaseClass):
# some fields here
classc = models.ForeignKey(ClassC)
class ClassC(BaseClass):
# some fields here
I would like to be able to use get_objects_for_group as follows:
>>> group = Group.objects.create('some group')
>>> class_c = ClassC.objects.create('ClassC')
>>> class_b = ClassB.objects.create('ClassB', classc=class_c)
>>> class_a = ClassA.objects.create('ClassA', classb=class_b)
>>> assign_perm('view_record', group, class_c)
>>> assign_perm('view_record', group, class_b)
>>> assign_perm('view_record', group, class_a)
>>> get_objects_for_group(group, 'view_record')
This gives me a QuerySet. Can I use the BaseClass that I defined above and write a raw query over other related classes?
>>> qs.intersection(get_objects_for_group(group, 'view_record'), \
BaseClass.objects.raw('select * from table_a a'
'join table_b b on a.id=b.table_a_id '
'join table_c c on b.id=c.table_b_id '
'where some conditions here'))
Does this approach make sense? Is there a better way to tackle this problem?
Thanks!
Edit:
Another way to tackle the problem might be creating a separate table for each user. I understand the complexity this might add to my application but:
The number of users will not be more than 100s for a long time. Not a consumer application.
Per our use case, it's quite unlikely that I'll need to query across these tables. I won't write a query that needs to aggregate anything from table1, table2, table3 that belongs to the same model.
Maintaining a separate table per customer could have an advantage.
Do you think this is a viable approach?
After researching many options I found out that I can solve this problem at the database level using Row Level Security on PostgreSQL. It ends up being the easiest and the most elegant.
This article helped me a lot to bridge the application level users with PostgreSQL policies.
What I learned by doing my research is:
Separate tables could still be an option in the future when customers can potentially affect each others' query performances since they are allowed to run arbitrary queries.
Trying to solve it at the ORM level is almost impossible if you are planning to use raw or ad-hoc queries.
I think you already know what you need to do. The word you are looking for is multitenancy. Although it is not one table per customer. The best suit for you will be one schema per customer. Unfortunately, the best article I had on multitenancy is no more available. See if you can find a cached version: https://msdn.microsoft.com/en-us/library/aa479086.aspx otherwise there are numerous articles availabe on the internet.
Another viable approach is to take a look at custom managers. You could write one custom manager for each Model-Customer and query it accordingly. But all this will lead to application complexity and will soon get out of your hand. Any bug in the application security layer is a nightmare to you.
Weighing both I'd be inclined to say multitenancy solution as you said in your edit is by far the best approach.
First, you should provide us with more details, how is your architecture set and built, with django so that we can help you. Have you implemented an API? using django template is not really a good idea if you are building a large scale application, consuming a lot of data.Because this can affect the query load massively.I can suggest extracting your front-end from the backend.
For example, I have 1000 Users with lots of related objects that I use in template.
Is it right that this:
User.objects.all()[:10]
Will always perform better than this:
User.objects.all().prefetch_related('educations', 'places')[:10]
This line will do an extra query to fetch the related objects for educations and places.
User.objects.all().prefetch_related('educations', 'places')[:10]
However it will only fetch the related objects for the sliced queryset User.objects.all()[:10], so you don't have to worry that it will fetch the related objects the thousands of other users in your database.
Here are my entities:
class Article(db.Entity):
id = PrimaryKey(int, auto=True)
creation_time = Required(datetime)
last_modification_time = Optional(datetime, default=datetime.now)
title = Required(str)
contents = Required(str)
authors = Set('Author')
class Author(db.Entity):
id = PrimaryKey(int, auto=True)
first_name = Required(str)
last_name = Required(str)
articles = Set(Article)
And here is the code I'm using to get some data:
return left_join((article, author) for article in entities.Article
for author in article.authors).prefetch(entities.Author)[:]
Whether I'm using the prefetch method or not, the generated sql always looks the same:
SELECT DISTINCT "article"."id", "t-1"."author"
FROM "article" "article"
LEFT JOIN "article_author" "t-1"
ON "article"."id" = "t-1"."article"
And then when I iterated over the results, pony is issuing yet another query (queries):
SELECT "id", "creation_time", "last_modification_time", "title", "contents"
FROM "article"
WHERE "id" = %(p1)s
SELECT "id", "first_name", "last_name"
FROM "author"
WHERE "id" IN (%(p1)s, %(p2)s)
The desired behavior for me would be if the orm would issue just one query that would load all the data needed. So how do I achieve that?
Author of PonyORM is here. We don't want to load all this objects using just one query, because this is inefficient.
The only benefit of using a single query to load many-to-many relation is to reduce the number of round-trips to the database. But if we would replace three queries with one, this is not a major improvement. When your database server located near your application server these round-trips are actually very fast, comparing with the processing the resulted data in Python.
On the other side, when both sides of many-to-many relation are loaded using the same query, it is inevitable that the same object's data will be repeated over and over in multiple rows. This has many drawbacks:
The size of data transferred from the database became much larger as compared to situation when no duplicate information is transferred. In your example, if you have ten articles, and each is written by three authors, the single query will return thirty rows, with large fields like article.contents duplicated multiple times. Separate queries will transfer the minimum amount of data possible, the difference in size may easily be an order of magnitude depending on specific many-to-many relation.
The database server is usually written in compiled language like C and works very fast. The same is true for networking layer. But Python code is interpreted, and the time consumed by Python code is (contrary to some opinions) usually much more than the time which is spent in the database. You can see profiling tests that was performed by the SQLAlchemy author Mike Bayer after which he came to conclusion:
A great misconception I seem to encounter often is the notion that communication with the database takes up a majority of the time spent in a database-centric Python application. This perhaps is a common wisdom in compiled languages such as C or maybe even Java, but generally not in Python. Python is very slow, compared to such systems (...) Whether a database driver (DBAPI) is written in pure Python or in C will incur significant additional Python-level overhead. For just the DBAPI alone, this can be as much as an order of magnitude slower.
When all data of many-to-many relation are loaded using the same query and the same data is repeated in many rows, it is necessary to parse all of this repeated data in Python just to throw out most of them. As Python is the slowest part of process, such "optimization" may lead to decreased performance.
As a support to my words I can point to Django ORM. This ORM has two methods which can be used to query optimization. The first one, called select_related loads all related objects in a single query, while the more recently added method called prefetch_related loads objects in a way Pony does by default. According to Django users the second method works much faster:
In some scenarios, we have found up to a 30% speed improvement.
The database is required to perform joins which consume precious resources of the database server.
While Python code is the slowest part when processing a single request, the database server CPU time is a shared resource which is used by all parallel requests. You can scale Python code easily by starting multiple Python processes on different servers, but it is much harder to scale the database. Because of this, in high-load application it is better to offload useful work from the database server to application server, so this work can be done in parallel by multiple application servers.
When database performs join it needs to spend additional time for doing it. But for Pony it is irrelevant if database make join or not, because in any case an object will be interlinked inside ORM identity map. So the work that database doing when perform join is just useless spend of database time. On the other hand, using identity map pattern Pony can link objects equally fast regardless of whether they are provided in the same database row or not.
Returning to the number of round-trips, Pony have dedicated mechanism to eliminate "N+1 query" problem. The "N+1 query" anti-pattern arises when an ORM sends hundreds of very similar queries each of them loads separate object from the database. Many ORMs suffers from this problem. But Pony can detect it and replace repeated N queries with a single query which loads all necessary objects at once. This mechanism is very efficient and can greatly reduce the number of round-trips. But when we speak about loading many-to-many relation, there are no N queries here, there are just three queries which are more efficient when executed separately, so there are no benefit in trying to execute single query instead.
To summarize, I need to say that the ORM performance is a very important to us, Pony ORM developers. And because of that, we don't want to implement loading many-to-many relation in a single query, as it most certainly will be slower than our current solution.
So, to answer your question, you cannot load both side of many-to-many relation in a single query. And I think this is a good thing.
This should work
python
from pony.orm import select
select((article, author) for article in Article if Article.authors == Authors.id)
I am trying to design a tagging system with a model like this:
Tag:
content = CharField
creator = ForeignKey
used = IntergerField
It is a many-to-many relationship between tags and what's been tagged.
Everytime I insert a record into the assotication table,
Tag.used is incremented by one, and decremented by one in case of deletion.
Tag.used is maintained because I want to speed up answering the question 'How many times this tag is used?'.
However, this seems to slow insertion down obviously.
Please tell me how to improve this design.
Thanks in advance.
http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html
If your database support materialized indexed views then you might want to create one for this. You can get a large performance boost for frequently run queries that aggregate data, which I think you have here.
your view would be on a query like:
SELECT
TagID,COUNT(*)
FROM YourTable
GROUP BY TagID
The aggregations can be precomputed and stored in the index to minimize expensive computations during query execution.
I don't think it's a good idea to denormalize your data like that.
I think a more elegant solution is to use django aggregation to track how many times the tag has been used http://docs.djangoproject.com/en/dev/topics/db/aggregation/
You could attach the used count to your tag object by calling something like this:
my_tag = Tag.objects.annotate(used=Count('post'))[0]
and then accessing it like this:
my_tag.used
assuming that you have a Post model class that has a ManyToMany field to your Tag class
You can order the Tags by the named annotated field if needed:
Tag.objects.annotate(used=Count('post')).order_by('-used')