Raw query and row level access control over multiple models in Django

Raw query and row level access control over multiple models in Django - python

I'm trying to provide an interface for the user to write custom queries over the database. I need to make sure they can only query the records they are allowed to. In order to do that, I decided to apply row based access control using django-guardian.
Here is how my schemas look like
class BaseClass(models.Model):
somefield = models.TextField()
class Meta:
permissions = (
('view_record', 'View record'),
)
class ClassA(BaseClass):
# some other fields here
classb = models.ForeignKey(ClassB)
class ClassB(BaseClass):
# some fields here
classc = models.ForeignKey(ClassC)
class ClassC(BaseClass):
# some fields here
I would like to be able to use get_objects_for_group as follows:
>>> group = Group.objects.create('some group')
>>> class_c = ClassC.objects.create('ClassC')
>>> class_b = ClassB.objects.create('ClassB', classc=class_c)
>>> class_a = ClassA.objects.create('ClassA', classb=class_b)
>>> assign_perm('view_record', group, class_c)
>>> assign_perm('view_record', group, class_b)
>>> assign_perm('view_record', group, class_a)
>>> get_objects_for_group(group, 'view_record')
This gives me a QuerySet. Can I use the BaseClass that I defined above and write a raw query over other related classes?
>>> qs.intersection(get_objects_for_group(group, 'view_record'), \
BaseClass.objects.raw('select * from table_a a'
'join table_b b on a.id=b.table_a_id '
'join table_c c on b.id=c.table_b_id '
'where some conditions here'))
Does this approach make sense? Is there a better way to tackle this problem?
Thanks!
Edit:
Another way to tackle the problem might be creating a separate table for each user. I understand the complexity this might add to my application but:
The number of users will not be more than 100s for a long time. Not a consumer application.
Per our use case, it's quite unlikely that I'll need to query across these tables. I won't write a query that needs to aggregate anything from table1, table2, table3 that belongs to the same model.
Maintaining a separate table per customer could have an advantage.
Do you think this is a viable approach?

After researching many options I found out that I can solve this problem at the database level using Row Level Security on PostgreSQL. It ends up being the easiest and the most elegant.
This article helped me a lot to bridge the application level users with PostgreSQL policies.
What I learned by doing my research is:
Separate tables could still be an option in the future when customers can potentially affect each others' query performances since they are allowed to run arbitrary queries.
Trying to solve it at the ORM level is almost impossible if you are planning to use raw or ad-hoc queries.

I think you already know what you need to do. The word you are looking for is multitenancy. Although it is not one table per customer. The best suit for you will be one schema per customer. Unfortunately, the best article I had on multitenancy is no more available. See if you can find a cached version: https://msdn.microsoft.com/en-us/library/aa479086.aspx otherwise there are numerous articles availabe on the internet.
Another viable approach is to take a look at custom managers. You could write one custom manager for each Model-Customer and query it accordingly. But all this will lead to application complexity and will soon get out of your hand. Any bug in the application security layer is a nightmare to you.
Weighing both I'd be inclined to say multitenancy solution as you said in your edit is by far the best approach.

First, you should provide us with more details, how is your architecture set and built, with django so that we can help you. Have you implemented an API? using django template is not really a good idea if you are building a large scale application, consuming a lot of data.Because this can affect the query load massively.I can suggest extracting your front-end from the backend.

Related

How to implement composition/agregation with NDB on GAE

How do we implement agregation or composition with NDB on Google App Engine ? What is the best way to proceed depending on use cases ?
Thanks !
I've tried to use a repeated property. In this very simple example, a Project have a list of Tag keys (I have chosen to code it this way instead of using StructuredProperty because many Project objects can share Tag objects).
class Project(ndb.Model):
name = ndb.StringProperty()
tags = ndb.KeyProperty(kind=Tag, repeated=True)
budget = ndb.FloatProperty()
date_begin = ndb.DateProperty(auto_now_add=True)
date_end = ndb.DateProperty(auto_now_add=True)
#classmethod
def all(cls):
return cls.query()
#classmethod
def addTags(cls, from_str):
tagname_list = from_str.split(',')
tag_list = []
for tag in tagname_list:
tag_list.append(Tag.addTag(tag))
cls.tags = tag_list
--
Edited (2) :
Thanks. Finally, I have chosen to create a new Model class 'Relation' representing a relation between two entities. It's more an association, I confess that my first design was unadapted.

An alternative would be to use BigQuery. At first we used NDB, with a RawModel which stores individual, non-aggregated records, and an AggregateModel, which a stores the aggregate values.
The AggregateModel was updated every time a RawModel was created, which caused some inconsistency issues. In hindsight, properly using parent/ancestor keys as Tim suggested would've worked, but in the end we found BigQuery much more pleasant and intuitive to work with.
We just have cronjobs that run everyday to push RawModel to BigQuery and another to create the AggregateModel records with data fetched from BigQuery.
(Of course, this is only effective if you have lots of data to aggregate)

It really does depend on the use case. For small numbers of items StructuredProperty and repeated properties may well be the best fit.
For large numbers of entities you will then look at setting the parent/ancestor in the Key for composition, and have a KeyProperty pointing to the primary entity in a many to one aggregation.
However the choice will also depend heavily on the actual use pattern as well. Then considerations of efficiency kick in.
The best I can suggest is consider carefully how you plan to use these relationships, how active are they (ie are they constantly changing, adding, deleting), do you need to see all members of the relation most of the time, or just subsets. These consideration may well require adjustments to the approach.

When should I use objects.raw() in Django?

I am quite new to Python and I have seen that both
Entries.objects.raw("SELECT * FROM register_entries")
and
Entries.objects.filter()
do the same query.
In which cases is better to use one or the other?

It depends on the database backend that you are using. The first assume that you have a SQL-based database engine. That is not always true. At the opposite, the second one will work in any case (if the backend is designed for). There was for instance few years ago a LDAP backend which was designed so, but LDAP queries do not use SQL language at all.
In all cases, I advice you to use the second one. It is the better way to go if you want to make reusable and long-term code.
There are also other ideas to prefer the second one to the first one
avoiding possible SQL injections ;
no need to know about database design (table's name, fields' name) ;
generic code is better than specific one ;
and moreover, it is shorter...
But you sometimes will have to use the first one when you do specific operations (calling specific backend's functions), but avoid them as much as possible.
In a nutshell, use the second one!!!

From django documentation:
When the model query APIs don’t go far enough, you can fall back to writing raw SQL
For all aspects, django queryset api offers you many options to customize your queries. But in some cases, you need to use very specific queries where django api become insufficient. But before you go for raw SQL, it is best to read all Queryset Api docs and learn everything about django queryset api.

Sharding a Django Project

I'm starting a Django project and need to shard multiple tables that are likely to all be of too many rows. I've looked through threads here and elsewhere, and followed the Django multi-db documentation, but am still not sure how that all stitches together. My models have relationships that would be broken by sharding, so it seems like the options are to either drop the foreign keys of forgo sharding the respective models.
For argument's sake, consider the classic Authot, Publisher and Book scenario, but throw in book copies and users that can own them. Say books and users had to be sharded. How would you approach that? A user may own a copy of a book that's not in the same database.
In general, what are the best practices you have used for routing and the sharding itself? Did you use Django database routers, manually selected a database inside commands based on your sharding logic, or overridden some parts of the ORM to achive that?
I'm using PostgreSQL on Ubuntu, if it matters.
Many thanks.

In the past I've done something similar using Postgresql Table Partitioning, however this merely splits a table up in the same DB. This is helpful in reducing table search time. This is also nice because you don't need to modify your django code much. (Make sure you perform queries with the fields you're using for constraints).
But it's not sharding.
If you haven't seen it yet, you should check out Sharding Postgres with Instagram.

I agree with #DanielRoseman. Also, how many is too many rows. If you are careful with indexing, you can handle a lot of rows with no performance problems. Keep your indexed values small (ints). I've got tables in excess of 400 million rows that produce sub-second responses even when joining with other many million row tables.
It might make more sense to break user up into multiple tables so that the user object has a core of commonly used things and then the "profile" info lives elsewhere (std Django setup). Copies would be a small table referencing books which has the bulk of the data. Considering how much ram you can put into a DB server these days, sharding before you have too seems wrong.

Designing a Tag table that tells how many times it's used

I am trying to design a tagging system with a model like this:
Tag:
content = CharField
creator = ForeignKey
used = IntergerField
It is a many-to-many relationship between tags and what's been tagged.
Everytime I insert a record into the assotication table,
Tag.used is incremented by one, and decremented by one in case of deletion.
Tag.used is maintained because I want to speed up answering the question 'How many times this tag is used?'.
However, this seems to slow insertion down obviously.
Please tell me how to improve this design.
Thanks in advance.
http://www.pui.ch/phred/archives/2005/06/tagsystems-performance-tests.html

If your database support materialized indexed views then you might want to create one for this. You can get a large performance boost for frequently run queries that aggregate data, which I think you have here.
your view would be on a query like:
SELECT
TagID,COUNT(*)
FROM YourTable
GROUP BY TagID
The aggregations can be precomputed and stored in the index to minimize expensive computations during query execution.

I don't think it's a good idea to denormalize your data like that.
I think a more elegant solution is to use django aggregation to track how many times the tag has been used http://docs.djangoproject.com/en/dev/topics/db/aggregation/
You could attach the used count to your tag object by calling something like this:
my_tag = Tag.objects.annotate(used=Count('post'))[0]
and then accessing it like this:
my_tag.used
assuming that you have a Post model class that has a ManyToMany field to your Tag class
You can order the Tags by the named annotated field if needed:
Tag.objects.annotate(used=Count('post')).order_by('-used')

Django objects.filter, how "expensive" would this be?

I am trying to make a search view in Django. It is a search form with freetext input + some options to select, so that you can filter on years and so on. This is some of the code I have in the view so far, the part that does the filtering. And I would like some input on how expensive this would be on the database server.
soknad_list = Soknad.objects.all()
if var1:
soknad_list = soknad_list.filter(pub_date__year=var1)
if var2:
soknad_list = soknad_list.filter(muncipality__name__exact=var2)
if var3:
soknad_list = soknad_list.filter(genre__name__exact=var3)
# TEXT SEARCH
stop_word_list = re.compile(STOP_WORDS, re.IGNORECASE)
search_term = '%s' % request.GET['q']
cleaned_search_term = stop_word_list.sub('', search_term)
cleaned_search_term = cleaned_search_term.strip()
if len(cleaned_search_term) != 0:
soknad_list = soknad_list.filter(Q(dream__icontains=cleaned_search_term) | Q(tags__icontains=cleaned_search_term) | Q(name__icontains=cleaned_search_term) | Q(school__name__icontains=cleaned_search_term))
So what I do is, first make a list of all objects, then I check which variables exists (I fetch these with GET on an earlier point) and then I filter the results if they exists. But this doesn't seem too elegant, it probably does a lot of queries to achieve the result, so is there a better way to this?
It does exactly what I want, but I guess there is a better/smarter way to do this. Any ideas?

filter itself doesn't execute a query, no query is executed until you explicitly fetch items from query (e.g. get), and list( query ) also executes it.

You can see the query that will be generated by using:
soknad_list.query.as_sql()[0]
You can then put that into your database shell to see how long the query takes, or use EXPLAIN (if your database backend supports it) to see how expensive it is.

As Aaron mentioned, you should get a hold of the query text that is going to be run against the database and use an EXPLAIN (or other some method) to view the query execution plan. Once you have a hold of the execution plan for the query you can see what is going on in the database itself. There are a lot of operations that see very expensive to run through procedural code that are very trivial for any database to run, especially if you provide indexes that the database can use for speeding up your query.
If I read your question correctly, you're retrieving a result set of all rows in the Soknad table. Once you have these results back you use the filter() method to trim down your results meet your criteria. From looking at the Django documentation, it looks like this will do an in-memory filter rather than re-query the database (of course, this really depends on which data access layer you're using and not on Django itself).
The most optimal solution would be to use a full-text search engine (Lucene, ferret, etc) to handle this for you. If that is not available or practical the next best option would be to to construct a query predicate (WHERE clause) before issuing your query to the database and let the database perform the filtering.
However, as with all things that involve the database, the real answer is 'it depends.' The best suggestion is to try out several different approaches using data that is close to production and benchmark them over at least 3 iterations before settling on a final solution to the problem. It may be just as fast, or even faster, to filter in memory rather than filter in the database.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.