Ordering queryset by filtered child objects - python

I have the following (simplified) data model, see visual representation below the post:
Articles, which have Attributes
Attributes refer by PK to a Type, which have a field code
Attributes have a field value
The value refers to the field uuid in another model called Record, which also has a field sorting_code
I now want to return a list of articles in a certain ordering, using pagination. I am using a default ViewSet for it. The pagination forces me to do the ordering in database, instead of later in Python. However, I cannot seem to find the correct ordering clause that orders these articles. What I need to do is:
Fetch the Attribute with a specific type
Look up the values in those Attributes in the Record table
Order the articles based by sorting_code
The following SQL query does the job (for all articles):
SELECT art.order_id, art.uuid, att.value, mdr.code, mdr.name, mdr.sorting_code
FROM ow_order_article art
INNER JOIN ow_order_attribute att ON att.article_id = art.uuid
INNER JOIN ow_masterdata_type mdt ON att.masterdata_type_id = mdt.uuid
INNER JOIN ow_masterdata_record mdr ON att.value = mdr.uuid
WHERE mdt.code = 'article_structure'
ORDER BY mdr.sorting_code, mdr.code
What would be the correct way to get this ordering in a queryset in Django?

Related

Assigning identifiable unique IDs to rows when importing data from XML

I'm designing a database in which I'll be importing a large amount of data from XML daily to create or update existing rows.
Item data spans dozens of tables all related to the item_id in the main item table
For every item in the XML file, I need to check if it already exists in the database and update or create if it's not there.
Every XML belongs to a source_id and every item in the XML contains a unique alphanumeric ID up to 50 chars (but those IDs are not unique across all XMLs), so source_id:xml_item_id would be unique here
What I need is a way of finding if the item already exists in the database. Ideally, I will search by pk and use the same pk to join other tables
Attempt 1
I've tried encoding source_id:xml_item_id into a bigint for the pk as well as decode the bigint back to the original source_id:xml_item_id, but most of the times this is overflowing
So this is not going to work
Attempt 2
Use a UUID for the pk and source_id:xml_item_id as unique_id (string) for which to search by, but join related tables to UUID
While I don't see anything wrong here (IMO), JOINs might be affected, and I would prefer numeric pk for use in URLs
Attempt 3
Use source_id:xml_item_id as pk (string)
Same worries as with Attempt 2
The reason I've avoided AI PKs in all attempts is that there is a high possibility to shard this data in the future and I'd like this to have a relatively low impact on how PKs are being generated when this happens
What would be the best approach to handle this?
To identify if items already exist in the database
Have a user-friendly pk for URLs
Try not to impact JOIN performance too much
You can use unique together
class Data(models.Model):
source_id = models.CharField()
xml_item_id = models.CharField()
# ... other fields
class Meta:
unique_together = ("source_id", "xml_item_id")
Then in your import function just:
scid = your_xml_source_id
xmlid = your_xml_id
obj, created = Data.objects.get_or_create(source_id=scid, xml_item_id=xmlid)
if created:
# it's the new object, populate obj with the rest of the data
obj.other_field = your_xml_other_field
else:
# it's existing object, update object with a new value
obj.other_field = new_value
obj.save()

Django ORM INNER JOIN

I have not been able to do this query with Django ORM.
how to make a inner join, how to do this query and return only the columns I want?
SELECT establecimiento.nombre, categoria.titulo
FROM establecimiento INNER JOIN
categoria ON establecimiento.categoria = categoria.id
Based on the information in your comment responding to pdxwebdev (that you have a foreign key field declared) this is simple. Django automates much of the join behavior needed for foreign key relationships.
To precisely replicate that query, including selecting only two fields from the join, any of values, values_list or only should do it depending on exactly what Python objects you want to get back. Eg, here's a query using values to retrieve an iterable queryset of dictionaries:
Establecimiento.objects.values('nombre', 'categoria__titulo')
values_list will retrieve tuples instead of dictionaries, and only will retrieve Establecimiento instances on which all model fields other than those two are deferred (they have not been retrieved from the database but will be looked up as needed).
When you use __ to follow a foreign key relationship like that, Django will do the inner join automatically.
You can also use select_related on a queryset to ask it to do the join even when you're not retrieving specific fields. EG:
Establecimiento.objects.select_related('categoria')
This should produce a query of SELECT * from ..., and return a queryset of Establecimiento instances that have their categoria data already loaded into memory.
I'm not sure I understand the question.
establecimiento.categoria just needs to be a foreign key field to categoria model. categoria.id is the primary key so this will be done automatically.
To return only certain columns, just the .only() method.
https://docs.djangoproject.com/en/dev/ref/models/querysets/#only

Django: Selecting distinct values on maximum foreign key value

I have the following models which I'm testing with SQLite3 and MySQL:
# (various model fields extraneous to discussion removed...)
class Run(models.Model):
runNumber = models.IntegerField()
class Snapshot(models.Model):
t = models.DateTimeField()
class SnapshotRun(models.Model):
snapshot = models.ForeignKey(Snapshot)
run = models.ForeignKey(Run)
# other fields which make it possible to have multiple distinct Run objects per Snapshot
I want a query which will give me a set of runNumbers & snapshot IDs for which the Snapshot.id is below some specified value. Naively I would expect this to work:
print SnapshotRun.objects.filter(snapshot__id__lte=ss_id)\
.order_by("run__runNumber", "-snapshot__id")\
.distinct("run__runNumber", "snapshot__id")\
.values("run__runNumber", "snapshot__id")
But this blows up with
NotImplementedError: DISTINCT ON fields is not supported by this database backend
for both database backends. Postgres is unfortunately not an option.
Time to fall back to raw SQL?
Update:
Since Django's ORM won't help me out of this one (thanks #jknupp) I did manage to get the following raw SQL to work:
cursor.execute("""
SELECT r.runNumber, ssr1.snapshot_id
FROM livedata_run AS r
JOIN livedata_snapshotrun AS ssr1
ON ssr1.id =
(
SELECT id
FROM livedata_snapshotrun AS ssr2
WHERE ssr2.run_id = r.id
AND ssr2.snapshot_id <= %s
ORDER BY snapshot_id DESC
LIMIT 1
);
""", max_ss_id)
Here livedata is the Django app these tables live in.
The note in the Django documentation is pretty clear:
Note:
Any fields used in an order_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct(). If order by fields from a related model, those fields will be added to the selected columns and they may make otherwise duplicate rows appear to be distinct. Since the extra columns don’t appear in the returned results (they are only there to support ordering), it sometimes looks like non-distinct results are being returned.
Similarly, if you use a values() query to restrict the columns selected, the columns used in any order_by() (or default model ordering) will still be involved and may affect uniqueness of the results.
The moral here is that if you are using distinct() be careful about ordering by related models. Similarly, when using distinct() and values() together, be careful when ordering by fields not in the values() call.
Also, below that:
This ability to specify field names (with distinct) is only available in PostgreSQL.

Djapian - filtering results

I use Djapian to search for object by keywords, but I want to be able to filter results. It would be nice to use Django's QuerySet API for this, for example:
if query.strip():
results = Model.indexer.search(query).prefetch()
else:
results = Model.objects.all()
results = results.filter(somefield__lt=somevalue)
return results
But Djapian returns a ResultSet of Hit objects, not Model objects. I can of course filter the objects "by hand", in Python, but it's not realistic in case of filtering all objects (when query is empty) - I would have to retrieve the whole table from database.
Am I out of luck with using Djapian for this?
I went through its source and found that Djapian has a filter method that can be applied to its results. I have just tried the below code and it seems to be working.
My indexer is as follows:
class MarketIndexer( djapian.Indexer ):
fields = [ 'name', 'description', 'tags_string', 'state']
tags = [('state', 'state'),]
Here is how I filter results (never mind the first line that does stuff for wildcard usage):
objects = model.indexer.search(q_wc).flags(djapian.resultset.xapian.QueryParser.FLAG_WILDCARD).prefetch()
objects = objects.filter(state=1)
When executed, it now brings Markets that have their state equal to "1".
I dont know Djapian, but i am familiar with xapian. In Xapian you can filter the results with a MatchDecider.
The decision function of the match decider gets called on every document which matches the search criteria so it's not a good idea to do a database query for every document here, but you can of course access the values of the document.
For example at ubuntuusers.de we have a xapian database which contains blog posts, forum posts, planet entries, wiki entries and so on and each document in the xapian database has some additional access information stored as value. After the query, an AuthMatchDecider filters the potential documents and returns the filtered MSet which are then displayed to the user.
If the decision procedure is as simple as somefield < somevalue, you could also simply add the value of somefield to the values of the document (using the sortable_serialize function provided by xapian) and add (using OP_FILTER) an OP_VALUE_RANGE query to the original query.

How can I make a Django query for the first occurrence of a foreign key in a column?

Basically, I have a table with a bunch of foreign keys and I'm trying to query only the first occurrence of a particular key by the "created" field. Using the Blog/Entry example, if the Entry model has a foreign key to Blog and a foreign key to User, then how can I construct a query to select all Entries in which a particular User has written the first one for the various Blogs?
class Blog(models.Model):
...
class User(models.Model):
...
class Entry(models.Model):
blog = models.Foreignkey(Blog)
user = models.Foreignkey(User)
I assume there's some magic I'm missing to select the first entries of a blog and that I can simple filter further down to a particular user by appending:
query = Entry.objects.magicquery.filter(user=user)
But maybe there's some other more efficient way. Thanks!
query = Entry.objects.filter(user=user).order_by('id')[0]
Basically order by id (lowest to highest), and slice it to get only the first hit from the QuerySet.
I don't have a Django install available right now to test my line, so please check the documentation if somehow I have a type above:
order by
limiting querysets
By the way, interesting note on 'limiting queysets" manual section:
To retrieve a single object rather
than a list (e.g. SELECT foo FROM bar
LIMIT 1), use a simple index instead
of a slice. For example, this returns
the first Entry in the database, after
ordering entries alphabetically by
headline:
Entry.objects.order_by('headline')[0]
EDIT: Ok, this is the best I could come up so far (after yours and mine comment). It doesn't return Entry objects, but its ids as entry_id.
query = Entry.objects.values('blog').filter(user=user).annotate(Count('blog')).annotate(entry_id=Min('id'))
I'll keep looking for a better way.
Ancient question, I realise - #zalew's response is close but will likely result in the error:
ProgrammingError: SELECT DISTINCT ON expressions must match initial
ORDER BY expressions
To correct this, try aligning the ordering and distinct parts of the query:
Entry.objects.filter(user=user).distinct("blog").order_by("blog", "created")
As a bonus, in case of multiple entries being created at exactly the same time (unlikely, but you never know!), you could add determinism by including id in the order_by:
To correct this, try aligning the ordering and distinct parts of the query:
Entry.objects.filter(user=user).distinct("blog").order_by("blog", "created", "id")
Can't test it in this particular moment
Entry.objects.filter(user=user).distinct("blog").order_by("id")

Categories