Django Query to Combine Multiple ArrayFields to one text string - python

I have an object model where Documents are long text files that can have Attachments and both sets of objects can also have spreadsheet-like Tables. Each table has a rectangular array with text. I want users to be able to search for a keyword across the table contents, but the results will be displayed by the main document (so instead of seeing each table that matches, you'll just see the document that has the most tables that match your query).
Below you can see a test query I'm trying to run that in an ideal world would convert all of the table contents (across all attachments) to one long string, that I can then pass to a SearchHighlight to make the headline. For some reason, the test query returns the tables as different objects, rather than concatenated to one long string.
I'm using a custom function that mimics the Postgres 13 StringAgg as I'm using Postgres 10.
Thanks in advance for your help, let me know if I need to provide more information to replicate this.
my models.py:
class Document(AbstractDocument):
tables = GenericRelation(Table)
class Attachment(AbstractDocument):
tables_new = GenericRelation(Table)
main_document = ForeignKey(Document, on_delete=CASCADE, related_name="attachments")
class Table(models.Model):
content_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
object_id = models.SlugField()
content_object = GenericForeignKey()
content = ArrayField(ArrayField(models.TextField(null=True)))
my query:
def myStringAgg(field: str):
return Func(
F(field),
Value(" "),
Value(""),
function="array_to_string",
output_field=models.TextField(),
)
s = Document.objects.all() \
.annotate(tt=myStringAgg("attachments__tables__content")) \
.values_list('tt', flat=True)
# what I get
>>> <DocumentSet ['table1', 'table2']>
# what I want
>>> <DocumentSet ['table1 table2']>
I'm using Django 3.2 and Postgres 10.
To clarify what my full scope is, this what the final query would look like:
qs = Document.objects.filter(
Q(tables__search_vector=query) |
Q(attachments__tables__search_vector=query)
)
.annotate(rank=rank)
.order_by("-rank")
.annotate(snippet=SearchHeadline(
myStringAgg("attachments__tables__content"),
query, max_fragments=5)
)

You can use the join function to create a string from a list:
s = Document.objects.all() \
.annotate(tt=myStringAgg("attachments__tables__content")) \
.values_list('tt', flat=True)
s = " ". join(list(s))

Related

Query based in Embedded Documents List fields in mongoengine

I'm running into an issue using mongoengine. A raw query that works on Compass isn't working using _ _ raw _ _ on mongoengine. I'd like to rewrite it using mongoengine's methods, but I'd like to understand why it's not working using _ _ raw_ _ either.
I'm running an embedded document list field that has inheritence. The query is "give me all sequences that are have a 'type A' Assignment "
My schema:
class Sequence(Document):
seq = StringField(required = True)
samples = EmbeddedDocumentListField(Sample)
assignments = EmbeddedDocumentListField(Assignment)
class Sample(EmbeddedDocument):
name = StringField()
class Assignment(EmbeddedDocument):
name = StringField()
meta = {'allow_inheritance': True}
class TypeA(Assignment):
pass
class TypeB(Assignment):
other_field = StringField()
pass
Writing {'assignments._cls': 'TypeA'} into Compass returns a list. But on mongoengine I get an empty field:
from mongo_objects import Sequence
def get_samples_assigned_as_class(cls : str):
query_raw = Sequence.objects(__raw__={'assignments._cls': cls}) # raw query, fails
#query2 = Sequence.objects(assignments___cls = cls) # Fist attempt, failed
#query3 = Sequence.objects.get().assignments.filter(cls = cls) # Second attempt, also failed. Didn't like that it queried everything first
print(query_raw) # empty list, iterating does nothing.
get_samples_assigned_as_class('TypeA')
"Assignments" is a list because one sequence may have multiples of the same class. An in depth awnser on how to query these lists for categorical information would be ideal, as I'm not sure how to properly go about it. I'm mostly filtering on the inheritence _cls, but eventually I'd like to do nested queries (cls : TypeA, sample : Sample_1)
Thanks

How to do efficient reverse foreign key check on Django multi-table inheritance model?

I've got file objects of different types, which inherit from a BaseFile, and add custom attributes, methods and maybe fields. The BaseFile also stores the File Type ID, so that the corresponding subclass model can be retrieved from any BaseFile object:
class BaseFile(models.Model):
name = models.CharField(max_length=80, db_index=True)
size= models.PositiveIntegerField()
time_created = models.DateTimeField(default=datetime.now)
file_type = models.ForeignKey(ContentType, on_delete=models.PROTECT)
class FileType1(BaseFile):
storage_path = '/path/for/filetype1/'
def custom_method(self):
<some custom behaviour>
class FileType2(BaseFile):
storage_path = '/path/for/filetype2/'
extra_field = models.CharField(max_length=12)
I also have different types of events which are associated with files:
class FileEvent(models.Model):
file = models.ForeignKey(BaseFile, on_delete=models.PROTECT)
time = models.DateTimeField(default=datetime.now)
I want to be able to efficiently get all files of a particular type which have not been involved in a particular event, such as:
unprocessed_files_type1 = FileType1.objects.filter(fileevent__isnull=True)
However, looking at the SQL executed for this query:
SELECT "app_basefile"."id", "app_basefile"."name", "app_basefile"."size", "app_basefile"."time_created", "app_basefile"."file_type_id", "app_filetype1"."basefile_ptr_id"
FROM "app_filetype1"
INNER JOIN "app_basefile"
ON("app_filetype1"."basefile_ptr_id" = "app_basefile"."id")
LEFT OUTER JOIN "app_fileevent" ON ("app_basefile"."id" = "app_fileevent"."file_id")
WHERE "app_fileevent"."id" IS NULL
It looks like this might not be very efficient because it joins on BaseFile.id instead of FileType1.basefile_ptr_id, so it will check ALL BaseFile ids to see whether they are present in FileEvent.file_id, when I only need to check the BaseFile ids corresponding to FileType1, or FileType1.basefile_ptr_ids.
This could result in a significant performance difference if there are a very large number of BaseFiles, but FileType1 is only a small subset of that, because it will be doing a large amount of unnecessary lookups.
Is there a way to force Django to join on "app_filetype1"."basefile_ptr_id" or otherwise achieve this functionality more efficiently?
Thanks for the help
UPDATE:
Using annotations and Exists subquery seems to do what I'm after, however the resulting SQL still appears strange:
unprocessed_files_type1 = FileType1.objects.annotate(file_event=Exists(FileEvent.objects.filter(file=OuterRef('pk')))).filter(file_event=False)
SELECT "app_basefile"."id", "app_basefile"."name", "app_basefile"."size", "app_basefile"."time_created", "app_basefile"."file_type_id", "app_filetype1"."basefile_ptr_id",
EXISTS(
SELECT U0."id", U0."file_id", U0."time"
FROM "app_fileevent" U0
WHERE U0."file_id" = ("app_filetype1"."basefile_ptr_id"))
AS "file_event"
FROM "app_filetype1"
INNER JOIN "app_basefile" ON ("app_filetype1"."basefile_ptr_id" = "app_basefile"."id")
WHERE EXISTS(
SELECT U0."id", U0."file_id", U0."time"
FROM "app_fileevent" U0
WHERE U0."file_id" = ("app_filetype1"."basefile_ptr_id")) = 0
It appears to be doing the WHERE EXISTS subquery twice instead of just using the annotated 'file_event' label... Maybe this is just a Django/SQLite driver bug?

Associate classes with django-filters

Bonjour, I have a question regarding django-filters. My problem is:
I have two classes defined in my models.py that are:
class Volcano(models.Model):
vd_id = models.AutoField("ID, Volcano Identifier (Index)",
primary_key=True)
[...]
class VolcanoInformation(models.Model):
# Primary key
vd_inf_id = models.AutoField("ID, volcano information identifier (index)",
primary_key=True)
# Other attributes
vd_inf_numcal = models.IntegerField("Number of calderas")
[...]
# Foreign key(s)
vd_id = models.ForeignKey(Volcano, null=True, related_name='vd_inf_vd_id',
on_delete=models.CASCADE)
The two of them are linked throught the vd_id attribute.
I want to develop a search tool that allows the user to search a volcano by its number of calderas (vd_inf_numcal).
I am using django-filters and for now here's my filters.py:
from .models import *
import django_filters
class VolcanoFilter(django_filters.FilterSet):
vd_name = django_filters.ModelChoiceFilter(
queryset=Volcano.objects.values_list('vd_name', flat=True),
widget=forms.Select, label='Volcano name',
to_field_name='vd_name',
)
vd_inf_numcal = django_filters.ModelChoiceFilter(
queryset=VolcanoInformation.objects.values_list('vd_inf_numcal', flat=True),
widget=forms.Select, label='Number of calderas',
)
class Meta:
model = Volcano
fields = ['vd_name', 'vd_inf_numcal']
My views.py is:
def search(request):
feature_list = Volcano.objects.all()
feature_filter = VolcanoFilter(request.GET, queryset = feature_list)
return render(request, 'app/search_list.html', {'filter' : feature_filter, 'feature_type': feature_type})
In my application, a dropdown list of the possible number of calderas appears but the search returns no result which is normal because there is no relation between VolcanoInformation.vd_inf_numcal, VolcanoInformation.vd_id and Volcano.vd_id.
It even says "Select a valid choice. That choice is not one of the available choices."
My question is how could I make this link using django_filters ?
I guess I should write some method within the class but I have absolutely no idea on how to do it.
If anyone had the answer, I would be more than thankful !
In general, you need to answer two questions:
What field are we querying against & what query/lookup expressions need to be generated.
What kinds of values should we be filtering with.
These answers are essentially the left hand and right hand side of your .filter() call.
In this case, you're filtering across the reverse side of the Volcano-Volcano Information relationship (vd_inf_vd_id), against the number of calderas (vd_inf_numcal) for a Volcano. Additionally, you want an exact match.
For the values, you'll need a set of choices containing integers.
AllValuesFilter will look at the DB column and generate the choices from the column values. However, the downside is that the choices will not include any missing values, which look weird when rendered. You could either adapt this field, or use a plain ChoiceFilter, generating the values yourself.
def num_calderas_choices():
# Get the maximum number of calderas
max_count = VolcanoInformation.objects.aggregate(result=Max('vd_inf_numcal'))['result']
# Generate a list of two-tuples for the select dropdown, from 0 to max_count
# e.g, [(0, 0), (1, 1), (2, 2), ...]
return zip(range(max_count), range(max_count))
class VolcanoFilter(django_filters.FilterSet):
name = ...
num_calderas = django_filters.ChoiceFilter(
# related field traversal (note the connecting '__')
field_name='vd_inf_vd_id__vd_inf_numcal',
label='Number of calderas',
choices=num_calderas_choices
)
class Meta:
model = Volcano
fields = ['name', 'num_calderas']
Note that I haven't tested the above code myself, but it should be close enough to get you started.
Thanks a lot ! That's exactly what I was looking for ! I didn't understand how the .filter() works.
What I did, for other attributes is to generate the choices but in a different way. For instance if I just wanted to display a list of the available locations I would use:
# Location attribute
loc = VolcanoInformation.objects.values_list('vd_inf_loc', flat=True)
vd_inf_loc = django_filters.ChoiceFilter(
field_name='vd_inf_vd_id__vd_inf_loc',
label='Geographic location',
choices=zip(loc, loc),
)

How do I order by date when using ReferenceProperty?

I have a simple one-to-many structure like this:
class User(db.Model):
userEmail = db.StringProperty()
class Comment(db.Model):
user = db.ReferenceProperty(User, collection_name="comments")
comment = db.StringProperty()
date = db.DateTimeProperty()
I fetch a user from by his email:
q = User.all() # prepare User table for querying
q.filter("userEmail =", "az#example.com") # apply filter, email lookup
results = q.fetch(1) # execute the query, apply limit 1
the_user = results[0] # the results is a list of objects, grab the first one
this_users_comments = the_user.comments # get the user's comments
How can I order the user's comments by date, and limit it to 10 comments?
You will want to use the key keyword argument of the built-in sorted function, and use the "date" property as the key:
import operator
sorted_comments = sorted(this_users_comments, key=operator.attrgetter("date"))
# The comments will probably be sorted with earlier comments at the front of the list
# If you want ten most recent, also add the following line:
# sorted_comments.reverse()
ten_comments = sorted_comments[:10]
That query fetches the user. You need to do another query for the comments:
this_users_comments.order('date').limit(10)
for comment in this_users_comments:
...

Union and Intersect in Django

class Tag(models.Model):
name = models.CharField(maxlength=100)
class Blog(models.Model):
name = models.CharField(maxlength=100)
tags = models.ManyToManyField(Tag)
Simple models just to ask my question.
I wonder how can i query blogs using tags in two different ways.
Blog entries that are tagged with "tag1" or "tag2":
Blog.objects.filter(tags_in=[1,2]).distinct()
Blog objects that are tagged with "tag1" and "tag2" : ?
Blog objects that are tagged with exactly "tag1" and "tag2" and nothing else : ??
Tag and Blog is just used for an example.
You could use Q objects for #1:
# Blogs who have either hockey or django tags.
from django.db.models import Q
Blog.objects.filter(
Q(tags__name__iexact='hockey') | Q(tags__name__iexact='django')
)
Unions and intersections, I believe, are a bit outside the scope of the Django ORM, but its possible to to these. The following examples are from a Django application called called django-tagging that provides the functionality. Line 346 of models.py:
For part two, you're looking for a union of two queries, basically
def get_union_by_model(self, queryset_or_model, tags):
"""
Create a ``QuerySet`` containing instances of the specified
model associated with *any* of the given list of tags.
"""
tags = get_tag_list(tags)
tag_count = len(tags)
queryset, model = get_queryset_and_model(queryset_or_model)
if not tag_count:
return model._default_manager.none()
model_table = qn(model._meta.db_table)
# This query selects the ids of all objects which have any of
# the given tags.
query = """
SELECT %(model_pk)s
FROM %(model)s, %(tagged_item)s
WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
AND %(tagged_item)s.tag_id IN (%(tag_id_placeholders)s)
AND %(model_pk)s = %(tagged_item)s.object_id
GROUP BY %(model_pk)s""" % {
'model_pk': '%s.%s' % (model_table, qn(model._meta.pk.column)),
'model': model_table,
'tagged_item': qn(self.model._meta.db_table),
'content_type_id': ContentType.objects.get_for_model(model).pk,
'tag_id_placeholders': ','.join(['%s'] * tag_count),
}
cursor = connection.cursor()
cursor.execute(query, [tag.pk for tag in tags])
object_ids = [row[0] for row in cursor.fetchall()]
if len(object_ids) > 0:
return queryset.filter(pk__in=object_ids)
else:
return model._default_manager.none()
For part #3 I believe you're looking for an intersection. See line 307 of models.py
def get_intersection_by_model(self, queryset_or_model, tags):
"""
Create a ``QuerySet`` containing instances of the specified
model associated with *all* of the given list of tags.
"""
tags = get_tag_list(tags)
tag_count = len(tags)
queryset, model = get_queryset_and_model(queryset_or_model)
if not tag_count:
return model._default_manager.none()
model_table = qn(model._meta.db_table)
# This query selects the ids of all objects which have all the
# given tags.
query = """
SELECT %(model_pk)s
FROM %(model)s, %(tagged_item)s
WHERE %(tagged_item)s.content_type_id = %(content_type_id)s
AND %(tagged_item)s.tag_id IN (%(tag_id_placeholders)s)
AND %(model_pk)s = %(tagged_item)s.object_id
GROUP BY %(model_pk)s
HAVING COUNT(%(model_pk)s) = %(tag_count)s""" % {
'model_pk': '%s.%s' % (model_table, qn(model._meta.pk.column)),
'model': model_table,
'tagged_item': qn(self.model._meta.db_table),
'content_type_id': ContentType.objects.get_for_model(model).pk,
'tag_id_placeholders': ','.join(['%s'] * tag_count),
'tag_count': tag_count,
}
cursor = connection.cursor()
cursor.execute(query, [tag.pk for tag in tags])
object_ids = [row[0] for row in cursor.fetchall()]
if len(object_ids) > 0:
return queryset.filter(pk__in=object_ids)
else:
return model._default_manager.none()
I've tested these out with Django 1.0:
The "or" queries:
Blog.objects.filter(tags__name__in=['tag1', 'tag2']).distinct()
or you could use the Q class:
Blog.objects.filter(Q(tags__name='tag1') | Q(tags__name='tag2')).distinct()
The "and" query:
Blog.objects.filter(tags__name='tag1').filter(tags__name='tag2')
I'm not sure about the third one, you'll probably need to drop to SQL to do it.
Please don't reinvent the wheel and use django-tagging application which was made exactly for your use case. It can do all queries you describe, and much more.
If you need to add custom fields to your Tag model, you can also take a look at my branch of django-tagging.
This will do the trick for you
Blog.objects.filter(tags__name__in=['tag1', 'tag2']).annotate(tag_matches=models.Count(tags)).filter(tag_matches=2)

Categories