Django Queryset: Need help in optimizing this set of queries - python

I'm trying to sieve out some common tag-combinations from a list of educational question records.
For this example, I'm looking at only 2-tag example (tag-tag) which I should get an example of result like:
"point" + "curve" (65 entries)
"add" + "subtract" (40 entries)
...
This is the desired outcome in SQL statement:
SELECT a.tag, b.tag, count(*)
FROM examquestions.dbmanagement_tag as a
INNER JOIN examquestions.dbmanagement_tag as b on a.question_id_id = b.question_id_id
where a.tag != b.tag
group by a.tag, b.tag
Basically we are getting different tags with common questions to be identified into a list and group them within the same matching tag combinations.
I have tried to do a similar query using django queryset:
twotaglist = [] #final set of results
alphatags = tag.objects.all().values('tag', 'type').annotate().order_by('tag')
betatags = tag.objects.all().values('tag', 'type').annotate().order_by('tag')
startindex = 0 #startindex reduced by 1 to shorten betatag range each time the atag changes. this is to reduce the double count of comparison of similar matches of tags
for atag in alphatags:
for btag in betatags[startindex:]:
if (atag['tag'] != btag['tag']):
commonQns = [] #to check how many common qns
atagQns = tag.objects.filter(tag=atag['tag'], question_id__in=qnlist).values('question_id').annotate()
btagQns = tag.objects.filter(tag=btag['tag'], question_id__in=qnlist).values('question_id').annotate()
for atagQ in atagQns:
for btagQ in btagQns:
if (atagQ['question_id'] == btagQ['question_id']):
commonQns.append(atagQ['question_id'])
if (len(commonQns) > 0):
twotaglist.append({'atag': atag['tag'],
'btag': btag['tag'],
'count': len(commonQns)})
startindex=startindex+1
The logic works fine, however as I am pretty new to this platform, I'm not sure if there is a shorter workaround instead to make it much efficient.
Currently, the query needed about 45 seconds on about 5K X 5K tag comparison :(
Addon: Tag class
class tag(models.Model):
id = models.IntegerField('id',primary_key=True,null=False)
question_id = models.ForeignKey(question,null=False)
tag = models.TextField('tag',null=True)
type = models.CharField('type',max_length=1)
def __str__(self):
return str(self.tag)

Unfortunately django doesn't allow joining unless there's a foreign key (or one to one) involved. You're going to have to do it in code. I've found a way (totally untested) to do it with a single query which should improve execution time significantly.
from collections import Counter
from itertools import combinations
# Assuming Models
class Question(models.Model):
...
class Tag(models.Model):
tag = models.CharField(..)
question = models.ForeignKey(Question, related_name='tags')
c = Counter()
questions = Question.objects.all().prefetch_related('tags') # prefetch M2M
for q in questions:
# sort them so 'point' + 'curve' == 'curve' + 'point'
tags = sorted([tag.name for tag in q.tags.all()])
c.update(combinations(tags,2)) # get all 2-pair combinations and update counter
c.most_common(5) # show the top 5
The above code uses Counters, itertools.combinations, and django prefetch_related which should cover most of the bits above that might be unknown. Look at those resources if the above code doesn't work exactly, and modify accordingly.
If you're not using a M2M field on your Question model you can still access tags as if it were a M2M field by using reverse relations. See my edit that changes the reverse relation from tag_set to tags. I've made a couple of other edits that should work with the way you've defined your models.
If you don't specify related_name='tags', then just change tags in the filters and prefetch_related to tag_set and you're good to go.

If I understood your question correctly, I would keep things simpler and do something like this
relevant_tags = Tag.objects.filter(question_id__in=qnlist)
#Here relevant_tags has both a and b tags
unique_tags = set()
for tag_item in relevant_tags:
unique_tags.add(tag_item.tag)
#unique_tags should have your A and B tags
a_tag = unique_tags.pop()
b_tag = unique_tags.pop()
#Some logic to make sure what is A and what is B
a_tags = filter(lambda t : t.tag == a_tag, relevant_tags)
b_tags = filter(lambda t : t.tag == b_tag, relevant_tags)
#a_tags and b_tags contain A and B tags filtered from relevant_tags
same_question_tags = dict()
for q in qnlist:
a_list = filter(lambda a: a.question_id == q.id, a_tags)
b_list = filter(lambda a: a.question_id == q.id, b_tags)
same_question_tags[q] = a_list+b_list
The good thing about this is you can extend it to N number of tags by iterating over the returned tags in a loop to get all unique ones and then iterating further to filter them out tag wise.
There are definitely more ways to do this too.

Related

Django: Add a "configuration" list for different code sections to access

I use these different code snippets at different parts in my code. To avoid potential errors over time I would like to implement one configuration list that both these sections can access. The list gets longer over time with more entries. Do you have an idea about how to achieve that?
Here the "configuration" list #1 and #2 should access in order to perform the filter and if statement:
list = [TYPE_OF_PEOPLE_ATTENDING, HEARING_ABOUT_THE_EVENT, MISSING_EVENT_INFORMATION, REASON_FOR_ATTENDING]
1
entities = (
Entity.objects.values("answer__question__focus", "name")
.annotate(count=Count("pk"))
.annotate(total_salience=Sum("salience"))
.filter(
Q(answer__question__focus=QuestionFocus.TYPE_OF_PEOPLE_ATTENDING) |
Q(answer__question__focus=QuestionFocus.HEARING_ABOUT_THE_EVENT) |
Q(answer__question__focus=QuestionFocus.MISSING_EVENT_INFORMATION) |
Q(answer__question__focus=QuestionFocus.REASON_FOR_ATTENDING)
)
)
2
if (
answer_obj.question.focus == QuestionFocus.TYPE_OF_PEOPLE_ATTENDING
or answer_obj.question.focus == QuestionFocus.HEARING_ABOUT_THE_EVENT
or answer_obj.question.focus == QuestionFocus.MISSING_EVENT_INFORMATION
or answer_obj.question.focus == QuestionFocus.REASON_FOR_ATTENDING
):
entities = analyze_entities(answer_obj.answer)
bulk_create_entities(entities, response, answer_obj)
You should be able to rewrite both statements to directly use a list:
VALID_TYPES = [TYPE_OF_PEOPLE_ATTENDING, HEARING_ABOUT_THE_EVENT, MISSING_EVENT_INFORMATION, REASON_FOR_ATTENDING]
1
entities = (
Entity.objects.values("answer__question__focus", "name")
.annotate(count=Count("pk"))
.annotate(total_salience=Sum("salience"))
.filter(answer__question__focus__in=VALID_TYPES)
2
if (answer_obj.question.focus in VALID_TYPES):
entities = analyze_entities(answer_obj.answer)
bulk_create_entities(entities, response, answer_obj)

How can I split a string in Python based on and or params effeciently?

I have input string like ((display_name contain 'Fw-1111' or display_name contain 'P1') and site_name contain 'device').
Input can have any combination of filter in any order.
I want to fetch each filter separately based on type. e.g. all filter on display_name should come together as single Input like (display_name contain 'Fw-1111' or display_name contain 'P1').
I used split string to parse it and get the filters separately. Although I was able to do but the code is clumsy. I think there should be a better way to achieve this in Python. Please let me know an efficient way to achieve this.
This is my sample clumsy code:
column_search = ((display_name contain 'Fw-1111' or display_name
contain 'P1') and site_name contain 'device')
col_search_list = column_search.split('and')
if 'display_name' in col_search_list[0]:
policy_filter = col_search_list[0]
elif 'site_name' in col_search_list[0]:
site_filter = col_search_list[0]
if len(col_search_list) >1:
if 'display_name' in col_search_list[1]:
policy_filter = col_search_list[1]
elif 'site_name' in col_search_list[1]:
site_filter = col_search_list[1]
Loop over "col_search_list":
col_search_list = column_search.split('and')
for term in col_search_list:
if 'display_name' in term:
policy_filter = term
elif 'site_name' in term:
site_filter = term

Django ORM filter by Max column value of two related models

I have 3 related models:
Program(Model):
... # which aggregates ProgramVersions
ProgramVersion(Model):
program = ForeignKey(Program)
index = IntegerField()
UserProgramVersion(Model):
user = ForeignKey(User)
version = ForeignKey(ProgramVersion)
index = IntegerField()
ProgramVersion and UserProgramVersion are orderable models based on index field - object with highest index in the table is considered latest/newest object (this is handled by some custom logic, not relevant).
I would like to select all latest UserProgramVersion's, i.e. latest UPV's which point to the same Program.
this can be handled by this UserProgramVersion queryset:
def latest_user_program_versions(self):
latest = self\
.order_by('version__program_id', '-version__index', '-index')\
.distinct('version__program_id')
return self.filter(id__in=latest)
this works fine however im looking for a solution which does NOT use .distinct()
I tried something like this:
def latest_user_program_versions(self):
latest = self\
.annotate(
'max_version_index'=Max('version__index'),
'max_index'=Max('index'))\
.filter(
'version__index'=F('max_version_index'),
'index'=F('max_index'))
return self.filter(id__in=latest)
this however does not work
Use Subquery() expressions in Django 1.11. The example in docs is similar and the purpose is also to get the newest item for required parent records.
(You could start probably by that example with your objects, but I wrote also a complete more complicated suggestion to avoid possible performance pitfalls.)
from django.db.models import OuterRef, Subquery
...
def latest_user_program_versions(self, *args, **kwargs):
# You should filter users by args or kwargs here, for performance reasons.
# If you do it here it is applied also to subquery - much faster on a big db.
qs = self.filter(*args, **kwargs)
parent = Program.objects.filter(pk__in=qs.values('version__program'))
newest = (
qs.filter(version__program=OuterRef('pk'))
.order_by('-version__index', '-index')
)
pks = (
parent.annotate(newest_id=Subquery(newest.values('pk')[:1]))
.values_list('newest_id', flat=True)
)
# Maybe you prefer to uncomment this to be it compiled by two shorter SQLs.
# pks = list(pks)
return self.filter(pk__in=pks)
If you considerably improve it, write the solution in your answer.
EDIT Your problem in your second solution:
Nobody can cut a branch below him, neither in SQL, but I can sit on its temporary copy in a subquery, to can survive it :-) That is also why I ask for a filter at the beginning. The second problem is that Max('version__index') and Max('index') could be from two different objects and no valid intersection is found.
EDIT2: Verified: The internal SQL from my query is complicated, but seems correct.
SELECT app_userprogramversion.id,...
FROM app_userprogramversion
WHERE app_userprogramversion.id IN
(SELECT
(SELECT U0.id
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE (U0.user_id = 123 AND U2.program_id = (V0.id))
ORDER BY U2.index DESC, U0.index DESC LIMIT 1
) AS newest_id
FROM app_program V0 WHERE V0.id IN
(SELECT U2.program_id AS Col1
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE U0.user_id = 123
)
)

How to search in ManyToManyField

I am new to django, and I am trying to make a query in a Many to many field.
an example of my query:
in the Models I have
class Line(models.Model):
name = models.CharField("Name of line", max_length=50, blank=True)
class Cross(models.Model):
lines = models.ManyToManyField(Line, verbose_name="Lines crossed")
date = models.DateField('Cross Date', null=True, blank=False)
I am making a search querying all the crosses that have certain lines.
I mean the query in the search box will look like: line_1, line_2, line_3
and the result will be all the crosses that have all the lines (line_1, line2, line_3)
I don't know how should the filter condition be!
all_crosses = Cross.objects.all().filter(???)
The view code:
def inventory(request):
if request.method == "POST":
if 'btn_search' in request.POST:
if 'search_by_lines' in request.POST:
lines_query = request.POST['search_by_lines']
queried_lines = split_query(lines_query, ',')
query = [Q(lines__name=l) for l in queried_lines]
print(query)
result = Cross.objects.filter(reduce(operator.and_, query))
Thank you very much
You should be able to do:
crosses = Cross.objects.filter(lines__name__in=['line_1', 'line_2', 'line_3'])
for any of the three values. If you're looking for all of the values that match, you'll need to use a Q object:
from django.db.models import Q
crosses = Cross.objects.filter(
Q(lines__name='line_1') &
Q(lines__name='line_2') &
Q(lines__name='line_3')
)
There is at least one other approach you can use, which would be chaining filters:
Cross.objects.filter(lines__name='line_1')
.filter(lines_name='line_2')
.filter(lines__name='line_3')
If you need to dynamically construct the Q objects, and assuming the "name" value is what you're posting:
import operator
lines = [Q(line__name='{}'.format(line)) for line in request.POST.getlist('lines')]
crosses = Cross.objects.filter(reduce(operator.and_, lines))
[Update]
Turns out, I was dead wrong. I tried a couple of different ways of querying Cross objects where the value of lines matched all of the items searched. Q objects, annotations of counts on the number of objects contained... nothing worked as expected.
In the end, I ended up matching cross.lines as a list to the list of values posted.
In short, the search view I created matched in this fashion:
results = []
posted_lines = []
search_by_lines = 'search_by_lines' in request.POST.keys()
crosses = Cross.objects.all().prefetch_related('lines')
if request.method == 'POST' and search_by_lines:
posted_lines = request.POST.getlist('line')
for cross in crosses:
if list(cross.lines.values_list('name', flat=True)) == posted_lines:
results.append(cross)
return render(request, 'search.html', {'lines': lines, 'results': results,
'posted_lines': posted_lines})
What I would probably do in this case is add a column on the Cross model to keep a comma separated list of the primary keys of the related lines values, which you could keep in sync via a post_save signal.
With the additional field, you could query directly against the "line" values without joins.

Dynamic queries in Elasticsearch and issue of keywords

I'm currently running into a problem, trying to build dynamic queries for Elasticsearch in Python. To make a query I use Q shortсut from elasticsearch_dsl. This is something I try to implement
...
s = Search(using=db, index="reestr")
condition = {"attr_1_":"value 1", "attr_2_":"value 2"} # try to build query from this
must = []
for key in condition:
must.append(Q('match',key=condition[key]))
But that in fact results to this condition:
[Q('match',key="value 1"),Q('match',key="value 2")]
However, what I want is:
[Q('match',attr_1_="value 1"),Q('match',attr_2_="value 2")]
IMHO, the way this library does queries is not effective. I think this syntax:
Q("match","attrubute_name"="attribute_value")
is much more powerful and makes it possible to do a lot more things, than this one:
Q("match",attribute_name="attribute_value")
It seems, as if it is impossible to dynamically build attribute_names. Or it is, of course, possible that I do not know the right way to do it.
Suppose,
filters = {'condition1':['value1'],'condition2':['value3','value4']}
Code:
filters = data['filters_data']
must_and = list() # The condition that has only one value
should_or = list() # The condition that has more than 1 value
for key in filters:
if len(filters[key]) > 1:
for item in filters[key]:
should_or.append(Q("match", **{key:item}))
else:
must_and.append(Q("match", **{key:filters[key][0]}))
q1 = Bool(must=must_and)
q2 = Bool(should=should_or)
s = s.query(q1).query(q2)
result = s.execute()
One can also use terms, that can directly accept the list of values and no need of complicated for loops,
Code:
for key in filters:
must_and.append(Q("terms", **{key:filters[key]}))
q1 = Bool(must=must_and)

Categories