I am working with distance between objects, so I use haversine formaula in a raw query:
query = """SELECT *, (%d*acos(cos(radians(%2f))
*cos(radians(client_address.latitude))*cos(radians(client_address.longitude)-radians(%2f))
+sin(radians(%2f))*sin(radians(client_address.latitude))))
AS distance FROM jobs LEFT OUTER JOIN client_address ON ( client_address.id = jobs.location_id) HAVING
distance < %2f AND status = 1 ORDER BY distance LIMIT 0, %d""" % (
earth,
float(self.latitude),
float(self.longitude),
float(self.latitude),
distance,
limit
)
This query is a object method, that's why you see self reference. I find out that I cant do extra filters for the resulting RawQuerySet unless I perform this query in a .extra() or .annotate() method. I wonder if it is possible, Any suggestion? Thanks!
Related
Background
Suppose we have a set of questions, and a set of students that answered these questions.
The answers have been reviewed, and scores have been assigned, on some unknown range.
Now, we need to normalize the scores with respect to the extreme values within each question.
For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).
Then, for each student, we would like to know the mean of the normalized scores for all questions combined.
Minimal example
Here's a very naive minimal implementation, just to illustrate what we would like to achieve:
class Question(models.Model):
pass
class Student(models.Model):
def mean_normalized_score(self):
normalized_scores = []
for score in self.score_set.all():
normalized_scores.append(score.normalized_value())
return mean(normalized_scores) if normalized_scores else None
class Score(models.Model):
student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
value = models.FloatField()
def normalized_value(self):
limits = Score.objects.filter(question=self.question).aggregate(
min=models.Min('value'), max=models.Max('value'))
return (self.value - limits['min']) / (limits['max'] - limits['min'])
This works well, but it is quite inefficient in terms of database queries, etc.
Goal
Instead of the implementation above, I would prefer to offload the number-crunching on to the database.
What I've tried
Consider, for example, these two use cases:
list the normalized_value for all Score objects
list the mean_normalized_score for all Student objects
The first use case can be covered using window functions in a query, something like this:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
normalized_value=(F('value') - w_min) / (w_max - w_min))
This works nicely, so the Score.normalized_value() method from the example is no longer needed.
Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score() method by a single database query.
The raw SQL could look something like this (for sqlite):
SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
SELECT
myapp_score.*,
((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
AS normalized_value
FROM myapp_score
)
GROUP BY student_id
I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django's ORM.
I've tried building on the annotated_scores queryset described above, using Django's Subquery, annotate(), aggregate(), Prefetch, and combinations of those, but I must be making a mistake somewhere.
Probably the closest I've gotten is this:
subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))
But this is incorrect.
Could someone point me in the right direction, without resorting to raw queries?
I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that's what is blocking the calculation of the mean of the normalized values. I've added comments on the lines to explain what I'm trying to do:
# Get the minimum score per question
min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))
# Get the maximum score per question
max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))
# Calculate the normalized value per score, then get the average by grouping by students
mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
min=Subquery(min_subquery.values('min')[:1]),
max=Subquery(max_subquery.values('max')[:1]),
normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
).values('student').annotate(mean=Avg('normalized'))
# Get the calculated mean per student
Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))
The resulting SQL is:
SELECT
"student"."id",
"student"."name",
(
SELECT
AVG(
(
(
V0."value" - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
) / (
(
SELECT
MAX(U0."value") AS "max"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
) - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
)
)
) AS "mean"
FROM
"score" V0
WHERE
V0."student_id" = ("student"."id")
GROUP BY
V0."student_id"
LIMIT
1
) AS "mean"
FROM
"student"
As mentioned by #bdbd, and judging from this Django issue, it appears that annotating a windowed queryset is not yet possible (using Django 3.2).
As a temporary workaround, I refactored #bdbd's excellent Subquery solution as follows.
class ScoreQuerySet(models.QuerySet):
def annotate_normalized(self):
w_min = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
min=Min('value')).values('min')[:1])
w_max = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
max=Max('value')).values('max')[:1])
return self.annotate(normalized=(F('value') - w_min) / (w_max - w_min))
def aggregate_student_mean(self):
return self.annotate_normalized().values('student_id').annotate(
mean=Avg('normalized'))
class Score(models.Model):
objects = ScoreQuerySet.as_manager()
...
Note: If necessary, we can add more Student lookups to the values() in aggregate_student_mean(), e.g. student__name. As long as we take care not to mess up the grouping.
Now, if it ever becomes possible to filter and annotate windowed querysets, we can simply replace the Subquery lines by the much simpler Window implementation:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
I've tried replacing INNER_QUERY with "myApp_Instructionssteptranslation", and also leaving it away but this just gives other errors.
So, how come the generated query seems to work correctly when ran apart, but fails when we want to retrieve its results using django?
And how can I fix the issue so that it behaves like I want it too?We have a model InstructionsStep, which has a foreign key to a Instructions, which in turn is connected to a Library. A InstructionsStep has a description, but as multiple languages might exist this description is stored in a separate model containing a language code and the description translated in that language.
But for performance reasons, we need to be able to get a queryset of Instructionssteps, where the description is annotated in the default language (which is stored in the Library). To achieve this and to circumvent django's limitations on joins within annotations, we created a custom Aggregate function that retrieves this language. (DefaultInstructionsStepTranslationDescription)
class InstructionsStepTranslationQuerySet(models.query.QuerySet):
def language(self, language):
class DefaultInstructionsStepTranslationDescription(Aggregate):
template = '''
(%(function)s %(distinct)s INNER_QUERY."%(expressions)s" FROM (
SELECT "myApp_Instructionssteptranslation"."description" AS "description",
MIN("myUser_library"."default_language") AS "default_language"
FROM "myApp_Instructionssteptranslation"
INNER JOIN "myApp_Instructionsstep" A_ST ON ("myApp_Instructionssteptranslation"."Instructions_step_id" = A_ST."id")
INNER JOIN "myApp_Instructions" ON (A_ST."Instructions_id" = "myApp_Instructions"."id")
LEFT OUTER JOIN "myUser_library" ON ("myApp_Instructions"."library_id" = "myUser_library"."id")
WHERE "myApp_Instructionssteptranslation"."Instructions_step_id" = "myApp_Instructionsstep"."id"
and "myApp_Instructionssteptranslation"."language" = default_language
GROUP BY "myApp_Instructionssteptranslation"."id"
) AS INNER_QUERY
LIMIT 1
'''
function = 'SELECT'
def __init__(self, expression='', **extra):
super(DefaultInstructionsStepTranslationDescription, self).__init__(
expression,
distinct='',
output_field=CharField(),
**extra
)
return self.annotate(
t_description=
Case(
When(id__in = InstructionsStepTranslation.objects\
.annotate( default_language = Min(F("Instructions_step__Instructions__library__default_language")))\
.filter( language=F("default_language") )\
.values_list("Instructions_step_id"),
then=DefaultInstructionsStepTranslationDescription(Value("description"))
),
default=Value("error"),
output_field=CharField()
)
)
This generates the following sql-query (the database is a postgres database)
SELECT "myApp_Instructionsstep"."id",
"myApp_Instructionsstep"."original_id",
"myApp_Instructionsstep"."number",
"myApp_Instructionsstep"."Instructions_id",
"myApp_Instructionsstep"."ccp",
CASE
WHEN "myApp_Instructionsstep"."id" IN
(SELECT U0."Instructions_step_id"
FROM "myApp_Instructionssteptranslation" U0
INNER JOIN "myApp_Instructionsstep" U1 ON (U0."Instructions_step_id" = U1."id")
INNER JOIN "myApp_Instructions" U2 ON (U1."Instructions_id" = U2."id")
LEFT OUTER JOIN "myUser_library" U3 ON (U2."library_id" = U3."id")
GROUP BY U0."id"
HAVING U0."language" = (MIN(U3."default_language"))) THEN
(SELECT INNER_QUERY."description"
FROM
(SELECT "myApp_Instructionssteptranslation"."description" AS "description",
MIN("myUser_library"."default_language") AS "default_language"
FROM "myApp_Instructionssteptranslation"
INNER JOIN "myApp_Instructionsstep" A_ST ON ("myApp_Instructionssteptranslation"."Instructions_step_id" = A_ST."id")
INNER JOIN "myApp_Instructions" ON (A_ST."Instructions_id" = "myApp_Instructions"."id")
LEFT OUTER JOIN "myUser_library" ON ("myApp_Instructions"."library_id" = "myUser_library"."id")
WHERE "myApp_Instructionssteptranslation"."Instructions_step_id" = "myApp_Instructionsstep"."id"
and "myApp_Instructionssteptranslation"."language" = default_language
GROUP BY "myApp_Instructionssteptranslation"."id") AS INNER_QUERY
LIMIT 1)
ELSE 'error'
END AS "t_description"
FROM "myApp_Instructionsstep"
WHERE "myApp_Instructionsstep"."id" = 438
GROUP BY "myApp_Instructionsstep"."id"
ORDER BY "myApp_Instructionsstep"."number" ASC
Which works correctly when pasted in Postico.
However, running this in django,
step_id = 438
# InstructionsStep.objectsobjects is overrided with a custom manager that uses the above defined custon queryset
step_queryset = InstructionsStep.objects.language('en').filter(id=step_id)
retrieved_steps = step_queryset.all()
gives the following error:
LINE 1: ...ge" = (MIN(U3."default_language"))) THEN (SELECT INNER_QUER...
^
HINT: Perhaps you meant to reference the column "inner_query.description".
I've tried replacing INNER_QUERY with "myApp_Instructionssteptranslation", and also leaving it away but this just gives other errors.
So, how come the generated query seems to work correctly when ran apart, but fails when we want to retrieve its results using django?
And how can I fix the issue so that it behaves like I want it too?
Meanwhile, I've found that the printed query with the .query attribute differs from the actual query that's been executed.
In this case it printed SELECT INNER_QUERY."description", but it executed SELECT INNER_QUERY."'description'". The single quotes are added because of the Value("description") expression given to InstructionsStepTranslationQuerySet
I solved my problem in the end by passing the id-field (F("id")) instead and using it instead of A_ST."id". (sadly this is necessary as Aggregate does not allow an empty expression to be passed)
I have this sqlalchemy query:
query = session.query(Store).options(joinedload('salesmen').
joinedload('comissions').
joinedload('orders')).\
filter(Store.store_code.in_(selected_stores))
stores = query.all()
for store in stores:
for salesman in store.salesmen:
for comission in salesman.comissions:
#generate html for comissions for each salesman in each store
#print html document using PySide
This was working perfectly, however I added two new filter queries:
filter(Comissions.payment_status == 0).\
filter(Order.order_date <= self.dateEdit.date().toPython())
If I add just the first filter the application hangs for a couple of seconds, if I add both the application hangs indefinitely
What am I doing wrong here? How do I make this query fast?
Thank you for your help
EDIT: This is the sql generated, unfortunately the class and variable names are in Portuguese, I just translated them to English so it would be easier to undertand,
so Loja = Store, Vendedores = Salesmen, Pedido = Order, Comission = Comissao
Query generated:
SELECT "Loja"."CodLoja", "Vendedores_1"."CodVendedor", "Vendedores_1"."NomeVendedor", "Vendedores_1"."CodLoja", "Vendedores_1"."PercentualComissao",
"Vendedores_1"."Ativo", "Comissao_1"."CodComissao", "Comissao_1"."CodVendedor", "Comissao_1"."CodPedido",
"Pedidos_1"."CodPedido", "Pedidos_1"."CodLoja", "Pedidos_1"."CodCliente", "Pedidos_1"."NomeCliente", "Pedidos_1"."EnderecoCliente", "Pedidos_1"."BairroCliente",
"Pedidos_1"."CidadeCliente", "Pedidos_1"."UFCliente", "Pedidos_1"."CEPCliente", "Pedidos_1"."FoneCliente", "Pedidos_1"."Fone2Cliente", "Pedidos_1"."PontoReferenciaCliente",
"Pedidos_1"."DataPedido", "Pedidos_1"."ValorProdutos", "Pedidos_1"."ValorCreditoTroca",
"Pedidos_1"."ValorTotalDoPedido", "Pedidos_1"."Situacao", "Pedidos_1"."Vendeu_Teflon", "Pedidos_1"."ValorTotalTeflon",
"Pedidos_1"."DataVenda", "Pedidos_1"."CodVendedor", "Pedidos_1"."TipoVenda", "Comissao_1"."Valor", "Comissao_1"."DataPagamento", "Comissao_1"."StatusPagamento"
FROM "Comissao", "Pedidos", "Loja" LEFT OUTER JOIN "Vendedores" AS "Vendedores_1" ON "Loja"."CodLoja" = "Vendedores_1"."CodLoja"
LEFT OUTER JOIN "Comissao" AS "Comissao_1" ON "Vendedores_1"."CodVendedor" = "Comissao_1"."CodVendedor" LEFT OUTER JOIN "Pedidos" AS "Pedidos_1" ON "Pedidos_1"."CodPedido" = "Comissao_1"."CodPedido"
WHERE "Loja"."CodLoja" IN (:CodLoja_1) AND "Comissao"."StatusPagamento" = :StatusPagamento_1 AND "Pedidos"."DataPedido" <= :DataPedido_1
Your FROM clause is producing a Cartesian product and includes each table twice, once for filtering the result and once for eagerly loading the relationship.
To stop this use contains_eager instead of joinedload in your options. This will look for the related attributes in the query's columns instead of constructing an extra join. You will also need to explicitly join to the other tables in your query, e.g.:
query = session.query(Store)\
.join(Store.salesmen)\
.join(Store.commissions)\
.join(Store.orders)\
.options(contains_eager('salesmen'),
contains_eager('comissions'),
contains_eager('orders'))\
.filter(Store.store_code.in_(selected_stores))\
.filter(Comissions.payment_status == 0)\
.filter(Order.order_date <= self.dateEdit.date().toPython())
I have the following code in python. I get this error ->tuple indices must be integers, not str
How can I pass these values into the query? I have other examples where this approach works perfectly, i don't understand why it's failling here.
def request_events_json(uei,interval,conn):
cur = conn.cursor()
events_query ="""select e.nodeid,n.nodelabel,e.ipaddr,count(*) as total,min(e.eventcreatetime),max(e.eventcreatetime),(regexp_matches (e.eventlogmsg,E': %(.*)'))[1] as msglog
from events e, node n where e.eventuei = (%s) and e.eventcreatetime > now() - interval (%s) and n.nodeid=e.nodeid
group by n.nodelabel,e.nodeid,e.ipaddr,msglog
order by e.nodeid, count(*) desc limit 10;"""
try:
print('## Requesting events ##')
cur.execute(events_query,('uei.opennms.org/syslogd/cisco/line','5 min'))
.......
With my version of PostgreSQL the round brackets after interval are forbidden.
Update:
It is the percent-sign in the regexp. Double it.
I have the following query that I'm executing using a Python script (by using the MySQLdb module).
conn=MySQLdb.connect (host = "localhost", user = "root",passwd = "<password>",db = "test")
cursor = conn.cursor ()
preamble='set #radius=%s; set #o_lat=%s; set #o_lon=%s; '%(radius,latitude,longitude)
query='SELECT *, (6371*1000 * acos(cos(radians(#o_lat)) * cos(radians(lat)) * cos(radians(lon) - radians(#o_lon)) + sin(radians(#o_lat)) * sin(radians(lat))) as distance FROM poi_table HAVING distance < #radius ORDER BY distance ASC LIMIT 0, 50)'
complete_query=preamble+query
results=cursor.execute (complete_query)
print results
The values of radius, latitude, and longitude are not important, but they are being defined when the script executes. What bothers me is that the snippet of code above returns no results; essentially meaning that the way that the query is being executed is wonky. I executed the SQL query (including the set variables with actual values, and it returned the correct number of results).
If I modify the query to just be a simple SELECT FROM query (SELECT * FROM poi_table) it returns results. What is going on here?
EDIT: Encapsulated Haversine formula calculation within parenthesis
AFAIK you can't run multiple statements using execute().
You can, however, let MySQLdb handle the value substitutions.
Note that there are two arguments being passed to execute().
Also, just running execute() doesn't actually return any results.
You need to use fetchone() or fetchmany() or fetchall().
cursor.execute('''
SELECT *,
6371*1000 * acos(cos(radians(%s)) *
cos(radians(lat)) * cos(radians(lon) - radians(%s)) +
sin(radians(%s)) * sin(radians(lat))) as distance
FROM
poi_table
WHERE distance < %s
ORDER BY distance ASC
LIMIT 0, 50''', (latitude, longitude, latitude, radius,))
results = cursor.fetchall()
print results
Are you sure that the HAVING clause shouldn't be WHERE distance < #radius instead?