Django query expression for calculated fields with aggregation and without grouping

Django query expression for calculated fields with aggregation and without grouping - python

Continuing from here.
I have the following query expression in Django:
fields = {
'impressions': models.Sum('impressions'),
'clicks': models.Sum('clicks'),
}
calc_fields = {
'ctr': models.Case(models.When(impressions = 0, then = 0), default = 1.0 * models.F('clicks') / models.F('impressions'), output_field = models.FloatField()),
}
Stats.objects.values('product').annotate(**fields).annotate(**calc_fields)
This correctly calculates aggregated fields impressions, clicks and ctr grouped by product and produces the following postgres query:
SELECT
"stats"."product_id",
SUM("stats"."impressions") AS "impressions",
SUM("stats"."clicks") AS "clicks",
CASE WHEN SUM("stats"."impressions") = 0 THEN 0 ELSE ((1.0 * SUM("stats"."clicks")) / SUM("stats"."impressions")) END AS "ctr"
FROM "stats"
GROUP BY "stats"."product_id";
Now all I want is to modify this expression so when no grouping is provided it would just return aggregated values across the whole table.
And it appears to be harder than I expected. The rest of the question can be skipped, I'm just listing what I've tried.
1) First attempt to just drop values():
Stats.objects.annotate(**fields).annotate(**calc_fields)
This produces an error:
The annotation 'impressions' conflicts with a field on the model.
This happens probably because it now tries to return a list of full Stats objects with all its fields rather than only custom fields I've listed as before. Maybe there is a way to disable it and just return custom fields?
I tried renaming the fields but it results in grouping by the primary index, so it returns all records in the table rather than one
2) Maybe I need to swap annotate with aggregate.
Stats.objects.aggregate(**fields).aggregate(**calc_fields)
This produces an error:
Error: 'dict' object has no attribute 'aggregate'
Ok maybe they cannot be chained, lets merge the fields together and swap F with Sum inside the case:
fields = {
'impressions': models.Sum('impressions'),
'clicks': models.Sum('clicks'),
'ctr': models.Case(models.When(impressions = 0, then = 0), default = 1.0 * models.Sum('clicks') / models.Sum('impressions'), output_field = models.FloatField()),
}
Stats.objects.aggregate(**fields)
This appears to be working correctly. But the hell breaks loose again when I add extra fields (all values are the same, only names are different):
fields = {
'impressions': models.Sum('impressions'),
'impressions2': models.Sum('impressions'),
'clicks': models.Sum('clicks'),
'clicks2': models.Sum('clicks'),
'ctr': models.Case(models.When(impressions = 0, then = 0), default = 1.0 * models.Sum('clicks') / models.Sum('impressions'), output_field = models.FloatField()),
'ctr2': models.Case(models.When(impressions = 0, then = 0), default = 1.0 * models.Sum('clicks') / models.Sum('impressions'), output_field = models.FloatField()),
}
It seems that the problem is caused by multiple models.When(impressions = 0), it generates wrong sql:
SELECT
CASE WHEN "stats"."impressions" = 0 THEN 0 ELSE ((1.0 * SUM("stats"."clicks")) / SUM("stats"."impressions")) END AS "ctr",
CASE WHEN SUM("fstats"."impressions") = 0 THEN 0 ELSE ((1.0 * SUM("stats"."clicks")) / SUM("impressions")) END AS "ctr2",
SUM("stats"."impressions") AS "impressions",
SUM("stats"."impressions") AS "impressions2",
SUM("stats"."clicks") AS "clicks",
SUM("stats"."clicks") AS "clicks2",
FROM "stats";
Notice how the CASE for ctr is incorrectly calculated as:
CASE WHEN "stats"."impressions" = 0
While for ctr2 it is correct:
CASE WHEN SUM("stats"."impressions") = 0
Which causes the database error:
column "stats.impressions" must appear in the GROUP BY clause or be used in an aggregate function
3) Renaming all aggregated fields to be different from model fields so there is no chance of confusion:
fields = {
'impressions_total': models.Sum('impressions'),
'impressions_total2': models.Sum('impressions'),
'clicks_total': models.Sum('clicks'),
'clicks_total2': models.Sum('clicks'),
'ctr': models.Case(models.When(impressions_total = 0, then = 0), default = 1.0 * models.Sum('clicks') / models.Sum('impressions'), output_field = models.FloatField()),
'ctr2': models.Case(models.When(impressions_total = 0, then = 0), default = 1.0 * models.Sum('clicks') / models.Sum('impressions'), output_field = models.FloatField()),
}
Stats.objects.aggregate(**fields)
This produces error:
Error: Cannot resolve keyword 'impressions_total' into field. Choices are: clicks, clicks_total, impressions, impressions_total2, product, product_id
If I leave single ctr without ctr2 it works fine. Again adding extra fields breaks it for some reason.
Any tips? I would prefer it to just return an array of custom fields rather than full objects.

Related

Issue with parametrized queries using python

I am trying to use python to use a parametrized query through a list. This is the following code:
loan_records =['604150062','604150063','604150064','604150065','604150066','604150067','604150069','604150070']
borr_query = "select distinct a.nbr_aus, cast(a.nbr_trans_aus as varchar(50)) nbr_trans_aus, c.amt_finl_item, case when a.cd_idx in (-9999, 0) then null else a.cd_idx end as cd_idx, a.rate_curr_int, case when a.rate_gr_mrtg_mrgn = 0 then null else a.rate_gr_mrtg_mrgn end as rate_gr_mrtg_mrgn, a.rate_loln_max_cap, case when a.rate_perdc_cap = 0 then null else a.rate_perdc_cap end as rate_perdc_cap from db2mant.i_lp_trans a left join db2mant.i_lp_trans_borr b on a.nbr_aus = b.nbr_aus and a.nbr_trans_aus = b.nbr_trans_aus left join db2mant.i_lp_finl_item c on a.nbr_aus = c.nbr_aus and a.nbr_trans_aus = c.nbr_trans_aus where a.nbr_trans_aus in (?) and c.cd_finl_item = 189"
ODS.execute(borr_query, loan_records)
#PML.execute(PML_SUBMN_Query, (first_evnt, last_evnt, x))
ODS_records = ODS.fetchall()
ODS_records = pd.DataFrame(ODS_records, columns=['nbr_aus', 'nbr_trans_aus', 'amt_finl_item', 'cd_idx', 'rate_curr_int', 'rate_gr_mrtg_mrgn', 'rate_loln_max_cap', 'rate_perdc_cap'])
When I try to run this code: this is the following error message:
error message

Dropdown Menu - Float Values

I am trying to create a dropdown with ipywidgets, it needs to return a variable value, this variable value may change depending on other slider input values. When I try to return the variable I get the error: NameError: 'adjustment_one' Not Defined.
calculated_area = input_slider / numstorey_input_slider * 0.7
revised_area = input_slider - calculated_area
adjustment_one = revised_area / input_slider
location = ipywidgets.Dropdown(
options=[('Internal', 1), ('Roof', adjustment_one), ('External', 2) ],
value=1,
description='Plant Space Location: ',
)
The above method is solved and was an indenting error however I have now found that I can't pass a float value using the ipywidget dropdown. Below is the code - based on what is selected in the drop down this will be mapped to a dataframe as a multiplier. However when I open the excel document it only every returns the value '0'. This value should actually be 0.87 when Roof is selected, and will change depending on slider input values.
calculated_roof_area = gia_input_slider / numstoreys_input_slider * 0.7
revised_gia_roofinput = gia_input_slider - calculated_roof_area
gia_adjustment_rooftop = revised_gia_roofinput / gia_input_slider
plantspace = ipywidgets.Dropdown(
options=[('Internal', 1), ('Roof', gia_adjustment_rooftop), ('External', 2) ],
value=1,
description='Plant Space Location: ',
)
print(plantspace)
df = pd.read_excel(r'dataframe_final.xlsx', sheet_name='sheet1')
# below recalculates the adjusted qty when the slider position changes
multipliers = {
'Building Footprint Factor': bldg_footprint_factor,
'CAT A Fitout Yes / No': fitout_cat_dropdown,
'GIA Factor': gia_factor,
'GIA Adjustment - Rooftop': plantspace_dropdown
}
df['QP1 Multiplier'] = df['QP1'].map(multipliers)
df['QP2 Multiplier'] = df['QP2'].map(multipliers)
df['QP3 Multiplier'] = df['QP3'].map(multipliers)
df['QP4 Multiplier'] = df['QP4'].map(multipliers)
df['QP5 Multiplier'] = df['QP5'].map(multipliers)
df['QP6 Multiplier'] = df['QP6'].map(multipliers)
df['QP7 Multiplier'] = df['QP7'].map(multipliers)
df['QP8 Multiplier'] = df['QP8'].map(multipliers)
df['QP9 Multiplier'] = df['QP9'].map(multipliers)
Can anybody help?

Graphene GraphQL on return return records where calculated field is greater than zero

I have a GrpahQL endpoint that returns all the Magic the Gathering cards for a particular MtG set. I have created a field that returns a list of calculated fields which is the delta is card price between two days. I would like to filter all the cards where the delta percentage is greater than 0.
so something along the lines of return magic_sets_cards_pricing.objects.filter(MagicSetsGainsPricing.delta_percent>0)
Code:
class MagicSetsGainsPricing(graphene.ObjectType):
value = graphene.String()
delta_percent = graphene.String()
delta_value = graphene.String()
class Meta:
model = magic_sets_cards_pricing
def resolve_gains(self, info, days=None, **kwargs):
us = auth_user_settings.objects.values('region', 'currency').get()
if us['region'] == 'Europe':
region_nf = 'eur'
match us['currency']:
case 'Pound':
symbol = '£'
rate = currency_exchanges.objects.filter(c_from='EUR').filter(c_to='GBP').values('rate').get()['rate']
case 'Dollar':
symbol = '$'
rate = currency_exchanges.objects.filter(c_from='EUR').filter(c_to='USD').values('rate').get()['rate']
case _:
symbol = '€'
rate = 1
else:
region_nf = 'usd'
match us['currency']:
case 'Pound':
symbol = '£'
rate = currency_exchanges.objects.filter(c_from='USD').filter(c_to='GBP').values('rate').get()['rate']
case 'Euro':
symbol = '€'
rate = currency_exchanges.objects.filter(c_from='USD').filter(c_to='EUR').values('rate').get()['rate']
case _:
symbol = '$'
rate = 1
price_today = magic_sets_cards_pricing.objects.filter(card_id=self).values('date', region_nf).order_by('-date')[0:1][0][region_nf] or 0
calc_price_today = round((price_today * rate), 2)
price_previous = magic_sets_cards_pricing.objects.filter(card_id=self).values('date', region_nf).order_by('-date')[days:days+1][0][region_nf] or 0
calc_price_previous = round((price_previous * rate), 2)
calc_delta_value = str(calc_price_today - calc_price_previous)
calc_delta_percent = str(round(((calc_price_today - calc_price_previous) / (calc_price_previous or 1)) * 100, 2))
return MagicSetsGainsPricing(
value=symbol + str(calc_price_today),
delta_value=symbol + calc_delta_value,
delta_percent=calc_delta_percent + '%',
)
class MagicSetsGains(DjangoObjectType):
pricing = graphene.Field(MagicSetsGainsPricing, days=graphene.Int(), resolver=MagicSetsGainsPricing.resolve_gains)
class Meta:
model = magic_sets_cards
def resolve_pricing(self, info, **kwargs):
return magic_sets_cards_pricing.objects.filter(card_id=self)
class MagicSetsGainsQuery(ObjectType):
magic_sets_gains = graphene.List(MagicSetsGains, code=graphene.String())
def resolve_magic_sets_gains(self, info, code=None, **kwargs):
sql_number_to_int = "CAST((REGEXP_MATCH(number, '\d+'))[1] as INTEGER)"
excluded_sides = ['b', 'c', 'd', 'e']
return magic_sets_cards.objects.filter(set_id__code=code).exclude(side__in=excluded_sides).extra(select={'int': sql_number_to_int}).order_by('int', 'number').all()
Response:
{
"data": {
"magicSetsGains": [
{
"number": "1",
"name": "Adeline, Resplendent Cathar",
"pricing": {
"value": "£2.23",
"deltaValue": "£0.52",
"deltaPercent": "30.41%"
}
},
{
"number": "2",
"name": "Ambitious Farmhand // Seasoned Cathar",
"pricing": {
"value": "£0.07",
"deltaValue": "£-0.04",
"deltaPercent": "-36.36%"
}
},
...
]
}
}

So you cant use the model.objects.filter() because that filters the objects on the database level, and the database doesn't know about the "delta_percent" field that you've calculated at the schema level. However you can filter it in python using the filter function.
What I would suggest you do is this. Firstly, the code inside resolve_gains needs to move outside of the schema into a utility file and you can name it something like get_pricing_for_card(card, days=None) that returns a dict with {"value": blah, "delta_percent": blah, "delta_price": blah}.
Then your resolve_gains becomes:
def resolve_gains(self, info, days=None, **kwargs):
pricing = get_pricing_for_card(self, days)
return MagicSetsGainsPricing(**pricing)
Then MagicSetsGainsQuery becomes this:
class MagicSetsGainsQuery(ObjectType):
magic_sets_gains = graphene.List(MagicSetsGains, code=graphene.String())
def resolve_magic_sets_gains(self, info, code=None, **kwargs):
sql_number_to_int = "CAST((REGEXP_MATCH(number, '\d+'))[1] as INTEGER)"
excluded_sides = ['b', 'c', 'd', 'e']
cards_to_return = magic_sets_cards.objects.filter(set_id__code=code).exclude(side__in=excluded_sides).extra(select={'int': sql_number_to_int}).order_by('int', 'number').all()
return filter(lambda card: get_pricing_for_card(card)["delta_percent"] > 0, cards_to_return)
Alternatively you can replace the return line with:
return [card for card in cards_to_return if get_pricing_for_card(card)["delta_percent"] > 0]
I know this isn't great because essentially you end up running get_pricing_for_card twice. This is a little annoying thing about the graphene schema where you can't really pass around values calculated in the resolve function. The larger refactor to this would be that you save the pricing in the pricing table with the delta percent field and so forth, when you create a card. then you can do pricing.objects.filter(delta_percent__gte=0)

Dynamodb isn't finding overlap between two date ranges

I am trying to search my database to see if a date range I am about to add overlaps with a date range that already exists in the database.
Using this question: Determine Whether Two Date Ranges Overlap
I came up with firstDay <= :end and lastDay >= :start for my FilterExpression.
def create(self, start=None, days=30):
# Create the start/end times
if start is None:
start = datetime.utcnow()
elif isinstance(start, datetime) is False:
raise ValueError('Start time must either be "None" or a "datetime"')
end = start + timedelta(days=days)
# Format the start and end string "YYYYMMDD"
start = str(start.year) + str('%02d' % start.month) + str('%02d' % start.day)
end = str(end.year) + str('%02d' % end.month) + str('%02d' % end.day)
# Search the database for overlap
days = self.connection.select(
filter='firstDay <= :end and lastDay >= :start',
attributes={
':start': {'N': start},
':end': {'N': end}
}
)
# if we get one or more days then there is overlap
if len(days) > 0:
raise ValueError('There looks to be a time overlap')
# Add the item to the database
self.connection.insert({
"firstDay": {"N": start},
"lastDay": {"N": end}
})
I am then calling the function like this:
seasons = dynamodb.Seasons()
seasons.create(start=datetime.utcnow() + timedelta(days=50))
As requested, the method looks like this:
def select(self, conditions='', filter='', attributes={}, names={}, limit=1, select='ALL_ATTRIBUTES'):
"""
Select one or more items from dynamodb
"""
# Create the condition, it should contain the datatype hash
conditions = self.hashKey + ' = :hash and ' + conditions if len(conditions) > 0 else self.hashKey + ' = :hash'
attributes[':hash'] = {"S": self.hashValue}
limit = max(1, limit)
args = {
'TableName': self.table,
'Select': select,
'ScanIndexForward': True,
'Limit': limit,
'KeyConditionExpression': conditions,
'ExpressionAttributeValues': attributes
}
if len(names) > 0:
args['ExpressionAttributeNames'] = names
if len(filter) > 0:
args['FilterExpression'] = filter
return self.connection.query(**args)['Items']
When I run the above, it keeps inserting the above start and end date into the database because it isn't finding any overlap. Why is this happening?
The table structure looks like this (JavaScript):
{
TableName: 'test-table',
AttributeDefinitions: [{
AttributeName: 'dataType',
AttributeType: 'S'
}, {
AttributeName: 'created',
AttributeType: 'S'
}],
KeySchema: [{
AttributeName: 'dataType',
KeyType: 'HASH'
}, {
AttributeName: 'created',
KeyType: 'RANGE'
}],
ProvisionedThroughput: {
ReadCapacityUnits: 5,
WriteCapacityUnits: 5
},
}

Looks like you are setting LIMIT=1. You are probably using this to say 'just return the first match found'. In fact setting Limit to 1 means you will only evaluate the first item found in the Query (i.e. in the partition value range). You probably need to remove the limit, so that each item in the partition range is evaluated for an overlap.

Group by column to get array results in Postgresql

I have a table called moviegenre which looks like:
moviegenre:
- movie (FK movie.id)
- genre (FK genre.id)
I have a query (ORM generated) which returns all movie.imdb and genre.id's which have genre.id's in common with a given movie.imdb_id.
SELECT "movie"."imdb_id",
"moviegenre"."genre_id"
FROM "moviegenre"
INNER JOIN "movie"
ON ( "moviegenre"."movie_id" = "movie"."id" )
WHERE ( "movie"."imdb_id" IN (SELECT U0."imdb_id"
FROM "movie" U0
INNER JOIN "moviegenre" U1
ON ( U0."id" = U1."movie_id" )
WHERE ( U0."last_ingested_on" IS NOT NULL
AND NOT ( U0."imdb_id" IN
( 'tt0169547' ) )
AND NOT ( U0."imdb_id" IN
( 'tt0169547' ) )
AND U1."genre_id" IN ( 2, 10 ) ))
AND "moviegenre"."genre_id" IN ( 2, 10 ) )
The problem is that I'll get results in the format:
[
('imdbid22`, 'genreid1'),
('imdbid22`, 'genreid2'),
('imdbid44`, 'genreid1'),
('imdbid55`, 'genreid8'),
]
Is there a way within the query itself I can group all of the genre ids into a list under the movie.imdb_id's? I'd like do to grouping in the query.
Currently doing it in my web app code (Python) which is extremely slow when 50k+ rows are returned.
[
('imdbid22`, ['genreid1', 'genreid2']),
('imdbid44`, 'genreid1'),
('imdbid55`, 'genreid8'),
]
thanks in advance!
edit:
here's the python code which runs against the current results
results_list = []
for item in movies_and_genres:
genres_in_common = len(set([
i['genre__id'] for i in movies_and_genres
if i['movie__imdb_id'] == item['movie__imdb_id']
]))
imdb_id = item['movie__imdb_id']
if genres_in_common >= min_in_comon:
result_item = {
'movie.imdb_id': imdb_id,
'count': genres_in_common
}
if result_item not in results_list:
results_list.append(result_item)
return results_list

select m.imdb_id, array_agg(g.genre_id) as genre_id
from
moviegenre g
inner join
movie m on g.movie_id = m.id
where
m.last_ingested_on is not null
and not m.imdb_id in ('tt0169547')
and not m.imdb_id in ('tt0169547')
and g.genre_id in (2, 10)
group by m.imdb_id
array_agg will create an array of all the genre_ids of a certain imdb_id:
http://www.postgresql.org/docs/current/interactive/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE

I hope python code will be fast enough:
movielist = [
('imdbid22', 'genreid1'),
('imdbid22', 'genreid2'),
('imdbid44, 'genreid1'),
('imdbid55', 'genreid8'),
]
dict = {}
for items in movielist:
if dict[items[0]] not in dict:
dict[items[0]] = items[1]
else:
dict[items[0]] = dict[items[0]].append(items[1])
print dict
Output:
{'imdbid44': ['genreid1'], 'imdbid55': ['genreid8'], 'imdbid22': ['genreid1', 'genreid2']}
If you just need movie name, count:
Change this in original query you will get the answer you dont need python code
SELECT "movie"."imdb_id", count("moviegenre"."genre_id")
group by "movie"."imdb_id"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django query expression for calculated fields with aggregation and without grouping - python

Related

Issue with parametrized queries using python

Dropdown Menu - Float Values

Graphene GraphQL on return return records where calculated field is greater than zero

Dynamodb isn't finding overlap between two date ranges

Group by column to get array results in Postgresql

Categories

Resources