Variables in a Postgres view? - python

I have a view in Postgres which queries a master table (150 million rows) and retrieves data from the prior day (a function which returns SELECT yesterday; it was the only way to get the view to respect my partition constraints) and then joins it with two dimension tables. This works fine, but how would I loop through this query in Python? Is there a way to make the date dynamic?
for date in date_range('2016-06-01', '2017-07-31'):
(query from the view, replacing the date with the date in the loop)
My workaround was to literally copy and paste the entire view as a huge select statement format string, and then pass in the date in a loop. This worked, but it seems like there must be a better solution to utilize an existing view or to pass in a variable which might be useful in the future.

To loop day by day inside the interval on a for loop you could do something like:
import datetime
initialDate = datetime.datetime(2016, 6, 1)
finalDate = datetime.datetime(2017, 7, 31)
for day in range((finalDate - initialDate).days + 1):
current = (initialDate + datetime.timedelta(days = day)).date()
print("query from the view, replacing the date with " + current.strftime('%m/%d/%Y'))
Replacing the print with the call to your query. If the dates are strings you can do something like:
initialDate = datetime.datetime.strptime("06/01/2016", '%m/%d/%Y')

Related

Group data by month, if the month is in a date range of two fields in Django

I have a contract model containing a start and end datetime field. I want to show in a graph how many contracts are active per month (the month is between start and end time).
How can I get this information without multiple database requests per month?
I can annotate it for each field like this
start_month_contracts = contracts.annotate(
start_month=TruncMonth("start")
) \
.values("start_month") \
.annotate(count=Count("start_month"))
end_month_contracts = contracts.annotate(
end_month=TruncMonth("end")
) \
.values("end_month") \
.annotate(count=Count("end_month"))
but how do I combine both to get the active contracts per month?
Suppose you have the following model with start and end dates:
class Contract(models.Model):
...
start = models.DateTimeField()
end = models.DateTimeField()
Basic query for "active" contracts in a month
The basic formula is as you stated:
the month is between start and end time
A query can get us this for any given month...
# Get active contracts for December 2020
month = datetime.datetime(2020, 12, 1)
# all Contract records active in december
qs = Contract.objects.filter(start__lte=month, end__gte=month)
# Or, since we just care about the count, we can use `.count()` instead:
december_active_count = Contract.objects.filter(start__lte=month, end__gte=month).count()
If you find you need to tweak the basic query, that's fine. It's not so much the query as much as it is the methods, which carry through this explanation, without regard to what the query happens to be.
multiple counts in a single query
There's a few ways you can do a single query and chart out your contracts...
Counting records in the django application
A simple naïve approach is to pull all the relevant contracts first in a single query, then count them for each month in Python...
This works fine, but there's a few potential problems:
The DB will send data for each record. If you have many records, the number of bytes required to be sent by the DB could get excessive.
While the calculations here are fairly lightweight, it does require some CPU power for Python to crunch these numbers for every record and could take a while if there are many records
Really, we probably want to have the DB do the counting for us.
Counting on the database
If you wanted to handle this on the database, rather than in Python, you can develop a query to do aggregations DB-side using .aggregate. The benefit here is that the DB only has to transmit the counts, rather than all the records, which is a significantly smaller number of bytes. It also offloads some number crunching from your app to the DB.
Extending on the first example, let's try to get the counts for more than 1 month in a single query. We do this by using aggregate along with the Count aggregation function.
from django.db.models import Count, Q
november = datetime.datetime(2020, 11, 1)
december = datetime.datetime(2020, 12, 1)
contract_counts = Contract.objects.aggregate(
november_counts=Count('pk', filter=Q(start__lte=november, end__gte=november))
december_counts=Count('pk', filter=Q(start__lte=december, end__gte=december))
)
print(contract_counts)
{'november_counts': 376, 'december_counts': 393} # <-- output
We can apply this same principle to get the counts for all months over a specified time range. In order to do this, we pre-determine each month between start and end that will be counted and use Case and Count for each of those months.
Really, this is now just a matter of generating the keyword arguments like above, but dynamically.
I'll also create a custom manager for this model, so make the interface a little nicer.
import calendar
from django.db.models import Count, Q
class ContractManager(models.Manager):
def month_counts(self, start, end):
qs = self.get_queryset()
# generate keyword arguments for .aggregate
aggregations = {}
for month in months(start, end): # the start of each month in the range
month_name = calendar.month_name[month.month]
aggregation_name = f'{month_name}_{month.year}'
aggregations[aggregation_name] = Count(
'pk', filter=Q(start__lte=month, end__gte=month)
)
return qs.aggregate(**aggregations)
class Contract(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
objects = ContractManager()
You can then produce the counts like so:
start = datetime(2020, 1, 1)
end = datetime(2021, 1, 1)
print(Contract.objects.month_counts(start, end))
The output, gathered from of this might look something like this:
{'January_2020': 2,
'February_2020': 90,
'March_2020': 163,
'April_2020': 234,
'May_2020': 272,
'June_2020': 284,
'July_2020': 284,
'August_2020': 275,
'September_2020': 247,
'October_2020': 205,
'November_2020': 128,
'December_2020': 68,
'January_2021': 3}
You can also see only 1 query is used:
from django.db import connection
print(len(connection.queries))
# 1
Final thoughts and notes
I should mention This is not the most efficient way to do this and there's a lot of room for optimization. You could probably also generate the month intervals on the DB side, instead of in Python, if you wanted. Specific backends may have more performant options available, too, like the daterange functions of Postgres. Though, what we have here should provide enough context for using aggregate to get the counts you want.
I can annotate it for each field like this
I don't think your code here gets you the counts you really want. You're counting the number of contracts that either started or ended in a particular month... but this won't be able to tell you how many contracts were active in any single given month.
P.S.
I omitted the code for the months() function above for brevity. The code can be found here if you're interested. Something like pandas might be more performant, though it shouldn't be a concern, unless your time intervals go over thousands of years :-)

How to Loop Through Dates in Python to pass into PostgresSQL query

I have 2 date variables which I pass into a SQL query via Python. It looks something like this:
start = '2019-10-01'
finish = '2019-12-22'
code_block = '''select sum(revenue) from table
where date between '{start}' and '{finish}'
'''.format(start = start, finish = finish)
That gets me the data I want for the current quarter, however I want to be able to loop through this same query for the previous 5 quarters. Can someone help me figure out a way so that this runs for the current quarter, then updates both start and finish to previous quarter, runs the query, and then keeps going until 5 quarters ago?
Consider adding a year and quarter grouping in aggregate SQL query and avoid the Python looping. And use a date difference of 15 months (i.e., 5 quarters) even use NOW() for end date. Also, use parameterization (supported in pandas) and not string formatting for dynamic querying.
code_block = '''select concat(date_part('year', date)::text,
'Q', date_part('quarter', date)::text) as yyyyqq,
sum(revenue) as sum_revenue
from table
where date between (%s::date - INTERVAL '15 MONTHS') and NOW()
group by date_part('year', date),
date_part('quarter', date)
'''
df = pd.read_sql(code_block, myconn, params=[start])
If you still need separate quarterly data frames use groupby to build a dictionary of data frames for the 5 quarters.
# DICTIONARY OF QUARTERLY DATA FRAMES
df_dict = {i:g for i,g in df.groupby(['yyyyqq'])}
df_dict['2019Q4'].head()
df_dict['2019Q3'].tail()
df_dict['2019Q2'].describe()
...
Just define a list with start dates and a list with finish dates and loop through them with:
for date_start, date_finish in zip(start_list, finish_list):
start = date_start
finish = date_finish
# here you insert the query
Hope this is what you are looking for =)

timedelta - most elegant way to pass 'days=-5' from string

I am trying call a function that triggers a report to be generated with a starting date that is either hour or days ago. The code below works fine but I would like to store the timedelta offset in a mysql database.
starting_date = datetime.today() - timedelta(days=-5)
I had hoped to store 'days=-5' in the database, extract that database column to variable 'delta_offset' and then run
starting_date = datetime.today() - timedelta(delta_offset)
It doesnt like this because delta_offset is a string. I know i could modify the function to just include the offset and store -5 in my database, like what is below. But I really wanted to store days=-5 in the database because my offset can be hours as well. I could make my offset in database always hours and store -120 in the database but was wondering if there was an elegant way where I store 'days=-5' in the database and not cause type issues
starting_date = datetime.today() - timedelta(days=delta_offset)
Instead of storing 'days=-5' in your database as a single column, you could break this into two columns named 'value' and 'unit' or similar.
Then you can pass these to timedelta in a dictionary and unpacking. Like so:
unit = 'days'
value = -5
starting_date = datetime.today() - timedelta(**{unit: value})
This will unpack the dictionary so you get the same result as doing timedelta([unit]=value).
Alternatively, if you really would like to keep 'days=-5' as a value of a single column in your database, you could split the string on '=' then take a similar approach. Here's how:
offset = 'days=-5'
unit, value = offset.split('=')
starting_date = datetime.today() - timedelta(**{unit: int(value)})
i would do it this way:
date_offset_split = date_offset.split("=")
kwargs = {date_offset_split[0]: int(date_offset_split[1])}
starting_date = datetime.today() - timedelta(**kwargs)

Applying an arbitrary sort to Django ORM querysets

I have a database table containing sets of items representing yearly recurring events. The record sets are stored by month and day. I often need to retrieve the event records corresponding to a range of calendar dates. I'm using the Django ORM to work with the records, so at present I convert the dates to corresponding Q objects (e.g. Q(month=month, day=day) and OR them together in the call to MyModel.objects.filter().
The problem comes if my date range intersects the new year. If I want the events from Dec. 31, 2013 to Jan 1, 2014, I do something like:
MyModel.objects.filter(Q(month=12, day=31) | Q(month=1, day=1))
But I get my results in the order:
month = 1, day = 1
month = 12, day = 31
Instead, I would like to get my results in the order:
month = 12, day = 31
month = 1, day = 1
For reasons that would unnecessarily complicate the question, I can't simply partition the query into two queries, one for each year. I would like to make one query and get the results in the desired order. I can reformulate the query, if necessary.
I know that extra should be useful, but I don't quite see how to use it.
Update:
To head a little closer to the intended solution, to impose an absolute ordering I could somehow slip the Julian Day into the results as a "calculated" field, and order by that field. But how to do that?
I have tricky solution Using extra and SQL CASE statement:
start_month = 12
start_day = 31
end_month = 1
end_day = 1
query = (models.MyModel.objects.filter(Q(month=start_month, day=start_day) |
Q(month=end_month, day=end_day))
.extra(select={'order_me': '''CASE WHEN month*31+day < %s*31+%s
THEN (12+month)*31+day
ELSE (month)*31+day
END''' % (start_month, start_day)})
.extra(order_by=['order_me']))
As soon as I have added this order_me field (which is not nice) I think It should be used in predicate instead of Q(...)|Q(...) for date range
query = (models.MyModel.objects.all()
.extra(select={'order_me': """CASE WHEN month*31+day < %s*31+%s
THEN (12+month)*31+day
ELSE (month)*31+day
END""" % (start_month, start_day) })
.extra(order_by=['order_me'])
.extra(where=['order_me < (12 + %s) * 31 + %s' % (end_month,
end_day)]))

Django averaging over a date range

I have a simple uptime monitoring app built in Django which checks if a site doesn't respond to a ping request. It stores a timestamp of when it was pinged in the column "dt" and the millisecond response time in the column "ms". The site is pinged every minute and an entry is put into the database.
The Django model looks like this:
class Uptime (models.Model):
user = models.ForeignKey(User)
dt = models.DateTimeField()
ms = models.IntegerField()
I'd like to grab one day at a time from the dt column and take an average of the ms response time for that day. Even though there's 1440 entries per day I'd like to just grab the day (e.g. 4-19-2013) and get the average ms response time. The code I have been using below is undeniably wrong but I feel like I'm on the right track. How could I get this to work?
output_date = Uptime.objects.filter(user = request.user).values_list('dt', flat = True).filter(dt__range=(startTime, endTime)).order_by('dt').extra(select={'day': 'extract( day from dt )'}).values('day')
output_ms = Uptime.objects.filter(user = request.user).filter(dt__range=(startTime, endTime)).extra(select={'day': 'date( dt )'}).values('day').aggregate(Avg('ms'))
Thanks!
You need to annotate to do the group by. Django doesnt have anything in the orm for extracting only dates, so you need to add an "extra" parameter to the query, which specifies to only use the dates. You then select only those values and annotate.
Try the following:
Uptime.objects.filter(user=request.user).extra({'_date': 'Date(dt)'}).values('_date').annotate(avgMs=Avg('ms'))
This should give you a list like follows:
[{'_date': datetime.date(2012, 7, 5), 'avgMs': 300},{'_date': datetime.date(2012, 7, 6), 'avgMs': 350}]

Categories