Event Extraction from Recurrence Pattern

Event Extraction from Recurrence Pattern - python

I've been working on a event based AJAX application that stores recurring events in the a table in the following format (Django models):
event_id = models.CharField(primary_key=True, max_length=24)
# start_date - the start date of the first event in a series
start_date = models.DateTimeField()
# the end date of the last event in a series
end_date = models.DateTimeField()
# Length of the occurence
event_length = models.BigIntegerField(null=True, blank=True, default=0)
rec_type = models.CharField(max_length=32)
The rec_type stores data in the following format:
[type]_[count]_[day]_[count2]_[days]#[extra]
type - the type of repeation: 'day','week','month','year'.
count - the interval between events in the “type” units.
day and count2 - define a day of a month ( first Monday, third Friday, etc ).
days - the comma-separated list of affected week days.
extra - the extra info that can be used to change presentation of recurring details.
For example:
day_3___ - each three days
month _2___ - each two month
month_1_1_2_ - second Monday of each month
week_2___1,5 - Monday and Friday of each second week
This works fine, and allows many events to be transmitted concisely, but I now have the requirement to extract all events that occur during a given range. For example on a specific date, week or month and I am a bit lost as to how best to approach.
In particular, I am stuck with how to check if an event with a given recurrence pattern is eligible to be in the results.
What is the best approach here?

Personally, I'd store an rrule object from python-dateutil (http://labix.org/python-dateutil) rather than inventing your own recurrence format. Then you can just define some methods that use rrule. between(after, before) to generate instances of your event object for a given range.
One catch though, dateutil's rrule object doesn't pickle correctly, so you should define your own mechanism of serialising the object to the database. I've generally gone with a JSON representation of the keyword arguments for instantiating the rrule. The annoying edge case is that if you want to store stuff like '2nd Monday of the month', you have to do additional work with MO(2), because the value it returns isn't useful. It's hard to explain, but you'll see the problem when you try it.
I'm not aware of any efficient way to find all eligible events within a range though, you'll have to load in all the Event models that potentially overlap with the range. So you'll always be loading in potentially more data than you'll eventually use. Just make sure relatively smart about it to reduce the burden. Short of someone adding recurrence handling to databases themselves, I'm not aware of any way to improve this.

Related

Django datetime.date.min and Mysql DATE minimum value is different

Actual Question
How can i deal with different minimum date in Django DateField and Mysql DATE type. What can be done to deal with the issue?
TL:DR
datetime.date.min has lower value than minimum supported in Mysql for DATE type. Saving datetime.date.min to Mysql is possible but it is not guaranteed to work.
Context
I'm working with Django (v1.11) model:
class MyModel(model.Model):
begin_date = models.DateField(null=True, blank=True)
end_date = models.DateField(null=True, blank=True)
Currently the fields with NULL value designate the lower and higher date range limits respectively.
Example:
Date range from beginning of time to today will be:
mymodel.begin_date = None
mymodel.end_date = datetime.date(2019, 11, 20)
mymodel.save()
The usage of None/NULL creates a need to convert it to datetime.date.min/datetime.date.max on any sorting operations etc. Simply put: its inconvenient and adds unnecessary complexity.
My though was to do necessary migrations and start putting datetime.date.min/datetime.date.max instead of None/`NULL'
But there is a problem:
Mysql has different min/max date range
According to the mysql docs:
The DATE type is used for values with a date part but no time part.
MySQL retrieves and displays DATE values in 'YYYY-MM-DD' format. The
supported range is '1000-01-01' to '9999-12-31'.
According to python docs:
The smallest year number allowed in a date or datetime object. MINYEAR is 1.
The largest year number allowed in a date or datetime object. MAXYEAR is 9999.
The earliest representable date, date(MINYEAR, 1, 1).
The latest representable date, date(MAXYEAR, 12, 31).
It turns out that in Django you can still use datetime.date.min and it will be saved to the database. But it is not guranteed to work.
Because of:
For the DATE and DATETIME range descriptions, “supported” means that although earlier values might work, there is no guarantee.
My idea
I'm thinking about doing all necessary migrations, clean the code from None convertions and just use Field.default. Set it to default=datetime(1000, 1, 1) (minimum mysql supported date)

I think your idea will work fine.
Note that it's database-specific, though. The more ungainly approach of using None does have the virtue of working across databases.
Also note that apparently MySQL let's you store dates below that ostensible minimum value. Therefore, there could be dates in the database that will not be caught by your filter. So be sure to validate dates before you store them to make sure they are at or above the minimum.

return datetimes in the active timezone with a django query

I am trying to retrieve the last n hour rows from a table and print their datetimes in a given timezone, the timezone to use when printing dates is given, I am trying to use activate to make django return the datetimes with the proper timezone but it returns dates as UTC.
here is my current code:
min_time = datetime.datetime.now(link.monitor.timezone) - datetime.timedelta(hours=period)
timezone.activate(link.monitor.timezone)
rows = TraceHttp.objects.values_list('time', 'elapsed').filter(time__gt=min_time,link_id=link_id,elapsed__gt=0)
array = []
for row in rows:
array.append((row[0].astimezone(link.monitor.timezone),row[1]))
I want to avoid using the astimezone function and make Django do this for me, is there sometimes I'm missing about the activate function?
EDIT
Here are my models, as you can see the timezone to display is saved on the "monitor" model:
class Link(models.Model):
...
monitor = models.ForeignKey(Monitor)
...
class Monitor(models.Model):
...
timezone = TimeZoneField(default='Europe/London')
class TraceHttp(models.Model):
link = models.ForeignKey(Link)
time = models.DateTimeField()
elapsed = models.FloatField()

After some research I noticed that Django allways returns datetimes as UTC and it's up to you to interpret them in the correct timezone either by using the datetime.astimezone(timezone) method or activating a certain timezone.
The django active function just changes the way that the datetime will be rendered on a template but doesn't actually localize a timezone.

If you ever find yourself doing timezone.localtime(dt_value) or dt_value.astimezone(tzifo) in a loop for a few million times to calculate what's the current date in your timezone, the likely best approach as of 1.10 <= django.VERSION <= 2.1 is to use django.db.models.functions.Trunc and related functions, i.e use a queryset like:
from django.db.models.functions import Trunc, TruncDate
qs = MyModel.objects.filter(...).values(
'dtime',
...,
dtime_at_my_tz=Trunc('dtime', 'second', tzinfo=yourtz),
date_at_my_tz=TruncDate('dtime', tzinfo=yourtz),
month=TruncDate(Trunc('dtime', 'month', tzinfo=yourtz)),
quarter=TruncDate(Trunc('dtime', 'quarter', tzinfo=yourtz))
)
This will return datetimes or dates in the right timezone. You can use other Trunc* functions as shorthand. TruncDate is especially useful if all you need are datetime.dates
This will offload date calculations to the database, usually with a big reduction in code complexity and increased speed (in my case, over 6.5 million timezone.localtime(ts) were contributing 25% of total CPU time)
Note on TruncMonth and timezones
A while ago I found that I couldn't get 'proper' months out of TruncMonth or TruncQuarter: a January 1st would become a December 31st.
TruncMonth uses the currently active timezone, so (correctly) a datetime of 2019-01-01T00:00:00Z gets converted to the previous day for any timezone that has a positive offset from UTC (Western Europe and everywhere further East).
If you're only interested in the 'pure month' of an event datetime (and you probably are if you're using TruncMonth) this isn't helpful, however if you timezone.activate(timezone.utc) before executing the query (that is, evaluating your QuerySet) you'll get the intended result. Keep in mind that events occurred from your midnight until UTC's midnight will fall under the previous month (and in the same way datetimes from your timezone's midnight to UTC's midnight will be converted to the 'wrong' month)

You can use now() function from django.utils, but you need to set two variables in settings: USE_TZ and TIME_ZONE, the first with True and the other with the default timezone that will be used to generate the datetime.
You can see more informations in django documentation here.

Django queryset filtering by ISO week number

I have a model that contains datefield. I'm trying to get query set of that model that contains current week (starts on Monday).
So since Django datefield contains simple datetime.date model I assumed to filter by using .isocalendar(). Logically it's exactly what I want without no extra comparisons and calculations by current week day.
So what I want to do essentially is force .filter statement to behave in this logic:
if model.date.isocalendar()[2] == datetime.date.today().isocalendar()[2]
...
Yet how to write it inside filter statement?
.filter(model__date__isocalendar=datetime.date.today().isocalendar()) will give wrong results (same as comparing to today not this week).
As digging true http://docs.python.org/library/datetime.html I have not noticed any other week day options...
Note from documentation:
date.isocalendar() Return a 3-tuple, (ISO year, ISO week number, ISO
weekday).
Update:
Although I disliked the solution of using ranges yet it's the best option.
However in my case I made a variable that marks the beginning of the week and just look greater or equal value because if I'm looking for a matches for current week. In case of giving the number of the week It would require both ends.
today = datetime.date.today()
monday = today - datetime.timedelta(days=today.weekday())
... \
.filter(date__gte=monday)

You're not going to be able to do this. Remember it's not just an issue of what Python supports, Django has to communicate the filter to the database, and the database doesn't support such complex date calculations. You can use __range, though, with a start date and end date.

Even simpler than using Extract function that Amit mentioned in his answer is using __week field lookup added in Django 1.11, so you can simply do:
.filter(model__date__week=datetime.date.today().isocalendar()[1])

ExtractWeek has been introduced in Django 1.11 for filtering based on isoweek number.
For Django 1.10 and lower versions, following solution works for filtering by iso number week on postgres database:
from django.db.models.functions import Extract
from django.db import models
#models.DateTimeField.register_lookup
class ExtractWeek(Extract):
lookup_name = 'week'
Now do query as follows
queryset.annotate(week=ExtractWeek('date'))\
.filter(week=week_number)

(This answer should only work for postgres, but might work for other databases.)
A quick and elegant solution for this problem would be to define these two custom transformers:
from django.db import models
from django.db.models.lookups import DateTransform
#models.DateTimeField.register_lookup
class WeekTransform(DateTransform):
lookup_name = 'week'
#models.DateTimeField.register_lookup
class ISOYearTransform(DateTransform):
lookup_name = 'isoyear'
Now you can query by week like this:
from django.utils.timezone import now
year, week, _ = now().isocalendar()
MyModel.objects.filter(created__isoyear=year, created__week=week)
Behinds the scenes, the Django DateTransform object uses the postgres EXTRACT function, which supports week and isoyear.

How to deal with "partial" dates (2010-00-00) from MySQL in Django?

In one of my Django projects that use MySQL as the database, I need to have a date fields that accept also "partial" dates like only year (YYYY) and year and month (YYYY-MM) plus normal date (YYYY-MM-DD).
The date field in MySQL can deal with that by accepting 00 for the month and the day. So 2010-00-00 is valid in MySQL and it represent 2010. Same thing for 2010-05-00 that represent May 2010.
So I started to create a PartialDateField to support this feature. But I hit a wall because, by default, and Django use the default, MySQLdb, the python driver to MySQL, return a datetime.date object for a date field AND datetime.date() support only real date. So it's possible to modify the converter for the date field used by MySQLdb and return only a string in this format 'YYYY-MM-DD'. Unfortunately the converter use by MySQLdb is set at the connection level so it's use for all MySQL date fields. But Django DateField rely on the fact that the database return a datetime.date object, so if I change the converter to return a string, Django is not happy at all.
Someone have an idea or advice to solve this problem? How to create a PartialDateField in Django ?
EDIT
Also I should add that I already thought of 2 solutions, create 3 integer fields for year, month and day (as mention by Alison R.) or use a varchar field to keep date as string in this format YYYY-MM-DD.
But in both solutions, if I'm not wrong, I will loose the special properties of a date field like doing query of this kind on them: Get all entries after this date. I can probably re-implement this functionality on the client side but that will not be a valid solution in my case because the database can be query from other systems (mysql client, MS Access, etc.)

First, thanks for all your answers. None of them, as is, was a good solution for my problem, but, for your defense, I should add that I didn't give all the requirements. But each one help me think about my problem and some of your ideas are part of my final solution.
So my final solution, on the DB side, is to use a varchar field (limited to 10 chars) and storing the date in it, as a string, in the ISO format (YYYY-MM-DD) with 00 for month and day when there's no month and/or day (like a date field in MySQL). This way, this field can work with any databases, the data can be read, understand and edited directly and easily by a human using a simple client (like mysql client, phpmyadmin, etc.). That was a requirement. It can also be exported to Excel/CSV without any conversion, etc. The disadvantage is that the format is not enforce (except in Django). Someone could write 'not a date' or do a mistake in the format and the DB will accept it (if you have an idea about this problem...).
This way it's also possible to do all of the special queries of a date field relatively easily. For queries with WHERE: <, >, <=, >= and = work directly. The IN and BETWEEN queries work directly also. For querying by day or month you just have to do it with EXTRACT (DAY|MONTH ...). Ordering work also directly. So I think it covers all the query needs and with mostly no complication.
On the Django side, I did 2 things. First, I have created a PartialDate object that look mostly like datetime.date but supporting date without month and/or day. Inside this object I use a datetime.datetime object to keep the date. I'm using the hours and minutes as flag that tell if the month and day are valid when they are set to 1. It's the same idea that steveha propose but with a different implementation (and only on the client side). Using a datetime.datetime object gives me a lot of nice features for working with dates (validation, comparaison, etc.).
Secondly, I have created a PartialDateField that mostly deal with the conversion between the PartialDate object and the database.
So far, it works pretty well (I have mostly finish my extensive unit tests).

You could store the partial date as an integer (preferably in a field named for the portion of the date you are storing, such as year, month or day) and do validation and conversion to a date object in the model.
EDIT
If you need real date functionality, you probably need real, not partial, dates. For instance, does "get everything after 2010-0-0" return dates inclusive of 2010 or only dates in 2011 and beyond? The same goes for your other example of May 2010. The ways in which different languages/clients deal with partial dates (if they support them at all) are likely to be highly idiosyncratic, and they are unlikely to match MySQL's implementation.
On the other hand, if you store a year integer such as 2010, it is easy to ask the database for "all records with year > 2010" and understand exactly what the result should be, from any client, on any platform. You can even combine this approach for more complicated dates/queries, such as "all records with year > 2010 AND month > 5".
SECOND EDIT
Your only other (and perhaps best) option is to store truly valid dates and come up with a convention in your application for what they mean. A DATETIME field named like date_month could have a value of 2010-05-01, but you would treat that as representing all dates in May, 2010. You would need to accommodate this when programming. If you had date_month in Python as a datetime object, you would need to call a function like date_month.end_of_month() to query dates following that month. (That is pseudocode, but could be easily implemented with something like the calendar module.)

It sounds like you want to store a date interval. In Python this would (to my still-somewhat-noob understanding) most readily be implemented by storing two datetime.datetime objects, one specifying the start of the date range and the other specifying the end. In a manner similar to that used to specify list slices, the endpoint would not itself be included in the date range.
For example, this code would implement a date range as a named tuple:
>>> from datetime import datetime
>>> from collections import namedtuple
>>> DateRange = namedtuple('DateRange', 'start end')
>>> the_year_2010 = DateRange(datetime(2010, 1, 1), datetime(2011, 1, 1))
>>> the_year_2010.start <= datetime(2010, 4, 20) < the_year_2010.end
True
>>> the_year_2010.start <= datetime(2009, 12, 31) < the_year_2010.end
False
>>> the_year_2010.start <= datetime(2011, 1, 1) < the_year_2010.end
False
Or even add some magic:
>>> DateRange.__contains__ = lambda self, x: self.start <= x < self.end
>>> datetime(2010, 4, 20) in the_year_2010
True
>>> datetime(2011, 4, 20) in the_year_2010
False
This is such a useful concept that I'm pretty sure that somebody has already made an implementation available. For example, a quick glance suggests that the relativedate class from the dateutil package will do this, and more expressively, by allowing a 'years' keyword argument to be passed to the constructor.
However, mapping such an object into database fields is somewhat more complicated, so you might be better off implementing it simply by just pulling both fields separately and then combining them. I guess this depends on the DB framework; I'm not very familiar with that aspect of Python yet.
In any case, I think the key is to think of a "partial date" as a range rather than as a simple value.
edit
It's tempting, but I think inappropriate, to add more magic methods that will handle uses of the > and < operators. There's a bit of ambiguity there: does a date that's "greater than" a given range occur after the range's end, or after its beginning? It initially seems appropriate to use <= to indicate that the date on the right-hand side of the equation is after the start of the range, and < to indicate that it's after the end.
However, this implies equality between the range and a date within the range, which is incorrect, since it implies that the month of May, 2010 is equal to the year 2010, because May the 4th, 2010 equates to the both of them. IE you would end up with falsisms like 2010-04-20 == 2010 == 2010-05-04 being true.
So probably it would be better to implement a method like isafterstart to explicitly check if a date is after the beginning of the range. But again, somebody's probably already done it, so it's probably worth a look on pypi to see what's considered production-ready. This is indicated by the presence of "Development Status :: 5 - Production/Stable" in the "Categories" section of a given module's pypi page. Note that not all modules have been given a development status.
Or you could just keep it simple, and using the basic namedtuple implementation, explicitly check
>>> datetime(2012, 12, 21) >= the_year_2010.start
True

Can you store the date together with a flag that tells how much of the date is valid?
Something like this:
YEAR_VALID = 0x04
MONTH_VALID = 0x02
DAY_VALID = 0x01
Y_VALID = YEAR_VALID
YM_VALID = YEAR_VALID | MONTH_VALID
YMD_VALID = YEAR_VALID | MONTH_VALID | DAY_VALID
Then, if you have a date like 2010-00-00, convert that to 2010-01-01 and set the flag to Y_VALID. If you have a date like 2010-06-00, convert that to 2010-06-01 and set the flag to YM_VALID.
So, then, PartialDateField would be a class that bundles together a date and the date-valid flag described above.
P.S. You don't actually need to use the flags the way I showed it; that's the old C programmer in me coming to the surface. You could use Y_VALID, YM_VALID, YMD_VALID = range(3) and it would work about as well. The key is to have some kind of flag that tells you how much of the date to trust.

Although not in Python - here's an example of how the same problem was solved in Ruby - using a single Integer value - and bitwise operators to store year, month and day - with month and day optional.
https://github.com/58bits/partial-date
Look at the source in lib for date.rb and bits.rb.
I'm sure a similar solution could be written in Python.
To persist the date (sortable) you just save the Integer to the database.

Counts of events grouped by date in python?

This is no doubt another noobish question, but I'll ask it anyways:
I have a data set of events with exact datetime in UTC. I'd like to create a line chart showing total number of events by day (date) in the specified date range. Right now I can retrieve the total data set for the needed date range, but then I need to go through it and count up for each date.
The app is running on google app engine and is using python.
What is the best way to create a new data set showing date and corresponding counts (including if there were no events on that date) that I can then use to pass this info to a django template?
Data set for this example looks like this:
class Event(db.Model):
event_name = db.StringProperty()
doe = db.DateTimeProperty()
dlu = db.DateTimeProperty()
user = db.UserProperty()
Ideally, I want something with date and count for that date.
Thanks and please let me know if something else is needed to answer this question!

You'll have to do the binning in-memory (i.e. after the datastore fetch).
The .date() method of a datetime instance will facilitate your binning; it chops off the time element. Then you can use a dictionary to hold the bins:
bins = {}
for event in Event.all().fetch(1000):
bins.setdefault(event.doe.date(), []).append( event )
Then do what you wish with (e.g. count) the bins. For a direct count:
counts = collections.defaultdict(int)
for event in Event.all().fetch(1000):
counts[event.doe.date()] += 1

I can't see how that would be possible with single query as GQL has no support for GROUP BY or aggregation generally.

In order to minimize the amount of work you do, you'll probably want to write a task that sums up the per-day totals once, so you can reuse them. I'd suggest using the bulkupdate library to run a once-a-day task that counts events for the previous day, and creates a new model instance, with a key name based on the date, containing the count. Then, you can get all needed data points by doing a query (or better, a batch get) for the set of summary entities you need.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.