Shall I bother with storing DateTime data as julianday in SQLite? - python

SQLite docs specifies that the preferred format for storing datetime values in the DB is to use Julian Day (using built-in functions).
However, all frameworks I saw in python (pysqlite, SQLAlchemy) store the datetime.datetime values as ISO formatted strings. Why are they doing so?
I'm usually trying to adapt the frameworks to storing datetime as julianday, and it's quite painful. I started to doubt that is worth the efforts.
Please share your experience in this field with me. Does sticking with julianday make sense?

Julian Day is handy for all sorts of date calculations, but it can's store the time part decently (with precise hours, minutes, and seconds). In the past I've used both Julian Day fields (for dates), and seconds-from-the-Epoch (for datetime instances), but only when I had specific needs for computation (of dates and respectively of times). The simplicity of ISO formatted dates and datetimes, I think, should make them the preferred choice, say about 97% of the time.

Store it both ways. Frameworks can be set in their ways and if yours is expecting to find a raw column with an ISO formatted string then that is probably more of a pain to get around than it's worth.
The concern in having two columns is data consistency but sqlite should have everything you need to make it work. Version 3.3 has support for check constraints and triggers. Read up on date and time functions. You should be able to do what you need entirely in the database.
CREATE TABLE Table1 (jd, isotime);
CREATE TRIGGER trigger_name_1 AFTER INSERT ON Table1
BEGIN
UPDATE Table1 SET jd = julianday(isotime) WHERE rowid = last_insert_rowid();
END;
CREATE TRIGGER trigger_name_2 AFTER UPDATE OF isotime ON Table1
BEGIN
UPDATE Table1 SET jd = julianday(isotime) WHERE rowid = old.rowid;
END;
And if you cant do what you need within the DB you can write a C extension to perform the functionality you need. That way you wont need to touch the framework other than to load your extension.

But typically, the Human doesn't read directly from the database. Fractional time on a Julian Day is easily converted to human readible by (for example)
void hour_time(GenericDate *ConvertObject)
{
double frac_time = ConvertObject->jd;
double hour = (24.0*(frac_time - (int)frac_time));
double minute = 60.0*(hour - (int)hour);
double second = 60.0*(minute - (int)minute);
double microsecond = 1000000.0*(second - (int)second);
ConvertObject->hour = hour;
ConvertObject->minute = minute;
ConvertObject->second = second;
ConvertObject->microsecond = microsecond;
};

Because 2010-06-22 00:45:56 is far easier for a human to read than 2455369.5318981484. Text dates are great for doing ad-hoc queries in SQLiteSpy or SQLite Manager.
The main drawback, of course, is that text dates require 19 bytes instead of 8.

Related

Is there a way to modify datetime objects through the Django ORM Query?

We've a Django, Postgresql database that contains objects with:
object_date = models.DateTimeField()
as a field.
We need to count the objects by hour per day, so we need to remove some of the extra time data, for example: minutes, seconds and microseconds.
We can remove the extra time data in python:
query = MyModel.objects.values('object_date')
data = [tweet['tweet_date'].replace(minute=0, second=0, microsecond=0) for tweet in query
Which leaves us with a list containing the date and hour.
My Question: Is there a better, faster, cleaner way to do this in the query itself?
If you simply want to obtain the dates without the time data, you can use extra to declare calculated fields:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
You don't gain too much from just that, though; the database is now sending you even more data.
However, with these additional fields, you can use aggregation to instantly get the counts you were looking for, by adding one line:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
.annotate(count=Count('*'))
Alternatively, you could use any valid SQL to combine what I made two fields into one field, by formatting it into a string, for example. The nice thing about doing that, is that you can then use the tuples to construct a Counter for convenient querying (use values_list()).
This query will certainly be more efficient than doing the counting in Python. For a background job that may not be so important, however.
One downside is that this code is not portable; for one, it does not work on SQLite, which you may still be using for testing purposes. In that case, you might save yourself the trouble and write a raw query right away, which will be just as unportable but more readable.
Update
As of 1.10 it is possible to perform this query nicely using expressions, thanks to the addition of TruncHour. Here's a suggestion for how the solution could look:
from collections import Counter
from django.db.models import Count
from django.db.models.functions import TruncHour
counts_by_group = Counter(dict(
MyModel.objects
.annotate(object_group=TruncHour('object_date'))
.values_list('object_group')
.annotate(count=Count('object_group'))
)) # query with counts_by_group[datetime.datetime(year, month, day, hour)]
It's elegant, efficient and portable. :)
count = len(MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)))
or
count = MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)).count()
Assuming I understand what you're asking for, this returns the number of objects that have a date within a specific time range. Set the range to be from the beginning of the hour until the end of the hour and you will return all objects created in that hour. Count() or len() can be used depending on the desired use. For more information on that check out https://docs.djangoproject.com/en/1.9/ref/models/querysets/#count

Python SQLite, passing date values in sql query

I have having a problem with inserting date values into an SQL query. I am using sqlite3 and python. The query is:
c.execute("""SELECT tweeterHash.* FROM tweeterHash, tweetDates WHERE
Date(tweetDates.start) > Date(?) AND
Date(tweetDates.end) > Date(?)""",
(start,end,))
The query doesn't return any values, and there is no error message. If I use this query:
c.execute("""SELECT tweeterHash.* FROM tweeterHash, tweetDates WHERE
Date(tweetDates.start) > Date(2014-01-01) AND
Date(tweetDates.end) > Date(2015-01-01)""")
Then I get the values that I want, which is as expected?
The values start and end come from a text file:
f = open('dates.txt','r')
start = f.readline().strip('\n')
end = f.readline().strip('\n')
but I have also just tried declaring it as well:
start = '2014-01-01'
end = '2015-01-01'
I guess I don't understand why passing the string in from the start and end variables doesn't work? What is the best way to pass a date variable into a SQL query? Any help is greatly appreciated.
These aren't the same dates—and it's the non-parameterized ones you've got wrong.
Date(2014-01-01) calculates the arithmetic expression 2014 - 01 - 01, then constructs a Date from the resulting number 2012, which will get you something in 4707 BC.
Date('2014-01-01'), or Date(?) where the parameter is the string '2014-01-01', constructs the date you want, in 2014 AD.
You can see this more easily by just selecting dates directly:
>>> cur.execute('SELECT Date(2014-01-01), Date(?)', ['2014-01-01'])
>>> print(cur.fetchone())
('-4707-05-28', '2014-01-01')
Meanwhile:
What is the best way to pass a date variable into a SQL query?
Ideally, use actual date objects instead of strings. The sqlite3 library knows how to handle datetime.datetime and datetime.date. And don't call Date on the values, just compare them. (Yes, sqlite3 might then compare them as strings instead of dates, but the whole point of using ISO8601-like formats is that this always gives the same result… unless of course you have a bunch of dates from 4707 BC lying around.) So:
start = datetime.date(2014, 1, 1)
end = datetime.date(2015, 1, 1)
c.execute("""SELECT tweeterHash.* FROM tweeterHash, tweetDates WHERE
tweetDates.start > ? AND
tweetDates.end > ?""",
(start,end,))
And would this also mean that when I create the table, I would want: " start datetime, end datetime, "?
That would work, but I wouldn't do that. Python will convert date objects to ISO8601-format strings, but not convert back on SELECT, and SQLite will let you transparently compare those strings to the values returned by the Date function.
You could get the same effect with TEXT, but I believe you'd find it less confusing, DATETIME will set the column affinity to NUMERIC, which can confuse both humans and other tools when you're actually storing strings.
Or you could use the type DATE—which is just as meaningless to SQLite as DATETIME, but it can tell Python to transparently convert return values into datetime.date objects. See Default adapters and converters in the sqlite3 docs.
Also, if you haven't read Datatypes in SQLite Version 3 and SQLite and Python types, you really should; there are a lot of things that are both surprising (even—or maybe especially—if you've used other databases), and potentially very useful.
Meanwhile, if you think you're getting the "right" results from passing Date(2014-01-01) around, that means you've actually got a bunch of garbage values in your database. And there's no way to fix them, because the mistake isn't reversible. (After all, 2014-01-01 and 2015-01-02 are both 2012…) Hopefully you either don't need the old data, or can regenerate it. Otherwise, you'll need some kind of workaround that lets you deal with existing data as usefully as possible under the circumstances.

MongoDB date and removed objects

Yesterday I had some strange experience with MongoDB. I am using twisted and txmongo - an asynchronous driver for mongodb (similar to pymongo).
I have a rest service where it receives some data and put it to mongodb. One field is timestamp in milliseconds, 13 digits.
First of all ther is no trivial way to convert millisecond timestamp into python datetime in python. I ended up with something like this:
def date2ts(ts):
return int((time.mktime(ts.timetuple()) * 1000) + (ts.microsecond / 1000))
def ts2date(ts):
return datetime.datetime.fromtimestamp(ts / 1000) + datetime.timedelta(microseconds=(ts % 1000))
The problem is that when I save the data to mongodb, retreive datetime back and convert it back to timestamp using my function I don't get the same result in milliseconds.
I did not understand why is it happening. Datetime is saved in mongodb as ISODate object. I tried to query it from shell and there is indeed difference in one second or few millisoconds.
QUESTION 1: Does anybody know why is this happening?
But this is not over. I decided not to use datetime and to save timestamp directly as long. Before that I removed all the data from collection. I was quite surprised that when I tried to save same field not as date but as long, it was represented as ISODate in shell. And when retrieved there was still difference in few milliseconds.
I tried to drop the collection and index. When it did not help I tried to drop entire database. When it did not help I tried to drop entire database and to restart mongod. And after this I guess it started to save it as Long.
QUESTION 2: Does anybody know why is this happening?
Thank you!
Python's timestamp is calculated in seconds since the Unix epoch of Jan 1, 1970. The timestamp in JavaScript (and in turn MongoDB), on the other hand, is in terms of milliseconds.
That said, if you have only have the timestamps on hand, you can multiple the Python value by 1000 to get milliseconds and store that value into MongoDB. Likewise, you can take the value from MongoDB and divide it by 1000 to make it a Python timestamp. Keep in mind that Python only seems to care for two significant digits after the decimal point instead of three (as it doesn't typically care for milliseconds) so keep that in mind if you are still having differences of < 10 milliseconds.
Normally I would suggest working with tuples instead, but the conventions for the value ranges are different for each language (JavaScript is unintuitive in that it starts days of the month at 0 instead of 1) and may cause issues down the road.
It can be the case of different timezone's. Please use the below mentioned function to rectify it.
function time_format(d, offset) {
// Set timezone
utc = d.getTime() + (d.getTimezoneOffset() * 60000);
nd = new Date(utc + (3600000*offset));
return nd;
}
searchdate = time_format(searchdate, '+5.5');
'+5.5' here is the timezone difference from the local time to GMT time.

Extracting Date and Time info from a string.

I have a database full of data, including a date and time string, e.g. Tue, 21 Sep 2010 14:16:17 +0000
What I would like to be able to do is extract various documents (records) from the database based on the time contained within the date string, Tue, 21 Sep 2010 14:16:17 +0000.
From the above date string, how would I use python and regex to extract documents that have the time 15:00:00? I'm using MongoDB by the way, in conjunction with Python.
I don't know MongoDB, but shouldn't something like this work?
SELECT * FROM Database WHERE Date LIKE '%15:00:00%'
If you have a date string, the only place it contains colons will be the time part of the date, so that should be good enough without a regex. It would be better, of course, if you had an actual timestamp instead of a string in your date field.
You can use $where:
db.collection.find({$where: "var d = new Date(this.dateProperty); return d.getUTCHours() == 15 && d.getUTCMinutes() == 0 && d.getUTCSeconds() == 0"})
Or regular expression:
db.collection.find({dateProperty: /.*15:00.*/})
The second can be a bit faster than first but both will be relatively slow. To speedup things you would store dates in built-in date format. Also if you need to query on datetime components consider adding indexable date representation such as {y:2010,m:9,d:21,h:14,i:16,s:17} (properties depend on your query needs, if you only need to query by hour you would have {h:14}). Then you can have index per each component.
I agree with the other poster. Though this doesn't solve your immediate problem, if you have any control over the database, you should seriously consider creating a time/column, with either a DATE or TIMESTAMP datatype. That would make your system much more robust, & completely avoid the problem of trying to parse dates from string (an inherently fragile technique).
To keep things easy, use:
import datetime, dateutil.parser
dateutil.parser.parse("Tue, 21 Sep 2010 14:16:17 +0000").strftime('%X')
# '14:16:17'

How to deal with "partial" dates (2010-00-00) from MySQL in Django?

In one of my Django projects that use MySQL as the database, I need to have a date fields that accept also "partial" dates like only year (YYYY) and year and month (YYYY-MM) plus normal date (YYYY-MM-DD).
The date field in MySQL can deal with that by accepting 00 for the month and the day. So 2010-00-00 is valid in MySQL and it represent 2010. Same thing for 2010-05-00 that represent May 2010.
So I started to create a PartialDateField to support this feature. But I hit a wall because, by default, and Django use the default, MySQLdb, the python driver to MySQL, return a datetime.date object for a date field AND datetime.date() support only real date. So it's possible to modify the converter for the date field used by MySQLdb and return only a string in this format 'YYYY-MM-DD'. Unfortunately the converter use by MySQLdb is set at the connection level so it's use for all MySQL date fields. But Django DateField rely on the fact that the database return a datetime.date object, so if I change the converter to return a string, Django is not happy at all.
Someone have an idea or advice to solve this problem? How to create a PartialDateField in Django ?
EDIT
Also I should add that I already thought of 2 solutions, create 3 integer fields for year, month and day (as mention by Alison R.) or use a varchar field to keep date as string in this format YYYY-MM-DD.
But in both solutions, if I'm not wrong, I will loose the special properties of a date field like doing query of this kind on them: Get all entries after this date. I can probably re-implement this functionality on the client side but that will not be a valid solution in my case because the database can be query from other systems (mysql client, MS Access, etc.)
First, thanks for all your answers. None of them, as is, was a good solution for my problem, but, for your defense, I should add that I didn't give all the requirements. But each one help me think about my problem and some of your ideas are part of my final solution.
So my final solution, on the DB side, is to use a varchar field (limited to 10 chars) and storing the date in it, as a string, in the ISO format (YYYY-MM-DD) with 00 for month and day when there's no month and/or day (like a date field in MySQL). This way, this field can work with any databases, the data can be read, understand and edited directly and easily by a human using a simple client (like mysql client, phpmyadmin, etc.). That was a requirement. It can also be exported to Excel/CSV without any conversion, etc. The disadvantage is that the format is not enforce (except in Django). Someone could write 'not a date' or do a mistake in the format and the DB will accept it (if you have an idea about this problem...).
This way it's also possible to do all of the special queries of a date field relatively easily. For queries with WHERE: <, >, <=, >= and = work directly. The IN and BETWEEN queries work directly also. For querying by day or month you just have to do it with EXTRACT (DAY|MONTH ...). Ordering work also directly. So I think it covers all the query needs and with mostly no complication.
On the Django side, I did 2 things. First, I have created a PartialDate object that look mostly like datetime.date but supporting date without month and/or day. Inside this object I use a datetime.datetime object to keep the date. I'm using the hours and minutes as flag that tell if the month and day are valid when they are set to 1. It's the same idea that steveha propose but with a different implementation (and only on the client side). Using a datetime.datetime object gives me a lot of nice features for working with dates (validation, comparaison, etc.).
Secondly, I have created a PartialDateField that mostly deal with the conversion between the PartialDate object and the database.
So far, it works pretty well (I have mostly finish my extensive unit tests).
You could store the partial date as an integer (preferably in a field named for the portion of the date you are storing, such as year, month or day) and do validation and conversion to a date object in the model.
EDIT
If you need real date functionality, you probably need real, not partial, dates. For instance, does "get everything after 2010-0-0" return dates inclusive of 2010 or only dates in 2011 and beyond? The same goes for your other example of May 2010. The ways in which different languages/clients deal with partial dates (if they support them at all) are likely to be highly idiosyncratic, and they are unlikely to match MySQL's implementation.
On the other hand, if you store a year integer such as 2010, it is easy to ask the database for "all records with year > 2010" and understand exactly what the result should be, from any client, on any platform. You can even combine this approach for more complicated dates/queries, such as "all records with year > 2010 AND month > 5".
SECOND EDIT
Your only other (and perhaps best) option is to store truly valid dates and come up with a convention in your application for what they mean. A DATETIME field named like date_month could have a value of 2010-05-01, but you would treat that as representing all dates in May, 2010. You would need to accommodate this when programming. If you had date_month in Python as a datetime object, you would need to call a function like date_month.end_of_month() to query dates following that month. (That is pseudocode, but could be easily implemented with something like the calendar module.)
It sounds like you want to store a date interval. In Python this would (to my still-somewhat-noob understanding) most readily be implemented by storing two datetime.datetime objects, one specifying the start of the date range and the other specifying the end. In a manner similar to that used to specify list slices, the endpoint would not itself be included in the date range.
For example, this code would implement a date range as a named tuple:
>>> from datetime import datetime
>>> from collections import namedtuple
>>> DateRange = namedtuple('DateRange', 'start end')
>>> the_year_2010 = DateRange(datetime(2010, 1, 1), datetime(2011, 1, 1))
>>> the_year_2010.start <= datetime(2010, 4, 20) < the_year_2010.end
True
>>> the_year_2010.start <= datetime(2009, 12, 31) < the_year_2010.end
False
>>> the_year_2010.start <= datetime(2011, 1, 1) < the_year_2010.end
False
Or even add some magic:
>>> DateRange.__contains__ = lambda self, x: self.start <= x < self.end
>>> datetime(2010, 4, 20) in the_year_2010
True
>>> datetime(2011, 4, 20) in the_year_2010
False
This is such a useful concept that I'm pretty sure that somebody has already made an implementation available. For example, a quick glance suggests that the relativedate class from the dateutil package will do this, and more expressively, by allowing a 'years' keyword argument to be passed to the constructor.
However, mapping such an object into database fields is somewhat more complicated, so you might be better off implementing it simply by just pulling both fields separately and then combining them. I guess this depends on the DB framework; I'm not very familiar with that aspect of Python yet.
In any case, I think the key is to think of a "partial date" as a range rather than as a simple value.
edit
It's tempting, but I think inappropriate, to add more magic methods that will handle uses of the > and < operators. There's a bit of ambiguity there: does a date that's "greater than" a given range occur after the range's end, or after its beginning? It initially seems appropriate to use <= to indicate that the date on the right-hand side of the equation is after the start of the range, and < to indicate that it's after the end.
However, this implies equality between the range and a date within the range, which is incorrect, since it implies that the month of May, 2010 is equal to the year 2010, because May the 4th, 2010 equates to the both of them. IE you would end up with falsisms like 2010-04-20 == 2010 == 2010-05-04 being true.
So probably it would be better to implement a method like isafterstart to explicitly check if a date is after the beginning of the range. But again, somebody's probably already done it, so it's probably worth a look on pypi to see what's considered production-ready. This is indicated by the presence of "Development Status :: 5 - Production/Stable" in the "Categories" section of a given module's pypi page. Note that not all modules have been given a development status.
Or you could just keep it simple, and using the basic namedtuple implementation, explicitly check
>>> datetime(2012, 12, 21) >= the_year_2010.start
True
Can you store the date together with a flag that tells how much of the date is valid?
Something like this:
YEAR_VALID = 0x04
MONTH_VALID = 0x02
DAY_VALID = 0x01
Y_VALID = YEAR_VALID
YM_VALID = YEAR_VALID | MONTH_VALID
YMD_VALID = YEAR_VALID | MONTH_VALID | DAY_VALID
Then, if you have a date like 2010-00-00, convert that to 2010-01-01 and set the flag to Y_VALID. If you have a date like 2010-06-00, convert that to 2010-06-01 and set the flag to YM_VALID.
So, then, PartialDateField would be a class that bundles together a date and the date-valid flag described above.
P.S. You don't actually need to use the flags the way I showed it; that's the old C programmer in me coming to the surface. You could use Y_VALID, YM_VALID, YMD_VALID = range(3) and it would work about as well. The key is to have some kind of flag that tells you how much of the date to trust.
Although not in Python - here's an example of how the same problem was solved in Ruby - using a single Integer value - and bitwise operators to store year, month and day - with month and day optional.
https://github.com/58bits/partial-date
Look at the source in lib for date.rb and bits.rb.
I'm sure a similar solution could be written in Python.
To persist the date (sortable) you just save the Integer to the database.

Categories