Counts of events grouped by date in python?

Counts of events grouped by date in python? - python

This is no doubt another noobish question, but I'll ask it anyways:
I have a data set of events with exact datetime in UTC. I'd like to create a line chart showing total number of events by day (date) in the specified date range. Right now I can retrieve the total data set for the needed date range, but then I need to go through it and count up for each date.
The app is running on google app engine and is using python.
What is the best way to create a new data set showing date and corresponding counts (including if there were no events on that date) that I can then use to pass this info to a django template?
Data set for this example looks like this:
class Event(db.Model):
event_name = db.StringProperty()
doe = db.DateTimeProperty()
dlu = db.DateTimeProperty()
user = db.UserProperty()
Ideally, I want something with date and count for that date.
Thanks and please let me know if something else is needed to answer this question!

You'll have to do the binning in-memory (i.e. after the datastore fetch).
The .date() method of a datetime instance will facilitate your binning; it chops off the time element. Then you can use a dictionary to hold the bins:
bins = {}
for event in Event.all().fetch(1000):
bins.setdefault(event.doe.date(), []).append( event )
Then do what you wish with (e.g. count) the bins. For a direct count:
counts = collections.defaultdict(int)
for event in Event.all().fetch(1000):
counts[event.doe.date()] += 1

I can't see how that would be possible with single query as GQL has no support for GROUP BY or aggregation generally.

In order to minimize the amount of work you do, you'll probably want to write a task that sums up the per-day totals once, so you can reuse them. I'd suggest using the bulkupdate library to run a once-a-day task that counts events for the previous day, and creates a new model instance, with a key name based on the date, containing the count. Then, you can get all needed data points by doing a query (or better, a batch get) for the set of summary entities you need.

Related

Update a specific field for each document based on a function

I have ~10k documents in my collection, with 3 fields(name, wait, utc).
The timestamps are too granular for my use, and I want to round them down to the last 10 minutes.
I created a function to modify these timestamps (I am rounding them via a function called round_to_10min(), which I import from another python file I have called utility_func.py).
It's not slick or anything but it works:
from datetime import datetime as dt
def round_to_10min(my_dt):
hours = my_dt.hour
minutes =(my_dt.minute//10)*10
date = dt(my_dt.year,my_dt.month,my_dt.day)
return dt(date.year, date.month,date.day, hours, minutes)
Is there a way for me to update the 'utc' field for each document in my collection, without taking the cursor and saving it into a list, iterating through it?
An example of what I would like to avoid having to do(doesn't seem efficient):
alldocs = collection.find({})
for x in alldocs:
id = x['_id']
old_value = int(x['utc'])
new_value = utility_func.round_to_10min(old_value)
update_val = {"$set":{"utc":new_value}}
collection.update_one({"_id":ObjectId(id)},update_val)
Here's where I think I should be headed, but the update argument has me stumped...
update_value = {'$set':{'utc':result_from_function}}
collection.update_many({},update_value)
Is this achievable in pymongo?

The mechanism you are seeking will not work.
Pymongo supports MongoDB operations only. If you can find a way to achieve your goal using MongoDB operations, you can perform this in a single update_many or aggregate query.
If you prefer to use python, then you're limited to your original approach of find, loop, update_one.

Is there a way to modify datetime objects through the Django ORM Query?

We've a Django, Postgresql database that contains objects with:
object_date = models.DateTimeField()
as a field.
We need to count the objects by hour per day, so we need to remove some of the extra time data, for example: minutes, seconds and microseconds.
We can remove the extra time data in python:
query = MyModel.objects.values('object_date')
data = [tweet['tweet_date'].replace(minute=0, second=0, microsecond=0) for tweet in query
Which leaves us with a list containing the date and hour.
My Question: Is there a better, faster, cleaner way to do this in the query itself?

If you simply want to obtain the dates without the time data, you can use extra to declare calculated fields:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
You don't gain too much from just that, though; the database is now sending you even more data.
However, with these additional fields, you can use aggregation to instantly get the counts you were looking for, by adding one line:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
.annotate(count=Count('*'))
Alternatively, you could use any valid SQL to combine what I made two fields into one field, by formatting it into a string, for example. The nice thing about doing that, is that you can then use the tuples to construct a Counter for convenient querying (use values_list()).
This query will certainly be more efficient than doing the counting in Python. For a background job that may not be so important, however.
One downside is that this code is not portable; for one, it does not work on SQLite, which you may still be using for testing purposes. In that case, you might save yourself the trouble and write a raw query right away, which will be just as unportable but more readable.
Update
As of 1.10 it is possible to perform this query nicely using expressions, thanks to the addition of TruncHour. Here's a suggestion for how the solution could look:
from collections import Counter
from django.db.models import Count
from django.db.models.functions import TruncHour
counts_by_group = Counter(dict(
MyModel.objects
.annotate(object_group=TruncHour('object_date'))
.values_list('object_group')
.annotate(count=Count('object_group'))
)) # query with counts_by_group[datetime.datetime(year, month, day, hour)]
It's elegant, efficient and portable. :)

count = len(MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)))
or
count = MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)).count()
Assuming I understand what you're asking for, this returns the number of objects that have a date within a specific time range. Set the range to be from the beginning of the hour until the end of the hour and you will return all objects created in that hour. Count() or len() can be used depending on the desired use. For more information on that check out https://docs.djangoproject.com/en/1.9/ref/models/querysets/#count

Query [large] data records from Proficy Historian?

I'm using the Proficy Historian SDK with python27. I can create a data record object and add the query criteria attributes (sample type, start time, end time, sample interval - in milliseconds) and use datarecord.QueryRecordset() to execute a query.
The problem I'm facing is that method QueryRecordset seems to only work for returning a small number of data sets (a few hundred records at most) i.e. a small date range, otherwise it returns no results for any of the SCADA tags. I can sometimes get it to return more (a few thousand) records by slowly incriminating the date range, but it seems unreliable. So, is there a way to fix this or a different way to do the query or set it up? Most of my queries contain multiple tags. Otherwise, I guess I'll just have to successively execute the query/slide the date range and pull a few hundred records at a time.
Update:
I'm preforming the query using the following steps:
from win32com.client.gencache import EnsureDispatch
from win32com.client import constants as c
import datetime
ihApp = EnsureDispatch('iHistorian_SDK.Server')
drecord = ihApp.Data.NewRecordset()
drecord.Criteria.Tags = ['Tag_1', 'Tag_2', 'Tag_3']
drecord.Criteria.SamplingMode = c.Calculated
drecord.Criteria.CalculationMode = c.Average
drecord.Criteria.Direction = c.Forward
drecord.Criteria.NumberOfSamples = 0 # This is the default value
drecord.Criteria.SamplingInterval = 30*60*1000 # 30 min interval in ms
# I've tried using the win32com pytime type instead of datetime, but it
# doesn't make a difference
drecord.Criteria.StartTime = datetime.datetime(2015, 11, 1)
drecord.Criteria.EndTime = datetime.datetime(2015, 11, 10)
# Run the query
drecord.Fields.Clear()
drecord.Fields.AllFields()
drecord.QueryRecordset()
One problem that may be happening is the use of dates/times in the dd/mm/yyyy hh:mm format. When I create a pytime or datetime object the individual attributes e.g. year, day, month, hour, minute are all correct before and after assignment to drecord.Criteria.StartTime and drecord.Criteria.EndTime, but when I print the variable it always comes out in mm/dd/yyyy hh:mm format, but this is probably due to the object's str or repr method.

So, it turns out there were two properties that could be adjusted to increase the number of samples returned and time before a timeout occurred. Both properties are set on the server object (ihApp):
ihApp.MaximumQueryIntervals = MaximumQueryIntervals # Default is 50000
ihApp.MaximumQueryTime = MaximumQueryTime # Default is 60 (seconds)
Increasing both these values seemed to fix my problems. Some tags definitely seem to take longer to query than others (over the same time period and same sampling method), so increasing the max. query time helped make returning query data more reliable.
When QueryRecordset() completes it returns False if there was an error and doesn't populate any of the data records. The error type can be show using:
drecord.LastError

Modify column output for sqlform.grid() in Web2py

I have started using web2py for a web application and try to use SQLFORM.grid(...) to display a paginated listing of one of my db-table's data like in the following minimal example.
grid=SQLFORM.grid(query,
links=links,
fields=[db.example.date,db.example.foo, db.example.bar])
The db.example.date field contains a Python datetime.datetime object in UTC. At the moment it is displayed just plainly like that. However, I want to have more control about the actual output in a way that I can set the local timezone and modify the output string to have something like "2 hours ago".
As seen in another question[0] I can use the links to insert new columns. Unfortunately I can't seem to sort the rows by a field I have inserted in such way. Also, they are inserted on the right instead of actually replacing my first column. So that does not seem to be a solution.
To sum it up: How do I gain control about the way db.example.date is printed out in the end?
[0] Calculated Fields in web2py sqlgrid

You can achieve your goal when you define the table in your model. The represent parameter in the Field constructor that you used in define_table will be recognized by the SQLFORM.grid. For example, if you wanted to just print the date with the month name you could put the following in your model.
Field('a_date', type='date', represent=lambda x, row: x.strftime("%B %d, %Y")),
your function could also convert to local time.

You need to use prettydate to change the datetime arid format in a humanized string, and call it in the represent parameter of your Field() descriptor. For example :
from gluon.tools import prettydate
db.example.date.represent = lambda v,r: prettydate(r.date)
That way, any display of the db.example.date would be displayed humanized, including through SQLFORM.grid

If you don't want to have the date always represented in this way as per David Nehme's answer. Just before your grid creation, you can set the db.table.field.represent in the controller.
db.example.date.represent = lambda value, row: value.strftime("%B %d, %Y")
followed by.
grid = SQLFORM.grid(query,....
I use this often when I join tables. If there is a row.field in the represent from the model file it breaks because it then must be more specific, row.table.field.

Django filter based on custom function

I have a table AvailableDates with a column date that stores date information.
I want to filter the date after performing some operation on it which is defined by convert_date_to_type function, that takes parameter input_variable provided by user.
def convert_date_to_type(date,input_variable):
#perform some operation on date based on input_variable
#return value will be a type, which will be any one item from types list below
return type
list of types:
types = []
types.append('type1')
types.append('type2')
types.append('type3')
Now I want to filter the table based on type. I will do this in for loop:
for i in range(0,len(types)):
#filter table here based on types[i], something like this
AvailableDates.objects.filter(convert_date_to_type(date,input_variable)=types[i])
How can I achieve this? Any other approach is much appreciated.
I cannot store the type information in separate column, because one date can be of different types based on input_variable given by user.

The approach you are taking will result in very time consuming queries because you are looping over all the objects and therefore skipping all the
time benefits that Database systems give in terms of querying.
The main question you have to answer is "How frequently is this query going to be used ?"
If it's going to be a lot, then I will suggest the following approach.
Creating an extra column or a table with one-to-one relation with your model
Override the model's save function to process your date and store the result in this extra column created in step 1 at the time of saving your model.
Implement your query on this extra column created in step 1.
This approach has space overhead, but it will make the query faster.
If it's not going to be lot, but the query can make your web request noticeably slow, then also use the above approach. It will help with a smooth web experience.

one solution is to get the ids list first:
dates_, wanted_ids = AvailableDates.objects.all(), []
for i in range(0, len(types)):
wanted_ids += [d for d in dates_ if convert_date_to_type(d.date) == types[i]]
wanted_dates = AvailableDates.objects.filter(id__in=wanted_ids)
not very performant but it works

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counts of events grouped by date in python? - python

I can't see how that would be possible with single query as GQL has no support for GROUP BY or aggregation generally.

Related

Update a specific field for each document based on a function

Is there a way to modify datetime objects through the Django ORM Query?

Query [large] data records from Proficy Historian?

Modify column output for sqlform.grid() in Web2py

Django filter based on custom function

Categories

Resources