Django filter based on custom function - python

I have a table AvailableDates with a column date that stores date information.
I want to filter the date after performing some operation on it which is defined by convert_date_to_type function, that takes parameter input_variable provided by user.
def convert_date_to_type(date,input_variable):
#perform some operation on date based on input_variable
#return value will be a type, which will be any one item from types list below
return type
list of types:
types = []
types.append('type1')
types.append('type2')
types.append('type3')
Now I want to filter the table based on type. I will do this in for loop:
for i in range(0,len(types)):
#filter table here based on types[i], something like this
AvailableDates.objects.filter(convert_date_to_type(date,input_variable)=types[i])
How can I achieve this? Any other approach is much appreciated.
I cannot store the type information in separate column, because one date can be of different types based on input_variable given by user.

The approach you are taking will result in very time consuming queries because you are looping over all the objects and therefore skipping all the
time benefits that Database systems give in terms of querying.
The main question you have to answer is "How frequently is this query going to be used ?"
If it's going to be a lot, then I will suggest the following approach.
Creating an extra column or a table with one-to-one relation with your model
Override the model's save function to process your date and store the result in this extra column created in step 1 at the time of saving your model.
Implement your query on this extra column created in step 1.
This approach has space overhead, but it will make the query faster.
If it's not going to be lot, but the query can make your web request noticeably slow, then also use the above approach. It will help with a smooth web experience.

one solution is to get the ids list first:
dates_, wanted_ids = AvailableDates.objects.all(), []
for i in range(0, len(types)):
wanted_ids += [d for d in dates_ if convert_date_to_type(d.date) == types[i]]
wanted_dates = AvailableDates.objects.filter(id__in=wanted_ids)
not very performant but it works

Related

Update a specific field for each document based on a function

I have ~10k documents in my collection, with 3 fields(name, wait, utc).
The timestamps are too granular for my use, and I want to round them down to the last 10 minutes.
I created a function to modify these timestamps (I am rounding them via a function called round_to_10min(), which I import from another python file I have called utility_func.py).
It's not slick or anything but it works:
from datetime import datetime as dt
def round_to_10min(my_dt):
hours = my_dt.hour
minutes =(my_dt.minute//10)*10
date = dt(my_dt.year,my_dt.month,my_dt.day)
return dt(date.year, date.month,date.day, hours, minutes)
Is there a way for me to update the 'utc' field for each document in my collection, without taking the cursor and saving it into a list, iterating through it?
An example of what I would like to avoid having to do(doesn't seem efficient):
alldocs = collection.find({})
for x in alldocs:
id = x['_id']
old_value = int(x['utc'])
new_value = utility_func.round_to_10min(old_value)
update_val = {"$set":{"utc":new_value}}
collection.update_one({"_id":ObjectId(id)},update_val)
Here's where I think I should be headed, but the update argument has me stumped...
update_value = {'$set':{'utc':result_from_function}}
collection.update_many({},update_value)
Is this achievable in pymongo?
The mechanism you are seeking will not work.
Pymongo supports MongoDB operations only. If you can find a way to achieve your goal using MongoDB operations, you can perform this in a single update_many or aggregate query.
If you prefer to use python, then you're limited to your original approach of find, loop, update_one.

Is there a way to modify datetime objects through the Django ORM Query?

We've a Django, Postgresql database that contains objects with:
object_date = models.DateTimeField()
as a field.
We need to count the objects by hour per day, so we need to remove some of the extra time data, for example: minutes, seconds and microseconds.
We can remove the extra time data in python:
query = MyModel.objects.values('object_date')
data = [tweet['tweet_date'].replace(minute=0, second=0, microsecond=0) for tweet in query
Which leaves us with a list containing the date and hour.
My Question: Is there a better, faster, cleaner way to do this in the query itself?
If you simply want to obtain the dates without the time data, you can use extra to declare calculated fields:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
You don't gain too much from just that, though; the database is now sending you even more data.
However, with these additional fields, you can use aggregation to instantly get the counts you were looking for, by adding one line:
query = MyModel.objects
.extra(select={
'object_date_group': 'CAST(object_date AS DATE)',
'object_hour_group': 'EXTRACT(HOUR FROM object_date)'
})
.values('object_date_group', 'object_hour_group')
.annotate(count=Count('*'))
Alternatively, you could use any valid SQL to combine what I made two fields into one field, by formatting it into a string, for example. The nice thing about doing that, is that you can then use the tuples to construct a Counter for convenient querying (use values_list()).
This query will certainly be more efficient than doing the counting in Python. For a background job that may not be so important, however.
One downside is that this code is not portable; for one, it does not work on SQLite, which you may still be using for testing purposes. In that case, you might save yourself the trouble and write a raw query right away, which will be just as unportable but more readable.
Update
As of 1.10 it is possible to perform this query nicely using expressions, thanks to the addition of TruncHour. Here's a suggestion for how the solution could look:
from collections import Counter
from django.db.models import Count
from django.db.models.functions import TruncHour
counts_by_group = Counter(dict(
MyModel.objects
.annotate(object_group=TruncHour('object_date'))
.values_list('object_group')
.annotate(count=Count('object_group'))
)) # query with counts_by_group[datetime.datetime(year, month, day, hour)]
It's elegant, efficient and portable. :)
count = len(MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)))
or
count = MyModel.objects.filter(object_date__range=(beginning_of_hour, end_of_hour)).count()
Assuming I understand what you're asking for, this returns the number of objects that have a date within a specific time range. Set the range to be from the beginning of the hour until the end of the hour and you will return all objects created in that hour. Count() or len() can be used depending on the desired use. For more information on that check out https://docs.djangoproject.com/en/1.9/ref/models/querysets/#count

How to store numerical lookup table in Python (with labels)

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Counts of events grouped by date in python?

This is no doubt another noobish question, but I'll ask it anyways:
I have a data set of events with exact datetime in UTC. I'd like to create a line chart showing total number of events by day (date) in the specified date range. Right now I can retrieve the total data set for the needed date range, but then I need to go through it and count up for each date.
The app is running on google app engine and is using python.
What is the best way to create a new data set showing date and corresponding counts (including if there were no events on that date) that I can then use to pass this info to a django template?
Data set for this example looks like this:
class Event(db.Model):
event_name = db.StringProperty()
doe = db.DateTimeProperty()
dlu = db.DateTimeProperty()
user = db.UserProperty()
Ideally, I want something with date and count for that date.
Thanks and please let me know if something else is needed to answer this question!
You'll have to do the binning in-memory (i.e. after the datastore fetch).
The .date() method of a datetime instance will facilitate your binning; it chops off the time element. Then you can use a dictionary to hold the bins:
bins = {}
for event in Event.all().fetch(1000):
bins.setdefault(event.doe.date(), []).append( event )
Then do what you wish with (e.g. count) the bins. For a direct count:
counts = collections.defaultdict(int)
for event in Event.all().fetch(1000):
counts[event.doe.date()] += 1
I can't see how that would be possible with single query as GQL has no support for GROUP BY or aggregation generally.
In order to minimize the amount of work you do, you'll probably want to write a task that sums up the per-day totals once, so you can reuse them. I'd suggest using the bulkupdate library to run a once-a-day task that counts events for the previous day, and creates a new model instance, with a key name based on the date, containing the count. Then, you can get all needed data points by doing a query (or better, a batch get) for the set of summary entities you need.

How to use Graphlab recommend() for providing recommendations to new user?

In Graphlab,
I am trying to use recommend() method, to see how it provides recommendation for a new user(user_id) which isn't present in the trained model prepared from give dataset. Since the aim is to determine similar users through this recommendation model used, so I plan to pass the new_user_data in recommend(), but with exact same item- ratings of an existing user to check if it should returns the same ratings.
Here is what I am doing:
(data is the dataset containing UserIds, ItemIds and Rating columns)
(say 104 is a new UserId which isn't in data set)
result=graphlab.factorization_recommender.create(data,user_id='UserId',
item_id='ItemId',target='Rating')
new_user_info=graphlab.SFrame({'UserId':104,'ItemId':['x'],'Rating':9})
r=result.recommend(users=104,new_user_data=new_user_info)
I am getting an error:
raise exc_type(exc_value)
TypeError: object of type 'int' has no len()
Can anyone help as to how to use recommend() method for new user ?
Which of the lines give you the exception? I think that you have problems with your SFrame creation and with your usage of the .recommend() method.
new_user_info=graphlab.SFrame({'UserId':104,'ItemId':['x'],'Rating':9})
# should be
new_user_info=graphlab.SFrame({'UserId':[104],'ItemId':['x'],'Rating':[9]})
# construct SFrames from a dictionary where the values are lists
and
r = result.recommend(users=104,new_user_data=new_user_info)
# should be:
r = result.recommend(users=[104],new_user_data=new_user_info)
# users is a list, not an integer

Categories