Django how to make this view faster? - python

I have view. It takes about 1000 records from model and calculate for each two values. It works correct, but very slow about 1 minute.
My model. It contain readings for each day:
class Reading(models.Model):
meter = models.ForeignKey(Meter, verbose_name=_('meter'))
reading = models.FloatField(verbose_name=_('reading'))
code = models.ForeignKey(ReadingCode, verbose_name=_('code'))
date = models.DateTimeField(verbose_name=_('date'))
class Meta:
get_latest_by = 'date'
ordering = ['-date', ]
def __unicode__(self):
return u'%s' % (self.date,)
#property
def consumption(self):
try:
end = self.get_next_by_date(code=self.code, meter=self.meter)
return (end.reading - self.reading) / (end.date - self.date).days
except:
return 0.0
#property
def middle_consumption(self):
data = []
current_year = self.date.year
for year in range(current_year - 3, current_year):
date = datetime.date(year, self.date.month, self.date.day)
try:
data.append(Reading.objects.get(
date = date,
meter = self.meter,
code = self.code
).consumption)
except:
data.append(0.0)
for i in data:
if not i:
data.pop(0)
return sum(data) / len(data)
My view. It returns json with all readings for requested meter and with calculated consumption and calculated middle consumption for last 3 years.
class DataForDayChart(TemplateView):
def get(self, request, *args, **kwargs):
output = []
meter = Meter.objects.get(slug=kwargs['slug'])
# TODO: Make it faster
for reading in meter.readings_for_period().order_by('date'):
output.append({
"label": reading.date.strftime("%d.%m.%Y"),
"reading": reading.reading,
"value": reading.consumption / 1000,
"middle": reading.middle_consumption / 1000
})
return HttpResponse(output, mimetype='application/json')
What should I change to make it faster?

The performance issue may caused by too many db operations, e.g. in method middle_consumption you query db at least twice,
end = self.get_next_by_date(code=self.code, meter=self.meter)
...
data.append(Reading.objects.get(
date = date,
meter = self.meter,
code = self.code
).consumption)
You didn't show the all codes, so i suppose each step in following loop needs sql queries.
for reading in meter.readings_for_period().order_by('date'):
And as you said, there is only 1000 records, maybe you can load the data once and manipulate the relations and calculating in memory, which should improve the overall performance.

From the view name, I would imagine it's data does not change much during the day; in this case, I would recommend you to use caching.
Django has a good caching framework which is quite easy and straightforward to setup and use, and it will immediately make a huge difference, without much effort.
Of course, the first call will still be slow; and there, you may want to optimize it.
The usual approach in optimization is, first and foremost, measure, and you should profile the view, to see what are the slowest functions.
Alternatively, you could also insert some print statement to get a gist of where the slowdown occurs; the advantage here, is that you do not need to learn how to use a new tool.
That said, my best guess is that the slowdown happens in the call to meter.readings_for_period() (whose code you did not post), and is due to some inefficient database query - for instance, eventually instructing the ORM to retrieve the records one by one instead that with a single select statement.

Related

Ordering objects by rating accounting for the number or ratings

I'm trying to do something similar to the first response in this SO question: SQL ordering by rating/votes, where resources may be rated (one rating per user per resource), but when ordering the resources based on their ratings, any resources with fewer than X separate ratings will appear below those with X or more.
I'm implementing this in Django and I'd very much prefer to avoid the use of raw query and keep within the Django model and query framework.
So far, this is what I have:
data = []
data_top = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_top:
data.append(d)
data_bottom = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_bottom:
data.append(d)
This all functions and returns the ordering by rating as I need, however, it doesn't feel very efficient - what with running 2 queries and looping over the results of each.
Is there a better way I can code this, either in a single query, or at least avoiding looping though each query set?
Any help much appreciated.
from itertools import chain
main_query = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating'))
data_top_query = main_query.exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data_bottom_query = main_query.exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data = list(chain(data_top_query, data_bottom_query))
Using itertools.chain is faster than looping each list and appending elements one by one
Also, the querysets will get evaluated when list is called on them (as they don't hit the database till then)
FYI, the above will hit the db twice when evaluated.
You're currently querying twice and iterating twice, but you can cut it down to one and one easily-just query for the items ordered by rating, then iterate like this:
data_top = []
data_bottom = []
data = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).order_by(order_by)
for d in data:
if data.rate_count >= settings.ORB_RESOURCE_MIN_RATINGS:
data_top.append(d)
else:
data_bottom.append(d)
data = data_top + data_bottom
This can also be done with the query only, by creating another aggregate column which contains the value rate_count < settings.ORB_RESOURCE_MIN_RATINGS (return 0 for values above or at the threshold, 1 for below) and sorting on (new_column, rating). Pretty sure this would require some custom SQL, but perhaps someone else knows otherwise.

Best way to limit incoming messages to avoid duplicates

I have a system that accepts messages that contain urls, if certain keywords are in the messages, an api call is made with the url as a parameter.
In order to conserve processing and keep my end presentation efficient.
I don't want duplicate urls being submitted within a certain time range.
so if this url ---> http://instagram.com/p/gHVMxltq_8/ comes in and it's submitted to the api
url = incoming.msg['urls']
url = urlparse(url)
if url.netloc == "instagram.com":
r = requests.get("http://api.some.url/show?url=%s"% url)
and then 3 secs later the same url comes in, I don't want it submitted to the api.
What programming method might I deploy to eliminate/limit duplicate messages from being submitted to the api based on time?
UPDATE USING TIM PETERS METHOD:
limit = DecayingSet(86400)
l = limit.add(longUrl)
if l == False:
pass
else:
r = requests.get("http://api.some.url/show?url=%s"% url)
this snippet is inside a long running process, that is accepting streaming messages via tcp.
every time I pass the same url in, l returns True every time.
But when I try it in the interpreter everything is good, it returns False when the set time hasn't expired.
Does it have to do with the fact that the script is running, while the set is being added to?
Instance issues?
Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,
from time import time
class DecayingSet:
def __init__(self, timeout): # timeout in seconds
from collections import deque
self.timeout = timeout
self.d = deque()
self.present = set()
def add(self, thing):
# Return True if `thing` not already in set,
# else return False.
result = thing not in self.present
if result:
self.present.add(thing)
self.d.append((time(), thing))
self.clean()
return result
def clean(self):
# forget stuff added >= `timeout` seconds ago
now = time()
d = self.d
while d and now - d[0][0] >= self.timeout:
_, thing = d.popleft()
self.present.remove(thing)
As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.
Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.
suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.
now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.
the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists
Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.
If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).
If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.
Further update: Bloom filters are an extension of this scheme.
I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:
urls = {}
and then for each URL:
if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
# Don't submit the data
else:
urls[url] = time.time()
# Submit the data

Update multiple columns using django F() object

I have a model that stores stats on a one integer stat per column basis. I have a view that handles the update of said stats, like so:
class PlayerStats(models.Model):
#In game stats - these represent the actual keys sent by the game
NumberOfJumps = models.IntegerField(default=0)
NumberOfDoubleJumps = models.IntegerField(default=0)
NumberOfSilverPickups = models.IntegerField(default=0)
NumberOfGoldPickups = models.IntegerField(default=0)
NumberOfHealthPickups = models.IntegerField(default=0)
I basically get a dicitonary of stats that I need to to add to the current stats stored in the database.
I really don´t want to pull all of the data out of the model and then update it again, as I would like to do this on the database level, if possible.
A colleague suggested that I use django´s F() object in order to push this out of the view code, mostly in order to keep it thread-safe and avoid any mysql deadlocks (the stats table is potentially being updated continually by different threads)
The dictionary contains keys that mirror those used in the database, so at the moment I´m doing it like this:
def update_stats(new_stats):
player_stats = PlayerStats(user=user, **new_stats)
old_stats = player_stats.values()[0]
updated_stats = {}
for stat in new_stats:
if old_stat[stat]:
updated_stats[stat] = old_stats[stat] + new_stats[stat]
PlayerStats.objects.filter(user=user).update(**updated_stats)
Anybody have any pointers as to how to achieve this by using the F() object?
To update using models.F, you need to construct something like
qs.update(field_1=models.F('field_1')+field_1_delta,
field_2=models.F('field_2')+field_2_delta,
...)
For you code, it might be
new_stats = {
'NumberOfHealthPickups': 99
# ...
}
updated_stats = {}
for stat in new_stats:
updated_stats[stat] = models.F(stat) + new_stats[stat]
PlayerStats.objects.filter(user=user).update(**updated_stats)
One option is to update the fields one by one.
This code will not update all fields at once (so it might be slow, db-access-wise), but it is safe (no deadlocks, no lost updates).
user_stats = PlayerStats.objects.get(user=user)
for stat, increment in new_stats.iteritems():
user_stats.update(**{ stat: F(stat) + increment })

How can I make my code more efficient?

I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)

Python: why does this code take forever (infinite loop?)

I'm developing an app in Google App Engine. One of my methods is taking never completing, which makes me think it's caught in an infinite loop. I've stared at it, but can't figure it out.
Disclaimer: I'm using http://code.google.com/p/gaeunitlink text to run my tests. Perhaps it's acting oddly?
This is the problematic function:
def _traverseForwards(course, c_levels):
''' Looks forwards in the dependency graph '''
result = {'nodes': [], 'arcs': []}
if c_levels == 0:
return result
model_arc_tails_with_course = set(_getListArcTailsWithCourse(course))
q_arc_heads = DependencyArcHead.all()
for model_arc_head in q_arc_heads:
for model_arc_tail in model_arc_tails_with_course:
if model_arc_tail.key() in model_arc_head.tails:
result['nodes'].append(model_arc_head.sink)
result['arcs'].append(_makeArc(course, model_arc_head.sink))
# rec_result = _traverseForwards(model_arc_head.sink, c_levels - 1)
# _extendResult(result, rec_result)
return result
Originally, I thought it might be a recursion error, but I commented out the recursion and the problem persists. If this function is called with c_levels = 0, it runs fine.
The models it references:
class Course(db.Model):
dept_code = db.StringProperty()
number = db.IntegerProperty()
title = db.StringProperty()
raw_pre_reqs = db.StringProperty(multiline=True)
original_description = db.StringProperty()
def getPreReqs(self):
return pickle.loads(str(self.raw_pre_reqs))
def __repr__(self):
return "%s %s: %s" % (self.dept_code, self.number, self.title)
class DependencyArcTail(db.Model):
''' A list of courses that is a pre-req for something else '''
courses = db.ListProperty(db.Key)
def equals(self, arcTail):
for this_course in self.courses:
if not (this_course in arcTail.courses):
return False
for other_course in arcTail.courses:
if not (other_course in self.courses):
return False
return True
class DependencyArcHead(db.Model):
''' Maintains a course, and a list of tails with that course as their sink '''
sink = db.ReferenceProperty()
tails = db.ListProperty(db.Key)
Utility functions it references:
def _makeArc(source, sink):
return {'source': source, 'sink': sink}
def _getListArcTailsWithCourse(course):
''' returns a LIST, not SET
there may be duplicate entries
'''
q_arc_heads = DependencyArcHead.all()
result = []
for arc_head in q_arc_heads:
for key_arc_tail in arc_head.tails:
model_arc_tail = db.get(key_arc_tail)
if course.key() in model_arc_tail.courses:
result.append(model_arc_tail)
return result
Am I missing something pretty obvious here, or is GAEUnit acting up?
Also - the test that is making this run slow has no more than 5 models of any kind in the datastore. I know this is potentially slow, but my app only does this once then subsequently caches it.
Ignoring the commented out recursion, I don't think this should be an infinite loop - you are just doing some for-loops over finite results sets.
However, it does seem like this would be really slow. You're looping over entire tables and then doing more datastore queries in every nested loop. It seems unlikely that this sort of request would complete in a timely manner on GAE unless your tables are really, really small.
Some rough numbers:
If H = # of entities in DepedencyArcHead and T = average # of tails in each DependencyArcHead then:
_getListArcTailsWithCourse is doing about H*T queries (understimate). In the "worst" case, the result returned from this function will have H*T elements.
_traverseForwards loops over all these results H times, and thus does another H*(H*T) queries.
Even if H and T are only on the order of 10s, you could be doing thousands of queries. If they're bigger, then ... (and this ignores any additional queries you'd do if you uncommented the recursive call).
In short, I think you may want to try to organize your data a little differently if possible. I'd make a specific suggestion, but what exactly you're trying to do isn't clear to me.

Categories