Update multiple columns using django F() object - python

I have a model that stores stats on a one integer stat per column basis. I have a view that handles the update of said stats, like so:
class PlayerStats(models.Model):
#In game stats - these represent the actual keys sent by the game
NumberOfJumps = models.IntegerField(default=0)
NumberOfDoubleJumps = models.IntegerField(default=0)
NumberOfSilverPickups = models.IntegerField(default=0)
NumberOfGoldPickups = models.IntegerField(default=0)
NumberOfHealthPickups = models.IntegerField(default=0)
I basically get a dicitonary of stats that I need to to add to the current stats stored in the database.
I really don´t want to pull all of the data out of the model and then update it again, as I would like to do this on the database level, if possible.
A colleague suggested that I use django´s F() object in order to push this out of the view code, mostly in order to keep it thread-safe and avoid any mysql deadlocks (the stats table is potentially being updated continually by different threads)
The dictionary contains keys that mirror those used in the database, so at the moment I´m doing it like this:
def update_stats(new_stats):
player_stats = PlayerStats(user=user, **new_stats)
old_stats = player_stats.values()[0]
updated_stats = {}
for stat in new_stats:
if old_stat[stat]:
updated_stats[stat] = old_stats[stat] + new_stats[stat]
PlayerStats.objects.filter(user=user).update(**updated_stats)
Anybody have any pointers as to how to achieve this by using the F() object?

To update using models.F, you need to construct something like
qs.update(field_1=models.F('field_1')+field_1_delta,
field_2=models.F('field_2')+field_2_delta,
...)
For you code, it might be
new_stats = {
'NumberOfHealthPickups': 99
# ...
}
updated_stats = {}
for stat in new_stats:
updated_stats[stat] = models.F(stat) + new_stats[stat]
PlayerStats.objects.filter(user=user).update(**updated_stats)

One option is to update the fields one by one.
This code will not update all fields at once (so it might be slow, db-access-wise), but it is safe (no deadlocks, no lost updates).
user_stats = PlayerStats.objects.get(user=user)
for stat, increment in new_stats.iteritems():
user_stats.update(**{ stat: F(stat) + increment })

Related

What would be the most efficient or useful way to store attendance data?

I want to move my schools attendance records away from excel sheets and into python, but I'm not sure what the best way to store that data would be.
Method 1
Create a dictionary with student names as keys, and the dates they attended as items in a list. Or perhaps a list of the days they were absent would be more efficient.
attendance = {}
attendance["student_1"] = [2018-08-10, 2018-08-15, 2018-08-20]
Method 2
Create a dictionary of dates and append a list of students who were present on that day:
attendance = {}
attendance["2018-08-10"] = [student_1, student_2, student_3]
Method 3
Created nested dictionaries. Students names as the outer keys. All dates as inner keys with a boolean as a value.
attendance = {}
attendance["student_1"]= {}
attendance["student_1"]["1018-08-10"] = 'True'
All of these would probably work, but there must be a better way of storing this data. Can anyone help?
I should add that I want to be able to access the student's attendance record from name and retrieve all the student names that were present given a particular date.
It Completely depends on your use case. Each method has got its own advantage.
Method 1
attendance = {}
attendance["student_1"] = [2018-08-10, 2018-08-15, 2018-08-20]
total_days_present_student_1 = len(attendance["student_1"])
You have the advantage of getting easily the no. of days a student was present
Method 2
attendance = {}
attendance["2018-08-10"] = [student_1, student_2, student_3]
total_student_present_on_2018_08_10 = len(attendance["2018-08-10"])
You have the advantage of getting the total no. of students present on a particular day
Method 3
attendance = {}
attendance["student_1"]= {}
attendance["student_1"]["1018-08-10"] = 'True'
Not really any special advantage which the other 2 methods are providing
I'm not sure whether you've delved into OOP (object-oriented programming), but this approach may be useful if you need to store more than just attendance in the future. See my 'basic' example:
Setup objects
students = []
class Student:
def __init__(self, name, age):
self.name = name
self.age = age
self.attendance = {}
def add_attendance(date, students, values):
for student, val in zip(students, values):
student.attendance[date] = val
Setup students
This part could be done by reading from a text file with student data, but I've simplified here for brevity.
students = [
Student('Bob', 15),
Student('Sam', 14)
]
Add a day and record attendance
Again, I've hard-coded the dates here, but this would obviously come from an external source; the datetime module may prove useful here.
current_date = '27-08-2018'
attendance_values = [
True, # for Student(Bob)
False # for Student(Sam)
]
add_attendance(current_date,
students,
attendance_values)
Now, I'll add a 2nd day (Hard-coded for demonstration):
current_date = '28-08-2018'
attendance_values = [True, True]
add_attendance(current_date, students, attendance_values)
Display information
I can easily display all information:
>>> print('\n'.join([str(s.attendance)
... for s in students]))
{'27-08-2018': True, '28-08-2018': True}
{'27-08-2018': False, '28-08-2018': True}
Or, in a more 'friendly' way, and with each student name:
>>> print('data for 27-08-2018:')
>>> for student in students:
... print('{:>10}: {}'.format(student.name,
... student.attendance['27-08-2018']))
data for 27-08-2018:
Bob: True
Sam: False
Storing externally
Currently, all data will be lost on the program's termination, so a possible text file structure could be the following.
Students:
Bob 15
Sam 14 # more data fields in columns here
Attendance:
27-08-2018
Bob True # or anything else to signify they were present
Sam False
28-08-2018
Bob True
Sam True
Now you could read each file line by line, splitting by whitespace for the 'students' file, but for the 'attendance' file, things will most certainly be more difficult. This all depends on what data you include in your attendance file: it could just be a date with True/False values or a fully formatted record.
I want to move my schools attendance records away from excel sheets and into python, but I'm not sure what the best way to store that data would be.
Actually, none of the example you posted are about storing data (persist them between program executions). Updates to your attendance dict during the program's execution will be lost when the process finishes, and I seriously doubt you want your program users to edit the python code to add or change data.
To make a long story short, this kind of programs want a SQL database - which not only takes care of persisting your data but also makes querying much easier.

Speed up python w/ sqlalchemy function

I have a function that populates a database table using python and sqlalchemy. The function is running fairly slowly right now, taking around 17 minutes. I think the main problem is I am looping through two large sets of data to build the new table. I have included the record count in the code below.
How can I speed this up? Should I try to convert the nested for loop into one big sqlalchemy query? I profiled this function with pycharm but am not sure I fully understand the results.
def populate(self):
"""Core function to populate positions."""
# get raw annotations with tag Org
# returns 11,659 records
organizations = model.session.query(model.Annotation) \
.filter(model.Annotation.tag == 'Org')\
.filter(model.Annotation.organization_id.isnot(None)).all()
# get raw annotations with tags Support or Oppose
# returns 2,947 records
annotations = model.session.query(model.Annotation) \
.filter((model.Annotation.tag == 'Support') | (model.Annotation.tag == 'Oppose')).all()
for org in organizations:
for anno in annotations:
# Org overlaps with Support or Oppose tag
# start and end columns are integers
if org.start >= anno.start and org.end <= anno.end:
position = model.Position()
# set to de-duplicated organization
position.organization_id = org.organization_id
position.disposition = anno.tag
# look up bill_id from document_bill table
document = model.session.query(model.document_bill)\
.filter_by(document_id=anno.document_id).first()
position.bill_id = document.bill_id
position.document_id = anno.document_id
model.session.add(position)
logging.info('org: {}, disposition: {}, bill: {}'.format(
position.organization_id, position.disposition, position.bill_id)
)
continue
logging.info('committing to database')
model.session.commit()
My bets, in order of descending probability:
Autocommit is ON, so you're waiting for disk.
The query inside the loop "document = model.session.query(model.document_bill)...." is slow (use EXPLAIN ANALYZE).
most of the time is actually spent printing logs to the terminal in the inner loop (you should profile)
model.session.add(position) is slow (no idea what that does)
(and this one should really be first) Could a SQL query like INSERT INTO SELECT do this in a couple tens of milliseconds? If so, why make a loop in the application?...

Ordering objects by rating accounting for the number or ratings

I'm trying to do something similar to the first response in this SO question: SQL ordering by rating/votes, where resources may be rated (one rating per user per resource), but when ordering the resources based on their ratings, any resources with fewer than X separate ratings will appear below those with X or more.
I'm implementing this in Django and I'd very much prefer to avoid the use of raw query and keep within the Django model and query framework.
So far, this is what I have:
data = []
data_top = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_top:
data.append(d)
data_bottom = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_bottom:
data.append(d)
This all functions and returns the ordering by rating as I need, however, it doesn't feel very efficient - what with running 2 queries and looping over the results of each.
Is there a better way I can code this, either in a single query, or at least avoiding looping though each query set?
Any help much appreciated.
from itertools import chain
main_query = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating'))
data_top_query = main_query.exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data_bottom_query = main_query.exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data = list(chain(data_top_query, data_bottom_query))
Using itertools.chain is faster than looping each list and appending elements one by one
Also, the querysets will get evaluated when list is called on them (as they don't hit the database till then)
FYI, the above will hit the db twice when evaluated.
You're currently querying twice and iterating twice, but you can cut it down to one and one easily-just query for the items ordered by rating, then iterate like this:
data_top = []
data_bottom = []
data = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).order_by(order_by)
for d in data:
if data.rate_count >= settings.ORB_RESOURCE_MIN_RATINGS:
data_top.append(d)
else:
data_bottom.append(d)
data = data_top + data_bottom
This can also be done with the query only, by creating another aggregate column which contains the value rate_count < settings.ORB_RESOURCE_MIN_RATINGS (return 0 for values above or at the threshold, 1 for below) and sorting on (new_column, rating). Pretty sure this would require some custom SQL, but perhaps someone else knows otherwise.

Django how to make this view faster?

I have view. It takes about 1000 records from model and calculate for each two values. It works correct, but very slow about 1 minute.
My model. It contain readings for each day:
class Reading(models.Model):
meter = models.ForeignKey(Meter, verbose_name=_('meter'))
reading = models.FloatField(verbose_name=_('reading'))
code = models.ForeignKey(ReadingCode, verbose_name=_('code'))
date = models.DateTimeField(verbose_name=_('date'))
class Meta:
get_latest_by = 'date'
ordering = ['-date', ]
def __unicode__(self):
return u'%s' % (self.date,)
#property
def consumption(self):
try:
end = self.get_next_by_date(code=self.code, meter=self.meter)
return (end.reading - self.reading) / (end.date - self.date).days
except:
return 0.0
#property
def middle_consumption(self):
data = []
current_year = self.date.year
for year in range(current_year - 3, current_year):
date = datetime.date(year, self.date.month, self.date.day)
try:
data.append(Reading.objects.get(
date = date,
meter = self.meter,
code = self.code
).consumption)
except:
data.append(0.0)
for i in data:
if not i:
data.pop(0)
return sum(data) / len(data)
My view. It returns json with all readings for requested meter and with calculated consumption and calculated middle consumption for last 3 years.
class DataForDayChart(TemplateView):
def get(self, request, *args, **kwargs):
output = []
meter = Meter.objects.get(slug=kwargs['slug'])
# TODO: Make it faster
for reading in meter.readings_for_period().order_by('date'):
output.append({
"label": reading.date.strftime("%d.%m.%Y"),
"reading": reading.reading,
"value": reading.consumption / 1000,
"middle": reading.middle_consumption / 1000
})
return HttpResponse(output, mimetype='application/json')
What should I change to make it faster?
The performance issue may caused by too many db operations, e.g. in method middle_consumption you query db at least twice,
end = self.get_next_by_date(code=self.code, meter=self.meter)
...
data.append(Reading.objects.get(
date = date,
meter = self.meter,
code = self.code
).consumption)
You didn't show the all codes, so i suppose each step in following loop needs sql queries.
for reading in meter.readings_for_period().order_by('date'):
And as you said, there is only 1000 records, maybe you can load the data once and manipulate the relations and calculating in memory, which should improve the overall performance.
From the view name, I would imagine it's data does not change much during the day; in this case, I would recommend you to use caching.
Django has a good caching framework which is quite easy and straightforward to setup and use, and it will immediately make a huge difference, without much effort.
Of course, the first call will still be slow; and there, you may want to optimize it.
The usual approach in optimization is, first and foremost, measure, and you should profile the view, to see what are the slowest functions.
Alternatively, you could also insert some print statement to get a gist of where the slowdown occurs; the advantage here, is that you do not need to learn how to use a new tool.
That said, my best guess is that the slowdown happens in the call to meter.readings_for_period() (whose code you did not post), and is due to some inefficient database query - for instance, eventually instructing the ORM to retrieve the records one by one instead that with a single select statement.

App Engine Datastore IN Operator - how to use?

Reading: http://code.google.com/appengine/docs/python/datastore/gqlreference.html
I want to use:
:= IN
but am unsure how to make it work. Let's assume the following
class User(db.Model):
name = db.StringProperty()
class UniqueListOfSavedItems(db.Model):
str = db.StringPropery()
datesaved = db.DateTimeProperty()
class UserListOfSavedItems(db.Model):
name = db.ReferenceProperty(User, collection='user')
str = db.ReferenceProperty(UniqueListOfSavedItems, collection='itemlist')
How can I do a query which gets me the list of saved items for a user? Obviously I can do:
q = db.Gql("SELECT * FROM UserListOfSavedItems WHERE name :=", user[0].name)
but that gets me a list of keys. I want to now take that list and get it into a query to get the str field out of UniqueListOfSavedItems. I thought I could do:
q2 = db.Gql("SELECT * FROM UniqueListOfSavedItems WHERE := str in q")
but something's not right...any ideas? Is it (am at my day job, so can't test this now):
q2 = db.Gql("SELECT * FROM UniqueListOfSavedItems __key__ := str in q)
side note: what a devilishly difficult problem to search on because all I really care about is the "IN" operator.
Since you have a list of keys, you don't need to do a second query - you can do a batch fetch, instead. Try this:
#and this should get me the items that a user saved
useritems = db.get(saveditemkeys)
(Note you don't even need the guard clause - a db.get on 0 entities is short-circuited appropritely.)
What's the difference, you may ask? Well, a db.get takes about 20-40ms. A query, on the other hand (GQL or not) takes about 160-200ms. But wait, it gets worse! The IN operator is implemented in Python, and translates to multiple queries, which are executed serially. So if you do a query with an IN filter for 10 keys, you're doing 10 separate 160ms-ish query operations, for a total of about 1.6 seconds latency. A single db.get, in contrast, will have the same effect and take a total of about 30ms.
+1 to Adam for getting me on the right track. Based on his pointer, and doing some searching at Code Search, I have the following solution.
usersaveditems = User.Gql(“Select * from UserListOfSavedItems where user =:1”, userkey)
saveditemkeys = []
for item in usersaveditems:
#this should create a list of keys (references) to the saved item table
saveditemkeys.append(item.str())
if len(usersavedsearches > 0):
#and this should get me the items that a user saved
useritems = db.Gql(“SELECT * FROM UniqueListOfSavedItems WHERE __key__ in :1’, saveditemkeys)

Categories