How to speed up stacked queries in Django using caching

How to speed up stacked queries in Django using caching - python

Current in my project, I have a layer of models as such:
Sensor class: which stores different sensors
Entry class: which stores whenever a new data entry happens
Data class: which stores data pairs for each entry.
Generally, that additional class was used since I have different fields for different sensors and I wanted to store them in one database, so how I was doing that was through two methods, one for entry to get the data for it, and one in the sensor as follows:
class Data(models.Model):
key = models.CharField(max_length=50)
value = models.CharField(max_length=50)
entry = models.ForeignKey(Entry, on_delete=CASCADE, related_name="dataFields")
def __str__(self):
return str(self.id) + "-" + self.key + ":" + self.value
class Entry(models.Model):
time = models.DateTimeField()
sensor = models.ForeignKey(Sensor, on_delete=CASCADE, related_name="entriesList")
def generateFields(self, field=None):
"""
Generate the fields by merging the Entry and data fields.
Outputs a dictionary which contains all fields.
"""
if field is None:
field_set = self.dataFields.all()
else:
field_set = self.dataFields.filter(key=field)
output = {"timestamp": self.time.timestamp()}
for key, value in field_set.values_list("key", "value"):
try:
floated = float(value)
isnan = math.isnan(floated)
except (TypeError, ValueError):
continue
if isnan:
return None
else:
output[key] = floated
return output
class Sensor(models.Model):
.
.
.
def generateData(self, fromTime, toTime, field=None):
fromTime = datetime.fromtimestamp(fromTime)
toTime = datetime.fromtimestamp(toTime)
entries = self.entriesList.filter(time__range=(toTime, fromTime)).order_by(
"time"
)
output = []
for entry in entries:
value = entry.generateFields(field)
if value is not None:
output.append(value)
return output
After trying to troubleshoot the issues of time (as running this query for a ~5000-10000 entries took too long, almost 10 seconds!), i found that most of the time (about 95%) was spent on the method for generateFields(), I've been looking at options for caching it (by using cached_property), using different methods but none have really worked so far.
Is there a method to store the results of generateFields() in the database automatically upon saving the model perhaps? Or possibly just saving the results of the reverse-query self.dataFields.all()? I can figure out that's the main culprit since for 5000 entries, there are 25000 data fields on average at least.
(Thanks to Jacinator for the notes and improvements) above is the code after Jacinator's changes, but the issue (and the question) still stands at large, as caching the fields would speed up the process by almost 25-50 times(!!) which is critical when my actual dataset can be much larger (1~ minute to run a query isnt really acceptable)

I think this is how I would consider writing generateFields.
class Entry(models.Model):
...
def generateFields(self, field=None):
"""
Generate the fields by merging the Entry and data fields.
Outputs a dictionary which contains all fields.
"""
if field is None:
field_set = self.dataFields.all()
else:
field_set = self.dataFields.filter(key=field)
output = {"timestamp": self.time.timestamp()}
for key, value in field_set.values_list("key", "value"):
try:
floated = float(value)
isnan = math.isnan(floated)
except (TypeError, ValueError):
continue
if isnan:
return None
else:
output[key] = floated
return output
First, I'm going to avoid comparing the field (if it's provided) in Python. I can use the queryset .filter to pass that off to the SQL.
if field is None:
field_set = self.dataFields.all()
else:
field_set = self.dataFields.filter(key=field)
Second, I'm using QuerySet.values_list to retrieve the values from the entries. I could be wrong (please correct me if so), but I think this also passes off the attribute retrieval to the SQL. I don't actually know if it's faster, but I suspect it might be.
for key, value in field_set.values_list("key", "value"):
I've restructured the try/except block, but that has less to do with increasing the speed and more to do with making it explicit what errors are being caught and what lines are raising them.
try:
floated = float(value)
isnan = math.isnan(floated)
except (TypeError, ValueError):
continue
The lines outside of the try/except now should be problem free.
if isnan:
return None
else:
output[key] = floated
I'm a little unfamiliar with QuerySet.prefetch_related, but I think adding it to this line would also help.
entries = self.entriesList.filter(time__range=(toTime, fromTime)).order_by(
"time").prefetch_related("dataFields")

While it is not the perfect solution, below is the approach I am using at the moment (I am still looking for a better solution, but for this case I believe this might be useful to others who face the exact same situation as me here.)
I am using django-picklefield to store the cachedData, declaring it as:
from picklefield.fields import PickledObjectField
.
.
.
class Entry(models.Model):
time = models.DateTimeField()
sensor = models.ForeignKey(Sensor, on_delete=CASCADE, related_name="entriesList")
cachedValues = PickledObjectField(null=True)
Next, I am adding a property to both generate the value, and return it as:
#property
def data(self):
if self.cachedValues is None:
fields = self.generateFields()
self.cachedValues = fields
self.save()
return fields
else:
return self.cachedValues
Normaly, this field will be set automatically when new data is added, however since I have a large amount of data already present, while I can wait until its accessed (as future access will be much faster), i decided to quick-index it by running this:
def mass_set(request):
clear = lambda: os.system("clear")
entries = Entry.objects.all()
length = len(entries)
for count, entry in enumerate(entries):
_ = entry.data
clear()
print(f"{count}/{length} done")
Finally, below are the benchmarks for a set of 2230 fields running on my local development machine, measuring the main loop as such:
def generateData(self, fromTime, toTime, field=None):
fromTime = datetime.fromtimestamp(fromTime)
toTime = datetime.fromtimestamp(toTime)
# entries = self.entriesList.filter().order_by('time')
entries = self.entriesList.filter(time__range=(toTime, fromTime)).order_by(
"time"
)
output = []
totalTimeLoop = 0
totalTimeGenerate = 0
stage1 = time.time()
stage2 = time.time()
for entry in entries:
stage1 = time.time()
totalTimeLoop += stage1 - stage2
# Value = entry.generateFields(field)
value = entry.data
stage2 = time.time()
totalTimeGenerate += stage2 - stage1
if value is not None:
output.append(value)
print(f"Total time spent generating the loop: {totalTimeLoop}")
print(f"Total time spent creating the fields: {totalTimeGenerate}")
return output
Before:
Loop Generation Time: 0.1659650
Fields Generation Time: 3.1726377
After:
Loop Generation Time: 0.1614456
Fields Generation Time: 0.0032608
About a thousand-fold decrease in fields generation time, and in overall time a 20x speed increase
As for the cons, there are two mainly:
While applying it to my dataset currently (with 167 thousand fields), it takes a considerable amount of time to update, it took around 23 minutes on my local machine, and is expected to take about an hour or two on live servers. However this is a one-time process as all future entries will be added automatically with minimum performance effect with adding the following lines:
_ = newEntry.data
newEntry.save()
The other is database size, while went from 100.8 MB to 132.9 MB, a significant increase of 31% that can be problematic for a database a few scales larger than mine, but a good enough solution small/medium DBs where storage is less important than speed

Related

Implementing Priority Queue in python

I'm having a bit of trouble implementing a priority queue in python. Essentially I copied all the code from the heapq implementation documentation, however in my application to recreate the data structure for Dijkstra's algorithm, I would like to update the priorities for each item.
class PriorityQueue():
REMOVED = '<removed-task>' # placeholder for a removed task
def __init__(self, arr=None):
if arr is None:
self.data = []
else:
self.data = []
self.entry_finder = {} # mapping of tasks to entries
self.counter = itertools.count() # unique sequence count
def add_task(self, task, priority=0):
'Add a new task or update the priority of an existing task'
if task in self.entry_finder:
self.remove_task(task)
count = next(self.counter) # Increments count each time item is added ( Saves index ).
entry = [priority, count, task]
self.entry_finder[task] = entry
heapq.heappush(self.data, entry)
def remove_task(self, task):
'Mark an existing task as REMOVED. Raise KeyError if not found.'
entry = self.entry_finder.pop(task)
entry[-1] = self.REMOVED
def pop_task(self):
'Remove and return the lowest priority task. Raise KeyError if empty.'
while self.data:
priority, count, task = heapq.heappop(self.data)
if task is not self.REMOVED:
del self.entry_finder[task]
return task
raise KeyError('pop from an empty priority queue')
list = PriorityQueue()
list.add_task("A", 12)
list.add_task("B", 6)
list.add_task("C", 8)
list.add_task("D", 2)
list.add_task("E", 1)
list.add_task("A", 5)
The following code works fine, except it will add a new task "A", and then the old task "A" will be renamed to 'removed-task' and keep it's index in the heap. I would prefer to outright delete the original task instead. I was confused as to how calling '''add_task"" which calls remove_task to change the value of '''entry'' was implicitly changing the value of items in self.data. I then realized that in add_task:
entry = [priority, count, task]
self.entry_finder[task] = entry
we have these two names interning and referencing the same object. If this is the case, how may I be able to delete an object by it's ID? I think this would solve my problem, unless anyone has another solution to find the item in self.data and remove it in O(1). Thanks!

Tracking model changes in SQLAlchemy

I want to log every action what will be done with some SQLAlchemy-Models.
So, I have a after_insert, after_delete and before_update hooks, where I will save previous and current representation of model,
def keep_logs(cls):
#event.listens_for(cls, 'after_delete')
def after_delete_trigger(mapper, connection, target):
pass
#event.listens_for(cls, 'after_insert')
def after_insert_trigger(mapper, connection, target):
pass
#event.listens_for(cls, 'before_update')
def before_update_trigger(mapper, connection, target):
prev = cls.query.filter_by(id=target.id).one()
# comparing previous and current model
MODELS_TO_LOGGING = (
User,
)
for cls in MODELS_TO_LOGGING:
keep_logs(cls)
But there is one problem: when I'm trying to find model in before_update hook, SQLA returns modified (dirty) version.
How can I get previous version of model before updating it?
Is there a different way to keep model changes?
Thanks!

SQLAlchemy tracks the changes to each attribute. You don't need to (and shouldn't) query the instance again in the event. Additionally, the event is triggered for any instance that has been modified, even if that modification will not change any data. Loop over each column, checking if it has been modified, and store any new values.
#event.listens_for(cls, 'before_update')
def before_update(mapper, connection, target):
state = db.inspect(target)
changes = {}
for attr in state.attrs:
hist = attr.load_history()
if not hist.has_changes():
continue
# hist.deleted holds old value
# hist.added holds new value
changes[attr.key] = hist.added
# now changes map keys to new values

I had a similar problem but wanted to be able to keep track of the deltas as changes are made to sqlalchemy models instead of just the new values. I wrote this slight extension to davidism's answer to do that along with slightly better handling of before and after, since they are lists sometimes or empty tuples other times:
from sqlalchemy import inspect
def get_model_changes(model):
"""
Return a dictionary containing changes made to the model since it was
fetched from the database.
The dictionary is of the form {'property_name': [old_value, new_value]}
Example:
user = get_user_by_id(420)
>>> '<User id=402 email="business_email#gmail.com">'
get_model_changes(user)
>>> {}
user.email = 'new_email#who-dis.biz'
get_model_changes(user)
>>> {'email': ['business_email#gmail.com', 'new_email#who-dis.biz']}
"""
state = inspect(model)
changes = {}
for attr in state.attrs:
hist = state.get_history(attr.key, True)
if not hist.has_changes():
continue
old_value = hist.deleted[0] if hist.deleted else None
new_value = hist.added[0] if hist.added else None
changes[attr.key] = [old_value, new_value]
return changes
def has_model_changed(model):
"""
Return True if there are any unsaved changes on the model.
"""
return bool(get_model_changes(model))

If an attribute is expired (which sessions do by default on commit) the old value is not available unless it was loaded before being changed. You can see this with the inspection.
state = inspect(entity)
session.commit()
state.attrs.my_attribute.history # History(added=None, unchanged=None, deleted=None)
# Load history manually
state.attrs.my_attribute.load_history()
state.attrs.my_attribute.history # History(added=(), unchanged=['my_value'], deleted=())
In order for attributes to stay loaded you can not expire entities by settings expire_on_commit to False on the session.

Compare old and updated fields in Django model

I want to compare old and updated field in model. I have did this issue for one field but i want do this for all fields:
class MyUser(User)
def save(self, **kwargs):
if self.pk is not None:
orig = MyUser.objects.get(pk=self.pk)
orig_field_names = orig._meta.get_all_field_names()
field_names = self._meta.get_all_field_names()
# I want do this in loop
if orig.first_name != self.first_name:
print 'first_name changed'
UpdateLog.objects.create(
user = orig,
filed_name = self.first_name,
update_time = datetime.now()
)
super(MyUser, self).save(**kwargs)
Thanks in advance

Here's my go-to function for comparing fields. Gets a little hairy when dealing with foreign keys, but it's not too bad overall:
def get_changes_between_objects(object1, object2, excludes=[]):
"""
Finds the changes between the common fields on two objects
:param object1: The first object
:param object2: The second object
:param excludes: A list of field names to exclude
"""
changes = {}
# For every field in the model
for field in object1._meta.fields:
# Don't process excluded fields or automatically updating fields
if not field.name in excludes and not isinstance(field, fields.AutoField):
# If the field isn't a related field (i.e. a foreign key)..
if not isinstance(field, fields.related.RelatedField):
old_val = field.value_from_object(object1)
new_val = field.value_from_object(object2)
# If the old value doesn't equal the new value, and they're
# not both equivalent to null (i.e. None and "")
if old_val != new_val and not(not old_val and not new_val):
changes[field.verbose_name] = (old_val, new_val)
# If the field is a related field..
elif isinstance(field, fields.related.RelatedField):
if field.value_from_object(object1) != field.value_from_object(object2):
old_pk = field.value_from_object(object1)
try:
old_val = field.related.parent_model.objects.get(pk=old_pk)
except field.related.parent_model.DoesNotExist:
old_val = None
new_pk = field.value_from_object(object2)
try:
new_val = field.related.parent_model.objects.get(pk=new_pk)
except field.related.parent_model.DoesNotExist:
new_val = None
changes[field.verbose_name] = (old_val, new_val)
return changes
Usage:
>>> item = Item.objects.get(pk=1)
>>> item_old = Item.objects.get(pk=1)
>>> print item.my_attribute
'foo'
>>> item.my_attribute = 'bar'
>>> get_changes_between_objects(item, item_old)
{'My Attribute': ('bar', 'foo')}

You want a signal. For quick reference, here's the introductory paragraph or so from that link:
Django includes a “signal dispatcher” which helps decoupled
applications get notified when actions occur elsewhere in the
framework. In a nutshell, signals allow certain senders to notify a
set of receivers that some action has taken place. They’re especially
useful when many pieces of code may be interested in the same events.
Django provides a set of built-in signals that let user code get
notified by Django itself of certain actions.

Read the docs before vote -1. Catch the signal is the better way to do this thing

how to get entities which don't have certain attribute in datastore

I'm trying to make an appraisal system
This is my class
class Goal(db.Expando):
GID = db.IntegerProperty(required=True)
description = db.TextProperty(required=True)
time = db.FloatProperty(required=True)
weight = db.IntegerProperty(required=True)
Emp = db.UserProperty(auto_current_user=True)
Status = db.BooleanProperty(default=False)
Following things are given by employee,
class SubmitGoal(webapp.RequestHandler):
def post(self):
dtw = simplejson.loads(self.request.body)
try:
maxid = Goal.all().order("-GID").get().GID + 1
except:
maxid = 1
try:
g = Goal(GID=maxid, description=dtw[0], time=float(dtw[1]), weight=int(dtw[2]))
g.put()
self.response.out.write(simplejson.dumps("Submitted"))
except:
self.response.out.write(simplejson.dumps("Error"))
Now, here Manager checks the goals and approve it or not.. if approved then status will be stored as true in datastore else false
idsta = simplejson.loads(self.request.body)
try:
g = db.Query(Goal).filter("GID =", int(idsta[0])).get()
if g:
if idsta[1]:
g.Status=True
try:
del g.Comments
except:
None
else:
g.Status=False
g.Comments=idsta[2]
db.put(g)
self.response.out.write(simplejson.dumps("Submitted"))
except:
self.response.out.write(simplejson.dumps("Error"))
Now, this is where im stuck..."filter('status=',True)".. this is returning all the entities which has status true.. means which are approved.. i want those entities which are approved AND which have not been assessed by employee yet..
def get(self):
t = []
for g in Goal.all().filter("Status = ",True):
t.append([g.GID, g.description, g.time, g.weight, g.Emp])
self.response.out.write(simplejson.dumps(t))
def post(self):
idasm = simplejson.loads(self.request.body)
try:
g = db.Query(Goal).filter("GID =", int(idasm[0])).get()
if g:
g.AsmEmp=idasm[1]
db.put(g)
self.response.out.write(simplejson.dumps("Submitted"))
except:
self.response.out.write(simplejson.dumps("Error"))
How am I supposed to do this? as I know that if I add another filter like "filter('AsmEmp =', not None)" this will only return those entities which have the AsmEmp attribute what I need is vice versa.

You explicitly can't do this. As the documentation states:
It is not possible to query for entities that are missing a given property.
Instead, create a property for is_assessed which defaults to False, and query on that.

could you not simply add another field for when employee_assessed = db.user...
and only populate this at the time when it is assessed?

The records do not lack the attribute in the datastore, it's simply set to None. You can query for those records with Goal.all().filter('status =', True).filter('AsmEmp =', None).
A few incidental suggestions about your code:
'Status' is a rather unintuitive name for a boolean.
It's generally good Python style to begin properties and attributes with a lower-case letter.
You shouldn't iterate over a query directly. This fetches results in batches, and is much less efficient than doing an explicit fetch. Instead, fetch the number of results you need with .fetch(n).
A try/except with no exception class specified and no action taken when an exception occurs is a very bad idea, and can mask a wide variety of issues.
Edit: I didn't notice that you were using an Expando - in which case #Daniel's answer is correct. There doesn't seem to be any good reason to use Expando here, though. Adding the property to the model (and updating existing entities) would be the easiest solution here.

Multiple Try-Excepts followed by an Else in python

Is there some way to have several consecutive Try-Except clauses that trigger a single Else only if all of them are successful?
As an example:
try:
private.anodization_voltage_meter = Voltmeter(voltage_meter_address.value) #assign voltmeter location
except(visa.VisaIOError): #channel time out
private.logger.warning('Volt Meter is not on or not on this channel')
try:
private.anodization_current_meter = Voltmeter(current_meter_address.value) #assign voltmeter as current meter location
except(visa.VisaIOError): #channel time out
private.logger.warning('Ammeter is not on or not on this channel')
try:
private.sample_thermometer = Voltmeter(sample_thermometer_address.value)#assign voltmeter as thermomter location for sample.
except(visa.VisaIOError): #channel time out
private.logger.warning('Sample Thermometer is not on or not on this channel')
try:
private.heater_thermometer = Voltmeter(heater_thermometer_address.value)#assign voltmeter as thermomter location for heater.
except(visa.VisaIOError): #channel time out
private.logger.warning('Heater Thermometer is not on or not on this channel')
else:
private.logger.info('Meters initialized')
As you can see, you only want to print meters initialized if all of them went off, however as currently written it only depends on the heater thermometer. is there some way to stack these?

Personally I'd just have a trip variable init_ok or somesuch.
Set it as True, and have all the except clauses set it False, then test at the end?

Consider breaking the try/except structure into a function that returns True if the call worked and False if it failed, then use e.g. all() to see that they all succeded:
def initfunc(structure, attrname, address, desc):
try:
var = Voltmeter(address.value)
setattr(structure, attrname, var)
return True
except(visa.VisaIOError):
structure.logger.warning('%s is not on or not on this channel' % (desc,))
if all([initfunc(*x) for x in [(private, 'anodization_voltage_meter', voltage_meter_address, 'Volt Meter'), ...]]):
private.logger.info('Meters initialized')

Try something like this. keeping the original behaviour of not stopping after the first exception
success = True
for meter, address, message in (
('anodization_voltage_meter',voltage_meter_address,'Volt Meter is not on or not on this channel'),
('anodization_current_meter',current_meter_address,'Ammeter is not on or not on this channel'),
('sample_thermometer',sample_thermometer_address,'Sample Thermometer is not on or not on this channel'),
('heater_thermometer',heater_thermometer_address,'Heater Thermometer is not on or not on this channel')):
try:
setattr(private,meter, Voltmeter(address.value):
except (visa.VisaIOError,):
success = False
private.logger.warning(message)
if success: # everything is ok
private.logger.info('Meters initialized')

You can keep a boolean, initialized at the beginning to: everythingOK=True
Then set it to False in all the except blocks and log the final line only if true.

You've got a number of quite similar objects, and in cases like that it's often better to treat them uniformly and use a data-driven approach.
So, I'd start with your ..._meter_address objects. To me they sound like configuration for the meters, so I'd have a class that looks something like this:
class MeterConfiguration(object):
def __init__(self, name, address):
self.name = name
self.address = address
def english_name(self):
"""A readable form of the name for this meter."""
return ' '.join(x.title() for x in self.name.split('_')) + ' Meter'
Perhaps you have some more configuration (currently stored in variables) that could go in here too.
Then, I'd have a summary of all the meters that your program deals with. I'm creating these statically, but it may well be right for you to read all or part of this information from a config file. I've no idea what your addresses look like, so I made something up :)
ALL_METERS = [
MeterConfiguration('anodization_voltage', 'PORT0001'),
MeterConfiguration('anodization_current', 'PORT0002'),
MeterConfiguration('sample_thermometer', 'PORT0003'),
MeterConfiguration('heater_thermometer', 'PORT0004')
]
Cool, now all configuration is in one place, and we can use this to make things uniform and simpler.
private.meters = {}
any_errors = False
for meter in ALL_METERS:
try:
private.meters[meter.name] = Voltmeter(meter.address)
except VisaError:
logging.error('%s not found at %s', meter.english_name(), meter.address)
any_errors = True
if not any_errors:
logging.info('All meters initialized.')
You can use, for example private.meters['anodization_voltage'] to refer to a particular meter, or iterate over the meters dict if you need to do something to them all.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to speed up stacked queries in Django using caching - python

Related

Implementing Priority Queue in python

Tracking model changes in SQLAlchemy

Compare old and updated fields in Django model

how to get entities which don't have certain attribute in datastore

Multiple Try-Excepts followed by an Else in python

Categories

Resources