What is the recommended way to store a Python exception – in a structured way that allows access to the different parts of that exception – in a Django model?
It is common to design a Django model that records “an event” or “an attempt to do foo”; part of the information to be recorded is “… and the result was this error”. That then becomes a field on the model.
An important aspect of recording the error in the database is to query it in various structured ways: What was the exception type? What was the specific message? What were the other arguments to the exception instance? What was the full traceback?
(I'm aware of services that will let me stream logging or errors to them across the network; that is not what this question asks. I need the structured exception data in the project's own database, for reference directly from other models in the same database.)
The Python exception object is a data structure that has all of these parts – the exception type, the arguments to the exception instance, the “cause” and “context”, the traceback – available separately, as attributes of the object. The objective of this question is to have the Python exception information stored in the database model, such that they can be queried in an equivalent structured manner.
So it isn't enough to have a free-form TextField to record a string representation of the exception. A structured field or set of fields is needed; perhaps even an entire separate model for storing exception instances.
Of course I could design such a thing myself, and spend time making the same mistakes countless people before me have undoubtedly made in trying to implement something like this. Instead, I want to learn from existing art in solving this problem.
What is a general pattern to store structured Python exception data in a Django model, preferably in an existing mature general-purpose library at PyPI that my project can use for its models?
I am not that sure that many people had sought a custom way of storing exception data
Ultimately, you have to know what you need. In all projects I had worked so far, the traceback text had contained enough information to dig to the error source. (sometimes, even naively, so that the TB was escaped multipletimes, still it had been enough to "reverse escape" it, and fix the source with only one instance of the error).
Some debugging tools offer live interactive instrospection in each execution frame when an exception happens - but that has to be "live", because you can't ordinarily serialize Python execution frames and store it on the DB.
That said, if the traceback text is not enough for you, you can have the traceback object by calling sys.exc_info()[2]. That allows you to introspect each frame and know for yourself the file, line number, local and global variables as dictionaries. If you want the variable values to be available on the database, you have to serialize them, but not all values on variables will be easily serializable. So, it is your call to know 'when enough is enough' in this process.
Most modern databases allow for JSON fields, and serializing the exception info to a JSON field restricting data to strings, numbers, bool and None is probably enough.
One way is to run manually each key on the f_globals and f_locals dict for each fame, and try to json.dumps that key's value, and on an exception on the JSON serializtion, use the object's repr instead. Or you can customize a JSON serializer that could store customized relevant data for datetime and dates, open files, sockets and so on - only you can know your needs.
TL;DR: Use a JSON field, at the except clause get hold of the traceback object by calling sys.last_traceback(), and have a custom function to serialize what you want from it into the JSON field.
Or, just use traceback.format_tb to save the traceback text - it probably will be enough anyway.
Most people delegate tasks like this to third-party services like RollBar or LogEntries, or to middleware services running in a VPC like Elastic LogStash.
I'd suggest that the Django ORM is not well suited for this type of storage. Most of the advantage of the ORM is letting you join on relational tables without elaborate SQL and manage schema changes and referential integrity with migrations. These are the problems encountered by the "CRUD" applications Django is designed for — web applications with user accounts, preferences, notification inboxes, etc. The ORM is intended to manage mutable data with more reads than writes.
Your problem, storing Python exceptions that happened in production, is quite different. Your schema needs will almost never change. The data you want to store is never modified at all once written, and all writes are strictly appends. Your data does not contain foreign keys or other fields that would change in a migration. You will almost always query recent data over historical, which will rarely be read outside of offline/bulk analytics.
If you really want to store this information in Django, I'd suggest storing only a rolling window that you periodically rotate into compressed logs on disk. Otherwise you will be maintaining costly indexes in data that is almost never needed. In this case, you should consider constructing your own custom Django model that extracts the Exception metadata you need. You could also put this information into a JSON field that you store as a string as #jsbueno suggests, but this sacrifices indexed selection.
(Note Python exceptions cannot be directly serialized to JSON or pickled. There is a project called tblib that enables pickling, which in turn could be stored as BLOB fields in Django, but I have no idea if the performance would be reasonable. My guess is it would not be worth it.)
In recent years there are many alternative DBMS products for log-like, append-only storage with analytic query patterns. But most of this progress is too recent and too "web-scale" to have off-the-shelf integration with Django, which focuses on smaller, more traditional CRUD applications. You should look for solutions that can be integrated with Python more generally, as complex logging/event storage is mostly out-of-scope for Django.
Related
I've been racking my brain on this for the last few weeks and I just can't seem to understand it. I'm hoping you folks here can give me some clarity.
A LITTLE BACKGROUND
I've built an API to help serve a large website and like all of us, are trying to keep the API as efficient as possible. Part of this efficiency is to NOT create an object that contains custom business logic over and over again (Example: a service class) as requests are made. To give some personal background I come from the Java world so I'm use to using a IoC or DI to help handle object creation and injection into my classes to ensure classes are NOT created over and over on a per request basis.
WHAT I'VE READ
While looking at many Python IoC and DI posts I've become rather confused on how to best approach creating a given class and not having to worry about the server getting overloaded with too many objects based on the amount of requests it may be handling.
Some people say an IoC or DI really isn't needed. But as I run my Django app I find that unless I construct the object I want globally (top of file) for views.py to use later rather than within each view class or def within views.py I run the change of creating multiple classes of the same type, which from what I understand would cause memory bloat on the server.
So what's the right way to be pythonic to keep objects from being built over and over? Should I invest in using an IoC / DI or not? Can I safely rely on setting up my service.py files to just contain def's instead of classes that contain def's? Is the garbage collector just THAT efficient so I don't even have to worry about it.
I've purposely not placed any code in this post since this seems like a general questions, but I can provide a few code examples if that helps.
Thanks
From a confused engineer that wants to be as pythonic as possible
You come from a background where everything needs to be a class, I've programmed web apps in Java too, and sometimes it's harder to unlearn old things than to learn new things, I understand.
In Python / Django you wouldn't make anything a class unless you need many instances and need to keep state.
For a service that's hardly the case, and sometimes you'll notice in Java-like web apps some services are made singletons, which is just a workaround and a rather big anti-pattern in Python
Pythonic
Python is flexible enough so that a "services class" isn't required, you'd just have a Python module (e.g. services.py) with a number of functions, emphasis on being a function that takes in something, returns something, in a completely stateless fashion.
# services.py
# this is a module, doesn't keep any state within,
# it may read and write to the DB, do some processing etc but doesn't remember things
def get_scores(student_id):
return Score.objects.filter(student=student_id)
# views.py
# receives HTTP requests
def view_scores(request, student_id):
scores = services.get_scores(student_id)
# e.g. use the scores queryset in a template return HTML page
Notice how if you need to swap out the service, you'll just be swapping out a single Python module (just a file really), so Pythonistas hardly bother with explicit interfaces and other abstractions.
Memory
Now per each "django worker process", you'd have that one services module, that is used over and over for all requests that come in, and when the Score queryset is used and no longer pointed at in memory, it'll be cleaned up.
I saw your other post, and well, instantiating a ScoreService object for each request, or keeping an instance of it in the global scope is just unnecessary, the above example does the job with one module in memory, and doesn't need us to be smart about it.
And if you did need to keep state in-between several requests, keeping them in online instances of ScoreService would be a bad idea anyway because now every user might need one instance, that's not viable (too many online objects keeping context). Not to mention that instance is only accessible from the same process unless you have some sharing mechanisms in place.
Keep state in a datastore
In case you want to keep state in-between requests, you'd keep the state in a datastore, and when the request comes in, you hit the services module again to get the context back from the datastore, pick up where you left it and do your business, return your HTTP response, then unused things will get garbage collected.
The emphasis being on keeping things stateless, where any given HTTP request can be processed on any given django process, and all state objects are garbage collected after the response is returned and objects go out of scope.
This may not be the fastest request/response cycle we can pull, but it's scalable as hell
Look at some major web apps written in Django
I suggest you look at some open source Django projects and look at how they're organized, you'll see a lot of the things you're busting your brains with, Djangonauts just don't bother with.
I'm quite new to django, and moved to it from Drupal.
In Drupal is possible to define module-level variables (read "application" for django) which are stored in the DB and use one of Drupal's "core tables". The idiom would be something like:
variable_set('mymodule_variablename', $value);
variable_get('mymodule_variablename', $default_value);
variable_del('mymodule_variablename');
The idea is that it wouldn't make sense to have each module (app) to instantiate a whole "module table" to just store one value, so the core provides a common one to be shared across modules.
To the best of my newbie understanding of django, django lack such a functionality, but - since it is a common pattern - I thought to turn to SO community to check if there is a typical/standard/idiomatic way that django devs use to solve this problem.
(BTW: the value is not a constant that I could put in a settings file. It's a value that should be refreshed daily, and should be read at each request).
There are apps to achieve this, but I'd like to recommend django-modeldict from disqus, as its brief
ModelDict is a very efficient way to store things like settings in
your database. The entire model is transformed into a dictionary
(lazily) as well as stored in your cache. It's invalidated only when
it needs to be (both in process and based on CACHE_BACKEND).
Data that is not static is stored in a model. If you need to share data or functions between apps I have seen the convention of making a shared app, something like 'common'. This would house shared models, or utility functions.
In the django projects I have seen the data is usually specific. The data you are storing should be in a model that is representative of that data, I would rather have an explicit model/object representing my data then a generic object that houses vastly different data.
If you are only defining 1 or two variables which are changed daily, perhaps just a key/value store like memcached would work for you?
Another +1 for ModelDict. Another potential, similar solution is Django Constance:
https://github.com/jazzband/django-constance
It's meant to store app config parameters in the database and has the advantage that it exposes a nice backend to edit them for administrators (with the right permissions), handles default values and also has caching etc.
EDIT:
In case it's not clear from the documentation (which it isn't), you can set settings the same the 'Pythonic way.' I.e. to set a setting to a value, you do
from constance import config
config.variable_name = value
I'm developing a framework of sorts. I'm providing a base class, that will be subclassed by other developers to add behavior to the system. The instances of those classes will have attributes that my framework doesn't necessarily expect, except by inspecting those instances' __dict__. To make things even more interesting, some of those classes can be created dynamically, at any time.
I'd like some things to be handled by the framework, namely, I will need to persist those instances, display their attribute values to the user, and let her search/filter instances using those values.
I have to use a relational database. I know there are some decent python OO database out there, but unfortunately they're not an option in this case.
I'm not looking for a full-blown ORM too... and it may not even be an option, given that some of the classes can be created dynamically.
So, my question is, what state of a python instance do I need to serialize to ensure that I can deserialize it later on? Is it enough to look at __dict__, or are there other private attributes that I should be using?
Pickling the instances is not enough, because I'll need to unpickle them to search/filter the attribute values, and I'm afraid it's too much data to do it in-memory (instead of letting the database do it).
Just use an ORM. This is what they are for.
What you are proposing to do is create your own half-assed ORM on your own time. Save your time for your own code that does things, and use the effort other people put for free into solving this problem for you.
Note that all class creation in python is "dynamic" - this is not an issue, for, well, anything at all. In fact, if you are assembling classes programmatically, it is probably slightly easier with an ORM, because they provide reifications of fields.
In the worst case, if you really do need to store your objects in a fake nosql-type schema, you will still only have to write your own backend driver if you use an existing ORM, rather than coding the whole stack yourself. (As it happens, you're not the first person to face this - solutions exist. Goole "python orm store dynamically created models" and "sqlalchemy store dynamically created models")
Candidates include:
Django ORM
SQLAlchemy
Some others you can find by googling "Python ORM".
surfing on the web, reading about django dev best practices points to use pickled model fields with extreme caution.
But in a real life example, where would you use a PickledObjectField, to solve what specific problems?
We have a system of social-networks "backends" which do some generic stuff like "post message", "get status", "get friends" etc. The link between each backend class and user is django model, which keeps user, backend name and credentials. Now imagine how many auth systems are there: oauth, plain passwords, facebook's obscure js stuff etc. This is where JSONField shines, we keep all backend-specif auth data in a dictionary on this model, which is stored in db as json, we can put anything into it no problem.
You would use it to store... almost-arbitrary Python objects. In general there's little reason to use it; JSON is safer and more portable.
You can definitely substitute a PickledObjectField with JSON and some extra logic to create an object out of the JSON. At the end of the day, your use case, when considering to use a PickledObjectField or JSON+logic, is serializing a Python object into your database. If you can trust the data in the Python object, and know that it will always be serialize-able, you can reasonably use the PickledObjectField. In my case (I don't use django's ORM, but this should still apply), I have a couple different object types that can go into my PickledObjectField, and their definitions are constantly mutating. Rather than constantly updating my JSON parsing logic to create an object out of JSON values, I simply use a PickledObjectField to just store the different objects, and then later retrieve them in perfectly usable form (calling their functions). Caveat: If you store an object via PickledObjectField, then you change the object definition, and then you retrieve the object, the old object may have trouble fitting into the new object's definition (depending on what you changed).
The problems to be solved are the efficiency and the convenience of defining and handling a complex object consisting of many parts.
You can turn each part type into a Model and connect them via ForeignKeys.
Or you can turn each part type into a class, dictionary, list, tuple, enum or whathaveyou to your liking and use PickledObjectField to store and retrieve the whole beast in one step.
That approach makes sense if you will never manipulate parts individually, only the complex object as a whole.
Real life example
In my application there are RQdef objects that represent essentially a type with a certain basic structure (if you are curious what they mean, look here).
RQdefs consist of several Aspects and some fixed attributes.
Aspects consist of one or more Facets and some fixed attributes.
Facets consist of two or more Levels and some fixed attributes.
Levels consist of a few fixed attributes.
Overall, a typical RQdef will have about 20-40 parts.
An RQdef is always completely constructed in a single step before it is stored in the database and it is henceforth never modified, only read (but read frequently).
PickledObjectField is more convenient and much more efficient for this purpose than would be a set of four models and 20-40 objects for each RQdef.
I'm using SQLAlchemy's declarative extension. I'd like all changes to tables logs, including changes in many-to-many relationships (mapping tables). Each table should have a separate "log" table with a similar schema, but additional columns specifying when the change was made, who made the change, etc.
My programming model would be something like this:
row.foo = 1
row.log_version(username, change_description, ...)
Ideally, the system wouldn't allow the transaction to commit without row.log_version being called.
Thoughts?
There are too many questions in one, so they that full answers to all them won't fit StackOverflow answer format. I'll try to describe hints in short, so ask separate question for them if it's not enough.
Assigning user and description to transaction
The most popular way to do so is assigning user (and other info) to some global object (threading.local() in threaded application). This is very bad way, that causes hard to discover bugs.
A better way is assigning user to the session. This is OK when session is created for each web request (in fact, it's the best design for application with authentication anyway), since there is the only user using this session. But passing description this way is not as good.
And my favorite solution is to extent Session.commit() method to accept optional user (and probably other info) parameter and assign it current transaction. This is the most flexible, and it suites well to pass description too. Note that info is bound to single transaction and is passed in obvious way when transaction is closed.
Discovering changes
There is a sqlalchemy.org.attributes.instance_state(obj) contains all information you need. The most useful for you is probably state.committed_state dictionary which contains original state for changed fields (including many-to-many relations!). There is also state.get_history() method (or sqlalchemy.org.attributes.get_history() function) returning a history object with has_changes() method and added and deleted properties for new and old value respectively. In later case use state.manager.keys() (or state.manager.attributes) to get a list of all fields.
Automatically storing changes
SQLAlchemy supports mapper extension that can provide hooks before and after update, insert and delete. You need to provide your own extension with all before hooks (you can't use after since the state of objects is changed on flush). For declarative extension it's easy to write a subclass of DeclarativeMeta that adds a mapper extension for all your models. Note that you have to flush changes twice if you use mapped objects for log, since a unit of work doesn't account objects created in hooks.
We have a pretty comprehensive "versioning" recipe at http://www.sqlalchemy.org/trac/wiki/UsageRecipes/LogVersions . It seems some other users have contributed some variants on it. The mechanics of "add a row when something changes at the ORM level" are all there.
Alternatively you can also intercept at the execution level using ConnectionProxy, search through the SQLA docs for how to use that.
edit: versioning is now an example included with SQLA: http://docs.sqlalchemy.org/en/rel_0_8/orm/examples.html#versioned-objects