I've used MongoEngine a lot lately. Apart from the MongoDB integration, I like the idea of defining the structures of entities explicitly. Field definitions make code easier to understand. Also, using those definitions, I can validate objects to catch potential bugs or serialize/deserialize them more accurately.
The problem with MongoEngine is that it is designed specifically to work with a storage engine. The same applies for Django and SQLAlchemy models, which also lack list and set types. My question is, then, is there an object schema/model library for Python that does automated object validation and serialization, but not object-relational mapping or any other fancy stuff?
Let me give an example.
class Wheel(Entity):
radius = FloatField(1.0)
class Bicycle(Entity):
front = EntityField(Wheel)
back = EntityField(Wheel)
class Owner(Entity):
name = StringField()
bicycles = ListField(EntityField(Bicycle))
owner = Owner(name='Eser Aygün', bicycles=[])
bmx = Bicycle()
bmx.front = Wheel()
bmx.back = Wheel()
trek = Bicycle()
trek.front = Wheel(1.2)
trek.back = Wheel(1.2)
owner.bicycles.append(bmx)
owner.bicycles.append(trek)
owner.validate() # checks the structure recursively
Given the structure, it is also easy to serialize and deserialize objects. For example, owner.jsonify() may return the dictionary
{
'name': 'Eser Aygün',
'bicycles': [{
'front': {
radius: 1.0
},
'back': {
radius: 1.0
}
}, {
'front': {
radius: 1.2
},
'back': {
radius: 1.2
}
}],
}
and you can easily convert it back calling owner.dejsonify(dic).
If anyone is still looking, as python-entities hasn't been updated for a while, there are some good libraries out there:
schematics - https://schematics.readthedocs.org/en/latest/
colander - http://docs.pylonsproject.org/projects/colander/en/latest/
voluptuous - https://pypi.python.org/pypi/voluptuous (more of a validation library)
Check out mongopersist, which uses mongo as a persistence layer for Python objects like ZODB. It does not perform schema validation, but it lets you move objects between Mongo and Python transparently.
For validation or other serialization/deserialization scenarios (e.g. forms), consider colander. Colander project description:
Colander is useful as a system for validating and deserializing data obtained via XML, JSON, an HTML form post or any other equally simple data serialization. It runs on Python 2.6, 2.7 and 3.2. Colander can be used to:
Define a data schema.
Deserialize a data structure composed of strings, mappings, and lists into an arbitrary Python structure after validating the data structure against a data schema.
Serialize an arbitrary Python structure to a data structure composed of strings, mappings, and lists.
What you're describing can be achieved with remoteobjects, which contains a mechanism (called dataobject) to allow you to define the structure of an object such that it can be validated and it can be marshalled easily to and from JSON.
It also includes some functionality for building a REST client library that makes HTTP requests, but the use of this part is not required.
The main remoteobjects distribution does not come with specific StringField or IntegerField types, but it's easy enough to implement them. Here's an example BooleanField from a codebase I maintain that uses remoteobjects:
class BooleanField(dataobject.fields.Field):
def encode(self, value):
if value is not None and type(value) is not bool:
raise TypeError("Requires boolean")
return super(BooleanField, self).encode(value)
This can then be used in an object definition:
class ThingWithBoolean(dataobject.DataObject):
my_boolean = BooleanField()
And then:
thing = ThingWithBoolean.from_dict({"my_boolean":true})
thing.my_boolean = "hello"
return json.dumps(thing.to_dict()) # will fail because my_boolean is not a boolean
As I said earlier in a comment, I've decided to invent my own wheel. I started implementing an open-source Python library, Entities, that just does what I wanted. You can check it out from https://github.com/eseraygun/python-entities/.
The library supports recursive and non-recursive collection types (list, set and dict), nested entities and reference fields. It can automatically validate, serialize, deserialize and generate hashable keys for entities of any complexity. (In fact, de/serialization feature is not complete yet.)
This is how you use it:
from entities import *
class Account(Entity):
id = IntegerField(group=PRIMARY) # this field is in primary key group
iban = IntegerField(group=SECONDARY) # this is in secondary key group
balance = FloatField(default=0.0)
class Name(Entity):
first_name = StringField(group=SECONDARY)
last_name = StringField(group=SECONDARY)
class Customer(Entity):
id = IntegerField(group=PRIMARY)
name = EntityField(Name, group=SECONDARY)
accounts = ListField(ReferenceField(Account), default=list)
# Create Account objects.
a_1 = Account(1, 111, 10.0) # __init__() recognizes positional arguments
a_2 = Account(id=2, iban=222, balance=20.0) # as well as keyword arguments
# Generate hashable key using primary key.
print a_1.keyify() # prints '(1,)'
# Generate hashable key using secondary key.
print a_2.keyify(SECONDARY) # prints '(222,)'
# Create Customer object.
c = Customer(1, Name('eser', 'aygun'))
# Generate hashable key using primary key.
print c.keyify() # prints '(1,)'
# Generate hashable key using secondary key.
print c.keyify(SECONDARY) # prints '(('eser', 'aygun'),)'
# Try validating an invalid object.
c.accounts.append(123)
try:
c.validate() # fails
except ValidationError:
print 'accounts list is only for Account objects'
# Try validating a valid object.
c.accounts = [a_1, a_2]
c.validate() # succeeds
Related
I hope there is some best practice or design pattern for use case of having simple nested data/mapping with need to pre-store data, select proper data group, and conveniently access specific values. Example - install script parameters where install script will be given type of install (a.k.a label) and based on that proper folder to install and other params values should be evaluated
Sample as nested dict
install_params = {'prod':{'folder_name':'production','param_1' : value_a ... },
'dev':{'folder_name':'development','param_1' : value_b ... },
'test':{'folder_name':'test', 'param_1' : value_c ... }}
Nested dicts offers grouping of all data under parent dict install_params so all labels can be present to user with install_params.keys(), and (in this case) sole required functionality - retrieving proper values for given installation type/label (simply via install_params[user_selected_label] )
However having either numerous items in parent dict, or more nested levels, or more complex structures as values of params, this will be cumbersome to use and error prone.
Thought of rewriting it as either collection.namedtuple or typing.NamedTuple or #dataclass or just generic class but neither seems to offer convenient solution to the generic use case above.
e.g. named tuple offers dot notation to access fields, but I still have to group them all somehow and implement functionality of selecting proper instance based
prod = InstallType( 'prod', 'prod', 'value_a')
dev = InstallType( 'dev', 'development ', 'value_b')
test = InstallType( 'test', 'test ', 'value_c')
all_types = [prod,dev, test]
if type =='prod': # select prod named tuple
in the end I've started to think to create dictionary of named tuples or a single data class that would keep the data mapping private and init itself based on label input parameter. The later would nicely encapsulate all functionality and allow dot notation, but still uses ugly nested dicts inside
#dataclass(frozen=True)
class SomeDataClass:
label: str
folder_name: str = None
param_1: float = None
def __post__init__(self):
#dreaded nested_dict to populate self.folder_name and others based on submitted label
(...)
install_params = SomeDataClass(label=user_selection)
print(install_params.folder_name) # would work as expected since coded in SomeDataClass
This seemed promising, but SomeDataClass again produced further problems (How to expose all available install labels (prod, dev, test...) to user before instancing the class, how to presents fields as mandatory, but not requiring user to submit them, as they should be 'computed' based on submitted label. ( plain folder_name: str requires user to still submit it, while folder_name : str = None is confusing and suggest its optional and can be set to None ), etc... ) It also feels rather over-engineered solution for the simple nested dicts mappings
I have made a really long form with the help of colander alchemy and deform.
This form has 100 or so fields and currently the only way I know to add the data back to the database once form is submitted is to explicitly re-define each variable and then add that to the database but there must be a better way.
#my schema
class All(colander.MappingSchema):
setup_schema(None,atr)
atrschema =atr.__colanderalchemy__
setup_schema(None,chemicals)
chemicalsschema =chemicals.__colanderalchemy__
setup_schema(None,data_aquisition)
data_aquisitionschema =data_aquisition.__colanderalchemy__
setup_schema(None,depositor)
depositorschema =depositor.__colanderalchemy__
setup_schema(None,dried_film)
dried_filmschema =dried_film.__colanderalchemy__
form = All()
form = deform.Form(form,buttons=('submit',))
# this is how I get it to work by redefining each field but there must be a better way
if 'submit' in request.POST:
prism_material = request.params['prism_material']
angle_of_incidence_degrees =
request.params['angle_of_incidence_degrees']
number_of_reflections = request.params['number_of_reflections']
prism_size_mm = request.params['prism_size_mm']
spectrometer_ID = 6
page = atr (spectrometer_ID=spectrometer_ID,prism_size_mm=prism_size_mm,number_of_reflections=number_of_reflections,angle_of_incidence_degrees=angle_of_incidence_degrees,prism_material=prism_material)
request.dbsession.add(page)
Would like to somehow just be able to remap all of that 'multi dictionary' that is returned back to the database?
So, you have a dict (request.params) and want to pass the key-value pars from that dict to a function? Python has a way to do that using **kwargs syntax:
if 'submit' in request.POST:
page = Page(spectrometer_ID=6,**request.params)
request.dbsession.add(page)
(this works also because SQLAlchemy provides a default constructor which assigns the passed values to the mapped columns, no need to define it manually)
Of course, this is a naive approach which will only work for the simplest use-cases - for example, it may allow passing parameters not defined in your schema which may create a security problem; the field names in your schema must match the field names in your SQLAlchemy model; it may not work with lists (i.e. multiple values with the same name which you can access via request.params.get_all(name)).
I'm creating an API and I need to return data in a dictionnary format sothat it can be serialized (by the API mechanism).
The code that currently works is something as simple as:
def mymethod(self):
queryset1 = MyClass.objects.get(...) # Ccontains 1 object, easy to deal with
queryset2 = OtherClass.objects.filter(...) # Contains N objects, hard to deal with !
return {
'qs1_result': queryset1.some_attribute # This works well
}
Returning data from queryset1 is easy because there is 1 object. I just pick the attribute I need and it works. Now let's say that in addition, I want to return data from queryset2, where there are many objects, and that I don't need every attributes of the object.
How would you do that?
I repeat that I do NOT need to make the serialization myself. I just need to return structured data sothat the serialization can be made.
Thanks a lot.
From the Django docs: https://docs.djangoproject.com/en/dev/topics/serialization/#subset-of-fields
Subset of fields
If you only want a subset of fields to be serialized, you can specify a fields argument to the serializer:
from django.core import serializers
data = serializers.serialize('json', SomeModel.objects.all(), fields=('name','size'))
In this example, only the name and size attributes of each model will be serialized.
I'm working on a multi-tenanted application in which some users can define their own data fields (via the admin) to collect additional data in forms and report on the data. The latter bit makes JSONField not a great option, so instead I have the following solution:
class CustomDataField(models.Model):
"""
Abstract specification for arbitrary data fields.
Not used for holding data itself, but metadata about the fields.
"""
site = models.ForeignKey(Site, default=settings.SITE_ID)
name = models.CharField(max_length=64)
class Meta:
abstract = True
class CustomDataValue(models.Model):
"""
Abstract specification for arbitrary data.
"""
value = models.CharField(max_length=1024)
class Meta:
abstract = True
Note how CustomDataField has a ForeignKey to Site - each Site will have a different set of custom data fields, but use the same database.
Then the various concrete data fields can be defined as:
class UserCustomDataField(CustomDataField):
pass
class UserCustomDataValue(CustomDataValue):
custom_field = models.ForeignKey(UserCustomDataField)
user = models.ForeignKey(User, related_name='custom_data')
class Meta:
unique_together=(('user','custom_field'),)
This leads to the following use:
custom_field = UserCustomDataField.objects.create(name='zodiac', site=my_site) #probably created in the admin
user = User.objects.create(username='foo')
user_sign = UserCustomDataValue(custom_field=custom_field, user=user, data='Libra')
user.custom_data.add(user_sign) #actually, what does this even do?
But this feels very clunky, particularly with the need to manually create the related data and associate it with the concrete model. Is there a better approach?
Options that have been pre-emptively discarded:
Custom SQL to modify tables on-the-fly. Partly because this won't scale and partly because it's too much of a hack.
Schema-less solutions like NoSQL. I have nothing against them, but they're still not a good fit. Ultimately this data is typed, and the possibility exists of using a third-party reporting application.
JSONField, as listed above, as it's not going to work well with queries.
As of today, there are four available approaches, two of them requiring a certain storage backend:
Django-eav (the original package is no longer mantained but has some thriving forks)
This solution is based on Entity Attribute Value data model, essentially, it uses several tables to store dynamic attributes of objects. Great parts about this solution is that it:
uses several pure and simple Django models to represent dynamic fields, which makes it simple to understand and database-agnostic;
allows you to effectively attach/detach dynamic attribute storage to Django model with simple commands like:
eav.unregister(Encounter)
eav.register(Patient)
Nicely integrates with Django admin;
At the same time being really powerful.
Downsides:
Not very efficient. This is more of a criticism of the EAV pattern itself, which requires manually merging the data from a column format to a set of key-value pairs in the model.
Harder to maintain. Maintaining data integrity requires a multi-column unique key constraint, which may be inefficient on some databases.
You will need to select one of the forks, since the official package is no longer maintained and there is no clear leader.
The usage is pretty straightforward:
import eav
from app.models import Patient, Encounter
eav.register(Encounter)
eav.register(Patient)
Attribute.objects.create(name='age', datatype=Attribute.TYPE_INT)
Attribute.objects.create(name='height', datatype=Attribute.TYPE_FLOAT)
Attribute.objects.create(name='weight', datatype=Attribute.TYPE_FLOAT)
Attribute.objects.create(name='city', datatype=Attribute.TYPE_TEXT)
Attribute.objects.create(name='country', datatype=Attribute.TYPE_TEXT)
self.yes = EnumValue.objects.create(value='yes')
self.no = EnumValue.objects.create(value='no')
self.unkown = EnumValue.objects.create(value='unkown')
ynu = EnumGroup.objects.create(name='Yes / No / Unknown')
ynu.enums.add(self.yes)
ynu.enums.add(self.no)
ynu.enums.add(self.unkown)
Attribute.objects.create(name='fever', datatype=Attribute.TYPE_ENUM,\
enum_group=ynu)
# When you register a model within EAV,
# you can access all of EAV attributes:
Patient.objects.create(name='Bob', eav__age=12,
eav__fever=no, eav__city='New York',
eav__country='USA')
# You can filter queries based on their EAV fields:
query1 = Patient.objects.filter(Q(eav__city__contains='Y'))
query2 = Q(eav__city__contains='Y') | Q(eav__fever=no)
Hstore, JSON or JSONB fields in PostgreSQL
PostgreSQL supports several more complex data types. Most are supported via third-party packages, but in recent years Django has adopted them into django.contrib.postgres.fields.
HStoreField:
Django-hstore was originally a third-party package, but Django 1.8 added HStoreField as a built-in, along with several other PostgreSQL-supported field types.
This approach is good in a sense that it lets you have the best of both worlds: dynamic fields and relational database. However, hstore is not ideal performance-wise, especially if you are going to end up storing thousands of items in one field. It also only supports strings for values.
#app/models.py
from django.contrib.postgres.fields import HStoreField
class Something(models.Model):
name = models.CharField(max_length=32)
data = models.HStoreField(db_index=True)
In Django's shell you can use it like this:
>>> instance = Something.objects.create(
name='something',
data={'a': '1', 'b': '2'}
)
>>> instance.data['a']
'1'
>>> empty = Something.objects.create(name='empty')
>>> empty.data
{}
>>> empty.data['a'] = '1'
>>> empty.save()
>>> Something.objects.get(name='something').data['a']
'1'
You can issue indexed queries against hstore fields:
# equivalence
Something.objects.filter(data={'a': '1', 'b': '2'})
# subset by key/value mapping
Something.objects.filter(data__a='1')
# subset by list of keys
Something.objects.filter(data__has_keys=['a', 'b'])
# subset by single key
Something.objects.filter(data__has_key='a')
JSONField:
JSON/JSONB fields support any JSON-encodable data type, not just key/value pairs, but also tend to be faster and (for JSONB) more compact than Hstore.
Several packages implement JSON/JSONB fields including django-pgfields, but as of Django 1.9, JSONField is a built-in using JSONB for storage.
JSONField is similar to HStoreField, and may perform better with large dictionaries. It also supports types other than strings, such as integers, booleans and nested dictionaries.
#app/models.py
from django.contrib.postgres.fields import JSONField
class Something(models.Model):
name = models.CharField(max_length=32)
data = JSONField(db_index=True)
Creating in the shell:
>>> instance = Something.objects.create(
name='something',
data={'a': 1, 'b': 2, 'nested': {'c':3}}
)
Indexed queries are nearly identical to HStoreField, except nesting is possible. Complex indexes may require manually creation (or a scripted migration).
>>> Something.objects.filter(data__a=1)
>>> Something.objects.filter(data__nested__c=3)
>>> Something.objects.filter(data__has_key='a')
Django MongoDB
Or other NoSQL Django adaptations -- with them you can have fully dynamic models.
NoSQL Django libraries are great, but keep in mind that they are not 100% the Django-compatible, for example, to migrate to Django-nonrel from standard Django you will need to replace ManyToMany with ListField among other things.
Checkout this Django MongoDB example:
from djangotoolbox.fields import DictField
class Image(models.Model):
exif = DictField()
...
>>> image = Image.objects.create(exif=get_exif_data(...))
>>> image.exif
{u'camera_model' : 'Spamcams 4242', 'exposure_time' : 0.3, ...}
You can even create embedded lists of any Django models:
class Container(models.Model):
stuff = ListField(EmbeddedModelField())
class FooModel(models.Model):
foo = models.IntegerField()
class BarModel(models.Model):
bar = models.CharField()
...
>>> Container.objects.create(
stuff=[FooModel(foo=42), BarModel(bar='spam')]
)
Django-mutant: Dynamic models based on syncdb and South-hooks
Django-mutant implements fully dynamic Foreign Key and m2m fields. And is inspired by incredible but somewhat hackish solutions by Will Hardy and Michael Hall.
All of these are based on Django South hooks, which, according to Will Hardy's talk at DjangoCon 2011 (watch it!) are nevertheless robust and tested in production (relevant source code).
First to implement this was Michael Hall.
Yes, this is magic, with these approaches you can achieve fully dynamic Django apps, models and fields with any relational database backend. But at what cost? Will stability of application suffer upon heavy use? These are the questions to be considered. You need to be sure to maintain a proper lock in order to allow simultaneous database altering requests.
If you are using Michael Halls lib, your code will look like this:
from dynamo import models
test_app, created = models.DynamicApp.objects.get_or_create(
name='dynamo'
)
test, created = models.DynamicModel.objects.get_or_create(
name='Test',
verbose_name='Test Model',
app=test_app
)
foo, created = models.DynamicModelField.objects.get_or_create(
name = 'foo',
verbose_name = 'Foo Field',
model = test,
field_type = 'dynamiccharfield',
null = True,
blank = True,
unique = False,
help_text = 'Test field for Foo',
)
bar, created = models.DynamicModelField.objects.get_or_create(
name = 'bar',
verbose_name = 'Bar Field',
model = test,
field_type = 'dynamicintegerfield',
null = True,
blank = True,
unique = False,
help_text = 'Test field for Bar',
)
I've been working on pushing the django-dynamo idea further. The project is still undocumented but you can read the code at https://github.com/charettes/django-mutant.
Actually FK and M2M fields (see contrib.related) also work and it's even possible to define wrapper for your own custom fields.
There's also support for model options such as unique_together and ordering plus Model bases so you can subclass model proxy, abstract or mixins.
I'm actually working on a not in-memory lock mechanism to make sure model definitions can be shared accross multiple django running instances while preventing them using obsolete definition.
The project is still very alpha but it's a cornerstone technology for one of my project so I'll have to take it to production ready. The big plan is supporting django-nonrel also so we can leverage the mongodb driver.
Further research reveals that this is a somewhat special case of Entity Attribute Value design pattern, which has been implemented for Django by a couple of packages.
First, there's the original eav-django project, which is on PyPi.
Second, there's a more recent fork of the first project, django-eav which is primarily a refactor to allow use of EAV with django's own models or models in third-party apps.
Is there a way to perform validation on an object after (or as) the properties are set but before the session is committed?
For instance, I have a domain model Device that has a mac property. I would like to ensure that the mac property contains a valid and sanitized mac value before it is added to or updated in the database.
It looks like the Pythonic approach is to do most things as properties (including SQLAlchemy). If I had coded this in PHP or Java, I would probably have opted to create getter/setter methods to protect the data and give me the flexibility to handle this in the domain model itself.
public function mac() { return $this->mac; }
public function setMac($mac) {
return $this->mac = $this->sanitizeAndValidateMac($mac);
}
public function sanitizeAndValidateMac($mac) {
if ( ! preg_match(self::$VALID_MAC_REGEX) ) {
throw new InvalidMacException($mac);
}
return strtolower($mac);
}
What is a Pythonic way to handle this type of situation using SQLAlchemy?
(While I'm aware that validation and should be handled elsewhere (i.e., web framework) I would like to figure out how to handle some of these domain specific validation rules as they are bound to come up frequently.)
UPDATE
I know that I could use property to do this under normal circumstances. The key part is that I am using SQLAlchemy with these classes. I do not understand exactly how SQLAlchemy is performing its magic but I suspect that creating and overriding these properties on my own could lead to unstable and/or unpredictable results.
You can add data validation inside your SQLAlchemy classes using the #validates() decorator.
From the docs - Simple Validators:
An attribute validator can raise an exception, halting the process of mutating the attribute’s value, or can change the given value into something different.
from sqlalchemy.orm import validates
class EmailAddress(Base):
__tablename__ = 'address'
id = Column(Integer, primary_key=True)
email = Column(String)
#validates('email')
def validate_email(self, key, address):
# you can use assertions, such as
# assert '#' in address
# or raise an exception:
if '#' not in address:
raise ValueError('Email address must contain an # sign.')
return address
Yes. This can be done nicely using a MapperExtension.
# uses sqlalchemy hooks to data model class specific validators before update and insert
class ValidationExtension( sqlalchemy.orm.interfaces.MapperExtension ):
def before_update(self, mapper, connection, instance):
"""not every instance here is actually updated to the db, see http://www.sqlalchemy.org/docs/reference/orm/interfaces.html?highlight=mapperextension#sqlalchemy.orm.interfaces.MapperExtension.before_update"""
instance.validate()
return sqlalchemy.orm.interfaces.MapperExtension.before_update(self, mapper, connection, instance)
def before_insert(self, mapper, connection, instance):
instance.validate()
return sqlalchemy.orm.interfaces.MapperExtension.before_insert(self, mapper, connection, instance)
sqlalchemy.orm.mapper( model, table, extension = ValidationExtension(), **mapper_args )
You may want to check before_update reference because not every instance here is actually updated to the db.
"It looks like the Pythonic approach is to do most things as properties"
It varies, but that's close.
"If I had coded this in PHP or Java, I would probably have opted to create getter/setter methods..."
Good. That's Pythonic enough. Your getter and setter functions are bound up in a property; that's pretty good.
What's the question?
Are you asking how to spell property?
However, "transparent validation" -- if I read your example code correctly -- may not really be all that good an idea.
Your model and your validation should probably be kept separate. It's common to have multiple validations for a single model. For some users, fields are optional, fixed or not used; this leads to multiple validations.
You'll be happier following the Django design pattern of using a Form for validation, separate form the model.