Check for extra keys in marshmallow.Schema.dump()

Check for extra keys in marshmallow.Schema.dump() - python

I want to be able to take some Python object (more precisely, a dataclass) and dump it to it's dict representation using a schema. Let me give you an example:
from marshmallow import Schema, field
import dataclasses
#dataclasses.dataclass
class Foo:
x: int
y: int
z: int
class FooSchema(Schema):
x = field.Int()
y = field.Int()
FooSchema().dump(Foo(1,2,3))
As you can see, the schema differs from the Foo definition. I want to somehow be able to recognize it when dumping - so I would get some ValidationError with an explanation that there's an extra field z. It doesn't really have to be .dump(), I looked at .load() and .validate() but only the former seems to accept objects, not only dicts.
Is there a way to do this in marshmallow? Because for now when I do this dump, I would just get a dictionary: {"x": 1, "y": 2} without z of course, but no errors whatsoever. And I would want the same behavior for a case, when there's no key in dumped object (like z was in schema but not in Foo). This wold basically serve me as a sanity check of changes done to the classes themselves - maybe if it's not possible in marshmallow you know some lib/technique that makes it so?

So I had this problem today and did some digging. Based off https://github.com/marshmallow-code/marshmallow/issues/1545 its something people are considering but the current implimentaiton of dump iterates through the fields listed dout in the schema definition so wont work.
The best I could get to work was:
from marshmallow import EXCLUDE, INCLUDE, RAISE, Schema, fields
class FooSchema(Schema):
class Meta:
unknown = INCLUDE
x = fields.Int()
y = fields.Int()
Which atleast sort of displays as a dict.

Related

Trying to set a superclass field in a subclass using validator

I am trying to set a super-class field in a subclass using validator as follows:
Approach 1
from typing import List
from pydantic import BaseModel, validator, root_validator
class ClassSuper(BaseModel):
field1: int = 0
class ClassSub(ClassSuper):
field2: List[int]
#validator('field1')
def validate_field1(cls, v, values):
return len(values["field2"])
sub = ClassSub(field2=[1, 2, 3])
print(sub.field1) # It prints 0, but expected it to print 3
If I run the code above it prints 0, but I expected it to print 3 (which is basically len(field2)). However, if I use #root_validator() instead, I get the expected result.
Approach 2
from typing import List
from pydantic import BaseModel, validator, root_validator
class ClassSuper(BaseModel):
field1: int = 0
class ClassSub(ClassSuper):
field2: List[int]
#root_validator()
def validate_field1(cls, values):
values["field1"] = len(values["field2"])
return values
sub = ClassSub(field2=[1, 2, 3])
print(sub.field1) # This prints 3, as expected
New to using pydantic and I am bit puzzled what I am doing wrong with the Approach 1. Thank you for your help.

The reason your Approach 1 does not work is because by default, validators for a field are not called, when the value for that field is not supplied (see docs).
Your validate_field1 is never even called. If you add always=True to your #validator, the method is called, even if you don't provide a value for field1.
However, if you try that, you'll see that it will still not work, but instead throw an error about the key "field2" not being present in values.
This in turn is due to the fact that validators are called in the order they were defined. In this case, field1 is defined before field2, which means that field2 is not yet validated by the time validate_field1 is called. And values only contains previously-validated fields (see docs). Thus, at the time validate_field1 is called, values is simply an empty dictionary.
Using the #root_validator is the correct approach here because it receives the entire model's data, regardless of whether or not field values were supplied explicitly or by default.
And just as a side note: If you don't need to specify any parameters for it, you can use #root_validator without the parantheses.
And as another side note: If you are using Python 3.9+, you can use the regular list class as the type annotation. (See standard generic alias types) That means field2: list[int] without the need for typing.List.
Hope this helps.

How to apply 'load_from' and 'dump_to' to every field in a marshmallow schema?

I've been trying to implement an 'abstract' schema class that will automatically convert values in CamelCase (serialized) to snake_case (deserialized).
class CamelCaseSchema(marshmallow.Schema):
#marshmallow.pre_load
def camel_to_snake(self, data):
return {
utils.CaseConverter.camel_to_snake(key): value for key, value in data.items()
}
#marshmallow.post_dump
def snake_to_camel(self, data):
return {
utils.CaseConverter.snake_to_camel(key): value for key, value in data.items()
}
While using something like this works nicely, it does not achieve everything applying load_from and dump_to to a field does. Namely, it fails to provide correct field names when there's an issue with deserialization. For instance, I get:
{'something_id': [u'Not a valid integer.']} instead of {'somethingId': [u'Not a valid integer.']}.
While I can post-process these emitted errors, this seems like an unnecessary coupling that I wish to avoid if I'm to make the use of schema fully transparent.
Any ideas? I tried tackling the metaclasses involved, but the complexity was a bit overwhelming and everything seemed exceptionally ugly.

You're using marshmallow 2. Marshmallow 3 is now out and I recommend using it. My answer will apply to marshmallow 3.
In marshmallow 3, load_from / dump_to have been replace by a single attribute : data_key.
You'd need to alter data_key in each field when instantiating the schema. This will happen after field instantiation but I don't think it matters.
You want to do that ASAP when the schema is instantiated to avoid inconsistency issues. The right moment to do that would be in the middle of Schema._init_fields, before the data_key attributes are checked for consistency. But duplicating this method would be a pity. Besides, due to the nature of the camel/snake case conversion the consistency checks can be applied before the conversion anyway.
And since _init_fields is private API, I'd recommend doing the modification at the end of __init__.
class CamelCaseSchema(Schema):
def __init__(self, **kwargs):
super().__init__(**kwargs)
for field_name, field in self.fields.items():
fields.data_key = utils.CaseConverter.snake_to_camel(field_name)
I didn't try that but I think it should work.

Using python class variables as dict key is good practice?

I wonder if i use class variables as keys in a dictionary can it be considered as good practice?
For example, this is object model in peewee.
class Abc:
name = CharField()
age = IntegerField()
then i declare a configuration dict for it like this
conf = {Abc.name: Config(listable=False, format_on_edit=True), Abc.age: Config()}
I don't want to use string like name, age as dict keys because i'm afraid of mistyping, and want to make sure that Object/ Model field is valid.
I see Sqlalchemy or Peewee are using condition like where(Abc.name=='abc') or filter(User.age == 25) , not where('name', 'abc') like many other orms from Go or PHP since they don't have class variables.
It's quite good for prevent mistyping.
I've tried hash(Abc.name) and it works then class variables are immutable for using as dict keys or not?

You may safely use them. Peewee replaces the field instances you declare as model attributes with special Descriptor objects (which then expose the underlying field instance, which is hashable).
For instance, when performing an insert or update, you can specify the data using the fields as keys:
User.insert({User.username: 'charlie'}).execute()

Django: mutate queryset to another model on the fly

Is there a more elegant/faster way to do this?
Here are my models:
class X(Model):
(...)
class A(Model):
xs = ManyToManyField(X)
class B(A):
(...)
class C(A): # NOTE: not relevant here
(...)
class Y(Model):
b = ManyToOneField(B)
Here is what I want to do:
def foo(x):
# NOTE: b's pk (b.a_ptr) is actually a's pk (a.id)
return Y.objects.filter(b__in=x.a_set.all())
But it returns the following error:
<repr(<django.db.models.query.QuerySet at 0x7f6f2c159a90>) failed:
django.core.exceptions.FieldError: Cannot resolve keyword u'a_ptr'
into field. Choices are: ...enumerate all fields of A...>
And here is what I'm doing right now in order to minimize the queries:
def foo(x):
a_set = x.a_set
a_set.model = B # NOTE: force the model to be B
return Y.filter(b__in=a_set.all())
It works but it's not one line. It would be cool to have something like this:
def foo(x):
return Y.filter(b__in=x.a_set._as(B).all())
The simpler way in Django seems to be the following:
def foo(x):
return Y.filter(b__in=B.objects.filter(pk__in=x.a_set.all()))
...but this makes useless sub-queries in SQL:
SELECT Y.* FROM Y WHERE Y.b IN (
SELECT B.a_ptr FROM B WHERE B.a_ptr IN (
SELECT A.id FROM A WHERE ...));
This is what I want:
SELECT Y.* FROM Y WHERE Y.b IN (SELECT A.id FROM A WHERE ...);
(This SQL example is a bit simplified because the relation between A and X is actually a ManyToMany table, I substituted this with SELECT A.id FROM A WHERE ... for clarity sake.)

You might get a better result if you follow the relationship one more step, rather than using in.
Y.objects.filter(b__xs=x)

I know next to nothing about databases, but there are many warnings in the Django literature about concrete inheritance class B(A) causing database inefficiency because of repeated INNER JOIN operations.
The underlying problem is that relational databases aren't object-orientated.
If you use Postgres, then Django has native support (from 1.8) for Hstore fields, which are searchable but schema-less collections of keyword-value pairs. I intend to be using one of these in my base class (your A) to eliminate any need for derived classes (your B and C). Instead, I'll maintain a type field in A (equal to "B" or "C" in your example) and use that and/or careful key presence checking to access subfields of the Hstore field which are "guaranteed" to exist if (say) instance.type=="C" (which constraint isn't enforced at the DB integrity level).
I also discovered the django-hstore module which existed prior to Django 1.8, and which has even greater functionality. I've yet to do anything with it so cannot comment further.
I also found reference to wide tables in the literature, where you simply define a model A with a potentially large number of fields which contain useful data only if the object is of type B, and more yet fields for types C,D,... I don't have the DB knowledge to assess the cost of lots of blank or null fields attached to every row of a table. Clearly if you have a million sheep records and one pedigree-racehorse record and one stick-insect record, as "subclasses" of Animal in the same wide table, it gets silly.
I'd appreciate feedback on either idea if you have the DB understanding that I lack.
Oh, and for completeness, there's the django-polymorphic module, but that builds on concrete inheritance with its attendant inefficiencies.

Object schemas/models without ORM, DOM or forms

I've used MongoEngine a lot lately. Apart from the MongoDB integration, I like the idea of defining the structures of entities explicitly. Field definitions make code easier to understand. Also, using those definitions, I can validate objects to catch potential bugs or serialize/deserialize them more accurately.
The problem with MongoEngine is that it is designed specifically to work with a storage engine. The same applies for Django and SQLAlchemy models, which also lack list and set types. My question is, then, is there an object schema/model library for Python that does automated object validation and serialization, but not object-relational mapping or any other fancy stuff?
Let me give an example.
class Wheel(Entity):
radius = FloatField(1.0)
class Bicycle(Entity):
front = EntityField(Wheel)
back = EntityField(Wheel)
class Owner(Entity):
name = StringField()
bicycles = ListField(EntityField(Bicycle))
owner = Owner(name='Eser Aygün', bicycles=[])
bmx = Bicycle()
bmx.front = Wheel()
bmx.back = Wheel()
trek = Bicycle()
trek.front = Wheel(1.2)
trek.back = Wheel(1.2)
owner.bicycles.append(bmx)
owner.bicycles.append(trek)
owner.validate() # checks the structure recursively
Given the structure, it is also easy to serialize and deserialize objects. For example, owner.jsonify() may return the dictionary
{
'name': 'Eser Aygün',
'bicycles': [{
'front': {
radius: 1.0
},
'back': {
radius: 1.0
}
}, {
'front': {
radius: 1.2
},
'back': {
radius: 1.2
}
}],
}
and you can easily convert it back calling owner.dejsonify(dic).

If anyone is still looking, as python-entities hasn't been updated for a while, there are some good libraries out there:
schematics - https://schematics.readthedocs.org/en/latest/
colander - http://docs.pylonsproject.org/projects/colander/en/latest/
voluptuous - https://pypi.python.org/pypi/voluptuous (more of a validation library)

Check out mongopersist, which uses mongo as a persistence layer for Python objects like ZODB. It does not perform schema validation, but it lets you move objects between Mongo and Python transparently.
For validation or other serialization/deserialization scenarios (e.g. forms), consider colander. Colander project description:
Colander is useful as a system for validating and deserializing data obtained via XML, JSON, an HTML form post or any other equally simple data serialization. It runs on Python 2.6, 2.7 and 3.2. Colander can be used to:
Define a data schema.
Deserialize a data structure composed of strings, mappings, and lists into an arbitrary Python structure after validating the data structure against a data schema.
Serialize an arbitrary Python structure to a data structure composed of strings, mappings, and lists.

What you're describing can be achieved with remoteobjects, which contains a mechanism (called dataobject) to allow you to define the structure of an object such that it can be validated and it can be marshalled easily to and from JSON.
It also includes some functionality for building a REST client library that makes HTTP requests, but the use of this part is not required.
The main remoteobjects distribution does not come with specific StringField or IntegerField types, but it's easy enough to implement them. Here's an example BooleanField from a codebase I maintain that uses remoteobjects:
class BooleanField(dataobject.fields.Field):
def encode(self, value):
if value is not None and type(value) is not bool:
raise TypeError("Requires boolean")
return super(BooleanField, self).encode(value)
This can then be used in an object definition:
class ThingWithBoolean(dataobject.DataObject):
my_boolean = BooleanField()
And then:
thing = ThingWithBoolean.from_dict({"my_boolean":true})
thing.my_boolean = "hello"
return json.dumps(thing.to_dict()) # will fail because my_boolean is not a boolean

As I said earlier in a comment, I've decided to invent my own wheel. I started implementing an open-source Python library, Entities, that just does what I wanted. You can check it out from https://github.com/eseraygun/python-entities/.
The library supports recursive and non-recursive collection types (list, set and dict), nested entities and reference fields. It can automatically validate, serialize, deserialize and generate hashable keys for entities of any complexity. (In fact, de/serialization feature is not complete yet.)
This is how you use it:
from entities import *
class Account(Entity):
id = IntegerField(group=PRIMARY) # this field is in primary key group
iban = IntegerField(group=SECONDARY) # this is in secondary key group
balance = FloatField(default=0.0)
class Name(Entity):
first_name = StringField(group=SECONDARY)
last_name = StringField(group=SECONDARY)
class Customer(Entity):
id = IntegerField(group=PRIMARY)
name = EntityField(Name, group=SECONDARY)
accounts = ListField(ReferenceField(Account), default=list)
# Create Account objects.
a_1 = Account(1, 111, 10.0) # __init__() recognizes positional arguments
a_2 = Account(id=2, iban=222, balance=20.0) # as well as keyword arguments
# Generate hashable key using primary key.
print a_1.keyify() # prints '(1,)'
# Generate hashable key using secondary key.
print a_2.keyify(SECONDARY) # prints '(222,)'
# Create Customer object.
c = Customer(1, Name('eser', 'aygun'))
# Generate hashable key using primary key.
print c.keyify() # prints '(1,)'
# Generate hashable key using secondary key.
print c.keyify(SECONDARY) # prints '(('eser', 'aygun'),)'
# Try validating an invalid object.
c.accounts.append(123)
try:
c.validate() # fails
except ValidationError:
print 'accounts list is only for Account objects'
# Try validating a valid object.
c.accounts = [a_1, a_2]
c.validate() # succeeds

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check for extra keys in marshmallow.Schema.dump() - python

Related

Trying to set a superclass field in a subclass using validator

How to apply 'load_from' and 'dump_to' to every field in a marshmallow schema?

Using python class variables as dict key is good practice?

Django: mutate queryset to another model on the fly

Object schemas/models without ORM, DOM or forms

Categories

Resources