I am trying to use pydantic to validate JSON that is being returned in a "parallel" array format. Namely, there is an array defining the column names/types followed by an array of "rows" (this is similar to how pandas handles df.to_json(orient='split') seen here)
{
"columns": [
"sensor",
"value",
"range"
],
"data": [
[
"a",
1,
{"low": 1, "high": 2}
],
[
"b",
2,
{"low": 0, "high": 2}
]
]
}
I know that I could do this:
class ValueRange(BaseModel):
low: int
high: int
class Response(BaseModel):
columns: Tuple[Literal['sensor'], Literal['value'], Literal['range']]
data: List[Tuple[str, int, ValueRange]]
But this has a few downsides:
After parsing, it doesn't allow for an association of the data with the column names. So, you have to do everything by index. Ideally, I would like to parse a response into a List[Row] and then be able to do things like response.data[0].sensor.
It hardcodes the column order.
It doesn't allow for responses that have variable columns in the responses. For example, the same endpoint could also return the following:
{
"columns": ["sensor", "value"],
"data": [
["a", 1],
["b", 2]
]
}
At first I thought that I could use pydantic's discriminated unions, but I'm not seeing how to do this across arrays.
Anyone know of the best approach for validating this type of data? (I'm currently using pydantic, but am open to other libraries if it makes sense).
Thanks!
TL;DR
A very interesting use case for a custom validator.
from collections.abc import Sequence
from typing import Literal, Optional
from pydantic import BaseModel, validator
from pydantic.fields import Field
Column = Literal["sensor", "value", "range"]
class ValueRange(BaseModel):
low: int
high: int
class DataPoint(BaseModel):
sensor: str
value: int
range: Optional[ValueRange]
class Response(BaseModel):
columns: Optional[tuple[Column, ...]] = Field(exclude=True, repr=False)
data: list[DataPoint]
#validator("columns", pre=True)
def ensure_distinct(cls, v: object) -> object:
if isinstance(v, Sequence) and len(v) != len(set(v)):
raise ValueError("`columns` must all be distinct")
return v
#validator("data", pre=True, each_item=True)
def parse_nested_sequence(
cls,
v: object,
values: dict[str, object],
) -> object:
if not isinstance(v, Sequence):
return v
columns = values.get("columns")
if not isinstance(columns, Sequence):
raise TypeError(
"If `data` items are provided as a sequences, "
"the `columns` must be present as a sequence."
)
if len(columns) != len(v):
raise ValueError(
"`data` item must be the same length as `columns`"
)
return dict(zip(columns, v))
Explanation
Schema
First we need to set up the models to reflect the schema we want to have after parsing and validation are complete.
Since you mentioned that you would like the data field in the response model to be a list of model instances corresponding to a certain schema, we need to define that schema. From your example it seems to need at least the fields sensor, value and range. The range field should be its own model once again. You also mentioned that range should be optional, so we'll encode that too.
The actual top-level Response model will still have the columns field because we will need that for validation later on. But we can hide that field from the string representation and the exporter methods like dict and json. To communicate the variable number of elements in that columns tuple, we'll change the annotation a bit.
Here are the models I would suggest:
from collections.abc import Sequence
from typing import Literal, Optional
from pydantic import BaseModel, validator
from pydantic.fields import Field
Column = Literal["sensor", "value", "range"]
class ValueRange(BaseModel):
low: int
high: int
class DataPoint(BaseModel):
sensor: str
value: int
range: Optional[ValueRange]
class Response(BaseModel):
columns: Optional[tuple[Column, ...]] = Field(exclude=True, repr=False)
data: list[DataPoint]
... # more code
Field types and parameters
To hide columns we can use the appropriate parameters of the Field constructor. Making it Optional (default will be None) means we could still initialize an instance of Response without it, but we would then of course be forced to provide the data list in the correct format (as dictionaries or DataPoint instances).
Since we used tuple[Column, ...] as the annotation for columns, the elements could be in any order, which is what you wanted, but it could also theoretically be of arbitrary length and contain a bunch of duplicates. The Python typing system does not provide any elegant tool to define the type in a way that indicates that all elements of the tuple must be distinct. We could of course construct a huge type union of all permutations of the literal values that should be valid, but this is hardly practical.
Validators
Instead I would suggest a very simple validator to do this check for us:
...
class Response(BaseModel):
columns: Optional[tuple[Column, ...]] = Field(exclude=True, repr=False)
data: list[DataPoint]
#validator("columns", pre=True)
def ensure_distinct(cls, v: object) -> object:
if isinstance(v, Sequence) and len(v) != len(set(v)):
raise ValueError("`columns` must all be distinct")
return v
... # more code
The pre=True is actually important here, but only in conjunction with the second validator. That one is much more interesting because it is supposed to bring those columns together with the sequences of data. Here is what I propose:
...
class Response(BaseModel):
columns: Optional[tuple[Column, ...]] = Field(exclude=True, repr=False)
data: list[DataPoint]
#validator("columns", pre=True)
def ensure_distinct(cls, v: object) -> object:
if isinstance(v, Sequence) and len(v) != len(set(v)):
raise ValueError("`columns` must all be distinct")
return v
#validator("data", pre=True, each_item=True)
def parse_nested_sequence(
cls,
v: object,
values: dict[str, object],
) -> object:
if not isinstance(v, Sequence):
return v
columns = values.get("columns")
if not isinstance(columns, Sequence):
raise TypeError(
"If `data` items are provided as a sequences, "
"the `columns` must be present as a sequence."
)
if len(columns) != len(v):
raise ValueError(
"`data` item must be the same length as `columns`"
)
return dict(zip(columns, v))
The pre=True on both validators ensures that they are run before the default validators for those field types (otherwise we would immediately get a validation error from your example data). Field validators are always called in the order the fields were defined, so we ensured that our custom columns validator is called before our custom data validator.
That order also allows us to access the output of our columns validator in the values dictionary of our data validator.
The each_item=True flag changes the behavior of the validator so that it is applied to each individual element of the data list as opposed to the entire list. That means with our example data the v argument will always be a single "sub-list" (e.g. ["a", 1]).
If the value to validate is not of a sequence type, we don't bother with it. It will then be handled appropriately by the default field validator. If it is a sequence, we need to make sure that columns is present and also a sequence and that they are of the same length. If those checks pass, we can just zip them, consume that zip into a dictionary and send it on its merry way to the default field validator.
That's all.
Demo
Here is a little demo script:
def main() -> None:
from pydantic import ValidationError
print(Response.parse_raw(TEST_JSON_VALID_1).json(indent=2), "\n")
print(Response.parse_raw(TEST_JSON_VALID_2).json(indent=2), "\n")
try:
Response.parse_raw(TEST_JSON_INVALID_1)
except ValidationError as exc:
print(exc.json(indent=2), "\n")
try:
Response.parse_raw(TEST_JSON_INVALID_2)
except ValidationError as exc:
print(exc.json(indent=2), "\n")
try:
Response.parse_raw(TEST_JSON_INVALID_3)
except ValidationError as exc:
print(exc.json(indent=2))
if __name__ == "__main__":
main()
And here is the test data and its corresponding output:
TEST_JSON_VALID_1
{
"columns": [
"sensor",
"value",
"range"
],
"data": [
[
"a",
1,
{"low": 1, "high": 2}
],
[
"b",
2,
{"low": 0, "high": 2}
]
]
}
{
"data": [
{
"sensor": "a",
"value": 1,
"range": {
"low": 1,
"high": 2
}
},
{
"sensor": "b",
"value": 2,
"range": {
"low": 0,
"high": 2
}
}
]
}
TEST_JSON_VALID_2 (different order and no range)
{
"columns": ["value", "sensor"],
"data": [
[1, "a"],
[2, "b"]
]
}
{
"data": [
{
"sensor": "a",
"value": 1,
"range": null
},
{
"sensor": "b",
"value": 2,
"range": null
}
]
}
TEST_JSON_INVALID_1
{
"columns": ["foo", "value"],
"data": []
}
[
{
"loc": [
"columns",
0
],
"msg": "unexpected value; permitted: 'sensor', 'value', 'range'",
"type": "value_error.const",
"ctx": {
"given": "foo",
"permitted": [
"sensor",
"value",
"range"
]
}
}
]
TEST_JSON_INVALID_2
{
"columns": ["value", "value"],
"data": []
}
[
{
"loc": [
"columns"
],
"msg": "`columns` must all be distinct",
"type": "value_error"
}
]
TEST_JSON_INVALID_3
{
"columns": ["sensor", "value"],
"data": [["a", 1], ["b"]]
}
[
{
"loc": [
"data",
1
],
"msg": "`data` item must be the same length as `columns`",
"type": "value_error"
}
]
I want to create a type that represents this dictionary
{
"13fc71ee-e676-4e5d-a561-a20af5bcbaaf": "123",
"8d742ecf-6ae9-4408-a5a0-b1110f9b5365": "fdsafdsa",
"literal": {
"any": "dict",
},
}
This is the closest type I can come up with:
Dict[UUID | Literal["literal"], str | dict]
The main issue with the above type signature is that the following dictionary would be allowed:
{
"8d742ecf-6ae9-4408-a5a0-b1110f9b5365": {
"dictionary": 321,
},
"literal": "some string",
}
Is there a way to annotate a dictionary with a type that can either be a UUID key for a string value or a string literal for a dict value?
What you're asking for potentially requires runtime knowledge and thus cannot be encoded (in a useful way) as a type. Perhaps there is a way to create a custom type that can work with static type checkers for some similar types, but I imagine it would be complicated. Thus, I propose some alternatives:
typing.TypedDict:
class ItemType(TypedDict):
uuids: dict[str, str]
literal: dict[str, Any]
d: ItemType = {
"uuids": {
"13fc71ee-e676-4e5d-a561-a20af5bcbaaf": "123",
"8d742ecf-6ae9-4408-a5a0-b1110f9b5365": "fdsafdsa",
},
"literal": {
"any": "dict",
},
}
Dataclasses:
#dataclass
class Item:
uuids: dict[str, str]
literal: dict[str, Any]
d = Item(**{
"uuids": {
"13fc71ee-e676-4e5d-a561-a20af5bcbaaf": "123",
"8d742ecf-6ae9-4408-a5a0-b1110f9b5365": "fdsafdsa",
},
"literal": {
"any": "dict",
},
})
Optionally, one can consider overriding __getitem__ to allow dictionary-like access d["literal"].
Runtime checks
If all else fails, and you really need to verify constraints that the type system cannot handle, you can do various runtime checks via isinstance or perhaps via libraries like beartype.
I have a dynamic json like:
{
"ts": 1111111, // this field is required
... // others are dynamic
}
The fields are all dynamic except ts. E.g:
{"ts":111,"f1":"aa","f2":"bb"}
{"ts":222,"f3":"cc","f4":"dd"}
How to declare this model with Pydantic?
class JsonData(BaseModel):
ts: int
... # and then?
Equivalent to typescript:
interface JsonData {
ts: number;
[key: string]: any;
}
Thanks.
Without Type Validation
If type validation doesn't matter then you could use the extra option allow:
'allow' will assign the attributes to the model.
class JsonData(BaseModel):
ts: int
class Config:
extra = "allow"
This will give:
data = JsonData.parse_raw("""
{
"ts": 222,
"f3": "cc",
"f4": "dd"
}
""")
repr(data)
# "JsonData(ts=222, f4='dd', f3='cc')"
And the fields can be accessed via:
print(data.f3)
# cc
With Type Validation
However, if you could change the request body to contain an object holding the "dynamic fields", like the following:
{
"ts": 111,
"fields": {
"f1": "aa",
"f2": "bb"
}
}
or
{
"ts": 222,
"fields": {
"f3": "cc",
"f4": "dd"
}
}
you could use a Pydantic model like this one:
from pydantic import BaseModel
class JsonData(BaseModel):
ts: int
fields: dict[str, str] = {}
That way, any number of fields could be processed, while the field type would be validated, e.g.:
# No "extra" fields; yields an empty dict:
data = JsonData.parse_raw("""
{
"ts": "222"
}
""")
repr(data)
# "JsonData(ts=222, fields={})"
# One extra field:
data = JsonData.parse_raw("""
{
"ts": 111,
"fields": {
"f1": "aa"
}
}
""")
repr(data)
# "JsonData(ts=111, fields={'f1': 'aa'})"
# Several extra fields:
data = JsonData.parse_raw("""
{
"ts": 222,
"fields": {
"f2": "bb",
"f3": "cc",
"f4": "dd"
}
}
""")
repr(data)
# "JsonData(ts=222, fields={'f2': 'bb', 'f3': 'cc', 'f4': 'dd'})"
The fields can be accessed like this:
print(data.fields["f2"])
# bb
In this context, you might also want to consider the field types to be StrictStr as opposed to str. When using str, other types will get coerced into strings, for example when an integer or float is passed in. This doesn't happen with StrictStr.
See also: this workaround.
I discovered typing in python some time ago and am currently trying to use it in my projects.
I have encountered a problem:
I have two classes. One is an evaluator for machine learning training results. The second one is a comparator, which creates n evaluators and compares their results.
class Parameters(TypedDict):
modelname: str
filename: str
last_ckpt: str
project_path: str
number_of_ckpt: int
class ResultEvaluator(object):
def __init__(
self,
parameters: Parameters,
evaluation_results_folder: str = <default path>,
kitti_native_results_path: str = <default path>
) -> None:
self.parameters = parameters
self.evaluation_results_folder = evaluation_results_folder
self.kitti_native_results_path = kitti_native_results_path
class ComparatorFormatParameters(TypedDict):
parameters: Parameters
evaluation_results_folder: str
kitti_native_results_path: str
class ResultComparator(object):
def __init__(
self,
comperator_objectives_params:
List[ComparatorFormatParameters]
) -> None:
self.comparator_objectives: List[re.ResultEvaluator] = \
[
re.ResultEvaluator(**objective_params)
for objective_params in comperator_objectives_params
]
An evaluator gets a dictionary and two strings during initialization. The dictionary contains all parameters that must always be specified. The two strings contain paths that never have to be changed during normal operation. However, they must be changed for test purposes.
Therefore, a default value is assigned to them in the init method.
So you can also simply initialize the evaluator with the dictionary.
So that I can use the comparator with n evaluators I give it a list with the evaluator parameters.
So that these can be easily unpacked for initializing the evaluator, I have also created a typed dict. It contains the evaluator parameter typed dict and the two strings.
This all works so far.
Only if I mark the two strings of the typed dict as optional there will be a mypy error during the type checking, because the evaluator init method should get one parameter dict and two strings.
But since default values are available there are no problems when using it.
Do I have to type the two strings in typed dict as string types, even though they are not always included? In the example code above I have done it and I don't get any mypy errors, but somehow it doesn't seem quite right to me.
Or is there a completely different and nicer solution?
Thanks for the help!
edit:
At the moment I can initialize a result evaluator like this:
result_evaluator_instance = ResultEvaluator(
{
"model name": <model name>,
"filename": <filename>,
"last_ckpt": "120000"
"project_path": <path>,
"number_of_ckpt": 12
},
<path>,
<path>
)
or:
result_evaluator_instance = ResultEvaluator(
{
"model name": <model name>,
"filename": <filename>,
"last_ckpt": "120000"
"project_path": <path>,
"number_of_ckpt": 12
}
)
Normally this second variant is used.
When I initialize the comparator this is what happens:
comparator = ResultComparator(
[
{
"parameters": {
"model name": <model name>,
"filename": <filename>,
"last_ckpt": "120000"
"project_path": <path>,
"number_of_ckpt": 12
}
},
{
"parameters": {
"model name": <model name>,
"filename": <filename>,
"last_ckpt": "120000"
"project_path": <path>,
"number_of_ckpt": 12
}
},
]
)
or:
comparator = ResultComparator(
[
{
"parameters": {
"model name": <model name>,
"filename": <filename>,
"last_ckpt": "120000"
"project_path": <path>,
"number_of_ckpt": 12
},
"evaluation_results_folder": <path>,
"kitti_native_results_path": <path>
},
{
"parameters": {
"model name": <model name>,
"filename": <filename>,
"last_ckpt": "120000"
"project_path": <path>,
"number_of_ckpt": 12
},
"evaluation_results_folder": <path>,
"kitti_native_results_path": <path>
}
]
)
It also works both in each case. I just think it's misleading if in the ComparatorFormatParameter both paths have the type string, although they don't always have to be present.
So I have looked for a syntactically correct way to indicate this.
I also tried to assign default values to the two strings in the ComparatorFormatParameters.
class ComparatorFormatParameters(TypedDict):
parameters: Parameters
evaluation_results_folder: str = None
kitti_native_results_path: str = None
Unfortunately these default values then overwrite the default values I assigned in the ResultEvaluator.init.
So far I had optionally tried what I think Union[str, None] is.
But now I have explicitly tested Union[str, None] again and mypy returns errors.
I'm trying to use Elasticsearch-dsl-py to index some data from a jsonl file with many fields. ignoring the less general parts, the code looks like this:
es = Elasticsearch()
for id,line in enumerate(open(jsonlfile)):
jline = json.loads(line)
children = jline.pop('allChildrenOfTypeX')
res = es.index(index="mydocs", doc_type='fatherdoc', id=id, body=jline)
for ch in children:
res = es.index(index="mydocs", doc_type='childx', parent=id, body=ch)
trying to run this ends with the error:
RequestError: TransportError(400, u'illegal_argument_exception', u"Can't specify parent if no parent field has been configured")
I guess I need to tell es in advance that has a parent. However, what I don't want is to map ALL the fields of both just to do it.
Any help is greatly welcomed!
When creating your mydocs index, in the definition of your childx mapping type, you need to specify the _parent field with the value fatherdoc:
PUT mydocs
{
"mappings": {
"fatherdoc": {
"properties": {
... parent type fields ...
}
},
"childx": {
"_parent": { <---- add this
"type": "fatherdoc"
},
"properties": {
... parent type fields ...
}
}
}
}