Python - Validation with multiple schemas using lxml - python

I'm working with a schema that was built by a third party and I'd like to validate it with lxml. The problem is that such a schema is split over different xsd files, which reference themselves.
For example, a file called "extension.xsd" (which builds upon the "master" schema) has a line like:
<redefine schemaLocation="master.xsd">
If I try to validate it with lxml (parsing, then using XMLSchema then validating another document which I know is valid already), I only get validation using "extension" and not "master": in other words, the validation fails (because in the XML file there are elements only present in "master" and not in "extension").
How can I solve (or work around) this issue? Thanks!

If lxml doesn't support "redefine", the best option would be to fix it and submit a patch. :)
Failing that, the workaround would be to parse the master.xsd file yourself, and then apply the changes from extension.xsd, and output a single xsd file with the combined schema.

Related

Mongoengine change document structure

I'm trying for the first time to use mongo, and I choose mongoengine.
After defining the Document structure if I try to change it (adding a field, removing a field, renaming ecc..) the reading operations still works, but any other operation on previously stored document fail since they're note compliant anymore with the document structure.
Is there any way to manage this situation? should I only user Dynamic documents with Dictionaries instead of EmbeddedDocuments?
Using DynamicDocument or setting meta = {'strict': False} on your Document may help in some cases but the only proper solution to this is running a migration script.
I'd recommend doing this using pymongo but you could also do that from the mongo shell. Every time your model change in a way that is not compatible, you should run a migration on the existing data so that it fits the new model. Otherwise mongoengine will complain at some point (mongoengine contributor here)

Building a solr index using large text file

I have a large text file in following format:
00001,234234|234|235|7345
00005,788|298|234|735
You can treat values prior to , as keys and what I want to do is quick and dirty approach to query these keys and find the results sets for each key. After reading a bit I found out that solr provide a good framework to do this.
What would be the starting point?
Can I use python to read the file and build this index (search
engine) using solr?
is there a different mechanism to do such?
You can definitely do that using pysolr which is a python library. If the data is in key value form you can read it in python like shown here :
https://pypi.python.org/pypi/pysolr/3.1.0
To have more control on search you need to modify the schema.xml file to have the keys as you have in your text file.
Once you have the data ingested in SOLR you can follow the above link to perform search.
You can index your data directly in Solr using the UpdateCSV handler: You just need to specify the destination field names in the fieldnames parameter in your curl call (or add them as the first line in your file if that is easier). No custom code needed.
Do remember to check that the destination field for the |-separated values splits into tokens using that characters.
See https://wiki.apache.org/solr/UpdateCSV for details.

Can I query a YAML dataset in Python?

Similar to Is there a query language for JSON? and the more specific How can I filter a YAML dataset with an attribute value? - I would like to:
hand-edit small amounts data in YAML files
perform arbitrary queries on the complete dataset (probably in Python, open to other ideas)
work with the resulting subset in Python
It doesn't appear that PyYAML has a feature like this, and today I can't find the link I had to the YQuery language, which wasn't a mature project anyway (or maybe I dreamt it).
Is there a (Python) library that offers YAML queries? If not, is there a Pythonic way to "query" a set of objects other than just iterating over them?
I don't thing there is a direct way to do it. But PyYAML reads yaml files into a dict representing everything in the file. Afterwards you can execute all dict related operations. The question python query keys in a dictionary based on values mentions some pythonic "query" styles.
bootalchemy provides a means to do this via SQLAlchemy. First, define your schema in a SQLAlchemy model. Then load your YAML into a SQLAlchemy session using bootalchemy. Finally, perform queries on that session. (You don't have to commit the session to an actual database.)
Example from the PyPI page (assume model is already defined):
from bootalchemy.loader import Loader
# (this simulates loading the data from YAML)
data = [{'Genre':[{'name': "action",
'description':'Car chases, guns and violence.'
}
]
}
]
# load YAML data into session using pre-defined model
loader = Loader(model)
loader.from_list(session, data)
# query SQLAlchemy session
genres = session.query(Genre).all()
# print results
print [(genre.name, genre.description) for genre in genres]
Output:
[('action', 'Car chases, guns and violence.')]
You could try to use jsonpath? Yes, that's meant for json, not yaml, but as long as you have json-compatible datastructures, this should work, because you're working on the parsed data, not on the json or yaml represention? (seems to work with the python libraries jsonpath and jsonpath-rw)
You can check the following tools:
yq for CLI queries, like with jq,
yaml-query another CLI query tool written in Python.

Generate Python Class and SQLAlchemy code from XSD to store XML on Postgres

I have some very complex XSD schemas to work with. By complex I mean that each of these XSD would correspont to about 20 classes / tables in a database, with each table having approximately 40 fields. And I have 18 different XSD like that to program.
What I'm trying to achieve is: Get a XML file defined by the XSD and save all the data in a PostgreSQL database using SQLAlchemy. Basically I need a CRUD application that will persist a XML file in the database following the model of the XSD schema, and also be able to retrieve an object from the database and create a XML file.
I want to avoid having to manually create the python classes, the sqlalchemy table definitions, the CRUD code. This would be a monumental job, subject to a lot of small mistakes, given the complexity of the XSD files.
I can generate python classes from XSD in many ways like GenerateDS, PyXB, etc... I need to save those objects in the database. I'm open to any suggestions, even if the suggestion is conceptually different that what I'm describing.
Thank you very much
You can use generateDS to create django models from your XSD. You do this using the gends_run_gen_django script, which is located under the django directory in the generateDS source. Here is some documentation on that functionality. Relevant quote:
Here is an overview of the process:
Step 1. Generate bindings -- Run generateDS.py.
Step 2. Extract simpleType definitions from schema -- Run
gends_extract_simple_types.py.
Step 3. Generate models.py and forms.py -- Run gends_generate_django.py.
The script gends_run_gen_django.py performs these three steps.
I believe django should give you most of the functionality of sqlalchemy. However, if you decide to use sqlalchemy instead then generateDS' django functionality might be a good model on which to base a similar sqlalchemy solution.
Not sure if there is a way directly, but you could indirectly go from XSD to a SQL Server DB, and then import the DB from SQLAlchemy

Getting text from db as django dictionary

I recently joined a company which is using django to build their product. I'm currently responsible for one of the apps, which was already developed a little bit before I was here.
One of the entities in the app has a json dictionary attribute, which has been kept in the db as a text field. Also, this attribute is marked in the model as a text field. So, as you can imagine it's not being handled correctly.
I wanted to change this and set it as a json field using https://github.com/bradjasper/django-jsonfield , which works really well.
However, I've run into a peculiar problem. Previous data stored in the db was not correctly handled and since it was unicode data, the text field in the db looks like:
{u'key': u'value'}
Now when the entity manager tries to load those values using the json field, it of course breaks since it's no longer a valid json string.
I've done some research on how to overcome this, but haven't found nothing.
My question:
Do you have any suggestion on how to overcome this? It can be any type of solution.
Something that I can run over night altering that field to transform it to a valid json string.
Some changes to the json-field code, which enables it to correctly handle these values.
Additional info
We use postgres with psycopg2 as django's db backend.
Thank you very much.
You're probably just going to need to iterate over the whole table, load the field, convert it into a real Python dict, and dump it back out with json.dumps. ast.literal_eval is a good choice for the conversion stage because it works like the built-in eval but is more restricted, so less risky to your system.
for obj in MyModel.objects.all():
value = ast.literal_eval(obj.dict_value)
obj.dict_value = json.dumps(value)
value.save()

Categories