I have a large XML file (3 MB+) and i have an XSD to validate it against.
I am using python and LXML. I started from this script <>. Which does the validation fine, including giving me the line number. But the problem the file is on one line, so when i validate all i get is the error showing on line 1. when i use pretty print to split the lines up for me it maxes out at line 65535.
Thanks!
Pretty-print your XML to add newlines to it. Then put it through your validator to get a more helpful line number.
EDIT: On re-reading your question, I see that you have used Notepad++ to add the newlines. But that LXML apparently has a size limitation when it comings to validating your XML.
For a general approach to this problem, please see Validating a HUGE XML file. In particular, the accepted answer begins with:
Instead of using a DOMParser, use a SAXParser. This reads from an
input stream or reader so you can keep the XML on disk instead of
loading it all into memory.
Basically, you need to use a streaming approach which SAX offers. So, if your requirement is that you must validate your file in Python, then you'll need to find validation approach based on streaming. (Perhap LXML offers validation in a streaming fashion?)
However, if your validation requirements are more flexible, then consider a specialized tool such as XMLStarlet.
For example, here's how to validate an XML file against an XSD from the XMLStarlet entry on Wikipedia:
xmlstarlet val -e -s my.xsd my.xml
And testimonial on using XMLStarlet on very large files.
Related
I have to process XML files that contain potentially large (up to 2GB) content. In these files , the 'large' part of the content is not spread over the whole file but is contained in one single element (an encrypted file, hex encoded).
I have no leverage on the source of the files, so I need to deal with that situation.
A requirement is to keep a small memory foot print (< 500MB). I was able to read and process the file's contents in streaming mode using xml.sax which is doing it's job just fine.
The problem is, that these files also need to be validated against an XML schema definition (.xsd file), which seems not to be supported by xml.sax.
I found some up-to-date libraries for schema validation like xmlschema but none for doing the validation in a streaming/lazy fashion.
Can anyone recommend a way to do this?
Many schema processors (such as Xerces and Saxon) operate in streaming mode, so there's no need to hold the data in memory while it's being validated. However, a 2Gb single text node is stretching Java's limits on the size of strings and arrays, and even a streaming processor is quite likely to want to hold the whole of a single node in memory.
If there are no validation constraints on the content of this text node (e.g. you don't need to validate that it is valid xs:base64Binary) then I would suggest using a schema validator (such as Saxon) that accepts SAX input, and supplying the input via a SAX filter that eliminates or condenses the long text value. A SAX parser supplies text to the ContentHandler in multiple chunks so there should be no limit in the SAX parser on the size of a text node. Saxon will try and combine the multiple chunks into a single string (or char array) and may fail at this stage either because of Java limits or because of the amount of memory available; but if your filter cuts out the big text node, this won't happen.
Michael Kay's answer had this nice idea of a content filter that can condense long text. This helped me solve my problem.
I ended up writing a simple text shrinker that pre-processes an XML file for me by reducing the text content size in named tags (like: "only keep the first 64 bytes of the text in the 'Data' and 'CipherValue' elements, don't touch anything else").
The resulting file then is small enought to feed it into a validator like xmlschema.
If anyone needs something similar: here is the code of the shrinker
If you use this, be careful
This indeed changes the content of the XML and could potentially cause problems, if the XML schema definition contains things like min or max length checks for the affected elements.
I've got some badly-formed XML files using Python, and I need to figure out what's wrong with them (ie. what the errors are) without actually looking at the data (the files are sensitive client data).
I figure there should be a way to sanitize the XML (ie. remove all content in all nodes) but keep the tags, so that I can see any structural issues.
However, ElementTree doesn't return any detailed information about mismatched tags - just a line number and a character position which is useless if I can't reference the original XML.
Does anyone know how I can either sanitize the XML so I can view it, or get more detailed error messages for badly formed XML (that won't return tag contents)? I could write a customer parser to strip content, but I wanted to exhaust other options first.
It's a hard enough problem to try to automatically fix markup problems when you can look at the file. If you're not permitted to see the document contents, forget about having any reasonable hope of fixing such doubly undefined problems.
Your best bet is to fix the bad "XML" at its source.
If you can't do that, I suggest that you use a tool listed in How to parse invalid (bad / not well-formed) XML? to attempt to automatically repair the well-formedness problem. Then, after you actually have XML, you can use XML tools to strip or sanitize content (if that's even still necessary at that point).
I'm working on a script which involves continuously analyzing data and outputting results in a multi-threaded way. So basically the result file(an xml file) is constantly being updated/modified (sometimes 2-3 times/per second).
I'm currently using lxml to parse/modify/update the xml file, which works fine right now. But from what I can tell, you have to rewrite the whole xml file even sometimes you just add one entry/sub-entry like <weather content=sunny /> somewhere in the file. The xml file is growing bigger gradually, and so is the overhead.
As far as efficiency/resource is concerned, any other way to update/modify the xml file? Or you will have to switch to SQL database or similar some day when the xml file is too big to parse/modify/update?
No you generally cannot - and not just XML files, any file format.
You can only update "in place" if you overwite bytes exactly (i.e. don't add or remove any characters, just replace some with something of the same byte length).
Using a form of database sounds like a good option.
It certainly sounds like you need some sort of database, as Li-anung Yip states this would take care of all kinds of nasty multi-threaded sync issues.
You stated that your data is gradually increasing? How is it being consumed? Are clients forced to download the entire result file each time?
Don't know your use-case but perhaps you could consider using an ATOM feed to distribute your data changes? Providing support for Atom pub would also effectively REST enable your data. It's still XML, but in a standard's compliant format that is easy to consume and poll for changes.
I am writing a server side process in Python that takes XML in a directory and puts it into a database. The XML that is put in the directory is generated from forms that are filled out on remote laptops and sent via HTTP to the server. When we add fields to the form it adds tags to the XML which allows for situations where one XML file will have more or fewer tags than another. How can I make my server side script robust enough to handle these scenarios.
I would do something like mentioned here: https://stackoverflow.com/questions/9845943/how-to-convert-xml-data-in-to-sqlite-database/9879617#9879617
There is different ways you can apply the logic in the for loop depending on any patterns in the xml, but the idea is the same. This should then let you handle the query much more smoothly depending on which values exist.
Make sure you look at: http://lxml.de/tutorial.html there a lots of great tips with using lxml.
A mini example may get you started:
from xml.dom.minidom import parseString
doc = parseString('<one><two>three</two></one>')
for twoElement in doc.getElementsByTagName('two'):
print twoElement.firstChild.data
Maybe you should have a look at the minidom documentation or ask further questions here. But with that eggs.getElementsByTagName() you can find all elements below the tree eggs. Of course you can be more specific than searching in doc.
I have a very large XML file with 40,000 tag elements.
When i am using element tree to parse this file it's giving errors due to memory.
So is there any module in python that can read the xml file in data chunks without loading the entire xml into memory?And How i can implement that module?
Probably the best library for working with XML in Python is lxml, in this case you should be interested in iterparse/iterwalk.
This is a problem that people usually solve using sax.
If your huge file is basically a bunch of XML documents aggregated inside and overall XML envelope, then I would suggest using sax (or plain string parsing) to break it up into a series of individual documents that you can then process using lxml.etree.