Validating XML with large text element against XML Schema (xsd) - python

I have to process XML files that contain potentially large (up to 2GB) content. In these files , the 'large' part of the content is not spread over the whole file but is contained in one single element (an encrypted file, hex encoded).
I have no leverage on the source of the files, so I need to deal with that situation.
A requirement is to keep a small memory foot print (< 500MB). I was able to read and process the file's contents in streaming mode using xml.sax which is doing it's job just fine.
The problem is, that these files also need to be validated against an XML schema definition (.xsd file), which seems not to be supported by xml.sax.
I found some up-to-date libraries for schema validation like xmlschema but none for doing the validation in a streaming/lazy fashion.
Can anyone recommend a way to do this?

Many schema processors (such as Xerces and Saxon) operate in streaming mode, so there's no need to hold the data in memory while it's being validated. However, a 2Gb single text node is stretching Java's limits on the size of strings and arrays, and even a streaming processor is quite likely to want to hold the whole of a single node in memory.
If there are no validation constraints on the content of this text node (e.g. you don't need to validate that it is valid xs:base64Binary) then I would suggest using a schema validator (such as Saxon) that accepts SAX input, and supplying the input via a SAX filter that eliminates or condenses the long text value. A SAX parser supplies text to the ContentHandler in multiple chunks so there should be no limit in the SAX parser on the size of a text node. Saxon will try and combine the multiple chunks into a single string (or char array) and may fail at this stage either because of Java limits or because of the amount of memory available; but if your filter cuts out the big text node, this won't happen.

Michael Kay's answer had this nice idea of a content filter that can condense long text. This helped me solve my problem.
I ended up writing a simple text shrinker that pre-processes an XML file for me by reducing the text content size in named tags (like: "only keep the first 64 bytes of the text in the 'Data' and 'CipherValue' elements, don't touch anything else").
The resulting file then is small enought to feed it into a validator like xmlschema.
If anyone needs something similar: here is the code of the shrinker
If you use this, be careful
This indeed changes the content of the XML and could potentially cause problems, if the XML schema definition contains things like min or max length checks for the affected elements.

Related

fast checking the existence of tag in large XML using python cElementTree

I have XML files with sizes of hundreds of megabytes to tens of gigabytes and use Python's cElementTree to process them. Due to limited memory and low speed, I don't want to load all contents into memory using et.parse then find or findall method to find whether the tag exists (I didn't try this way, actually). Now I simply use et.iterparse to iterate through all tags to achieve this aim. In the case that the tag locates close to the end of the file, this can be very slow as well. I wonder whether there exists a better way to achieve this and get the location of the tag? If I know the top level (e.g., index) the tag locates, at which the size is much smaller than other parts of the file, is it possible to iterate through the top level tag and then target that part to parse? I searched online, but surprisingly no related questions are posted. Do I miss anything? Thanks in advance.
I solved this by reading the file block by block instead of parsing the file using cElementTree. My tags are close to the end of the file, so according to this answer, I read a block of contexts with specified size block_size at a time from the end of the file by using file.seek and file.read methods, and line = f.read(block_size), and then simply using "<my_tag " in line (or more specific tag name to avoid ambiguity) to check whether the tag exists. This is much faster then using iterparse to go through all tags.

Extract particular fields from json in python

Say I have a lot of json lines to process and I only care about the specific fields in a json line.
{blablabla, 'whatICare': 1, blablabla}
{blablabla, 'whatICare': 2, blablabla}
....
Is there any way to extract whatICare from these json lines withoud loads them? Since the json lines are very long it may be slow to build objects from json..
Not any reliable way without writing your own parsing code.
But check out ujson! It can be 10x faster than python's built in json library, which is a bit on the slow side.
No, you will have to load and parse the JSON before you know what’s inside and to be able to filter out the desired elements.
That being said, if you worry about memory, you could use ijson which is an iterative parser. Instead of loading all the content at once, it is able to load only what’s necessary for the next iteration. So if you your file contains an array of objects, you can load and parse one object at a time, reducing the memory impact (as you only need to keep one object in memory, plus the data you actually care about). But it won’t become faster, and it also won’t magically skip data you are not interested in.

Concurrent SAX processing of large, simple XML files?

I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Python, but the extra processing I have to do for each row means that the 40GB file takes an entire day to complete. To speed things up, I'd like to use all my cores simultaneously. Unfortunately, it seems that the SAX parser can't deal with "malformed" chunks of XML, which is what you get when you seek to an arbitrary line in the file and try parsing from there. Since the SAX parser can accept a stream, I think I need to divide my XML file into eight different streams, each containing [number of rows]/8 rows and padded with fake opening and closing tags. How would I go about doing this? Or — is there a better solution that I might be missing? Thank you!
You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.
My suggested way is to read the whole XML file into an internal format and do the extra processing afterwards. SAX should be fast enough to read 40GB of XML in no more than an hour.
Depending on the data you could use a SQLite database or HDF5 file for intermediate storage.
By the way, Python is not really multi-threaded (see GIL). You need the multiprocessing module to split the work into different processes.

Are there ways to modify/update xml files other than totally over writing the old file?

I'm working on a script which involves continuously analyzing data and outputting results in a multi-threaded way. So basically the result file(an xml file) is constantly being updated/modified (sometimes 2-3 times/per second).
I'm currently using lxml to parse/modify/update the xml file, which works fine right now. But from what I can tell, you have to rewrite the whole xml file even sometimes you just add one entry/sub-entry like <weather content=sunny /> somewhere in the file. The xml file is growing bigger gradually, and so is the overhead.
As far as efficiency/resource is concerned, any other way to update/modify the xml file? Or you will have to switch to SQL database or similar some day when the xml file is too big to parse/modify/update?
No you generally cannot - and not just XML files, any file format.
You can only update "in place" if you overwite bytes exactly (i.e. don't add or remove any characters, just replace some with something of the same byte length).
Using a form of database sounds like a good option.
It certainly sounds like you need some sort of database, as Li-anung Yip states this would take care of all kinds of nasty multi-threaded sync issues.
You stated that your data is gradually increasing? How is it being consumed? Are clients forced to download the entire result file each time?
Don't know your use-case but perhaps you could consider using an ATOM feed to distribute your data changes? Providing support for Atom pub would also effectively REST enable your data. It's still XML, but in a standard's compliant format that is easy to consume and poll for changes.

What is the most efficient way of extracting information from a large number of xml files in python?

I have a directory full (~103, 104) of XML files from which I need to extract the contents of several fields.
I've tested different xml parsers, and since I don't need to validate the contents (expensive) I was thinking of simply using xml.parsers.expat (the fastest one) to go through the files, one by one to extract the data.
Is there a more efficient way? (simple text matching doesn't work)
Do I need to issue a new ParserCreate() for each new file (or string) or can I reuse the same one for every file?
Any caveats?
Thanks!
Usually, I would suggest using ElementTree's iterparse, or for extra-speed, its counterpart from lxml. Also try to use Processing (comes built-in with 2.6) to parallelize.
The important thing about iterparse is that you get the element (sub-)structures as they are parsed.
import xml.etree.cElementTree as ET
xml_it = ET.iterparse("some.xml")
event, elem = xml_it.next()
event will always be the string "end" in this case, but you can also initialize the parser to also tell you about new elements as they are parsed. You don't have any guarantee that all children elements will have been parsed at that point, but the attributes are there, if you are only interested in that.
Another point is that you can stop reading elements from iterator early, i.e. before the whole document has been processed.
If the files are large (are they?), there is a common idiom to keep memory usage constant just as in a streaming parser.
The quickest way would be to match strings (with, e.g., regular expressions) instead of parsing XML - depending on your XMLs this could actually work.
But the most important thing is this: instead of thinking through several options, just implement them and time them on a small set. This will take roughly the same amount of time, and will give you real numbers do drive you forward.
EDIT:
Are the files on a local drive or network drive? Network I/O will kill you here.
The problem parallelizes trivially - you can split the work among several computers (or several processes on a multicore computer).
If you know that the XML files are generated using the ever-same algorithm, it might be more efficient to not do any XML parsing at all. E.g. if you know that the data is in lines 3, 4, and 5, you might read through the file line-by-line, and then use regular expressions.
Of course, that approach would fail if the files are not machine-generated, or originate from different generators, or if the generator changes over time. However, I'm optimistic that it would be more efficient.
Whether or not you recycle the parser objects is largely irrelevant. Many more objects will get created, so a single parser object doesn't really count much.
One thing you didn't indicate is whether or not you're reading the XML into a DOM of some kind. I'm guessing that you're probably not, but on the off chance you are, don't. Use xml.sax instead. Using SAX instead of DOM will get you a significant performance boost.

Categories