lxml, parsing in reverse - python

I am parsing a large file (>9GB) and am using iterparse of lxml in Python to parse the file while clearing as I go forward. I was wondering, is there a way to parse backwards while clearing? I could see I how would implement this independently of lxml, but it would be nice to use this package.
Thank you in advance!

Yes and no...
there is 'easy' solution for starting 'from the end' reverse.
But there is a reverse iterator that goes until the end and on its way 'clear the references' and optimize the read.
Approach 1: split the file on its structure and nodes so you can parse what you only want.
Approach 2: check the 'smart' way to parse it at [1]
What I did in my case.
I knew before that may data onto a 12gb file was at the last 2gb.
So I use the unix command to split the file and process the last one only.
(this is a ugly hack but in MY case was simple and worked fast enough, you can use tail too but I want to archive the other files too)
--> A real python master will use file.seek() but I thought unix command were faster
Now I use the second approach [1]
[1] - http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
I hope this helps you I had a hard time understanding the xml structure.

iterparse() is strictly forward-only, I'm afraid. If you want to read a tree in reverse, you'll have to read it forward, while writing it to some intermediate store (be it in memory or on disc) in some form that's easier for you to parse backwards, and then read that. I'm not aware of any stream parsers that allow XML to be parsed back-to-front.
Off the top of my head, you could use two files, one containing the data and the other an index of offsets to the records in the data file. That would make reading backwards relatively easy once it's been written.

Related

quickest method of accessing key - value pairs?

I hope the question is not too unspecific: I have a huge database-like list (~ 900,000 entries) which I want to use for processing text files. (More details below.) Since this list will be edited and used with other programs as well, I would prefer to keep it in one separate file and include it in the python code, either directly or by dumping it to some format that python can use. I was wondering if you can advice on what would be the quickest and most efficient way. I have looked at several options, but may not have seen what is best:
Include the list as a python dictionary in the form
my_list = { "key": "value" }
directly into my python code.
Dump the list to an sqlite database and use the sqlite3 module.
Have the list as a yml file and use the yaml module.
Any ideas how these approaches would scale if I process a text file and want to do replacements on something like 30,000 lines?
For those interested: this is for linguistic processing, in particular ancient Greek. The list is an exhaustive list of Greek forms and the head words that they are derived from. For every word form in a text file, I want to add the dictionary head word.
Point 1 is much faster than using either YAML or SQL as #b4hand and #DeepSpace indicated. What you should do though is not include the list in the rest of the rest of the python code you are developing, as you indicated, but make a separate .py file with just the that dictionary definition.
That way the list in that file is more easy to write from a program (or extend by a program). And, on first import, a .pyc will be created which speeds up re-reading on further runs of your program. This is actually very similar in performance
to using the pickle module and pickling the dictionary to file and reading it back from there, while keeping the dictionary in an easy human readable and editable form.
Less than one million entries is not huge and should fit in memory easily. Thus, your best bet is option 1.
If you are looking for speed, option 1 should be the fastest because the other 2 will need to repeatedly access the HD which will be the bottleneck.
I would use a caching mechanism to hold this data or maybe a data structure storage like redis. Loading all of this in memory might become too expensive.

XML parsing in Python for big data

I am trying to parse an XML file using Python. But the problem is that the XML file size is around 30GB. So, it's taking hours to execute:
tree = ET.parse('Posts.xml')
In my XML file, there are millions of child elements of the root. Is there any way to make it faster? I don't need all the children to parse. Even the first 100,000 would be fine. All I need is to set a limit for the depth to parse.
You'll want an XML parsing mechanism that doesn't load everything into memory.
You can use ElementTree.iterparse or you could use Sax.
Here is a page with some XML processing tutorials for Python.
UPDATE: As #marbu said in the comment, if you use ElementTree.iterparse be sure to use it in such a way that you get rid of elements in memory when you've finished processing them.

Concurrent SAX processing of large, simple XML files?

I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Python, but the extra processing I have to do for each row means that the 40GB file takes an entire day to complete. To speed things up, I'd like to use all my cores simultaneously. Unfortunately, it seems that the SAX parser can't deal with "malformed" chunks of XML, which is what you get when you seek to an arbitrary line in the file and try parsing from there. Since the SAX parser can accept a stream, I think I need to divide my XML file into eight different streams, each containing [number of rows]/8 rows and padded with fake opening and closing tags. How would I go about doing this? Or — is there a better solution that I might be missing? Thank you!
You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.
My suggested way is to read the whole XML file into an internal format and do the extra processing afterwards. SAX should be fast enough to read 40GB of XML in no more than an hour.
Depending on the data you could use a SQLite database or HDF5 file for intermediate storage.
By the way, Python is not really multi-threaded (see GIL). You need the multiprocessing module to split the work into different processes.

Are there ways to modify/update xml files other than totally over writing the old file?

I'm working on a script which involves continuously analyzing data and outputting results in a multi-threaded way. So basically the result file(an xml file) is constantly being updated/modified (sometimes 2-3 times/per second).
I'm currently using lxml to parse/modify/update the xml file, which works fine right now. But from what I can tell, you have to rewrite the whole xml file even sometimes you just add one entry/sub-entry like <weather content=sunny /> somewhere in the file. The xml file is growing bigger gradually, and so is the overhead.
As far as efficiency/resource is concerned, any other way to update/modify the xml file? Or you will have to switch to SQL database or similar some day when the xml file is too big to parse/modify/update?
No you generally cannot - and not just XML files, any file format.
You can only update "in place" if you overwite bytes exactly (i.e. don't add or remove any characters, just replace some with something of the same byte length).
Using a form of database sounds like a good option.
It certainly sounds like you need some sort of database, as Li-anung Yip states this would take care of all kinds of nasty multi-threaded sync issues.
You stated that your data is gradually increasing? How is it being consumed? Are clients forced to download the entire result file each time?
Don't know your use-case but perhaps you could consider using an ATOM feed to distribute your data changes? Providing support for Atom pub would also effectively REST enable your data. It's still XML, but in a standard's compliant format that is easy to consume and poll for changes.

What is the most efficient way of extracting information from a large number of xml files in python?

I have a directory full (~103, 104) of XML files from which I need to extract the contents of several fields.
I've tested different xml parsers, and since I don't need to validate the contents (expensive) I was thinking of simply using xml.parsers.expat (the fastest one) to go through the files, one by one to extract the data.
Is there a more efficient way? (simple text matching doesn't work)
Do I need to issue a new ParserCreate() for each new file (or string) or can I reuse the same one for every file?
Any caveats?
Thanks!
Usually, I would suggest using ElementTree's iterparse, or for extra-speed, its counterpart from lxml. Also try to use Processing (comes built-in with 2.6) to parallelize.
The important thing about iterparse is that you get the element (sub-)structures as they are parsed.
import xml.etree.cElementTree as ET
xml_it = ET.iterparse("some.xml")
event, elem = xml_it.next()
event will always be the string "end" in this case, but you can also initialize the parser to also tell you about new elements as they are parsed. You don't have any guarantee that all children elements will have been parsed at that point, but the attributes are there, if you are only interested in that.
Another point is that you can stop reading elements from iterator early, i.e. before the whole document has been processed.
If the files are large (are they?), there is a common idiom to keep memory usage constant just as in a streaming parser.
The quickest way would be to match strings (with, e.g., regular expressions) instead of parsing XML - depending on your XMLs this could actually work.
But the most important thing is this: instead of thinking through several options, just implement them and time them on a small set. This will take roughly the same amount of time, and will give you real numbers do drive you forward.
EDIT:
Are the files on a local drive or network drive? Network I/O will kill you here.
The problem parallelizes trivially - you can split the work among several computers (or several processes on a multicore computer).
If you know that the XML files are generated using the ever-same algorithm, it might be more efficient to not do any XML parsing at all. E.g. if you know that the data is in lines 3, 4, and 5, you might read through the file line-by-line, and then use regular expressions.
Of course, that approach would fail if the files are not machine-generated, or originate from different generators, or if the generator changes over time. However, I'm optimistic that it would be more efficient.
Whether or not you recycle the parser objects is largely irrelevant. Many more objects will get created, so a single parser object doesn't really count much.
One thing you didn't indicate is whether or not you're reading the XML into a DOM of some kind. I'm guessing that you're probably not, but on the off chance you are, don't. Use xml.sax instead. Using SAX instead of DOM will get you a significant performance boost.

Categories