How to parse XML file in chunks - python

I have a very large XML file with 40,000 tag elements.
When i am using element tree to parse this file it's giving errors due to memory.
So is there any module in python that can read the xml file in data chunks without loading the entire xml into memory?And How i can implement that module?

Probably the best library for working with XML in Python is lxml, in this case you should be interested in iterparse/iterwalk.

This is a problem that people usually solve using sax.
If your huge file is basically a bunch of XML documents aggregated inside and overall XML envelope, then I would suggest using sax (or plain string parsing) to break it up into a series of individual documents that you can then process using lxml.etree.

Related

How to remove all the unnecessary tags and signs from html files?

I am trying to extract "only" text information from 10-K reports (e.g. company's proxy reports) on SEC's EDGAR system by using Python's BeautifulSoup or HTMLParser. However, the parsers that I am using do not seem to work well onto the 'txt'-format files, including a large portion of meaningless signs and tags along with some xbrl information, which is not needed at all. However, when I apply the parser directly onto 'htm'-format files, which are more or less free from the issues of meaningless tags, the parser seems works relatively fine.
"""for Python 3, from urllib.request import urlopen"""
from urllib2 import urlopen
from bs4 import BeautifulSoup
"""for extracting text data only from txt format"""
txt = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/0001660156-16-000019.txt")
bs_txt = BeautifulSoup(txt.read())
bs_txt_text = bs_txt.get_text()
len(bs_txt_text) # 400051
"""for extracting text data only from htm format"""
html = urlopen("https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/f201510kzec2_10k.htm")
bs_html = BeautifulSoup(html.read())
bs_html_text = bs_html.get_text()
len(bs_html_text) # 98042
But the issue is I am in a position to rely on 'txt'-format files, not on 'htm' ones, so my question is, is there any way to deal with removing all the meaningless signs and tags from the files and extracting only text information as the one directly extracted from 'htm' files? I am relatively new to parsing using Python, so if you have any idea on this, it would be of great help. Thank you in advance!
The best way to deal with XBRL data is to use an XBRL processor such as the open-source Arelle (note: I have no affiliation with them) or other proprietary engines.
You can then look at the data with a higher level of abstraction. In terms of the XBRL data model, the process you describe in the question involves
looking for concepts that are text blocks (textBlockItemType) in the taxonomy;
retrieving the value of the facts reported against these concepts in the instance;
additionally, obtaining some meta-information regarding it: who (reporting entity), when (XBRL period), what the text is about (concept metadata and documentation), etc.
An XBRL processor will save you the efforts of resolving the entire DTS as well as dealing with the complexity of the low-level syntax.
The second most appropriate way is to use an XML parser, maybe with an XML Schema engine as well as XQuery or XSLT, but this will require more work as you will need to either:
look at the XML Schema (XBRL taxonomy schema) files, recursively navigating them and looking for text block concepts, deal with namespaces, links, and so on (which an XBRL processor shields you from)
or only look at the instance, ideally the XML file (e.g., https://www.sec.gov/Archives/edgar/data/1660156/000166015616000019/zeci-20151231.xml ) with a few hacks (such as taking XML elements ending with TextBlock), but this is at your own risks and not recommended as this bypasses the taxonomy.
Finally, as you suggest in the original question, you can also look at the document-format files (HTML, etc) rather than at the data files of the SEC filing, however in this case it defeats the purpose of using XBRL, which is to make the data understandable by a computer thanks to tags and contexts, and it may miss important context information associated with the text -- a bit like opening a spreadsheet file with a text/hex editor.
Of course, there are use cases that could justify using that last approach such as running natural language processing algorithms. All I am saying is that this is then outside of the scope of XBRL.
There is an HTML tag stripper at the pyparsing wiki Examples page. It does not try to build an HTML doc, it merely looks for HTML and script tags and strips them out.

XML parsing in Python for big data

I am trying to parse an XML file using Python. But the problem is that the XML file size is around 30GB. So, it's taking hours to execute:
tree = ET.parse('Posts.xml')
In my XML file, there are millions of child elements of the root. Is there any way to make it faster? I don't need all the children to parse. Even the first 100,000 would be fine. All I need is to set a limit for the depth to parse.
You'll want an XML parsing mechanism that doesn't load everything into memory.
You can use ElementTree.iterparse or you could use Sax.
Here is a page with some XML processing tutorials for Python.
UPDATE: As #marbu said in the comment, if you use ElementTree.iterparse be sure to use it in such a way that you get rid of elements in memory when you've finished processing them.

Concurrent SAX processing of large, simple XML files?

I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Python, but the extra processing I have to do for each row means that the 40GB file takes an entire day to complete. To speed things up, I'd like to use all my cores simultaneously. Unfortunately, it seems that the SAX parser can't deal with "malformed" chunks of XML, which is what you get when you seek to an arbitrary line in the file and try parsing from there. Since the SAX parser can accept a stream, I think I need to divide my XML file into eight different streams, each containing [number of rows]/8 rows and padded with fake opening and closing tags. How would I go about doing this? Or — is there a better solution that I might be missing? Thank you!
You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.
My suggested way is to read the whole XML file into an internal format and do the extra processing afterwards. SAX should be fast enough to read 40GB of XML in no more than an hour.
Depending on the data you could use a SQLite database or HDF5 file for intermediate storage.
By the way, Python is not really multi-threaded (see GIL). You need the multiprocessing module to split the work into different processes.

xsd validation, get the object that is invalid

I have a large XML file (3 MB+) and i have an XSD to validate it against.
I am using python and LXML. I started from this script <>. Which does the validation fine, including giving me the line number. But the problem the file is on one line, so when i validate all i get is the error showing on line 1. when i use pretty print to split the lines up for me it maxes out at line 65535.
Thanks!
Pretty-print your XML to add newlines to it. Then put it through your validator to get a more helpful line number.
EDIT: On re-reading your question, I see that you have used Notepad++ to add the newlines. But that LXML apparently has a size limitation when it comings to validating your XML.
For a general approach to this problem, please see Validating a HUGE XML file. In particular, the accepted answer begins with:
Instead of using a DOMParser, use a SAXParser. This reads from an
input stream or reader so you can keep the XML on disk instead of
loading it all into memory.
Basically, you need to use a streaming approach which SAX offers. So, if your requirement is that you must validate your file in Python, then you'll need to find validation approach based on streaming. (Perhap LXML offers validation in a streaming fashion?)
However, if your validation requirements are more flexible, then consider a specialized tool such as XMLStarlet.
For example, here's how to validate an XML file against an XSD from the XMLStarlet entry on Wikipedia:
xmlstarlet val -e -s my.xsd my.xml
And testimonial on using XMLStarlet on very large files.

How to parse multiple XML documents from a single stream?

I've got a socket from which I'm reading XML data. However, this socket will spit out multiple different XML documents, so I can't simply parse all the output I receive.
Is there a good way, preferably using the Python standard library, for me to parse multiple XML documents? In other words, if I end up getting
<foo/>
<bar/>
then is there a way to either get multiple DOM objects or have a SAX parser simply work on such a stream?
If you get separate documents, you'll need something to divide them; and if you have that, you can simply split the stream before you parse the individual documents.
Another possibility would be to wrap that into another document, so each XML document is actually a subdocument of a parent's you create (and wrap around) just for that purpose.

Categories