generate xml with sax2 in python - python

I have a data model or an object from a class, and I need to initialize it by reading from an xml file, or create this object from scratch and output it to an xml file. Previously, I simply use string operations from python to read xml (file.read + string.find) and write xml (file.write), without error checking.
Now I am thinking to use Sax2 to do this. I know how to do it for the read, but not very clear about write. It looks like the sax2 is used for the case when there is an original xml and you want to output after certain modifications. In my case I want to output my data model to xml, with no original xml at all. I wonder if sax2 is good or suitable for this or I should keep using my old way. What is the better way to input/output a class object from/to XML with python? The class is very simple (just a list collection of a list information, i.e., root -> children -> grandchildren) and small size.
Thanks for any suggestions.

Try the pythonic XML processing way: ElementTree.
Generating XML output is easy with`xml.etree.ElementTree.ElementTree.write().
write(file, encoding="us-ascii", xml_declaration=None, method="xml")
Writes the element tree to a file, as XML. file is a file name, or a file object opened for writing. encoding 1 is the output encoding (default is US-ASCII). xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None). method is either "xml", "html" or "text" (default is "xml"). Returns an encoded string.
Example loading ElementTree object from text file:
>>> from xml.etree.ElementTree import ElementTree
>>> tree = ElementTree()
>>> tree.parse("index.xhtml")

Related

How to convert an XML tree object to bytes stream? Python

I have a function that saves files to a db, but this one requires a bytes stream as parameter. Something like:
write_to_db("File name", stream_obj)
Now, I want to save a XML; I am using the xml library.
import xml.etree.cElementTree as ET
Is there a function that convert the xml object to bytes stream?
The solution I got was:
Save it locally with the function write
Retrieve it with "rb" to get the file as bytes
Now that I have the bytes stream, save it with the function mentioned
Delete the file
Example:
# Saving xml as local file
tree = ET.ElementTree(ET.Element("Example")
tree.write("/This/is/a/path.xml")
# Reading local file as bytes
f = open("/This/is/a/path.xml", "rb")
# Saving to DB
write_to_db("File name", f) # <--- See how I am using "f" cuz I opened it as bytes with rb
# Deleting local file
os.remove("/This/is/a/path.xml")]
But is there a function from the xml library that returns automatically the bytes stream? Something like:
tree = ET.ElementTree(ET.Element("Example")
bytes_file = tree.get_bytes() # <-- Like this?
# Writing to db
write_to_db("File name", bytes_file)
This so I can prevent creating and removing the file in my repository.
Thank you in advance.
Another fast question:
Are the words "bytes stream" correct? or what is the difference? what would be the correct words that I am looking for?
So as Balmy mentioned in the comments, the solution is using:
ET.tostring()
My code at the end looked something like this:
# Here you build your xml
x = ET.Element("ExampleXML",{"a tag": "1", "another tag": "2"})
# Here I am saving it to my db by using the "tostring" function,
# Which as default return the xml as a bytes stream string.
write_to_db("File name", ET.tostring(x))

Modifying XML file from string source - how to do it?

I have a problem, where I want to change some lines in my XML, but this XML is not in file, it is in string. I am using Python 3.x and lib xml.etree.ElementTree for this purpose.
I have these piece of code which I know works for files in project, but as I said, I want no files, only operations on string sources.
source_tree = ET.ElementTree(ET.fromstring(source_config))
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
This works, but I don't know how to save it. I tried
source_tree.write(source_config, encoding='latin-1') but this doesn't work (treats all XML as a name).
I don't think you need both source_tree and source_tree_root. By having both, you're creating two separate things. When you write using source_tree, you don't get the changes made to source_tree_root.
Try creating an ElementTree from source_tree_root (which is just an Element), like this (untested since you didn't supply an mcve)...
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
ET.ElementTree(source_tree_root).write("output.xml")
OK, I thought that you were using lxml. My bad. Well here is how you would do it with lxml. I think that lxml is superior.
Here is the basic way of parsing a string into an XML document.
from lxml import etree
doc = etree.fromstring(yourstring)
for e in doc.xpath('//sometagname'):
e.set('foo', 'bar')

MemoryError on generating large xml files using xml.etree.ElementTree [duplicate]

There are many ways to read XML, both all-at-once (DOM) and one-bit-at-a-time (SAX). I have used SAX or lxml to iteratively read large XML files (e.g. wikipedia dump which is 6.5GB compressed).
However after doing some iterative processing (in python using ElementTree) of that XML file, I want to write out the (new) XML data to another file.
Are there any libraries to iteratively write out XML data? I could create the XML tree and then write it out, but that is not possible without oodles of ram. Is there anyway to write the XML tree to a file iteratively? One bit at a time?
I know I could generate the XML myself with print "<%s>" % tag_name, etc., but that seems a bit... hacky.
Fredrik Lundh's elementtree.SimpleXMLWriter will let you write out XML incrementally. Here's the demo code embedded in the module:
from elementtree.SimpleXMLWriter import XMLWriter
import sys
w = XMLWriter(sys.stdout)
html = w.start("html")
w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value="my application 1.0")
w.end()
w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")
w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(html)
With lxml, you can use etree.Element to create new nodes, and etree.tostring to write out the XML representation. See, for example, Listing 6. Serialize an element's children from Liza Daly's article "High-performance XML parsing in Python with lxml".
If you're reading in XML dialect1, and have to write XML dialect2, wouldn't it be a good idea to write down the conversion process using xslt? You may not even need any source code that way.
If you don't find anything else, what I'd prefer here is to inherit from ElementTree and create a "iteractiveElementTree", adding to it a "file" attribute. I'd subclasse the nodes to have a "start_tag_comitted" attribute and a "commit" method. Upon being called, this "commit" method would call the render method for a subtree - starting from the fartest parent where e"start_tag_comitted" is false. With the string in hand I'd manually strip the closing tags for the parents of the current node. There is the need to handle the previously oppened but not closed parent siblings as well.
Then, I'd remove the "commited" node from the memory model.
You will need to anotate node parents to each node as well, as ElementTree does not do that.
(Write me if there are no better answers an dyou get stuck there, I could implement this)

Python lxml error "namespace not defined."

I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)

Following xs:include when parsing XSD as XML with lxml in Python

So, my problem is I'm trying to do something a little un-orthodox. I have a complicated set of XSD files. However I don't want to use these XSD files to verify an XML file; I want to parse these XSDs as XML and interrogate them just as I would a normal XML file. This is possible because XSDs are valid XML. I am using lxml with Python3.
The problem I'm having is with the statement:
<xs:include schemaLocation="sdm-extension.xsd"/>
If I instruct lxml to create an XSD for verifying like this:
schema = etree.XMLSchema(schema_root)
this dependency will be resolved (the file exists in the same directory as the one I've just loaded). HOWEVER, I am treating these as XML so, correctly, lxml just treats this as a normal element with an attribute and does not follow it.
Is there an easy or correct way to extend lxml so that I may have the same or similar behaviour as, say
<xi:include href="metadata.xml" parse="xml" xpointer="title"/>
I could, of course, create a separate xml file manually that includes all the dependencies in the XSD schema. That is perhaps a solution?
So it seems like one option is to use the xi:xinclude method and create a separate xml file that includes all the XSDs I want to parse. Something along the lines of:
<fullxsd>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm1-0-0.xsd" parse="xml"/>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm-ns-structure.xsd" parse="xml"/>
</fullxsd>
Then use some lxml along the lines of
def combine(xsd_file):
with open(xsd_file, 'rb') as f_xsd:
parser = etree.XMLParser(recover=True, encoding='utf-8',remove_comments=True, remove_blank_text=True)
xsd_source = f_xsd.read()
root = etree.fromstring(xsd_source, parser)
incl = etree.XInclude()
incl(root)
print(etree.tostring(root, pretty_print=True))
Its not ideal but it seems the proper way. I've looked at custom URI parsers in the lxml but that would mean actually altering the XSDs which seems messier.
Try this:
def validate_xml(schema_file, xml_file):
xsd_doc = etree.parse(schema_file)
xsd = etree.XMLSchema(xsd_doc)
xml = etree.parse(xml_file)
return xsd.validate(xml)

Categories