Let's take the StackOverflow public export of Tags, which looks something like this:
<?xml version="1.0" encoding="utf-8"?>
<tags>
<row Id="1" TagName=".net" Count="303362" ExcerptPostId="3624959" WikiPostId="3607476" />
<row Id="2" TagName="html" Count="1038358" ExcerptPostId="3673183" WikiPostId="3673182" />
<row Id="3" TagName="javascript" Count="2130783" ExcerptPostId="3624960" WikiPostId="3607052" />
...
Let's assume that this object wouldn't fit in memory, but since it's line-separated I think it'd be OK to process without doing too much trickery. What might be a good approach to process a file like this? My thought was just to process it by row, building faking the xml structure, something like:
for line in file:
node = etree.fromstring('<x>%s</x>' % line).find('row')
...
Is this a common approach for handling xml that is "row-oriented" that is too big to fit in memory? I see this commonly with DB exports (actually i think the db client I use does that format, though I never use xml-exports from a db).
SAX is a abbreviation of Simple API to XML and provides a very fast and lightweight access to XML data. In contrast to DOM, which reads the whole XML document and creates a in-memory tree representation from it, SAX gives the program available information immediately after they are read from the document during parsing. The workflow is a follows - the parser reads the document and each time it finds a distinct type of XML data, such as start of an element, end of an element, processing instruction, etc. it passes the information to the so called handler
See a concrete example here: http://python.zirael.org/e-sax1.html
Related
I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?
The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml
I am pretty new to python. I now how to marshall/unmarshall objects in Java.
I am looking for something like we did it in Java.
Like:
JAXBContext jaxbContext = JAXBContext.newInstance(com.Request1.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(message)));
requestStr2 = (com.Request1) unmarshaller.unmarshal(doc);
Where Request1 has #XmlRootElement annotation.
I don't want to write multiple elements, subelements etc, because i have very complex xsd structure. I want to generate classes from xsd by generateDS, then initialize it from database and serialize to xml-file
I saw pyxser, but it is only on python 2.
What modules could help me with that?
Thank you
The generateDS generated python-modules can parse a XML instance of XML Schema from file or from string. The generateDS module supports complex XML Schemas including complex types, abstract types, enumerated lists, mixed content, etc.
For example, here is the command for generateDS to generate a python module called people.py from people.xsd.
python generateDS.py -o people.py people.xsd
On Windows, there is a generateDS.exe wrapper that can be called as a short-cut:
generateDS.exe -o people.py people.xsd
This is an example XML instance conforming to the people.xsd schema.
<people xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<person id="1">
<name>Alberta</name>
</person>
<person id="2">
<name>Bernardo</name>
</person>
...
</people>
The following is a snippet of python to parse the above XML instance in file named people.xml that conforms to the XML Schema. The parse() function parses the XML document from a file and creates object structure for classes associated with the elements.
import people as p
people = p.parse("people.xml", silence=True)
# iterate over each person in the collection
for person in people.get_person():
print(person.name)
If wanted to parse from a string variable xml then call the parseString() function on it.
people = p.parseString(xml, silence=True)
Note if the XML content contains encoding declaration then must use bytes input such as doing the following.
people = p.parseString(str.encode(xml), silence=True)
See tutorial for details (and more examples) to create a python module from an XML Schema.
I use the xml library in Python3.5 for reading and writing an xml-file. I don't modify the file. Just open and write. But the library modifes the file.
Why is it modified?
How can I prevent this? e.g. I just want to replace specific tag or it's value in a quite complex xml-file without loosing any other informations.
This is the example file
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<movie>
<title>Der Eisbär</title>
<ids>
<entry>
<key>tmdb</key>
<value xsi:type="xs:int" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">9321</value>
</entry>
<entry>
<key>imdb</key>
<value xsi:type="xs:string" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">tt0167132</value>
</entry>
</ids>
</movie>
This is the code
import xml.etree.ElementTree as ET
tree = ET.parse('x.nfo')
tree.write('y.nfo', encoding='utf-8')
And the xml-file becomes this
<movie xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<title>Der Eisbär</title>
<ids>
<entry>
<key>tmdb</key>
<value xsi:type="xs:int">9321</value>
</entry>
<entry>
<key>imdb</key>
<value xsi:type="xs:string">tt0167132</value>
</entry>
</ids>
</movie>
Line 1 is gone.
The <movie>-tag in line 2 has attributes now.
The <value>-tag in line 7 and 11 now has less attributes.
Note that "xml package" and "the xml library" are ambiguous. There are several XML-related modules in the standard library: https://docs.python.org/3/library/xml.html.
Why is it modified?
ElementTree moves namespace declarations to the root element, and namespaces that aren't actually used in the document are removed.
Why does ElementTree do this? I don't know, but perhaps it is a way to make the implementation simpler.
How can I prevent this? e.g. I just want to replace specific tag or it's value in a quite complex xml-file without loosing any other informations.
I don't think there is a way to prevent this. The issue has been brought up before. Here are two very similar questions with no answers:
How do I parse and write XML using Python's ElementTree without moving namespaces around?
Keep Existing Namespaces when overwriting XML file with ElementTree and Python
My suggestion is to use lxml instead of ElementTree. With lxml, the namespace declarations will remain where they occur in the original file.
Line 1 is gone.
That line is the XML declaration. It is recommended but not mandatory to have one.
If you always want an XML declaration, use xml_declaration=True in the write() method call.
This is my code:
from xml.dom import minidom as md
doc = md.parse('file.props')
# operations with doc
xml_file = open('file.props', "w")
doc.writexml(xml_file, encoding="utf-8")
xml_file.close()
I parse an XML, I do some operations, than I open and write on it. But for example, if in my file got:
<MY_TAG />
^
it rewrites as:
<MY_TAG/>
^
I know this can seem irrelevant, but my file is constantly monitored by version control GIT, which say that line is "different" on every write.
The same with the header:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
it becomes:
<?xml version="1.0" encoding="utf-8"?><Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
Which is pretty annoying. Any clues?
Retaining quirks of formatting in XML through parsing and serialization is pretty well impossible. If you need to do text-level comparisons, the only way to do it is to canonicalize the formats that you are comparing (there are various XML canonicalization tools out there).
In principle you can configure git to use an XML-aware diff tool for comparisons, but please don't ask me for details, it's not something I've ever done myself. I've always just lived with the fact that it works badly.
I'm not going to parse a xml file but i want to move it in python. (i.e etree)
I know about the basic of etree however all i want to know is about xml namespace.
I've got 3 namespaces in code. Here is the code that i want to transfer.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns1:g2sMessage xmlns:ns1="http://www.gamingstandards.com/g2s/schemas/v1.0.3"><ns1:g2sBody ns1:dateTimeSent="2014-08-14T15:30:48.692+09:00" ns1:egmId="TestAppEGMID" ns1:hostId="1"><ns1:communications ns1:deviceId="0" ns1:sessionId="1001" ns1:sessionType="G2S_request" ns1:timeToLive="100" ns1:commandId="1001" ns1:dateTime="2014-08-14T06:30:48.696Z"><ns1:keepAlive/></ns1:communications></ns1:g2sBody></ns1:g2sMessage>TestAppEGMID1
Does anyone have an idea?