I am pretty new to python. I now how to marshall/unmarshall objects in Java.
I am looking for something like we did it in Java.
Like:
JAXBContext jaxbContext = JAXBContext.newInstance(com.Request1.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(message)));
requestStr2 = (com.Request1) unmarshaller.unmarshal(doc);
Where Request1 has #XmlRootElement annotation.
I don't want to write multiple elements, subelements etc, because i have very complex xsd structure. I want to generate classes from xsd by generateDS, then initialize it from database and serialize to xml-file
I saw pyxser, but it is only on python 2.
What modules could help me with that?
Thank you
The generateDS generated python-modules can parse a XML instance of XML Schema from file or from string. The generateDS module supports complex XML Schemas including complex types, abstract types, enumerated lists, mixed content, etc.
For example, here is the command for generateDS to generate a python module called people.py from people.xsd.
python generateDS.py -o people.py people.xsd
On Windows, there is a generateDS.exe wrapper that can be called as a short-cut:
generateDS.exe -o people.py people.xsd
This is an example XML instance conforming to the people.xsd schema.
<people xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<person id="1">
<name>Alberta</name>
</person>
<person id="2">
<name>Bernardo</name>
</person>
...
</people>
The following is a snippet of python to parse the above XML instance in file named people.xml that conforms to the XML Schema. The parse() function parses the XML document from a file and creates object structure for classes associated with the elements.
import people as p
people = p.parse("people.xml", silence=True)
# iterate over each person in the collection
for person in people.get_person():
print(person.name)
If wanted to parse from a string variable xml then call the parseString() function on it.
people = p.parseString(xml, silence=True)
Note if the XML content contains encoding declaration then must use bytes input such as doing the following.
people = p.parseString(str.encode(xml), silence=True)
See tutorial for details (and more examples) to create a python module from an XML Schema.
Related
Let's take the StackOverflow public export of Tags, which looks something like this:
<?xml version="1.0" encoding="utf-8"?>
<tags>
<row Id="1" TagName=".net" Count="303362" ExcerptPostId="3624959" WikiPostId="3607476" />
<row Id="2" TagName="html" Count="1038358" ExcerptPostId="3673183" WikiPostId="3673182" />
<row Id="3" TagName="javascript" Count="2130783" ExcerptPostId="3624960" WikiPostId="3607052" />
...
Let's assume that this object wouldn't fit in memory, but since it's line-separated I think it'd be OK to process without doing too much trickery. What might be a good approach to process a file like this? My thought was just to process it by row, building faking the xml structure, something like:
for line in file:
node = etree.fromstring('<x>%s</x>' % line).find('row')
...
Is this a common approach for handling xml that is "row-oriented" that is too big to fit in memory? I see this commonly with DB exports (actually i think the db client I use does that format, though I never use xml-exports from a db).
SAX is a abbreviation of Simple API to XML and provides a very fast and lightweight access to XML data. In contrast to DOM, which reads the whole XML document and creates a in-memory tree representation from it, SAX gives the program available information immediately after they are read from the document during parsing. The workflow is a follows - the parser reads the document and each time it finds a distinct type of XML data, such as start of an element, end of an element, processing instruction, etc. it passes the information to the so called handler
See a concrete example here: http://python.zirael.org/e-sax1.html
I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?
The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml
I have some hardware that creates a bazillion record XML file where the xml records look like this:
<Reading>
<DeviceId>13553678</DeviceId>
<Reading>1009735</Reading>
<DataStatus>0</DataStatus>
</Reading>
Every once in awhile, we will experience hardware failure where a character value gets inserted into the Reading tag, like this:
<Reading>
<DeviceId>13553678</DeviceId>
<Reading>100F735</Reading>
<DataStatus>0</DataStatus>
</Reading>
Unfortunately, the application that consumes this XML file will dump the ENTIRE file with a "Input string was not in a correct format" error. I would like to write an intermediary program in Python to remove the bad records from the xml file, archive them, and then rebuild the file for processing. I have used python for simple text manipulation but I believe there are some XML features I could leverage. Any help would be appreciated.
This can easily be done by using the lxml module and XPath expressions. Also see the logging module on how to do proper logging.
Configure a logger with a FileHandler
Get all inner <Reading/> nodes
If their text doesn't consist only of digits, drop the parent node and log
from lxml import etree
import logging
logger = logging.getLogger()
logger.addHandler(logging.FileHandler('dropped_readings.log'))
tree = etree.parse(open('readings.xml'))
readings = tree.xpath('//Reading/Reading')
for reading in readings:
reading_block = reading.getparent()
value = reading.text
if not all(c.isdigit() for c in value):
reading_dump = etree.tostring(reading_block)
logger.warn("Dropped reading '%s':" % value)
logger.warn(reading_dump)
reading_block.getparent().remove(reading_block)
print etree.tostring(tree, xml_declaration=True, encoding='utf-8')
See the all() builtin and generator epxressions for how the condition works.
I'm using Python's elementtree to parse some XML configuration files.
At the top of the file, I have a root element like this:
<?xml version="1.0" encoding="utf-8"?>
<sgx:FooConfig
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:foo="http://ns.au.firm.com/foo.xsd"
xmlns:bar="http://ns.au.firm.com/bar.xsd"
>
The problem is, the bar namespace can be set to one of two different XSDs, depending on the version of the configuration file.
I'm looking for a way to print out the namespace mapping using ElementTree, so I can check which of the two XSDs is being used - then I can get my code to handle the correct case.
Is there a way to print out all the namespace definitions out using Python?
Cheers,
Victor
What you have is not valid xml (undefined prefixes) and I think you can't do this with xml.etree but you should be able to do it using lxml.
import lxml.etree as et
tree = et.XML(yourxml)
print tree.nsmap
How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>
>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).
It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.