I have an xml file with this as the header
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='\\segotn12805\ppr\PPRData3\StyleSheet\PPRData3.xslt'?>
when I modify the file I use .write (for example)
mytree.write('output.xml')
but the output file does not contain the header info.
The first two lines of the output file look like this
<ns0:pprdata xmlns:ns0="http://ManHub.PPRData">
<ns0:Group name="Models">
any ideas on how I can add the header info to the output file?
The first line is the XML declaration. It is optional, and a parser will assume UTF-8 if not specified.
The second line is a processing instruction.
It would be helpful if you provided more code to show what you are doing, but I suspect that you are using ElementTree. The documentation has this note indicating that by default these are skipped:
Note Not all elements of the XML input will end up as elements of the parsed tree. Currently, this module skips over any XML comments, processing instructions, and document type declarations in the input. Nevertheless, trees built using this module’s API rather than parsing from XML text can have comments and processing instructions in them; they will be included when generating XML output. A document type declaration may be accessed by passing a custom TreeBuilder instance to the XMLParser constructor.
As suggested in this answer, you might want to try using lxml
Related
I am pretty new to python. I now how to marshall/unmarshall objects in Java.
I am looking for something like we did it in Java.
Like:
JAXBContext jaxbContext = JAXBContext.newInstance(com.Request1.class);
Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(message)));
requestStr2 = (com.Request1) unmarshaller.unmarshal(doc);
Where Request1 has #XmlRootElement annotation.
I don't want to write multiple elements, subelements etc, because i have very complex xsd structure. I want to generate classes from xsd by generateDS, then initialize it from database and serialize to xml-file
I saw pyxser, but it is only on python 2.
What modules could help me with that?
Thank you
The generateDS generated python-modules can parse a XML instance of XML Schema from file or from string. The generateDS module supports complex XML Schemas including complex types, abstract types, enumerated lists, mixed content, etc.
For example, here is the command for generateDS to generate a python module called people.py from people.xsd.
python generateDS.py -o people.py people.xsd
On Windows, there is a generateDS.exe wrapper that can be called as a short-cut:
generateDS.exe -o people.py people.xsd
This is an example XML instance conforming to the people.xsd schema.
<people xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<person id="1">
<name>Alberta</name>
</person>
<person id="2">
<name>Bernardo</name>
</person>
...
</people>
The following is a snippet of python to parse the above XML instance in file named people.xml that conforms to the XML Schema. The parse() function parses the XML document from a file and creates object structure for classes associated with the elements.
import people as p
people = p.parse("people.xml", silence=True)
# iterate over each person in the collection
for person in people.get_person():
print(person.name)
If wanted to parse from a string variable xml then call the parseString() function on it.
people = p.parseString(xml, silence=True)
Note if the XML content contains encoding declaration then must use bytes input such as doing the following.
people = p.parseString(str.encode(xml), silence=True)
See tutorial for details (and more examples) to create a python module from an XML Schema.
This is my code:
from xml.dom import minidom as md
doc = md.parse('file.props')
# operations with doc
xml_file = open('file.props', "w")
doc.writexml(xml_file, encoding="utf-8")
xml_file.close()
I parse an XML, I do some operations, than I open and write on it. But for example, if in my file got:
<MY_TAG />
^
it rewrites as:
<MY_TAG/>
^
I know this can seem irrelevant, but my file is constantly monitored by version control GIT, which say that line is "different" on every write.
The same with the header:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
it becomes:
<?xml version="1.0" encoding="utf-8"?><Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
Which is pretty annoying. Any clues?
Retaining quirks of formatting in XML through parsing and serialization is pretty well impossible. If you need to do text-level comparisons, the only way to do it is to canonicalize the formats that you are comparing (there are various XML canonicalization tools out there).
In principle you can configure git to use an XML-aware diff tool for comparisons, but please don't ask me for details, it's not something I've ever done myself. I've always just lived with the fact that it works badly.
I have some hardware that creates a bazillion record XML file where the xml records look like this:
<Reading>
<DeviceId>13553678</DeviceId>
<Reading>1009735</Reading>
<DataStatus>0</DataStatus>
</Reading>
Every once in awhile, we will experience hardware failure where a character value gets inserted into the Reading tag, like this:
<Reading>
<DeviceId>13553678</DeviceId>
<Reading>100F735</Reading>
<DataStatus>0</DataStatus>
</Reading>
Unfortunately, the application that consumes this XML file will dump the ENTIRE file with a "Input string was not in a correct format" error. I would like to write an intermediary program in Python to remove the bad records from the xml file, archive them, and then rebuild the file for processing. I have used python for simple text manipulation but I believe there are some XML features I could leverage. Any help would be appreciated.
This can easily be done by using the lxml module and XPath expressions. Also see the logging module on how to do proper logging.
Configure a logger with a FileHandler
Get all inner <Reading/> nodes
If their text doesn't consist only of digits, drop the parent node and log
from lxml import etree
import logging
logger = logging.getLogger()
logger.addHandler(logging.FileHandler('dropped_readings.log'))
tree = etree.parse(open('readings.xml'))
readings = tree.xpath('//Reading/Reading')
for reading in readings:
reading_block = reading.getparent()
value = reading.text
if not all(c.isdigit() for c in value):
reading_dump = etree.tostring(reading_block)
logger.warn("Dropped reading '%s':" % value)
logger.warn(reading_dump)
reading_block.getparent().remove(reading_block)
print etree.tostring(tree, xml_declaration=True, encoding='utf-8')
See the all() builtin and generator epxressions for how the condition works.
An SVG file is basically an XML file so I could use the string <?xml (or the hex representation: '3c 3f 78 6d 6c') as a magic number but there are a few opposing reason not to do that if for example there are extra white-spaces it could break this check.
The other images I need/expect to check are all binaries and have magic numbers. How can I fast check if the file is an SVG format without using the extension eventually using Python?
XML is not required to start with the <?xml preamble, so testing for that prefix is not a good detection technique — not to mention that it would identify every XML as SVG. A decent detection, and really easy to implement, is to use a real XML parser to test that the file is well-formed XML that contains the svg top-level element:
import xml.etree.cElementTree as et
def is_svg(filename):
tag = None
with open(filename, "r") as f:
try:
for event, el in et.iterparse(f, ('start',)):
tag = el.tag
break
except et.ParseError:
pass
return tag == '{http://www.w3.org/2000/svg}svg'
Using cElementTree ensures that the detection is efficient through the use of expat; timeit shows that an SVG file was detected as such in ~200μs, and a non-SVG in 35μs. The iterparse API enables the parser to forego creating the whole element tree (module name notwithstanding) and only read the initial portion of the document, regardless of total file size.
You could try reading the beginning of the file as binary - if you can't find any magic numbers, you read it as a text file and match to any textual patterns you wish. Or vice-versa.
This is from man file (here), for the unix file command:
The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable ... These files have a “magic number” stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a “magic” has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. ...
(my emphasis)
And here's one example of the "magic" that the file command uses to identify an svg file (see source for more):
...
0 string \<?xml\ version=
>14 regex ['"\ \t]*[0-9.]+['"\ \t]*
>>19 search/4096 \<svg SVG Scalable Vector Graphics image
...
0 string \<svg SVG Scalable Vector Graphics image
...
As described by man magic, each line follows the format <offset> <type> <test> <message>.
If I understand correctly, the code above looks for the literal "<?xml version=". If that is found, it looks for a version number, as described by the regular expression. If that is found, it searches the next 4096 bytes until it finds the literal "<svg". If any of this fails, it looks for the literal "<svg" at the start of the file, and so on.
Something similar could be implemented in Python.
Note there's also python-magic, which provides an interface to libmagic, as used by the unix file command.
How can I convert xml to Python data structure using lxml?
I have searched high and low but can't find anything.
Input example
<ApplicationPack>
<name>Mozilla Firefox</name>
<shortname>firefox</shortname>
<description>Leading Open Source internet browser.</description>
<version>3.6.3-1</version>
<license name="Firefox EULA">http://www.mozilla.com/en-US/legal/eula/firefox-en.html</license>
<ms-license>False</ms-license>
<vendor>Mozilla Foundation</vendor>
<homepage>http://www.mozilla.org/firefox</homepage>
<icon>resources/firefox.png</icon>
<download>http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-GB</download>
<crack required="0"/>
<install>scripts/install.sh</install>
<postinstall name="Clean Up"></postinstall>
<run>C:\\Program Files\\Mozilla Firefox\\firefox.exe</run>
<uninstall>c:\\Program Files\\Mozilla Firefox\\uninstall\\helper.exe /S</uninstall>
<requires name="autohotkey" />
</ApplicationPack>
>>> from lxml import etree
>>> treetop = etree.fromstring(anxmlstring)
converts the xml in the string to a Python data structure, and so does
>>> othertree = etree.parse(somexmlurl)
where somexmlurl is the path to a local XML file or the URL of an XML file on the web.
The Python data structure these functions provide (known as an "element tree", whence the etree module name) is well documented here -- all the classes, functions, methods, etc, that the Python data structure in question supports. It closely matches one supported in the Python standard library, by the way.
If you want some different Python data structure, you'll have to walk through the Python data structure which lxml returns, as above mentioned, and build your different data structure yourself based on the information thus collected; lxml can't specifically help you, except by offering several helpers for finding information in the parsed structure it returns, so that collecting said info is a flexible, easy task (again, see the documentation URL above).
It's not entirely clear what kind of data structure you're looking for, but here's a link to a code sample to convert XML to python dictionary of lists via lxml.etree.