lxml --pretty_print-- write file problem - python

I am write a raw data to xml file python program, in my design,we get the raw data line by line,
then write it into xml file like:
`<root>\n
<a> value </a>\n
<b> value </b>\n
</root>
The first time i write into xml file with pretty_print=True, i got what i want, but when the second
time i read the file, get the element root, --add-- new elemnts then save it back with pretty_print=True, but i can not get what i want,it just like:
...\n
<c> value </c></root>
`
what's wrong with lxml? Or my fault?

You might find the answer in the lxml faq: Why doesn't the pretty_print option reformat my XML output?

Related

XML closing tag messes up the file

Basicly I download a few XML files and then append them with Element Tree. The problem is that the final file has these things:
<<?xml version="1.0" encoding="UTF-8" standalone="yes"?> - at the start of each new xml fil
...
</product_info> /><product_info> ...
where product info is the actual cosing tag and the /> is what is messing everything up.
I fixed the first part by removing the XML declaration in the original xml file with:
replace('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><','')
#I remove a closing bracet at the end because I cannot remove the opening bracet as it is not in the original file
I suspect the problem is that for some reason before each XML files is apeneded it is enclosed in some tag?
When I check the 'ET.SubElement(root,response_xml)' this is what prints:
<Element 'product_info article_id="0006303562403"...'
Could the tag be the problem?
Your file won't qualify as XML if it's not well-formed, and you generally cannot use libraries designed to parse XML on data that fails to meet the definition of XML.
Examples of failures to be well-formed include:
Having any content before the XML declaration.
Having multiple root elements.
Not properly closing an element.
Using characters not allowed in component names. (XML attribute names may not start with a ', for example.)
You must fix the code that violates the rules of well-formedness, or edit the data manually to repair, or see this Q/A for other options:
How to parse invalid (bad / not well-formed) XML?

Write Open Office XML (e.g. docx) with XML that matches the OOXML namespace

I have a python program that edits the XML in a .docx file. I'd like to edit the XML with ETree.
When I read the XML from the .docx file, it begins like this:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://schemas.micro'...
This is in a variable called data. I create the element tree with:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.XML(data)
I convert it back with:
data = ElementTree.tostring(tree)
However, there have been subtle changes to the XML. It now looks like this:
b'<ns0:document xmlns:ns0="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:ns1="ht...
Word won't read this, even though it is standard XML.
EDIT: I tried adding the string to my XML, just to get it to round-trip:
XML_HEADER=b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n'
tree = ElementTree.XML(data)
data = XML_HEADER + ElementTree.tostring(tree)
But I still get the error:
We're sorry. We can't open <filename>.docx because we found a problem with its contents.
Details:
The XML data is invalid according to the schema.
Location: Part: /word/document.xml, Line: 0, Column:0
I can't fix word. I've got to generate XML that looks exactly like the XML that I started with. How do I get ETree to generate that?

Put a XML file inside a Python script?

I'm trying to create a face-detection script using Python's OpenCV using the haar cascade XML file.
My goal is to upload a python file to a website but due to some weird policies, I can only upload the Python file, without the XML...
The question is, is it possible to somehow put the XML file inside the Python script, say, convert it to a String or something and then generate an XML from that String?
xml = """<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>Yes, you can embed XML in a string literal in Python.</b>
</a>"""
Not answer to title but answer of your description question.
Haar cascade doesn't support non-file XML strings. Also, if you try to put an XML file to a website and give a link to an XML file with cv2.CascadeClassifier(), it will give an error.
But you can use the request module on python to achieve what you want.
It gets XML from the website, then puts it into a file
def function(self, image):
# download XML from server
link = LINK_TO_XML
r = requests.get(link, allow_redirects=True)
open('haarcascade_frontalface_default.xml', 'wb').write(r.content)
# end of download
haar_cascade = cv.CascadeClassifier('haarcascade_frontalface_default.xml')
First, copy the contents of the XML file into the python file and assign the whole thing to a string. Then use XML library to create a tree type data structure named root which contains the contents of the XML file. This tree is traversable and you can do what you like with it in your program:
import xml.etree.ElementTree as ET
root = ET.fromstring(XML_file_example_as_string).
To generate XML from the string you can use ElementTree.write() like this:
tree = ET.ElementTree(root)
tree.write('example.xml')

How can I remove extra space in empty xml tags

I have a xml file where I'm looking for specific tag (for example: tag <x>) and if I find him I replace/update its value to specific text (for example: test).
Python version 3.5.0.
Sample xml file:
<root>
<a/>
<b>0</b>
<c/>
<x>some value</x>
</root>
This is my code:
from xml.etree import ElementTree as et
datafile = 'input.xml' # path to the source xml file
datafile_out = 'output.xml' # path to the updated xml
tree = et.parse(datafile)
tree.find('.//x').text ='TEST' # find <x> tag and write there value "TEST"
tree.write(datafile_out) #generating updated xml file
And this is my output:
<root>
<a />
<b>0</b>
<c />
<x>TEST</x>
</root>
Everything works as expected.
But my problem is with extra space in empty tags: <a />
between tag name "a" and "slash" which wasn't present in input xml file.
I'm working with quite big xml files with a lot of empty tags so every extra space makes this files a lot bigger.
Is there any possible way to stop ElementTree.write() to add that extra space?
Note: I would like to use build in Python modules and not install third parties solutions.
Many thanks for your advices!
Have you tried using regular expressions.
As an example:
yourXmlAsString.replaceAll(">\s*<", "><");
Would remove all whitespaces between every XML element.

ElementTree.write doesn't pretty_print on second pass

I'm having an issue with formatting xml when writing to an xml file. The issue is, the first time I write to the xml file, the xml is formatted properly using pretty_print=True. Any subsequent attempts to append to the xml file are not formatted properly. The xml is written, but not formatted. My code looks like:
#does the library.xml file exist?
if os.path.isfile(libraryFile):
library = ET.ElementTree()
library.parse(libraryFile)
else:
#the library.xml does not exist at the given path
library = ET.ElementTree(project.getBoilerplateLibrary(path))
root = library.getroot()
root.append(xml) #xml is a lxml Element object
f = open(libraryFile, 'w')
library.write(f, pretty_print=True)
f.close()
The first time we write to the file I get something like:
<root>
<element>
<foo>bar</foo>
</element>
</root>
Any subsequent attempts to append to this file end up looking like:
<root>
<element>
<foo>bar</foo>
</element><element><bleep>bloop</bleep></element></root>
Any ideas?
The FAQ covers this answer: Why doesn't the pretty print options reformat my XML output
This question has also been asked before on StackOverflow as lxml pretty print write file problem.
It is unfortunately a side effect of using XML where whitespace (unfortunately) definitely matters.

Categories