I'm having an issue with formatting xml when writing to an xml file. The issue is, the first time I write to the xml file, the xml is formatted properly using pretty_print=True. Any subsequent attempts to append to the xml file are not formatted properly. The xml is written, but not formatted. My code looks like:
#does the library.xml file exist?
if os.path.isfile(libraryFile):
library = ET.ElementTree()
library.parse(libraryFile)
else:
#the library.xml does not exist at the given path
library = ET.ElementTree(project.getBoilerplateLibrary(path))
root = library.getroot()
root.append(xml) #xml is a lxml Element object
f = open(libraryFile, 'w')
library.write(f, pretty_print=True)
f.close()
The first time we write to the file I get something like:
<root>
<element>
<foo>bar</foo>
</element>
</root>
Any subsequent attempts to append to this file end up looking like:
<root>
<element>
<foo>bar</foo>
</element><element><bleep>bloop</bleep></element></root>
Any ideas?
The FAQ covers this answer: Why doesn't the pretty print options reformat my XML output
This question has also been asked before on StackOverflow as lxml pretty print write file problem.
It is unfortunately a side effect of using XML where whitespace (unfortunately) definitely matters.
Related
I am trying to split large xml file into smaller ones, first I started off beautifulsoup:
from bs4 import BeautifulSoup
import os
# Core settings
rootdir = r'C:\Users\XX\Documents\Grant Data\2010_xml'
extension = ".xml"
to_save = r'C:\Users\XX\Documents\all_patents_as_xml'
index = 0
for root, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith(extension):
print(file)
file_name = os.path.join(root,file)
with open(file_name) as f:
data = f.read()
texts = data.split('?xml version="1.0" encoding="UTF-8"?')
for text in texts:
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(text)
However, I got a memory error. Then I switched to xml etree:
from xml.etree import ElementTree as ET
import re
file_name = r'C:\Users\XX\Documents\Grant Data\2010_xml\2010cat_xml.xml'
with open(file_name) as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
parser = ET.iterparse(tree)
to_save = r'C:\Users\Yilmaz\Documents\all_patents_as_xml'
index = 0
for event, element in parser:
# element is a whole element
if element.tag == '?xml version="1.0" encoding="UTF-8"?':
index += 1
filename = to_save + "\\"+ str(index) + ".txt"
with open(filename, 'w') as f:
f.write(ET.tostring(element))
# do something with this element
# then clean up
element.clear()
and I get the following error:
OverflowError: size does not fit in an int
I am using windows operating system, I know in Linux you can split the xmls from consule but in my case I don't know what to do.
If your XML can not be loaded because of memory limits, you should consider using SAX.
With SAX you will read "small bites" of the document, do what ever you want to do with them (Example: Save every N elements to a new file).
Python SAX example 1.
Python SAX example 2.
There are major issues with your question and your attempts at solving it:
You mention using Beautiful Soup. However, while you import Beautiful Soup in your code, you don't actually do anything with it.
The code you show that uses xml.etree is grossly incorrect. At the line parser = ET.iterparse(tree), tree is an XML tree already parsed with ET.fromstring, but the argument to iterparse must either be a file name or a file object. An XML tree is neither of those. So that attempt is dead on arrival.
But more importantly, it looks like what you are trying to process is a file which contains a bunch of concatenated XML files. In your xml.etree attempt you have this test:
element.tag == '?xml version="1.0" encoding="UTF-8"?'
The only intent I can imagine for this test is that you think that xml.etree will somehow interpret <?xml version="1.0" encoding="UTF-8"?> as an XML element which has a name of '?xml version="1.0" encoding="UTF-8"?'. However, the structure <?xml version="1.0" encoding="UTF-8"?> is not an XML element, it is an XML declaration.
And since your code seems to be attempting to split every time an XML declaration is encountered, it seems that your input is a file that contains multiple XML declarations. This file is not valid XML. The XML specification allows the XML declaration to appear once, and only once at the beginning of an XML file. (Don't confuse the XML declaration with a processing instruction. They look similar because they are both delimited by <? and ?>, but the XML declaration is not a processing instruction.) If you use an XML parser on your input file, and this parser conforms to the XML specification, then it has to reject your file as being not XML because XML does not allow XML declarations to appear at random positions in documents.
Where does that leave you? If all XML declarations present in your source document are the same, there's a relatively easy way to make your document parsable by an XML parser. (The attempts you made suggest that they are all the same since you do not use a regular expressions to match different forms of the XML declaration (e.g. one that would specify the standalone parameter).) You can just remove all XML declarations from your source document, wrap it in a new root element, and parse that with xml.etree. (This assumes that the individual XML documents that were concatenated to make up your source document were all individually well-formed. If they weren't then this won't work.)
Note, however, that the string <?xml version="1.0" encoding="UTF-8"?> can appear in an XML document in contexts where this string is not actually an XML declaration. Here is a well-formed XML document that would throw off an algorithm that just looks for a string that looks like an XML declaration:
<?xml version = "1.0" encoding = "UTF-8"?>
<a>
<![CDATA[
<?xml version = "1.0" encoding = "UTF-8"?>
]]>
<?q <?xml version = "1.0" encoding = "UTF-8"?> ?>
<!-- <?xml version = "1.0" encoding = "UTF-8"?> -->
</a>
If you know how your source file was created, you may already be able to know for sure that you don't have any of the cases above. Otherwise, you may want to examine your source to make sure none of the above happens.
Once you take care of this, then using a strategy based on ET.iterparse, or SAX should work.
I have a python program that edits the XML in a .docx file. I'd like to edit the XML with ETree.
When I read the XML from the .docx file, it begins like this:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:wpc="http://schemas.micro'...
This is in a variable called data. I create the element tree with:
import xml.etree.ElementTree as ElementTree
tree = ElementTree.XML(data)
I convert it back with:
data = ElementTree.tostring(tree)
However, there have been subtle changes to the XML. It now looks like this:
b'<ns0:document xmlns:ns0="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:ns1="ht...
Word won't read this, even though it is standard XML.
EDIT: I tried adding the string to my XML, just to get it to round-trip:
XML_HEADER=b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n'
tree = ElementTree.XML(data)
data = XML_HEADER + ElementTree.tostring(tree)
But I still get the error:
We're sorry. We can't open <filename>.docx because we found a problem with its contents.
Details:
The XML data is invalid according to the schema.
Location: Part: /word/document.xml, Line: 0, Column:0
I can't fix word. I've got to generate XML that looks exactly like the XML that I started with. How do I get ETree to generate that?
I'm trying to create a face-detection script using Python's OpenCV using the haar cascade XML file.
My goal is to upload a python file to a website but due to some weird policies, I can only upload the Python file, without the XML...
The question is, is it possible to somehow put the XML file inside the Python script, say, convert it to a String or something and then generate an XML from that String?
xml = """<?xml version="1.0" encoding="UTF-8"?>
<a>
<b>Yes, you can embed XML in a string literal in Python.</b>
</a>"""
Not answer to title but answer of your description question.
Haar cascade doesn't support non-file XML strings. Also, if you try to put an XML file to a website and give a link to an XML file with cv2.CascadeClassifier(), it will give an error.
But you can use the request module on python to achieve what you want.
It gets XML from the website, then puts it into a file
def function(self, image):
# download XML from server
link = LINK_TO_XML
r = requests.get(link, allow_redirects=True)
open('haarcascade_frontalface_default.xml', 'wb').write(r.content)
# end of download
haar_cascade = cv.CascadeClassifier('haarcascade_frontalface_default.xml')
First, copy the contents of the XML file into the python file and assign the whole thing to a string. Then use XML library to create a tree type data structure named root which contains the contents of the XML file. This tree is traversable and you can do what you like with it in your program:
import xml.etree.ElementTree as ET
root = ET.fromstring(XML_file_example_as_string).
To generate XML from the string you can use ElementTree.write() like this:
tree = ET.ElementTree(root)
tree.write('example.xml')
I am being driven crazy by some oddly formed xml and would be grateful for some pointers:
The documents are defined like this:
<sphinx:document id="18059090929806848187">
<url>http://www.some-website.com</url>
<page_number>104</page_number>
<size>7865</size>
</sphinx:document>
Now, I need to read lots (500m+ of these files which are all gz compresed) and grab the text values form a few of the contained tags.
sample code:
from lxml import objectify, etree
import gzip
with open ('file_list','rb') as file_list:
for file in file_list:
in_xml = gzip.open(file.strip('\n'))
xml2 = etree.iterparse(in_xml)
for action, elem in xml2:
if elem.tag == "page_number":
print elem.text + str(file)
the first value elem.text is returned but only for the first file in the list and quickly followed by the error:
lxml.etree.XMLSyntaxError: Namespace prefix sphinx on document is not defined, line 1, column 20
Please excuse my ignorance but xml really hurts my head and I have been struggling with this for a while. Is there a way that I can either define the namespace prefix or handle this in some other more intelligent manner?
Thanks
Your input file is not well formed XML. I assume that it is a snippet from a larger XML document.
Your choices are:
Reconstruct the larger document. How you do this is specific to your application. You may have to consult with the people that created the file you are parsing.
Parse the file in spite of its errors. To do that, use the recover keyword from lxml.etree.iterparse:
xml2 =etree.iterparse(in_xml, recover=True)
I am write a raw data to xml file python program, in my design,we get the raw data line by line,
then write it into xml file like:
`<root>\n
<a> value </a>\n
<b> value </b>\n
</root>
The first time i write into xml file with pretty_print=True, i got what i want, but when the second
time i read the file, get the element root, --add-- new elemnts then save it back with pretty_print=True, but i can not get what i want,it just like:
...\n
<c> value </c></root>
`
what's wrong with lxml? Or my fault?
You might find the answer in the lxml faq: Why doesn't the pretty_print option reformat my XML output?