XML file generating unwanted data

XML file generating unwanted data - python

I have tried writing few things to xml file after reading it from a different xml file, everything works smoothly but there are few unwanted tags coming inside the xml file which i generate as output.
Here is what I have tried
from xml.etree import ElementTree as ET
from xml.dom.minidom import getDOMImplementation
from xml.dom.minidom import parseString
tree = ET.parse('C:\\Users\\ca33.xml')
root = tree.getroot()
impl = getDOMImplementation()
#print(root)
header = [root.find('header')]
for h in header:
h1=(parseString(ET.tostring(h)).toprettyxml(''))
#print(h1)
commands = root.findall(".//records//")
recs=[c for c in commands if c.find('soc_id')!=None and c.find('soc_id').text[:9]=='000001051']
bb=""
for rec in recs:
aa=(parseString(ET.tostring(rec)).toprettyxml(''))
bb=bb+aa
#print(bb)
newdoc = impl.createDocument(None, "file"+h1+bb, None)
newdoc.writexml(open('data.xml', 'w'),'\n'.join([line for line in newdoc.toprettyxml(indent=' '*2).split('\n') if line.strip()]))
I get the output data.xml file as.
<?xml version="1.0" ?><?xml version="1.0" ?>
<file<?xml version="1.0" ?>
<header>
<number_of_records>41</number_of_records>
</header>
<?xml version="1.0" ?>
<record>
<soc_id>00000105139E3B82</soc_id>
</record>
<?xml version="1.0" ?>
<soc_id>00000105139E3640</soc_id>
</record>
<?xml version="1.0" ?>
<header>
<number_of_records>41</number_of_records>
So you can see that many tags of <?xml version="1.0" ?> is being generated everywhere and in the last it again starts writing the data from first but leaves a 2 line spacing

So, what I understand is that you are trying to read a xml file at first place and then you are trying to write the same data into a different file.
In this process you are running into problems
from xml.etree import ElementTree as ET
tree = ET.parse('C:\\Users\\ca33.xml')
root = tree.getroot()
for header_ex in root.findall('header'):
h = [ET.tostring(c) for c in header_ex]
str_header=str(h)
for record_ex in root.findall('records'):
r = [ET.tostring(c) for c in record if c.find('soc_id')!=None and c.find('soc_id').text[:9]=='000001051']
for rec in r:
str_rec=str(rec)
with open("output.xml","w") as f:
f.write("<?xml version='1.0' encoding='ASCII' standalone='yes'?>")
f.write("<file>"+"<header>"+str_header+"</header>")
f.close()
Since you have not posted any random data, I assume it to be the way you had posted in question.I assume that record is a tag and it has something more or many sub/child tags inside it and that's the reason for me to loop twice over it.
And also stop using unnecessary imports in your code.

Related

How do I remove a comment outside of the root element of an XML document using python lxml

How do you remove comments above or below the root node of an xml document using python's lxml module? I want to remove only one comment above the root node, NOT all comments in the entire document. For instance, given the following xml document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
I want to output
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
The usual way to remove an element would be to do element.getparent().remove(element), but this doesn't work for the root element since getparent returns None. I also tried the suggestions from this stackoverflow answer, but the first answer (using a parser that remove comments) removes all comments from the document including the ones I want to keep, and the second answer (adding a dummy opening and closing tag around the document) doesn't work if the document has a directive above the root element.
I can get access to the comment above the root element using the following code, but how do I remove it from the document?
from lxml import etree as ET
tree = ET.parse("./sample_file.xml")
root = tree.getroot()
comment = root.getprevious()
# What do I do with comment now??
I've tried doing the following, but none of them worked:
comment.getparent().remove(comment) says AttributeError: 'NoneType' object has no attribute 'remove'
del comment does nothing
comment.clear() does nothing
comment.text = "" renders an empty comment <!---->
root.remove(comment) says ValueError: Element is not a child of this node.
tree.remove(comment) says AttributeError: 'lxml.etree._ElementTree' object has no attribute 'remove'
tree[:] = [root] says TypeError: 'lxml.etree._ElementTree' object does not support item assignment
Initialize a new tree with tree = ET.ElementTree(root). Serializing this new tree still has the comments somehow.

You could just build another tree by using fromstring() and passing in the root element.
from lxml import etree
tree = etree.parse("sample_file.xml")
new_tree = etree.fromstring(etree.tostring(tree.getroot()))
print(etree.tostring(new_tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())
printed output...
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<root>
<!-- This comment needs to STAY -->
<a/>
</root>
Note: This will also remove any processing instructions before root, so another option is to append the comment to root before removing...
from lxml import etree
tree = etree.parse("sample_file.xml")
root = tree.getroot()
for comment_to_delete in root.xpath("preceding::comment()"):
root.append(comment_to_delete)
root.remove(comment_to_delete)
print(etree.tostring(tree, xml_declaration=True, encoding="UTF-8", standalone=True).decode())
This produces the same output as above, but will retain any processing instructions that occur before root.

You can parse a XML file with comments with the xmlPullParser:
If your input file looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- This comment needs to be removed -->
<root>
<!-- This comment needs to STAY -->
<a/>
<b>Text</b>
</root>
Parse the file and write it to a new one:
import xml.etree.ElementTree as ET
import re
# Write XML declaration line into neu file without comment 1
def write_delte_xml(input):
with open('Cleaned.xml', 'a') as my_file:
my_file.write(f'{input}')
with open('Remove_Comment.xml', 'r', encoding='utf-8') as xml:
feedstring = xml.readlines()
parser = ET.XMLPullParser(['start','end', 'comment'])
for line in enumerate(feedstring):
if line[0] == 0 and line[1].startswith('<?'):
write_delte_xml(line[1])
parser.feed(line[1])
for event, elem in parser.read_events():
if event == "comment" and line[0] != 1:
write_delte_xml(line[1])
#print(line[1])
if event == "start" and r'\>' not in line[1]:
write_delte_xml(f"{line[1]}")
#print("start",f"{line[1]},Element: {elem}")
if event == "end":
write_delte_xml(f"{line[1]}")
#print(f"END: {line[1]}")
# Clean douplicates
xml_list = []
with open('Cleaned.xml', 'rb') as xml:
lines = xml.readlines()
for line in lines:
if line not in xml_list:
xml_list.append(line)
with open('Cleaned_final.xml', 'wb') as my_file:
for line in xml_list:
my_file.write(line)
print('Cleaned.xml')
Output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<!-- This comment needs to STAY -->
<a/>
<b>Text</b>
</root>

How to load xml file with specifc paragraph by xml in Python?

I have a xml file and its structure like that,
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<book>
<toc> <tocdiv pagenum="564">
<title>9thmemo</title>
<tocdiv pagenum="588">
<title>b</title>
</tocdiv>
</tocdiv></toc>
<chapter><title>9thmemo</title>
<para>...</para>
<para>...</para></chapter>
<chapter>...</chapter>
<chapter>...</chapter>
</book>
There are several chapters in the <book>...</book>, and each chapter has a title, I only want to read all content of this chapter,"9thmemo"(not others)
I tried to read by following code:
from xml.dom import minidom
filename = "result.xml"
file = minidom.parse(filename)
chapters = file.getElementsByTagName('chapter')
for i in range(10):
print(chapters[i])
I only get the address of each chapter...
if I add some sub-element like chapters[i].title, it shows cannot find this attribute

I only want to read all content of this chapter,"9thmemo"(not others)
The problem with the code is that it does not try to locate the specific 'chapter' while the answer code uses xpath in order to locate it.
Try the below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<book>
<toc>
<tocdiv pagenum="564">
<title>9thmemo</title>
<tocdiv pagenum="588">
<title>b</title>
</tocdiv>
</tocdiv>
</toc>
<chapter>
<title>9thmemo</title>
<para>A</para>
<para>B</para>
</chapter>
<chapter>...</chapter>
<chapter>...</chapter>
</book>'''
root = ET.fromstring(xml)
chapter = root.find('.//chapter/[title="9thmemo"]')
para_data = ','.join(p.text for p in chapter.findall('para'))
print(para_data)
output
A,B

Python: Appending children to an already created XML file's root using XML DOM

I'm totally new to XML and I'm stuck on how to append children to a root node of an already exisitng XML file using Python and XML DOM. Right now I have this script to create an output file:
from xml.dom.minidom import Document
doc = Document()
root_node = doc.createElement("notes") # Root
doc.appendChild(root_node)
object_node = doc.createElement("data") # Child
root_node.appendChild(object_node)
object_node.setAttribute("a_version", "something_v001.0001.ma") # Set attributes
object_node.setAttribute("b_user", "Me")
object_node.setAttribute("c_comment", "comment about file")
xml_file = open("C:/Temp/file_notes.xml", "w") # Append
xml_file.write(doc.toprettyxml())
xml_file.close()
This gives me an output file that looks like this:
<?xml version="1.0" ?>
<notes>
<data a_version="Me" b_user="something_v001.0001.ma" c_comment="comment about file"/>
</notes>
I'd like to append future data to this file so it will look something like this after 2 additional versions:
<?xml version="1.0" ?>
<notes>
<data a_version="something_v001.0001.ma" b_user="Me" c_comment="comment about file"/>
<data a_version="something_v001.0002.ma" b_user="You" c_comment="minor save"/>
<data a_version="something_v002.0003.ma" b_user="Them" c_comment="major save"/>
</notes>
But every attempt I make at appending data comes out like this:
<?xml version="1.0" ?>
<notes>
<data a_version="Me" b_user="something_v001.0001.ma" c_comment="comment about file"/>
</notes>
<?xml version="1.0" ?>
<notes>
<data a_version="Me" b_user="something_v001.0001.ma" c_comment="comment about file"/>
</notes>
<?xml version="1.0" ?>
<notes>
<data a_version="Me" b_user="something_v001.0001.ma" c_comment="comment about file"/>
</notes>
If anyone has any alternative methods to accomplish this task by using ElementTree that would be appreciated as well. There seem to be a lot more resources, but I'm not sure how to implement the solution with Maya. Thanks!

You haven't shown us any code demonstrating "every attempt I make at appending data". But never mind, here is how you can use ElementTree to append new elements to an existing XML file.
from xml.etree import ElementTree as ET
from xml.dom import minidom
# Assume that we have an existing XML document with one "data" child
doc = ET.parse("file_notes.xml")
root = doc.getroot()
# Create 2 new "data" elements
data1 = ET.Element("data", {"a_version": "something_v001.0002.ma",
"b_user": "You",
"c_comment": "minor save"})
data2 = ET.Element("data", {"a_version": "something_v001.0003.ma",
"b_user": "Them",
"c_comment": "major save"})
# Append the new "data" elements to the root element of the XML document
root.append(data1)
root.append(data2)
# Now we have a new well-formed XML document. It is not very nicely formatted...
out = ET.tostring(root)
# ...so we'll use minidom to make the output a little prettier
dom = minidom.parseString(out)
print dom.toprettyxml()
Output:
<?xml version="1.0" ?>
<notes>
<data a_version="Me" b_user="something_v001.0001.ma" c_comment="comment about file"/>
<data a_version="something_v001.0002.ma" b_user="You" c_comment="minor save"/>
<data a_version="something_v001.0003.ma" b_user="Them" c_comment="major save"/>
</notes>
ElementTree does not have a built-in pretty-printer, so we use minidom for that. The output contains some superfluous whitespace, but it is better than what ElementTree can provide.

How to get node's value of an XML in Python?

suppose i have an xml file:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<quarkSettings>
<UpdatePath></UpdatePath>
<Version>Development</Version>
<Project>ABC</Project>
</quarkSettings>
</configuration>
now i want get Project's value. I have written following code:
import xml.etree.ElementTree as ET
doc1 = ET.parse("Configuration.xml")
for e in doc1.find("Project"):
project =e.text
but it doesn't give the value.

i got the answer:
import xml.etree.ElementTree as ET
doc1 = ET.parse(get_path_for_config_Quark_Release)
root = doc1.getroot()
for element in root.findall("quarkSettings"):
project = element.find("Project").text

XML header getting removed after processing with elementtree

i have an xml file and i used Elementtree to add a new tag to the xml file.My xml file before processing is as follows
<?xml version="1.0" encoding="utf-8"?>
<PackageInfo xmlns="http://someurlpackage">
<data ID="http://someurldata1">data1</data >
<data ID="http://someurldata2">data2</data >
<data ID="http://someurldata3">data3</data >
</PackageInfo>
I used following python code to add a new data tag and write it to my xml file
tree = ET.ElementTree(xmlFile)
root = tree.getroot()
elem= ET.Element('data')
elem.attrib['ID']="http://someurldata4"
elem.text='data4'
root[1].append(elem)
tree = ET.ElementTree(root)
tree.write(xmlFile)
But the resultant xml file have <?xml version="1.0" encoding="utf-8"?> absent and the file looks as below
<PackageInfo xmlns="http://someurlpackage">
<data ID="http://someurldata1">data1</data >
<data ID="http://someurldata2">data2</data >
<data ID="http://someurldata3">data3</data >
</PackageInfo>
Is there any way to include the xml header rather than hardcoding the line

It looks like you need optional arguments to the write method to output the declaration.
http://docs.python.org/library/xml.etree.elementtree.html#elementtree-elementtree-objects
tree.write(xmlfile,xml_declaration=True)
I'm afraid I'm not that familiar with xml.etree.ElementTree and it's variation between python releases.
Here's it working with lxml.etree:
>>> from lxml import etree
>>> sample = """<?xml version="1.0" encoding="utf-8"?>
... <PackageInfo xmlns="http://someurlpackage">
... <data ID="http://someurldata1">data1</data >
... <data ID="http://someurldata2">data2</data >
... <data ID="http://someurldata3">data3</data >
... </PackageInfo>"""
>>>
>>> doc = etree.XML(sample)
>>> data = doc.makeelement("data")
>>> data.attrib['ID'] = 'http://someurldata4'
>>> data.text = 'data4'
>>> doc.append(data)
>>> etree.tostring(doc,xml_declaration=True)
'<?xml version=\'1.0\' encoding=\'ASCII\'?>\n<PackageInfo xmlns="http://someurlpackage">\n<data ID="http://someurldata1">data1</data>\n<data ID="http://someurldata2">data2</data>\n<data ID="http://someurldata3">data3</data>\n<data ID="http://someurldata4">data4</data></PackageInfo>'
>>> etree.tostring(doc,xml_declaration=True,encoding='utf-8')
'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<PackageInfo xmlns="http://someurlpackage">\n<data ID="http://someurldata1">data1</data>\n<data ID="http://someurldata2">data2</data>\n<data ID="http://someurldata3">data3</data>\n<data ID="http://someurldata4">data4</data></PackageInfo>'

try this:::
tree.write(xmlFile, encoding="utf-8")

If you are using python <=2.6
There is no xml_declaration parameter in ElementTree.write()
def write(self, file, encoding="us-ascii"):
def _write(self, file,node, encoding, namespaces):
You can use lxml.etree
install lxml
sample here:
from lxml import etree
document = etree.Element('outer')
node = etree.SubElement(document, 'inner')
print(etree.tostring(document, xml_declaration=True))
BTW:
I find that it is not necessary to write the xml_declaration
Is the XML declaration node mandatory?
There is no XML declaration necessary for a document to be
successfully readable, since there are defaults for both version and
encoding (1.0 and UTF-8, respectively).
At least,it works even if AndroidManifest.xml does not have an xml_declaration
I have tried :-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

XML file generating unwanted data - python

Related

How do I remove a comment outside of the root element of an XML document using python lxml

How to load xml file with specifc paragraph by xml in Python?

Python: Appending children to an already created XML file's root using XML DOM

How to get node's value of an XML in Python?

XML header getting removed after processing with elementtree

Categories

Resources