Using python to fix broken XML - python

I have multiple xml files in which the tags are not closed properly. They are often moved to the next line.
My XML
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<item id='my-img' href='images/dp.jpg' media-type='image/jpeg'/
>
</note>
Is there any pythonic way of converting it to desired XML like below? :
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<item id='my-img' href='images/dp.jpg' media-type='image/jpeg'/>
</note>
I have tried using beautifulsoup. But it doesn't seem to format this properly.

The simplest way i could think of would be to remove line breaks entirely, load the xml and save a prettyfied version. The default repr only fixes the tags but keeps the format the same.
from bs4 import BeautifulSoup as bs
with open("file.xml","r") as f:
xml_str = f.read(); # Load file
xml = bs(xml_str, "lxml") # parse file
with open("file.xml","w") as f:
f.write(repr(xml)) # write pretty file
Note: we are using the lxml parser which needs to be installed using
pip3 install lxml
Edit: no need to remove the line breaks

It all depends on how many xml fields you need to fix and if there is a pattern in it. From what you put as an example, might be easier to use find and replace (for example using regular expressions).

Related

get only xml data from text file using python

I have a text file where I have some XML data and some HTML data. Both start with "<". Now I want to extract only XML data and save it in another file. How can I do it?
File example:
xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data
Note: This file is in .txt format.
I would treat your whole input not as XML, but as an HTML fragment. HTML can contain non-standard elements, so <note> etc. is fine.
For convenience I suggest pyquery (link) to deal with HTML. It works pretty much the same way as jQuery, so if you've worked with that before, it should be familiar.
It's pretty straight-forward. Load your data, wrap it in "<html></html>", parse it, query it.
from pyquery import PyQuery as pq
data = """xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data"""
doc = pq(f"<html><body>{data}</body></html>")
note = doc.find("note")
print(note.find("body").text())
which prints "Don't forget me this weekend!".

How to write XML with "usual" self-closing/header tags?

This is my code:
from xml.dom import minidom as md
doc = md.parse('file.props')
# operations with doc
xml_file = open('file.props', "w")
doc.writexml(xml_file, encoding="utf-8")
xml_file.close()
I parse an XML, I do some operations, than I open and write on it. But for example, if in my file got:
<MY_TAG />
^
it rewrites as:
<MY_TAG/>
^
I know this can seem irrelevant, but my file is constantly monitored by version control GIT, which say that line is "different" on every write.
The same with the header:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
it becomes:
<?xml version="1.0" encoding="utf-8"?><Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
Which is pretty annoying. Any clues?
Retaining quirks of formatting in XML through parsing and serialization is pretty well impossible. If you need to do text-level comparisons, the only way to do it is to canonicalize the formats that you are comparing (there are various XML canonicalization tools out there).
In principle you can configure git to use an XML-aware diff tool for comparisons, but please don't ask me for details, it's not something I've ever done myself. I've always just lived with the fact that it works badly.

python lxml insert namespaces into header

I am creating an xml file and I would like to insert namespaces into header. I want to get this.
<ns2:prenosPodatkovRazporedaOdgovorSporocilo xmlns="http://someurl" xmlns:ns2="http://someotherurl" xmlns:ns3="http://another_someotherurl">
<ns2:statusPrenosa>
<status>09</status>
<nazivStatusa>Napaka prenosa</nazivStatusa>
while I get this
<?xml version='1.0' encoding='UTF-8'?>
<prenosPodatkovRazporedaOdgovorSporocilo>
<ns0:podatkiRazporeda xmlns:ns0="ns2">
<ns2:podatkiRazporeda xmlns:ns2="ns3">
.....
register_namespace does not do the trick or how do I include them when using tostring() function
etree.register_namespace('n2',"http://someurl")
etree.register_namespace('n3',"http://someother_url")
etree.register_namespace('n4',"http://another_someotherurl")
hope I was clear enough
thank you

how to build xml file in python, with formatting

I'm trying to build a xml file in python so I can write it out to a file, but I'm getting complications with new lines and tabbing etc...
I cannot use a module to do this - because Im using a cut down version of python 2. It must all be in pure python.
For instance, how is it possible to create a xml file with this type of formatting, which keeps all the new lines and tabs (whitespace)?
e.g.
<?xml version="1.0" encoding="UTF-8"?>
<myfiledata>
<mydata>
blahblah
</mydata>
</myfiledata>
I've tried enclosing each line
' <myfiledata>' +\n
' blahblah' +\n
etc.
However, the output Im getting from the script is not anything close to how it looks in my python file, there is extra white space and the new lines arent properly working.
Is there any definitive way to do this? I would rather be editing a file that looks somewhat like what I will end up with - for clarity sake...
You can use XMLGenerator from saxutils to generate the XML and xml.dom.minidom to parse it and print the pretty xml (both modules from standard library in Python 2).
Sample code creating a XML and pretty-printing it:
from __future__ import print_function
from xml.sax.saxutils import XMLGenerator
import io
import xml.dom.minidom
def pprint_xml_string(s):
"""Pretty-print an XML string with minidom"""
parsed = xml.dom.minidom.parse(io.BytesIO(s))
return parsed.toprettyxml()
# create a XML file in-memory:
fp = io.BytesIO()
xg = XMLGenerator(fp)
xg.startDocument()
xg.startElement('root', {})
xg.startElement('subitem', {})
xg.characters('text content')
xg.endElement('subitem')
xg.startElement('subitem', {})
xg.characters('text content for another subitem')
xg.endElement('subitem')
xg.endElement('root')
xg.endDocument()
# pretty-print it
xml_string = fp.getvalue()
pretty_xml = pprint_xml_string(xml_string)
print(pretty_xml)
Output is:
<?xml version="1.0" ?>
<root>
<subitem>text content</subitem>
<subitem>text content for another subitem</subitem>
</root>
Note that the text content elements (wrapped in <subitem> tags) aren't indented because doing so would change their content (XML doesn't ignore whitespace like HTML does).
The answer was to use xml.element.tree and from xml.dom import minidom
Which are all available on python 2.5

Use Python to edit XML header

I've written a Python script to create some XML, but I didn't find a way to edit the heading within Python.
Basically, instead of having this as my heading:
<?xml version="1.0" ?>
I need to have this:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
I looked into this as best I could, but found no way to add standalone status from within Python. So I figured that I'd have to go over the file after it had been created and then replace the text. I read in several places that I should stay far away from using readlines() because it could ruin the XML formatting.
Currently the code I have to do this - which I got from another Stackoverflow post about editing XML with Python - is:
doc = parse('file.xml')
elem = doc.find('xml version="1.0"')
elem.text = 'xml version="1.0" encoding="UTF-8" standalone="no"'
That provides me with a KeyError. I've tried to debug it to no avail, which leads me to believe that perhaps the XML heading wasn't meant to be edited in this way. Or my code is just wrong.
If anyone is curious (or miraculously knows how to insert standalone status into Python code), here is my code to write the xml file:
with open('file.xml', 'w') as f:
f.write(doc.toprettyxml(indent=' '))
Some people don't like "toprettyxml", but with my relatively basic level, it seemed like the best bet.
Anyway, if anyone can provide some advice or insight, I would be very grateful.
The xml.etree API does not give you any options to write out a standalone attribute in the XML declaration.
You may want to use the lxml library instead; it uses the same ElementTree API, but offers more control over the output. tostring() supports a standalone flag:
from lxml import etree
etree.tostring(doc, pretty_print=True, standalone=False)
or use .write(), which support the same options:
doc.write(outputfile, pretty_print=True, standalone=False)

Categories