Write over xml root, but keep header comment - python

I have a script set up to use parse from lxml library in Python to read in xml files and provided some logic to remove a specified attribute. I would like write over the xml elements but keep a header comment.
Example.xml
<?xml version="1.0" encoding="utf-8"?>
<!-- Original Header
Some more info -->
<Foo Name = "Bar" Pet = "Able">
<Foo2 Name = "Bar2" />
<Foo3 Name = "Bar3" />
</Foo>
I would like to write back into the xml file after going through the processing logic such that the Example.xml would look like this:
<?xml version="1.0" encoding="utf-8"?>
<!-- Original Header
Some more info -->
<Foo Name = "Bar">
<Foo2 Name = "Bar2" />
<Foo3 Name = "Bar3" />
</Foo>
The removing of the attribute is something I have already figured out. Writing back into the same xml file and preserving the header is what I can't think of a solution to. Worst case scenario is I make an output folder and manually do a BeyondCompare with the files, but I would like to automate this completely.

You can do it very easily with BeautifulSoup using the lxml parser. First, open the file for reading, parse the contents using BeautifulSoup, make changes as required then open the file for writing and write to the file.
from bs4 import BeautifulSoup
with open('./Example.xml', 'r') as f:
xml = f.read()
soup = BeautifulSoup(xml, 'lxml-xml')
foo = soup.find('Foo')
del foo['Pet']
with open('./Example.xml', 'w') as f:
f.write(soup.prettify())
It can be written more concisely as:
from bs4 import BeautifulSoup
with open('./Example.xml', 'r') as f:
soup = BeautifulSoup(f.read(), 'xml')
del soup.find('Foo')['Pet']
with open('./Example.xml', 'w') as f:
f.write(soup.prettify())

Related

lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content

lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content
I am getting the error using lxml lib in python. Other solutions/hacks are replacing utf-16 to utf-8 in file php. What is the pythonic way to solve this?
python code:
import lxml.etree as etree
tree = etree.parse("req.xml")
req.xml:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
Have a look at the documentation of the XMLParser constructor:
>>> help(etree.XMLParser)
Among other options, there is an encoding parameter, which allows you to "override the document encoding", as the docs say.
That's exactly what you need:
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("req.xml", parser=parser)
If the error message is right (ie. there aren't any other problems with the document), then I expect this to work.
You can parse xml content using BeautifulSoup which is more pythonic way as you required.
NOTE: If your data is encoded in utf-16 it can easily parse by decoding in utf-8 during reading/PARSE the file content.
So below is the code:
sample.xml contains following data:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
Code:
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read().decode('utf-8', 'ignore') # xml content stored in this variable and decode to utf-8
soup = BeautifulSoup(content, 'html.parser') #parse content to BeautifulSoup Module
data = [data.attrsfor data in soup.findAll("test")]
print data
Output:
{u'xmlns:xsi': u'http://www.w3.org/2001/XMLSchema-instance', u'xmlns:xsd': u'http://www.w3.org/2001/XMLSchema'}

Parse xml in Python ( One correct way to do so) using xlml

I am getting a response using requests module in Python and the response is in form of xml. I want to parse it and get details out of each 'dt' tag. I am not able to do that using lxml.
Here is the xml response:
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="harsh">
<ew>harsh</ew><subj>MD-2</subj><hw>harsh</hw>
<sound><wav>harsh001.wav</wav><wpr>!h#rsh</wpr></sound>
<pr>ˈhärsh</pr>
<fl>adjective</fl>
<et>Middle English <it>harsk,</it> of Scandinavian origin; akin to Norwegian <it>harsk</it> harsh</et>
<def>
<date>14th century</date>
<sn>1</sn>
<dt>:having a coarse uneven surface that is rough or unpleasant to the touch</dt>
<sn>2 a</sn>
<dt>:causing a disagreeable or painful sensory reaction :<sx>irritating</sx></dt>
<sn>b</sn>
<dt>:physically discomforting :<sx>painful</sx></dt>
<sn>3</sn>
<dt>:unduly exacting :<sx>severe</sx></dt>
<sn>4</sn>
<dt>:lacking in aesthetic appeal or refinement :<sx>crude</sx></dt>
<ss>rough</ss>
</def>
<uro><ure>harsh*ly</ure> <fl>adverb</fl></uro>
<uro><ure>harsh*ness</ure> <fl>noun</fl></uro>
</entry>
</entry_list>
A simple way would be to traverse down the hierarchy of the xml document.
import requests
from lxml import etree
re = requests.get(url)
root = etree.fromstring(re.content)
print(root.xpath('//entry_list/entry/def/dt/text()'))
This will give text value for each 'dt' tag in the xml document.
from xml.dom import minidom
# List with dt values
dt_elems = []
# Process xml getting elements by tag name
xmldoc = minidom.parse('text.xml')
itemlist = xmldoc.getElementsByTagName('dt')
# Get the values
for i in itemlist:
dt_elems.append(" ".join(t.nodeValue for t in i.childNodes if t.nodeType==t.TEXT_NODE))
# Print the list result
print dt_elems

Removing a tag and its contents in xml using BeautifulSoup and lxml in Python

I am working with my Evernote data - extracted to an xml file. I have parsed the data using BeautifulSoup and here is a sampling of my xml data.
<note>
<title>
Audio and camera roll from STUDY DAY! in San Francisco
</title>
<content>
<![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
<en-note><div><en-media type="image/jpeg" hash="e3a84de41c9886b93a6921413b8482d5" width="1080" style="" height="1920"/><en-media type="image/jpeg" hash="b907b22a9f2db379aec3739d65ce62db" width="1123" style="" height="1600"/><en-media type="audio/wav" hash="d3fdcd5a487531dc156a8c5ef6000764" style=""/><br/></div>
</en-note>
]]>
</content>
<created>
20130217T153800Z
</created>
<updated>
20130217T154311Z
</updated>
<note-attributes>
<latitude>
37.78670730072799
</latitude>
<longitude>
-122.4171893858559
</longitude>
<altitude>
42
</altitude>
<source>
mobile.iphone
</source>
<reminder-order>
0
</reminder-order>
</note-attributes>
<resource>
<data encoding="base64">
There are two avenues I would like to explore here:
1. Finding and removing Specific tags (in this case )
2. locating a group/list of tags to extract to another document
This is my current code which parses the xml prettifies it and outputs to a text file.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml','r'))
with open("file.txt", "w") as f:
f.write(soup.prettify().encode('utf8'))
You can search nodes by name
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('myNotes.xml', 'r'))
source = soup.source
print source
#<source>
# mobile.iphone
#</source>
source = soup.source
print source.string
# mobile.iphone
Another way to do it, findAll method:
for tag in soup.findAll('source'):
print tag.string
if you want to print every node stripping tags, this should do the job:
for tag in soup.findAll():
print tag.string
Hope it helps.
EDIT:________
BeautifulSoup asumes you know the structure, although by definition xml is a structured data storage.
So you need to give a guideline to BS to parse your xml.
row = []
title = soup.note.title.string
created = soup.note.created.string
row.append(title)
row.append(created)
Now you only have to iterate over xml.
If you're using BeautifulSoup, you could use the getText() method to strip out the tags in the child elements and get one consolidated text
source.getText()

Python and BeautifulSoup shuffles my tags around?

This has been puzzling me for a while now, and I can not figure out what's going on here. This is the original XML file:
<?xml version="1.0" encoding="UTF-8" ?>
<book>
<meta>
<title>Some Title</title>
<creator>Another Author</creator>
<language>en-US</language>
...
</meta>
<chapter>
...
</chapter>
</book>
Then I read the file in:
with open(filename) as f :
soup = BeautifulSoup(f);
print(soup.root)
and that, oddly, dumps the following:
<html><body><book>
<meta/>
<title>Some Title</title>
<creator>Some Author</creator>
<language>en-US</language>
...
So why oh why is the <meta> tag in the soup empty, when it is not in the original XML file? (I could swear that this worked just two weeks ago, and I could also swear that I've not touched the script. I did make some slight changes to the XML file further down, but I fail to see a correlation.)
You are opening a XML file in a HTML parser. BeautifulSoup tries to repair the HTML structure it expected.
Instead, use a XML parser, or use BeautifulSoup in XML mode:
soup = BeautifulSoup(f, 'xml')
For this to work you must have lxml installed.
lxml is an excellent XML library in and of itself. You could also use the ElementTree library included with Python; lxml is based on the same API but has more features.

BeautifulSoup XML Only printing first line

I'm using BeautifulSoup4 (And lxml) to parse an XML file, for some reason when I print soup.prettify() it only prints the first line:
from bs4 import BeautifulSoup
f = open('xmlDoc.xml', "r")
soup = BeautifulSoup(f, 'xml')
print soup.prettify()
#>>> <?xml version="1.0" encoding="utf-8"?>
Any idea why it's not grabbing everything?
UPDATE:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- Data Junction generated file.
Macro type "1000" is reserved. -->
<djmacros>
<macro name="Test" type="5000" value="TestValue">
<description>test</description>
</macro>
<macro name="AnotherTest" type="0" value="TestValue2"/>
<macro name="TestLocation" type="1000" value="C:\RandomLocation">
<description> </description>
</macro>
<djmacros>
The file position is at EOF:
>>> soup = BeautifulSoup("", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'
Or the content is not valid xml:
>>> soup = BeautifulSoup("no <root/> element", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'
As per J.F.Sebastion's answer, the XML is invalid.
Your final tag is incorrect:
<djmacros>
The correct tag is:
</djmacros>
You can confirm this with an XML validator. Eg http://www.w3schools.com/xml/xml_validator.asp
I had the same problem with a valid XML file.
The problem was that the XML file is encoded in UTF-8 with BOM.
I discovered that by printing the raw content:
content = open(path, "r").read()
print(content)
And I got (see this thread: What's  sign at the beginning of my source file?):
<?xml version="1.0" encoding="utf-8"?>
If the encoding is UTF-8-BOM instead of UTF-8 it may have problems even if the XML is otherwise valid.

Categories