I'm using Python's minidom library to try and manipulate some XML files. Here is an example file :
<document>
<item>
<link>http://www.this-is-a-url.com/</link>
<description>This is some information!</description>
</item>
<item>
<link>http://www.this-is-a-url.com/</link>
<description>This is some information!</description>
</item>
<item>
<link>http://www.this-is-a-url.com/</link>
<description>This is some information!</description>
</item>
</document>
What I need to do, is take the value in "description" and put it into "link" so both say "This is some information!". I've tried to do it like so:
#!/usr/bin/python
from xml.dom.minidom import parse
xmlData = parse("file.xml")
itmNode = xmlData.getElementsByTagName("item")
for n in itmNode:
n.childNodes[1] = n.childNodes[3]
n.childNodes[1].tagName = "link"
print xmlData.toxml()
However "n.childNodes[1] = n.childNodes[3]" seems to link the two nodes together, so when I do "n.childNodes[1].tagName = "link"" to correct the name BOTH child nodes become "link" where before they were both "description".
Furthermore, if I use "n.childNodes[1].nodeValue" the changes don't work and the XML is printed in it's original form. What am I doing wrong?
I'm not sure you can modify the DOM in place with xml.dom.minidom (creating the whole document from scratch with new values should work though).
Anyway, if you accept a solution based on xml.etree.ElementTree (I strongly recommend using it since it provides a friendlier interface), then you could use the following code:
from xml.etree.ElementTree import ElementTree, dump
tree = ElementTree()
tree.parse('file.xml')
items = tree.findall('item')
for item in items:
link, description = list(item)
link.text = description.text
dump(tree)
Related
I am getting a response using requests module in Python and the response is in form of xml. I want to parse it and get details out of each 'dt' tag. I am not able to do that using lxml.
Here is the xml response:
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="harsh">
<ew>harsh</ew><subj>MD-2</subj><hw>harsh</hw>
<sound><wav>harsh001.wav</wav><wpr>!h#rsh</wpr></sound>
<pr>ˈhärsh</pr>
<fl>adjective</fl>
<et>Middle English <it>harsk,</it> of Scandinavian origin; akin to Norwegian <it>harsk</it> harsh</et>
<def>
<date>14th century</date>
<sn>1</sn>
<dt>:having a coarse uneven surface that is rough or unpleasant to the touch</dt>
<sn>2 a</sn>
<dt>:causing a disagreeable or painful sensory reaction :<sx>irritating</sx></dt>
<sn>b</sn>
<dt>:physically discomforting :<sx>painful</sx></dt>
<sn>3</sn>
<dt>:unduly exacting :<sx>severe</sx></dt>
<sn>4</sn>
<dt>:lacking in aesthetic appeal or refinement :<sx>crude</sx></dt>
<ss>rough</ss>
</def>
<uro><ure>harsh*ly</ure> <fl>adverb</fl></uro>
<uro><ure>harsh*ness</ure> <fl>noun</fl></uro>
</entry>
</entry_list>
A simple way would be to traverse down the hierarchy of the xml document.
import requests
from lxml import etree
re = requests.get(url)
root = etree.fromstring(re.content)
print(root.xpath('//entry_list/entry/def/dt/text()'))
This will give text value for each 'dt' tag in the xml document.
from xml.dom import minidom
# List with dt values
dt_elems = []
# Process xml getting elements by tag name
xmldoc = minidom.parse('text.xml')
itemlist = xmldoc.getElementsByTagName('dt')
# Get the values
for i in itemlist:
dt_elems.append(" ".join(t.nodeValue for t in i.childNodes if t.nodeType==t.TEXT_NODE))
# Print the list result
print dt_elems
Can anyone offer some help with regards to using Python to extract information from a XML file? This will be my example XML.
<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</root>
What I want to print out is the information between the root tags. However, I want it to print it as is, which means all the tags, text in between the tags, and the content within the tag (in this case number index ="2") I have tried itertext(), but that removes the tags and prints only the text in between the root tags. So far, I have a makeshift solution that prints out only the element.tag and the element.text but that does not print out the end tags and the content within the tag. Any help would be appreciated! :)
With s as your input,
s='''<root>
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
</root>'''
Find all tags with tag name number and convert the tag to string using ET.tostring()
import xml.etree.ElementTree as ET
root = ET.fromstring(s)
for node in root.findall('.//number'):
print ET.tostring(node)
Output:
<number index="2">
<info>
<info.RANDOM>Random Text</info.RANDOM>
</info>
</number>
from bs4 import BeautifulSoup
xml = "<root><number index=\"2\"><info><info.RANDOM>Random Text</info.RANDOM></info></root>"
soup = BeautifulSoup(xml, "xml")
output = soup.prettify()
print(output[output.find("<root>") + 7:output.rfind("</root>")])
the + 7 accounts for root>\n
I've been messing around with the lxml library for a little while and maybe I'm not understanding it correctly or I'm missing something but I can't seem to figure out how to edit the file after I catch a certain xpath and then be able to write that back out into xml while I'm parsing element by element.
Say we have this xml as an example:
<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>
What I would like to do while parsing is when I hit that xpath of "/xml/items/pie" is to add an element before pie, so it will turn out like this:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>
That output would need to be done by writing to a file line by line as I hit each tag and edit the xml at certain xpaths. I mean I could have it print the starting tag, the text, the attribute if it exists, and then the ending tag by hard coding certain parts, but that would be very messy and it be nice if there was a way to avoid that if possible.
Here's my guess code at this:
from lxml import etree
path=[]
count=0
context=etree.iterparse(file,events=('start','end'))
for event, element in context:
if event=='start':
path.append(element.tag)
if /'+'/'.join(path)=='/xml/items/pie':
itemnode=etree.Element('item',id=str(count))
itemnode.text=""
element.addprevious(itemnode)#Not the right way to do it of course
#write/print out xml here.
else:
element.clear()
path.pop()
Edit: Also, I need to run through fairly big files, so I have to use iterparse.
Here's a solution using iterparse(). The idea is to catch all tag "start" events, remember the parent (items) tag, then for every pie tag create an item tag and put the pie into it:
from StringIO import StringIO
from lxml import etree
from lxml.etree import Element
data = """<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>"""
stream = StringIO(data)
context = etree.iterparse(stream, events=("start", ))
for action, elem in context:
if elem.tag == 'items':
items = elem
index = 1
elif elem.tag == 'pie':
item = Element('item', {'id': str(index)})
items.replace(elem, item)
item.append(elem)
index += 1
print etree.tostring(context.root)
prints:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>
There is a more clean way to make modifications you need:
iterate over pie elements
make an item element
use replace() to replace a pie element with item
replace(self, old_element, new_element)
Replaces a subelement with the
element passed as second argument.
from lxml import etree
from lxml.etree import XMLParser, Element
data = """<xml>
<items>
<pie>cherry</pie>
<pie>apple</pie>
<pie>chocolate</pie>
</items>
</xml>"""
tree = etree.fromstring(data, parser=XMLParser())
items = tree.find('.//items')
for index, pie in enumerate(items.xpath('.//pie'), start=1):
item = Element('item', {'id': str(index)})
items.replace(pie, item)
item.append(pie)
print etree.tostring(tree, pretty_print=True)
prints:
<xml>
<items>
<item id="1"><pie>cherry</pie></item>
<item id="2"><pie>apple</pie></item>
<item id="3"><pie>chocolate</pie></item>
</items>
</xml>
I would suggest you to use an XSLT template, as it seems to match better for this task. Initially XSLT is a little bit tricky until you get used to it, if all you want is to generate some output from an XML, then XSLT is a great tool.
I'm using Python and the lxml library to produce an XML file that I want to look like this:
<item>
<a:text>hello</a:text>
</item>
However, I can't manage to produce this, I've tried the following code:
import lxml.etree as etree
item = etree.Element('item')
el = etree.SubElement(item, 'text', nsmap={'a': 'http://example.com/')
But then I end up with:
<item>
<text xmlns="http://example.com/">hello</text>
</item>
I also tried this after some inspiration from the lxml namespaces documentation (http://lxml.de/tutorial.html#namespaces):
import lxml.etree as etree
item = etree.Element('item')
el = etree.SubElement(item, '{a}text')
But that gives me:
<item>
<ns1:text xmlns:ns1="a">hello</text>
</item>
Is there any way to get the XML format I need with lxml ?
The first thing to note is that this...
<item>
<a:text>hello</a:text>
</item>
...is not valid XML. a: is a namespace prefix, but somewhere you have to map it to an actual namespace, as in:
<item xmlns:a="http://example.com/">
<a:text>hello</a:text>
</item>
As you read in the lxml documentation, you can use the {namespace}element syntax to specify a namespace...but this uses an actual namespace, not a namespace prefix (which is why your second example did not work as expected).
You can get what I think you want like this:
>>> from lxml import etree
>>> item = etree.Element('item', nsmap={'a': 'http://example.com/'})
>>> e1 = etree.SubElement(item, '{http://example.com/}text')
Which gives you:
>>> print etree.tostring(item, pretty_print=True)
<item xmlns:a="http://example.com/">
<a:text/>
</item>
It's also worth noting that from the perspective of XML, the above is exactly equivalent to:
<item>
<text xmlns="http://example.com/">hello</text>
</item>
I can read tags, except when there is a prefix. I'm not having luck searching SO for a previous question.
I need to read media:content. I tried image = node.find("media:content").
Rss input:
<channel>
<title>Popular Photography in the last 1 week</title>
<item>
<title>foo</title>
<media:category label="Miscellaneous">photography/misc</media:category>
<media:content url="http://foo.com/1.jpg" height="375" width="500" medium="image"/>
</item>
<item> ... </item>
</channel>
I can read a sibling tag title.
from xml.etree import ElementTree
with open('cache1.rss', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//channel/item'):
title = node.find("title").text
I've been using the docs, yet stuck on the 'prefix' part.
Here's an example of using XML namespaces with ElementTree:
>>> x = '''\
<channel xmlns:media="http://www.w3.org/TR/html4/">
<title>Popular Photography in the last 1 week</title>
<item>
<title>foo</title>
<media:category label="Miscellaneous">photography/misc</media:category>
<media:content url="http://foo.com/1.jpg" height="375" width="500" medium="image"/>
</item>
<item> ... </item>
</channel>
'''
>>> node = ElementTree.fromstring(x)
>>> for elem in node.findall('item/{http://www.w3.org/TR/html4/}category'):
print elem.text
photography/misc
media is an XML namespace, it has to be defined somewhere earlier with xmlns:media="...". See http://lxml.de/xpathxslt.html#namespaces-and-prefixes for how to define xml namespaces for use in XPath expressions in lxml.