Add an element before text with lxml - python

I have some XML where I want to insert a new element before the text.
I tried:
from lxml import etree
xml = "<root><foo>some text</foo></root>"
root = etree.fromstring(xml)
root.find("foo")
foo.insert(0, etree.Element("bar"))
etree.tostring(foo)
and the result was
<foo>some text<bar/></foo>
when I was hoping for
<foo><bar/>some text</foo>
Bearing in mind that the foo element may actually be quite complicated.
The best I could come with was
def insert_before(elem, child):
elem.insert(0, child)
child.tail, elem.text = elem.text, None
But is there a function or argument in the API that I missed?

Related

How to get the xml element as a string with namespace using ElementTree in python?

I need to get the elements from xml as a string. I am trying with below xml format.
<xml>
<prot:data xmlns:prot="prot">
<product-id-template>
<prot:ProductId>PRODUCT_ID</prot:ProductId>
</product-id-template>
<product-name-template>
<prot:ProductName>PRODUCT_NAME</prot:ProductName>
</product-name-template>
<dealer-template>
<xsi:Dealer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">DEALER</xsi:Dealer>
</dealer-template>
</prot:data>
</xml>
And I tried with below code:
from xml.etree import ElementTree as ET
def get_template(xpath, namespaces):
tree = ET.parse('cdata.xml')
elements = tree.getroot()
for element in elements.findall(xpath, namespaces=namespaces):
return element
namespace = {"prot" : "prot"}
aa = get_template(".//prot:ProductId", namespace)
print(ET.tostring(aa).decode())
Actual output:
<ns0:ProductId xmlns:ns0="prot">PRODUCT_ID</ns0:ProductId>
Expected output:
<prot:ProductId>PRODUCT_ID</prot:ProductId>
I should not remove the xmlns from the document where it presents in the document. And It has to be removed where it not presents. Example product-id-template is not containing the xmlns so it needs to be retrieved without xmlns. And dealer-template contains the xmlns so it needs to be retrieved with xmlns.
How to achieve this?
You can remove xmlns with regex.
import re
# ...
with_ns = ET.tostring(aa).decode()
no_ns = re.sub(' xmlns(:\w+)?="[^"]+"', '', with_ns)
print(no_ns)
UPDATE: You can do a very wild thing. Although I can't recommend it, because I'm not a Python expert.
I just checked the source code and found that I can do this hack:
def my_serialize_xml(write, elem, qnames, namespaces,
short_empty_elements, **kwargs):
ET._serialize_xml(write, elem, qnames,
None, short_empty_elements, **kwargs)
ET._serialize["xml"] = my_serialize_xml
I just defined my_serialize_xml, which calls ElementTree._serialize_xml with namespaces=None. And then, in dictionary ElementTree._serialize, I changed value for key "xml" to my_serialize_xml. So when you call ElementTree.tostring, it will use my_serialize_xml.
If you want to try it, just place the code(above) after from xml.etree import ElementTree as ET (but before using the ET).

How to parse the xml with xmlns attribute using python

<?xml version="1.0" ?>
<school xmlns="loyo:22:2.2">
<profile>
<student xmlns="loyo:5:542">
<marks>
<mark java="java:/lo">
<ca1>200</ca1>
</mark>
</marks>
</student>
</profile>
</school>
I trying to access the ca1 text. I am using etree but I cannot access it. I'm using below code.
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath):
elements = list()
if root.findall(xpath):
for elem in root.findall(xpath):
elements.append(elem.text)
return elements
else:
raise SystemExit("Invalid xpath provided")
t = getElementsData('.//ca1')
for i in t:
print(i)
I tried in different way to access it I don't know the exact problem. Is it recording file type issue?
Your document has namespaces on nodes school and student, you need to incorporate the namespaces in your search. Since you are looking for ca1, which is under student, you will need to specify the namespace that student node has:
import xml.etree.ElementTree as ET
tree = ET.parse('mca.xml')
root = tree.getroot()
def getElementsData(xpath, namespaces):
elements = root.findall(xpath, namespaces)
if elements == []:
raise SystemExit("Invalid xpath provided")
return elements
namespaces = {'ns_school': 'loyo:22:2.2', 'ns_student': 'loyo:5:542'}
elements = getElementsData('.//ns_student:ca1', namespaces)
for element in elements:
print(element)
Notes
Since your namespaces have no names, I gave them such names as ns_school, ns_student, but these name can be anything (e.g. ns1, mystudent, ...)
In a more complex system, I recommend raising some other kinds of errors and let the caller decide whether or not to exit.
How about traversing like this
import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('test.xml').getroot()
data = e.getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].getchildren()[0].text
print(data)
Try the following xpath
tree.xpath('//ca1//text()')[0].strip()

How delete tag from node in lxml without tail?

Example:
html = <a><b>Text</b>Text2</a>
BeautifullSoup code
[x.extract() for x in html.findAll(.//b)]
in exit we have:
html = <a>Text2</a>
Lxml code:
[bad.getparent().remove(bad) for bad in html.xpath(".//b")]
in exit we have:
html = <a></a>
because lxml think "Text2" it's a tail of <b></b>
If we need only text line from join of tags we can use:
for bad in raw.xpath(xpath_search):
bad.text = ''
But, how do that without changing text, but remove tags without tail?
While the accepted answer from phlou will work, there are easier ways to remove tags without also removing their tails.
If you want to remove a specific element, then the LXML method you are looking for is drop_tree.
From the docs:
Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.
If you want to remove all instances of a specific tag, you can use the lxml.etree.strip_elements or lxml.html.etree.strip_elements with with_tail=False.
Delete all elements with the provided tag names from a tree or
subtree. This will remove the elements and their entire subtree,
including all their attributes, text content and descendants. It
will also remove the tail text of the element unless you
explicitly set the with_tail keyword argument option to False.
So, for the example in the original post:
>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
... bad.drop_tree()
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
or
>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'
Edit:
Please look at #Joshmakers answer https://stackoverflow.com/a/47946748/8055036, which is clearly the better one.
I did the following to safe the tail text to the previous sibling or parent.
def remove_keeping_tail(self, element):
"""Safe the tail text and then delete the element"""
self._preserve_tail_before_delete(element)
element.getparent().remove(element)
def _preserve_tail_before_delete(self, node):
if node.tail: # preserve the tail
previous = node.getprevious()
if previous is not None: # if there is a previous sibling it will get the tail
if previous.tail is None:
previous.tail = node.tail
else:
previous.tail = previous.tail + node.tail
else: # The parent get the tail as text
parent = node.getparent()
if parent.text is None:
parent.text = node.tail
else:
parent.text = parent.text + node.tail
HTH

Get all text inside a tag in lxml

I'd like to write a code snippet that would grab all of the text inside the <content> tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren()) but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
Just use the node.itertext() method, as in:
''.join(node.itertext())
Does text_content() do what you need?
Try:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
Example:
from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)
Produces: '\nText outside tag <div>Text <em>inside</em> tag</div>\n'
A version of albertov 's stringify-content that solves the bugs reported by hoju:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
return ''.join(
chunk for chunk in chain(
(node.text,),
chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
(node.tail,)) if chunk)
The following snippet which uses python generators works perfectly and is very efficient.
''.join(node.itertext()).strip()
Defining stringify_children this way may be less complicated:
from lxml import etree
def stringify_children(node):
s = node.text
if s is None:
s = ''
for child in node:
s += etree.tostring(child, encoding='unicode')
return s
or in one line
return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))
Rationale is the same as in this answer: leave the serialization of child nodes to lxml. The tail part of node in this case isn't interesting since it is "behind" the end tag. Note that the encoding argument may be changed according to one's needs.
Another possible solution is to serialize the node itself and afterwards, strip the start and end tag away:
def stringify_children(node):
s = etree.tostring(node, encoding='unicode', with_tail=False)
return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]
which is somewhat horrible. This code is correct only if node has no attributes, and I don't think anyone would want to use it even then.
One of the simplest code snippets, that actually worked for me and as per documentation at http://lxml.de/tutorial.html#using-xpath-to-find-text is
etree.tostring(html, method="text")
where etree is a node/tag whose complete text, you are trying to read. Behold that it doesn't get rid of script and style tags though.
import urllib2
from lxml import etree
url = 'some_url'
getting url
test = urllib2.urlopen(url)
page = test.read()
getting all html code within including table tag
tree = etree.HTML(page)
xpath selector
table = tree.xpath("xpath_here")
res = etree.tostring(table)
res is the html code of table
this was doing job for me.
so you can extract the tags content with xpath_text() and tags including their content using tostring()
div = tree.xpath("//div")
div_res = etree.tostring(div)
text = tree.xpath_text("//content")
or text = tree.xpath("//content/text()")
div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')
this last line with strip method using is not nice, but it just works
Just a quick enhancement as the answer has been given. If you want to clean the inside text:
clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()
In response to #Richard's comment above, if you patch stringify_children to read:
parts = ([node.text] +
-- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++ list(chain(*([tostring(c)] for c in node.getchildren()))) +
[node.tail])
it seems to avoid the duplication he refers to.
I know that this is an old question, but this is a common problem and I have a solution that seems simpler than the ones suggested so far:
def stringify_children(node):
"""Given a LXML tag, return contents as a string
>>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
>>> node = lxml.html.fragment_fromstring(html)
>>> extract_html_content(node)
"<strong>Sample sentence</strong> with tags."
"""
if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
return ""
node.attrib.clear()
opening_tag = len(node.tag) + 2
closing_tag = -(len(node.tag) + 3)
return lxml.html.tostring(node)[opening_tag:closing_tag]
Unlike some of the other answers to this question this solution preserves all of tags contained within it and attacks the problem from a different angle than the other working solutions.
Here is a working solution. We can get content with a parent tag and then cut the parent tag from output.
import re
from lxml import etree
def _tostr_with_tags(parent_element, html_entities=False):
RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$'
content_with_parent = etree.tostring(parent_element)
def _replace_html_entities(s):
RE_ENTITY = r'&#(\d+);'
def repl(m):
return unichr(int(m.group(1)))
replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)
return replaced
if not html_entities:
content_with_parent = _replace_html_entities(content_with_parent)
content_with_parent = content_with_parent.strip() # remove 'white' characters on margins
start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]
if start_tag != end_tag:
raise Exception('Start tag does not match to end tag while getting content with tags.')
return content_without_parent
parent_element must have Element type.
Please note, that if you want text content (not html entities in text) please leave html_entities parameter as False.
lxml have a method for that:
node.text_content()
If this is an a tag, you can try:
node.values()
import re
from lxml import etree
node = etree.fromstring("""
<content>Text before inner tag
<div>Text
<em>inside</em>
tag
</div>
Text after inner tag
</content>""")
print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

Getting unique value when the same tag is in children's tree in XML with Python

I have getElementText as follows which works pretty well with [0] as the XML that I'm working on doesn't have the duplicate tag.
from xml.dom import minidom
def getElementText(element, tagName):
return str(element.getElementsByTagName(tagName)[0].firstChild.data)
doc = minidom.parse("/Users/smcho/Desktop/hello.xml")
outputTree = doc.getElementsByTagName("Output")[0]
print getElementText(outputTree, "Number")
However, when I parse the following XML, I can't get the value <Number>0</Number> but <ConnectedTerminal><Number>1</Number></ConnectedTerminal> with getElementText(outputTree, "Number"), because the getElementText function returns the first of the two elements with the tag "Number".
<Output>
<ConnectedTerminal>
<Node>5</Node>
<Number>1</Number>
</ConnectedTerminal>
<Type>int8</Type>
<Number>0</Number>
</Output>
Any solution to this problem? Is there any way to get only <Number>0</Number> or <ConnectedTerminal><Number>1</Number></ConnectedTerminal>.
If lxml is an option (it's much nicer than minidomyou) can do:
from lxml import etree
doc = etree.fromstring(xml)
node = doc.find('Number')
print node.text # 0
node = doc.xpath('//ConnectedTerminal/Number')[0]
print node.text # 1
Also see the xpath tutorial.
There's not a direct DOM method to do this, no. But it's fairly easy to write one:
def getChildElementsByTagName(element, tag):
children= []
for child in element.childNodes:
if child.nodeType==child.ELEMENT_NODE and tag in (child.tagName, '*'):
children.push(child)
return children
Plus here's a safer text-getting function, so you don't have to worry about multiple nodes, missing nodes due to blank strings, or CDATA sections.
def getTextContent(element):
texts= []
for child in element.childNodes:
if child.nodeType==child.ELEMENT_NODE:
texts.append(getTextContent(child))
elif child.nodeType==child.TEXT_NODE:
texts.append(child.data)
return u''.join(texts)
then just:
>>> getTextContent(getChildElementsByTagName(doc, u'Number')[0])
u'0'
>>> getTextContent(getChildElementsByTagName(doc, u'Output')[0].getElementsByTagName(u'Number')[0])
u'1'

Categories