Python and Selenium - get text excluding child node's text - python

Using Python 3.
Supposing:
<whatever>
text
<subchild>
other
</subchild>
</whatever>
If I do:
elem = driver.find_element_by_xpath("//whatever")
elem.text contains "text other"
If I do:
elem = driver.find_element_by_xpath("//whatever/text()[normalize-space()]")
elem is not Webelement.
How my I proceed to grab only "text" (and not "other")?
Id est: grab only text in direct node, not the child nodes.
UPDATE:
Original HTML is:
<div class="border-ashes the-code text-center">
VIVEGRPN
<span class="cursor"></span>
<button class="btn btn-ashes zclip" data-clipboard-target=".the-code" data-coupon-code="VklWRUdSUE4=">
<span class="r">Hen, la.</span>
</div>

Bear in mind that the replacement approach mentioned by #Guy doesn't work for many structures.
For instance, having this structure:
<div>
Hello World
<b>e</b>
</div>
The parent text would be Hello World e, the child text would be e, and the replacement would result in Hllo World instead of Hello World.
A safe solution
To get the own text of an element in a safe manner, you have to iterate over the children of the node, and concat the text nodes. Since you can't do that in pure Selenium, you have to execute JS code.
OWN_TEXT_SCRIPT = "if(arguments[0].hasChildNodes()){var r='';var C=arguments[0].childNodes;for(var n=0;n<C.length;n++){if(C[n].nodeType==Node.TEXT_NODE){r+=' '+C[n].nodeValue}}return r.trim()}else{return arguments[0].innerText}"
parent_text = driver.execute_script(OWN_TEXT_SCRIPT, elem)
The script is a minified version of this simple function:
if (arguments[0].hasChildNodes()) {
var res = '';
var children = arguments[0].childNodes;
for (var n = 0; n < children.length; n++) {
if (children[n].nodeType == Node.TEXT_NODE) {
res += ' ' + children[n].nodeValue;
}
}
return res.trim()
}
else {
return arguments[0].innerText
}

I had similar problem recently, where selenium always gave me all the text inside the element including the spans. I ended up splitting the string with newline "\n". for e.g.
all_text = driver.find_element_by_xpath(xpath).text
req_text = str.split(str(all_text ), "\n")[0]

You can remove the child node text from the all text
all_text = driver.find_element_by_xpath("//whatever").text
child_text = driver.find_element_by_xpath("//subchild").text
parent_text = all_text.replace(child_text, '')

You can firstly extract the outerHTML from the element, then build the soup with BeautifulSoup, and remove any element you want.
Small example:
el = driver.find_element_by_css_selector('whatever')
outerHTML = el.get_attribute('outerHTML')
soup = BeautifulSoup(outerHTML)
inner_elem = soup.select('subchild')[0].extract()
text_inner_elem = inner_elem.text
text_outer_elem = soup.text

Related

Selenium, Xpath, select a certain part of text within a node

I have a source file like this:
<div class="l_post j_l_post l_post_bright " ...>
<div class="lzl_cnt">
...
<span class="lzl_content_main">
text1
<a class="at j_user_card" username="...">
username
</a>
text3
</span>
</div>
...
</div>
And I want to get text3, Currently, I tried this:(I am at <div class="lzl_cnt">)
driver.find_element(By.XPATH,'.//span[#class="lzl_content_main"]/text()[1]')
but I got
"Message: invalid selector: The result of the xpath expression
".//span[#class="lzl_content_main"]/text()[1]" is: [object Text]. It
should be an element".
And Is there a way to get the "text3"?
I should make it clearer:
The above HTML is part of the bigger structure, and I selected it out with the following python code:
for sel in driver.find_elements_by_css_selector('div.l_post.j_l_post.l_post_bright'):
for i in sel.find_elements_by_xpath('.//div[#class="lzl_cnt"]'):
#user1 = i.find_element_by_xpath('.//a[#class="at j_user_card "]').text
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
text3 = ???
print(user2, text3)
In selenium you cannot use XPath that returns attributes or text nodes, so /text() syntax is not allowed. If you want to get specific child text node only instead of complete text content (returned by text property), you might execute JavaScript
You can apply below code to get required text node:
...
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
span = i.find_element_by_xpath('.//span[#class="lzl_content_main"]')
reply = driver.execute_script('return arguments[0].lastChild.textContent;', span)
You might also need to do reply = reply.strip() to get rid of trailing spaces
Yes:
//div[#class='lzl_cnt']
And then you should use the .text on that element
Except you span isn't closed, so assuming it closes before the div.
Here i am answering a solution for you.
List<WebElement> list = driver.findElements(By.tagName("span"));
for(WebElement el : list){
String desiredText = el.getAttribute("innerHTML");
if(desiredText.equalsIgnoreCase("text3")){
System.out.println("desired text found");
break;
}
}
Please use the above code and let me know your feedback.

Selenium - filter by font

I want to retrieve all text from "p" elements that match a particular font.
<p>
Hello there <i> mate </i> !
</p>
So, here, I want only "Hello there !" and not "mate". I already know the font (the whole css property) of "Hello there".
My current code is:
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
snippets.append(elem.text)
but this also gives me all the italics. How can I recurse on all the children of the "p" and only get the text that matched my stored_font ?
Using set.difference() seems appropriate here, assuming that your elements are unique:
p_tags = set(br.find_elements_by_tag_name('p'))
i_tags = set(br.find_elements_by_tag_name('i'))
p_tags_without_i_tags = p_tags.difference(i_tags)
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
snippets.append(elem.text)
We can get the text content of TextNode by javascript:
script = 'return arguments[0].firstChild.textContent + arguments[0].lastChild.textContent;';
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
print driver.execute_script(script, elem);
I have doubts about writing solution for filtering that stuff for test needs.
I would ethier assert all text with 'mate' or use text.contains("Hello there")

Parse text divided by <br> but not inside <span>

I can't figure out how to parse this type of data:
<div id="tabs-1" class="ui-tabs-panel ui-widget-content ui-corner-bottom">
<strong><span itemprop="name">MOS-SCHAUM</span></strong><br>
<span itemprop="description">Antistatická pena čierna na IO 300x300x6mm</span>
<br>RoHS: Áno
<br>Obj.číslo: 13291<br>
</div>
There can be many <span> tags inside the snippet - I don't want to get them. I want only those, which are not inside <span> tags.
So the result would be:
{'RoHS':'Áno',
'Obj.číslo': '13291'}
I was considering .contents but it's a very unpredictable which elements will be on which index.
Do you know how to do that?
EDIT:
Even if I try this:
detail_table = soup.find('div',id="tabs-1")
itemprops = detail_table.find_all('span',itemprop=re.compile('.+'))
for item in itemprops:
data[item['itemprop']]=item
contents = detail_table.contents[-1].contents[-1].contents[-1].contents
for i,c in enumerate(contents):
print c
print '---'
I get this:
RoHS: Áno
# 1st element
---
<br>Obj.�íslo: 68664<br>
</br></br> # 2st element
---
EDIT2: I've just find out one solution but it's not very nice. There must be a more elegant solution:
def get_data(url):
data = {}
soup = get_soup(url)
""" TECHNICAL INFORMATION """
tech_par_table = soup.find('div',id="tabs-2")
trs = tech_par_table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
parameter = tds[0].text
value = tds[1].text
data[parameter]=value
""" DETAIL """
detail_table = soup.find('div',id="tabs-1")
itemprops = detail_table.find_all('span',itemprop=re.compile('.+'))
for item in itemprops:
data[item['itemprop'].replace('\n','').replace('\t','').strip()]=item.text.
contents = detail_table.contents[-1].contents[-1].contents[-1].contents
for i,c in enumerate(contents):
if isinstance(c,bs4.element.NavigableString):
splitted = c.split(':')
data[splitted[0]]=splitted[1].replace('\n','').replace('\t','').strip()
if isinstance(c,bs4.element.Tag):
splitted = c.text.split(':')
data[splitted[0]]=splitted[1].replace('\n','').replace('\t','').strip()
First you need to get all br tag and use the .next_element attribute to get whatever was parsed immediately after each br tag; here your text.
d = {}
for br in soup.find_all('br'):
text = br.next_element.strip()
if text:
arr = text.split(':')
d[arr[0]] = arr[1].strip()
print(d)
yields:
{'Obj.číslo': '13291', 'RoHS': 'Áno'}

Python, lxml and removing outer tag from using lxml.html.tostring(el)

I am using the below to get all of the html content of a section to save to a database
el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)
The product description has a tag that looks like this:
<div id='productDescription'>
<THE HTML CODE I WANT>
</div>
The code works great , gives me all of the html code but how do I remove the outer layer i.e. the <div id='productDescription'> and the closing tag </div> ?
You could convert each child to string individually:
text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))
Or in even more hackish way:
el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]
if your productDescription div div contains mixed text/elements content, e.g.
<div id='productDescription'>
the
<b> html code </b>
i want
</div>
you can get the content (in string) using xpath('node()') traversal:
s = ''
for node in el.xpath('node()'):
if isinstance(node, basestring):
s += node
else:
s += lxml.html.tostring(node, with_tail=False)
Here is a function that does what you want.
def strip_outer(xml):
"""
>>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd">
... <mrow>
... <msup>
... <mi>x</mi>
... <mn>2</mn>
... </msup>
... <mo> + </mo>
... <mi>x</mi>
... </mrow>
... </math>'''
>>> so = strip_outer(xml)
>>> so.splitlines()[0]=='<mrow>'
True
"""
xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute
xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element
rx = lxml.etree.XML(xml)
lxml.etree.strip_tags(rx,'math')#strip <math with all attributes
uc=lxml.etree.tounicode(rx)
uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again
return uc.strip()
Use regexp.
def strip_outer_tag(html_fragment):
import re
outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>$', re.DOTALL)
return outer_tag.search(html_fragment).group(1)
html_fragment = strip_outer_tag(tostring(el, encoding='unicode')) # `encoding` is optionaly

How can I strip comment tags from HTML using BeautifulSoup?

I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.
So far I have this EDITED & UPDATED CURRENT CODE:
soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page
1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.
2) I would like to strip<!-- --> tags and everything in between them. How would I go about that?
QUESTION EDIT #jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example
<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>
I am still trying to figure out why it
doesn't find and strip tags like this:
<!-- //-->. Those backslashes cause
certain tags to be overlooked.
This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:
import re, copy
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment
soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
if mutation isn't your bag, you can
[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]

Categories