Selenium - filter by font - python

I want to retrieve all text from "p" elements that match a particular font.
<p>
Hello there <i> mate </i> !
</p>
So, here, I want only "Hello there !" and not "mate". I already know the font (the whole css property) of "Hello there".
My current code is:
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
snippets.append(elem.text)
but this also gives me all the italics. How can I recurse on all the children of the "p" and only get the text that matched my stored_font ?

Using set.difference() seems appropriate here, assuming that your elements are unique:
p_tags = set(br.find_elements_by_tag_name('p'))
i_tags = set(br.find_elements_by_tag_name('i'))
p_tags_without_i_tags = p_tags.difference(i_tags)
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
snippets.append(elem.text)

We can get the text content of TextNode by javascript:
script = 'return arguments[0].firstChild.textContent + arguments[0].lastChild.textContent;';
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
print driver.execute_script(script, elem);

I have doubts about writing solution for filtering that stuff for test needs.
I would ethier assert all text with 'mate' or use text.contains("Hello there")

Related

How to select multiple text parts with Scrapy inside of a tag with subtags?

I have this sample-html:
<div class="classname1">
"This is text inside of"
<b>"a subtag"</b>
"I would like to select."
<br>
"More text I don't need"
</br>
(more br and b tags on the same level)
</div>
The result should be a list containing:
["This is text inside of a subtag I would like to select."]
I tried:
response.xpath('//div[#class="classname1"]//text()[1]').getall()
but this gives me only the first part "This is text inside".
There are two challenges:
Sometimes there is no b tag
There is even more text after the desired section that should be expluded
Maybe a loop?
If anyone has an approach it would be really helpful.
What about this (used More text I don't need as a stopword):
parts = []
for text in response.xpath('//div[#class="classname1"]//text()').getall():
if 'More text I don't need' in text:
break
parts.append(text)
result = ' '.join(parts)
UPDATE For example, you need to extract all text before Ort: :
def parse(self, response):
for card_node in response.xpath('//div[#class="col-md-8 col-sm-12 card-place-container"]'):
parts = []
for text in card_node.xpath('.//text()').getall():
if 'Ort: ' in text:
break
parts.append(text)
before_ort = '\n'.join(parts)
print(before_ort)
Use the descendant or self xpath selector in combination with the position selector as below
response.xpath('//div[#class="classname1"]/descendant-or-self::*/text()[position() <3]').getall()

Get the full text of content intersected by a tag with Beautiful Soup

Say I have some HTML like the following.
<p>This is the beginning of the text. <em>Italicized middle</em> This is the end of the text.</p>
It's a tag with another tag inside. I can use Beautiful Soup to get the contents of it:
list_of_tags = full_html.findAll()
for tag in list_of_tags:
print(tag.find(text = True))
That prints:
This is the beginning of the text.
Italicized middle
It cuts off the end part—everything after the contained tag. How can find that part?
Thanks to ggorlen's help, I modified my program to work a little differently. It first modifies the contents of an tag so they are italicized in markup (I decided this was a good way to distinguish it for my purposes.
for tag in tag_list:
if tag.name == "em":
tag.string.replace_with("*" + tag.string + "*")
if tag.name == "strong":
tag.string.replace_with("**" + tag.string + "**")
Then, in a separate loop, I got the text of everything that wasn't a tag I had modified above (otherwise it would be recursive), then added its .text to a list.
for tag in tag_list:
if tag.name == "strong" || tag.name == "em":
continue
else:
my_list.append(tag.text)
If you want to get all the data without separation, you can use the following methods.
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<p>This is the beginning of the text. <em>Italicized middle</em> This is the end of the text.</p>
'''
doc = SimplifiedDoc(html)
print (doc.p.text)
Result:
This is the beginning of the text. Italicized middle This is the end of the text.

Selenium, Xpath, select a certain part of text within a node

I have a source file like this:
<div class="l_post j_l_post l_post_bright " ...>
<div class="lzl_cnt">
...
<span class="lzl_content_main">
text1
<a class="at j_user_card" username="...">
username
</a>
text3
</span>
</div>
...
</div>
And I want to get text3, Currently, I tried this:(I am at <div class="lzl_cnt">)
driver.find_element(By.XPATH,'.//span[#class="lzl_content_main"]/text()[1]')
but I got
"Message: invalid selector: The result of the xpath expression
".//span[#class="lzl_content_main"]/text()[1]" is: [object Text]. It
should be an element".
And Is there a way to get the "text3"?
I should make it clearer:
The above HTML is part of the bigger structure, and I selected it out with the following python code:
for sel in driver.find_elements_by_css_selector('div.l_post.j_l_post.l_post_bright'):
for i in sel.find_elements_by_xpath('.//div[#class="lzl_cnt"]'):
#user1 = i.find_element_by_xpath('.//a[#class="at j_user_card "]').text
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
text3 = ???
print(user2, text3)
In selenium you cannot use XPath that returns attributes or text nodes, so /text() syntax is not allowed. If you want to get specific child text node only instead of complete text content (returned by text property), you might execute JavaScript
You can apply below code to get required text node:
...
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
span = i.find_element_by_xpath('.//span[#class="lzl_content_main"]')
reply = driver.execute_script('return arguments[0].lastChild.textContent;', span)
You might also need to do reply = reply.strip() to get rid of trailing spaces
Yes:
//div[#class='lzl_cnt']
And then you should use the .text on that element
Except you span isn't closed, so assuming it closes before the div.
Here i am answering a solution for you.
List<WebElement> list = driver.findElements(By.tagName("span"));
for(WebElement el : list){
String desiredText = el.getAttribute("innerHTML");
if(desiredText.equalsIgnoreCase("text3")){
System.out.println("desired text found");
break;
}
}
Please use the above code and let me know your feedback.

Python and Selenium - get text excluding child node's text

Using Python 3.
Supposing:
<whatever>
text
<subchild>
other
</subchild>
</whatever>
If I do:
elem = driver.find_element_by_xpath("//whatever")
elem.text contains "text other"
If I do:
elem = driver.find_element_by_xpath("//whatever/text()[normalize-space()]")
elem is not Webelement.
How my I proceed to grab only "text" (and not "other")?
Id est: grab only text in direct node, not the child nodes.
UPDATE:
Original HTML is:
<div class="border-ashes the-code text-center">
VIVEGRPN
<span class="cursor"></span>
<button class="btn btn-ashes zclip" data-clipboard-target=".the-code" data-coupon-code="VklWRUdSUE4=">
<span class="r">Hen, la.</span>
</div>
Bear in mind that the replacement approach mentioned by #Guy doesn't work for many structures.
For instance, having this structure:
<div>
Hello World
<b>e</b>
</div>
The parent text would be Hello World e, the child text would be e, and the replacement would result in Hllo World instead of Hello World.
A safe solution
To get the own text of an element in a safe manner, you have to iterate over the children of the node, and concat the text nodes. Since you can't do that in pure Selenium, you have to execute JS code.
OWN_TEXT_SCRIPT = "if(arguments[0].hasChildNodes()){var r='';var C=arguments[0].childNodes;for(var n=0;n<C.length;n++){if(C[n].nodeType==Node.TEXT_NODE){r+=' '+C[n].nodeValue}}return r.trim()}else{return arguments[0].innerText}"
parent_text = driver.execute_script(OWN_TEXT_SCRIPT, elem)
The script is a minified version of this simple function:
if (arguments[0].hasChildNodes()) {
var res = '';
var children = arguments[0].childNodes;
for (var n = 0; n < children.length; n++) {
if (children[n].nodeType == Node.TEXT_NODE) {
res += ' ' + children[n].nodeValue;
}
}
return res.trim()
}
else {
return arguments[0].innerText
}
I had similar problem recently, where selenium always gave me all the text inside the element including the spans. I ended up splitting the string with newline "\n". for e.g.
all_text = driver.find_element_by_xpath(xpath).text
req_text = str.split(str(all_text ), "\n")[0]
You can remove the child node text from the all text
all_text = driver.find_element_by_xpath("//whatever").text
child_text = driver.find_element_by_xpath("//subchild").text
parent_text = all_text.replace(child_text, '')
You can firstly extract the outerHTML from the element, then build the soup with BeautifulSoup, and remove any element you want.
Small example:
el = driver.find_element_by_css_selector('whatever')
outerHTML = el.get_attribute('outerHTML')
soup = BeautifulSoup(outerHTML)
inner_elem = soup.select('subchild')[0].extract()
text_inner_elem = inner_elem.text
text_outer_elem = soup.text

extracting data between span tags with BeautifulSoup Python

I would like to extract data between span tags. Here is a sample of html code:
<p>
<span class="html-italic">3-Acetyl-</span>
<span class="html-italic">(4-acetyl-5-(β</span>
"-"
<span class="html-italic">naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one</span>
"("
<b>5b</b>
</p>
I need to get a full name:
3-Acetyl-4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one (without 5b). I don't know how to extract '-' between the second and the third span tags. Also, a total number of span tags may vary and '-' can be between any span tags. The code I wrote gives me only: 3-Acetyl-4-acetyl-5-(β. Here is a part of my code:
p = soup.find("p")
name = ""
for child in p.children:
if child.name == "span":
name += child.text
print name
Any help is highly appreciated!
You could use CSS selectors.
>>> ''.join(i.text for i in soup.select('p > span'))
'3-Acetyl-(4-acetyl-5-(βnaphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'
you can just do something like
p = soup.find("p")
name = ""
for child in p.children:
if child.name == "span":
name += child.text
elif child.name is 'None':
name += child.string.rstrip("\"\n ").lstrip("\"\n ")
print name
You can use BeautifulSoup's .findAll(text=True) to get all text inside the element, including outside the spans. This returns a list of text parts, which need to be stripped of whitespace and quotation marks. I'm not sure what rule you're using to exclude the last "("5b but maybe it's as easy as slicing the list:
parts = soup.find("p").findAll(text=True)
name = ''.join(p.strip(string.whitespace + '"') for p in parts[:-3])
Result:
u'3-Acetyl-(4-acetyl-5-(β-naphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one'
try like this:
name=""
for x in soup.find('p'):
try:
if x.name == 'span':
name += x.get_text()
except:pass
print name
output:
3-Acetyl-(4-acetyl-5-(βnaphtyl)-4,5-dihydro-1,3,4-oxodiazol-2-yl)methoxy)-2H-chromen-2-one
If you like one-liners, you can do something like:
(your_item.find("p", {"attr": "value"})).find("span").get_text()

Categories