I have a source file like this:
<div class="l_post j_l_post l_post_bright " ...>
<div class="lzl_cnt">
...
<span class="lzl_content_main">
text1
<a class="at j_user_card" username="...">
username
</a>
text3
</span>
</div>
...
</div>
And I want to get text3, Currently, I tried this:(I am at <div class="lzl_cnt">)
driver.find_element(By.XPATH,'.//span[#class="lzl_content_main"]/text()[1]')
but I got
"Message: invalid selector: The result of the xpath expression
".//span[#class="lzl_content_main"]/text()[1]" is: [object Text]. It
should be an element".
And Is there a way to get the "text3"?
I should make it clearer:
The above HTML is part of the bigger structure, and I selected it out with the following python code:
for sel in driver.find_elements_by_css_selector('div.l_post.j_l_post.l_post_bright'):
for i in sel.find_elements_by_xpath('.//div[#class="lzl_cnt"]'):
#user1 = i.find_element_by_xpath('.//a[#class="at j_user_card "]').text
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
text3 = ???
print(user2, text3)
In selenium you cannot use XPath that returns attributes or text nodes, so /text() syntax is not allowed. If you want to get specific child text node only instead of complete text content (returned by text property), you might execute JavaScript
You can apply below code to get required text node:
...
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
span = i.find_element_by_xpath('.//span[#class="lzl_content_main"]')
reply = driver.execute_script('return arguments[0].lastChild.textContent;', span)
You might also need to do reply = reply.strip() to get rid of trailing spaces
Yes:
//div[#class='lzl_cnt']
And then you should use the .text on that element
Except you span isn't closed, so assuming it closes before the div.
Here i am answering a solution for you.
List<WebElement> list = driver.findElements(By.tagName("span"));
for(WebElement el : list){
String desiredText = el.getAttribute("innerHTML");
if(desiredText.equalsIgnoreCase("text3")){
System.out.println("desired text found");
break;
}
}
Please use the above code and let me know your feedback.
Related
I have this sample-html:
<div class="classname1">
"This is text inside of"
<b>"a subtag"</b>
"I would like to select."
<br>
"More text I don't need"
</br>
(more br and b tags on the same level)
</div>
The result should be a list containing:
["This is text inside of a subtag I would like to select."]
I tried:
response.xpath('//div[#class="classname1"]//text()[1]').getall()
but this gives me only the first part "This is text inside".
There are two challenges:
Sometimes there is no b tag
There is even more text after the desired section that should be expluded
Maybe a loop?
If anyone has an approach it would be really helpful.
What about this (used More text I don't need as a stopword):
parts = []
for text in response.xpath('//div[#class="classname1"]//text()').getall():
if 'More text I don't need' in text:
break
parts.append(text)
result = ' '.join(parts)
UPDATE For example, you need to extract all text before Ort: :
def parse(self, response):
for card_node in response.xpath('//div[#class="col-md-8 col-sm-12 card-place-container"]'):
parts = []
for text in card_node.xpath('.//text()').getall():
if 'Ort: ' in text:
break
parts.append(text)
before_ort = '\n'.join(parts)
print(before_ort)
Use the descendant or self xpath selector in combination with the position selector as below
response.xpath('//div[#class="classname1"]/descendant-or-self::*/text()[position() <3]').getall()
I try to scrape out an ImageId inside a button tag, want to have the result:
"25511e1fd64e99acd991a22d6c2d6b6c".
When I try:
drawing_url = drawing_url.find_all('button', class_='inspectBut')['onclick']
it doesn't work. Giving an error-
TypeError: list indices must be integers or slices, not str
Input =
for article in soup.find_all('div', class_='dojoxGridRow'):
drawing_url = article.find('td', class_='dojoxGridCell', idx='3')
drawing_url = drawing_url.find_all('button', class_='inspectBut')
if drawing_url:
for e in drawing_url:
print(e)
Output =
<button class="inspectBut" href="#"
onclick="window.open('getImg?imageId=25511e1fd64e99acd991a22d6c2d6b6c&
timestamp=1552011572288','_blank', 'toolbar=0,
menubar=0, modal=yes, scrollbars=1, resizable=1,
height='+$(window).height()+', width='+$(window).width())"
title="Open Image" type="button">
</button>
...
...
Try this one.
import re
#for all the buttons
btn_onlclick_list = [a.get('onclick') for a in soup.find_all('button')]
for click in btn_onlclick_list:
a = re.findall("imageId=(\w+)", click)[0]
print(a)
You first need to check whether the attribute is present or not.
tag.attrs returns a list of attributes present in the current tag
Consider the following Code.
Code:
from bs4 import BeautifulSoup
a="""
<td>
<button class='hi' onclick="This Data">
<button class='hi' onclick="This Second">
</td>"""
soup = BeautifulSoup(a,'lxml')
print([btn['onclick'] for btn in soup.find_all('button',class_='hi') if 'onclick' in btn.attrs])
Output:
['This Data','This Second']
or you can simply do this
[btn['onclick'] for btn in soup.find_all('button', attrs={'class' : 'hi', 'onclick' : True})]
You should be searching for
button_list = soup.find_all('button', {'class': 'inspectBut'})
That will give you the button array and you can later get url field by
[button['getimg?imageid'] for button in button_list]
You will still need to do some parsing, but I hope this can get you on the right track.
Your mistake here was that you need to search correct property class and look for correct html tag, which is, ironically, getimg?imageid.
I want to retrieve all text from "p" elements that match a particular font.
<p>
Hello there <i> mate </i> !
</p>
So, here, I want only "Hello there !" and not "mate". I already know the font (the whole css property) of "Hello there".
My current code is:
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
snippets.append(elem.text)
but this also gives me all the italics. How can I recurse on all the children of the "p" and only get the text that matched my stored_font ?
Using set.difference() seems appropriate here, assuming that your elements are unique:
p_tags = set(br.find_elements_by_tag_name('p'))
i_tags = set(br.find_elements_by_tag_name('i'))
p_tags_without_i_tags = p_tags.difference(i_tags)
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
snippets.append(elem.text)
We can get the text content of TextNode by javascript:
script = 'return arguments[0].firstChild.textContent + arguments[0].lastChild.textContent;';
for elem in br.find_elements_by_tag_name('p'):
if elem.value_of_css_property('font') == stored_font:
print driver.execute_script(script, elem);
I have doubts about writing solution for filtering that stuff for test needs.
I would ethier assert all text with 'mate' or use text.contains("Hello there")
I have the following html: I'm trying to get the following numbers saved as variables Available Now,7,148.49,HatchBack,Good. The problem I'm running into is that I'm not able to pull them out independently since they don't have a class attached to it. I'm wondering how to solve this. The following is the html then my futile code to solve this.
</div>
<div class="car-profile-info">
<div class="col-md-12 no-padding">
<div class="col-md-6 no-padding">
<strong>Status:</strong> <span class="statusAvail"> Available Now </span><br/>
<strong>Min. Booking </strong>7 Days ($148.89)<br/>
<strong>Style: </strong>Hatchback<br/>
<strong>Transmission: </strong>Automatic<br/>
<strong>Condition: </strong>Good<br/>
</div>
Python 2.7 Code: - this gives me the entire html!
soup=BeautifulSoup(html)
print soup.find("span",{"class":"statusAvail"}).getText()
for i in soup.select("strong"):
if i.getText()=="Min. Booking ":
print i.parent.getText().replace("Min. Booking ","")
Find all the strong elements under the div element with class="car-profile-info" and, for each element found, get the .next_siblings until you meet the br element:
from bs4 import BeautifulSoup, Tag
for strong in soup.select(".car-profile-info strong"):
label = strong.get_text()
value = ""
for elm in strong.next_siblings:
if getattr(elm, "name") == "br":
break
if isinstance(elm, Tag):
value += elm.get_text(strip=True)
else:
value += elm.strip()
print(label, value)
You can use ".next_sibling" to navigate to the text you want like this:
for i in soup.select("strong"):
if i.get_text(strip=True) == "Min. Booking":
print(i.next_sibling) #this will print: 7 Days ($148.89)
See also http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways
I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...
I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.
Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space