I have to retrieve text inside an HTML table, in the cells the text sometimes is inside a <div> and sometimes is not.
How can I make a div in a XPath optional?
My actual code:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div/text()")
Wanted pseudocode:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div or nothing/text()")
You want the string value of the td[5] element. Use string():
stuff = tree.xpath("string(/html/body/table/tbody/tr/td[5])")
This will return text without markup beneath td[5].
You can also indirectly obtain the string value of an element via normalize-space() as suggested by splash58 in the comments, if you also want whitespace to be trimmed on the ends and reduced interiorly.
Related
I am trying to get text from an element and then searching for it in another place, the problem is that when I get the text using .text, I noticed that it misses if there is two spaces, and when it searches in the next page, it can't find it. So is there a way to get text as it is with spaces?
Code:
self.name = session.find_element('xpath', './/a[contains(#href, "name")]').text
self.dataBrowser.find_elements(By.XPATH, f'//tr[child::td[child::div[child::strong[text()="{self.name}"]]]]')
Not always the text property will actually hold the text. So, try this:
element = browser.find_element(By.CSS, 'CSS_EXPRESSION')
text = element.get_attribute('text')
if text is None or text is '':
text = element.get_attribute('value')
If text and value both returns nothing just try reading the innnerHTML attribute.
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
I have the following:
This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.
In the above string i have to search for "next part" not only "part" so once i find the "next part" I need to check if there is any a tag present in the matched text (sometimes there is not an tag) - how can I do that?
Additional to my main question I can't make my xpath to work to find "next part" in the elements.
I tried this:
//*[contains(text(),"next part")]
But it doesn't find anything probably because I have spaces in there - how do I overcome this?
Thank you in advance,
Let's assume this html:
<p>This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.</p>
We can select with selenium:
p = driver.find_element_by_xpath('//p[contains(.,"next part")]')
And we can determine if it's partly in an a tag with regex (Tony the Pony notwithstanding):
html = p.get_attribute('innerHTML')
partly_in_a = 'next part' in re.sub(r'</?a.*?>', '', html) and 'next part' not in html
There's no pure xpath 1.0 solution for this, and it's a mistake in general to depend on xpath for stuff like this.
You'll need to use a nested XPath selector for this.
//*[contains(text(), 'next') and a[contains(text(), 'part')]]
This will query on any element that contains text next, then also check that the element contains nested a element with text part.
To determine whether or not there actually IS a nested a tag, you will need to write a method for this that checks against two different XPaths. There is no easy way around this, other than to evaluate the elements and see what's there.
public bool DoesElementHaveNestedTag()
{
// check for presence of locator with nested tag
// if driver.findElements returns > 0, then nested tag locator exists
if (driver.findElements(By.XPath("//*[contains(text(), 'next') and a[contains(text(), 'part')]]")).Count > 0) return true
else return false
}
You can change this method to fit your needs, but the idea is the same. There is no way to know if a WebElement has a nested tag or not, unless you try to find the WebElement using two XPaths -- one that checks for the tag, and one that does not.
Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:
lines=bs.find('div',{'class':'Text'}).find_all('div')
But for some reason, there's some lines that contain an h2 tag and a br tag, like this one:
<div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text
So if I run .text on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.
I managed to isolate those lines doing this:
for n,t in enumerate(lines):
if t.find('h2') is not None and t.find('br') is not None:
print('\n',n,':',t)
But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks
Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text"
You can also use different char - ie. "|" - .get_text(split='|') and you get "Header2|Paragraph text".
And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element.
You can also find h2 and clear() or extract() this tag and later you can get text from all divand you get without "Header2"
Documentation: get_text(), clear(), extract()
I'm creating parser, and i have following construction:
quotes = soup.findAll('div',{'class':'text'})
But it's strip all html tags(like br). How I can change it?
findAll itself will give you a list of HTML nodes.
If you want to retrieve their text content (without tags), use .get_text().
To get the children of these nodes (as objects too), use .contents or .children.
In order to print a node's children as a well-formatted string, you can use .prettify(). Note that this won't exactly preserve the original formatting.
See also:
BeautifulSoup innerhtml?
If you want to take out the tags from the text, you could try something like this:
for item in quotes:
quote = re.sub(r"\<.*?\>", "", quote)