Get text via XPath, ignoring markup

Get text via XPath, ignoring markup - python

I have to retrieve text inside an HTML table, in the cells the text sometimes is inside a <div> and sometimes is not.
How can I make a div in a XPath optional?
My actual code:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div/text()")
Wanted pseudocode:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div or nothing/text()")

You want the string value of the td[5] element. Use string():
stuff = tree.xpath("string(/html/body/table/tbody/tr/td[5])")
This will return text without markup beneath td[5].
You can also indirectly obtain the string value of an element via normalize-space() as suggested by splash58 in the comments, if you also want whitespace to be trimmed on the ends and reduced interiorly.

Related

How to get text from an element with selenium

I am trying to get text from an element and then searching for it in another place, the problem is that when I get the text using .text, I noticed that it misses if there is two spaces, and when it searches in the next page, it can't find it. So is there a way to get text as it is with spaces?
Code:
self.name = session.find_element('xpath', './/a[contains(#href, "name")]').text
self.dataBrowser.find_elements(By.XPATH, f'//tr[child::td[child::div[child::strong[text()="{self.name}"]]]]')

Not always the text property will actually hold the text. So, try this:
element = browser.find_element(By.CSS, 'CSS_EXPRESSION')
text = element.get_attribute('text')
if text is None or text is '':
text = element.get_attribute('value')
If text and value both returns nothing just try reading the innnerHTML attribute.

Empty list as output from scrapy response object

I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.

I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()

How to check if matched text in web element is partially enclosed in <a> tag?

I have the following:
This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.
In the above string i have to search for "next part" not only "part" so once i find the "next part" I need to check if there is any a tag present in the matched text (sometimes there is not an tag) - how can I do that?
Additional to my main question I can't make my xpath to work to find "next part" in the elements.
I tried this:
//*[contains(text(),"next part")]
But it doesn't find anything probably because I have spaces in there - how do I overcome this?
Thank you in advance,

Let's assume this html:
<p>This is my text string and this next <a href='https//somelink.org/'>part</a> is only partially enclosed in a tags.</p>
We can select with selenium:
p = driver.find_element_by_xpath('//p[contains(.,"next part")]')
And we can determine if it's partly in an a tag with regex (Tony the Pony notwithstanding):
html = p.get_attribute('innerHTML')
partly_in_a = 'next part' in re.sub(r'</?a.*?>', '', html) and 'next part' not in html
There's no pure xpath 1.0 solution for this, and it's a mistake in general to depend on xpath for stuff like this.

You'll need to use a nested XPath selector for this.
//*[contains(text(), 'next') and a[contains(text(), 'part')]]
This will query on any element that contains text next, then also check that the element contains nested a element with text part.
To determine whether or not there actually IS a nested a tag, you will need to write a method for this that checks against two different XPaths. There is no easy way around this, other than to evaluate the elements and see what's there.
public bool DoesElementHaveNestedTag()
{
// check for presence of locator with nested tag
// if driver.findElements returns > 0, then nested tag locator exists
if (driver.findElements(By.XPath("//*[contains(text(), 'next') and a[contains(text(), 'part')]]")).Count > 0) return true
else return false
}
You can change this method to fit your needs, but the idea is the same. There is no way to know if a WebElement has a nested tag or not, unless you try to find the WebElement using two XPaths -- one that checks for the tag, and one that does not.

Extract tag text from line BeautifulSoup

Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:
lines=bs.find('div',{'class':'Text'}).find_all('div')
But for some reason, there's some lines that contain an h2 tag and a br tag, like this one:
<div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text
So if I run .text on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.
I managed to isolate those lines doing this:
for n,t in enumerate(lines):
if t.find('h2') is not None and t.find('br') is not None:
print('\n',n,':',t)
But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks

Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text"
You can also use different char - ie. "|" - .get_text(split='|') and you get "Header2|Paragraph text".
And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element.
You can also find h2 and clear() or extract() this tag and later you can get text from all divand you get without "Header2"
Documentation: get_text(), clear(), extract()

Saving <br/> in beautifulsoup

I'm creating parser, and i have following construction:
quotes = soup.findAll('div',{'class':'text'})
But it's strip all html tags(like br). How I can change it?

findAll itself will give you a list of HTML nodes.
If you want to retrieve their text content (without tags), use .get_text().
To get the children of these nodes (as objects too), use .contents or .children.
In order to print a node's children as a well-formatted string, you can use .prettify(). Note that this won't exactly preserve the original formatting.
See also:
BeautifulSoup innerhtml?

If you want to take out the tags from the text, you could try something like this:
for item in quotes:
quote = re.sub(r"\<.*?\>", "", quote)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get text via XPath, ignoring markup - python

Related

How to get text from an element with selenium

Empty list as output from scrapy response object

How to check if matched text in web element is partially enclosed in <a> tag?

Extract tag text from line BeautifulSoup

Saving <br/> in beautifulsoup

Categories

Resources