How to get text from an element with selenium - python

I am trying to get text from an element and then searching for it in another place, the problem is that when I get the text using .text, I noticed that it misses if there is two spaces, and when it searches in the next page, it can't find it. So is there a way to get text as it is with spaces?
Code:
self.name = session.find_element('xpath', './/a[contains(#href, "name")]').text
self.dataBrowser.find_elements(By.XPATH, f'//tr[child::td[child::div[child::strong[text()="{self.name}"]]]]')

Not always the text property will actually hold the text. So, try this:
element = browser.find_element(By.CSS, 'CSS_EXPRESSION')
text = element.get_attribute('text')
if text is None or text is '':
text = element.get_attribute('value')
If text and value both returns nothing just try reading the innnerHTML attribute.

Related

How to find text with Selenium

Guys I need to know how to find a text with Selenium, this one for example:
Test
I can get the text with the following code:
wait.until(EC.element_to_be_clickable((By.XPATH, '//*[text()="Test"]'))).text
But I need to get a text in the following format:
"Key" "email#gmail.com"
I need to be able to get the above text, remembering the email and password may be different depending on the case, so I would like to get the full value of the string from the .com of the string, since the email will always have a .com, so in this case , I need to be able to find the .com and after finding me return the full value of the string.
Use contains in xpath to find a text that has .com. You can also use ends-with. The xpath would be something like this:
//*[contains(text(),'.com')]
Or
//*[ends-with(text(),'.com')]

Extract tag text from line BeautifulSoup

Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:
lines=bs.find('div',{'class':'Text'}).find_all('div')
But for some reason, there's some lines that contain an h2 tag and a br tag, like this one:
<div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text
So if I run .text on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.
I managed to isolate those lines doing this:
for n,t in enumerate(lines):
if t.find('h2') is not None and t.find('br') is not None:
print('\n',n,':',t)
But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks
Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text"
You can also use different char - ie. "|" - .get_text(split='|') and you get "Header2|Paragraph text".
And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element.
You can also find h2 and clear() or extract() this tag and later you can get text from all divand you get without "Header2"
Documentation: get_text(), clear(), extract()

Scrapy: how can I extract text with hyperlink together

I'm crawling a professor's webpage.
Under her research description, there are two hyperlinks, which are " TEDx UCL" and "here".
I use xpath like '//div[#class="group"]//p/text()'
to get the first 3 paragraphs.
And '//div[#class="group"]/text()' to get the last paragraph with some newlines. But these can be cleaned easily.
The problem is the last paragraph contains only text. The hyperlinks are lost. Though I can extract them separately, it is tedious to put them back to their corresponding position.
How can I get all the text and keep the hyperlinks?
You can use html2text.
sample = response.xpath("//div[#class="group"]//p/text()")
converter = html2text.HTML2Text()
converter.ignore_links = True
converter.handle(sample)
Try this:
'//div[#class="group"]/p//text()[normalize-space(.)]'

Unable to find element using the following Xpath

I am trying to find the input type with statusid_103408 and with text() Draft
here is the xpath i am using, not sure where I am going wrong
//input[#name='statusid_103408' and contains(text(), 'Draft')]
The reason this xpath does not work is because the text of "Draft" is not actually a property of the input element. It is contained in the li element that is the parent. Therefore, your search is returning no results.
I suggest just using the name only in your xpath search (if it unique). If you definitely need the text in your search, you can search the li item's text first, then find your input, like so:
//li[text()='Draft']/input[#name='statusid_103408']
Use Value it will work , because value is unique, text is not inside the input tag!

Get text via XPath, ignoring markup

I have to retrieve text inside an HTML table, in the cells the text sometimes is inside a <div> and sometimes is not.
How can I make a div in a XPath optional?
My actual code:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div/text()")
Wanted pseudocode:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div or nothing/text()")
You want the string value of the td[5] element. Use string():
stuff = tree.xpath("string(/html/body/table/tbody/tr/td[5])")
This will return text without markup beneath td[5].
You can also indirectly obtain the string value of an element via normalize-space() as suggested by splash58 in the comments, if you also want whitespace to be trimmed on the ends and reduced interiorly.

Categories