Get enclosing tag of text selenium - python

So, I know there's a bunch of ways to obtain an element on a webpage by looking up its tag in selenium, but I was wondering if it was possible to do the reverse. Can I obtain the immediate enclosing tag of some text that I look up on a page using selenium?

Related

Get Xpath for Element in Python

I've been researching this for two days now. There seems to be no simple way of doing this. I can find an element on a page by downloading the html with Selenium and passing it to BeautifulSoup, followed by a search via classes and strings. I want to click on this element after finding it, so I want to pass its Xpath to Selenium. I have no minimal working example, only pseudo code for what I'm hoping to do.
Why is there no function/library that lets me search through the html of a webpage, find an element, and then request it's Xpath? I can do this manually by inspecting the webpage and clicking 'copy Xpath'. I can't find any solutions to this on stackoverflow, so please don't tell me I haven't looked hard enough.
Pseudo-Code:
*parser is BeautifulSoup HTML object*
for box in parser.find_all('span', class_="icon-type-2"): # find all elements with particular icon
xpath = box.get_xpath()
I'm willing to change my code entirely, as long as I can locate a particular element, and extract it's Xpath. So any other ideas on entirely different libraries are welcome.

Scrapy: extract text from span without class or id

I have the following html structure:
I would like to extract the text ("“Business-Thinking”-Fokus im Master-Kurs") from the span highlighted (using Scrapy), however I have trouble reaching to it as it does not contain any specific class or id.
I tried to access it with the following absolute xPath:
sel.xpath('/html/body/div[4]/div[1]/div/div/h1/span/text()').extract()
I don't get any error, however it returns a blank file, meaning the text is not extracted.
Note: The parent classes are not unique, that's why I'm not using a relative path. As the text varies, I also cannot reach the span by looking for the text it contains.
Do you have any suggestion on how I should modify my xPath to extract the text? Thanks!
If you load the page using scrapy shell url it loads without javascript.
When you look at source without javascript, the xpath to the span is /html/body/div/div[1]/div/div/h1/span
To load webpages with javascript in Scrapy use Splash.

Getting all elements in a page with specific span class python selenium

Hi I am attempting to scrape multiple pages using selenium in python. I am interested in extracting all elements that fall within a span class element, basically what I would like to do is get the span class elements then extract the link within it. For each page it is possible to achieve this by using the xpath, however the xpath changes for each object and for each page. here is an example of what the web elements look like:
essentially I would like to extract the elements this is consistent in all the pages that I will be scraping. SO my idea is to get these elements then to get the href elements for these. I have tried to get all the elements on the page using this code
driver.find_elements_by_xpath("//span[#class='Text__StyledText-jknly0-0 cCEhaW']")
However this has not worked and it returns nothing. I also do not want to use the inner class because it varies by page as well so the only real element to use if I want to automate the scraping without getting too messy is that element I mention. Any way to extract the links for this span class elements on the page?
try this xpath
//span[contains(#class,'Text__StyledText')]//a[contains(#class,'Anchor__StyledAnchor')]
To actually grab that element we use the following
driver.find_elements_by_css_selector("span.Text__StyledText-jknly0-0.cCEhaW")

Are there any selenium locators present which can scrape any content of a webpage?

Currently i use Python with selenium for scraping purpose. There are many ways in selenium to scrape data. And I used to use css selectors.
But then I realised that,
Only tagNames are those things which always are on websites.
For example,
Not every website uses classes or Id's like, take an example of Wikipedia. They use normally just tags in it.
like <h1>, <a> without having any classes or id in it.
There comes the limitation for scraping USING tagNames, as they scrape every element under their tags.
For example : if I want to scrape table contents which are under <p> tag, then it scrapes the table contents as well as all the descriptions which are not needed.
My question is: is it possible to scrape the required elements under the tags which do not copy every other elements under their tags?
Like if I want to scrape content from, say Amazon then it will select only product names under h1 tags, not scraping all the headings under the h1 tag which are not product names.
If you find any other method/locator to use, even except the tagName also then also you can tell me. But the condition is that it must be present on every website/ most of the websites
Any help would be appreciated 😊...

How can Selenium (or BeautifulSoup) be used to access these hidden elements?

Here is an example page with pagination controlling dynamically loaded results.
http://www.rehabs.com/local/jacksonville-fl/
All that I presently know to try is:
curButton = 1
driver.find_element_by_css_selector('ul[class="pagination"]').find_elements_by_tag_name('li')[curButton].click()
Nothing seems to happen (also when trying to access and click the a tag or driver.get() the href of the a element).
Is there another way to access the hidden elements? For instance, when reading the html of the entire page, the elements of different pagination are shown, but are apparently inaccessible with BeautifulSoup.
Pagination was added for humans. Maybe you used the wrong xpath or css. Check it.
Use this xpath:
//div[#id="listing-basic"]/article/div[#class="h3"]/a/#href
You can click on the pagination button using:
driver.find_elements_by_css_selector('.pagination li a')[1].click()

Categories