How to get an XPath from selenium webelement or from lxml? - python

I am using selenium and I need to find the XPaths of some selenium web elements.
For example:
import selenium.webdriver
driver = selenium.webdriver.Firefox()
element = driver.find_element_by_xpath(<some_xpath>)
elements = element.find_elements_by_xpath(<some_relative_xpath>)
for e in elements:
print e.get_xpath()
I know I can't get the XPath from the element itself, but is there a nice way to get it anyway?
I tried using lxml to parse the HTML, but it doesn't recognize the XPath, <some_xpath>, I passed, even though driver.find_element_by_xpath(<some_xpath>) did manage to find that element.

lxml can auto-generate an absolute xpath for you using getpath() method.
Example (using wikipedia main page, getting xpath expression for the logo):
import urllib2
from lxml import etree
data = urllib2.urlopen("https://en.wikipedia.org")
tree = etree.parse(data)
element = tree.xpath('//div[#id="p-logo"]/a')[0]
print tree.getpath(element)
Prints:
/html/body/div[4]/div[2]/div[1]/a

Related

Python Search multiple html files in a variable

I have used Selenium driver to crawl through many site pages. Every time I get a new page I append the html to a variable called "All_APP_Pages". The variable All_APP_Pages is a variable holding html for many pages. Did not post code because its long and no relevant to issue. Python list "All_APP_Pages" as being of type bytes.
from lxml import html
from lxml import etree
import xml.etree.ElementTree as ET
from selenium.webdriver.common.by import By
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
print(link)
Once all pages have been scanned I need to get the link from this xpath
"//tr[.//span[contains(.,'Product Data Solutions (ABC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
The xpath listed here works. However it only works with the selenium driver if driver is on the page where this link exists. That is why all page are in one variable since I dont know what page the link will be on. The print shows this result
[<Element a at 0x1c39dea1180>]
How do I get this value from link I so can check if value is correct?
You need to iterate the list and get the href value
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
hrefs=[l.attrib["href"] for l in link]
print(hrefs)

how to find proper xpath for selenium?

I'm trying to scrape this page : https://www.bitmex.com/app/trade/XBTUSD
to get the Open Interest data on the left side of the page. I am at this stage
import bs4
from bs4 import BeautifulSoup
import requests
import re
from selenium import webdriver
import urllib.request
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
url = "https://www.bitmex.com/app/trade/XBTUSD"
page = urllib.request.urlopen('https://www.bitmex.com/app/trade/XBTUSD')
soup = bs4.BeautifulSoup(r.text, 'xml')
resultat = soup.find_all(text=re.compile("Open Interest"))
driver = webdriver.Firefox(executable_path='C:\\Users\\Samy\\Desktop\\geckodriver\\geckodriver.exe')
results = driver.find_elements_by_xpath("//*[#class='contractStats hoverContainer block']//*[#class='value']/html/body/div[1]/div/span/div[1]/div/div[2]/li/ul/div/div/div[2]/div[4]/span[2]/span/span[1]")
print(len(results))
I get 0 as a result. I tried several different things for the results variable (also driver.find_elements_by_xpath("//span[#class='price']/text()"), but can't seem to find the way. I know the problem is when I copy the XML path, but can't seem to understand clearly the issue despite reading Why does this xpath fail using lxml in python? and https://stackoverflow.com/a/43095252/7937578
I was using only the XML path obtained by copying, but after reading those SO questions I added the part at the begining[#class....] but I'm missing something. Thank you if you know how to help !
If I understood your requirements correctly, the following script should fetch you the required content from that page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.bitmex.com/app/trade/XBTUSD"
with webdriver.Firefox() as driver:
driver.get(link)
wait = WebDriverWait(driver,10)
items = [item.text for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//*[#class='lineItem']/span[#class='hoverHidden'][.//*[contains(.,'Open Interest')]]//span[#class='key' or #class='value']")))]
print(items)
Output at his moment:
['Open Interest', '640,089,423 USD']
I don't know why it fails, but I think the best way to find any element is by full XPath.
Something that look like this:
homebutton = driver.find_element_by_xpath("/html/body/header/div/div[1]/a[2]/span")
Give it a try.
Full path is not the best one, also it's harder to read it. The XPath is 'filter', try to find some unique attributes for needed control, or some unique description of parent one. Look, the needed span has 'value' class, and it is located inside span with 'tooltipWrapper' class, also the parent span has another child with 'key' class and 'Open Interest' text. There are thousands of locators, I can saggest two:
//span[#class = 'tooltipWrapper' and span[string() = 'Open Interest']]//span[#class = 'value']
//span[#class = 'key' and text() = 'Open Interest']/..//span[#class = 'value']

Generate a list from HTML elements using Python

I am using selenium and BeautifulSoup to create a few lists from wikipedia pages. When I look at the page source, the links I want to get the information from are always structured as:
<li>town_name, state</li>
There is a link within the tag that you can click on that will direct you to that town's wiki page. It is always /wiki/town_name,_California
I want to use a for loop in Python to find every item with this structure but am unclear how to write the regular expression. I tried:
my_link = "//wiki//*,California"
and
my_link = "//wiki//*,_California"
But when I tried to run:
br.find_element_by_link_text(my_link)
These returned similar errors:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"//wiki//*,_California"}
I also tried:
import selenium, time
import html5lib
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
pg_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
br = webdriver.Chrome()
url = "http://somewikipage.org"
br.get(url)
lnkLst = []
for lnk in br.find_element_by_partial_link_text(",_California"):
lnkLst.append(lnk)
and got this:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":",_California"}
Is there any way I can correct this code so I can build a list of my targeted links?
As you mentioned in your Question that br.find_element_by_partial_link_text(",_California") didn't work that's because ,_California is not really the link_text as per the HTML you provided.
As per your question we need to find the <a> tage which contains the attribute href="/wiki/town_name,_California". So you can use any of the following options:
css_selector:
br.find_element_by_css_selector("a[href=/wiki/town_name,_California]")
xpath:
br.find_element_by_xpath("//a[#href='/wiki/town_name,_California']")
Read up on css selectors, they are your friend. I think the following should work.
hrefs = [a.href for a in soup.select('li a[href^="/wiki/"]')]

Python: xpath unable to find element

I want to get the text for features companies from a link. I inspect it and get the xpath but it is unable to find the element. The links is always change but it has a similarity with ended with listedcompanies.com
The text I want to scrape is highlighted in the screenshot.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
browser.find_element_by_xpath("//*[#href='http://salcon.listedcompany.com']")
The error is
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"xpath","selector":"//*[#href=\'http://salcon.listedcompany.com\']"}' ; Stacktrace:
I want to get the text for those companies
If you need text below Featured Companies tab you can use this code:
import requests
from parsel import Selector
url = 'https://www.shareinvestor.com/my'
r = requests.get(url)
sel = Selector(r.text)
all_text = sel.xpath('//div[#class="sic_scrollPane" and a[img]]')
for ind, text in enumerate(all_text, start=1):
text = ''.join(text.xpath('p//text()').extract())
print(ind, text)
It gets you all text from that tab without the use of Selenium.
Note: I use Parsel library built on top of the lxml, but you can use bs4 or lxml.
Try to use "//a[contains(#href, 'listedcompany.com')]" XPath to match all links with href attribute that contains "listedcompany.com" as below:
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
lint_text_list = [link.text for link in browser.find_elements_by_xpath("//a[contains(#href, 'listedcompany.com')]") if link.text]

Trying to select element by xpath with Selenium but getting error "Unable to locate element"

I am trying to scrape the list of followings for a given instagram user. This requires using Selenium to navigate to the user's Instagram page and then clicking "following". However, I cannot seem to click the "following" button with Selenium.
driver = webdriver.Chrome()
url = 'https://www.instagram.com/beforeeesunrise/'
driver.get(url)
driver.find_element_by_xpath('//*[#id="react-root"]/section/main/article/header/div[2]/ul/li[3]/a').click()
However, this results in a NoSuchElementException. I copied the xpath from the html, tried using the class name, partial link and full link and cannot seem to get this to work! I've also made sure that the above xpath include the element with a "click" event listener.
UPDATE: By logging in I was able to get the above information. However (!), now I cannot get the resulting list of "followings". When I click on the button with the driver, the html does not include the information in the pop up dialog that you see on Instagram. My goal is to get all of the users that the given username is following.
Make sure you are using the correct X Path.
Use the following link to get perfect X Paths to access web elements and then try.
Selenium Command
Hope this helps to solve the problem!
Try a different XPath. I've verified this is unique on the page.
driver.find_element_by_xpath("//a[contains(.,'following')]")
It's not the main goal of selenium to provide rich functionalities, from a web-scraping perspective, to find elements on the page, so the better option is to delegate this task to a specific tool, like BeautifulSoup. After we find what we're looking for, then, we can ask for selenium to interact with the element.
The bridge between selenium and BeautifulSoup will be this amazing function below that I found here. The function gets a single BeautifulSoup element and generates a unique XPATH that we can use on selenium.
import os
import re
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import itertools
def xpath_soup(element):
"""
Generate xpath of soup element
:param element: bs4 text or node
:return: xpath as string
"""
components = []
child = element if element.name else element.parent
for parent in child.parents:
"""
#type parent: bs4.element.Tag
"""
previous = itertools.islice(parent.children, 0, parent.contents.index(child))
xpath_tag = child.name
xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
child = parent
components.reverse()
return '/%s' % '/'.join(components)
driver = webdriver.Chrome(executable_path=YOUR_CHROMEDRIVER_PATH)
driver.get(url = 'https://www.instagram.com/beforeeesunrise/')
source = driver.page_source
soup = bs(source, 'html.parser')
button = soup.find('button', text=re.compile(r'Follow'))
xpath_for_the_button = xpath_soup(button)
elm = driver.find_element_by_xpath(xpath_for_the_button)
elm.click()
...and works!
( but you need writing some code to log in with an account)

Categories