Python: xpath unable to find element - python

I want to get the text for features companies from a link. I inspect it and get the xpath but it is unable to find the element. The links is always change but it has a similarity with ended with listedcompanies.com
The text I want to scrape is highlighted in the screenshot.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
browser.find_element_by_xpath("//*[#href='http://salcon.listedcompany.com']")
The error is
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"xpath","selector":"//*[#href=\'http://salcon.listedcompany.com\']"}' ; Stacktrace:
I want to get the text for those companies

If you need text below Featured Companies tab you can use this code:
import requests
from parsel import Selector
url = 'https://www.shareinvestor.com/my'
r = requests.get(url)
sel = Selector(r.text)
all_text = sel.xpath('//div[#class="sic_scrollPane" and a[img]]')
for ind, text in enumerate(all_text, start=1):
text = ''.join(text.xpath('p//text()').extract())
print(ind, text)
It gets you all text from that tab without the use of Selenium.
Note: I use Parsel library built on top of the lxml, but you can use bs4 or lxml.

Try to use "//a[contains(#href, 'listedcompany.com')]" XPath to match all links with href attribute that contains "listedcompany.com" as below:
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
lint_text_list = [link.text for link in browser.find_elements_by_xpath("//a[contains(#href, 'listedcompany.com')]") if link.text]

Related

extract information inside span tag

I am trying to extract PMC ID between "span" tag.
To do so, I used find element by xpath, but I'm facing the following error:
selenium.common.exceptions.NoSuchElementException:Message: Unable to locate element: /div/main/div/details/div/div[2]/details/summary/span[5]
Following is the link:
https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893
Following is my code:
driver = webdriver.Firefox(executable_path='geckodriver.exe')
driver.implicitly_wait(10) # this lets webdriver wait 10 seconds for the website to load
driver.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893")
pmc= driver.find_element_by_xpath('/div/main/div/details/div/div[2]/details/summary/span[5]')
pmc.get_text()
The output should be:
PMC24938
You can use a css attribute selector then get_attribute to get the attribute value
from selenium import webdriver
driver = webdriver.Firefox(executable_path='geckodriver.exe')
driver.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893")
pmc = driver.find_element_by_css_selector('[pmcid]')
print(pmc.get_attribute('pmcid'))
Result:
Though you don't need selenium for this site. Use faster requests and bs4
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893')
soup = bs(r.content, 'lxml')
pmc = soup.select_one('[pmcid]')['pmcid']
print(pmc)

Generate a list from HTML elements using Python

I am using selenium and BeautifulSoup to create a few lists from wikipedia pages. When I look at the page source, the links I want to get the information from are always structured as:
<li>town_name, state</li>
There is a link within the tag that you can click on that will direct you to that town's wiki page. It is always /wiki/town_name,_California
I want to use a for loop in Python to find every item with this structure but am unclear how to write the regular expression. I tried:
my_link = "//wiki//*,California"
and
my_link = "//wiki//*,_California"
But when I tried to run:
br.find_element_by_link_text(my_link)
These returned similar errors:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"//wiki//*,_California"}
I also tried:
import selenium, time
import html5lib
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
pg_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
br = webdriver.Chrome()
url = "http://somewikipage.org"
br.get(url)
lnkLst = []
for lnk in br.find_element_by_partial_link_text(",_California"):
lnkLst.append(lnk)
and got this:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":",_California"}
Is there any way I can correct this code so I can build a list of my targeted links?
As you mentioned in your Question that br.find_element_by_partial_link_text(",_California") didn't work that's because ,_California is not really the link_text as per the HTML you provided.
As per your question we need to find the <a> tage which contains the attribute href="/wiki/town_name,_California". So you can use any of the following options:
css_selector:
br.find_element_by_css_selector("a[href=/wiki/town_name,_California]")
xpath:
br.find_element_by_xpath("//a[#href='/wiki/town_name,_California']")
Read up on css selectors, they are your friend. I think the following should work.
hrefs = [a.href for a in soup.select('li a[href^="/wiki/"]')]

Scraping hidden product details on a webpage using Selenium

Sorry I am a Selenium noob and have done a lot of reading but am still having trouble getting the product price (£0.55) from this page:
https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628. Product details are not visible when parsing the html using bs4. Using Selenium I can get a string of the entire page and can see the price in there (using the following code). I should be able to extract the price from this somehow but would prefer a less hacky solution.
browser = webdriver.Firefox(executable_path=r'C:\Users\Paul\geckodriver.exe')
browser.get('https://groceries.asda.com/product/tinned-tomatoes/asda-smart-price-chopped-tomatoes-in-tomato-juice/19560')
content = browser.page_source
If I run something like this:
elem = driver.find_element_by_id("bodyContainerTemplate")
print(elem)
It just returns: selenium.webdriver.firefox.webelement.FirefoxWebElement (session="df23fae6-e99c-403c-a992-a1adf1cb8010", element="6d9aac0b-2e98-4bb5-b8af-fcbe443af906")
The price is the text associated with this element: p class="prod-price" but I cannot seem to get this working. How should I go about getting this text (the product price)?
The type of elem is WebElement. If you need to extract text value of web-element you might use below code:
elem = driver.find_element_by_class_name("prod-price-inner")
print(elem.text)
Try this solution, it works with selenium and beautifulsoup
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628'
driver = webdriver.PhantomJS()
driver.get(url)
data = driver.page_source
soup = BeautifulSoup(data, 'html.parser')
ele = soup.find('span',{'class':'prod-price-inner'})
print ele.text
driver.quit()
It will print :
£0.55

Trying to select element by xpath with Selenium but getting error "Unable to locate element"

I am trying to scrape the list of followings for a given instagram user. This requires using Selenium to navigate to the user's Instagram page and then clicking "following". However, I cannot seem to click the "following" button with Selenium.
driver = webdriver.Chrome()
url = 'https://www.instagram.com/beforeeesunrise/'
driver.get(url)
driver.find_element_by_xpath('//*[#id="react-root"]/section/main/article/header/div[2]/ul/li[3]/a').click()
However, this results in a NoSuchElementException. I copied the xpath from the html, tried using the class name, partial link and full link and cannot seem to get this to work! I've also made sure that the above xpath include the element with a "click" event listener.
UPDATE: By logging in I was able to get the above information. However (!), now I cannot get the resulting list of "followings". When I click on the button with the driver, the html does not include the information in the pop up dialog that you see on Instagram. My goal is to get all of the users that the given username is following.
Make sure you are using the correct X Path.
Use the following link to get perfect X Paths to access web elements and then try.
Selenium Command
Hope this helps to solve the problem!
Try a different XPath. I've verified this is unique on the page.
driver.find_element_by_xpath("//a[contains(.,'following')]")
It's not the main goal of selenium to provide rich functionalities, from a web-scraping perspective, to find elements on the page, so the better option is to delegate this task to a specific tool, like BeautifulSoup. After we find what we're looking for, then, we can ask for selenium to interact with the element.
The bridge between selenium and BeautifulSoup will be this amazing function below that I found here. The function gets a single BeautifulSoup element and generates a unique XPATH that we can use on selenium.
import os
import re
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import itertools
def xpath_soup(element):
"""
Generate xpath of soup element
:param element: bs4 text or node
:return: xpath as string
"""
components = []
child = element if element.name else element.parent
for parent in child.parents:
"""
#type parent: bs4.element.Tag
"""
previous = itertools.islice(parent.children, 0, parent.contents.index(child))
xpath_tag = child.name
xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
child = parent
components.reverse()
return '/%s' % '/'.join(components)
driver = webdriver.Chrome(executable_path=YOUR_CHROMEDRIVER_PATH)
driver.get(url = 'https://www.instagram.com/beforeeesunrise/')
source = driver.page_source
soup = bs(source, 'html.parser')
button = soup.find('button', text=re.compile(r'Follow'))
xpath_for_the_button = xpath_soup(button)
elm = driver.find_element_by_xpath(xpath_for_the_button)
elm.click()
...and works!
( but you need writing some code to log in with an account)

How to get an XPath from selenium webelement or from lxml?

I am using selenium and I need to find the XPaths of some selenium web elements.
For example:
import selenium.webdriver
driver = selenium.webdriver.Firefox()
element = driver.find_element_by_xpath(<some_xpath>)
elements = element.find_elements_by_xpath(<some_relative_xpath>)
for e in elements:
print e.get_xpath()
I know I can't get the XPath from the element itself, but is there a nice way to get it anyway?
I tried using lxml to parse the HTML, but it doesn't recognize the XPath, <some_xpath>, I passed, even though driver.find_element_by_xpath(<some_xpath>) did manage to find that element.
lxml can auto-generate an absolute xpath for you using getpath() method.
Example (using wikipedia main page, getting xpath expression for the logo):
import urllib2
from lxml import etree
data = urllib2.urlopen("https://en.wikipedia.org")
tree = etree.parse(data)
element = tree.xpath('//div[#id="p-logo"]/a')[0]
print tree.getpath(element)
Prints:
/html/body/div[4]/div[2]/div[1]/a

Categories