Generate a list from HTML elements using Python - python

I am using selenium and BeautifulSoup to create a few lists from wikipedia pages. When I look at the page source, the links I want to get the information from are always structured as:
<li>town_name, state</li>
There is a link within the tag that you can click on that will direct you to that town's wiki page. It is always /wiki/town_name,_California
I want to use a for loop in Python to find every item with this structure but am unclear how to write the regular expression. I tried:
my_link = "//wiki//*,California"
and
my_link = "//wiki//*,_California"
But when I tried to run:
br.find_element_by_link_text(my_link)
These returned similar errors:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"//wiki//*,_California"}
I also tried:
import selenium, time
import html5lib
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
pg_src = br.page_source.encode("utf")
soup = BeautifulSoup(pg_src)
br = webdriver.Chrome()
url = "http://somewikipage.org"
br.get(url)
lnkLst = []
for lnk in br.find_element_by_partial_link_text(",_California"):
lnkLst.append(lnk)
and got this:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"partial link text","selector":",_California"}
Is there any way I can correct this code so I can build a list of my targeted links?

As you mentioned in your Question that br.find_element_by_partial_link_text(",_California") didn't work that's because ,_California is not really the link_text as per the HTML you provided.
As per your question we need to find the <a> tage which contains the attribute href="/wiki/town_name,_California". So you can use any of the following options:
css_selector:
br.find_element_by_css_selector("a[href=/wiki/town_name,_California]")
xpath:
br.find_element_by_xpath("//a[#href='/wiki/town_name,_California']")

Read up on css selectors, they are your friend. I think the following should work.
hrefs = [a.href for a in soup.select('li a[href^="/wiki/"]')]

Related

extract information inside span tag

I am trying to extract PMC ID between "span" tag.
To do so, I used find element by xpath, but I'm facing the following error:
selenium.common.exceptions.NoSuchElementException:Message: Unable to locate element: /div/main/div/details/div/div[2]/details/summary/span[5]
Following is the link:
https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893
Following is my code:
driver = webdriver.Firefox(executable_path='geckodriver.exe')
driver.implicitly_wait(10) # this lets webdriver wait 10 seconds for the website to load
driver.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893")
pmc= driver.find_element_by_xpath('/div/main/div/details/div/div[2]/details/summary/span[5]')
pmc.get_text()
The output should be:
PMC24938
You can use a css attribute selector then get_attribute to get the attribute value
from selenium import webdriver
driver = webdriver.Firefox(executable_path='geckodriver.exe')
driver.get("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893")
pmc = driver.find_element_by_css_selector('[pmcid]')
print(pmc.get_attribute('pmcid'))
Result:
Though you don't need selenium for this site. Use faster requests and bs4
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email#example.com&ids=9811893')
soup = bs(r.content, 'lxml')
pmc = soup.select_one('[pmcid]')['pmcid']
print(pmc)

Python: xpath unable to find element

I want to get the text for features companies from a link. I inspect it and get the xpath but it is unable to find the element. The links is always change but it has a similarity with ended with listedcompanies.com
The text I want to scrape is highlighted in the screenshot.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
browser.find_element_by_xpath("//*[#href='http://salcon.listedcompany.com']")
The error is
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"xpath","selector":"//*[#href=\'http://salcon.listedcompany.com\']"}' ; Stacktrace:
I want to get the text for those companies
If you need text below Featured Companies tab you can use this code:
import requests
from parsel import Selector
url = 'https://www.shareinvestor.com/my'
r = requests.get(url)
sel = Selector(r.text)
all_text = sel.xpath('//div[#class="sic_scrollPane" and a[img]]')
for ind, text in enumerate(all_text, start=1):
text = ''.join(text.xpath('p//text()').extract())
print(ind, text)
It gets you all text from that tab without the use of Selenium.
Note: I use Parsel library built on top of the lxml, but you can use bs4 or lxml.
Try to use "//a[contains(#href, 'listedcompany.com')]" XPath to match all links with href attribute that contains "listedcompany.com" as below:
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
lint_text_list = [link.text for link in browser.find_elements_by_xpath("//a[contains(#href, 'listedcompany.com')]") if link.text]

Scraping hidden product details on a webpage using Selenium

Sorry I am a Selenium noob and have done a lot of reading but am still having trouble getting the product price (£0.55) from this page:
https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628. Product details are not visible when parsing the html using bs4. Using Selenium I can get a string of the entire page and can see the price in there (using the following code). I should be able to extract the price from this somehow but would prefer a less hacky solution.
browser = webdriver.Firefox(executable_path=r'C:\Users\Paul\geckodriver.exe')
browser.get('https://groceries.asda.com/product/tinned-tomatoes/asda-smart-price-chopped-tomatoes-in-tomato-juice/19560')
content = browser.page_source
If I run something like this:
elem = driver.find_element_by_id("bodyContainerTemplate")
print(elem)
It just returns: selenium.webdriver.firefox.webelement.FirefoxWebElement (session="df23fae6-e99c-403c-a992-a1adf1cb8010", element="6d9aac0b-2e98-4bb5-b8af-fcbe443af906")
The price is the text associated with this element: p class="prod-price" but I cannot seem to get this working. How should I go about getting this text (the product price)?
The type of elem is WebElement. If you need to extract text value of web-element you might use below code:
elem = driver.find_element_by_class_name("prod-price-inner")
print(elem.text)
Try this solution, it works with selenium and beautifulsoup
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628'
driver = webdriver.PhantomJS()
driver.get(url)
data = driver.page_source
soup = BeautifulSoup(data, 'html.parser')
ele = soup.find('span',{'class':'prod-price-inner'})
print ele.text
driver.quit()
It will print :
£0.55

Parsing HTML Content using BeautifulSoup & Selenium

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import csv
import requests
import re
driver2 = webdriver.Chrome()
driver2.get("http://www.squawka.com/match-results?ctl=10_s2015")
soup=BeautifulSoup(driver2.page_source)
print soup
driver2.quit()
I'm trying to get the HREF of every "td", "Class":"Match Centre" and I need to use selenium to navigate through the pages but im struggling to incorporate the two so I can change the menu options and navigate through the different pages while feeding the links into my other code.
I've researched and tried ('inner-html') and the page.source currently in the code, but it doesn't get any of the web links I need.
Does anyone have a solution to get these links and navigate on the page. Could there be a way to get the XML of this page to get all the links?
Not sure why would you need BeautifulSoup (BS) here. Selenium alone is capable of locating elements and navigating through links on a page. For example, to get all the links to the match details page you can do as follow :
>>> matches = driver.find_elements_by_xpath("//td[#class='match-centre']/a")
>>> print [match.get_attribute("href") for match in matches]
As for navigating through the pages, you can use the following XPath :
//span[contains(#class,'page-numbers')]/following-sibling::a[1]
The above XPath finds link to the next page. To navigate through all the pages, you can try using a while loop; while the link to the next page is found :
perform a click action on the link,
grab all the href from current page,
locate the next page link.

How to get an XPath from selenium webelement or from lxml?

I am using selenium and I need to find the XPaths of some selenium web elements.
For example:
import selenium.webdriver
driver = selenium.webdriver.Firefox()
element = driver.find_element_by_xpath(<some_xpath>)
elements = element.find_elements_by_xpath(<some_relative_xpath>)
for e in elements:
print e.get_xpath()
I know I can't get the XPath from the element itself, but is there a nice way to get it anyway?
I tried using lxml to parse the HTML, but it doesn't recognize the XPath, <some_xpath>, I passed, even though driver.find_element_by_xpath(<some_xpath>) did manage to find that element.
lxml can auto-generate an absolute xpath for you using getpath() method.
Example (using wikipedia main page, getting xpath expression for the logo):
import urllib2
from lxml import etree
data = urllib2.urlopen("https://en.wikipedia.org")
tree = etree.parse(data)
element = tree.xpath('//div[#id="p-logo"]/a')[0]
print tree.getpath(element)
Prints:
/html/body/div[4]/div[2]/div[1]/a

Categories