New to python and selenium webdriver. I am trying to check all the links on my own webpage and use it's http status code to see if it is a broken link or not. The code that I am running (reduced from original)...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
links = driver.find_elements_by_xpath("//a[#href]")
while len(links):
url = links.pop()
url = url.get_attribute("href")
print(url)
The html looks like...
<ul>
<li>visit google</li>
<li>broken link ex</li>
</ul>
When I run my script, the only link that gets printed is the google link and not the broken link. I have done some test cases and it seems that only the links that include the phrase "http://www" in the link get printed. Although I can change the href links on my webpage to include this phrase, I have specific reasons as to why they cannot be included.
If I can just get all the links (with or without the "http://www" phrase) using driver.find_elements_by_xpath("//a[#href]"), then I can convert these later in the script to include the phrase and then get the http status codes.
I saw other posts but none that helped me get over this obstacle. Any clarification/workaround/hint would be appreciated.
the following list comprehension should get you a list of all links. It locates all anchor tags and generates a list containing the 'href' attribute of each element.
links = [elem.get_attribute("href") for elem in driver.find_elements_by_tag_name('a')]
here is same thing broken down into small steps and used as a function:
def get_all_links(driver):
links = []
elements = driver.find_elements_by_tag_name('a')
for elem in elements:
href = elem.get_attribute("href")
links.append(href)
return links
Related
I have used Selenium driver to crawl through many site pages. Every time I get a new page I append the html to a variable called "All_APP_Pages". The variable All_APP_Pages is a variable holding html for many pages. Did not post code because its long and no relevant to issue. Python list "All_APP_Pages" as being of type bytes.
from lxml import html
from lxml import etree
import xml.etree.ElementTree as ET
from selenium.webdriver.common.by import By
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
print(link)
Once all pages have been scanned I need to get the link from this xpath
"//tr[.//span[contains(.,'Product Data Solutions (ABC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
The xpath listed here works. However it only works with the selenium driver if driver is on the page where this link exists. That is why all page are in one variable since I dont know what page the link will be on. The print shows this result
[<Element a at 0x1c39dea1180>]
How do I get this value from link I so can check if value is correct?
You need to iterate the list and get the href value
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
hrefs=[l.attrib["href"] for l in link]
print(hrefs)
I'm trying to get a link from a page using Selenium. When checking the page's source code I can clearly see the original link, but when I use Selenium to select the element, and then use element.get_attribute('href'), the link that it returns is a different one.
# Web page url request
driver.get('https://www.facebook.com/ads/library/?active_status=all&ad_type=all&country=BR&q=myshopify&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all')
driver.maximize_window()
time.sleep(10)
v_link = driver.find_element(By.XPATH, '//*[#id="facebook"]/body/div[5]/div[2]/div/div/div/div/div[3]/span/div[1]/div/div[2]/div[1]/div[2]/div[3]/a')
print(v_link.get_attribute('href'))
The actual link that I need: https://bhalliproducts.store/?_pos=1&_sid=8a26757f5&_ss=r
The link being returned: https://l.facebook.com/l.php?u=https%3A%2F%2Fbhalliproducts.store%2F%3F_pos%3D1%26_sid%3D8a26757f5%26_ss%3Dr&h=AT3KkXQbOn5s3oaaaCV2vjaAnyJqEqkIlqvP16g3eCsCnw-fx3VCNMR66_Zxs50v9JU5JK2DLABhoBHRNHQENH6oyp39Pho2Z6o25NZD5RIvl5kMow0lfd2rdaUWp11e6alEJFtoJp0X_uXgp5B2OYocRg5wGA
You can use the following solution:
from urllib.parse import unquote
href = "https://l.facebook.com/l.php?u=https%3A%2F%2Fbhalliproducts.store%2F%3F_pos%3D1%26_sid%3D8a26757f5%26_ss%3Dr&h=AT3KkXQbOn5s3oaaaCV2vjaAnyJqEqkIlqvP16g3eCsCnw-fx3VCNMR66_Zxs50v9JU5JK2DLABhoBHRNHQENH6oyp39Pho2Z6o25NZD5RIvl5kMow0lfd2rdaUWp11e6alEJFtoJp0X_uXgp5B2OYocRg5wGA"
begin = href.find('=') + 1
end = href.find('&')
href = href[begin:end]
href = unquote(href)
print(href)
I am trying to extract all the href links in an anchor tag using selenium for my web scraping project in python.
I have multiple pages on a single page and I am trying to access the href elements for a single page.
Below is the code:
url = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&pn=1"
driver.get(url)
links=driver.find_elements_by_xpath('//*[#href]')
for l in links:
print(l.get_attribute('href'))
On running my code the same href element gets printed multiple times.
Snippet of Output of the code:
https://www.carwale.com/used/cars-in-chennai/ford-figo-2010-2012-d2115418/?slot=4&rk=1&isP=true
https://www.carwale.com/used/cars-in-chennai/ford-figo-2010-2012-d2115418/?slot=4&rk=1&isP=true
https://www.carwale.com/used/cars-in-chennai/ford-figo-2010-2012-d2115418/?slot=4&rk=1&isP=true
How do I get it to print only once?
Do something like:
url = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&pn=1"
driver.get(url)
processed = []
links = driver.find_elements_by_xpath('//*[#href]')
for link in links:
if link not in processed:
print(link.get_attribute('href'))
processed.append(link)
else:
continue
I am trying to scrape the list of followings for a given instagram user. This requires using Selenium to navigate to the user's Instagram page and then clicking "following". However, I cannot seem to click the "following" button with Selenium.
driver = webdriver.Chrome()
url = 'https://www.instagram.com/beforeeesunrise/'
driver.get(url)
driver.find_element_by_xpath('//*[#id="react-root"]/section/main/article/header/div[2]/ul/li[3]/a').click()
However, this results in a NoSuchElementException. I copied the xpath from the html, tried using the class name, partial link and full link and cannot seem to get this to work! I've also made sure that the above xpath include the element with a "click" event listener.
UPDATE: By logging in I was able to get the above information. However (!), now I cannot get the resulting list of "followings". When I click on the button with the driver, the html does not include the information in the pop up dialog that you see on Instagram. My goal is to get all of the users that the given username is following.
Make sure you are using the correct X Path.
Use the following link to get perfect X Paths to access web elements and then try.
Selenium Command
Hope this helps to solve the problem!
Try a different XPath. I've verified this is unique on the page.
driver.find_element_by_xpath("//a[contains(.,'following')]")
It's not the main goal of selenium to provide rich functionalities, from a web-scraping perspective, to find elements on the page, so the better option is to delegate this task to a specific tool, like BeautifulSoup. After we find what we're looking for, then, we can ask for selenium to interact with the element.
The bridge between selenium and BeautifulSoup will be this amazing function below that I found here. The function gets a single BeautifulSoup element and generates a unique XPATH that we can use on selenium.
import os
import re
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import itertools
def xpath_soup(element):
"""
Generate xpath of soup element
:param element: bs4 text or node
:return: xpath as string
"""
components = []
child = element if element.name else element.parent
for parent in child.parents:
"""
#type parent: bs4.element.Tag
"""
previous = itertools.islice(parent.children, 0, parent.contents.index(child))
xpath_tag = child.name
xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
child = parent
components.reverse()
return '/%s' % '/'.join(components)
driver = webdriver.Chrome(executable_path=YOUR_CHROMEDRIVER_PATH)
driver.get(url = 'https://www.instagram.com/beforeeesunrise/')
source = driver.page_source
soup = bs(source, 'html.parser')
button = soup.find('button', text=re.compile(r'Follow'))
xpath_for_the_button = xpath_soup(button)
elm = driver.find_element_by_xpath(xpath_for_the_button)
elm.click()
...and works!
( but you need writing some code to log in with an account)
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import csv
import requests
import re
driver2 = webdriver.Chrome()
driver2.get("http://www.squawka.com/match-results?ctl=10_s2015")
soup=BeautifulSoup(driver2.page_source)
print soup
driver2.quit()
I'm trying to get the HREF of every "td", "Class":"Match Centre" and I need to use selenium to navigate through the pages but im struggling to incorporate the two so I can change the menu options and navigate through the different pages while feeding the links into my other code.
I've researched and tried ('inner-html') and the page.source currently in the code, but it doesn't get any of the web links I need.
Does anyone have a solution to get these links and navigate on the page. Could there be a way to get the XML of this page to get all the links?
Not sure why would you need BeautifulSoup (BS) here. Selenium alone is capable of locating elements and navigating through links on a page. For example, to get all the links to the match details page you can do as follow :
>>> matches = driver.find_elements_by_xpath("//td[#class='match-centre']/a")
>>> print [match.get_attribute("href") for match in matches]
As for navigating through the pages, you can use the following XPath :
//span[contains(#class,'page-numbers')]/following-sibling::a[1]
The above XPath finds link to the next page. To navigate through all the pages, you can try using a while loop; while the link to the next page is found :
perform a click action on the link,
grab all the href from current page,
locate the next page link.