I have a script that scrapes from 10 multiple pages at once.
#hyperlink_list is the list of the pages
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(ChromeDriverManager().install(),options=options)
for i in range(0,10):
url = hyperlink_list[i]
sleep(randint(10, 24))
driver.get(url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
Now from the pages, I am extracting this part:
In only some pages, there is the show more link where the description is longer. I want to click this link, and extract the description whenever the show more link is available.
Code for show more link:
<a id="rfq-info-header-description-showmorebutton">
show more
</a>
I want to click this link only if it's available, otherwise it will show element not found error.
Use more = driver.find_element_by_id("rfq-info-header-description-showmorebutton") (assuming the more link can always be found using this id). If the more button is not found, this will throw an exception. (see here for details)
You should try-except block and we should look for show more web element. Below I am using find_elements (plural) and len() to get the size, if >0 then web element must be present and then trying to click on it using Explicit waits.
If size is not >0 then show more should not be visible and I am just printing a simple print statement in that block.
Code :
try:
if len(driver.find_elements(By.XPATH, "//a[#id='rfq-info-header-description-showmorebutton']")) >0 :
print("Show more link is available so Selenium bot will click on it.")
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, "//a[#id='rfq-info-header-description-showmorebutton']"))).click()
print('Clicked on show more link')
else:
print("Show more link is not available")
except:
print('Something else went wrong.')
pass
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Related
I am trying to scrape information on a website where the information is not immediately present. When you click a certain button, the page begins to load new content on the bottom of the page, and after it's done loading, red text shows up as "Assists (At Least)". I am able to find the first button "Go to Prop builder", which doesn't immediately show up on the page, but after the script clicks the button, it times out when trying to find the "Assists (At Least)" text, in spite of the script sleeping and being present on the screen.
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.bovada.lv/sports/basketball/nba')
# this part succeeds
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(
(By.XPATH, "//span[text()='Go to Prop builder']")
)
)
element.click()
time.sleep(5)
# this part fails
element2 = WebDriverWait(driver, 6).until(
EC.visibility_of_element_located(
(By.XPATH, "//*[text()='Assists (At Least)']")
)
)
time.sleep(2)
innerHTML = driver.execute_script('return document.body.innerHTML')
driver.quit()
soup = BeautifulSoup(innerHTML, 'html.parser')
The problem is the Assist element is under a frame. You need to switch to the frame like this:
frame = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,"player-props-frame")))
driver.switch_to.frame(frame)
Increase the timeout to confirm the timeout provided is correct, You can also confirm using debug mode. If still issue persist, please check "Assists (At Least)" element do not fall under any frame.
You can also share the DOM and proper error message if issue not resolved.
I have a couple of suggestions you could try,
Make sure that the content loaded at the bottom of the is not in a frame. If it is, you need to switch to the particular frame
Check the XPath is correct, try the XPath is matching from the Developer Console
Inspect the element from the browser, once the Developer console is open, press CTRL +F and then try your XPath. if it's not highlighting check frames
Check if there is are any iframes in the page, search for iframe in the view page source, and if you find any for that field which you are looking for, then switch to that frame first.
driver.switch_to.frame("name of the iframe")
Try adding a re-try logic with timeout, and a refresh button if any on the page
st = time.time()
while st+180>time.time():
try:
element2 = WebDriverWait(driver, 6).until(
EC.visibility_of_element_located(
(By.XPATH, "//*[text()='Assists (At Least)']")
)
)
except:
pass
The content you want is in an iFrame. You can access it by switching to it first, like this:
iframe=driver.find_element_by_css_selector('iframe[class="player-props-frame"]')
driver.switch_to.frame(iframe)
Round brackets are the issue here (at least in some cases...). If possible, use .contains selector:
//*[contains(text(),'Assists ') and contains(text(),'At Least')]
There is site, that streams youtube videos. I want to get playlist with them. So I use selenium webdriver to get the needed element div with class-name ytp-title-text where youtube link is located.
It is located here for example, when I use browser console to find element:
<div class="ytp-title-text"><a class="ytp-title-link yt-uix-sessionlink" target="_blank" data-sessionlink="feature=player-title" href="https://www.youtube.com/watch?v=VyCY62ElJ3g">Fears - Jono McCleery</a><div class="ytp-title-subtext"><a class="ytp-title-channel-name" target="_blank" href=""></a></div></div>
I wrote simple script for testing:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
driver = webdriver.Firefox()
driver.get('http://awsmtv.com')
try:
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.CLASS_NAME, "ytp-title-text"))
)
finally:
driver.quit()
But no element is found and timeout exception is thrown. I cannot understand, what actions selenium needs to perform to get the full page source.
Required link is hidden and also located inside an iframe. Try below to locate it:
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it("tvPlayer_1"))
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "ytp-title-link")))
print(element.get_attribute('href'))
finally:
driver.quit()
Just saw this element is inside iframe... You need to switch to the iframe first -> find it by ClassName -> ifame = ...(By.CLASS_NAME, "player") then switch to it driver.switch_to_frame(iframe) and you should be able now to get the wanted element :)
The XPath locator like this one will work (or your locator) -> "//a[#class='ytp-title-link yt-uix-sessionlink']".
You then need via the element to get the property href for the youtube video url or the text of the element for the song title.
If still not working I can suggest to get the page source - html = driver.page_source which will give you the source of the page and via some regex to get the info you want eventually.
hi I'm new to web scraping and have been trying to use Selenium to scrape a forum in python
I am trying to get Selenium to click "Next" until the last page but I am not sure how to break the loop. and I having trouble with the locator:
When I locate the next button by partial link, the automated clicking will continue to next thread e.g page1->page2->next thread->page1 of next thread-->page2 of next thread
while True:
next_link = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, "Next")))
next_link.click()
When I locate the next button by class name, the automated clicking will click "prev" button when it reaches the last page
while True:
next_link = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "prevnext")))
next_link.click()
My questions are:
Which locator should I use? (by class or by partial link or any
other suggestion?
How do I break the loop so it stops clicking when it reaches the
last page?
There are a couple of things you need to consider as follows :
There are two elements on the page with text as Next one on Top and another at the Bottom, so you need to decide with which element you desire to interact and construct a unique Locator Strategy
Moving forward as you want to invoke click() on the element instead of expected-conditions as presence_of_element_located() you need to use element_to_be_clickable().
When there would be no element with text as Next you need to execute the remaining steps, so invoke the click() within try-catch block and incase of an exception break out.
As per your requirement xpath as a Locator Strategy looks good to me.
Here is the working code block :
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://forums.hardwarezone.com.sg/money-mind-210/hdb-fully-paid-up-5744914.html")
driver.find_element_by_xpath("//a[#id='poststop' and #name='poststop']//following::table[1]//li[#class='prevnext']/a").click()
while True:
try :
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#id='poststop' and #name='poststop']//following::table[1]//li[#class='prevnext']/a[contains(.,'Next')]"))).click()
except :
print("No more pages left")
break
driver.quit()
Console Output :
No more pages left
You can use below code to click Next button until the last page reached and break the loop if the button is not present:
from selenium.common.exceptions import TimeoutException
while True:
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Next ›"))).click()
except TimeoutException:
break
You can use any locator which gives unique identification. Best practices says the following order.
Id
Name
Class Name
Css Selector
Xpath
Others
The come out of the while loop when it is not find the element you can use try block as given below. the break command is used for the same.
while True:
try:
next_link = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "prevnext")))
next_link.click()
except TimeoutException:
break
I have a div which contains the results for a certain search query. The text contained in this div changes as a button to go to the next page is clicked.
In the text contained in this div, there is also the corresponding number of the page. After clicking in the button to go to the next page, the results still take a bit to load, so I want to make the driver wait the content to load.
As such, I want to wait until the string "Page x" appears inside the div, where x corresponds to the number of the next page to be loaded.
For this, I tried:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
# Search page
driver.get('http://searchpage.com')
timeout = 120
current_page = 1
while True:
try:
# Wait until the results have been loaded
WebDriverWait(driver, timeout).until(
EC.text_to_be_present_in_element(
locator=(By.ID, "SEARCHBASE"),
text_="Page {:}".format(current_page)))
# Click to go to the next page. This calls some javascript.
driver.find_element_by_xpath('//a[#href="#"]').click
current_page += 1
except:
driver.quit()
Although, this always fails to match the text. What am I doing wrong here?
To just detect if anything whatsoever had changed in the page would also do the job, but I haven't found any way to do that.
Try to apply below solution to wait for partial text match:
WebDriverWait(driver, timeout).until(lambda driver: "Page {:}".format(current_page) in driver.find_element(By.ID, "SEARCHBASE").text)
I am trying to extract links from a infinite scroll website
It's my code for scrolling down the page
driver = webdriver.Chrome('C:\\Program Files (x86)\\Google\\Chrome\\chromedriver.exe')
driver.get('http://seekingalpha.com/market-news/top-news')
for i in range(0,2):
driver.implicitly_wait(15)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(20)
I aim at extracting specific links from this page. With class = "market_current_title" and HTML like the following :
<a class="market_current_title" href="/news/3223955-dow-wraps-best-week-since-2011-s-and-p-strongest-week-since-2014" sasource="titles_mc_top_news" target="_self">Dow wraps up best week since 2011; S&P in strongest week since 2014</a>
When I used
URL = driver.find_elements_by_class_name('market_current_title')
I ended up with the error that says "stale element reference: element is not attached to the page document". Then I tried
URL = driver.find_elements_by_xpath("//div[#id='a']//a[#class='market_current_title']")
but it says that there is no such a link !!!
Do you have any idea about solving this problem?
You're probably trying to interact with an element that is already changed (probably elements above your scrolling and off screen). Try this answer for some good options on how to overcome this.
Here's a snippet:
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import selenium.webdriver.support.expected_conditions as EC
import selenium.webdriver.support.ui as ui
# return True if element is visible within 2 seconds, otherwise False
def is_visible(self, locator, timeout=2):
try:
ui.WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.CSS_SELECTOR, locator)))
return True
except TimeoutException:
return False