I'm trying to have selenium scroll facebook page until certain text then get the HTML tags from that page. I'm trying to facebook post date text and have Seleinum scroll until that page. This code doesn't throw me error but doesn't does the task either. How can I achieve this? Right now it keeps scrolling and doesn't stop.
I'm just trying to scroll the page until the text 'Oct 5th' is visible.
driver.get("https://www.facebook.com/search/latest/?q=%23blacklivesmatter")
sleep(4)
wait = WebDriverWait(driver, 10)
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
wait.until(EC.visibility_of_element_located((By.XPATH, "//*[contains(text(), 'Oct 5th')]")))
html = driver.page_source
soup = BeautifulSoup(html)
except TimeoutException:
break
Edit: We need to look for the presence of an element instead of visibility.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.facebook.com/search/latest/?q=%23blacklivesmatter")
wait = WebDriverWait(driver, 10)
find_elem = None
scroll_from = 0
scroll_limit = 3000
while not find_elem:
sleep(2)
driver.execute_script("window.scrollTo(%d, %d);" %(scroll_from, scroll_from+scroll_limit))
scroll_from += scroll_limit
try:
find_elem = wait.until(EC.presence_of_element_located((By.XPATH, "//*[contains(text(), 'Oct 5th')]")))
except TimeoutException:
pass
driver.close()
First of all, if this text you are looking for is somewhere on the page, even if it is not immediately visible, it should still be visible in the HTML directly, without the need to scroll. The scrolling is only required when the page needs to be refreshed to load additional content that was not available before.
Now, I would suggest changing the following in your approach:
First of all, if the page does require to load some data that was unavailable before the scroll, you should give it enough time to do so. If you scroll and look for the text too quickly, it won't have enough time to get the updated HTML and you will basically just query the same DOM each time. Now given that you don't necessarily know when your text will appear, you will have to wait for a constant hard coded period each time. Few seconds should be enough, at least initially just to proved that it works.
Just to exclude possible issues with using wait.until, try looking for this text directly in the HTML source. You can change it later and use wait.until when you ensure that the rest of your script works properly.
Related
I am scraping pages of the Italian website publishing new laws (Gazzetta Ufficiale) to save the final page which holds the law text.
I have a loop that builds a list of the pages to download and am attaching a fully working cose sample which shows the problem I'm running in (the sample is not looped I am just doing two "gets".
What is the best way to handle the rare page which does not show the "Visualizza" (show) button but goes straight to the desired full text?
Hope the code is pretty self explanatory and commented. Thank you in advance and super happy 2022!
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome("/Users/bob/Documents/work/scraper/scrape_gu/chromedriver")
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario"
)
# this page has a "Visualizza" button, find it and click it.
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[#id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the "normal" result with the "Visualizza" button
bottoni[0].click() # now click it and this shows the desired final webpage
time.sleep(5) # just to see the "normal" desired result
# but unfortunately some pages directly get to the end result WITHOUT the "Visualizza" button.
# as an example see the following get
# showing the "normal" behaviour
driver.get(
"https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario"
) # get a law page
time.sleep(
5
) # as you can see we are now on the final desired full page WITHOUT the Visualizza button
# hence the following code, identical to that above will fail and timeout
bottoni = WebDriverWait(driver, 10).until(
EC.visibility_of_all_elements_located(
(By.XPATH, '//*[#id="corpo_export"]/div/input[1]')
)
)
time.sleep(5) # just to see the result
bottoni[0].click() # and this shows the desired final webpage
# and the program abends with the following message
# File "/Users/bob/Documents/work/scraper/scrape_gu/temp.py", line 33, in <module>
# bottoni = WebDriverWait(driver, 10).until(
# File "/Users/bob/opt/miniconda3/envs/scraping/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
# raise TimeoutException(message, screen, stacktrace)
# selenium.common.exceptions.TimeoutException: Message:
Catch the exception with a try and except block - If there is no button extract the text directly - Handling Exeptions
...
urls = [
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07300&tipoSerie=serie_generale&tipoVigenza=originario',
'https://www.gazzettaufficiale.it/atto/vediMenuHTML?atto.dataPubblicazioneGazzetta=2021-01-02&atto.codiceRedazionale=20A07249&tipoSerie=serie_generale&tipoVigenza=originario'
]
data = []
for url in urls:
driver.get(url)
try:
bottoni = WebDriverWait(driver,1).until(
EC.element_to_be_clickable(
(By.XPATH, '//input[#value="Visualizza"]')
)
)
bottoni.click()
except TimeoutException:
print('no bottoni -')
finally:
data.append(driver.find_element(By.XPATH, '//body').text)
driver.close()
print(data)
...
First, using selenium for this task is overkill.
You'd be able to do the same thing using requests or aiohttp coupled with beautifulsoup, except that would be much faster and easier to code.
Now to get back to your question, there are a few solutions.
The simplest would be :
Catch the timeout exception : if the button isn't found, then go straight to parsing the law.
Check if the button is present : !driver.findElements(By.id("corpo_export")).isEmpty(), before either clicking on it, or parsing the web page.
But then again, you'd have a much easier time getting rid of selenium and using beautifulsoup instead.
I am trying to scrape information on a website where the information is not immediately present. When you click a certain button, the page begins to load new content on the bottom of the page, and after it's done loading, red text shows up as "Assists (At Least)". I am able to find the first button "Go to Prop builder", which doesn't immediately show up on the page, but after the script clicks the button, it times out when trying to find the "Assists (At Least)" text, in spite of the script sleeping and being present on the screen.
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.bovada.lv/sports/basketball/nba')
# this part succeeds
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located(
(By.XPATH, "//span[text()='Go to Prop builder']")
)
)
element.click()
time.sleep(5)
# this part fails
element2 = WebDriverWait(driver, 6).until(
EC.visibility_of_element_located(
(By.XPATH, "//*[text()='Assists (At Least)']")
)
)
time.sleep(2)
innerHTML = driver.execute_script('return document.body.innerHTML')
driver.quit()
soup = BeautifulSoup(innerHTML, 'html.parser')
The problem is the Assist element is under a frame. You need to switch to the frame like this:
frame = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME,"player-props-frame")))
driver.switch_to.frame(frame)
Increase the timeout to confirm the timeout provided is correct, You can also confirm using debug mode. If still issue persist, please check "Assists (At Least)" element do not fall under any frame.
You can also share the DOM and proper error message if issue not resolved.
I have a couple of suggestions you could try,
Make sure that the content loaded at the bottom of the is not in a frame. If it is, you need to switch to the particular frame
Check the XPath is correct, try the XPath is matching from the Developer Console
Inspect the element from the browser, once the Developer console is open, press CTRL +F and then try your XPath. if it's not highlighting check frames
Check if there is are any iframes in the page, search for iframe in the view page source, and if you find any for that field which you are looking for, then switch to that frame first.
driver.switch_to.frame("name of the iframe")
Try adding a re-try logic with timeout, and a refresh button if any on the page
st = time.time()
while st+180>time.time():
try:
element2 = WebDriverWait(driver, 6).until(
EC.visibility_of_element_located(
(By.XPATH, "//*[text()='Assists (At Least)']")
)
)
except:
pass
The content you want is in an iFrame. You can access it by switching to it first, like this:
iframe=driver.find_element_by_css_selector('iframe[class="player-props-frame"]')
driver.switch_to.frame(iframe)
Round brackets are the issue here (at least in some cases...). If possible, use .contains selector:
//*[contains(text(),'Assists ') and contains(text(),'At Least')]
This page has a total of 790 products and I write selenium code to automatically click on the product load button until it will finish loading all 790 products. Unfortunately, my code is not working and getting an error. here is my full code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
driver = webdriver.Chrome()
driver.maximize_window()
url ='https://www.billigvvs.dk/maerker/grohe/produkter?min_price=1'
driver.get(url)
time.sleep(5)
#accept cookies
try:
driver.find_element_by_xpath("//button[#class='coi-banner__accept']").click()
except:
pass
print('cookies not accepted')
# Wait 20 seconds for page to load.
timeout = 20
try:
WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.XPATH, "//a[#class='productbox__info__name']")))
except TimeoutException:
print("Timed out waiting for page to load")
browser.quit()
#my page load button not working. I want to load all 790 product in this page
products_load_button = driver.find_element_by_xpath("//div[#class='filterlist__button']").click()
The error that I am getting:
Message: no such element: Unable to locate element: {"method":"xpath","selector":"//div[#class='filterlist__button']"}
(Session info: chrome=87.0.4280.88)
The error message saying Unable to locate element but see the picture which saying I am selecting the right element.
You are missing an extra space at the end, try with this:
products_load_button = driver.find_element_by_xpath("//div[#class='filterlist__button ']").click()
when you work with selectors is always a good practice to copy and paste directly from the page, that will save a lot of headaches in the future.
Edit:
The while loop to check if all the elements are loaded looks similar to this:
progress_bar_text = driver.find_element_by_css("div.filterlist__pagination__text").text
# From here you could extract the total items and the loaded items
# Note: I am doing this because I don't have access to the page, probably
# there is a better way to found out if the items are loaded taking
# taking a look into the attributes of the progressBar
total_items = int(progress_bar_text.split()[4])
loaded_items = int(progress_bar_text.split()[1])
while loaded_items < total_items:
# Click the product load button until the products are loaded
product_load_button.click()
# Get the progress bar text and updates the loaded_items count
progress_bar_text = driver.find_element_by_css("div.filterlist__pagination__text").text
loaded_items = int(progress_bar_text.split()[1])
This is a very simple example and does not consider a lot of scenarios that you will need to handle to make it stable, some of them are:
The elements might disappear or reload after you click the products_load_button. For this I'll recommend that you take a look to explicit waits in selenium docs.
Is possible that the progress bar could disappear/hide after the load is complete.
I have a div which contains the results for a certain search query. The text contained in this div changes as a button to go to the next page is clicked.
In the text contained in this div, there is also the corresponding number of the page. After clicking in the button to go to the next page, the results still take a bit to load, so I want to make the driver wait the content to load.
As such, I want to wait until the string "Page x" appears inside the div, where x corresponds to the number of the next page to be loaded.
For this, I tried:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
# Search page
driver.get('http://searchpage.com')
timeout = 120
current_page = 1
while True:
try:
# Wait until the results have been loaded
WebDriverWait(driver, timeout).until(
EC.text_to_be_present_in_element(
locator=(By.ID, "SEARCHBASE"),
text_="Page {:}".format(current_page)))
# Click to go to the next page. This calls some javascript.
driver.find_element_by_xpath('//a[#href="#"]').click
current_page += 1
except:
driver.quit()
Although, this always fails to match the text. What am I doing wrong here?
To just detect if anything whatsoever had changed in the page would also do the job, but I haven't found any way to do that.
Try to apply below solution to wait for partial text match:
WebDriverWait(driver, timeout).until(lambda driver: "Page {:}".format(current_page) in driver.find_element(By.ID, "SEARCHBASE").text)
I am trying to extract links from a infinite scroll website
It's my code for scrolling down the page
driver = webdriver.Chrome('C:\\Program Files (x86)\\Google\\Chrome\\chromedriver.exe')
driver.get('http://seekingalpha.com/market-news/top-news')
for i in range(0,2):
driver.implicitly_wait(15)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(20)
I aim at extracting specific links from this page. With class = "market_current_title" and HTML like the following :
<a class="market_current_title" href="/news/3223955-dow-wraps-best-week-since-2011-s-and-p-strongest-week-since-2014" sasource="titles_mc_top_news" target="_self">Dow wraps up best week since 2011; S&P in strongest week since 2014</a>
When I used
URL = driver.find_elements_by_class_name('market_current_title')
I ended up with the error that says "stale element reference: element is not attached to the page document". Then I tried
URL = driver.find_elements_by_xpath("//div[#id='a']//a[#class='market_current_title']")
but it says that there is no such a link !!!
Do you have any idea about solving this problem?
You're probably trying to interact with an element that is already changed (probably elements above your scrolling and off screen). Try this answer for some good options on how to overcome this.
Here's a snippet:
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import selenium.webdriver.support.expected_conditions as EC
import selenium.webdriver.support.ui as ui
# return True if element is visible within 2 seconds, otherwise False
def is_visible(self, locator, timeout=2):
try:
ui.WebDriverWait(driver, timeout).until(EC.visibility_of_element_located((By.CSS_SELECTOR, locator)))
return True
except TimeoutException:
return False