I've written a script in Python in association with selenium to click on each of the signs available in a map. However, when I execute my script, it throws timeout exception error upon reaching this line wait.until(EC.staleness_of(item)).
Before hitting that line, the script should have clicked once but It could not? How can I click on all the signs in that map cyclically?
This is the site link.
This is my code so far (perhaps, I'm trying with the wrong selectors):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.findapetwash.com/"
driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 15)
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#map .gm-style"))):
item.click()
wait.until(EC.staleness_of(item))
driver.quit()
Signs visible on that map like:
Post script: I know that this is their API https://www.findapetwash.com/api/locations/getAll/ using which I can get the JSON content but I would like to stick to the Selenium way. Thanks.
I know you wrote you don't want to use the API but using Selenium to get the locations from the map markers seems a bit overkill for this, instead, why not making a call to their Web service using requests and parse the returned json?
Here is a working script:
import requests
import json
api_url='https://www.findapetwash.com/api/locations/getAll/'
class Location:
def __init__(self, json):
self.id=json['id']
self.user_id=json['user_id']
self.name=json['name']
self.address=json['address']
self.zipcode=json['zipcode']
self.lat=json['lat']
self.lng=json['lng']
self.price_range=json['price_range']
self.photo='https://www.findapetwash.com' + json['photo']
def get_locations():
locations = []
response = requests.get(api_url)
if response.ok:
result_json = json.loads(response.text)
for location_json in result_json['locations']:
locations.append(Location(location_json))
return locations
else:
print('Error loading locations')
return False
if __name__ == '__main__':
locations = get_locations()
for l in locations:
print(l.name)
Selenium
If you still want to go the Selenium way, instead of waiting until all the elements are loaded, you could just halt the script for some seconds or even a minute to make sure everything is loaded, this should fix the timeout exception:
import time
driver.get(link)
# Wait 20 seconds
time.sleep(20)
For other possible workarounds, see the accepted answer here: Make Selenium wait 10 seconds
You can click one by one using Selenium if, for some reasons, you cannot use API. Also it is possible to extract information for each sign without clicking on them with Selenium.
Here code to click one by one:
signs = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.marker.marker--list")))
for sign in signs:
driver.execute_script("arguments[0].click();", sign)
#do something
Try also without wait, probably will work.
Related
I am trying to web scrape university ranking infomation from USNews site. And the problem is when I use selenium to open the webpage, the 'Load More Button' is not working properly. (I think I successfully click it but in the Chrome window opened by webdriver, when I scroll down to the button, is says that 'We're sorry, there was a problem loading the next page of search results'.
I am new to web scrawler and I did a lot of research on this, there are several similar questions but none of those answers helped. I really need some help. Here is my code:
driver_path = 'xxx' (chromedriver path)
driver = webdriver.Chrome(executable_path=driver_path)
url2 = 'https://www.usnews.com/education/best-global-universities/rankings'
wait = WebDriverWait(driver, 30)
driver.get(url2)
driver.maximize_window()
count = 1
while True:
try:
print(1)
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
wait.until(EC.visibility_of_element_located((By.XPATH,"//*[#id='rankings']/div[3]/button")))
print(2)
show_more = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='rankings']/div[3]/button")))
ActionChains(browser).move_to_element(show_more).click().perform()
print(3)
# driver.find_element(By.XPATH,"//*[#id='rankings']/div[3]/button").click()
# print(4)
# wait.until(EC.visibility_of_element_located((By.XPATH,"//*[#id='rankings']/div[3]/button")))
# print(5)
count += 1
time.sleep(2)
if count >=2:
break
Even though I did not write code to close the ad, but I don't think the ad is the problem since when I manually close it and then click the button, it is still not working. Is it the problem with the website?
import requests
import os
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
It is clear that there is an anti scraping control in the specific site. It is always recommended to consult the robots.txt file beforehand and check whether scraping is possible on a certain site or not.
In general, this site blocks just the IP (try to go to other pages afterwards, you will see that you will get a 403 error).
In general, however, the approach you used does not seem wrong to me. You can try contacting the site directly to see if the problem can be solved in some other way.
I am trying to scrape from Google search results the blue highlighted portion as shown below:
When I use inspect element, it shows: span class="YhemCb". I have tried using various soup.find and soup.find_all commands, but everything I have tried has no
output so far. What command should I use to scrape this part?
Google uses javascript to display most of its web elements, so using something like requests and BeautifulSoup is unfortunately not enough.
Instead, use selenium! It essentially allows you to control a browser using code.
First, you will need to navigate to the google page you wish to scrape
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
Then, you have to wait until the review page loads in the browser.
This is done using WebDriverWait: you have to specify an element that needs to appear on the page. The [data-attrid="kc:/local:one line summary"] span css selector allows me to select the review info about the hotel.
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the rating
print(review_element.get_attribute('innerHTML'))
Here's the full code in case you want to play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
# navigate to google
google_search = 'https://www.google.com/search?q=courtyard+by+marriott+fayetteville+fort+bragg'
driver.get(google_search)
# wait until the page loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '[data-attrid="kc:/local:one line summary"] span'))
review_element = WebDriverWait(driver, timeout).until(expectation)
# print the rating
print(review_element.get_attribute('innerHTML'))
Note Google is notoriously defensive against anyone who is trying to scrape them. On first few attempts you might be successful, but eventually you will have to deal with Google Captcha.
To work around that, I would suggest using the search engine scraper, something like the quickstart guide to get you started!
Disclaimer: I work at Oxylabs.io
How do I track dynamically updating code on a website?
On a website there is a part of the code that shows notifications. This code gets updates frequently, and I would like to use selenium to capture the changes.
Example:
# Setting up the driver
from selenium import webdriver
EXE_PATH = r'C:/Users/mrx/Downloads/chromedriver.exe'
driver = webdriver.Chrome(executable_path=EXE_PATH)
# Navigating to website and element of interest
driver.get('https://whateverwebsite.com/')
element = driver.find_element_by_id('changing-element')
# Printing source at time 1
element.get_attribute('innerHTML')
# Printing source at time 2
element.get_attribute('innerHTML')
The code returned for time 1 and time 2 is different. I could of cause capture this using some time of loop.
# While loop capturing changes
results=list()
while True:
print("New source")
source=element.get_attribute('innerHTML')
new_source=element.get_attribute('innerHTML')
results.append(source)
while source==new_source:
time.sleep(1)
Is there a smarter way to do this using selenium's event listener?
new_source=element.get_attribute('innerHTML')
Try wait by selenium way with WebDriverWait, selenium provide a method .text_to_be_present_in_element, you can try following approach.
First you need following import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
Try the bellow code:
element = driver.find_element_by_id('changing-element')
# Printing source at time 1
element.get_attribute('innerHTML')
#something that makes the element change
WebDriverWait(driver, 10).until(expected_conditions.text_to_be_present_in_element((By.ID, 'changing-element'), 'expected_value'))
# Printing source at time 2
element.get_attribute('innerHTML')
But if it isn't found, it will return an TimeoutException error, please handle with try/except
I am trying to automate speedtests with different browsers automatically, and the main part of the test is inside a loop. The problem is, sometimes, one element which has been selected before, and the script worked correctly, at the one of the next steps, exactly at the same loop and at the same page, but with different number, without any change in the xpath, selenium cannot select it again. So, I can not repeat my test as much as I want.
Most of the time I have this problem with Edge, and I think one reason can be, the xpath for elements which I found by help of Chrome or Firefox. ( I can not find the xpath in Edge first of all, I searched a lot about it).
I also put the different xpath that I use. Actually I want to get the numeric or string values of ping,download, upload location and server.
Please let me know, how can I solve this issue, I tried different sleep time and two different xpath. the script always gives me error when I am trying to select the element with class_name or css_selector.
firefox:
"/html/body/div[3]/div[2]/div/div/div/div[3]/div[1]/div[3]/div/div[3]/div/div[1]/div[2]/div[1]/div/div[2]/span"
chrome:
"//[#id='container']/div[2]/div/div/div/div[3]/div[1]/div[3]/div/div[3]/div/div[1]/div[2]/div[1]/div/div[2]/span"
chrome:
"//div[#class='result-item result-item-ping updated']/div[2]/span"
Other question is how can I wait for a page to load completely. this method WebDriverWait(driver,some seconds) does not work for me and i have to use time.sleep()
Error:
selenium.common.exceptions.NoSuchElementException: Message: No such element
element = driver.find_element_by_xpath("/html/body/div[3]/div[2]/div/div/div/div[3]/div[1]/div[3]/div/div[3]/div/div[1]/div[2]/div[1]/div/div[2]/span")
To automate the speedtests you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Edge(executable_path=r'C:\WebDrivers\MicrosoftWebDriver.exe')
driver.get("https://www.speedtest.net/")
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.js-start-test.test-mode-multi"))).click()
WebDriverWait(driver, 45).until(EC.url_contains("result"))
print("Ping :"+driver.find_element_by_css_selector("div[title='Reaction Time'] div.result-data.u-align-left>span").get_attribute("innerHTML"))
print("Download: "+driver.find_element_by_css_selector("div[title='Receiving Time'] div.result-data.u-align-left>span").get_attribute("innerHTML"))
print("Upload :"+driver.find_element_by_css_selector("div[title='Sending Time'] div.result-data.u-align-left>span").get_attribute("innerHTML"))
#driver.quit()
Console Output:
Ping :35
Download: 21.53
Upload :3.46
Browser Snapshot:
Use the following CSS locators to identify the values:
Download: *.result-data-large.number.result-data-value.download-speed*
Upload: *.result-data-large.number.result-data-value.upload-speed*
Ping: *.result-data-large.number.result-data-value.ping-speed*
Making use of getText(), you can retrieve their values. Wait for an element in the page to be visible to make sure the page is loaded successfully.
Try with:element = driver.find_element_by_xpath("/html/body/div[3]/div[2]/div/div/div/div[3]/div[1]/div[3]/div/div[3]/div/div[1]/div[2]/div[1]/div/div[2]/")
Maybe also you need to catch exception for: NoSuchElementException cases.
I've tested these CSS selectors and they work in both Chrome and Edge.
span.ping-speed # ping
span.download-speed # download
span.upload-speed # upload
div.server-current > div.result-label # server
If you want to know when the page is done loading, you can wait until the URL changes from https://www.speedtest.net to https://www.speedtest.net/results/<some number>. I would just use WebDriverWait and url_contains("results") , e.g.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(EC.url_contains("results"))
There are some other approaches in this question.
WebDriverWait driverWait = new WebDriverWait(driver, 30000);
driver.get("https://www.speedtest.net/");
WebElement goLink = driver.findElement(By.cssSelector(".js-start-test.test-mode-multi"));
driverWait.until(ExpectedConditions.elementToBeClickable(goLink));
goLink.click();
By download = By.cssSelector(".result-data-large.number.result-data-value.download-speed");
By upload = By.cssSelector(".result-data-large.number.result-data-value.upload-speed");
By ping = By.cssSelector(".result-data-large.number.result-data-value.ping-speed");
driverWait.until(ExpectedConditions.urlMatches("https://www.speedtest.net/result/[0-9]"));
String downloadSpeed = driver.findElement(download).getText();
String uploadSpeed = driver.findElement(upload).getText();
String pingValue = driver.findElement(ping).getText();
System.out.println("Download: "+downloadSpeed + "\nUpload: "+ uploadSpeed + "\n Ping: "+pingValue);
Output
Download: 78.82
Upload: 45.93
Ping: 23
I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.
The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/
There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.
Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:
import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
print("Begin...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Waiting to load page... (Delay 3 seconds)")
time.sleep(3)
print("Searching for elements")
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
print(journal)
print("Set fields, delay 3 seconds between input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)
print("Perform search")
submitButton = button.find_element_by_id("PesquisarBtn1_input")
submitButton.click()
The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.
Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?
Thanks!
Your DOM elements you are trying to find are located in iframe. So before using find_element_by_id API you should switch to iframe context.
Here is a code how to switch to iframe context:
# add your code
frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
# add your code
Here is a link describing switching to iframe context.