Scraping dynamic data from a form on a site - python

I am trying to scrape a dynamic list of options from a form on a site. The site works in a way that when you enter some data in the query box, it takes them as keywords and searches from its own database and accordingly generates results.
I am trying to extract the whole complete list by scraping using selenium.
Initially in the inspect element section, I have:
and this is what changes when we write some keywords there in the form:
for i in range(1,100):
try:
depart.append(browser.find_elements_by_class_name("accessabilityBar textIndent")[i].text)
except Exception as e:
break
print(depart)
So, here is what I get as output: [u'']
Can somebody help me out with this?

browser.find_elements_by_class_name("accessabilityBar textIndent") returns you an exception because compound class names are not permitted, but exception is catched by except block.
Try below instead:
depart = [item.text for item in browser.find_elements_by_css_selector("span.accessabilityBar.textIndent")]
If you need to wait until text generated, you might need to use
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(browser, 10).until(EC.frame_to_be_available_and_switch_to_it(driver.find_element_by_xpath('//iframe[#src="s.effectivemeasure.net/html/frame_2.3.7.html"]')))
depart = [item.text for item in WebDriverWait(browser, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//span[#class='accessabilityBar textIndent' and normalize-space()]")))]

Related

Selenium wrong selectors leading no no output

I'm trying to scrape this website
Best Western Mornington Hotel
for the name of hotel rooms and the price of said room. I'm using Selenium to try and scrape this data but I keep on getting no return after what I assume is me using the wrong selectors/XPATH. Is there any method of identifying the correct XPATH/div class/selector? I feel like I have selected the correct ones but there is no output.
from re import sub
from decimal import Decimal
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
seleniumurl = 'https://www.bestwestern.co.uk/hotels/best-western-mornington-hotel-london-hyde-park-83187/in-2021-06-03/out-2021-06-05/adults-1/children-0/rooms-1'
driver = webdriver.Chrome(executable_path='C:\\Users\\Conor\\Desktop\\diss\\chromedriver.exe')
driver.get(seleniumurl)
time.sleep(5)
working = driver.find_elements_by_class_name('room-type-block')
for work in working:
name = work.find_elements_by_xpath('.//div/h4').string
price = work.find_elements_by_xpath('.//div[2]/div[2]/div/div[1]/div/div[3]/div/div[1]/div/div[2]/div[1]/div[2]/div[1]/div[1]/span[2]').string
print(name,price)
I only work with Selenium in Java, but from I can see you're trying to get collection of WebElements and invoke toString() on them...
should be that find_element_by_xpath to get just one WebElement and then call .text instead of .string?
Marek is right use .text instead of .string. Or use .get_attribute("innerHTML"). I also think your xpath may be wrong unless I'm looking at the wrong page. Here are some xpaths from the page you linked.
#This will get all the room type sections.
roomTypes = driver.find_elements_by_xpath("//div[contains(#class,'room-type-box__content')]")
#This will get the room type titles
roomTypes.find_elements_by_xpath("//div[contains(#class,'room-type-title')]/h3")
#Print out room type titles
for r in roomTypes:
print(r.text)
Please use this selector div#rr_wrp div.room-type-block and .visibility_of_all_elements_located method for get category div list.
With the above selector, you can search title by this xpath: .//h2[#class="room-type--title"], sub category by .//strong[#class="trimmedTitle rt-item--title"] and price .//div[#class="rt-rate-right--row group"]//span[#data-bind="text: priceText"].
And please try the following code with zip loop to extract parallel list:
driver = webdriver.Chrome(executable_path='C:\\Users\\Conor\\Desktop\\diss\\chromedriver.exe')
driver.get('https://www.bestwestern.co.uk/hotels/best-western-mornington-hotel-london-hyde-park-83187/in-2021-06-03/out-2021-06-05/adults-1/children-0/rooms-1')
wait = WebDriverWait(driver, 20)
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div#rr_wrp div.room-type-block')))
for element in elements:
for room_title in element.find_elements_by_xpath('.//h2[#class="room-type--title"]'):
print("Main Title ==>> " +room_title.text)
for room_type, room_price in zip(element.find_elements_by_xpath('.//strong[#class="trimmedTitle rt-item--title"]'), element.find_elements_by_xpath('.//div[#class="rt-rate-right--row group"]//span[#data-bind="text: priceText"]')) :
print(room_type.text +" " +room_price.text)
driver.quit()

Selenium(PYTHON) check whether or not element exists

So im trying to figure out how to run this loop properly, my issue is that depending on the link that is loading, the page that loads will have an access denied error, this isnt like that for all the links, my issue is that i would like to identify whether or not when a particular element loads onto my screen, the program recognizes it and breaks the loop, and starts the next iteration in the for loop, so im trying to determine whether the "Access-Denied" element is present, and if it is, then break, otherwise, continue the for loop
idList = ["8573", "85678", "2378", "2579"]
for ID in idList:
print(ID)
driver.get(f"https://www.someWebsite/username/{ID}")
element = driver.find_element_by_class_name("Access-Denied")
print("error loading website")
break
if not element:
print("you may continue the for loop")
Mind you if the element showing the access denied page isnt present, i get an error that the 'Access-denied' element doesnt exist, how can i fix this?
You want to wait for the webpage to receive the proper response. Using the following code, you can wait for the full response to load, and then take
appropriate action based on the outcome:
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
...
try:
_ = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "Access-Denied"))
)
print("error loading website")
break
except TimeoutException:
print("you may continue the for loop")
...
So you want to loop through if the access denied is there then break.
wait = WebDriverWait(driver, 10)
idList = ["8573", "85678", "2378", "2579"]
for ID in idList:
print(ID)
driver.get(f"https://www.someWebsite/username/{ID}")
try:
element=wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'Access-Denied')))
break
except:
continue
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Getting access to html element in a twitter iframe, without src property

I have been using Python with BeautifulSoup 4 to scrape the data out of unglobal website. Some companies over there, like this one: https://www.unglobalcompact.org/what-is-gc/participants/2968-Orsted-A-S
have twitter accounts. I would like to access the names of the twitter accounts. Problem is that it is inside of an iframe without a src property. I know that iframe is called by a different request than the rest of the website, but I wonder now if it is even possible to acess it without src property visible?
You can use selenium to do this. Here is the full code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = "https://www.unglobalcompact.org/what-is-gc/participants/2968-Orsted-A-S "
driver = webdriver.Chrome()
driver.get(url)
iframe = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="twitter-widget-0"]')))
driver.switch_to.frame(iframe)
names = driver.find_elements_by_xpath('//*[#class="TweetAuthor-name Identity-name customisable-highlight"]')
names = [name.text for name in names]
try:
name = max(set(names), key=names.count) #Finds the most frequently occurring name. This is because the same author has also retweeted tweets made by others. These retweets would contain the name of other people. The most frequently occurring name is the name of the author.
print(name)
except ValueError:
print("No Twitter Feed Found!")
driver.close()
Output:
Ørsted

Scrape data from table whose elements don't load immediately

I've been trying to scrape data from a table using selenium, but when I run the code, it only gets the header of the table.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(t.text)
I also tried finding element by tag name using table, without luck.
you should try this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
driver.implicitly_wait(100)
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
number=2
while(number<12):
content = driver.find_element_by_xpath('//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']')
print(content.text)
number+=1
The XPATH in 'table' is just the header, the actual content is this : '//*[#id="body"]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr['+str(number)+']' , that's why you are not getting any content different than the header. Since the XPATH in the rows are like ...../tr[2],...../tr[3],...../tr[4], etc, Im using the str(number) < 12 , to get all the raws, you can also try with 50 rows a the time, is up to you.
I would use requests and mimic the POST request by the page as much faster
import requests
data = {'METHOD': '0','VALUE': '{"BusquedaRubros":"true","IdRubro":"41","Inicio":0}'}
r = s.post('http://www.panamacompra.gob.pa/Security/AmbientePublico.asmx/cargarActosOportunidadesDeNegocio', data=data).json()
print(r['listActos'])
You need wait until loader disappear, you can use invisibility_of_element_located, utilize WebDriverWait and expected_conditions. For the table you can use css_selector instead your xpath.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
time.sleep(2)
WebDriverWait(driver, 50).until(EC.invisibility_of_element_located((By.XPATH, '//img[#src="images/loading.gif"]')))
table = driver.find_element_by_css_selector('.table_asearch.table.table-bordered.table-striped.table-hover.table-condensed')
print(table.text)
driver.quit()
Selenium is loading the table (happens fairly quickly) and then assuming it is done, since it's never given a chance to load the table rows (happens more slowly). One way around this is to repeatedly try to find an element that won't appear until the table is finished loading.
This is FAR from the most elegant solution (and there's probably Selenium libraries that do it better), but you can wait for the table by checking to see if a new table row can be found, and if not, sleep for 1 second before trying again.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time
driver = webdriver.Chrome()
driver.get('http://www.panamacompra.gob.pa/Inicio/#!/busquedaAvanzada?BusquedaRubros=true&IdRubro=41')
wvar = 0
while(wvar == 0):
try:
#try loading one of the elements we want to read
el = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody/tr[3]')
wvar = 1
except NoSuchElementException:
#not loaded yet
print('table body empty, waiting...')
time.sleep(1)
print('table loaded!')
#element got loaded; reload the table
table = driver.find_element_by_xpath('/html/body/div[1]/div[2]/div/div[2]/div/div/div[2]/div[2]/div[3]/table/tbody')
print(table.text)

How to Skip a Webpage After a Period of Time Selenium

I am parsing a file with a ton of colleges. Selenium googles "Admissions " + college_name then clicks the first link and gets some data from each page. The issue is that the list of college names I am pulling from is very rough (technically a list of all accredited institutions in America), so some of the links are broken or get stuck in a load loop. How do I set some sort of timer that basically says
if page load time > x seconds:
go to next element in list
You could invoke WebDriverWait on the page, and if the page catches a TimeoutException then you will know it took too long to load, so you can proceed to the next one.
Given you do not know what each page HTML will look like, this is a very challenging problem.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# list of college names
names = []
for name in names:
# search for the college here
# get list of search results
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='rc']")))
search_results = driver.find_elements_by_xpath("//div[#class='rc']")
# get first result
search_result = search_results[0]
# attempt to load the page
try:
search_result.click()
except TimeoutException:
# click operation should time out if next page does not load
# pass to move on to next URL
pass
This is a very rough, general outline. As I mentioned, without knowing what the expected page title will be, or what the expected page content will look like, it's incredibly difficult to write a generic method that will successfully accomplish this. This code is meant to be just a starting point for you.

Categories