Scraping information from Booking.com - python

I am trying to scrape some information from booking.com. I handled some stuff like pagination, extract title etc.
I am trying to extract the number of guests from here.
This is my code:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
soup2 = BeautifulSoup(driver.page_source, 'lxml')
guests = soup2.select_one('span.xp__guests__count')
guests = guests.text if price else None
amenities = soup2.select_one('div.hprt-facilities-block')
The result is this one '\n2 adults\n·\n\n0 children\n\n·\n\n1 room\n\n'
I know that with some regexp I can extract the information but I want but i would like to understand if is there a way to extract directly the "2 adults" from the above pic.
Thanks.

This is one way to get that information, without using BeautifulSoup (why parse the page twice?):
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(browser, 20)
url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
browser.get(url)
guest_count = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[class='xp__guests__count']"))).find_element(By.TAG_NAME, "span")
print(guest_count.text)
Result in terminal:
2 adults
Selenium docs can be found at https://www.selenium.dev/documentation/

I haven't used BeautifulSoup. I use Selenium. This is how I would do it in Selenium:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
element = driver.find_element(By.XPATH,"//span[#class='xp__guests__count']")
adults = int(element.text.split(" adults")[0])
print(str(adults))
Basically, I find the span element that contains the text you are looking for. .text gives you all the inner text (in this case, "2 adults · 0 children · 1 room").
The next line takes only the part of the string that comes before " adults", then casts it as an int.

Related

How can I web scrape information from a website that has all the tags in the <pre>preformatted tag section?

I am creating a python crawler that scrapes information from the Interpol website. I was successfully able to scrape information from the first page like names of people, date of birth, nationality etc. In order to scrape information from the second page, I first got the URL from tag and clicked on the link using my program. When I went to the URL, I found out that all the information(meaning all the tags) were in the < pre > tag section. I am confused about why that is the case. So my question is how can I get information from inside the pre-tag section where all the other tags are. I am trying to get names of people, birthdays, their corresponding links, etc. I am using selenium btw. I will put down the URL of the website. And the URL of the second page that I found in the tag. I hope that helps you guys understand what I am talking about.
Main Website:
https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices
The second-page link I found in the tag:
https://ws-public.interpol.int/notices/v1/red?resultPerPage=20&page=2
The code for the problem I have so far will be posted down below:
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'
driver = webdriver.Chrome(executable_path="c:\\SeliniumWebDrivers\\chromedriver.exe")
driver.get(url) //to go the website
url = [] //to get all the URLs of the people
names = [] //to get the names of the peoples
age = [] //to get the age of the people
nationality = [] //to get the nationality of the people
newwindow = [] //to get all the next page links
y = 0
g = 1
try:
driver.get(driver.current_url)
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'noticesResultsItemList'))
)
links = main.find_elements_by_tag_name("a")
years = main.find_elements_by_class_name("age")
borns = main.find_elements_by_class_name("nationalities")
for link in links:
newurl = link.get_attribute('href')
url.append(newurl)
names.append(link.text) //adding the names
y += 1
for year in years:
age.append(year.text) //adding the age to list
for nation in borns:
nationality.append(nation.text) //adding the nationality to list
driver.get(driver.current_url)
driver.refresh()
next = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, 'paginationPanel'))
)
pages = next.find_elements_by_tag_name("a")
for page in pages:
newlink = page.get_attribute('href')
newwindow.append(newlink)
#to get to the next page
print(newwindow[2])
driver.get(newwindow[2])
````
you can use selenium to click next page instead of getting the url. This is a just a simple ,you may need to use a loop and extract data and click next page. I've use variable browser instead of main.I've written a function and used a for loop to get the data from each page
from selenium import webdriver
import time
from selenium.common.exceptions import NoSuchElementException,ElementNotInteractableException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Chrome('/home/cam/Downloads/chromedriver')
url='https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'
browser.get(url)
def get_data():
links = browser.find_elements_by_tag_name("a")
years = browser.find_elements_by_class_name("age")
borns = browser.find_elements_by_class_name("nationalities")
time.sleep(5)
try:
browser.find_element_by_xpath('//*[#id="privacy-cookie-banner__privacy-accept"]').click()
except ElementNotInteractableException:
pass
for i in range(1,9):
print(i)
get_data()
print('//*[#id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')
b=WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')))
b.click()
time.sleep(10)

How can I extract the text elements using Selenium in Python?

Consider:
I am using Selenium to scrape the contents from the App Store: https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830
I tried to extract the text field "As subject matter experts, our team is very engaging..."
I tried to find elements by class
review_ratings = driver.find_elements_by_class_name('we-truncate we-truncate--multi-line we-truncate--interactive ember-view we-customer-review__body')
review_ratingsList = []
for e in review_ratings:
review_ratingsList.append(e.get_attribute('innerHTML'))
review_ratings
But it returns an empty list [].
Is anything wrong with the code? Or is there a better solution?
Using Requests and Beautiful Soup:
import requests
from bs4 import BeautifulSoup
url = 'https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
item = soup.select_one("blockquote > p").text
print(item)
Output:
As subject matter experts, our team is very engaging and focused on our near and long term financial health!
You can use WebDriverWait to wait for visibility of an element and get the text. Please check good Selenium locator.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#...
wait = WebDriverWait(driver, 5)
review_ratings = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".we-customer-review")))
for review_rating in review_ratings:
starts = review_rating.find_element_by_css_selector(".we-star-rating").get_attribute("aria-label")
title = review_rating.find_element_by_css_selector("h3").text
review = review_rating.find_element_by_css_selector("p").text
Mix Selenium with Beautiful Soup.
Using WebDriver:
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome()
url = "https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
bs = BeautifulSoup(innerHTML, 'html.parser')
bs.blockquote.p.text
Output:
Out[22]: 'As subject matter experts, our team is very engaging and focused on our near and long term financial health!'
Use WebDriverWait and wait for presence_of_all_elements_located and use the following CSS selector.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830")
review_ratings = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.we-customer-review__body p[dir="ltr"]')))
review_ratingsList = []
for e in review_ratings:
review_ratingsList.append(e.get_attribute('innerHTML'))
print(review_ratingsList)
Output:
['As subject matter experts, our team is very engaging and focused on our near and long term financial health!', 'Very much seems to be an unfinished app. Can’t find secure message alert. Or any alerts for that matter. Most of my client team is missing from the “send to” list. I have other functions very useful, when away from my computer.']

Unable to grab certain links from dynamic content

I've written a script in python in combination with selenium to scrape the links of different properties located at the right sided area right next to the map from its landing page.
Link to the landing page
When I click on each block manually from chrome I see links containing this /for_sale/ portion in a new tab whereas what my script fetches contain /homedetails/.
How can I get the number of results (such as 153 homes for sale) along with right links to the properties?
My try so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.zillow.com/homes/33155_rb/"
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)
itemcount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#map-result-count-message h2")))
print(itemcount.text)
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".zsg-photo-card-overlay-link"))):
print(item.get_attribute("href"))
driver.quit()
One of the current output:
https://www.zillow.com/homedetails/6860-SW-48th-Ter-Miami-FL-33155/44206318_zpid/
One of such expected output:
https://www.zillow.com/homes/for_sale/Miami-FL-33155/house_type/44184455_zpid/72458_rid/globalrelevanceex_sort/25.776783,-80.256072,25.695446,-80.364905_rect/12_zm/0_mmm/
While analyzing /homedetails/ and /for_sale/ links, I found that /homedetails/ link usually contains some sort of code like this:
44206318_zpid
that code acts as a unique identifier for the ad post, I extracted it and added it to:
https://www.zillow.com/homes/for_sale/
so the final link for the ad post will be like this:
https://www.zillow.com/homes/for_sale/44206318_zpid
It's a valid link and takes to the AD post.
Here is the final script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.zillow.com/homes/33155_rb/"
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get(link)
itemcount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"#map-result-count-message h2")))
print(itemcount.text)
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".zsg-photo-card-overlay-link"))):
link = item.get_attribute("href")
if "zpid" in link:
print("https://www.zillow.com/homes/for_sale/{}".format(link.split('/')[-2]))
I hope this would help.
You can loop over the pagination divs and keep a running counter of the number of homes displayed on each page. To parse the html, this answer utilizes BeautifulSoup:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import re, time
def home_num(_d:soup) -> int:
return len(_d.find_all('a', {'href':re.compile('^/homedetails/')}))
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.zillow.com/homes/33155_rb/')
homecount, _links = home_num(soup(d.page_source, 'html.parser')), []
_seen_links, _result_links = [], []
_start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
while _start:
_new_start = _start[0]
try:
_new_start.send_keys('\n')
time.sleep(5)
_start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
except:
_seen_links.append(_new_start.get_attribute('href'))
_start = [i for i in d.find_elements_by_tag_name('a') if isinstance(i.get_attribute("href"), str) and re.findall('/homes/for_sale/', i.get_attribute("href")) and i.get_attribute("href") not in _seen_links]
else:
_seen_links.append(_new_start.get_attribute('href'))
_result_links.append(_new_start.get_attribute('href'))
homecount += home_num(soup(d.page_source, 'html.parser'))
If you inspect those images present at right hand side of the page you will see "homedetails" not "forsale".
Just try to open link in new tab and observe the actuallink is "homedetails".

I'm trying to scrape data from the at the races website but the scraper is not returning any results

from selenium import webdriver
driver = webdriver.Chrome()
login_url = 'http://www.attheraces.com/racecard/Wolverhampton/6-October-2018/1715'
driver.get(login_url)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(sel_soup.findAll("sectionals-time"))
When I run the last line of the script it just returns
[]
It is a dynamic website as far as I am aware, so when you go to this site and scroll down to results, you click the sectional times tab, then right click the first sectional time for the first listed horse and inspect. this then shows me the class attribute as "sectionals-time" so I'm struggling to understand why it's not producing the sectional times for the horses.
Any advice and help much appreciated.
This will work. Leave a comment if you need the output to be different.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait
url = 'http://www.attheraces.com/racecard/Wolverhampton/6-October-2018/1715'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)
driver.find_element_by_xpath('//*[#id="racecard-tabs-1061960"]/div[1]/div/div[1]/ul/li[2]/a').click()
WebDriverWait(driver, 5).until(expected_conditions.presence_of_element_located((By.XPATH, '//*[#id="tab-racecard-sectional-times"]/div/div[1]/div[1]/div[2]/div/button')))
# method 1
for horse in driver.find_elements_by_class_name('card-item'):
horseName = horse.find_element_by_class_name('form-link').text
times = horse.find_elements_by_class_name('sectionals-time')
times = [time.text for time in times]
print('{}: {}'.format(horseName, times))
print()
# method 2
for horse in driver.find_elements_by_class_name('card-item'):
for time in horse.find_elements_by_class_name('sectionals-time'):
print(time.text)
print()
driver.close()
It looks to me like you have got the selector wrong,
Should you be specifying:
soup.findAll("span", {"class": "sectionals-time"})
hope that helps

Trouble scraping titles from a webpage

I've written a script in python with selenium to parse some results populated upon filling in an inputbox and preessing the Go button. My script does this portion well at this moment. However, my main goal is to parse the title of that container visible as Toys & Games as well.
This is my try so far (I could not find any idea to make a loop to do the same for all the containers):
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://www.fbatoolkit.com/"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
driver.find_element_by_css_selector(".estimator-container .estimator-input").send_keys("25000",Keys.RETURN)
time.sleep(2)
item = driver.find_element_by_css_selector(".estimator-result div").text
print(item)
driver.quit()
The result I get:
4 (30 Days Avg)
Result I would like to have:
Toys & Games
4 (30 Days Avg)
Link to an image in which you can see how they look like in that site. Expected fields are also marked with a pencil to let you know the location of the fields I'm trying to parse.
Try below code to get required output
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
url = "https://www.fbatoolkit.com/"
driver = webdriver.Chrome()
driver.get(url)
for container in wait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[class='chart-container']"))):
wait(container, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input.estimator-input"))).send_keys("25000", Keys.RETURN)
title = wait(container, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".chart text"))).text
item = wait(container, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".estimator-result div"))).text
print(title, item)
driver.quit()

Categories