Scraping this page here. I am trying to get the mail icon in the names. I have tried many things but cannot seem to click/find it. Some help please?
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='C:/chromedriver.exe')
search_term = input("Enter your search term :")
url = f'https://www.sciencedirect.com/search?qs={search_term}&show=100'
driver.get(url)
driver.maximize_window()
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,'/html/body/div[3]/div/div/div/button/span'))).click()
divs = driver.find_elements_by_class_name('result-item-content')
links = []
for div in divs:
link = div.find_element_by_tag_name('a')
links.append(link)
links[0].click()
div = driver.find_element_by_id('author-group')
print(div.text[0:])
name_links = div.find_elements_by_tag_name('a')
spans =[]
for name in name_links:
span = name.find_element_by_tag_name('span')
spans.append(span)
for span in spans:
mail = span.find_element_by_class_name('icon icon-envelope')
mail.click()
break
It seems that not every author has that icon, but, even taking that into account, you have a couple of mistakes in the current approach:
you are looking inside each span element of the author group - you don't have to do that
find_element_by_class_name would work with a single class value, not multiple (class is a multi-valued attribute with space being a delimiter between values)
Here is how would I go about this:
from selenium.common.exceptions import NoSuchElementException
author_group = driver.find_element_by_id('author-group')
for author in author_group.find_elements_by_css_selector("a.author"):
try:
given_name = author.find_element_by_css_selector(".given-name").text
surname = author.find_element_by_css_selector(".surname").text
except NoSuchElementException:
print("Could not extract first or last name")
continue
try:
mail_icon = author.find_element_by_css_selector(".icon-envelope")
mail_icon_present = True
except NoSuchElementException:
mail_icon_present = False
print(f"Author {given_name} {surname}. Mail icon present: {mail_icon_present}")
Notes:
note how we iterate over authors, container by container, and then looking for specific properties inside each one
note how we are checking for the presence of the mail icon in a forgiving EAFP manner
the . in before a class value in a CSS selector is a special syntax to match an element by a single class value
Related
I am trying to scrape some information from booking.com. I handled some stuff like pagination, extract title etc.
I am trying to extract the number of guests from here.
This is my code:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
soup2 = BeautifulSoup(driver.page_source, 'lxml')
guests = soup2.select_one('span.xp__guests__count')
guests = guests.text if price else None
amenities = soup2.select_one('div.hprt-facilities-block')
The result is this one '\n2 adults\n·\n\n0 children\n\n·\n\n1 room\n\n'
I know that with some regexp I can extract the information but I want but i would like to understand if is there a way to extract directly the "2 adults" from the above pic.
Thanks.
This is one way to get that information, without using BeautifulSoup (why parse the page twice?):
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(browser, 20)
url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
browser.get(url)
guest_count = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[class='xp__guests__count']"))).find_element(By.TAG_NAME, "span")
print(guest_count.text)
Result in terminal:
2 adults
Selenium docs can be found at https://www.selenium.dev/documentation/
I haven't used BeautifulSoup. I use Selenium. This is how I would do it in Selenium:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
element = driver.find_element(By.XPATH,"//span[#class='xp__guests__count']")
adults = int(element.text.split(" adults")[0])
print(str(adults))
Basically, I find the span element that contains the text you are looking for. .text gives you all the inner text (in this case, "2 adults · 0 children · 1 room").
The next line takes only the part of the string that comes before " adults", then casts it as an int.
I'm trying to scrape this website
Best Western Mornington Hotel
for the name of hotel rooms and the price of said room. I'm using Selenium to try and scrape this data but I keep on getting no return after what I assume is me using the wrong selectors/XPATH. Is there any method of identifying the correct XPATH/div class/selector? I feel like I have selected the correct ones but there is no output.
from re import sub
from decimal import Decimal
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
seleniumurl = 'https://www.bestwestern.co.uk/hotels/best-western-mornington-hotel-london-hyde-park-83187/in-2021-06-03/out-2021-06-05/adults-1/children-0/rooms-1'
driver = webdriver.Chrome(executable_path='C:\\Users\\Conor\\Desktop\\diss\\chromedriver.exe')
driver.get(seleniumurl)
time.sleep(5)
working = driver.find_elements_by_class_name('room-type-block')
for work in working:
name = work.find_elements_by_xpath('.//div/h4').string
price = work.find_elements_by_xpath('.//div[2]/div[2]/div/div[1]/div/div[3]/div/div[1]/div/div[2]/div[1]/div[2]/div[1]/div[1]/span[2]').string
print(name,price)
I only work with Selenium in Java, but from I can see you're trying to get collection of WebElements and invoke toString() on them...
should be that find_element_by_xpath to get just one WebElement and then call .text instead of .string?
Marek is right use .text instead of .string. Or use .get_attribute("innerHTML"). I also think your xpath may be wrong unless I'm looking at the wrong page. Here are some xpaths from the page you linked.
#This will get all the room type sections.
roomTypes = driver.find_elements_by_xpath("//div[contains(#class,'room-type-box__content')]")
#This will get the room type titles
roomTypes.find_elements_by_xpath("//div[contains(#class,'room-type-title')]/h3")
#Print out room type titles
for r in roomTypes:
print(r.text)
Please use this selector div#rr_wrp div.room-type-block and .visibility_of_all_elements_located method for get category div list.
With the above selector, you can search title by this xpath: .//h2[#class="room-type--title"], sub category by .//strong[#class="trimmedTitle rt-item--title"] and price .//div[#class="rt-rate-right--row group"]//span[#data-bind="text: priceText"].
And please try the following code with zip loop to extract parallel list:
driver = webdriver.Chrome(executable_path='C:\\Users\\Conor\\Desktop\\diss\\chromedriver.exe')
driver.get('https://www.bestwestern.co.uk/hotels/best-western-mornington-hotel-london-hyde-park-83187/in-2021-06-03/out-2021-06-05/adults-1/children-0/rooms-1')
wait = WebDriverWait(driver, 20)
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div#rr_wrp div.room-type-block')))
for element in elements:
for room_title in element.find_elements_by_xpath('.//h2[#class="room-type--title"]'):
print("Main Title ==>> " +room_title.text)
for room_type, room_price in zip(element.find_elements_by_xpath('.//strong[#class="trimmedTitle rt-item--title"]'), element.find_elements_by_xpath('.//div[#class="rt-rate-right--row group"]//span[#data-bind="text: priceText"]')) :
print(room_type.text +" " +room_price.text)
driver.quit()
I am creating a python crawler that scrapes information from the Interpol website. I was successfully able to scrape information from the first page like names of people, date of birth, nationality etc. In order to scrape information from the second page, I first got the URL from tag and clicked on the link using my program. When I went to the URL, I found out that all the information(meaning all the tags) were in the < pre > tag section. I am confused about why that is the case. So my question is how can I get information from inside the pre-tag section where all the other tags are. I am trying to get names of people, birthdays, their corresponding links, etc. I am using selenium btw. I will put down the URL of the website. And the URL of the second page that I found in the tag. I hope that helps you guys understand what I am talking about.
Main Website:
https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices
The second-page link I found in the tag:
https://ws-public.interpol.int/notices/v1/red?resultPerPage=20&page=2
The code for the problem I have so far will be posted down below:
from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'
driver = webdriver.Chrome(executable_path="c:\\SeliniumWebDrivers\\chromedriver.exe")
driver.get(url) //to go the website
url = [] //to get all the URLs of the people
names = [] //to get the names of the peoples
age = [] //to get the age of the people
nationality = [] //to get the nationality of the people
newwindow = [] //to get all the next page links
y = 0
g = 1
try:
driver.get(driver.current_url)
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'noticesResultsItemList'))
)
links = main.find_elements_by_tag_name("a")
years = main.find_elements_by_class_name("age")
borns = main.find_elements_by_class_name("nationalities")
for link in links:
newurl = link.get_attribute('href')
url.append(newurl)
names.append(link.text) //adding the names
y += 1
for year in years:
age.append(year.text) //adding the age to list
for nation in borns:
nationality.append(nation.text) //adding the nationality to list
driver.get(driver.current_url)
driver.refresh()
next = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.ID, 'paginationPanel'))
)
pages = next.find_elements_by_tag_name("a")
for page in pages:
newlink = page.get_attribute('href')
newwindow.append(newlink)
#to get to the next page
print(newwindow[2])
driver.get(newwindow[2])
````
you can use selenium to click next page instead of getting the url. This is a just a simple ,you may need to use a loop and extract data and click next page. I've use variable browser instead of main.I've written a function and used a for loop to get the data from each page
from selenium import webdriver
import time
from selenium.common.exceptions import NoSuchElementException,ElementNotInteractableException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
browser = webdriver.Chrome('/home/cam/Downloads/chromedriver')
url='https://www.interpol.int/en/How-we-work/Notices/View-Red-Notices'
browser.get(url)
def get_data():
links = browser.find_elements_by_tag_name("a")
years = browser.find_elements_by_class_name("age")
borns = browser.find_elements_by_class_name("nationalities")
time.sleep(5)
try:
browser.find_element_by_xpath('//*[#id="privacy-cookie-banner__privacy-accept"]').click()
except ElementNotInteractableException:
pass
for i in range(1,9):
print(i)
get_data()
print('//*[#id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')
b=WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="paginationPanel"]/div/div/ul/li['+str(i+2)+']/a')))
b.click()
time.sleep(10)
I'm trying to scrape the titles and links from Google search results using selenium (Python). My problem is that I'm only able to scrape the first 4 results, but not the other 6. Here, the results are just empty. My feeling is that this might has something to do with the loading time of the web page, but I'm not sure. I have been looking at implementing the wait.until(EC.visibility_of_element_located statement, but haven't found a way of making it work.
Anyone with experience on this issue? Much appreciated!
Code:
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options)
driver.get(link)
WebDriverWait(driver, 10)
headings = driver.find_elements_by_xpath('//div[#class = "g"]') #Heading elements
for heading in headings:
title = heading.find_elements_by_tag_name('h3')
links = heading.get_attribute('href') # This ain't working either, any help?
print(links)
#link = heading.find_element_by_name('a href')
for t in title:
print('title:', t.text)
You're trying to only obtain div elements with the class "g". However, by looking at a sample search result myself i have noticed, that not every search result is an element of the class g. Some differ.
https://i.imgur.com/QNd6nPm.png
You need some different kind of selector, e.g. by iterating through the exact div that contains every "search-result-element" and filter valid ones by checking each elements attributes that match a normal search result.
EDIT:
your try to get the link via the attribute "href" probably doesnt work as well because in my case, search results with the class "g" dont have any direct href attributes. Theres alway an a-tag followed by a href attribute, like so:
https://i.imgur.com/NHPcQTn.png
considering that the first a-tag in a search result is always the one you're looking for, you could search through the sub-elements of your heading for the FIRST a-tag that's found and then get the "href" attribute from it, something like that:
href = heading.find_element_by_tag_name("a").get_attribute("href")
You incorrectly specified locators for links.
Solution
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
root = "https://www.google.com/"
url = "https://google.com/search?q="
query = 'Why do I only see the first 4 results?' # Fill in google query
query = urllib.parse.quote_plus(query)
link = url + query
print(f'Main link to search for: {link}')
options = Options()
# options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path='/snap/bin/chromium.chromedriver')
driver.get(link)
wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[#class = "g"]')))
headings = driver.find_elements_by_xpath('//div[#class = "g"]') # Heading elements
for heading in headings:
title = heading.find_elements_by_tag_name('h3')
links = heading.find_element_by_css_selector('.yuRUbf>a').get_attribute("href") # This ain't working either, any help?
print(links)
# link = heading.find_element_by_name('a href')
for t in title:
print('title:', t.text)
Please note that the only 2 things I fixed were:
1 The way you get the locator
2 Explicit waits. You did not use them as you should have to.
Output:
Main link to search for: https://google.com/search?q=Why+do+I+only+see+the+first+4+results%3F
https://webapps.stackexchange.com/questions/14972/why-on-the-first-page-google-says-there-are-thousands-of-results-but-on-the-last
title: Why on the first page Google says there are thousands of ...
https://www.ltnow.com/how-to-get-more-than-10-results-per-page-in-google-search/
title:
https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
title:
https://www.impactplus.com/blog/google-is-limiting-number-of-search-results-per-domain-to-have-more-diversity-in-listings
title:
https://www.forbes.com/sites/forbesagencycouncil/2017/10/30/the-value-of-search-results-rankings/
title:
https://www.washingtonpost.com/news/the-intersect/wp/2015/06/30/always-click-the-first-google-result-you-might-want-to-stop-doing-that/
title: Always click the first Google result? You might want to stop ...
https://en.wikipedia.org/wiki/First_Four
title: First Four - Wikipedia
https://neilpatel.com/blog/first-page-google/
title: How to Show Up on the First Page of Google (Even if You're a ...
https://www.searchenginejournal.com/google-first-page-clicks/374516/
title: Over 25% of People Click the First Google Search Result
https://www.theleverageway.com/blog/how-far-down-the-search-engine-results-page-will-most-people-go/
title: How Far Down the Search Results Page Will Most People Go?
https://www.wordstream.com/blog/ws/2020/08/19/get-on-first-page-google
title: 10+ Free Ways to Get on the First Page of Google | WordStream
https://books.google.ca/books?id=teyaAwAAQBAJ&pg=PA102&lpg=PA102&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=iBI-YaNJNc&sig=ACfU3U0GpAnPsH_zTbblyRv1C6eS5xwCUg&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwD3oECBEQAw
title: PISA Knowledge and Skills for Life First Results from PISA ...
https://books.google.ca/books?id=8dY8AQAAQBAJ&pg=PA48&lpg=PA48&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=x-7WRKNzXs&sig=ACfU3U13RRTc66oxnpWC6WW-CMwyyIAm8A&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEHoECA8QAw
title: OECD Skills Outlook 2013 First Results from the Survey of ...
https://books.google.ca/books?id=zWwVAQAAIAAJ&pg=PA22&lpg=PA22&dq=Why+do+I+only+see+the+first+4+results?&source=bl&ots=u7XMk6B6Qz&sig=ACfU3U2Q8kNocn8W3HHkFxxJnV0b58WYoA&hl=en&sa=X&ved=2ahUKEwi-psHB_NjwAhURHs0KHWi-AC0Q6AEwEXoECBAQAw
title: Results of the First Joint US-USSR Central Pacific ...
Titles for People also ask are not returned because they have a different locator.
I have been using Python with BeautifulSoup 4 to scrape the data out of unglobal website. Some companies over there, like this one: https://www.unglobalcompact.org/what-is-gc/participants/2968-Orsted-A-S
have twitter accounts. I would like to access the names of the twitter accounts. Problem is that it is inside of an iframe without a src property. I know that iframe is called by a different request than the rest of the website, but I wonder now if it is even possible to acess it without src property visible?
You can use selenium to do this. Here is the full code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = "https://www.unglobalcompact.org/what-is-gc/participants/2968-Orsted-A-S "
driver = webdriver.Chrome()
driver.get(url)
iframe = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="twitter-widget-0"]')))
driver.switch_to.frame(iframe)
names = driver.find_elements_by_xpath('//*[#class="TweetAuthor-name Identity-name customisable-highlight"]')
names = [name.text for name in names]
try:
name = max(set(names), key=names.count) #Finds the most frequently occurring name. This is because the same author has also retweeted tweets made by others. These retweets would contain the name of other people. The most frequently occurring name is the name of the author.
print(name)
except ValueError:
print("No Twitter Feed Found!")
driver.close()
Output:
Ørsted