I would like to print each name of every merchant on this page. I tried this:
browser.get('https://www.trovaprezzi.it/televisori-lcd-plasma/prezzi-scheda-prodotto/lg_oled_cx3?sort=prezzo_totale')
Names = browser.find_elements_by_xpath("//span[#class='merchant_name']")
for span in Names:
print(span.text)
However, when I run the code, it prints an huge empty space without any word.
1 You need to get alt attribute to get a seller name
2 You need to use waits.
3 Check your indentation when you print a list values.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
browser = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
browser.get('https://www.trovaprezzi.it/televisori-lcd-plasma/prezzi-scheda-prodotto/lg_oled_cx3?sort=prezzo_totale')
wait = WebDriverWait(browser, 10)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".merchant_name_and_logo img")))
names = browser.find_elements_by_css_selector(".merchant_name_and_logo img")
for span in names:
print(span.get_attribute("alt"))
Prints:
Climaconvenienza
Shopdigit
eBay
ePrice
Onlinestore
Shoppyssimo
Prezzo forte
eBay
eBay
eBay
eBay
eBay
eBay
Yeppon
Showprice
Galagross
Sfera Ufficio
Climaconvenienza
Di Lella Shop
Shopdigit
Instead of span.text please try getting the "value" attribute there
Names = browser.find_elements_by_xpath("//span[#class='merchant_name']")
for span in Names:
print(span..get_attribute("value"))
Also, don't forget adding some wait / delay before
Names = browser.find_elements_by_xpath("//span[#class='merchant_name']")
Related
I am trying to scrape some information from booking.com. I handled some stuff like pagination, extract title etc.
I am trying to extract the number of guests from here.
This is my code:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
soup2 = BeautifulSoup(driver.page_source, 'lxml')
guests = soup2.select_one('span.xp__guests__count')
guests = guests.text if price else None
amenities = soup2.select_one('div.hprt-facilities-block')
The result is this one '\n2 adults\n·\n\n0 children\n\n·\n\n1 room\n\n'
I know that with some regexp I can extract the information but I want but i would like to understand if is there a way to extract directly the "2 adults" from the above pic.
Thanks.
This is one way to get that information, without using BeautifulSoup (why parse the page twice?):
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(browser, 20)
url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
browser.get(url)
guest_count = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span[class='xp__guests__count']"))).find_element(By.TAG_NAME, "span")
print(guest_count.text)
Result in terminal:
2 adults
Selenium docs can be found at https://www.selenium.dev/documentation/
I haven't used BeautifulSoup. I use Selenium. This is how I would do it in Selenium:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.maximize_window()
test_url = 'https://www.booking.com/hotel/gr/diamandi-20.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaFyIAQGYAQm4ARjIAQzYAQPoAQGIAgGoAgS4ApGp7ZgGwAIB0gIkZTBjOTA2MTQtYTc0MC00YWUwLTk5ZWEtMWNiYzg3NThiNGQ12AIE4AIB&sid=47583bd8c0122ee70cdd7bb0b06b0944&aid=304142&ucfs=1&arphpl=1&checkin=2022-10-24&checkout=2022-10-30&dest_id=-829252&dest_type=city&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=f0f16af3449102aa&srepoch=1662736362&all_sr_blocks=852390201_352617405_2_0_0&highlighted_blocks=852390201_352617405_2_0_0&matching_block_id=852390201_352617405_2_0_0&sr_pri_blocks=852390201_352617405_2_0_0__30000&from=searchresults#hotelTmpl'
driver.get(test_url)
time.sleep(3)
element = driver.find_element(By.XPATH,"//span[#class='xp__guests__count']")
adults = int(element.text.split(" adults")[0])
print(str(adults))
Basically, I find the span element that contains the text you are looking for. .text gives you all the inner text (in this case, "2 adults · 0 children · 1 room").
The next line takes only the part of the string that comes before " adults", then casts it as an int.
I'm trying to scrape some data from a website and I can't seem to get the text in between two tags as it keeps returning None. This is my code, could somebody tell me what is wrong?
uq = "https://my.uq.edu.au/programs-courses/requirements/program/2451/2021"
driver = webdriver.Firefox(executable_path="/Users/Connor/Downloads/geckodriver")
driver.get(uq)
groups = driver.find_element_by_xpath("/html/body/div/div[2]/div/div/div[3]/div[1]/div[2]/div[4]/div[1]/div[2]/a[1]/span[1]")
print(groups)
print(groups.text) #This won't get the text between the span tags e.g. <span>hello</span>
To get that text you'll need to click the parent element first.
However, you can avoid it with .get_attribute("innerHTML") and waiting for the presence of the element. Also, make sure you are using stable locators.
from selenium import webdriver
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = FirefoxOptions()
options.add_argument('window-size=1920x1080')
driver = webdriver.Firefox(options=options)
uq = "https://my.uq.edu.au/programs-courses/requirements/program/2451/2021"
driver.get(uq)
wait = WebDriverWait(driver, 15)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[title='Theory of Computing']>span:nth-of-type(1)")))
group = driver.find_element_by_css_selector("a[title='Theory of Computing']>span:nth-of-type(1)").get_attribute("innerHTML")
print(group)
Prints: COMP2048
You have to set an action to get the value of the WebElement.
So you have to use this code instead:
print(groups.getText());
Yes, same for me. I have tried every way possible. Trying this:
print(speed.get_attribute('outerHTML'))
I get
<span data-download-status-value="NaN" class="result-data-large number result-data-value download-speed"> </span>
I'm trying to scrape this website
Best Western Mornington Hotel
for the name of hotel rooms and the price of said room. I'm using Selenium to try and scrape this data but I keep on getting no return after what I assume is me using the wrong selectors/XPATH. Is there any method of identifying the correct XPATH/div class/selector? I feel like I have selected the correct ones but there is no output.
from re import sub
from decimal import Decimal
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
seleniumurl = 'https://www.bestwestern.co.uk/hotels/best-western-mornington-hotel-london-hyde-park-83187/in-2021-06-03/out-2021-06-05/adults-1/children-0/rooms-1'
driver = webdriver.Chrome(executable_path='C:\\Users\\Conor\\Desktop\\diss\\chromedriver.exe')
driver.get(seleniumurl)
time.sleep(5)
working = driver.find_elements_by_class_name('room-type-block')
for work in working:
name = work.find_elements_by_xpath('.//div/h4').string
price = work.find_elements_by_xpath('.//div[2]/div[2]/div/div[1]/div/div[3]/div/div[1]/div/div[2]/div[1]/div[2]/div[1]/div[1]/span[2]').string
print(name,price)
I only work with Selenium in Java, but from I can see you're trying to get collection of WebElements and invoke toString() on them...
should be that find_element_by_xpath to get just one WebElement and then call .text instead of .string?
Marek is right use .text instead of .string. Or use .get_attribute("innerHTML"). I also think your xpath may be wrong unless I'm looking at the wrong page. Here are some xpaths from the page you linked.
#This will get all the room type sections.
roomTypes = driver.find_elements_by_xpath("//div[contains(#class,'room-type-box__content')]")
#This will get the room type titles
roomTypes.find_elements_by_xpath("//div[contains(#class,'room-type-title')]/h3")
#Print out room type titles
for r in roomTypes:
print(r.text)
Please use this selector div#rr_wrp div.room-type-block and .visibility_of_all_elements_located method for get category div list.
With the above selector, you can search title by this xpath: .//h2[#class="room-type--title"], sub category by .//strong[#class="trimmedTitle rt-item--title"] and price .//div[#class="rt-rate-right--row group"]//span[#data-bind="text: priceText"].
And please try the following code with zip loop to extract parallel list:
driver = webdriver.Chrome(executable_path='C:\\Users\\Conor\\Desktop\\diss\\chromedriver.exe')
driver.get('https://www.bestwestern.co.uk/hotels/best-western-mornington-hotel-london-hyde-park-83187/in-2021-06-03/out-2021-06-05/adults-1/children-0/rooms-1')
wait = WebDriverWait(driver, 20)
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'div#rr_wrp div.room-type-block')))
for element in elements:
for room_title in element.find_elements_by_xpath('.//h2[#class="room-type--title"]'):
print("Main Title ==>> " +room_title.text)
for room_type, room_price in zip(element.find_elements_by_xpath('.//strong[#class="trimmedTitle rt-item--title"]'), element.find_elements_by_xpath('.//div[#class="rt-rate-right--row group"]//span[#data-bind="text: priceText"]')) :
print(room_type.text +" " +room_price.text)
driver.quit()
I have been using Python with BeautifulSoup 4 to scrape the data out of unglobal website. Some companies over there, like this one: https://www.unglobalcompact.org/what-is-gc/participants/2968-Orsted-A-S
have twitter accounts. I would like to access the names of the twitter accounts. Problem is that it is inside of an iframe without a src property. I know that iframe is called by a different request than the rest of the website, but I wonder now if it is even possible to acess it without src property visible?
You can use selenium to do this. Here is the full code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = "https://www.unglobalcompact.org/what-is-gc/participants/2968-Orsted-A-S "
driver = webdriver.Chrome()
driver.get(url)
iframe = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="twitter-widget-0"]')))
driver.switch_to.frame(iframe)
names = driver.find_elements_by_xpath('//*[#class="TweetAuthor-name Identity-name customisable-highlight"]')
names = [name.text for name in names]
try:
name = max(set(names), key=names.count) #Finds the most frequently occurring name. This is because the same author has also retweeted tweets made by others. These retweets would contain the name of other people. The most frequently occurring name is the name of the author.
print(name)
except ValueError:
print("No Twitter Feed Found!")
driver.close()
Output:
Ørsted
I'm trying to scrape the 'activity' text box from the two pages here and here.
I wrote the base of the code:
options = Options()
options.binary_location=r'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
options.add_experimental_option('excludeSwitches', ['enable-logging'])
#options.add_argument("--headless")
driver = webdriver.Chrome(options=options,executable_path='/mnt/c/Users/kela/Desktop/selenium/chromedriver.exe
url = 'http://www.uwm.edu.pl/biochemia/biopep/peptide_data_page1.php?zm_ID=' + str(i) #where str(i) is either 2500 or 2700 in this example
driver.get(url)
header = driver.find_element_by_css_selector('[name="activity"]')
children = header.find_elements_by_xpath(".//*")
I have two issues:
I need to only pull out the activity item that is 'option selected value', i don't want ALL the activities returned.
BUT if the option is the first item in the list, as is the case with one of the pages shown here whose activity is 'aami'; 'selected value' is not an option as it's the default.
So I'm stuck on identifying a line or two of code that I could add to my script that would extract:
neuropeptide | ne
alpha-amylase inhibitor | aami
from these two web pages, if anyone could help.
Use Select class and get the first_selected_option. You need to induce WebDriverWait And presence_of_element_located
i=2700
url = 'http://www.uwm.edu.pl/biochemia/biopep/peptide_data_page1.php?zm_ID=' + str(i) #where str(i) is either 2500 or 2700 in this example
driver.get(url)
element=WebDriverWait(driver,20).until(EC.presence_of_element_located((By.NAME,"activity")))
select=Select(element)
print(select.first_selected_option.text)
Output:
neuropeptide | ne
If you change the value to 2500 you will get alpha-amylase inhibitor | aami
Imports followings to execute above code.
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
You should check the attributes of option elements.
If 'selected' attribute in any option, get it.
If 'selected' attribute not in any option, get only first option.
I've implemented the finding attributes with BeautifulSoup. You can also implemenet with Selenium with executing Javascript code. Example here
My Code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
url = 'http://www.uwm.edu.pl/biochemia/biopep/peptide_data_page1.php?zm_ID=2500'
driver.get(url)
header = driver.find_element_by_css_selector('[name="activity"]')
soup = BeautifulSoup(header.get_attribute("innerHTML"), 'html.parser')
options = soup.find_all('option')
for option in options:
if 'selected' in option.attrs:
print(option.text)
break
else:
print(options[0].text.strip())