How do I click an element using selenium and beautifulsoup? - python

How do I click an element using selenium and beautifulsoup in python? I got these lines of code and I find it difficult to achieve. I want to click every element in every iteration. There are no pagination or next page. There are only like about 10 elements and after clicking the last element, it should stop. Does anyone know what should I do. Here are my code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib
import urllib.request
from bs4 import BeautifulSoup
chrome_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
url = 'https://www.99.co/singapore/condos-apartments/a-treasure-trove'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
details = soup.select('.FloorPlans__container__rwH_w') //Whole container of the result
for d in details:
picture = d.find('span',{'class':'Tappable-inactive'}).click() //the single element.
print(d)
driver.close()
Here is the site https://www.99.co/singapore/condos-apartments/a-treasure-trove . I want to scrape the details and the image in every floor plans section but it is difficult because the image only appears after you click the specific element. I can only get the details except for the image itself. Try it yourself so that you know what I mean.
EDIT:
I tried this method
for d in driver.find_elements_by_xpath('//*[#id="floorPlans"]/div/div/div/div/span'):
d.click()
The problem is it clicks too fast that the image couldn't load. And also im using selenium here. Is there any method like selecting a beautifulsoup like this format picture = d.find('span',{'class':'Tappable-inactive'}).click() ?

You cannot interact with website widgets by using beautifulSoup you need to work with selenium. There are 2 ways to handle this problem.
First is to get the main wrapper (class) of the 10 elements and then iterate to each child element of the main class.
You can get the element by xpath and increment the last number in xpath by one in each iteration to move to the next element.

I print some result to check your code.
"details" only has one item.
And "picture" is not element. (So it's not clickable.)
details = soup.select('.FloorPlans__container__rwH_w')
print(details)
print(len(details))
for d in details:
print(d)
picture = d.find('span',{'class':'Tappable-inactive'})
print(picture)
Output:
For your edited version, you can check images have been visible before you do click().
Use visibility_of_element_located to do.
Reference: https://selenium-python.readthedocs.io/waits.html

Related

Obtaining data on clicking multiple radio buttons in a page using selenium in python

I have a page, and I have 3 radio buttons on it. I want my code to consecutively click each of these buttons, and as they are clicked, a value (mpn) is displayed, I want to obtain this value. I am able to write the code for a single radio button, but I dont understand how i can create a loop so that only value of this button changes (value={1,2,3})
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"C:\Users\Home\Desktop\chromedriver.exe")
driver.get("https://www.1800cpap.com/resmed-airfit-n30-nasal-cpap-mask-with-headgear")
soup = BeautifulSoup(driver.page_source, 'html.parser')
size=driver.find_element_by_xpath("//input[#class='product-views-option-tile-input-picker'and #value='2' ]")
size.click()
mpn= driver.find_element_by_xpath("//span[#class='mpn-value']")
print(mpn.text)
Also, for each page, the buttons vary in number, and their names. So, if there is any general solution that i could extend to all pages, for all buttons, it would be highly appreciated. Thanks!
Welcome to SO!
You were a small step from the correct solution! In particular, the find_element_by_xpath() function returns a single element, but the similar function find_elements_by_xpath() (mind the plural) returns an iterable list, which you can use to implement a for loop.
Below a MWE with the example page that you provided
from selenium import webdriver
import time
driver = webdriver.Firefox() # initiate the driver
driver.get("https://www.1800cpap.com/resmed-airfit-n30-nasal-cpap-mask-with-headgear")
time.sleep(2) # sleep for a couple seconds to ensure correct upload
mpn = [] # initiate an empty results' list
for button in driver.find_elements_by_xpath("//label[#data-label='label-custcol3']"):
button.click()
mpn.append(driver.find_element_by_xpath("//span[#class='mpn-value']").text)
print(mpn) # print results

How to access text inside div tags using Selenium in Python?

I am trying to make a program in Python using Selenium which prints out the quotes from https://www.brainyquote.com/quote_of_the_day
EDIT:
I was able to access the quotes and the associated authors like so:
authors = driver.find_elements_by_css_selector("""div.col-xs-4.col-md-4 a[title="view author"]""")
for quote,author in zip(quotes,authors):
print('Quote: ', quote.text)
print('Author: ', author.text)
Not able to club topics similarly. Doing
total_topics = driver.find_elements_by_css_selector("""div.col-xs-4.col-md-4 a.qkw-btn.btn.btn-xs.oncl_list_kc""")
would make an undesired list
Earlier I was using Beautiful Soup which did the job perfectly except the fact that the requests library was able to access only the static website. However, I wanted to be able to scroll the website continuously to keep accessing new quotes. For that purpose, I'm trying to use Selenium.
This is how I did it using Soup:
for quote_data in soup.find_all('div', class_='col-xs-4 col-md-4'):
quote = quote_data.find('a',title='view quote').text
print('Quote: ',quote)
However, I am unable to find the same using Selenium.
My code in Selenium for basic testing:
driver.maximize_window()
driver.get('https://www.brainyquote.com/quote_of_the_day')
elem = driver.find_element_by_tag_name("body")
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
quote = driver.find_element_by_xpath('//div[#title="view quote"]')
I also tried CSS Selectors
print(driver.find_element_by_css_selector('div.col-xs-4 col-md-4')
The latter gave a NoSuchElementFound exception and the former is not giving any output at all. I would love to get some tips on where I am going wrong and how I would be able to tackle this.
Thanks!
quotes = driver.find_elements_by_xpath('//a[#title="view quote"]')
First scroll to bottom
You might need to write some kind of loop to scroll and click on the quotes links until there are no more elements found. Here's a bit of an outline of how I would do that:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get('https://www.brainyquote.com/quote_of_the_day')
while True:
# wait for all quote elements to appear
quote_links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='view quote']")))
# todo - need to check for the end condition. page has infinite scrolling
# break
# iterate the quote elements until we reach the end of this list
for quote_link in quote_links:
quote_link.click()
driver.back()
# now quote_links has gone stale because we are on a different page
quote_links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='view quote']")))
The above code enters a loop that searches for all of the 'View more' quote links on the page. Then, we iterate the list of links and click on each one. At this point the elements in quote_links list have gone stale due to the page no longer existing, so we re-find the elements with WebDriverWait before clicking another link.
This is just a rough outline and some extra work will need to be done to determine an end case for the infinite scrolling of the page, and you will need to write in the operations to perform on the quote pages themselves, but hopefully you see the idea here.

Selenium: How to parse through a code after using selenium to click a dropdown

I'm trying to webscrape trough this webpage https://www.sigmaaldrich.com/. Up to now I have achieved for the code to use the requests method to use the search bar. After that, I want to look for the different prices of the compounds. The html code that includes the prices is not visible until the Price dropdown has been clicked. I have achieved that by using selenium to click all the dropdowns with the desired class. But after that, I do not know how to get the html code of the webpage that is generated after clicking the dropdowns and where the price is placed.
Here's my code so far:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
#get the desired search terms by imput
name=input("Reagent: ")
CAS=input("CAS: ")
#search using the name of the compound
data_name= {'term':name, 'interface':'Product%20Name', 'N':'0+',
'mode':'mode%20matchpartialmax', 'lang':'es','region':'ES',
'focus':'product', 'N':'0%20220003048%20219853286%20219853112'}
#search using the CAS of the compound
data_CAS={'term':CAS, 'interface':'CAS%20No.', 'N':'0','mode':'partialmax',
'lang':'es', 'region':'ES', 'focus':'product'}
#get the link of the name search
r=requests.post("https://www.sigmaaldrich.com/catalog/search/", params=data_name.items())
#get the link of the CAS search
n=requests.post("https://www.sigmaaldrich.com/catalog/search/", params=data_CAS.items())
#use selenium to click in the dropdown(only for the name search)
driver=webdriver.Chrome(executable_path=r"C:\webdrivers\chromedriver.exe")
driver.get(r.url)
dropdown=driver.find_elements_by_class_name("expandArrow")
for arrow in dropdown:
arrow.click()
As I said, after this I need to find a way to get the html code after opening the dropdowns so that I can look for the price class. I have tried different things but I don't seem to get any working solution.
Thanks for your help.
You can try using the Selenium WebDriverWait. WebDriverWait
WebDriverWait wait = new WebDriverWait(driver, 30);
WebElement element = wait.until(ExpectedConditions.presenceOfElementLocated(css));
First, You should use WebDriverWait as Austen had pointed out.
For your question try this:
from selenium import webdriver
driver=webdriver.Chrome(executable_path=r"C:\webdrivers\chromedriver.exe")
driver.get(r.url)
dropdown=driver.find_elements_by_class_name("expandArrow")
for arrow in dropdown:
arrow.click()
html_source = driver.page_source
print(html_source)
Hope this helps you!

Efficient download of images from website with Python and selenium

Disclaimer: I do not have any background in web-scraping/HTML/javascripts/css and the likes but I know a bit of Python.
My end goal is to download all 4th image view of every 3515 car views in the ShapeNet website WITH the associated tag.
For instance the first of the 3515 couples would be the image that can be found in the collapse menu on the right of this picture: (that can be loaded by clicking on the first item of the first page and then on Images) with the associated tag "sport utility" as can be seen in the first picture (first car top left).
To do that I wrote with the help of #DebanjanB a snippet of code that click on the sport utility on the first picture opens the iframe clicks on images and then download the 4th picture link to my question. The full working code is this one:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()
browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[#id='02958343_anchor']")
linkElem.click()
#Page is also long to display iframe
element = wait.until(EC.element_to_be_clickable((By.ID, "model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")))
linkElem = browser.find_element_by_id("model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")
linkElem.click()
#iframe slow to be displayed
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
#iframe = browser.find_elements_by_id('viewerIframe')
#browser.switch_to_frame(iframe[0])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()
img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[#class='searchResult' and #id='image.3dw.bcf0b18a19bce6d91ad107790a9e2d51.3']/img[#class='enlarge']")
src = img.get_attribute('src')
os.system("wget %s --no-check-certificate"%src)
There are several issues with this. First I need to know by hand the xpath model_3dw_bcf0b18a19bce6d91ad107790a9e2d51 for each model I also need to extract the tag they both can be found at:
. So I need to extract it by inspecting every image displayed. Then I need to switch page (there are 22 pages) and maybe even scroll down on each page to be sure I have everything. Secondly I had to use time.sleep twice because the other method based on wait to be clickable does not seem to work as intented.
I have two questions the first one is obvious is it the right way of proceeding ? I feel that even if this could be quite fast without the time.sleep this feels very much like what a human would do and therefore must be terribly inefficient secondly if it is indeed the way to go: How could I write a double for loop on pages and items to be able to extract the tag and model id efficiently ?
EDIT 1: It seems that:
l=browser.find_elements_by_xpath("//div[starts-with(#id,'model_3dw')]")
might be the first step towards completion
EDIT 2: Almost there but the code is filled with time.sleep. Still need to get the tag name and to loop through the pages
EDIT 3: Got the tag name still need to loop through the pages and will post first draft of solution
So let me try to understand correctly what you mean and then see if I can help you solve the problem. I do not know Python, so excuse my synthax errors.
You want to click on each and every of the 183533 cars, and then download the 4th image within the iframe that pops up. Correct?
Now if this is the case, lets look at the first element you need, elements on the page with all the cars on it.
So to get all 160 cars of page 1, you are going to need:
elements = browser.find_elements_by_xpath("//img[#class='resultImg lazy']");
This is going to return 160 image elements for you. Which is exactly the amount of the displayed images (on page 1)
Then you can say:
for el in elements:
{here you place the code you need to download the 4th image,
so like switch to iframe, click on the 4th image etc.}
Now, for the first page, you have made a loop which will download the 4th image for every vehicle on it.
This doens't entirely solve your problem as you have multiple pages. Thankfully, the page navigation, previous and next, are greyed out on first and/or last page.
So you can just say:
browser.find_element_by_xpath("//a[#class='next']").click();
Just make sure you catch if element is not clickable as element will be greyed out on the last page.
Rather than scraping the site, you might consider examining the URLs that the webpage uses to query the data, then use the Python 'requests' package to simply make API requests directly from the server. I'm not a registered user on the site, so I can't provide you with any examples, but the paper that describes the shapenet.org site specifically mentions:
"To provide convenient access to all of the model and an-
notation data contained within ShapeNet, we construct an
index over all the 3D models and their associated annota-
tions using the Apache Solr framework. Each stored an-
notation for a given 3D model is contained within the index
as a separate attribute that can be easily queried and filtered
through a simple web-based UI. In addition, to make the
dataset conveniently accessible to researchers, we provide a
batched download capability."
This suggests that it might be easier to do what you want via API, as long as you can learn what their query language provides. A search in their QA/Forum may be productive too.
I came up with this answer, which kind of works but I don't know how to remove the several calls to time.sleep I will not accept my answer until someone finds something more elegant (also when it arrives at the end of the last page it fails):
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()
browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[#id='02958343_anchor']")
linkElem.click()
tag_names=[]
page_count=0
while True:
if page_count>0:
browser.find_element_by_xpath("//a[#class='next']").click()
time.sleep(2)
wait.until(EC.presence_of_element_located((By.XPATH, "//div[starts-with(#id,'model_3dw')]")))
list_of_items_on_page=browser.find_elements_by_xpath("//div[starts-with(#id,'model_3dw')]")
list_of_ids=[e.get_attribute("id") for e in list_of_items_on_page]
for i,item in enumerate(list_of_items_on_page):
#Page is also long to display iframe
current_id=list_of_ids[i]
element = wait.until(EC.element_to_be_clickable((By.ID, current_id)))
car_image=browser.find_element_by_id(current_id)
original_tag_name=car_image.find_element_by_xpath("./div[#style='text-align: center']").get_attribute("innerHTML")
count=0
tag_name=original_tag_name
while tag_name in tag_names:
tag_name=original_tag_name+"_"+str(count)
count+=1
tag_names.append(tag_name)
car_image.click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()
img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[#class='searchResult' and #id='image.3dw.%s.3']/img[#class='enlarge']"%current_id.split("_")[2])
src = img.get_attribute('src')
os.system("wget %s --no-check-certificate -O %s.png"%(src,tag_name))
browser.switch_to.default_content()
browser.find_element_by_css_selector(".btn-danger").click()
time.sleep(1)
page_count+=1
One can also import a NoSuchElementException from selenium and use a while True loop with try except to get rid of the arbitrary time.sleep.

Finding span by class and extracting its contents

I want to extract text of a particular span which is given in the snapshot. I am unable to find the span by its class attribute. I have attached The html source (snapshot) of the data to be extracted as well.
Any suggestions?
import bs4 as bs
import urllib
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source, 'html.parser')
count=soup.find('span',{'class':'number'})
print(len(count))
See the image:
If you disable JavaScript in your browser you can easily see that span element that you want are disappearing.
In order to get that element one of the possible solutions can be using Selenium browser.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = driver.find_element_by_xpath('//li[3]/span')
print(span.text)
driver.close()
Output:
Another solution - find desired value deep down in web page source(in Chrome browser press Ctrl+U) and extract span value using a regular expression.
import re
import requests
r = requests.get(
'https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2')
span = re.search('\"posts_count\":(\d+)', r.text)
print(span.group(1))
Output:
If You know how to use CSS SELECTORS you can use :
mySpan = soup.select("span.number")
It will return List of all nodes which are valid for this selector.
So mySpan[0] could contain what You need. And then use one of the methods like for example get_text() to get what You need.
First of all you need to decode response
source=urllib.request.urlopen(sourceUrl).read().decode()
Maybe your issue will disappears after this fix

Categories