Web Scraping using BeautifulSoup, click on element for hidden tab - python

I have an issue while I'm trying to capture specific information inside of the page.
Website: https://www.target.com/p/prairie-farms-vitamin-d-milk-1gal/-/A-47103206#lnk=sametab
On this page, there are hidden tabs named 'Label info', 'Shipping & Returns', 'Q&A' next to 'Details' tab under 'About this items' that I want to scrape.
I found that I need to click on these elements before doing scraping using Beautifulsoup.
Here is my code, let's say I've got pid for each link.
url = 'https://www.target.com' + str(pid)
driver.get(url)
driver.implicitly_wait(5)
soup = bs(driver.page_source, "html.parser")
wait = WebDriverWait(driver, 3)
button = soup.find_all('li', attrs={'class': "TabHeader__StyledLI-sc-25s16a-0 jMvtGI"})
index = button.index('tab-ShippingReturns')
print('The index of ShippingReturns is:', index)
if search(button, 'tab-ShippingReturns'):
button_shipping_returns = button[index].find_element_by_id("tab-ShippingReturns")
button_shipping_returns.click()
time.sleep(3)
My code returns
ResultSet object has no attribute 'find_element_by_id'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Can anyone kindly guide me how to resolve this?

Seems like buttons what you're trying interact have dynamically generated class by adding unique values at the end, I would suggest using contains() method of xpath selector types like:
driver.find_elements_by_xpath("//a[contains(#class,'TabHeader__Styled')]")
so, your code should looks like that:
elements = driver.find_elements_by_xpath("//a[contains(#class,'TabHeader__Styled')]")
for el in elements:
el.click()
it's not horrible to click on the button where page already in
if element is not visible and need to scroll down, you can use ActionChains:
from selenium.webdriver import ActionChains
ActionChains(driver).move_to_element(el).perform()
so code looks like that, just your element parser:
from selenium import webdriver
from selenium.webdriver import ActionChains
driver = webdriver.Chrome()
url = 'https://www.target.com/p/prairie-farms-vitamin-d-milk-1gal/-/A-47103206#lnk=sametab'
driver.get(url)
driver.implicitly_wait(5)
elements = driver.find_elements_by_xpath("//a[contains(#class,'TabHeader__Styled')]")
for el in elements:
ActionChains(driver).move_to_element(el).perform()
el.click()
driver.quit()

The following
button = soup.find_all('li', attrs={'class': "TabHeader__StyledLI-sc-25s16a-0 jMvtGI"})
will return a BeautifulSoup list of tags.
You then try to call a selenium method on that list with:
button_shipping_returns = button[index].find_element_by_id("tab-ShippingReturns")
Instead you need to call it on the webdriver element collection
driver.find_elements_by_css_selector('.TabHeader__StyledLI-sc-25s16a-0 jMvtGI')[index].find_element_by_id("tab-ShippingReturns")

Related

How can I extract the table from the given URL

I am trying to scrape the data from the table ETH ZERO SEK of the given URL, however I can't make it work. Does anyone have some advise how I can get it to work?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.ngm.se/marknaden/vardepapper?symbol=ETH%20ZERO%20SEK'
driver = webdriver.Chrome()
driver.get(url)
element = driver.find_element(By.Xpath, './/*[#id="detailviewDiv"]/table/tbody/tr[1]/td/div')
What happens?
Content you are looking for is provided via iframe, so you xpath won't work.
How to fix?
Option#1
Change your url to https://mdweb.ngm.se/detailview.html?locale=sv_SE&symbol=ETH%20ZERO%20SEK and call content directly
Option#2
Grab the source of iframe from your original url:
driver.get('https://www.ngm.se/marknaden/vardepapper?symbol=ETH%20ZERO%20SEK')
get the src of iframe that holds your table
iframe = driver.find_element(By.XPATH, '//iframe').get_attribute("src")
get the iframe
driver.get(iframe)
wait until your tbody of table is located and store it in element
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//div[#id="detailviewDiv"]//thead[.//span[contains(text(),"Volym")]]/following-sibling::tbody')))
Assign values from cells to variables, by split of elements text:
volym = element.text.split('\n')[-3]
vwap = element.text.split('\n')[-2]
Note waits requires - from selenium.webdriver.support.ui import WebDriverWait

Unable to identify what to 'click' for next page using selenium

I am trying to get search results from yahoo search using python - selenium and bs4. I have been able to get the links successfuly but I am not able to click the button at the bottom to go to the next page. I tried one way, but it could't identify after the second page.
Here is the link:
https://in.search.yahoo.com/search;_ylt=AwrwSY6ratRgKEcA0Bm6HAx.;_ylc=X1MDMjExNDcyMzAwMgRfcgMyBGZyAwRmcjIDc2ItdG9wLXNlYXJjaARncHJpZANidkhMeWFsMlJuLnZFX1ZVRk15LlBBBG5fcnNsdAMwBG5fc3VnZwMxMARvcmlnaW4DaW4uc2VhcmNoLnlhaG9vLmNvbQRwb3MDMARwcXN0cgMEcHFzdHJsAzAEcXN0cmwDMTQEcXVlcnkDc3RhY2slMjBvdmVyZmxvdwR0X3N0bXADMTYyNDUzMzY3OA--?p=stack+overflow&fr=sfp&iscqry=&fr2=sb-top-search
This is what im doing to get data from page but need to put in a loop which changes pages:
page = BeautifulSoup(driver.page_source, 'lxml')
lnks = page.find('div', {'id': 'web'}).find_all('a', href = True)
for i in lnks:
print(i['href'])
You don't need to scroll down to the bottom. The next button is accessible without scrolling. Suppose you want to navigate 10 pages. The python script can be like this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver=webdriver.Chrome()
driver.get('Yahoo Search URL')
# Let's create a loop containing the XPath for next button
# As well as waiting for the next button to be clickable.
for i in range(10):
WebDriverWait(driver, 5).until(EC.element_to_be_clickable(By.XPATH, '//a[#class="next"]'))
navigate = driver.find_element_by_xpath('//a[#class="next"]').click()
The next page button is on the bottom of the page so you first need to scroll to that element and then click it. Like this:
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
next_page_btn = driver.find_element_by_css_selector("a.next")
actions.move_to_element(next_page_btn).build().perform()
time.sleep(0.5)
next_page_btn.click()

Selenium click() method on angular.js site

I am scraping an angular.js site. My initial link has a search button. I find by xpath and click with no issues. After I click search, I want to be able to click each of the athletes in the table to go to their info pages, but I am not having success with the click method. The links are attached to their names.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
TIMEOUT = 5
driver = webdriver.Firefox()
driver.set_page_load_timeout(TIMEOUT)
url = 'https://n.rivals.com/search#?formValues=%7B%22sport%22:%22Football%22,%22recruit_year%22:2021,%22offer_and_visit_type%22:%5B%22Offer%22%5D,%22prospect_profiles.prospect_colleges.offer%22:true,%22page_number%22:1,%22page_size%22:50%7D'
try:
driver.get(url)
except TimeoutException:
pass
search_button = driver.find_element_by_xpath('//*[#id="articles"]/div/div[2]/div/div/div[1]/form/div[2]/div[5]/button')
search_button.click();
#below is where I tried, but could not get to click
first_athlete = driver.find_element_by_xpath('//*[#id="content_"]/td[1]/div[2]/a')
first_athlete.click();
Works if you remove the last /a in the xpath:
first_athlete = driver.find_element_by_xpath('//*[#id="content_"]/td[1]/div[2]')
first_athlete.click()
If you want to search for all athletes and you have the name of athletes with you, you can use CSS selector as well.
athelete = driver.find_elements_by_css_selector(`#content_ > td > div > a[href *="donovan-jackson"]);
athelete.click();
This code will give you a unique web element for each player.
Thanks

How to loop from a list of urls by clicking the xpath and extract data using Selenium in Python?

I am extracting board members from a list of URLs. For each url in the URL_lst, click the first xpath (ViewMore to expand the list), then extract values from the second xpath (BoardMembers' info).
Below are the three companies I want to extract info: https://www.bloomberg.com/quote/FB:US, https://www.bloomberg.com/quote/AAPL:US, https://www.bloomberg.com/quote/MSFT:US
My code is shown below but doesn't work. The Output list is not aggregated. I know sth wrong with the loop but don't know how to fix it. Can anyone tell me how to correct the code? Thanks!
URL_lst = ['https://www.bloomberg.com/quote/FB:US','https://www.bloomberg.com/quote/AAPL:US','https://www.bloomberg.com/quote/MSFT:US']
Outputs = []
driver = webdriver.Chrome(r'xxx\chromedriver.exe')
for url in URL_lst:
driver.get(url)
for c in driver.find_elements_by_xpath("//*[#id='root']/div/div/section[3]/div[10]/div[2]/div/span[1]"):
c.click()
for e in (c.find_elements_by_xpath('//*[#id="root"]/div/div/section[3]/div[10]/div[1]/div[2]/div/div[2]')[0].text.split('\n'):
Outputs.append(e)
print(Outputs)
Based on the URLs you provided, I did some refactoring for you. I added wait on each item you are trying to click and a scrollIntoView Javascript call to scroll down to the View More button. You were originally clicking View More buttons in a loop, but your XPath only returned 1 element, so the loop was redundant.
I also refactored your selector for board members to query directly on the div element containing their names. Your original query was finding a div several levels above the actual name text, which is why your Outputs list was returning empty.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from time import sleep
URL_lst = ['https://www.bloomberg.com/quote/FB:US','https://www.bloomberg.com/quote/AAPL:US','https://www.bloomberg.com/quote/MSFT:US']
Outputs = []
driver = webdriver.Chrome(r'xxx\chromedriver.exe')
wait = WebDriverWait(driver, 30)
for url in URL_lst:
driver.get(url)
# get "Board Members" header
board_members_header = wait.until(EC.presence_of_element_located((By.XPATH, "//h2[span[text()='Board Members']]")))
# scroll down to board members
driver.execute_script("arguments[0].scrollIntoView();", board_members_header)
# get view more button
view_more_button = wait.until(EC.presence_of_element_located((By.XPATH, "//section[contains(#class, 'PageMainContent')]/div/div[2]/div/span[span[text()='View More']]")))
# click view more button
view_more_button.click()
# wait on 'View less' to exist, meaning list is expanded now
wait.until(EC.presence_of_element_located((By.XPATH, "//section[contains(#class, 'PageMainContent')]/div/div[2]/div/span[span[text()='View Less']]")))
# wait on visibility of board member names
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[contains(#class, 'boardWrap')]//div[contains(#class, 'name')]")))
# get list of board members names
board_member_names = driver.find_elements_by_xpath("//div[contains(#class, 'boardWrap')]//div[contains(#class, 'name')]")
for board_member in board_member_names:
Outputs.append(board_member.text)
# explicit sleep to avoid being flagged as bot
sleep(5)
print(Outputs)
I also added an explicit sleep between URL grabs, so that Bloomberg does not flag you as a bot.

How can I parse table data from website using Selenium?

Im trying to parse the table present in the [website][1]
[1]: http://www.espncricinfo.com/rankings/content/page/211270.html using selenium, as I am beginner . i'm struggling to do that here is my code
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "http://www.espncricinfo.com/rankings/content/page/211270.html"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(len(soup.find_all("table")))
print(soup.find("table", {"class": "expanded_standings"}))
browser.close()
browser.quit()
that I tried, I'm unable to fetch anything from this, any suggestions will be really helpful thanks
The table you are after is within an iframe. So, to get the data from that table you need to switch that iframe first and then do the rest. Here is one way you could do it:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
wait = WebDriverWait(driver, 10)
## if any different table you expect to have then just change the index number within nth-of-type()
## and the appropriate name in the selector
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[name='testbat']:nth-of-type(1)")))
for table in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "table tr")))[1:]:
data = [item.text for item in table.find_elements_by_css_selector("th,td")]
print(data)
driver.quit()
And the best approach would be in this very case is as follows. No browser simulator is used. Only requests and BeautifulSoup have been used:
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.espncricinfo.com/rankings/content/page/211270.html")
soup = BeautifulSoup(res.text,"lxml")
## if any different table you expect to have then just change the index number
## and the appropriate name in the selector
item = soup.select("iframe[name='testbat']")[0]['src']
req = requests.get(item)
sauce = BeautifulSoup(req.text,"lxml")
for items in sauce.select("table tr"):
data = [item.text for item in items.select("th,td")]
print(data)
Partial results:
['Rank', 'Name', 'Country', 'Rating']
['1', 'S.P.D. Smith', 'AUS', '947']
['2', 'V. Kohli', 'IND', '912']
['3', 'J.E. Root', 'ENG', '881']
It looks like that page's tables are within iframes. If you have a specific table you want to scrape, try inspecting it using browser dev tools (right click, inspect element in Chrome) and find the iframe element that is wrapping it. The iframe should have a src attribute that holds a url to the page that actually contains that table. You can then use a similar method to the one you tried but instead use the src url.
Selenium can also "jump into" an iframe if you know how to find the iframe in the page's source code.
frame = browser.find_element_by_id("the_iframe_id")
browser.switch_to.frame(frame)
html = browser.page_source etc

Categories