My main html page has an iframe it and I to need to get the text Code: LWBAD that lives there.
Check picture for a better understanding:
Bellow is my main html page source that has an iframe in it:
<td class="centerdata flag"><iframe style="width: 200px; height: 206px;" scrolling="no" src="https://www.example.com/test/somewhere" ></iframe></td>
The redirect link (iframe page) has this html source
<body>
<a href="http://www.test2.com" target="_blank">
<img src="https://img2.test2.com/LWBAD-1.jpg"></a>
<br/>Code: LWBAD
So far I can get the complete page source from my main html page.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import html5lib
driver_path = '/usr/local/bin/chromedriver 2'
driver = webdriver.Chrome(driver_path)
driver.implicitly_wait(10)
driver.get('http://example.com')
try:
time.sleep(4)
iframe = driver.find_elements_by_tag_name('iframe')
driver.switch_to_default_content()
output = driver.page_source
print (output)
finally:
driver.quit();
*urls are not accesible from outside of my network that's why I used example.com
you should use
iframe = driver.find_elements_by_tag_name('iframe')[0]
driver.switch_to.frame(iframe)
# your work to extract link
driver.switch_to_default_content()
for multiple url
find_elements_by_tag_name will return an array. so use for loop
iframe = driver.find_elements_by_tag_name('iframe')
for i in iframe:
driver.switch_to.frame(i)
# your work to extract link
driver.switch_to_default_content()
to get only text
use
text = driver.find_element_by_tag_name('body').text
after driver.switch_to.frame(i)
try this:
iframe = driver.find_elements_by_tag_name('iframe')
for i in range(0, len(iframe)):
f = driver.find_elements_by_tag_name('iframe')[i]
driver.switch_to.frame(i)
# your work to extract link
text = driver.find_element_by_tag_name('body').text
print(text)
driver.switch_to_default_content()
Related
This is the page I'm trying to scrape:
https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2
I want to scrape the lisbox href next to Token label.
I need to scrape href from
class="link-hover d-flex justify-content-between align-items-center"
so my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2').text
html = BeautifulSoup(page, 'html.parser')
href = html.find(class_ = 'link-hover d-flex justify-content-between align-items-center')['href']
however the result is nothing.
Can anyone help me?
The element of interest is rendered by JavaScript. Thus, you will need some browser automation software to render the JavaScript, in order to get the full HTML necessary.
Note: You could use requests-html which supports JavaScript rendering. However, it does use a browser automation software itself, so, in my opinion, it's best to get rid of the "middle-man".
Selenium
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2")
elem = browser.find_element_by_id("availableBalanceDropdown")
elem.click()
soup = bs4.BeautifulSoup(browser.page_content(), features="html.parser")
Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as play:
browser = play.chromium.launch()
page = browser.new_page()
page.goto("https://etherscan.io/address/0xCcE984c41630878b91E20c416dA3F308855E87E2")
page.click("#availableBalanceDropdown")
soup = bs4.BeautifulSoup(page.content(), features="html.parser")
browser.quit()
Once you have the bs4.BeautifulSoup object, it's just a matter of scraping for the CSS selector.
import bs4
soup = bs4.BeautifulSoup(...) # From above examples
elems = soup.select(".link-hover.d-flex.justify-content-between.align-items-center")
I am trying to use Selenium and BeautifulSoup to extract some information from https://superbet.ro/pariuri-sportive/live.
I created the urls for the live matches, and now I'm iterating through them to extract some statistics. But STASTISTICS TAB is not loading when I use this code:
def get_soup(url):
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(3)
page = driver.execute_script('return document.body.innerHTML')
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
print(soup)
return soup
So I'm trying to click the Statistics button to find the divs I need, because the html obtained in my script is partially loaded and different than the original one from the chrome developer tools.
Here are the difference between what I get and what I need:
<div class="statistics__content">
<div class="sa-sdk-v5">
<div class="sa-sdk-unknown-tab" eventdetails="[object Object]">
Here are the divs I need
I don't know exactly how to click on Statistics because I don't have any button tag.
Here are the tabs
Finally, I solved the problem by clicking on that tab.
I would say stick with Selenium for that process.
You will need to:
Locate the element using selenium. In your case, you will need to grab all the matches and then go to its relative path to find out the box that you can click on. I don't think it has to be a button.
Then you can use something like this.
def wait_for_field(self, xpath, driver, interval=10):
try:
element = WebDriverWait(driver, interval).until(EC.presence_of_element_located((By.XPATH, xpath)))
except Exception as e:
raise CrawlerException(field + " failed, ", str(e))
return elemenT
def click_on_match(self, page_browser):
try:
element_to_click= self.wait_for_field("**fill out here**", page_browser,
interval=5)
print("found match to click")
page_browser.execute_script("arguments[0].click()", element_to_click)
except:
pass'
Try this:
import requests
url = "https://old.superbet.ro/rest/SBWeb.Models.Casino/getAllGames"
r = requests.get(url)
json_data = r.json()
I am using Selenium plus python to search a keyword and then in the search result i am trying to clicking top 5 urls and getting data from p tag and then going back. So basically then i am storing the data from these 5 sites. But somehow after searching the keyword i am not being to click the urls and getting the data. i don't know whats wrong. This is the code i have written. Please Help.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(executable_path="E:\chromedriver\chromedriver.exe")
driver.get("https://www.google.com/")
print(driver.title)
driver.maximize_window()
time.sleep(2)
driver.find_element(By.XPATH, "//input[#name='q']").send_keys('selenium')
driver.find_element(By.XPATH, "//div[#class='FPdoLc tfB0Bf']//input[#name='btnK']").send_keys(Keys.ENTER)
a = driver.find_elements_by_xpath("//div[#class='g']/a[#href]")
links = []
for x in a:
links.append(x.get_attribute('href'))
link_data = []
for new_url in links:
print('new url : ', new_url)
driver.get(new_url)
link_data.append(driver.page_source)
b = driver.find_elements(By.TAG_NAME, "p")
for data in b:
print(data.text)
driver.back()
driver.close()
EDIT :
While navigating through links it is also including links from "People also ask " . i dont want to navigate through this box. How can i do it?
If you want the 16 or so links use.
driver.get("https://www.google.com/")
print(driver.title)
driver.maximize_window()
time.sleep(2)
driver.find_element(By.XPATH, "//input[#name='q']").send_keys('selenium')
driver.find_element(By.XPATH, "//input[#name='btnK']").send_keys(Keys.ENTER)
a = driver.find_elements_by_xpath("//div[#class='g']/div/div/a")
links = []
for x in a:
links.append(x.get_attribute('href'))
link_data = []
for new_url in links:
print('new url : ', new_url)
driver.get(new_url)
link_data.append(driver.page_source)
b = driver.find_elements(By.TAG_NAME, "p")
for data in b:
print(data.text)
driver.back()
You have the wrong xpath for the links, should be:
"//div[#class='yuRUbf']/a[#href]"
If you look at the relevant part of the code, you'll see the <a> tag is not a child of <div class="g">, but of <div class="yuRUbf">
<div class="g"><!--m-->
<div class="tF2Cxc" data-hveid="CAkQAA" data-ved="2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFSgAMAp6BAgJEAA">
<div class="yuRUbf"><a href="https://www.healthline.com/nutrition/selenium-benefits"
data-ved="2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFjAKegQICRAC"
ping="/url?sa=t&source=web&rct=j&url=https://www.healthline.com/nutrition/selenium-benefits&ved=2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFjAKegQICRAC"><br>
<h3 class="LC20lb DKV0Md"><span>7 Science-Based Health Benefits of Selenium - Healthline</span></h3>
<div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">www.healthline.com<span
class="dyjrff qzEoUe"><span> › nutrition › selenium-benefits</span></span></cite></div>
</a>
...
</div>
</div>
</div>
You can also change your search lines a bit too, but it doesn't change the overall effect:
driver.find_element_by_xpath("//input[#name='q']").send_keys('selenium', Keys.ENTER)
I'm crawling a news website to extracts all links including the archived ones which is typical of a news website. The site here has a a button View More Stories that loads more website articles. Now this code below
def find_urls():
start_url = "e.vnexpress.net/news/business"
r = requests.get("http://" + start_url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
links = soup.findAll('a')
url_list = []
for url in links:
all_link = url.get('href')
if all_link.startswith('http://e.vnexpress.net/news/business'):
url_list.append(all_link)
return set(url_list)
successfully load quite a few url but how do I load more here is a snippet of the button
<a href="javascript:void(0)" id="vnexpress_folder_load_more" data-page="2"
data-cate="1003895">
View more stories
</a>
Can someone help me out. Thanks.
You can use a browser like selenium to click the button till the button disappears or disables. Finally you can scrape the entire page using beautifulsoup in one go.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
#initializing browser
driver = webdriver.Firefox()
driver.set_window_size(1120, 550)
driver.get("http://e.vnexpress.net/news/news")
# run this till button is present
elem = driver.find_element_by_id('vnexpress_folder_load_more'))
elem.click()
I have added to an html file: maintenance.html an iframe:
<iframe name="iframe_name" src="maintenance_state.txt" frameborder="0" height="40" allowtransparency="allowtransparency" width="800" align="middle" ></iframe>
And I want to get the content of the src file maintenance_state.txt using Python and Selenium.
I'm locating the iframe element using:
maintain = driver.find_element_by_name("iframe_name")
However maintain.text is returning an empty value.
How can I get the text written in maintenance_state.txt file.
Thanks for your help.
As some sites' scripts stop the iframe from working properly if it's loaded as the main document, it's also worth knowing how to read the iframe's source without needing to issue a separate driver.get for its URL:
driver.switch_to.frame(driver.find_element_by_name("iframe_name"))
print(driver.page_source)
driver.switch_to.default_content()
The last line is needed only if you want to be able to do something else with the page afterwards.
You can get the src element, navigate to it and get the page_source:
from urlparse import urljoin
src = driver.find_element_by_name("iframe_name").get_attribute("src")
url = urljoin(base_url, src)
driver.get(url)
print(driver.page_source)