I have added to an html file: maintenance.html an iframe:
<iframe name="iframe_name" src="maintenance_state.txt" frameborder="0" height="40" allowtransparency="allowtransparency" width="800" align="middle" ></iframe>
And I want to get the content of the src file maintenance_state.txt using Python and Selenium.
I'm locating the iframe element using:
maintain = driver.find_element_by_name("iframe_name")
However maintain.text is returning an empty value.
How can I get the text written in maintenance_state.txt file.
Thanks for your help.
As some sites' scripts stop the iframe from working properly if it's loaded as the main document, it's also worth knowing how to read the iframe's source without needing to issue a separate driver.get for its URL:
driver.switch_to.frame(driver.find_element_by_name("iframe_name"))
print(driver.page_source)
driver.switch_to.default_content()
The last line is needed only if you want to be able to do something else with the page afterwards.
You can get the src element, navigate to it and get the page_source:
from urlparse import urljoin
src = driver.find_element_by_name("iframe_name").get_attribute("src")
url = urljoin(base_url, src)
driver.get(url)
print(driver.page_source)
Related
Situation
I'm using Selenium and Python to extract info from a page
Here is the div I want to extract from:
I want to extract the "Registre-se" and the "Login" text.
My code
from selenium import webdriver
url = 'https://www.bet365.com/#/AVR/B146/R^1'
driver = webdriver.Chrome()
driver.get(url.format(q=''))
elements = driver.find_elements_by_class_name('hm-MainHeaderRHSLoggedOutNarrow_Join ')
for e in elements:
print(e.text)
elements = driver.find_elements_by_class_name('hm-MainHeaderRHSLoggedOutNarrow_Login ')
for e in elements:
print(e.text)
Problem
My code don't send any output.
HTML
<div class="hm-MainHeaderRHSLoggedOutNarrow_Join ">Registre-se</div>
<div class="hm-MainHeaderRHSLoggedOutNarrow_Login " style="">Login</div>
By looking this HTML
<div class="hm-MainHeaderRHSLoggedOutNarrow_Join ">Registre-se</div>
<div class="hm-MainHeaderRHSLoggedOutNarrow_Login " style="">Login</div>
and your code, which looks okay to me, except that part you are using find_elements for a single web element.
and by reading this comment
The class name "hm-MainHeaderRHSLoggedOutMed_Login " only appear in
the inspect of the website, but not in the page source. What it's
supposed to do now?
It is clear that the element is in either iframe or shadow root.
Cause page_source does not look for iframe.
Please check if it is in iframe, then you'd have to switch to iframe first and then you can use the code that you have.
switch it like this :
driver.switch_to.frame(driver.find_element_by_xpath('xpath here'))
I am trying to find the url for the trailer video from this page. https://www.binged.com/streaming-premiere-dates/black-monday/.
I checked the various properties of the div class="wordkeeper-video", I cannot find it. Can someone help?
Go ahead and play it. Then there will be something like this. The link is in src tag
<iframe frameborder="0" allowfullscreen="" allow="autoplay" src="https://www.youtube.com/embed/pzxGR6Q-7Mc?rel=0&showinfo=0&autoplay=1"></iframe>
PS: It is in div class="wordkeeper-video"
The video href is not initially present there.
You need first to click on the play button (actually the image), after that the href is presented inside the iframe there.
The iframe is .wordkeeper-video iframe
So you have to switch to the iframe and then extract it's src attribute
The full URL isn't there but all you need to build it is.
<div class="wordkeeper-video " data-type="youtube" data-embed="pzxGR6Q-7Mc" ...>
The data-embed attribute has what you need.
The URL is
https://www.youtube.com/watch?v=pzxGR6Q-7Mc
^ here's the data-embed value
You can get this by using
data_embed = driver.find_element_by_css_selector(".wordkeeper-video").get_attribute("data-embed")
video_url = "https://www.youtube.com/watch?v=" + data_embed
I'm a amateur at using python, and I'm trying to scrape the url from the html below using selenium.
<a class="" href="#" style="text-decoration: none; color: #1b1b1b;" onclick="toDetailOrUrl(event, '1641438','')">[안내] 빗썸 - 빗썸 글로벌 간 간편 가상자산 이동 서비스 종료 안내</a>
In ordinary case, the link url i want to get is in just beside 'href=', but there is just "#" in that html.
When i run the code below that is usual way to using selenium to scrape the given html, it returns a https://cafe.bithumb.com/view/boards/43. But is just what i entered in 'driver.get()', and i don't want.
url = "https://cafe.bithumb.com/view/boards/43"
driver=webdriver.Chrome('chromedriver.exe')
driver.get(url)
driver.implicitly_wait(30)
bo =driver.find_element_by_xpath("//tbody[1]/tr[#style='cursor:pointer;border-top:1px solid #dee2e6;background-color: white']/td[2]/a")
print(bo.get_attribute('href'))
What i want is https://cafe.bithumb.com/view/board-contents/1641438. You can get this url when you click a item corresponding with the xpath i wrote above.
I want this url using selenium or other programmatic ways, no need to open a chrome and enter the url in addressbar, and click using mouse... like that.
good
You can use,
bo.click()
in order to click the element you want (I assumed you want to click bo)
print(driver.execute_script('return arguments[0].getAttribute("href")',bo))
selenium , bo.get_attribute('href') is actually doing document.getElementById("somelocaator").href which returns full href , as '#' indicates current page you get current URL you provided in get()
If you just need # you can use the execute_script
I need scrape the contents of an iframe using python.
As the web-page loads it submits a request and gets the content of the iframe in the response. When I use BeautifulSoup to get the data it just gives the initial blank iframe contents.
Is there any way I can get the contents? If so, how do I do it in my case?
Here is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances" + "/token/generic-tokenholders2?a=0x168296bb09e24a88805cb9c33356536b980d3fc5&s=100000000000000000", headers={"content-type":"type"}, timeout=5)
soup = BeautifulSoup(page.content, "html.parser")
balances = soup.find(id="balances")
print(balances.prettify())
table_items = balances.find_all('tr') #I'm trying to collect all the <tr> tags inside an <iframe>
print(table_items) #It shows an empty list because the <iframe> didn't load
You must switch to iframe, to get content from it. Find iframe and then make new request.
iframe_content = requests.get(soup.find("iframe_name")["src"])
see this question
I am trying to learn a bit of beautiful soup, and to get some html data out of some iFrames - but I have not been very successful so far.
So, parsing the iFrame in itself does not seem to be a problem with BS4, but I do not seem to get the embedded content from this - whatever I do.
For example, consider the below iFrame (this is what I see on chrome developer tools):
<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>
where, <html>...</html> is the content I am interested in extracting.
However, when I use the following BS4 code:
iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
iFrames.append(soup.iframe.extract())
I get:
<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">
In other words, I get the iFrames without the document <html>...</html> within them.
I tried something along the lines of:
iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
print iframe.find_all('html')
.. but this does not seem to work..
So, I guess my question is, how do I reliably extract these document objects <html>...</html> from the iFrame elements.
Browsers load the iframe content in a separate request. You'll have to do the same:
for iframe in iframexx:
response = urllib2.urlopen(iframe.attrs['src'])
iframe_soup = BeautifulSoup(response)
Remember: BeautifulSoup is not a browser; it won't fetch images, CSS and JavaScript resources for you either.