Getting html elements of another page with window.open() - python

I'm trying to access html elements from an html page I access with window.open().
These are my htmls:
firstpage.html:
<html>
<body>
<script>
function openPage() {
return window.open("secondpage.html")
}
</script>
</body>
</html>
secondpage.html
<html>
<body>
<h1>Hello!</h2>
<p>This is the second page</p>
</body>
</html>
This is what I'm doing:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("firstpage.html")
html = browser.execute_script("return openPage().document;")
print html
What I'm expecting to get is a reference to the document element of the second page. This seems to work in the Firefox web console. When I test the script, the second page opens, but the first page seems to hang and after a while I get a dialog saying:
"A script on this page may be busy, or it may have stopped responding. You can stop the script now, or you can continue to see if the script will complete."
With the "Stop" and "Continue" buttons. Pressing "Continue" the dialog keeps to appear, when I eventually press "Stop", the html python variable contains the same text of the dialog.
What am I doing wrong?
EDIT:
As in the #e1che answer, this is the right way to do it:
firstpage.html:
<html>
<body>
<script>
function openPage() {
window.open("secondpage.html", "secondpagewindow")
}
</script>
</body>
</html>
The python code:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("firstpage.html")
browser.execute_script("openPage()")
browser.switch_to_window("secondpagewindow")
print browser.page_source

You're doing well,
but you miss to switch your driver to the new window.
so right after you're browser.execute_script("return openPage().document;")
You write something like :
driver.switch_to_window("windowName")
I let you search here for more infos and tricks ;)

Related

<!DOCTYPE html> missing in Selenium Python page_source

I'm using Selenium for functional testing of a Django application and thought I'd try html5lib as a way of validating the html output. One of the validations is that the page starts with a <!DOCTYPE ...> tag.
The unit test checks with response.content.decode() all worked fine, correctly flagging errors, but I found that Selenium driver.page_source output starts with an html tag. I have double-checked that I'm using the correct template by modifying the title and making sure that the change is reflected in the page_source. There is also a missing newline and indentation between the <html> tag and the <title> tag.
This is what the first few lines looks like in the Firefox browser.
<!DOCTYPE html>
<html>
<head>
<title>NetLog</title>
</head>
Here's the Python code.
self.driver.get(f"{self.live_server_url}/netlog/")
print(self.driver.page_source
And here's the first few lines of the print when run under the Firefox web driver.
<html><head>
<title>NetLog</title>
</head>
The page body looks fine, while the missing newline is also present between </body> and </html>. Is this expected behaviour? I suppose I could just stuff the DOCTYPE tag in front of the string as a workaround but would prefer to have it behave as intended.
Chris

how to ask web scraper to wait till javascript functions are not executed? [duplicate]

Here is the page I read:
<html>
<head>
<script type="text/javascript">
document.write("Hello World")
</script>
</head>
<body>
</body>
</html>
As you can see, the Hello World is added on the HTML page using javascript, when I use the HTML parser, like the BeautifulSoup to parse it, it can't parse the Hello World, it is possible to me parse the actually result on how the client side really see....? Thanks.
I ran into a similar problem when writing web scrapers in python, and I found Selenium Web Driver in combination with BeautifulSoup very useful. The code ends up looking something like this:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.yoursite.com")
soup = BeautifulSoup(browser.page_source, "html.parser")
...
With Selenium WebDriver, there's also functionally for a "wait until a certain DOM element has loaded", which makes the timing with javascript elements easier too.
For a correct representation of what the DOM looks like after javascript manipulation, you'll have to actually execute the javascript. This has to be done by something that has a javascript engine and a DOM (rather than text/markup) representation of the document - typically, a browser.

selenium webdriver(chrome) elements differ from those of driver.page_source

I tried to scrape an article in the medium.
But it failed because selenium.webdriver.page_source doesn't contain the target div.
[E.G.] Demystifying Python Decorators in Less Than 10 Minutes https://medium.com/#adrianmarkperea/demystifying-python-decorators-in-10-minutes-ffe092723c6c
In this site, the content holder div's class is "x y z ab ac ez af ag", but this element doesn't show up in driver.page_source.
shortcode: below.
It is NOT the kind of timeout problem.
It seems like the drive.page_source is not processed with javascript, but I don't know.
ARTICLE = "https://medium.com/#adrianmarkperea/demystifying-python-decorators-in-10-minutes-ffe092723c6c"
driver.get(ARTICLE)
text_soup = BeautifulSoup(driver.page_source,"html5lib")
text = text_soup.select(".x.y.z.ab.ac.ez.af.ag")
print(text) # => []
I expect the output of driver.page_source is the same as that of the chrome developer console's elements.
Update: I did some experiment.
I doubted webdriver couldn't get the html source processedby javascript, so I "selenium-ed" the below html file.
But I got "element-removed" html file.
result:
webdriver and ordinary chrome console are same -> processed
<html lang="en">
<body>
<script type="text/javascript">
document.querySelector("#id").remove();
</script>
</body></html>
wget / requests -> not processed
<html lang="en">
<body>
<div id="id">
test element
</div>
<script type="text/javascript">
document.querySelector("#id").remove();
</script>
</body></html>

How to crawl a web page using selenium - find_element_by_link_text

I'm trying to use Selenium and BeautifulSoup to "click" a javascript.void. The return of find_element_by_link_text is not NULL. However, nothing is updated by reviewing browser.page_source. I am not sure if crawling is success or not
Here is the result using
PageTable = soup.find('table',{'id':'rzrqjyzlTable'})
print(PageTable)
<table class="tab1" id="rzrqjyzlTable">
<div id="PageNav" class="PageNav" style="">
<div class="Page" id="PageCont">
Previous3<span class="at">1</span>
2
3
4
5
...
45
Next Page
<span class="txt"> Jump</span><input class="txt" id="PageContgopage">
<a class="btn_link">Go</a></div>
</div>
The code for clicking next page is shown below
try:
page = browser.find_element_by_link_text(u'Next Page')
page.click()
browser.implicitly_wait(3)
except NoSuchElementException:
print("NoSuchElementException")
soup = BeautifulSoup(browser.page_source, 'html.parser')
PageTable = soup.find('table',{'id':'rzrqjyzlTable'})
print(PageTable )
I am expecting that browser.page_source should be updated
My guess is that you are pulling the source before the page (or subpage) is done reloading. I would try grabbing the Next Page button, click it, wait for it to go stale (indicates the page is reloading), and then try pulling the source.
page = browser.find_element_by_link_text(u'Next Page')
page.click()
wait.until(EC.staleness_of(page))
# the page should be loading/loaded at this point
# you may need to wait for a specific element to appear to ensure that it's loaded properly since it doesn't seem to be a full page load
After clicking on Next page, you can reload the web page.
Code :
driver.refresh()
Or using Java script executor :
driver.execute_script("location.reload()")
after that you try to get the page source like the way you are doing.
Hope this will help.

How to get the EXACT, REAL value of 'href'

I'm trying to make a program that can fetch information about my attendance from my college website. In order to do that i wrote a script to login to the website ,which leads me to my dashboard ,and then go to the Attendence tab, get the href and attach it to url of the college website ,
the tag in the attendence class looked like this
Attendance
and when i clicked the attendance link the ,webpage had a url on the Address bar looked like this
http://erp.college_name.edu/Student/StudentAttendanceView.aspx?SID=7JyKkZE1Eyx2EYNii7tOjQ==|yaE7DmfR9r8= .
So, it was self explanatory that i was supposed to attach the href to the
'http://erp.college_name.edu' . Ok, i did i.e.
L = 'http://erp.college_name.edu' + str(I.findAll('li')[4].a.get('href').replace('.', ''))
but the problem is that when i fetch the href it is something else than that in the tag, it keeps on changing and when i get the link that is when i print L i got this.. which i assumed to get..
http://erp.college_name.edu/Student/StudentAttendanceViewaspx?SID=aDmK9cEFWwDqvsWw5ZzEOw==|oTeYVRfW1u8=
but the problem is that the href i'm getting in is different from the real url , and IT KEEPS ON CHANGING WHEN I RE-RUN THE PROGRAM ,the second time i got
http://erp.college_name.edu/Student/StudentAttendanceViewaspx?SID=WM/lbVRchyyBiLsDvkORJw==|MaP8NtvvrHE=
, why i'm getting this ,and moreover when i click on other links on my Dashboard page and again click on attendance tab , the href value in the url again changed in the address bar? ..
so, after that when i did,
opens = requests.get(L)
soup_2 = BeautifulSoup(opens.text, 'lxml')
print(L)
i got this..
C:\Users\HUNTER\AppData\Local\Programs\Python\Python35-32\python.exe
C:/Users/HUNTER/PycharmProjects/dictionary/erp_1.py
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>The page cannot be found</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<style type="text/css">
BODY { font: 8pt/12pt verdana }
H1 { font: 13pt/15pt verdana }
H2 { font: 8pt/12pt verdana }
A:link { color: red }
A:visited { color: maroon }
</style>
</head><body><table border="0" cellspacing="10" width="500"><tr><td>
<h1>The page cannot be found</h1>
The page you are looking for might have been removed, had its name
changed, or is temporarily unavailable.
<hr/>
<p>Please try the following:</p>
<ul>
<li>Make sure that the Web site address displayed in the address bar of
your browser is spelled and formatted correctly.</li>
<li>If you reached this page by clicking a link, contact
the Web site administrator to alert them that the link is incorrectly
formatted.
</li>
<li>Click the Back button to
try
another link.</li>
</ul>
<h2>HTTP Error 404 - File or directory not found.<br/>Internet
Information
Services (IIS)</h2>
<hr/>
<p>Technical Information (for support personnel)</p>
<ul>
<li>Go to <a href="http://go.microsoft.com/fwlink/?
linkid=8180">Microsoft
Product Support Services</a> and perform a title search for the words
<b>HTTP</b> and <b>404</b>.</li>
<li>Open <b>IIS Help</b>, which is accessible in IIS Manager (inetmgr),
and search for topics titled <b>Web Site Setup</b>, <b>Common
Administrative
Tasks</b>, and <b>About Custom Error Messages</b>.</li>
</ul>
</td></tr></table></body></html>
Process finished with exit code 0
UPDATE
I replaced the .replace('.', '') method with [2:] because the the replace function also removed . from .aspx in the href and the problem now changed to this
but still, how the value of href keep getting changed how can i fetch that page..
Any help?

Categories