I'm trying to use Selenium's Chrome web driver to navigate to a page and then fill out a form. The problem is that the page loads and then 5 seconds later displays the form. So JavaScript changes the DOM after 5 seconds. I think this means that the form's html id doesn't exist in the source code the web driver receives.
This is what the form looks like with Chrome's inspect feature:
However that html doesn't appear in the page's source html.
Python used to find the element:
answerBox = driver.find_element_by_xpath("//form[#id='answer0problem2']")
How would I access the input field within this form?
Is there a way to refresh the web driver without changing the page?
You're running into this problem because you didn't give the website enough time to load.
use time.sleep() like this:
import time
driver.get('http://your.website.com')
time.sleep(15)
plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'lxml')
This works because selenium spawns it's own process and is not affected by the python sleep. During this sleep time the headless browser keeps working and loads the website.
It's helpful to implement a sleep time for each selenium executions to account for page load. Because the only way the python process communicate to selenium is when you call driver, calling before page load can have consequences like the one you described.
Related
I am facing an issue.
I navigate on the page via Selenium Chrome. I have timeouts and WebDriverWait as I need a full page to get JSON out of it.
Then I click the navigation button with
driver.execute_script("arguments[0].click();", element)
as normal click never worked.
And it is navigating OK, I see Selenium is surfing normally. No problem.
But the driver.page_source remains for the first page that I got via 'get' method
All timeouts are the same as for the first page. And I see those new pages normally, but the page_source never updates.
What am I doing wrong?
After navigating to the new Page, you need to get the current URL by:
url = driver.current_url()
and then:
driver.get(url)
driver.getPageSource()
I try to use Selenium for my automation project.
In this case I realize find_element_by_tag_name() function returns the elements which display on browser so I basically used to PAGE_DOWN and run the function again.
The question is there any way to run find_elements_by_tag_name on whole loaded content in Selenium without scrolling down to page.
For example in my case I use this;
browser = webdriver.Chrome()
browser.get(url)
images = browser.find_elements_by_tag_name("img")
send_keys(Keys.PAGE_DOWN)
I don't want to send PAGE_DOWN because I already have whole page in the browser.
Note: browser.page_source is not solution.
Thanks for helps.
I have created a script which will fill the form and submit it.
the website then displays the results.
once i open chrome using selenium, i get the driver.page_source and it gives the correct html output of the initial state.
If i use the driver.page_source after submitting the form,i am only getting the source of the initial state again, that is: no change is reflected even though there is a change in the html.
Question: How do i get the HTML output of the page with changes after submitting the form?
Thanks for the help in advance!
ps: i'm new so yeah..
EDIT:
I found the answer, it was working fine all the while, but the web page hadn't fully loaded yet and hence i was still getting the old source code, so i just made the driver wait before extracting the new source.
thank you!
Once you submit the form before you pull out the page_source to check for the change, it is worth to mention that though the WebClient may have achieved 'document.readyState' equal to "complete" at a certain stage and Selenium gets back the control of program execution, that doesn't guarantees that all the associated Javascript and Ajax Calls on the new page have completed. Until and unless the Javascript and Ajax Calls associated with the DOM Tree gets completed the page is not completely rendered you may not be able to track the intended changes.
An ideal way to check for changes would be to induce WebDriverWait in-conjunction with expected_conditions clause set as title_contains as follows :
driver.find_element_by_xpath("xpath_of_element_changes_page").click()
WebDriverWait(browser, 10).until(EC.title_contains(("full_or_partial_text_of_the_new_page_title")))
source = driver.page_source
Note : While Page Title resides within the <head> tag of the HTML DOM a better solution would be to induce WebDriverWait for the visibility of an element which will be present in all situations within the <body> tag of the DOM Tree as follows :
driver.find_element_by_xpath("xpath_of_element_changes_page").click()
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.ID, "id_of_element_present_in_all_situation")))
source = driver.page_source
You can pass Selenium's current page to the scrapy Selector and use usual css and/or xpath selectors to get data from it:
sel_response = Selector(text=driver.page_source.encode('utf-8'))
sel_response.css(<your_css_selector>).extract()
I have a flash card making program for Spanish that pulls information from here: http://www.spanishdict.com/examples/zorro (this is just an example). I've set it up so it gets the translations fine, but now I want to add examples. I noticed however, that the examples on that page are dynamically generated so I installed Beautiful Soup and HTML5 parser. The tag I'm specifically interested in is:
<span class="megaexamples-pair-part">Los perros siguieron el rastro del <span
class="megaexamples-highlight">zorro</span>. </span>
The code I'm using to try and retrieve it is:
soup = BeautifulSoup(urlopen("http://www.spanishdict.com/examples/zorro").read(), 'html5lib')
example = soup.findAll("span", {"class": "megaexamples-pair-part"})
However, no matter what way I swing it, I can't seem to get it to pull down the dynamically generated code. I have confirmed I get the page by doing a search for megaexamples-container, which works fine (and you can see by just right clicking in google chrome and hitting View Page Source).
Any ideas?
What you're doing is just pull the HTML page, and it's likely loading more data from the server via a JavaScript call.
You have 2 options:
Use a webdriver such as selenium to control a web browser that correctly loads the entire page (you can then parse it with BeautifulSoup or find elements with selenium's own tools). This incurs in some overhead due to the browser usage.
Use the network tab of your browser's developer tools (usually accessed with F12) to analyze incoming and outgoing requests from dynamic loading and use the requests module to replicate them. This is more efficient but might also be more tricky.
Remember to do this only if you have permission from the site's owner, though. In many cases it's against the ToS.
I used Pedro's answer to get me moving in the right direction. Here is what I did to get it to work:
Download selenium with pip install selenium
Download the driver for the browser you want to emulate. You can download them from this page. The driver must be in the PATH variable or you will need to specify the path in the constructor for the webdriver.
Import selenium with from selenium import webdriver
Now use the following code:
browser = webdriver.Chrome()
browser.get(raw_input("Enter URL: "))
html_source = browser.page_source
Note: If you did not put your driver in path, you have to call the constructor with browser = webdriver.Chrome(<PATH_TO_DRIVER_HERE>)
Note 2: You can use something like webdriver.Firefox() if you want a different browser.
Now you can parse it with something like: soup = BeautifulSoup(html_source, 'html5lib')
When this page is scraped with urllib2:
url = https://www.geckoboard.com/careers/
response = urllib2.urlopen(url)
content = response.read()
the following element (the link to the job) is nowhere to be found in the source (content)
Taking a look at the full source that gets rendered in a browser:
So it would appear that the FRONT-END ENGINEER element is dynamically loaded by Javascript. Is it possible to have this Javascript executed by urllib2 (or other low-level library) without involving e.g. Selenium, BeautifulSoup, or other?
The pieces of information are loaded using some ajax request. You could use firebug extension for mozilla or google chrome has it's own tool to get theese details. Just hit f12 in google chrome while opening the URL. You can find the complete details there.
There you will find a request with url https://app.recruiterbox.com/widget/13587/openings/
Information from the above url is rendered in that web page.
From what I understand, you are building something generic for multiple web-sites and don't want to go deep down in how a certain site is loaded, what requests are made under-the-hood to construct the page. In this case, a real browser is your friend - load the page in a real browser automated via selenium - then, once the page is loaded, pass the .page_source to lxml.html (from what I see this is your HTML parser of choice) for further parsing.
If you don't want a browser to show up or you don't have a display, you can go headless - PhantomJS or a regular browser on a virtual display.
Here is a sample code to get you started:
from lxml.html import fromstring
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(15)
driver.get("https://www.geckoboard.com/careers/")
# TODO: you might need a delay here
tree = fromstring(driver.page_source)
driver.close()
# TODO: parse HTML
You should also know that, there are plenty of methods to locate elements in selenium and you might not even need a separate HTML parser here.
I think you're looking for something like this: https://github.com/scrapinghub/splash