Selenium Get jsp generated Page Source - python

I have a website that populate its content using JSP.
I am trying to use Selenium to scrape the content there. When I opened up the page, I put a time sleep a few seconds and wait until the page fully loaded(I can see the data finished populating by eye-balling).
However, when I do browser.find_elements_by_class... I cannot find any element! I don't know how can I solve that issue in Selenium.

Check and see if your elements are inside of a frame or iframe.
http://selenium-python.readthedocs.org/en/latest/navigating.html#moving-between-windows-and-frames documents the python version, which turns out to be:
driver.switch_to_frame("framename")

Related

fetching an updated page after scrolling with selenium webdriver

I'm trying to scrape titles and links from a youtube search, using selenium webdriver, and I'm currently iterating over the process until a certain condition turns false. Though I can see the page scrolling when it's launched, the data I get only seems to be from the first page fetched, before scrolling a single time. How can I access the updated data after I've scrolled down?
This is some of my code:
driver.get(URL)
while (condition)
// extract data, check for condition and write to csv file
driver.execute_script("window.scrollTo(0, 10000)")
WebDriverWait(driver, 60)
if (iteration terminating condition)
// terminate iteration
It depend on what you're using to extract the data. You can do this with selenium but again if you're extracting lots of data then it's probably not that efficient. Generally selenium should be used as a last resort for getting data you can't get through other means.
Consider the following other sources to get dynamic content.
API - Youtube does provide one and it may be worth checking this out. You could use the requests package with this which is more efficient than this.
Re-engineering HTTP requests - This is based on the fact that javascript makes an Asynchronous Javascipt and XML (AJAX) request to display information on a page without it being refreshed. If we can theoretically mimic these requests then we can grab the data we wnat. This applies to infinite scrolling, which occurs the the Youtue Website, but it can be used for search forms etc.. A request is made to a server and that response is then displayed on the page with javascript. This is also an efficient way to deal with dynamic content.
You could use splash, which pre-renders the pages and can execute javascript which is slightly more efficient than say selenium.
Selenium, which you're attempting here. It is meant for automated testing and was never really meant for web scraping. That being said, if it's needed then it's needed. But the downsides are that it is incredibly slow for lots of data and it can be quite brittle. That if the servers take longer to load the pages and the commands are executed then you can run into exceptions you don't want.
If you are thinking of using selenium my advice would be to use as little of selenium as possible. That is if the HTML page is updated when you scroll down, to parse that HTML page with say BeautifulSoup rather than using selenium to grab the data you want. Every single time you use selenium to extract data or scroll, you are making another HTTP request to the server. Selenium works by setting up an HTTP server, a secure connection between the webdriver and chromedriver browser. Browser activity is generated through HTTP requests. So you can imagine if you have a lot of lines of code for extracting data the load on the servers becomes greater.
You could update the driver.page_source as you scroll that will change with each scroll attempt and parse the data. The other option which may make more sense would be to wait until it stops scrolling and then get the driver.page_source, so you can parse the entire HTML with the data you desire.

How to prevent page updates after load with Python Selenium Webdriver (Firefox)

I'm using Python Selenium to save data into a spreadsheet from a webpage using Firefox, but the page continually updates data causing errors relating to stale elements. How do I resolve this?
I've tried to turn off JavaScript but that's doesn't seem to do anything. Any suggestions would be great!
If you want to save the data at the page in the specific moment of time you can
get the current page HTML source using WebDriver.page_source function
write it into a file
open the file from the disk using WebDriver.get() function
that's it, you should be able to work with the local copy of the page which will never change
Example code:
driver.get("http://seleniumhq.org")
with open("mypage.html", "w") as mypage:
mypage.write(driver.page_source)
mypage.close()
driver.get(os.getcwd() + "/" + (mypage.name))
#do what you need with the page source
another approach is using WebDriver.find_element function wherever you need to interact with the element.
so instead of
myelement = driver.find_element_by_xpath("//your_selector")
# some other action
myelement.getAttribute("interestingAttribute")
perform find any time you need to interact with the element:
driver.find_element_by_xpath("//your_selector").getAttribute("interestingAttribute")
or even better go for Explicit Wait of the element you need:
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//your/selector"))).get_attribute("href")

how to scrape a value which is dynamically created by XHR and it dosn't show in selenium driver.page_source

Trying to scrape a website which contains some dynamic data. These data could not be captured with python selenium driver.page_source even after adding lot of waits in it. But when I inspected with firebug, came to know that, these values were executed within browser console using referenced javascripts.
And, completely inspected the page source taken from selenium for these values and found no traces. All I can see only its id
But in actual browser, these values are present in it.

Problems automating getting webpages to .pdf

I am trying to automate the process of downloading webpages with technical documentation which I need to update every year or so.
Here is an example page: http://prod.adv-bio.com/ProductDetail.aspx?ProdNo=1197
From this page, the desired end result would be having all the html links saved as pdf's.
I am using wget to download the .pdf files
I can't use wget to download the html files, because the .html links on the page can only be accessed by clicking through from the previous page.
I tried using Selenium to open the links in Firefox and print them to pdf's, but the process is slow, frequently misses links, and my work proxy server forces me to re-authenticate every time I need to access a page for a different product.
I could open a chrome browser using chromedriver but could not handle the print dialog, even after trying pywinauto per an answer to a similar question here.
I tried taking screenshots of the html pages using Selenium, but could not find out how to get the whole webpage without capturing the entire screen.
I have been through a ton of links related to this topic but have yet to find a satisfying solution to this problem.
Is there a cleaner way to do this?

How to get page source in selenium2 python after page changed by JQuery

I am writing some test code in python using Selenium and getting a stale element error when I call driver.page_source. I understand the reason. I have navigated the page and the navigation inserts new page elements. But once these new elements are inserted in the page how can I get the source for the whole modified page?
You can use Splinter: http://splinter.cobrateam.info/

Categories