How to browse a whole website using selenium? - python

Is it possible to go through all the URIs of a given URL (website) using selenium ?
My aim is to launch firefox browser using selenium with a given URL of my choice (I know how to do it thanks to this website), and then let firefox browse all the pages that URL (website) has. I appreciate any hint/help on how to do it in Python.

You can use a recursive method in a class such as the one given below to do this.
public class RecursiveLinkTest {
//list to save visited links
static List<String> linkAlreadyVisited = new ArrayList<String>();
WebDriver driver;
public RecursiveLinkTest(WebDriver driver) {
this.driver = driver;
}
public void linkTest() {
// loop over all the a elements in the page
for(WebElement link : driver.findElements(By.tagName("a")) {
// Check if link is displayed and not previously visited
if (link.isDisplayed()
&& !linkAlreadyVisited.contains(link.getText())) {
// add link to list of links already visited
linkAlreadyVisited.add(link.getText());
System.out.println(link.getText());
// click on the link. This opens a new page
link.click();
// call recursiveLinkTest on the new page
new RecursiveLinkTest(driver).linkTest();
}
}
driver.navigate().back();
}
public static void main(String[] args) throws InterruptedException {
WebDriver driver = new FirefoxDriver();
driver.get("http://newtours.demoaut.com/");
// start recursive linkText
new RecursiveLinkTest(driver).linkTest();
}
}
Hope this helps you.

As Khyati mentions it is possible, however, selenium not a webcrawler or robot. You have to know where/what you are trying to test.
If you really want to go down that path I would recommend that you hit the page, pull all elements back and then loop through to click any elements that would correspond to navigation functionality (i.e. "//a" or hyperlink click).
Although if you go down this path and there is a page that opens another page then has a link back you would want to keep a list of all visited URL's and make sure that you don't duplicate a page like that.
This would work, but would also require a bit of logic in it to make it happen...and you might find yourself in an endless loop if you aren't careful.

I know you asked for a python example, but I was just in the middle of setting up a simple rep o for protractor testings and the task you want to accomplish seems to be very easy to do with protractor (which is just a wrapper around webdriver)
here is the code in javascript:
describe( 'stackoverflow scrapping', function () {
var ptor = protractor.getInstance();
beforeEach(function () {
browser.ignoreSynchronization = true;
} );
afterEach(function () {
} );
it( 'should find the number of links in a given url', function () {
browser.get( 'http://stackoverflow.com/questions/24257802/how-to-browse-a-whole-website-using-selenium' );
var script = function () {
var cb = arguments[ 0 ];
var nodes = document.querySelectorAll( 'a' );
nodes = [].slice.call( nodes ).map(function ( a ) {
return a.href;
} );
cb( nodes );
};
ptor.executeAsyncScript( script ).then(function ( res ) {
var visit = function ( url ) {
console.log( 'visiting url', url );
browser.get( url );
return ptor.sleep( 1000 );
};
var doVisit = function () {
var url = res.pop();
if ( url ) {
visit( url ).then( doVisit );
} else {
console.log( 'done visiting pages' );
}
};
doVisit();
} );
} );
} );
You can clone the repo from here
Note: I know protractor is probably not the best tool for it, but it was so simple to do it with it that I just give it a try.
I tested this with firefox (you can use the firefox-conf branch for it, but it will require that you fire webdriver manually) and chrome. If you're using osx this should work with no problem (assuming you have nodejs installed)

Selenium API provides all the facility via which you can do various operations like type ,click , goto , navigateTo , switch between frames, drag and drop, etc.
What you are aiming to do is just browsing in simple terms, clicking and providing different URls within the website also ,if I understood properly. Ya , you can definitely do it via Selenium webdriver.
And you can make a property file, for better ease and readiness where-in you can pass different properties like URLs , Base URI ,etc and do the automation testing via Selenium Webdriver in different browsers.

This is possible. I have implemented this using Java webdriver and URI. This was mainly created to identify the broken links.
Using "getElements" having tag can be get using webdriver once open and save "href" value.
Check all link status using URL class of java and Put it in stack.
Then pop link from stack and "get" link using Webdriver. Again get all the links from the page remove duplicate links which are present in stack.
Loop this until stack is empty.
You can update it as per your requirements. Such as levels of traversing, excluding other links which are not having domain of the given website etc.
Please comment if you are finding difficulty in implementation.

Related

Hi, can I infinite click with selenium?

Is there any way I can autoclick (spam) a button on a webpage using selenium? What I tried was while True: driver.find_element_by_id("whatev").click()
Most odds this will work. However, some sites may have protection against long auto clicking. In such case the site will redirect you to another URL, or a new HTML will be loaded with other classes, IDs and other attributes that will fail your code.
Here is what you can do using Java, translate my code to python:
loadSite();
while (true) {
try {
driver.find_element_by_id("whatev").click()
}
catch (Exception e) {
loadSite();
driver.find_element_by_id("whatev").click()
}
}

How to click on specific text in a paragraph?

I have a paragraph element as follows:
<p>You have logged in successfully. <em>LOGOUT</em></p>
Clicking on "LOGOUT" will initiate a logout procedure (e.g display a confirmation prompt).
How do I simulate this clicking on "LOGOUT" using Selenium WebDriver?
To find and click the "LOGOUT" text with python, you can use the following code:
logout = driver.find_element_by_xpath("//em[text()='LOGOUT']")
logout.click()
This could help :
Execute button Click with Selenium
As a preach :
You should first, try to analize the general basic components offered for your tool, and the interactions with external systems (selection, executions, listening).
Based on the first link offered as a resource your code should be some like :
package postBlo;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chromse.ChromeDriver;
public class singleClickButton {
public singleClickButton() {
super();
}
public static void main(String[] args) throws Exception {
System.setProperty("webdriver.chrome.driver", "./exefiles/chromedriver.exe");
WebDriver = new ChromeDriver();
driver.manage().window().maximixe();
driver.get("your-local-site-to-test");
//Referen an input component and set a values
driver.findElement(By.name("id-html-tag")).sendKeys("someValue text");
/* ## Execution of button by using id
You could use both methods to identify the element you need :
By using "xpath" expression wich allows you to navigate between elements by using expressions
By using id-identifier
Chose one of both.
driver.findElement(By.xpath("expression-xpath")).click();
driver.findElement(By.id("id-element")).click();
*/
driver.findElement(By.xpath("/html/body/elemnts-container-button/button\r\n" + "")).click();
driver.findElement(By.id("button-id")).click();
}
}
As a mention I'm not related to Selenium but still the logic it's alike.
Best

Selenium Webdriver: How to wait until document.readyState set to 'complete'? [duplicate]

I am trying to check if web page is loaded completed or not (i.e. checking that all the control is loaded) in selenium.
I tried below code:
new WebDriverWait(firefoxDriver, pageLoadTimeout).until(
webDriver -> ((JavascriptExecutor) webDriver).executeScript("return document.readyState").equals("complete"));
but even if page is loading above code does not wait.
I know that I can check for particular element to check if its visible/clickable etc but I am looking for some generic solution
As you mentioned if there is any generic function to check if the page has completely loaded through Selenium the answer is No.
First let us have a look at your code trial which is as follows :
new WebDriverWait(firefoxDriver, pageLoadTimeout).until(webDriver -> ((JavascriptExecutor) webDriver).executeScript("return document.readyState").equals("complete"));
The parameter pageLoadTimeout in the above line of code doesn't really reseambles to actual pageLoadTimeout().
Here you can find a detailed discussion of pageLoadTimeout in Selenium not working
Now as your usecase relates to page being completely loaded you can use the pageLoadStrategy() set to normal [ the supported values being none, eager or normal ] using either through an instance of DesiredCapabilities Class or ChromeOptions Class as follows :
Using DesiredCapabilities Class :
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxOptions;
import org.openqa.selenium.remote.DesiredCapabilities;
public class myDemo
{
public static void main(String[] args) throws Exception
{
System.setProperty("webdriver.gecko.driver", "C:\\Utility\\BrowserDrivers\\geckodriver.exe");
DesiredCapabilities dcap = new DesiredCapabilities();
dcap.setCapability("pageLoadStrategy", "normal");
FirefoxOptions opt = new FirefoxOptions();
opt.merge(dcap);
WebDriver driver = new FirefoxDriver(opt);
driver.get("https://www.google.com/");
System.out.println(driver.getTitle());
driver.quit();
}
}
Using ChromeOptions Class :
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxOptions;
import org.openqa.selenium.PageLoadStrategy;
public class myDemo
{
public static void main(String[] args) throws Exception
{
System.setProperty("webdriver.gecko.driver", "C:\\Utility\\BrowserDrivers\\geckodriver.exe");
FirefoxOptions opt = new FirefoxOptions();
opt.setPageLoadStrategy(PageLoadStrategy.NORMAL);
WebDriver driver = new FirefoxDriver(opt);
driver.get("https://www.google.com/");
System.out.println(driver.getTitle());
driver.quit();
}
}
You can find a detailed discussion in Page load strategy for Chrome driver (Updated till Selenium v3.12.0)
Now setting PageLoadStrategy to NORMAL and your code trial both ensures that the Browser Client have (i.e. the Web Browser) have attained 'document.readyState' equal to "complete". Once this condition is fulfilled Selenium performs the next line of code.
You can find a detailed discussion in Selenium IE WebDriver only works while debugging
But the Browser Client attaining 'document.readyState' equal to "complete" still doesn't guarantees that all the JavaScript and Ajax Calls have completed.
To wait for the all the JavaScript and Ajax Calls to complete you can write a function as follows :
public void WaitForAjax2Complete() throws InterruptedException
{
while (true)
{
if ((Boolean) ((JavascriptExecutor)driver).executeScript("return jQuery.active == 0")){
break;
}
Thread.sleep(100);
}
}
You can find a detailed discussion in Wait for ajax request to complete - selenium webdriver
Now, the above two approaches through PageLoadStrategy and "return jQuery.active == 0" looks to be waiting for indefinite events. So for a definite wait you can induce WebDriverWait inconjunction with ExpectedConditions set to titleContains() method which will ensure that the Page Title (i.e. the Web Page) is visible and assume the all the elements are also visible as follows :
driver.get("https://www.google.com/");
new WebDriverWait(driver, 10).until(ExpectedConditions.titleContains("partial_title_of_application_under_test"));
System.out.println(driver.getTitle());
driver.quit();
Now, at times it is possible though the Page Title will match your Application Title still the desired element you want to interact haven't completed loading. So a more granular approach would be to induce WebDriverWait inconjunction with ExpectedConditions set to visibilityOfElementLocated() method which will make your program wait for the desired element to be visible as follows :
driver.get("https://www.google.com/");
WebElement ele = new WebDriverWait(driver, 10).until(ExpectedConditions.visibilityOfElementLocated(By.xpath("xpath_of_the_desired_element")));
System.out.println(ele.getText());
driver.quit();
References
You can find a couple of relevant detailed discussions in:
Selenium IE WebDriver only works while debugging
Selenium how to manage wait for page load?
I use selenium too and I had the same problem, to fix that I just wait also for the jQuery to load.
So if you have the same issue try this also
((Long) ((JavascriptExecutor) browser).executeScript("return jQuery.active") == 0);
You can wrap both function in a method and check until both page and jQuery is loaded
Implement this, Its working for many of us including me. It includes Web Page wait on JavaScript, Angular, JQuery if its there.
If your Application is containing Javascript & JQuery you can write code for only those,
By define it in single method and you can Call it anywhere:
// Wait for jQuery to load
{
ExpectedCondition<Boolean> jQueryLoad = driver -> ((Long) ((JavascriptExecutor) driver).executeScript("return jQuery.active") == 0);
boolean jqueryReady = (Boolean) js.executeScript("return jQuery.active==0");
if (!jqueryReady) {
// System.out.println("JQuery is NOT Ready!");
wait.until(jQueryLoad);
}
wait.until(jQueryLoad);
}
// Wait for ANGULAR to load
{
String angularReadyScript = "return angular.element(document).injector().get('$http').pendingRequests.length === 0";
ExpectedCondition<Boolean> angularLoad = driver -> Boolean.valueOf(((JavascriptExecutor) driver).executeScript(angularReadyScript).toString());
boolean angularReady = Boolean.valueOf(js.executeScript(angularReadyScript).toString());
if (!angularReady) {
// System.out.println("ANGULAR is NOT Ready!");
wait.until(angularLoad);
}
}
// Wait for Javascript to load
{
ExpectedCondition<Boolean> jsLoad = driver -> ((JavascriptExecutor) driver).executeScript("return document.readyState").toString()
.equals("complete");
boolean jsReady = (Boolean) js.executeScript("return document.readyState").toString().equals("complete");
// Wait Javascript until it is Ready!
if (!jsReady) {
// System.out.println("JS in NOT Ready!");
wait.until(jsLoad);
}
}
Click here for Reference Link
Let me know if you stuck anywhere by implementing.
It overcomes the use of Thread or Explicit Wait.
public static void waitForPageToLoad(long timeOutInSeconds) {
ExpectedCondition<Boolean> expectation = new ExpectedCondition<Boolean>() {
public Boolean apply(WebDriver driver) {
return ((JavascriptExecutor) driver).executeScript("return document.readyState").equals("complete");
}
};
try {
System.out.println("Waiting for page to load...");
WebDriverWait wait = new WebDriverWait(Driver.getDriver(), timeOutInSeconds);
wait.until(expectation);
} catch (Throwable error) {
System.out.println(
"Timeout waiting for Page Load Request to complete after " + timeOutInSeconds + " seconds");
}
}
Try this method
This works for me well with dynamically rendered websites:
Wait for complete page to load
WebDriverWait wait = new WebDriverWait(driver, 50);
wait.until((ExpectedCondition<Boolean>) wd -> ((JavascriptExecutor) wd).executeScript("return document.readyState").equals("complete"));
Make another implicit wait with a dummy condition which would always fail
try {
wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//*[contains(text(),'" + "This text will always fail :)" + "')]"))); // condition you are certain won't be true
}
catch (TimeoutException te) {
}
Finally, instead of getting the html source - which would in most of one page applications would give you a different result , pull the outerhtml of the first html tag
String script = "return document.getElementsByTagName(\"html\")[0].outerHTML;";
content = ((JavascriptExecutor) driver).executeScript(script).toString();
There is a easy way to do it. When you first request the state via javascript, it tells you that the page is complete, but after that it enters the state loading. The first complete state was the initial page!
So my proposal is to check for a complete state after a loading state. Check this code in PHP, easily translatable to another language.
$prevStatus = '';
$checkStatus = function ($driver) use (&$prevStatus){
$status = $driver->executeScript("return document.readyState");
if ($prevStatus=='' && $status=='loading'){
//save the previous status and continue waiting
$prevStatus = $status;
return false;
}
if ($prevStatus=='loading' && $status=='complete'){
//loading -> complete, stop waiting, it is finish!
return true;
}
//continue waiting
return false;
};
$this->driver->wait(20, 150)->until($checkStatus);
Checking for a element to be present also works well, but you need to make sure that this element is only present in the destination page.
Something like this should work (please excuse the python in a java answer):
idle = driver.execute_async_script("""
window.requestIdleCallback(() => {
arguments[0](true)
})
""")
This should block until the event loop is idle which means all assets should be loaded.

The html content that I'm trying to scrape only appears to load when I navigate to a certain anchor within the site

I'm trying to scrape a certain value off the following website: https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556#data
Specifically, I'm trying to grab the "last" value from the table at the bottom of the page in the table with class "data default borderless". The issue is that when I search for that object name, nothing appears.
The code I use is as follows:
from bs4 import BeautifulSoup
import urllib2
url = "https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556#data"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
result = soup.findAll(attrs={"class":"data default borderless"})
print result
One issue I noticed is that when I pull the soup for that URL, it strips off the anchor tag and shows me the html for the url: https://www.theice.com/productguide/ProductSpec.shtml?specId=6747556
It was my understanding that anchor tags just navigate you around the page but all the HTML should be there regardless, so I'm wondering if this table somehow doesn't load unless you've navigated to the "data" section of the webpage.
Does anyone know how to force the table to load before I pull the soup? Is there something else I'm doing wrong that prevents me from seeing the table?
Thanks in advance!
The content is dynamically generated via below js:
<script type="text/javascript">
var app = {};
app.isOption = false;
app.urls = {
'spec':'/productguide/ProductSpec.shtml?details=&specId=6747556',
'data':'/productguide/ProductSpec.shtml?data=&specId=6747556',
'confirm':'/reports/dealreports/getSampleConfirm.do?hubId=4080&productId=3418',
'reports':'/productguide/ProductSpec.shtml?reports=&specId=6747556',
'expiry':'/productguide/ProductSpec.shtml?expiryDates=&specId=6747556'
};
app.Router = Backbone.Router.extend({
routes:{
"spec":"spec",
"data":"data",
"confirm":"confirm",
"reports":"reports",
"expiry":"expiry"
},
initialize: function(){
_.bindAll(this, "spec");
},
spec:function () {
this.navigate("");
this._loadPage('spec');
},
data:function () {
this._loadPage('data');
},
confirm:function () {
this._loadPage('confirm');
},
reports:function () {
this._loadPage('reports');
},
expiry:function () {
this._loadPage('expiry');
},
_loadPage:function (cssClass, cb) {
$('#right').html('Loading..').load(this._makeUrlUnique(app.urls[cssClass]), cb);
this._updateNav(cssClass);
},
_updateNav:function (cssClass) {
// the left bar gets hidden on margin rates because the tables get smashed up too much
// so ensure they're showing for the other links
$('#left').show();
$('#right').removeClass('wide');
// update the subnav css so the arrow points to the right location
$('#subnav ul li a.' + cssClass).siblings().removeClass('on').end().addClass('on');
},
_makeUrlUnique:function (urlString) {
return urlString + '&_=' + new Date().getTime();
}
});
// init and start the app
$(function () {
window.router = new app.Router();
Backbone.history.start();
});
</script>
Two things you can do:1. figuring out the real path and variables it uses to pull the data, see this part 'data':'/productguide/ProductSpec.shtml?data=&specId=6747556', it passes a variable to the data string and get the content. 2. use the rss feed they provided and construct your own table.
the table is generated by JavaScript and you cant get it without actually loading the page in your browser
or you could use Selenium to load the page then evaluate the JavaScript and html, But Selenium will bring up and window so its visible but you can use Phantom.JS which makes the browser headless
But yes you will need to load the actual js in a browser to get the HTML is generates
Take a look at this answer also
Good Luck!
The HTML is generated using Javascript, so BeautifulSoup won't be able to get the HTML for that table (and actually the whole <div id="right" class="main"> is loaded using Javascript, I guess they're using node.js)
You can check this by printing the value of soup.get_text(). You can see that the table is not there in the source.
In that case, there is no way for you to access the data, unless you use Javascript to do exactly what the script do to get the data from the server.

Unable to perform click action in selenium python

I'm writing a test script using selenium in python. I have a web-page containing a tree-view object like this:
I want to traverse over the menu to go to the desired directory. Respective HTML code for plus/minus indications is this:
<a onclick="changeTree('tree', 'close.gif', 'open.gif');">
<img id="someid" src="open.gif" />
</a>
The src attribute of the image can be either open.gif or close.gif.
I can detect weather there is a plus or minus by simply checking the src attribute of the img tag. I can also easily access to the parent tag, a, by using .find_element_by_xpath("..").
The problem is that I can't perform the click action not on the img nor the a tag.
I'v tried webdriver.Actions(driver).move_to_element(el).click().perform(); but it did not work.
I think I should mention that there is no problem in accessing the elements, since I can print all their attributes; I just can't perform actions on them. Any help?
EDIT 1:
Here's the js code for collapsing and expanding the tree:
function changeTree(tree, image1, image2) {
if (!isTreeviewLocked(tree)) {
var image = document.getElementById("treeViewImage" + tree);
if (image.src.indexOf(image1)!=-1) {
image.src = image2;
} else {
image.src = image1;
}
if (document.getElementById("treeView" + tree).innerHTML == "") {
return true;
} else {
changeMenu("treeView" + tree);
return false;
}
} else {
return false;
}
}
EDIT 2:
I Googled for some hours and I found out that there is a problem about triggering the Javascript events and the click action from web-driver. Additionally I have a span tag in my web-page that has an onclick event and I also have this problem on it.
After some tries like .execute_script("changeTree();"), .submit(), etc, I have solved the issue by using the ActionChains class. Now, I can click in all elements that they have java-script events as onclick. The code that I have used is this:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('someURL')
el = driver.find_element_by_id("someid")
webdriver.ActionChains(driver).move_to_element(el).click(el).perform()
I don't know if it occurred just to me or what, but I found out that I should find the element right before the key command; otherwise the script does not perform the action. I think it would be related to staling elements or something like that; anyway, thanks all for their attention.

Categories