Why am I unable to download a zip file with selenium? - python

I am trying to use selenium in order to try to download a testfile from a html webpage. Here is the complete html page I have been using as test object:
<!DOCTYPE html>
<html>
<head>
<title>Testpage</title>
</head>
<body>
<div id="page">
Example Download link:
Download this testzip
</div>
</body>
</html>
which I put in the current working directory, along with some example zip file renamed to testzip.zip.
My selenium code looks as follows:
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.dir", "/tmp")
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False )
profile.set_preference("pdfjs.disabled", True ) profile.set_preference("browser.helperApps.neverAsk.saveToDisk","application/zip")
profile.set_preference("plugin.disable_full_page_plugin_for_types", "application/zip")
browser = webdriver.Firefox(profile)
browser.get('file:///home/path/to/html/testweb.html')
browser.find_element_by_xpath('//a[contains(text(), "Download this testzip")]').click()
However, if I run the test (with nosetest for example), a browser is being opened, but after that nothing happens. No error message and no download, it just seems to 'hang'.
Any idea on how to fix this?

You are not setting up a real web server. You just have a html page but not a server to serve static files. You need to at least setup a server first.
But if your question is just related to download files, you can just use some international web site to test. It will work.

Related

How to identify call stack for dynamic loading of JavaScript files in Selenium?

I am using Selenium with ChromeDriver through Python to load a webpage.
If a page loads a JavaScript file a.js which dynamically loads another JavaScript file b.js to the page, how can I find the "call stack" for each respective script tag?
Example:
I have a page test.html which loads a.js
<html>
<head>
<script src="a.js"></script>
</head>
</html>
a.js in turn loads b.js:
// a.js
s = document.createElement('script');
s.src = 'b.js';
document.head.appendChild(s);
If I load this page through Selenium the scripts will both be loaded and execute as expected. This can be seen by running
with Chrome() as driver:
driver.get("http://localhost/test.html")
script_tags = driver.find_elements_by_tag_name("script")
for s in script_tags:
print(s.src)
This gives the output
http://localhost/a.js
http://localhost/b.js
Is there any way to tell how each respective script was loaded to the page?
In the example above, I would like to know that a.js was loaded directly through the page's HTML source and b.js was loaded by a.js.

Selenium firewall issue "The requested URL was rejected.[...]" [duplicate]

I did several hours of research and asked a bunch of people on fiverr who all couldn't solve a a specific problem I have.
I installed Selenium and tried to access a Website. Unfortunately the site won't allow a specific request and doesn't load the site at all. However, if I try to access the website with my "normal" Chrome Browser, it works fine.
I tried several things such as:
Different IP's
Deleting Cookies
Incognito Mode
Adding different UserAgents
Hiding features which might reveal that a Webdriver is being used
Nothing helped.
Here is a Screenshot of the Error I'm receiving:
And here is the very simple script I'm using:
# coding: utf8
from selenium import webdriver
url = 'https://registrierung.gmx.net/'
# Open ChromeDriver
driver = webdriver.Chrome();
# Open URL
driver.get(url)
If anyone has a solution for that I would highly appreciate it.
I'm also willing to give a big tip if someone could help me out here.
Thanks a lot!
Stay healthy everyone.
I took your code modified with a couple of arguments and executed the test. Here are the observations:
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://registrierung.gmx.net/")
print(driver.page_source)
Console Output:
<html style="" class=" adownload applicationcache blobconstructor blob-constructor borderimage borderradius boxshadow boxsizing canvas canvastext checked classlist contenteditable no-contentsecuritypolicy no-contextmenu cors cssanimations csscalc csscolumns cssfilters cssgradients cssmask csspointerevents cssreflections cssremunit cssresize csstransforms3d csstransforms csstransitions cssvhunit cssvmaxunit cssvminunit cssvwunit dataset details deviceorientation displaytable display-table draganddrop fileinput filereader filesystem flexbox fullscreen geolocation getusermedia hashchange history hsla indexeddb inlinesvg json lastchild localstorage no-mathml mediaqueries meter multiplebgs notification objectfit object-fit opacity pagevisibility performance postmessage progressbar no-regions requestanimationframe raf rgba ruby scriptasync scriptdefer sharedworkers siblinggeneral smil no-strictmode no-stylescoped supports svg svgfilters textshadow no-time no-touchevents typedarrays userselect webaudio webgl websockets websqldatabase webworkers datalistelem video svgasimg datauri no-csshyphens"><head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<meta http-equiv="CacheControl" content="no-cache">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo=">
<script type="text/javascript">
(function(){
window["bobcmn"] = "10111111111010200000005200000005200000006200000001249d50ae8200000096200000000200000002300000000300000000300000006/TSPD/300000008TSPD_10130000000cTSPD_101_DID300000005https3000000b0082f871fb6ab200097a0a5b9e04f342a8fdfa6e9e63434256f3f63e9b3885e118fdacf66cc0a382208ea9dc3b70a28002d902f95eb5ac2e5d23ffe409bb24b4c57f9cb8e1a5db4bcad517230d966c75d327f561cc49e16f4300000002TS200000000200000000";
.
.
<script type="text/javascript" src="/TSPD/082f871fb6ab20009afc88ee053e87fea57bf47d9659e73d0ea3c46c77969984660358739f3d19d0?type=11"></script>
<script type="text/javascript">
(function(){
window["blobfp"] = "01010101b00400000100e803000000000d4200623938653464333234383463633839323030356632343563393735363433343663666464633135393536643461353031366131633362353762643466626238663337210068747470733a2f2f72652e73656375726974792e66356161732e636f6d2f72652f0700545350445f3734";window["slobfp"] = "08c3194e510b10009a08af8b7ee6860a22b5726420e697e4";
})();
</script>
<script type="text/javascript" src="/TSPD/082f871fb6ab20009afc88ee053e87fea57bf47d9659e73d0ea3c46c77969984660358739f3d19d0?type=12"></script>
<noscript>Please enable JavaScript to view the page content.<br/>Your support ID is: 11993951574422772310.</noscript>
</head><body>
<style>canvas {display:none;}</style><canvas width="800" height="600"></canvas></body></html>
Browser Snapshot:
Conclusion
From the Page Source it's quite clear that Selenium driven ChromeDriver initiated google-chrome Browsing Context gets detected and the navigation is blocked.
I could have dug deeper and provide some more insights but suprisingly now even manually I am unable to access the webpage. Possibly my IP is black-listed now. Once my IP gets whitelisted I will provide more details.
References
You can find a couple of relevant detailed discussions in:
Can a website detect when you are using selenium with chromedriver?
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection

Can not find out the source of data I need when crawling website

I am writing a web crawler with python. I come across a problem when I am trying to find out the source of the data I need.
The site I am crawling is: https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League, and the data I want is as below:
I can find these data by browsering the page source after the page has been tatolly loaded by firefox:
DataStore.prime('standings', { stageId:15151, idx:0, field: 'overall'}, [[15151,32,'Manchester United',1,5,4,1,0,16,2,14,13,1,3,3,0,0,10,0,10,9,7,2,1,1,0,6,2,4,4,[[0,1190179,4,0,2,252,'England',2,'Premier League','2017/2018',32,29,'Manchester United','West Ham','Manchester United','West Ham',4,0,'w'] ......
I thought these data should be requested though ajax, but I detected no such request by using the web console.
Then, I simulated the browser behaviour (set header and cookies) requiring the html page:
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05">
</script>
<script>
(function() {
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B7661722073746174757......";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>
I created an .html file with the content above, and open it with firefox, but it seems that the script did not executed. Now, I don`t know how to do, I need some help, thanks!

Looking for solution for an automatically scrolling information board

Right now, my company has about 40 information boards scattered throughout the buildings with information relevant to each area. Each unit is a small linux based device that is programmed to launch an RDP session, log in with a user name and password, and pull up the appropriate powerpoint and start playing. The boards would go down every 4 hours for about 5 minutes, copy over a new version of a presentation (if applicable) and restart.
We now have "demands" for live data. Unfortunately I believe powerpoint will no longer be an option as we used the Powerpoint viewer, which does not support plugins. I wanted to use Google Slides, but also have the restriction that we cannot have a public facing service like Google Drive, so there goes that idea.
I was thinking of some kind of way to launch a web browser and have it rotate through a list of specified webpages (perhaps stored in a txt or csv file). I found a way to launch Firefox and have it autologin to OBIEE via python:
#source: http://obitool.blogspot.com/2012/12/automatic-login-script-for-obiee-11g_12.html
import unittest
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
# Hardcoding this information is usually not a good Idea!
user = '' # Put your user name here
password = '' # Put your password here
serverName = '' # Put Host name here
class OBIEE11G(unittest.TestCase):
def setUp(self):
# Create a new profile
self.fp = webdriver.FirefoxProfile()
self.fp.set_preference("browser.download.folderList",2)
self.fp.set_preference("browser.download.manager.showWhenStarting",False)
# Associate the profile with the Firefox selenium session
self.driver = webdriver.Firefox(firefox_profile=self.fp)
self.driver.implicitly_wait(2)
# Build the Analytics url and save it for the future
self.base_url = "http://" + serverName + ":9704/analytics"
def login(self):
# Retreive the driver variables created in setup
driver = self.driver
# Goto the loging page
driver.get(self.base_url + "/")
# The 11G login Page has following elements on it
driver.find_element_by_id("sawlogonuser").clear()
driver.find_element_by_id("sawlogonuser").send_keys(user)
driver.find_element_by_id("sawlogonpwd").clear()
driver.find_element_by_id("sawlogonpwd").send_keys(password)
driver.find_element_by_id("idlogon").click()
def test_OBIEE11G(self):
self.login()
#
if __name__ == "__main__":
unittest.main()
If I can use this, I would just need a way to rotate to a new webpage every 30 seconds. Any ideas / recommendations?
You could put a simple javascript snippet on each page that waits a specified time then redirects to the new page. This has the advantage of simple implementation, however it may be annoying to maintain this over many html files.
The other option is to write your copy in a markdown file, then have a single html page that rotates through a list of files and renders and displays the markdown. You would then update the data by rewriting the markdown files. It wouldn't be exactly live, but if 30 second resolution is ok you can get away with it. Something like this for the client code:
HTML
<!DOCTYPE html>
<html lang="en">
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Message Board</title>
<link rel="stylesheet" type="text/css" href="/css/style.css">
</head>
<body>
<div id="content" class="container"></div>
<!-- jQuery -->
<script src="//code.jquery.com/jquery-2.1.4.min.js" defer></script>
<!-- markdown compiler https://github.com/chjj/marked/ -->
<script src="/js/marked.min.js" defer></script>
<script src="/js/main.js" defer></script>
</body>
</html>
And the javascript
// main.js
// the markdown files
var sites = ["http://mysites.com/site1.md", "http://mysites.com/site2.md", "http://mysites.com/site3.md", "http://mysites.com/site4.md"]
function render(sites) {
window.scrollTo(0,0);
//get first element and rotate list of sites
var file = sites.shift();
sites.push(file);
$.ajax({
url:file,
success: function(data) { $("#content").html(marked(data)); },
cache: false
});
setTimeout(render(sites), 30000);
}
// start the loop
render(sites);
You can then use any method you would like to write out the markdown files.

Cannot fetch a web site with python urllib.urlopen() or any web browser other than Shiretoko

Here is the URL of the site I want to fetch
https://salami.parc.com/spartag/GetRepository?friend=jmankoff&keywords=antibiotic&option=jmankoff%27s+tags
When I fetch the web site with the following code and display the contents with the following code:
sock = urllib.urlopen("https://salami.parc.com/spartag/GetRepository?friend=jmankoff&keywords=antibiotic&option=jmankoff's+tags")
html = sock.read()
sock.close()
soup = BeautifulSoup(html)
print soup.prettify()
I get the following output:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>
Error message
</title>
</head>
<body>
<h2>
Invalid input data
</h2>
</body>
</html>
I get the same result with urllib2 as well. Now interestingly, this URL works on only Shiretoko web browser v3.5.7. (when I say it works I mean that it brings me the right page). When I feed this URL into Firefox 3.0.15 or Konqueror v4.2.2. I get exactly the same error page (with "Invalid input data"). I don't have any idea what creates this difference and how I can fetch this page using Python. Any ideas?
Thanks
If you see the urllib2 doc, it says
urllib2.build_opener([handler, ...])ΒΆ
.....
If the Python installation has SSL support (i.e., if the ssl module can be imported), HTTPSHandler will also be added.
.....
you can try using urllib2 together with ssl module. alternatively, you can use httplib
That's exactly what you get when you click on the link with a webbrowser. Maybe you are supposed to be logged in or have a cookie set or something
I get the same message for firefox 3.5.8 (shiretoko) on linux

Categories