I am trying to test a webpage's behaviour to requests from different referrers. I am doing the following so far
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.referer'] = referer
The problem is that the webpage has ajax requests which will change some things in the html, and those ajax requests should have as referer the webpage itself and not the referer i gave at the start. It seems that the referer is set once at the start and every subsequent request be it ajax or image or anchor takes that same referer and it never changes no matter how deep you browse, is there a solution to choocing the referer only for the first request and having it dynamic for the rest?
After some search i found this and i tried to achieve it through selenium, but i have not had any success yet with this:
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.onInitialized'] = """function() {page.customHeaders = {};};"""
Any ideas?
From what I can tell you would need to patch PhantomJS to achieve this.
PhantomJS contains a module called GhostDriver which provides the HTTP API that WebDriver uses to communicate with the PhantomJS instance. So anything you want to do via WebDriver needs to be supported by GhostDriver, but it doesn't seem that onInitialized is supported by GhostDriver.
If you're feeling adventurous you could clone the PhantomJS repository and patch the src/ghostdriver/session.js file to do what you want.
The _init method looks like this:
_init = function() {
var page;
// Ensure a Current Window is available, if it's found to be `null`
if (_currentWindowHandle === null) {
// Create the first Window/Page
page = require("webpage").create();
// Decorate it with listeners and helpers
page = _decorateNewWindow(page);
// set session-specific CookieJar
page.cookieJar = _cookieJar;
// Make the new Window, the Current Window
_currentWindowHandle = page.windowHandle;
// Store by WindowHandle
_windows[_currentWindowHandle] = page;
}
},
You could try using the code you found:
page.onInitialized = function() {
page.customHeaders = {};
};
on the page object created there.
Depending on what you test though you might be able to save a lot of effort and ditch the browser and just test HTTP requests directly using something like the requests module.
Related
i just tried to do URL requests with selenium like this
driver.get (example.com/)
but i confused how to do same requests with parameter, like i do with requests
params = {
'name':'john-doe',
'shop_id':'121323233443',
}
requests.get('example.com/', params=params)
request is a backend interaction library and selenium is an front end autoamtion library , you cannot explicitly send plain http calls using selenium. WOrk around is to use javascript executor in selenium to do http ajax calls using javascript inside browser. This not recommended as this just aover engineering design unless its sepcific required use case:
https://stackoverflow.com/a/5665291/6793637
I am scraping data from a site with a paginated table (max results 500 with 25 results per page). When I use chrome to "view source" I can see all 500 results, however, once the JS renders in selenium only 25 results show when using driver.page_source.
I have tried passing the cookies and headers off to requests, but that's not reliable and need to stick with selenium. I have also made a janky solution of clicking through the paginator's next button, but there must be a better way!
So how does one capture the full page source prior to JS rendering using selenium with the python bindings?
There might be a simpler way but it turns out you can do all kinds of asynchronous things from the browser including fetch:
def fetch(url):
return driver.execute_async_script("""
(async () => {
let r = await fetch('""" + url + """')
arguments[0](await r.text())
})()
""")
html = fetch('https://stackoverflow.com/')
Same-origin policy will apply.
import requests
from bs4 import BeautifulSoup
a = requests.Session()
soup = BeautifulSoup(a.get("https://www.facebook.com/").content)
payload = {
"lsd":soup.find("input",{"name":"lsd"})["value"],
"email":"my_email",
"pass":"my_password",
"persistent":"1",
"default_persistent":"1",
"timezone":"300",
"lgnrnd":soup.find("input",{"name":"lgnrnd"})["value"],
"lgndim":soup.find("input",{"name":"lgndim"})["value"],
"lgnjs":soup.find("input",{"name":"lgnjs"})["value"],
"locale":"en_US",
"qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
}
soup = BeautifulSoup(a.post("https://www.facebook.com/",data = payload).content)
print([i.text for i in soup.find_all("a")])
Im playing around with requests and have read several threads here in SO about it so I decided to try it out myself.
I am stumped by this line. "qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
because it returns empty thereby cause an error.
however looking at chrome developer tools this "qsstamp" is populated what am I missing here?
the payload is everything shown in the form data on chrome dev tools. so what is going on?
Using Firebug and search for qsstamp gives matched results directs to: Here
You can see: j.createHiddenInputs({qsstamp:u},v)
That means qsstamp is dynamically generated by JavaScript.
requests will not run JavaScript(since what it does is to fetch that page's HTML.) You may want to use something like dryscape or using emulated browser like Selenium.
I need to log into a website to access its html on a login-protected page for a project I'm doing.
I'm using this person's answer with the values I need:
from twill.commands import *
go('https://example.com/login')
fv("3", "email", "myemail#example.com")
fv("3", "password", "mypassword")
submit()
Assumedly this should log me in so I then run:
sock = urllib.urlopen("https://www.example.com/activities")
html_source = sock.read()
sock.close()
print html_source
Which I thought would print the html of the (now) accessible page but instead just gives me the html of the login page. I've tried other methods (e.g. with mechanize) but I get the identical result.
What am I missing? Do some sites restrict this type of login or does it not work with https or something? (The site is FitBit, since I couldn't use the url in the question)
You're using one library to log in and another to then retrieve the subsequent page. twill and urllib are not sharing data about your sessions. (Similar issue to this one.) If you do that, then you need to manage the session cookie / authentication yourself. Specifically, you'll need to copy the cookie + data and add that to the post-login request in the other library.
Otherwise, and more logically, use the same one for both the login and post-login requests.
I'm aware of a Python API for sale here (http://oktaykilic.com/my-projects/google-alerts-api-python/), but I'd like to understand why the way I'm doing it now isn't working.
Here is what I have so far:
class GAlerts():
def __init__(self, uName = 'USERNAME', passWord = 'PASSWORD'):
self.uName = uName
self.passWord = passWord
def addAlert(self):
self.cj = mechanize.CookieJar()
loginURL = 'https://www.google.com/accounts/ServiceLogin?hl=en&service=alerts&continue=http://www.google.com/alerts'
alertsURL = 'http://www.google.com/alerts'
#log into google
initialRequest = mechanize.Request(loginURL)
response = mechanize.urlopen(initialRequest)
#put in form info
forms = ClientForm.ParseResponse(response, backwards_compat=False)
forms[0]['Email'] = self.uName
forms[0]['Passwd'] = self.passWord
#click form and get cookies
request2 = forms[0].click()
response2 = mechanize.urlopen(request2)
self.cj.extract_cookies(response, initialRequest)
#now go to alerts page with cookies
request3 = mechanize.Request(alertsURL)
self.cj.add_cookie_header(request3)
response3 = mechanize.urlopen(request3)
#parse forms on this page
formsAdd = ClientForm.ParseResponse(response3, backwards_compat=False)
formsAdd[0]['q'] = 'Hines Ward'
#click it and submit
request4 = formsAdd[0].click()
self.cj.add_cookie_header(request4)
response4 = mechanize.urlopen(request4)
print response4.read()
myAlerter = GAlerts()
myAlerter.addAlert()
As far as I can tell, it successfully logs in and gets to the adding alerts homepage, but when I enter a query and "click" submit it sends me to a page that says "Please enter a valid e-mail address". Is there some kind of authentication I'm missing? I also don't understand how to change the values on google's custom drop-down menus? Any ideas?
Thanks
The custom drop-down menus are done using JavaScript, so the proper solution would be to figure out the URL parameters and then try to reproduce them (this might be the reason it doesn't works as expected right now - you are omitting required URL parameters that are normally set by JavaScript when you visit the site in a browser).
The lazy solution is to use the galerts library, it looks like it does exactly what you need.
A few hints for future projects involving mechanize (or screen-scraping in general):
Use Fiddler, an extremely useful HTTP debugging tool. It captures HTTP traffic from most browsers and allows you to see what exactly your browser requests. You can then craft the desired request manually and in case it doesn't work, you just have to compare. Tools like Firebug or Google Chrome's developer tools come in handy too, especially for lots of async requests. (you have to call set_proxies on your browser object to use it with Fiddler, see documentation)
For debugging purposes, do something like for f in self.forms(): print f. This shows you all forms mechanize recognized on a page, along with their name.
Handling cookies is repetitive, so - surprise! - there's an easy way to automate it. Just do this in your browser class constructor: self.set_cookiejar(cookielib.CookieJar()). This keeps track of cookies automatically.
I have been relying a long time on custom parses like BeautifulSoup (and I still use it for some special cases), but in most cases the fastest approach on web screen scraping is using XPath (for example, lxml has a very good implementation).
Mechanize doesn't handle JavaScript, and those drop-down Menus are JS. If you want to do automatization where JavaScript is involved, I suggest using Selenium, which also has Python bindings.
http://seleniumhq.org/