So, I started out with Mechanize, and apparently the first thing I try it on is a monkey-rhino-level high JavaScript navigated site.
Now the thing I'm stuck on is submitting the form.
Normally I'd do a submit using the Mechanize built-in submit() function.
import mechanize
browser = mechanize.Browser()
browser.select_form(name = 'foo')
browser.form['bar'] = 'baz'
browser.submit()
This way it'd use the submit button that's available in the HTML form.
However, the site I'm stuck on had to be one that doesn't use HTML submit buttons... No, they're trying to be JavaScript gurus, and do a submit via JavaScript.
The usual submit() doesn't seem to work with this.
So... Is there a way to get around this?
Any help is appreciated. Many thanks!
--[Edit]--
The JavaScript function I'm stuck on:
function foo(bar, baz) {
var qux = document.forms["qux"];
qux.bar.value = bar.split("$").join(":");
qux.baz.value = baz;
qux.submit();
}
What I did in Python (and what doesn't work):
def foo(browser, bar, baz):
qux = browser.select_form("qux")
browser.form[bar] = ":".join(bar.split("$"))
browser.form[baz] = baz
browser.submit()
Three ways:
The first method is preferable if the form is submitted using the POST/GET method, otherwise you'll have to resort to second and third method.
Submitting the form manually and check for POST/GET requests, their parameters and the post url required to submit the form. Popular tools for checking headers are the Live HTTP headers extension and Firebug extension for Firefox, and Developer Tools extension for Chrome. An example of using the POST/GET method:
import mechanize
import urllib
browser = mechanize.Browser()
#These are the parameters you've got from checking with the aforementioned tools
parameters = {'parameter1' : 'your content',
'parameter2' : 'a constant value',
'parameter3' : 'unique characters you might need to extract from the page'
}
#Encode the parameters
data = urllib.urlencode(parameters)
#Submit the form (POST request). You get the post_url and the request type(POST/GET) the same way with the parameters.
browser.open(post_url,data)
#Submit the form (GET request)
browser.open(post_url + '%s' % data)
Rewrite the javascript and execute it in Python. Check out spidermonkey.
Emulate a full browser. Check out Selenium and Windmill.
Related
I'm not sure if such a thing is possible, but I am trying to submit to a form such as https://lambdaschool.com/contact using a POST request.
I currently have the following:
import requests
payload = {"name":"MyName","lastname":"MyLast","email":"someemail#gmail.com","message":"My message"}
r = requests.post('http://lambdaschool.com/contact',params=payload)
print(r.text)
But I get the following error:
<title>405 Method Not Allowed</title>
etc.
Is such a thing possible to submit using a POST request?
If it were that simple, you'd see a lot of bots attacking every login form ever.
That URL obviously doesn't accept POST requests. That doesn't mean the submit button is POST-ing to that page (though clicking the button also gives that same error...)
You need to open the chrome / Firefox dev tools and watch the request to see what happens on form submit and replicate that data in Python.
Another option would be the mechanize or Selenium webdriver libraries to simulate a browser and fill out the form
params is for query parameters. You either want data, for a form encoded body, or json, for a JSON body.
I think the url should be 'http://lambdaschool.com/contact-form'.
I'm currently trying to access a website with python and I'm having some trouble using the requests and mechanize modules. Basically the way I do this task manually is that I load the website portal and log on then click a button and fill out a form and submit this to receive a download. I've gotten to the log on stage and am having trouble submitting my username and log in am currently using this method
import requests
payload = {"username":"user","password":"pass"}
r = requests.post("portal login address",data=payload)
response = r.content
print(response)
but this gives me the exact same output as a get request where I don't include the payload. I am also wondering how I can simulate these button clicks and form submissions, I know mechanize can be used but I'm unclear as to how
You can use the mechanize module, in this way:
import re
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("<page>")
# you can access the form by name or some other means
# ive used a loop here just as an example
for form in br.forms():
form["username"] = "saurabhav"
form["password"] = "8558881858"
form.submit()
Have a look at Mechanize
I need to log into a website to access its html on a login-protected page for a project I'm doing.
I'm using this person's answer with the values I need:
from twill.commands import *
go('https://example.com/login')
fv("3", "email", "myemail#example.com")
fv("3", "password", "mypassword")
submit()
Assumedly this should log me in so I then run:
sock = urllib.urlopen("https://www.example.com/activities")
html_source = sock.read()
sock.close()
print html_source
Which I thought would print the html of the (now) accessible page but instead just gives me the html of the login page. I've tried other methods (e.g. with mechanize) but I get the identical result.
What am I missing? Do some sites restrict this type of login or does it not work with https or something? (The site is FitBit, since I couldn't use the url in the question)
You're using one library to log in and another to then retrieve the subsequent page. twill and urllib are not sharing data about your sessions. (Similar issue to this one.) If you do that, then you need to manage the session cookie / authentication yourself. Specifically, you'll need to copy the cookie + data and add that to the post-login request in the other library.
Otherwise, and more logically, use the same one for both the login and post-login requests.
I'm trying to do something very simple using Python's Mechanize library. I want to go to: JobSearch">http://careers.force.com/jobs/ts2_JobSearch, select Dublin Ireland from the drop down list, and then hit enter.
I've written a very short Python script for this, but for some reason when I run it, it returns the HTML for the default search page rather than the search page that is produced after selecting the location (Dublin Ireland) and hitting enter. I have no idea what is going wrong:
import mechanize
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
br.open(link)
br.select_form('j_id0:j_id1:atsForm' )
br.form['j_id0:j_id1:atsForm:j_id38:1:searchCtrl'] = ["Ireland - Dublin"]
response = br.submit()
newsite = response.read()
This is in case you're still having this problem or if not, in case anyone else is having this problem in the future....
I looked at the postdata that was being sent by your browser when you manually selected something and wrote a function for you that will get you to the page you want by manually performing a POST operation with urllib.urlencoded data. Cheers.
import mechanize,cookielib,urllib
def get_search(html,controls):
#viewstate
s=re.search('ViewState" value="', html).span()[1]
e=re.search('"',html[s:]).span()[0]+s
state=html[s:e]
#viewstateversion
s=re.search('ViewStateVersion', html).span()[1]
s=s+re.search('value="', html[s:]).span()[1]
e=re.search('"', html[s:]).span()[0]+s
version=html[s:e]
#viewstatemac
s=re.search('ViewStateMAC',html).span()[1]
s=s+re.search('value=\"',html[s:]).span()[1]
e=re.search('"',html[s:]).span()[0]+s
mac=html[s:e]
return {controls[0]:controls[0], controls[1]:'',controls[2]:'Ireland - Dublin', controls[3]:'Search','com.salesforce.visualforce.ViewState':state,'com.salesforce.visualforce.ViewStateVersion':version,'com.salesforce.visualforce.ViewStateMAC':mac}
#Define variables and create a mechanize browser
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
cj=cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open(link)
#get the html data
html=br.response().read()
#get the control names from the correct form
br.select_form(nr=1)
controls=[control.name for control in br.form.controls]
#run function with html and control names list as parameters and run urllib.urlencode on what gets returned
postdata=urllib.urlencode(get_search(br.response().read(), controls))
#go to the webpage again but this time also submit the encoded data
br.open(link, postdata)
#There Ya Go
print br.response().read()
I'm aware of a Python API for sale here (http://oktaykilic.com/my-projects/google-alerts-api-python/), but I'd like to understand why the way I'm doing it now isn't working.
Here is what I have so far:
class GAlerts():
def __init__(self, uName = 'USERNAME', passWord = 'PASSWORD'):
self.uName = uName
self.passWord = passWord
def addAlert(self):
self.cj = mechanize.CookieJar()
loginURL = 'https://www.google.com/accounts/ServiceLogin?hl=en&service=alerts&continue=http://www.google.com/alerts'
alertsURL = 'http://www.google.com/alerts'
#log into google
initialRequest = mechanize.Request(loginURL)
response = mechanize.urlopen(initialRequest)
#put in form info
forms = ClientForm.ParseResponse(response, backwards_compat=False)
forms[0]['Email'] = self.uName
forms[0]['Passwd'] = self.passWord
#click form and get cookies
request2 = forms[0].click()
response2 = mechanize.urlopen(request2)
self.cj.extract_cookies(response, initialRequest)
#now go to alerts page with cookies
request3 = mechanize.Request(alertsURL)
self.cj.add_cookie_header(request3)
response3 = mechanize.urlopen(request3)
#parse forms on this page
formsAdd = ClientForm.ParseResponse(response3, backwards_compat=False)
formsAdd[0]['q'] = 'Hines Ward'
#click it and submit
request4 = formsAdd[0].click()
self.cj.add_cookie_header(request4)
response4 = mechanize.urlopen(request4)
print response4.read()
myAlerter = GAlerts()
myAlerter.addAlert()
As far as I can tell, it successfully logs in and gets to the adding alerts homepage, but when I enter a query and "click" submit it sends me to a page that says "Please enter a valid e-mail address". Is there some kind of authentication I'm missing? I also don't understand how to change the values on google's custom drop-down menus? Any ideas?
Thanks
The custom drop-down menus are done using JavaScript, so the proper solution would be to figure out the URL parameters and then try to reproduce them (this might be the reason it doesn't works as expected right now - you are omitting required URL parameters that are normally set by JavaScript when you visit the site in a browser).
The lazy solution is to use the galerts library, it looks like it does exactly what you need.
A few hints for future projects involving mechanize (or screen-scraping in general):
Use Fiddler, an extremely useful HTTP debugging tool. It captures HTTP traffic from most browsers and allows you to see what exactly your browser requests. You can then craft the desired request manually and in case it doesn't work, you just have to compare. Tools like Firebug or Google Chrome's developer tools come in handy too, especially for lots of async requests. (you have to call set_proxies on your browser object to use it with Fiddler, see documentation)
For debugging purposes, do something like for f in self.forms(): print f. This shows you all forms mechanize recognized on a page, along with their name.
Handling cookies is repetitive, so - surprise! - there's an easy way to automate it. Just do this in your browser class constructor: self.set_cookiejar(cookielib.CookieJar()). This keeps track of cookies automatically.
I have been relying a long time on custom parses like BeautifulSoup (and I still use it for some special cases), but in most cases the fastest approach on web screen scraping is using XPath (for example, lxml has a very good implementation).
Mechanize doesn't handle JavaScript, and those drop-down Menus are JS. If you want to do automatization where JavaScript is involved, I suggest using Selenium, which also has Python bindings.
http://seleniumhq.org/