Python session, cookies, and the web - python

** edit
after a short thought, I just figured that I don't have to use mechanize at all
and yet I don't know which Python library I should use in order to interact w/
cookies and session data,
can anyone please hint me ? **
I would like to perform a simple login and use the credentials ( and cookies, session data too ) for some site.
I used mechanize in order to perform the basic form usage, since the form is being built using Javascript
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
parameters = { 'username' : 'w00t',
'password' : 't00w'
}
data = urllib.urlencode(parameters)
resp = br.open(url,data)
however for some reason I can't seem to get any positive response from the server, I don't see any sign (for ex redirection to the desired page) , nor I know how to continue once I have the cookies and session to actually continue using these cookies and session data
I was wondering if anyone could hint me or refer me to the correct documentation, as what I have found does not seem to solve my problem

I've used the Requests library (http://docs.python-requests.org/en/latest/index.html) for this sort of thing in Python before. I found it very straight forward and to have great documentation. Here's an example that includes cookies in a request:
>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')
>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'
I have used Mechanize and if I recall, it keeps track of cookies for you. To the contrary, this library will require you to constantly repost the cookies upon requests.

Related

Web bot to login to site not working

I'm trying to get to grips with writing webbots using Python I've had some success so far, but one bot I'm having issues with.
This bot logins into hushmail.com, it'll be run every few days via cron to make sure the account stays active. I'm using mechanize to do the form filling and cookielib to handle the cookies and sessions. It's cobbled together from other scripts I've found.
The form fills correctly when looking at the debugger output in PyCharm, however on submitting the second page form, it doesn't take me to the inbox as expected. Instead it just returns me to the same login form.
#!/usr/bin/env python
import mechanize
import cookielib
#login details
my_user="user#hush.com"
my_pass="sampplepass_sdfnsdfakhsk*876sdfj#("
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('https://www.hushmail.com/')
html = r.read()
print br.title()
print r.info()
br.select_form(nr=0)
br.form['hush_username']=my_user
br.submit()
print br.title()
print r.info()
br.select_form('authenticationform')
br.form['hush_username']=my_user
br.form['hush_passphrase']=my_pass
br.submit()
print br.response().info()
print br.title()
print br.response().read()
I believe the unexpected return HTML values were due to the page returning a mix of Javascript and HTML which mechanize has issues interpreting.
I switch the Python script to use Selenium Web Driver which works much better. handling Javascript generated HTML via a Firefox web driver. I used the handy Selenium IDE plugin for Firefox to record my actions in a browser and then use the Export > Python Script in the plugin to create the basis for a more robust web bot.

Submitting a form with mechanize HTTP Error 500

This is my first time using mechanize and I'm trying to fill out a form with mechanize
Here are my browser options:
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.addheaders = br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en- US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
I fill out the form with valid values and hit br.submit() but it gives me HTTP: Error 500: Internal Server Error. I'm assuming it's detecting that it's a bot or something hitting the submit? But I figured that's what the addheaders was suppose to take care of.
You can use http://grablib.org/docs/, it is much easier and more efficient. Try it.
Install on linux:
pip install pycurl lxml
pip install grab
from grab import Grab
g = Grab()
g.go('http://google.com') # go to google.com
g.choose_form(0) #form number
g.set_input('q', 'test') # 'q'-input name, 'test' - search query
g.submit() # send request
print g.xpath_list('//a/text()') # view xpath result link list
Sorry for my english.

python mechanize session not saved

I'm trying to use python mechanize to retrive the list of apps on iTunes connect. Once this list is retrieved, further work will be done with those links.
Logging in succeeds but then when i follow the "Manage Your Applications" link I get redirected back to the login page. It is as if the session gets lost.
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
filename = 'itunes.html'
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('https://itunesconnect.apple.com/WebObjects/iTunesConnect.woa')
br.select_form(name='appleConnectForm')
br.form['theAccountName'] = username
br.form['theAccountPW'] = password
br.submit()
apps_link = br.find_link(text='Manage Your Applications')
print "Manage Your Apps link = ", apps_link
req = br.follow_link(text='Manage Your Applications')
for app_link in br.links():
print "link is ", app_link
Any ideas what could be wrong?
You need to save/load the cookiejar
Figured this out after further investigation. This was due to a known bug in cookielib documented here: http://bugs.python.org/issue3924
Basically some sites (notably itunesconnect), set the cookie version as a string not an int. Which causes an error in cookielib since it does not deal with that condition. The fix at the bottom of that issue thread worked for me.

python mechanize to access Sharepoint website

I'm trying to access Sharepoint using mechanize but i got a 401 error. Here's the code i'm using:
import mechanize
url = "http://sharepoint:8080/foo/bar/foobar.aspx"
br.addheaders = [('User-agent', 'Mozilla/4.0(compatible; MSIE 7.0b; Windows NT 6.0)')]
br.add_password(url, 'domain\\user', 'myPassword')
r = br.open(url)
html = r.read()
Did i miss anything?
Did you happen to try Python Ntlm for accessing SharePoint?
Examples in the Ntlm doc will explain how to use it with Urllib2. Pasted below the code for using NTLM authentication using mechanize.
import mechanize
from ntlm import HTTPNtlmAuthHandler
pass_manager = mechanize.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(pass_manager)
browser = mechanize.Browser()
browser.add_handler(auth_NTLM)
r = browser.open(url)
html = r.read()
Try with:
br.addheaders = [('User-agent', 'Mozilla/4.0(compatible; MSIE 7.0b; Windows NT 6.0)'), ('Authorization', 'Basic %s:%s' % ('domain\\user', 'myPassword'))]
instead of
br.addheaders = [('User-agent', 'Mozilla/4.0(compatible; MSIE 7.0b; Windows NT 6.0)')]
This should work if your sharepoint server provides Basic Auth.
Looking at the usage in the mechanize docs you only need to specify the username (eg 'john_doe', try this
...
br.add_password(url, 'username_string', 'myPassword')
r = br.open(url)
html = r.get_data() # r.get_data() can be called many times without calling seek

Python mechanize SelectControl is empty when it should have values

I'm trying to automate the download of some data from a webform. I'm using python's mechanize module.
The url is here: http://www.hiv.lanl.gov/components/sequence/HIV/search/search.html
I need to fill out the Sequence Length, Subtype and Genomic Region. I've got the Sequence-Length and Genomic-Region figured out but I'm having trouble with selecting the Subtype. When I load the form is get an empty SelectControl and mechanize won't let me select anything.
This code should load the website:
import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('http://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html')
br.follow_link(text = 'Search Interface')
br.select_form(nr = 1)
Any help would be greatly appreciated.
-Will
EDIT:
I tried to use BeautifulSoup to re-parse the HTML (as per this SO question) but no luck there.
New Edit:
Below is an excerpt of the mechanize form.
<search POST http://www.hiv.lanl.gov/components/sequence/HIV/search/search.comp multipart/form-data
<TextControl(value SequenceAccessions SA_GenBankAccession 1=)>
<SelectControl(master=[Any, *HIV-1, HIV-2, SIV, SHIV, syntheticwholeDNA, NULL])>
<TextControl(value SEQ_SAMple SSAM_common_name 1=)>
<SelectControl(slave=[])>
<TextControl(value SequenceEntry SE_sequenceLength 4=)>
<CheckboxControl(sample_year_exact=[*1])>
<TextControl(value SEQ_SAMple SSAM_Sample_year 11=)>
<TextControl(value SEQ_SAMple SSAM_Sample_country 1=)>
<CheckboxControl(recombinants=[Recombinants])>
For some reason the slave control is not populated with the possible choices.
The problem is that the Subtype <select> options are populated by Javascript, which mechanize does not execute. The Javascript code runs when the page loads, populating the slave options with the HIV-1 list:
addLoadEvent(fillList(document.forms['search'].slave, lists['HIV-1']));
However, the mapping of Virus -> Subtype is stored in this Javascript file search.js. You may need to store this mapping in your Python script and manually set the slave form value.

Categories