I'm trying to put together a python program that will periodically gather course data from a university website and save them to a csv file for personal use. After some searching around I stumbled across mechanize. I'm trying to set it up so I can log into my account but I've run into a stumbling block. The code I put together here is supposed to submit the form containing the login information. The website is https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword. The response page is supposed to display a red error message when a form is submitted with the wrong log in information. The response I keep getting lacks such an error message and I cant figure out what I'm doing wrong.
import mechanize
from bs4 import BeautifulSoup
import cookielib
import urllib2
# log in page url
url = "https://idp-prod.cc.ucf.edu/idp/Authn/UserPassword"
# create the browser object and give it fake headers
myBrowser = mechanize.Browser()
myBrowser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
myBrowser.set_handle_robots(False)
# have the browser open the site
myBrowser.open(url)
# handle the cookies
cj = cookielib.LWPCookieJar()
myBrowser.set_cookiejar(cj)
# select third form down
myBrowser.select_form(nr=2)
# fill out the form
myBrowser["j_username"] = "somestudentusername"
myBrowser["pwd"] = "somestudentpassword"
# submit the form and save the response
response = myBrowser.submit()
# upon the submission of incorrect login information an error message is displayed in red
soup = BeautifulSoup(response.read(), 'html.parser')
print (soup.find(color="red")) #find the error message in the response
Related
Im trying to make a script to access a webpage. It loads the first page, finds the login form, fill it and submit. The website works just like facebook, if you have the cookie already you are redirected to your feed list, otherwise to the login page.
But as a response I dont get another page, a simply get a string like that:
s1:1MEqkcRcZQ7x6adaszkZUQyRFRhCfXz1z:c2c8d18f12f50ab3e8daA1cf80a0d8b9f64e9d6684b8eb064dd76892d6134cde:1646683
Its like 4 strings separated with a ":". The first I dont know what is. The second is the username, the third is my hashed password (I suppose), and the last one is my user id.
Making a test in FF, I find out that is a Javascript problem, if you dont have Javascript enabled on your browser you get that string after login.
Here's my code>
import mechanize
import urllib
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True )
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.3')]
# If the protected site didn't receive the authentication data you would
wallet = 'username'
password = 'password'
response = br.open('https://www.example.com/')
#html = response.read()
# Show the source
#print html
# or
#print br.response().read()
# Show the html title
print br.title()
#print response.read()
# Show the response headers
#print response.info()
# or
#print br.response().info()
# Show the available forms
for form in br.forms():
print "Form name:", form.name
print form
# Select the login form
br.select_form(nr=2)
# Let's login
#br.form['op'] = 'login'
br.form['login'] = wallet
br.form['password'] = password
response1 = br.submit()
print response1
print response1.read()
print "#######################"
cookie = cookielib.Cookie(version=0, name='PON', value="response1.read()", expires=365, port=None, port_specified=False, domain='https://www.example.com/', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True, secure=True, discard=False, comment=None, comment_url=None, rest={'HttpOnly': False}, rfc2109=False)
cj.set_cookie(cookie)
response = br.open(https://www.example.com/)
Because I dont know what the string is, I figured it was a Cookie, so I tried to put in my Cookiejar, and tried to br.open(url) again, but it returns always the login page.
I have to replicate what the website's javascript does in python, but so far I'm stucked.
Any thoughts? I already tried to read the source code of the website, but I didnt find the script that is causing me trouble. Its probably inside the head tag right? I dont know.
I got it.
My mistake was to try set a Cookie with all the string. Each part of the string was a different cookie. The name of the cookies I got using the chrome extension "Live HTTP Headers".
I'm trying to get to grips with writing webbots using Python I've had some success so far, but one bot I'm having issues with.
This bot logins into hushmail.com, it'll be run every few days via cron to make sure the account stays active. I'm using mechanize to do the form filling and cookielib to handle the cookies and sessions. It's cobbled together from other scripts I've found.
The form fills correctly when looking at the debugger output in PyCharm, however on submitting the second page form, it doesn't take me to the inbox as expected. Instead it just returns me to the same login form.
#!/usr/bin/env python
import mechanize
import cookielib
#login details
my_user="user#hush.com"
my_pass="sampplepass_sdfnsdfakhsk*876sdfj#("
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('https://www.hushmail.com/')
html = r.read()
print br.title()
print r.info()
br.select_form(nr=0)
br.form['hush_username']=my_user
br.submit()
print br.title()
print r.info()
br.select_form('authenticationform')
br.form['hush_username']=my_user
br.form['hush_passphrase']=my_pass
br.submit()
print br.response().info()
print br.title()
print br.response().read()
I believe the unexpected return HTML values were due to the page returning a mix of Javascript and HTML which mechanize has issues interpreting.
I switch the Python script to use Selenium Web Driver which works much better. handling Javascript generated HTML via a Firefox web driver. I used the handy Selenium IDE plugin for Firefox to record my actions in a browser and then use the Export > Python Script in the plugin to create the basis for a more robust web bot.
my problem is as follows:
I'm trying to write a scraper that runs through the order process of an airline ticketing website. So I want to scrape a couple of pages that depend on the results of the pages before (I hope you get what I mean). I am so far right now:
import mechanize, urllib, urllib2
url = 'any url'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 5.2; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11')]
br.open(url)
response = br.response().read()
br.select_form(nr=1)
br.form.set_all_readonly(False)
## now I am reading out the variables of form(nr=1)
for control in br.form.controls:
if not control.name:
print " - (type) =", (control.type)
continue
print " - (name, type, value) =", (control.name, control.type, br[control.name])
## now I am modifying the variables
br['fromdate'] = '2012/11/03'
br['todate'] = '2012/11/07'
## now I am submitting the form and saving the output in the variable bookingsite
response = br.submit()
bookingsite = response.read()
And here is my problem: How can I use the variable bookingsite, which contains again a form that I want to modify and submit, just like a normal URL? Just by setting
br.open(bookingsite)
??? Or is there another way of modifying and submitting the output (and then submit the output again and receive the new output-page)?
After your initial response response = br.submit() Select the form from the response object:
response.select_form()
After you're changed the values of the fields within the form submit the form:
response.submit()
P.S. If you're automating booking sites they most likely have heavy Javascript. Mechanize doesn't handle Javascript. I'd suggest using Requests instead. You'll be happy you did.
Hello everybody (first post here).
I am trying to send data to a webpage. This webpage request two fields (a file and an e-mail address) if everything is ok the webpage returns a page saying "everything is ok" and sends a file to the provided e-mail address. I execute the code below and I get nothing in my e-mail account.
import urllib, urllib2
params = urllib.urlencode({'uploaded': open('file'),'email': 'user#domain.com'})
req = urllib2.urlopen('http://webpage.com', params)
print req.read()
the print command gives me the code of the home page (I assume instead it should give the code of the "everything is ok" page).
I think (based o google search) the poster module should do the trick but I need to keep dependencies to a minimum, hence I would like a solution using standard libraries (if that is possible).
Thanks in advance.
Thanks everybody for your answers. I solve my problem using the mechanize library.
import mechanize
br = mechanize.Browser()
br.open('webpage.com')
email='user#domain.com'
br.select_form(nr=0)
br['email'] = email
br.form.add_file(open('filename'), 'mime-type', 'filename')
br.form.set_all_readonly(False)
br.submit()
This site could checks Referer, User-Agent and Cookies.
Way to handle all of this is using urllib2.OpenerDirector which you can get by urllib2.build_opener.
# Cookies handle
cj = cookielib.CookieJar()
CookieProcessor = urllib2.HTTPCookieProcessor(cj)
# Build OpenerDirector
opener = urllib2.build_opener(CookieProcessor)
# Valid User-Agent from Firefox 3.6.8 on Ubuntu 10.04
user_agent = 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8'
# Referer says that you send request from web-site title page
referer = 'http://webpage.com'
opener.addheaders = [
('User-Agent', user_agent),
('Referer', referer),
('Accept-Charset', 'utf-8')
]
Then prepare parameters with urlencode and send request by opener.open(params)
Documentation for Python 2.7: cookielib, OpenerDirector
I'm trying to automate the download of some data from a webform. I'm using python's mechanize module.
The url is here: http://www.hiv.lanl.gov/components/sequence/HIV/search/search.html
I need to fill out the Sequence Length, Subtype and Genomic Region. I've got the Sequence-Length and Genomic-Region figured out but I'm having trouble with selecting the Subtype. When I load the form is get an empty SelectControl and mechanize won't let me select anything.
This code should load the website:
import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('http://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html')
br.follow_link(text = 'Search Interface')
br.select_form(nr = 1)
Any help would be greatly appreciated.
-Will
EDIT:
I tried to use BeautifulSoup to re-parse the HTML (as per this SO question) but no luck there.
New Edit:
Below is an excerpt of the mechanize form.
<search POST http://www.hiv.lanl.gov/components/sequence/HIV/search/search.comp multipart/form-data
<TextControl(value SequenceAccessions SA_GenBankAccession 1=)>
<SelectControl(master=[Any, *HIV-1, HIV-2, SIV, SHIV, syntheticwholeDNA, NULL])>
<TextControl(value SEQ_SAMple SSAM_common_name 1=)>
<SelectControl(slave=[])>
<TextControl(value SequenceEntry SE_sequenceLength 4=)>
<CheckboxControl(sample_year_exact=[*1])>
<TextControl(value SEQ_SAMple SSAM_Sample_year 11=)>
<TextControl(value SEQ_SAMple SSAM_Sample_country 1=)>
<CheckboxControl(recombinants=[Recombinants])>
For some reason the slave control is not populated with the possible choices.
The problem is that the Subtype <select> options are populated by Javascript, which mechanize does not execute. The Javascript code runs when the page loads, populating the slave options with the HIV-1 list:
addLoadEvent(fillList(document.forms['search'].slave, lists['HIV-1']));
However, the mapping of Virus -> Subtype is stored in this Javascript file search.js. You may need to store this mapping in your Python script and manually set the slave form value.