This is my first time using mechanize and I'm trying to fill out a form with mechanize
Here are my browser options:
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.addheaders = br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en- US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
I fill out the form with valid values and hit br.submit() but it gives me HTTP: Error 500: Internal Server Error. I'm assuming it's detecting that it's a bot or something hitting the submit? But I figured that's what the addheaders was suppose to take care of.
You can use http://grablib.org/docs/, it is much easier and more efficient. Try it.
Install on linux:
pip install pycurl lxml
pip install grab
from grab import Grab
g = Grab()
g.go('http://google.com') # go to google.com
g.choose_form(0) #form number
g.set_input('q', 'test') # 'q'-input name, 'test' - search query
g.submit() # send request
print g.xpath_list('//a/text()') # view xpath result link list
Sorry for my english.
Related
Hi I'm using python mechanize to get datas from webpages.
I'm trying to get imgurl from google image search webpage to download search result images.
Here's my code
I fill search form as 'dog' and submit. (search 'dog')
import mechanize
import cookielib
import urllib2
import urllib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time = 1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (x11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'), ('Accept', '*/*') ,('Accept-Language', 'ko-KR')]
br.open('http://www.google.com/imghp?hl=en')
br.select_form(nr=0)
br.form['q'] = 'dog'
a = br.submit()
searched_url = br.geturl()
file0 = open("1.html", "wb")
file0.write(a.read())
file0.close()
when i see page-source from chrome browser, there are 'imgurl's in pagesource. But when i read data from
python mechanize, there's no such things.
also, the size of 1.html(which i write by python) is much smaller than html file downloaded from chrome.
How can i get exactly same html data as web-browsers by using python?
Do i have to set request headers same as web-browsers?
thanks
I'm trying to write a python script that will fill a form on a website, send it, and after sending I want to search for a keyword on the resulting webpage.
More specifically, the form is: https://booking.elal.co.il/newBooking/changeOrder.jsp?LANG=EN&RESSYSTEMID=1
When I fill the form manually on the web, after I press the "continue" button I get kind of "processing page", and afterwards I get the webpage that I want to search on it the keyword.
I tried to use the script here: http://stockrt.github.io/p/handling-html-forms-with-python-mechanize-and-BeautifulSoup/ , but for some reason after submitting the form when I do: print br.response().geturl() I get the url of the "processing page", and not the url of the webpage I want to search on.
My Code:
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# The site we will navigate into, handling it's session
br.open('https://booking.elal.co.il/newBooking/changeOrder.jsp?LANG=EN&RESSYSTEMID=1')
# Select the first (index zero) form
br.select_form(nr=0)
# User credentials
br.form['REC_LOC'] = '...'
br.form['DIRECT_RETRIEVE_LASTNAME'] = '...'
# Login
br.submit()
#Trying to print the webpage
html = br.response().read()
print html2text.html2text(html)
Is it possible to do what I want, and how can I do it?
** edit
after a short thought, I just figured that I don't have to use mechanize at all
and yet I don't know which Python library I should use in order to interact w/
cookies and session data,
can anyone please hint me ? **
I would like to perform a simple login and use the credentials ( and cookies, session data too ) for some site.
I used mechanize in order to perform the basic form usage, since the form is being built using Javascript
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
parameters = { 'username' : 'w00t',
'password' : 't00w'
}
data = urllib.urlencode(parameters)
resp = br.open(url,data)
however for some reason I can't seem to get any positive response from the server, I don't see any sign (for ex redirection to the desired page) , nor I know how to continue once I have the cookies and session to actually continue using these cookies and session data
I was wondering if anyone could hint me or refer me to the correct documentation, as what I have found does not seem to solve my problem
I've used the Requests library (http://docs.python-requests.org/en/latest/index.html) for this sort of thing in Python before. I found it very straight forward and to have great documentation. Here's an example that includes cookies in a request:
>>> url = 'http://httpbin.org/cookies'
>>> cookies = dict(cookies_are='working')
>>> r = requests.get(url, cookies=cookies)
>>> r.text
'{"cookies": {"cookies_are": "working"}}'
I have used Mechanize and if I recall, it keeps track of cookies for you. To the contrary, this library will require you to constantly repost the cookies upon requests.
I'm trying to use python mechanize to retrive the list of apps on iTunes connect. Once this list is retrieved, further work will be done with those links.
Logging in succeeds but then when i follow the "Manage Your Applications" link I get redirected back to the login page. It is as if the session gets lost.
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
filename = 'itunes.html'
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('https://itunesconnect.apple.com/WebObjects/iTunesConnect.woa')
br.select_form(name='appleConnectForm')
br.form['theAccountName'] = username
br.form['theAccountPW'] = password
br.submit()
apps_link = br.find_link(text='Manage Your Applications')
print "Manage Your Apps link = ", apps_link
req = br.follow_link(text='Manage Your Applications')
for app_link in br.links():
print "link is ", app_link
Any ideas what could be wrong?
You need to save/load the cookiejar
Figured this out after further investigation. This was due to a known bug in cookielib documented here: http://bugs.python.org/issue3924
Basically some sites (notably itunesconnect), set the cookie version as a string not an int. Which causes an error in cookielib since it does not deal with that condition. The fix at the bottom of that issue thread worked for me.
I'm trying to automate the download of some data from a webform. I'm using python's mechanize module.
The url is here: http://www.hiv.lanl.gov/components/sequence/HIV/search/search.html
I need to fill out the Sequence Length, Subtype and Genomic Region. I've got the Sequence-Length and Genomic-Region figured out but I'm having trouble with selecting the Subtype. When I load the form is get an empty SelectControl and mechanize won't let me select anything.
This code should load the website:
import mechanize
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('http://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html')
br.follow_link(text = 'Search Interface')
br.select_form(nr = 1)
Any help would be greatly appreciated.
-Will
EDIT:
I tried to use BeautifulSoup to re-parse the HTML (as per this SO question) but no luck there.
New Edit:
Below is an excerpt of the mechanize form.
<search POST http://www.hiv.lanl.gov/components/sequence/HIV/search/search.comp multipart/form-data
<TextControl(value SequenceAccessions SA_GenBankAccession 1=)>
<SelectControl(master=[Any, *HIV-1, HIV-2, SIV, SHIV, syntheticwholeDNA, NULL])>
<TextControl(value SEQ_SAMple SSAM_common_name 1=)>
<SelectControl(slave=[])>
<TextControl(value SequenceEntry SE_sequenceLength 4=)>
<CheckboxControl(sample_year_exact=[*1])>
<TextControl(value SEQ_SAMple SSAM_Sample_year 11=)>
<TextControl(value SEQ_SAMple SSAM_Sample_country 1=)>
<CheckboxControl(recombinants=[Recombinants])>
For some reason the slave control is not populated with the possible choices.
The problem is that the Subtype <select> options are populated by Javascript, which mechanize does not execute. The Javascript code runs when the page loads, populating the slave options with the HIV-1 list:
addLoadEvent(fillList(document.forms['search'].slave, lists['HIV-1']));
However, the mapping of Virus -> Subtype is stored in this Javascript file search.js. You may need to store this mapping in your Python script and manually set the slave form value.