I'm trying to do something very simple using Python's Mechanize library. I want to go to: JobSearch">http://careers.force.com/jobs/ts2_JobSearch, select Dublin Ireland from the drop down list, and then hit enter.
I've written a very short Python script for this, but for some reason when I run it, it returns the HTML for the default search page rather than the search page that is produced after selecting the location (Dublin Ireland) and hitting enter. I have no idea what is going wrong:
import mechanize
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
br.open(link)
br.select_form('j_id0:j_id1:atsForm' )
br.form['j_id0:j_id1:atsForm:j_id38:1:searchCtrl'] = ["Ireland - Dublin"]
response = br.submit()
newsite = response.read()
This is in case you're still having this problem or if not, in case anyone else is having this problem in the future....
I looked at the postdata that was being sent by your browser when you manually selected something and wrote a function for you that will get you to the page you want by manually performing a POST operation with urllib.urlencoded data. Cheers.
import mechanize,cookielib,urllib
def get_search(html,controls):
#viewstate
s=re.search('ViewState" value="', html).span()[1]
e=re.search('"',html[s:]).span()[0]+s
state=html[s:e]
#viewstateversion
s=re.search('ViewStateVersion', html).span()[1]
s=s+re.search('value="', html[s:]).span()[1]
e=re.search('"', html[s:]).span()[0]+s
version=html[s:e]
#viewstatemac
s=re.search('ViewStateMAC',html).span()[1]
s=s+re.search('value=\"',html[s:]).span()[1]
e=re.search('"',html[s:]).span()[0]+s
mac=html[s:e]
return {controls[0]:controls[0], controls[1]:'',controls[2]:'Ireland - Dublin', controls[3]:'Search','com.salesforce.visualforce.ViewState':state,'com.salesforce.visualforce.ViewStateVersion':version,'com.salesforce.visualforce.ViewStateMAC':mac}
#Define variables and create a mechanize browser
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
cj=cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open(link)
#get the html data
html=br.response().read()
#get the control names from the correct form
br.select_form(nr=1)
controls=[control.name for control in br.form.controls]
#run function with html and control names list as parameters and run urllib.urlencode on what gets returned
postdata=urllib.urlencode(get_search(br.response().read(), controls))
#go to the webpage again but this time also submit the encoded data
br.open(link, postdata)
#There Ya Go
print br.response().read()
Related
I am trying to complete a form on a website automatically for academic purposes using Python's mechanize.
When a human completes the form and submits it, there is no recaptcha.
But when I fill in the controls for the form via mechanize in Python, there is a hidden control that is a recaptcha apparently.
<HiddenControl(recaptcha_response_field=manual_challenge)>
Since this recaptcha is never shown to a human, I don't know what it is looking for, or for that matter what a manual_challenge is.
Thus my question is, how can I pass this challenge so I can continue with automation / mechanize?
I've posted the script I've been using below, in case some fault lies with it.
import mechanize
import re
#constants
TEXT = "hello world!"
br = mechanize.Browser()
#ignore robots.txt
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
#open the page
response = br.open("http://somewebsite.com")
#this is the only form available
br.select_form("form2")
br.form.set_all_readonly(False)
cText = br.form.find_control("text")
cText.value = TEXT
#now submit our response
response = br.submit()
br.back()
#verify the url for error checking
print response.geturl()
#print the data to a text file
s = response.read()
w = open("test.txt", 'w')
print>>w, s
w.close()
This site obviously has protection set against robots like yours. If this is really for academic purposes mail them and ask for the data.
To get around the sites protection measures - that is a different thing altogether, but you should look into how they know you are a bot - is there any javascript you are not running, are you using mechanize user agent etc.. You probably don't want to enter that battlefield with them though.
I am required to retrieve 8000 answers from a website for research purposes (auto filling a form and submitting it 8000 times). I wrote the below script but when I run it after 20 submits python stops working and I'm unable to get what I need. Could you please help me find the problem with my script?
from mechanize import ParseResponse, urlopen, urljoin
import urllib2
from urllib2 import Request, urlopen, URLError
import mechanize
import time
URL = "url of the website"
br = mechanize.Browser() # Creates a browser
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
def fetch(val):
br.open(URL) # Open the login page
br.select_form(nr=0) # Find the login form
br['subject']='question'
br['value'] =val
br.set_all_readonly(False)
resp = br.submit()
data = resp.read()
br.reload()
x=data.find("the answer is:")
if x!=-1:
ur=data[x:x+100]
print ur
val_list =val_list # This list is available and contains 8000 different values
for i in range(0,8000):
fetch(val_list[i])
Having used mechanize in the past to do a similar data-scraping kind of thing, you're almost certainly getting limited by the website as Erbureth mentioned. Usually websites have a way to monitor connections to filter out exactly the type of thing you're attempting, and for good reason.
Putting aside for a moment whatever the purpose of your script may be and moving to your question of why is doesn't work: At the very least, I would put some delays in there so you're not trying to access the site repeatedly in such a short time span. Put a few seconds of pause between calls, and maybe it will work. (Although then you'll have to let it run for hours.)
Using mechanize (and python) I can go to a website, log in, find a form, fill in some answers, and submit that form. However, I don't know how I can open the "response" page - that is, the page that automatically loads once you've submitted the form.
Here's the python code:
br.select_form(name="simTrade")
br.form["symbolTextbox"] = "KO"
br.form["quantityTextbox"] = "10"
br.form["previewOrderButton"]
preview = br.submit()
print preview.read
With the above code, I can see what the response page holds. But I want to actually open that page and interact with it. How can I do that with mechanize? Thank you.
EDIT: So I answered my own question soon after posting this. Here's the code:
br.select_form(name="simTrade")
br.form["symbolTextbox"] = symbol
br.form["transactionTypeDropDown"] = [order_type]
br.form["quantityTextbox"] = amount
br.form["previewOrderButton"]
no_url = br.submit()
final = no_url.geturl()
x = br.open(final)
print x.read()
To get the html source code of the response page (the page that loads when you submit a form), I simply had to get the url of br.submit(). And there's a built in mechanize function for that, geturl().
The OP's answer is a bit convoluted and resulted in a AttributeError. This worked better for me:
br.submit()
base_url = br.geturl()
print base_url
Getting the URL of the new page and opening it isn't necessary. Once the form has been submitted the new page opens automatically and you can start interacting with it using the same mechanize browser object.
Using the original code from your question, if you wanted to submit the form and store all links on the new page in a list:
br.select_form(name="simTrade")
br.form["symbolTextbox"] = "KO"
br.form["quantityTextbox"] = "10"
br.form["previewOrderButton"]
br.submit()
# Here we store all links on the new page
# but we can use br do any necessary processing.
links = [link for link in br.links()]
# This will take us back to the original page with the "simTrade" form.
br.back()
I intend to use twill to fill out a form on one page, hit the submit button, and then use BeautifulSoup to parse the resulting page. How can I feed BeautifulSoup the HTML page? I assume I have to read the current url, but I do not know how to actually return the url in order to do so. I have tried twill's TwillBrowser.get_url(), but it only returns None.
For any future sufferers, I have found better luck in using mechanize instead of twill as twill is an un-updated thin shell for mechanize. The solution is as follows:
import mechanize
url = "foo.com"
br = mechanize.Browser()
br.open(url)
br.select_form(name = "YOURFORMNAMEHERE") #make sure to leave the quotation marks
br["YOURINPUTFIELDNAMEHERE"] = ["YOURVALUEHERE"] #this must be in a list even if it is only one value
response = br.submit()
print response.geturl()
Finally figured this out!
If you import twill like so:
import twill.commands as com
then the url =
url = com.browser.get_url()
Source: http://nullege.com/codes/search/twill.commands.browser.get_url?utm_expid=24446124-0.lSQi4Ea5S7WZwxHvFPbOIA.0&utm_referrer=https%3A%2F%2Fwww.google.com%2F
I am trying to scrape some data off National Vulnerbability Database (http://web.nvd.nist.gov). What I want to do is enter a search term, which brings me the first 20 results, scrape that data. then I want to click "next 20" until I traversed all results.
I am able to successfully submit search terms, but clicking "next 20" is not working at all.
Tools I am using Python + Mechanize
Here is my code:
# Browser
b = mechanize.Browser()
# The URL to this service
URL = 'http://web.nvd.nist.gov/view/vuln/search'
Search = ['Linux', 'Mac OS X', 'Windows']
def searchDB():
SearchCounter=0
for i in Search:
# Load the page
read = b.open(URL)
# Select the form
b.select_form(nr=0)
# Fill out the search form
b['vulnSearchForm:text'] = Search[int(SearchCounter)]
b.submit('vulnSearchForm:j_id120')
result=b.response().read()
file=open(Search[SearchCounter]+".txt","w")
file.write(result)
'''Here is where the problem is. vulnResultsForm:j_id116 is value of the "next 20 button'''
b.select_form(nr = 0)
b.form.click('vulnResultsForm:j_id116')
result=b.response().read()
if __name__ == '__main__':
searchDB()
From the docstring of b.form.click:
Return request that would result from clicking on a control.
The request object is a
urllib2.Request instance, which you
can pass to urllib2.urlopen (or
ClientCookie.urlopen).
So:
request = b.form.click('vulnResultsForm:j_id116')
b.open(request)
result = b.response().read()
I haven't used Mechanize outside of zope.testbrowser, whcih is based on Mechanize, so there may be differences, but here goes:
You click on the form...Try to get the button and click on the button instead.
Something like this, I think:
form.find_control("j_id120").click()
Also:
b['vulnSearchForm:text'] = Search[int(SearchCounter)]
Can be replaced with
b['vulnSearchForm:text'] = i
As i will contain the value. Python is not javascript, loop variables are not numbers (unless you want them to be).