I am required to retrieve 8000 answers from a website for research purposes (auto filling a form and submitting it 8000 times). I wrote the below script but when I run it after 20 submits python stops working and I'm unable to get what I need. Could you please help me find the problem with my script?
from mechanize import ParseResponse, urlopen, urljoin
import urllib2
from urllib2 import Request, urlopen, URLError
import mechanize
import time
URL = "url of the website"
br = mechanize.Browser() # Creates a browser
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
def fetch(val):
br.open(URL) # Open the login page
br.select_form(nr=0) # Find the login form
br['subject']='question'
br['value'] =val
br.set_all_readonly(False)
resp = br.submit()
data = resp.read()
br.reload()
x=data.find("the answer is:")
if x!=-1:
ur=data[x:x+100]
print ur
val_list =val_list # This list is available and contains 8000 different values
for i in range(0,8000):
fetch(val_list[i])
Having used mechanize in the past to do a similar data-scraping kind of thing, you're almost certainly getting limited by the website as Erbureth mentioned. Usually websites have a way to monitor connections to filter out exactly the type of thing you're attempting, and for good reason.
Putting aside for a moment whatever the purpose of your script may be and moving to your question of why is doesn't work: At the very least, I would put some delays in there so you're not trying to access the site repeatedly in such a short time span. Put a few seconds of pause between calls, and maybe it will work. (Although then you'll have to let it run for hours.)
Related
I am a very beginner of Python. And I tried to crawl some product information from my www.Alibaba.com console. When I came to the visitor details page, I found the cookie changed every time when I clicked the search button. I found the cookie changed for each request. I can not crawl the data in the way I crawled from other pages where the cookies were fixed in a certain period.
After comparing the cookie data, I found here were only 3 key-value pairs were changed. I think those 3 values made me fail to crawl the data. So I want to know how to handle such situation.
For python3 the http.client in the standard library can be configured to use an http.cookiejar CookieJar which will keep track of cookies within the client automatically.
You can set this up like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
If you're using pyhton2 then a similar approach works with urllib:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
r = opener.open("http://example.com/")
I am trying to complete a form on a website automatically for academic purposes using Python's mechanize.
When a human completes the form and submits it, there is no recaptcha.
But when I fill in the controls for the form via mechanize in Python, there is a hidden control that is a recaptcha apparently.
<HiddenControl(recaptcha_response_field=manual_challenge)>
Since this recaptcha is never shown to a human, I don't know what it is looking for, or for that matter what a manual_challenge is.
Thus my question is, how can I pass this challenge so I can continue with automation / mechanize?
I've posted the script I've been using below, in case some fault lies with it.
import mechanize
import re
#constants
TEXT = "hello world!"
br = mechanize.Browser()
#ignore robots.txt
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
#open the page
response = br.open("http://somewebsite.com")
#this is the only form available
br.select_form("form2")
br.form.set_all_readonly(False)
cText = br.form.find_control("text")
cText.value = TEXT
#now submit our response
response = br.submit()
br.back()
#verify the url for error checking
print response.geturl()
#print the data to a text file
s = response.read()
w = open("test.txt", 'w')
print>>w, s
w.close()
This site obviously has protection set against robots like yours. If this is really for academic purposes mail them and ask for the data.
To get around the sites protection measures - that is a different thing altogether, but you should look into how they know you are a bot - is there any javascript you are not running, are you using mechanize user agent etc.. You probably don't want to enter that battlefield with them though.
I'm trying to do something very simple using Python's Mechanize library. I want to go to: JobSearch">http://careers.force.com/jobs/ts2_JobSearch, select Dublin Ireland from the drop down list, and then hit enter.
I've written a very short Python script for this, but for some reason when I run it, it returns the HTML for the default search page rather than the search page that is produced after selecting the location (Dublin Ireland) and hitting enter. I have no idea what is going wrong:
import mechanize
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
br.open(link)
br.select_form('j_id0:j_id1:atsForm' )
br.form['j_id0:j_id1:atsForm:j_id38:1:searchCtrl'] = ["Ireland - Dublin"]
response = br.submit()
newsite = response.read()
This is in case you're still having this problem or if not, in case anyone else is having this problem in the future....
I looked at the postdata that was being sent by your browser when you manually selected something and wrote a function for you that will get you to the page you want by manually performing a POST operation with urllib.urlencoded data. Cheers.
import mechanize,cookielib,urllib
def get_search(html,controls):
#viewstate
s=re.search('ViewState" value="', html).span()[1]
e=re.search('"',html[s:]).span()[0]+s
state=html[s:e]
#viewstateversion
s=re.search('ViewStateVersion', html).span()[1]
s=s+re.search('value="', html[s:]).span()[1]
e=re.search('"', html[s:]).span()[0]+s
version=html[s:e]
#viewstatemac
s=re.search('ViewStateMAC',html).span()[1]
s=s+re.search('value=\"',html[s:]).span()[1]
e=re.search('"',html[s:]).span()[0]+s
mac=html[s:e]
return {controls[0]:controls[0], controls[1]:'',controls[2]:'Ireland - Dublin', controls[3]:'Search','com.salesforce.visualforce.ViewState':state,'com.salesforce.visualforce.ViewStateVersion':version,'com.salesforce.visualforce.ViewStateMAC':mac}
#Define variables and create a mechanize browser
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
cj=cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open(link)
#get the html data
html=br.response().read()
#get the control names from the correct form
br.select_form(nr=1)
controls=[control.name for control in br.form.controls]
#run function with html and control names list as parameters and run urllib.urlencode on what gets returned
postdata=urllib.urlencode(get_search(br.response().read(), controls))
#go to the webpage again but this time also submit the encoded data
br.open(link, postdata)
#There Ya Go
print br.response().read()
I am using urllib2 in Python to post login data to a web site.
After successful login, the site redirects my request to another page. Can someone provide a simple code sample on how to do this in Python with urllib2? I guess I will need cookies also to be logged in when I get redirected to another page. Right?
Thanks a lot in advace.
First, get mechanize: http://wwwsearch.sourceforge.net/mechanize/
You could do this kind of stuff with just urllib2, but you will be writing tons of boilerplate code, and it will be buggy.
Then:
import mechanize
br = mechanize.Browser()
br.open('http://somesite.com/account/signin/')
br.select_form('loginForm')
br['username'] = 'jekyll'
br['password'] = 'bananas'
br.submit()
# At this point, you're logged in, redirected, and the
# br object has the cookies and all that.
br.geturl() # e.g. http://somesite.com/loggedin/
Then you can use the Browser object br and do whatever you have to do, click on links, etc. Check the samples on the mechanize site
I'm working on a simple HTML scraper for Hulu in python 2.6 and am having problems with logging on to my account. Here's my code so far:
import urllib
import urllib2
from cookielib import CookieJar
#make a cookie and redirect handlers
cookies = CookieJar()
cookie_handler= urllib2.HTTPCookieProcessor(cookies)
redirect_handler= urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler,cookie_handler)#make opener w/ handlers
#build the url
login_info = {'username':USER,'password':PASS}#USER and PASS are defined
data = urllib.urlencode(login_info)
req = urllib2.Request("http://www.hulu.com/account/authenticate",data)#make the request
test = opener.open(req) #open the page
print test.read() #print html results
The code compiles and runs, but all that prints is:
Login.onError("Please \074a href=\"/support/login_faq#cant_login\"\076enable cookies\074/a\076 and try again.");
I assume there is some error in how I'm handling cookies, but just can't seem to spot it. I've heard Mechanize is a very useful module for this type of program, but as this seems to be the only speed bump left, I was hoping to find my bug.
What you're seeing is a ajax return. It is probably using javascript to set the cookie, and screwing up your attempts to authenticate.
The error message you are getting back could be misleading. For example the server might be looking at user-agent and seeing that say it's not one of the supported browsers, or looking at HTTP_REFERER expecting it to be coming from hulu domain. My point is there are two many variables coming in the request to keep guessing them one by one
I recommend using an http analyzer tool, e.g. Charles or the one in Firebug to figure out what (header fields, cookies, parameters) the client sends to server when you doing hulu login via a browser. This will give you the exact request that you need to construct in your python code.