I am trying to complete a form on a website automatically for academic purposes using Python's mechanize.
When a human completes the form and submits it, there is no recaptcha.
But when I fill in the controls for the form via mechanize in Python, there is a hidden control that is a recaptcha apparently.
<HiddenControl(recaptcha_response_field=manual_challenge)>
Since this recaptcha is never shown to a human, I don't know what it is looking for, or for that matter what a manual_challenge is.
Thus my question is, how can I pass this challenge so I can continue with automation / mechanize?
I've posted the script I've been using below, in case some fault lies with it.
import mechanize
import re
#constants
TEXT = "hello world!"
br = mechanize.Browser()
#ignore robots.txt
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
#open the page
response = br.open("http://somewebsite.com")
#this is the only form available
br.select_form("form2")
br.form.set_all_readonly(False)
cText = br.form.find_control("text")
cText.value = TEXT
#now submit our response
response = br.submit()
br.back()
#verify the url for error checking
print response.geturl()
#print the data to a text file
s = response.read()
w = open("test.txt", 'w')
print>>w, s
w.close()
This site obviously has protection set against robots like yours. If this is really for academic purposes mail them and ask for the data.
To get around the sites protection measures - that is a different thing altogether, but you should look into how they know you are a bot - is there any javascript you are not running, are you using mechanize user agent etc.. You probably don't want to enter that battlefield with them though.
Related
I'm currently trying to access a website with python and I'm having some trouble using the requests and mechanize modules. Basically the way I do this task manually is that I load the website portal and log on then click a button and fill out a form and submit this to receive a download. I've gotten to the log on stage and am having trouble submitting my username and log in am currently using this method
import requests
payload = {"username":"user","password":"pass"}
r = requests.post("portal login address",data=payload)
response = r.content
print(response)
but this gives me the exact same output as a get request where I don't include the payload. I am also wondering how I can simulate these button clicks and form submissions, I know mechanize can be used but I'm unclear as to how
You can use the mechanize module, in this way:
import re
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("<page>")
# you can access the form by name or some other means
# ive used a loop here just as an example
for form in br.forms():
form["username"] = "saurabhav"
form["password"] = "8558881858"
form.submit()
Have a look at Mechanize
I know there are alot of question about this matter but I try most of them.
my goal is to get the article from this page and use this in gae.
If I try to log in, it redirects to a long url ,after I log in there it redirects back to the article.
first I try urllib2 which is mentioned in here how to login to a website with python and mechanize and it didnt work.
then I took SelectLoginForm and login functions from https://github.com/cdhigh/KindleEar/blob/master/books/base.py it didnt work neither.
selenium wouldnt work because I gonna use it in gae. I guess gae cant support selenium
I started looking into mechanize module. my current code is :
# -*- coding: cp1254 -*-
import cookielib
import urllib2
import mechanize
b=mechanize.Browser()
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize.HTTPRefreshProcessor(),max_time=1)
b.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
b.open('https://hurpass.com/iframe/login?appkey=52da7ef64037f9497f0acb091390051062215&secret=52da7f0c4037f9497f0acb0b1390051084754&domain=sosyal.hurriyet.com.tr&callback_url=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&referer=http://sosyal.hurriyet.com.tr&user_page=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&is_mobile=0&session_timeout=0&is_vative=0&email=')
b.select_form(name='frm_login')
b["email"]="tasklak#hotmail.com"
b["password"]="123456"
b.submit(type="submit")
url='http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073'
last_response = b.response()
http_header_dict = last_response.info().dict
html_string_list = last_response.readlines()
html_data = "".join(html_string_list)
page = br.open(url)
print page.read().decode("UTF-8")
ha=open("test.html",'w')
ha.write(html_data)
ha.close
again I cant get this working but if I open the html it created, it redirects to logged article page. may it be mechanize redirection problem or is it impossible to login this page?
edit after mihail's answer:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
url='http://www.hurriyet.com.tr/anasayfa/'
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url+';ASPSESSIONID='+sessionidd)
print cj
edit 2:
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url)
k=0
for a in cj:
if k<2:
a.value=sessionidd
k+=1
print cj
First of all, you should know that if there isn't a publicly available API to do all this without scraping then it's very likely that what you are doing is not welcomed by the website owners, against their terms of service and could even be illegal and punishable by law depending on where you live.
Unless mechanize can interpret javascript code (which I doubt it does although I might be wrong) it's not going to be very helpful, although, skimming through the links you provided with Chrome's DevTools it looks like you could implement what you want with a few pure urlib2 requests.
For example, when you login for the first time, you'll see a GET request to http://auth.hurriyet.com.tr/api/loginuser/tasklak#hotmail.com/?%3D%3E%3F89%3A URL which includes your username and encoded password and returns some session IDs. The reason mechanize wouldn't work is because the password is encoded via a javascript code that's not being interpreted when you are submitting the form in your code.
Going into the source code of the login form you'll see that when the "Submit" button is clicked a loginUser() function is called which when you'll find you'll see that the password is being xor'ed with the following code:
for (i = 0; i < password.length; ++i) {
encoded_password += String.fromCharCode(12 ^ password.charCodeAt(i));
}
which you would have to rewrite in python, so to recieve the initial session IDs you'd have something like:
import urllib2
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
print(urllib2.urlopen(auth_url).read())
It looks like you're then going to need to validate the session IDs you received and retrieve session cookies which you then can use to get full articles but I will leave that to you.
I am required to retrieve 8000 answers from a website for research purposes (auto filling a form and submitting it 8000 times). I wrote the below script but when I run it after 20 submits python stops working and I'm unable to get what I need. Could you please help me find the problem with my script?
from mechanize import ParseResponse, urlopen, urljoin
import urllib2
from urllib2 import Request, urlopen, URLError
import mechanize
import time
URL = "url of the website"
br = mechanize.Browser() # Creates a browser
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
def fetch(val):
br.open(URL) # Open the login page
br.select_form(nr=0) # Find the login form
br['subject']='question'
br['value'] =val
br.set_all_readonly(False)
resp = br.submit()
data = resp.read()
br.reload()
x=data.find("the answer is:")
if x!=-1:
ur=data[x:x+100]
print ur
val_list =val_list # This list is available and contains 8000 different values
for i in range(0,8000):
fetch(val_list[i])
Having used mechanize in the past to do a similar data-scraping kind of thing, you're almost certainly getting limited by the website as Erbureth mentioned. Usually websites have a way to monitor connections to filter out exactly the type of thing you're attempting, and for good reason.
Putting aside for a moment whatever the purpose of your script may be and moving to your question of why is doesn't work: At the very least, I would put some delays in there so you're not trying to access the site repeatedly in such a short time span. Put a few seconds of pause between calls, and maybe it will work. (Although then you'll have to let it run for hours.)
I'm trying to do something very simple using Python's Mechanize library. I want to go to: JobSearch">http://careers.force.com/jobs/ts2_JobSearch, select Dublin Ireland from the drop down list, and then hit enter.
I've written a very short Python script for this, but for some reason when I run it, it returns the HTML for the default search page rather than the search page that is produced after selecting the location (Dublin Ireland) and hitting enter. I have no idea what is going wrong:
import mechanize
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
br.open(link)
br.select_form('j_id0:j_id1:atsForm' )
br.form['j_id0:j_id1:atsForm:j_id38:1:searchCtrl'] = ["Ireland - Dublin"]
response = br.submit()
newsite = response.read()
This is in case you're still having this problem or if not, in case anyone else is having this problem in the future....
I looked at the postdata that was being sent by your browser when you manually selected something and wrote a function for you that will get you to the page you want by manually performing a POST operation with urllib.urlencoded data. Cheers.
import mechanize,cookielib,urllib
def get_search(html,controls):
#viewstate
s=re.search('ViewState" value="', html).span()[1]
e=re.search('"',html[s:]).span()[0]+s
state=html[s:e]
#viewstateversion
s=re.search('ViewStateVersion', html).span()[1]
s=s+re.search('value="', html[s:]).span()[1]
e=re.search('"', html[s:]).span()[0]+s
version=html[s:e]
#viewstatemac
s=re.search('ViewStateMAC',html).span()[1]
s=s+re.search('value=\"',html[s:]).span()[1]
e=re.search('"',html[s:]).span()[0]+s
mac=html[s:e]
return {controls[0]:controls[0], controls[1]:'',controls[2]:'Ireland - Dublin', controls[3]:'Search','com.salesforce.visualforce.ViewState':state,'com.salesforce.visualforce.ViewStateVersion':version,'com.salesforce.visualforce.ViewStateMAC':mac}
#Define variables and create a mechanize browser
link = "http://careers.force.com/jobs/ts2__JobSearch"
br = mechanize.Browser()
cj=cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open(link)
#get the html data
html=br.response().read()
#get the control names from the correct form
br.select_form(nr=1)
controls=[control.name for control in br.form.controls]
#run function with html and control names list as parameters and run urllib.urlencode on what gets returned
postdata=urllib.urlencode(get_search(br.response().read(), controls))
#go to the webpage again but this time also submit the encoded data
br.open(link, postdata)
#There Ya Go
print br.response().read()
I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.
As far as I can tell, I'm getting this message because when the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
exit(str(warning)+'\n\nprobably means username or pw is wrong')
I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.
I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.
After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:
import urllib
import urllib2
import cookielib
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()
data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)
html = f.read()
f.close()
Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.