I'm trying to automate form filling on Sharepoint site, but my Python script can't get passed this authentication box that pops up when you type in the url from below.
from base64 import b64encode
import mechanize
url = 'http://moss.micron.com/MFG/ProbeTest/Lists/Manufacturing%20Requests/AllItems.aspx'
username = 'username'
password = 'password'
# I have had to add a carriage return ('%s:%s\n'), but
# you may not have to.
b64login = b64encode('%s:%s' % (username, password))
br = mechanize.Browser()
br.addheaders.append(
('Authorization', 'Basic %s' % b64login, )
)
br.open(url)!
This results in the following error:
EDIT:
Here are the results of running wget on the requested page.
--2013-08-30 11:16:17-- http://moss.micron.com/MFG/ProbeTest/Lists/Manufacturing%20Requests/AllItems.aspx
Resolving moss.micron.com... 137.201.88.118
Connecting to moss.micron.com|137.201.88.118|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 401 Unauthorized
Server: Microsoft-IIS/7.0
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 12.0.0.6341
Date: Fri, 30 Aug 2013 17:16:17 GMT
Connection: keep-alive
Content-Length: 0
Authorization failed.
Your browser is respecting the robots.txt on your site disallowing it.
You can set mechanize.Browser to ignore robots.txt, prior to making the request via:
br.set_handle_robots(False)
Alternately, edit your robots.txt to allow that sort of connection.
If you set a custom UserAgent header in your mechanize.Browser to allow you to filter for it.
See here for basic info about robots.txt.
If you can get to the site with a PC, download Fiddler2 which will allow you to see the transactions required when you log in.
Edit.. Ok. Obviously you have a PC.
Related
I am scraping search results from google using people_also_ask module. The module itself dont have method to use proxies but I manually added proxies in the module. When I got blocked from google I printed the status and it was printing my ip address was banned from sending requests. The code I added in people_also_ask module to use proxies is
proxies = {
'http' : "http://username:passward#ip:port"
}
response = SESSION.get(URL, params=params, headers=HEADERS, proxies=proxies)
.I know it is an illegal activity but I want to know why it happens for education purpose mainly. I think the code to extract the data is irrelevant so I am adding simple code to send request using people_also_ask module
import people_also_ask as paa
queries = ["how to boil eggs","how to make cake","price of poco f1","price of wooden table","best soap in us","how much tesla worth"]
for query in queries:
questions = paa.get_related_questions(query ,40)
Note: The changes are made in first function named search() of google.py of people_also_people module
Note: I am doing searchs from browser without any problem. why is google allowing me to use google but blocked from using the script
The answer is quite simple. Although it is a proxy service, it doesn't guarantee 100% anonymity. When you send the HTTP GET request via the proxy server, the request sent by your program to the proxy server is:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Connection: keep-alive
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Now, when the proxy server sends this request to the actual destination, it sends:
GET http://www.whatsmybrowser.org/ HTTP/1.1
Host: www.whatsmybrowser.org
Accept-Encoding: gzip, deflate
Accept: */*
User-Agent: python-requests/2.10.0
Via: 1.1 naxserver (squid/3.1.8)
X-Forwarded-For: 122.126.64.43
Cache-Control: max-age=18000
Connection: keep-alive
As you can see, it throws your IP (in my case, 122.126.64.43) in the HTTP header: X-Forwarded-For and hence the website knows that the request was sent on behalf of 122.126.64.43
Read more about this header at: https://www.rfc-editor.org/rfc/rfc7239
If you want to host your own squid proxy server and want to disable setting X-Forwarded-For header, read: http://www.squid-cache.org/Doc/config/forwarded_for/
I dont get any credit for the answer I copied this answer from the following post I found Python Requests module - proxy not working
I'm using python requests to send http requests to www.fredmeyer.com
I can't even get past an initial get request to this domain. doing a simple requests.get results in the connection hanging and never timing out. i've verified i have access to this domain and am able to run the request on my local machine. can anyone replicate
The site seems to have some filtering enabled to prohibit bots or similar. The following HTTP request works currently with the site:
GET / HTTP/1.1
Host: www.fredmeyer.com
Connection: keep-alive
Accept: text/html
Accept-Encoding:
If the Connection header is removed or its value changed to close it will hang. If the (empty) Accept-Encoding header is missing it will also hang. If the Accept line is missing it will return 403 Forbidden.
In order to access this site with requests the following currently works for me:
import requests
headers = { 'Accept':'text/html', 'Accept-Encoding': '', 'User-Agent': None }
resp = requests.get('https://www.fredmeyer.com', headers=headers)
print(resp.text)
Note that the heuristics used by the site to detect bots might change, so this might stop working in the future.
I need to log into a website using python but the login page requires a sessionID cookie in the request header. Using Google developer tools along with a webclient(hurl.it), I was able to determine the required format of the request header that is acceptable by the webserver:
Accept: */*
Accept-Encoding: gzip, deflate
Content-Length: 85
Content-Type: application/x-www-form-urlencoded
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
User-Agent: runscope/0.1
email=******%40beatswitch.com&login_token=545a46230b291&password=*****&submission=
I have produced the following request using Python requests module:
POST /my-ade/login/ HTTP/1.1
Host: www.amsterdam-dance-event.nl
Content-Length: 85
Accept-Encoding: gzip,deflate
Accept: */*
User-Agent: runscope/0.1
Connection: keep-alive
Cookie: www_amsterdam-dance-event_nl_session=l9Abno8a1UyHPof%2BOyVqk8BxHjesGMi78z6Ot0ZXCCbI%2BxVKqjm30ALTfW%2FR7yKcDaqfEtFOyysTrjIeU8lU5ylv1TOlW6GLHY8jDfKKWSULKsUUJiTh92DbvkuYBuE6zt%2FeLs44lDna6Nz3uMCOaSARN7gCpoSz0TOcFaes8Hk9q6FikP1F9e%2B%2FsMwfUP0RTA0Rc5gJFyJPxHXNCdn%2BT49mhHYnzoIWVlxGHhlaEkZX1PPsYx1xq0BCgpb0WnPViuiZiBnQY2nz%2BBO4Uur0WPNfpSSWZg5Qxz79nYeChlRe16JhYjVOdaiUhnfEvp1jM7h%2BCdR6cUeatd7HGbftRCjINDrVuPeyB5ltVihStmzKEjOmWetI0xNuaNswsPIKKuo%2BV6JFNfdLcA6h3iy1K8o%2FA49tKGMP2rmGe4e5Jec%3Df395212364d1ffc80cf95ebf5abf3b40f9dc6441;
Content-Type: application/x-www-form-urlencoded
login_token=545a46230b291&password=*****&email=******%40beatswitch.com&submission='
When I load the former request header with hurl.it, everything works perfectly and the webserver lets me log in but trying the almost-same request with the same parameters fails in python. While using python's request, the webserver presents an error page. Any help would be highly appreciated. I need a solution desperately.
EDIT:
Here is the code:
#Open the login page to get sessionID and login_token
loginURL = "https://www.amsterdam-dance-event.nl/my-ade/login/"
loginReq = session.get(loginURL)
loginSoup = BeautifulSoup(loginReq.text)
loginToken = loginSoup.find('input',attrs={'name':'login_token'})['value']
sessionID= loginReq.cookies['www_amsterdam-dance-event_nl_session']
cookie = 'www_amsterdam-dance-event_nl_session='+sessionID
#Construct the header and post it to the webserver
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
payload = {'email':'*******#beatswitch.com','password':'********','login_token':loginToken,'submission':''}
loggedinReq = session.post(loginURL,headers=headers,data=payload)
I found the solution, thanks to Md. Mohsin. I was trying to handle the request headers and cookies manually while the requests module can handle them by itself. So I REMOVED the following line from the code and let requests take total control, and everything worked as intended:
headers = {'Content-Length':'85','Accept':'*/*','User-Agent':' runscope/0.1','Content-Type':'application/x-www-form-urlencoded','Accept-Encoding':'gzip,deflate','Cookie':cookie}
I'm trying to access pages from my company server with python. The first trail return 401: Unathorized(the server does need domain username/pwd for authentication). And the header content is as follow, and it seems to support 3 authentication protocols, Negotiate, NTLM and Digest, so in my understanding, I can choose any of them, right?
Content-Type: text/html
Server: Microsoft-IIS/7.0
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM
WWW-Authenticate: Digest qop="auth",algorithm=MD5-sess,nonce="+Upgraded+v184080dc2d18fe10d63520db505929b5b5b929ec98692ce010e80d6347b7a35d4027e59e277ac4fe1c257a95196071258a8e0797bf6129f76",charset=utf-8,realm="Digest"
X-Powered-By: ASP.NET
Date: Tue, 06 Aug 2013 09:24:44 GMT
Connection: close
Content-Length: 1293
Set-Cookie: LB-INFO=1065493258.20480.0000; path=/
I'm using following python codes, but still got 401 unanthorized error, can anybody tell me how can i achieve it? Should I use NTLM? Thanks in advance!
p = urllib2.HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, self.url, username, password)
handler = urllib2.HTTPDigestAuthHandler(p)
opener = urllib2.build_opener(handler)
urllib2.install_opener(opener)
f = opener.open(self.url)
Another very popular form of HTTP Authentication is Digest Authentication, and Requests supports this out of the box as well:
from requests.auth import HTTPDigestAuth
url = 'http://httpbin.org/digest-auth/auth/user/pass'
requests.get(url, auth=HTTPDigestAuth('user', 'pass'))
urllib2 is the python standard library, but not necessarily the best tool for HTTP Requests.
I would highly recommend checking out the requests package, and you can find an authentication tutorial here: http://docs.python-requests.org/en/latest/user/authentication/#digest-authentication
This question was posted on StackApps, but the issue may be more a programming issue than an authentication issue, hence it may deserve a better place here.
I am working on an desktop inbox notifier for StackOverflow, using the API with Python.
The script I am working on first logs the user in on StackExchange, and then requests authorisation for the application. Assuming the application has been authorised through web-browser interaction of the user, the application should be able to make requests to the API with authentication, hence it needs the access token specific to the user. This is done with the URL: https://stackexchange.com/oauth/dialog?client_id=54&scope=read_inbox&redirect_uri=https://stackexchange.com/oauth/login_success.
When requesting authorisation via the web-browser the redirect is taking place and an access code is returned after a #. However, when requesting this same URL with Python (urllib2), no hash or key is returned in the response.
Why is it my urllib2 request is handled differently from the same request made in Firefox or W3m? What should I do to programmatically simulate this request and retrieve the access_token?
Here is my script (it's experimental) and remember: it assumes the user has already authorised the application.
#!/usr/bin/env python
import urllib
import urllib2
import cookielib
from BeautifulSoup import BeautifulSoup
from getpass import getpass
# Define URLs
parameters = [ 'client_id=54',
'scope=read_inbox',
'redirect_uri=https://stackexchange.com/oauth/login_success'
]
oauth_url = 'https://stackexchange.com/oauth/dialog?' + '&'.join(parameters)
login_url = 'https://openid.stackexchange.com/account/login'
submit_url = 'https://openid.stackexchange.com/account/login/submit'
authentication_url = 'http://stackexchange.com/users/authenticate?openid_identifier='
# Set counter for requests:
counter = 0
# Build opener
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
def authenticate(username='', password=''):
'''
Authenticates to StackExchange using user-provided username and password
'''
# Build up headers
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0'
headers = {'User-Agent' : user_agent}
# Set Data to None
data = None
# 1. Build up URL request with headers and data
request = urllib2.Request(login_url, data, headers)
response = opener.open(request)
# Build up POST data for authentication
html = response.read()
fkey = BeautifulSoup(html).findAll(attrs={'name' : 'fkey'})[0].get('value').encode()
values = {'email' : username,
'password' : password,
'fkey' : fkey}
data = urllib.urlencode(values)
# 2. Build up URL for authentication
request = urllib2.Request(submit_url, data, headers)
response = opener.open(request)
# Check if logged in
if response.url == 'https://openid.stackexchange.com/user':
print ' Logged in! :) '
else:
print ' Login failed! :( '
# Find user ID URL
html = response.read()
id_url = BeautifulSoup(html).findAll('code')[0].text.split('"')[-2].encode()
# 3. Build up URL for OpenID authentication
data = None
url = authentication_url + urllib.quote_plus(id_url)
request = urllib2.Request(url, data, headers)
response = opener.open(request)
# 4. Build up URL request with headers and data
request = urllib2.Request(oauth_url, data, headers)
response = opener.open(request)
if '#' in response.url:
print 'Access code provided in URL.'
else:
print 'No access code provided in URL.'
if __name__ == '__main__':
username = raw_input('Enter your username: ')
password = getpass('Enter your password: ')
authenticate(username, password)
To respond to comments below:
Tamper data in Firefox requests the above URL (as oauth_url in the code) with the following headers:
Host=stackexchange.com
User-Agent=Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language=en-us,en;q=0.5
Accept-Encoding=gzip, deflate
Accept-Charset=ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection=keep-alive
Cookie=m=2; __qca=P0-556807911-1326066608353; __utma=27693923.1085914018.1326066609.1326066609.1326066609.1; __utmb=27693923.3.10.1326066609; __utmc=27693923; __utmz=27693923.1326066609.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); gauthed=1; ASP.NET_SessionId=nt25smfr2x1nwhr1ecmd4ok0; se-usr=t=z0FHKC6Am06B&s=pblSq0x3B0lC
In the urllib2 request the header provides the user-agent value only. The cookie is not passed explicitly, but the se-usr is available in the cookie jar at the time of the request.
The response headers will be first the redirect:
Status=Found - 302
Server=nginx/0.7.65
Date=Sun, 08 Jan 2012 23:51:12 GMT
Content-Type=text/html; charset=utf-8
Connection=keep-alive
Cache-Control=private
Location=https://stackexchange.com/oauth/login_success#access_token=OYn42gZ6r3WoEX677A3BoA))&expires=86400
Set-Cookie=se-usr=t=kkdavslJe0iq&s=pblSq0x3B0lC; expires=Sun, 08-Jul-2012 23:51:12 GMT; path=/; HttpOnly
Content-Length=218
Then the redirect will take place through another request with the fresh se-usr value from that header.
I don't know how to catch the 302 in urllib2, it handles it by itself (which is great). It would be nice however to see if the access token as provided in the location header would be available.
There's nothing special in the last response header, both Firefox and Urllib return something like:
Server: nginx/0.7.65
Date: Sun, 08 Jan 2012 23:48:16 GMT
Content-Type: text/html; charset=utf-8
Connection: close
Cache-Control: private
Content-Length: 5664
I hope I didn't provide confidential info, let me know if I did :D
The token does not appear because of the way urllib2 handles the redirect. I am not familiar with the details so I won't elaborate here.
The solution is to catch the 302 before the urllib2 handles the redirect. This can be done by sub-classing the urllib2.HTTPRedirectHandler to get the redirect with its hashtag and token. Here is a short example of subclassing the handler:
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print "Going through 302:\n"
print headers
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
In the headers the location attribute will provide the redirect URL in full length, i.e. including the hashtag and token:
Output extract:
...
Going through 302:
Server: nginx/0.7.65
Date: Mon, 09 Jan 2012 20:20:11 GMT
Content-Type: text/html; charset=utf-8
Connection: close
Cache-Control: private
Location: https://stackexchange.com/oauth/login_success#access_token=K4zKd*HkKw5Opx(a8t12FA))&expires=86400
Content-Length: 218
...
More on catching redirects with urllib2 on StackOverflow (of course).