Python authenticate and launch private page using webbrowser, urllib and CookieJar - python

I want to login with cookiejar and and launch not the login page but a page that can only be seen after authenticated. I know mechanize does that but besides not working for me now, I rather do this without it. Now I have,
import urllib, urllib2, cookielib, webbrowser
from cookielib import CookieJar
username = 'my_username'
password = 'my_password'
url = 'my_login_page'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'my_username' : username, 'my_password' : password})
opener.open(url, login_data)
page_to_launch = 'my_authenticated_url'
webbrowser.open(page_to_launch, new=1, autoraise=1)
I am either able to login and dump the authenticated page to stdout, or launch the login page without recognizing the cookie, but I am not able to launch the page I want to after logging in. Help appreciated.

You could use the selenium module to do this. It starts a browser (chrome, Firefox, IE, etc) with an extension loaded into it that allows you to control the browser.
Here's how you load cookies into it:
from selenium import webdriver
driver = webdriver.Firefox() # open the browser
# Go to the correct domain
driver.get("http://www.example.com")
# Now set the cookie. Here's one for the entire domain
# the cookie name here is 'key' and it's value is 'value'
driver.add_cookie({'name':'key', 'value':'value', 'path':'/'})
# additional keys that can be passed in are:
# 'domain' -> String,
# 'secure' -> Boolean,
# 'expiry' -> Milliseconds since the Epoch it should expire.
# finally we visit the hidden page
driver.get('http://www.example.com/secret_page.html')

Your cookies aren't making it to the browser.
webbrowser has no facilities for accepting the cookies stored in your CookieJar instance. It's simply a generic interface for launching a browser with a URL. You will either have to implement a CookieJar that can store cookies in your browser (which is almost certainly no small task) or use an alternative library that solves this problem for you.

Related

using python-requests to access a site with captcha

I've searched the web on how to access a website using request, essentially the site ask the user to complete a captcha form before they can access the site.
As of now I understand the process should be
visit the site using selenium
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
browser.get('link-to-site')
complete the captcha form
save the cookies from that selenium session (since some how these cookies will contain data showing that you've completed captcha
input('cookies ready ?')
pickle.dump( browser.get_cookies() , open("cookies.pkl","wb"))
open a request session
get the site
import requests
session = requests.session()
r = session.get('link-to-site')
then load the cookies in
with open('cookies.pkl', 'r') as f:
cookies = requests.utils.cookiejar_from_dict(json.load(f))
session.cookies.update(cookies)
But I'm still unable to access the site, so I'm assuming the google captcha hasn't been solved when I'm using requests.
So there must be a correct way to go about this, I must be missing something?
You need to load the site after setting the cookies. Otherwise, the response is what the response would be without any cookies. Although having said that you will normally need to submit the form with selenium then list the cookies as a captcha doesn't normally set a cookie in itself.

How can I login this page and read it?

I know there are alot of question about this matter but I try most of them.
my goal is to get the article from this page and use this in gae.
If I try to log in, it redirects to a long url ,after I log in there it redirects back to the article.
first I try urllib2 which is mentioned in here how to login to a website with python and mechanize and it didnt work.
then I took SelectLoginForm and login functions from https://github.com/cdhigh/KindleEar/blob/master/books/base.py it didnt work neither.
selenium wouldnt work because I gonna use it in gae. I guess gae cant support selenium
I started looking into mechanize module. my current code is :
# -*- coding: cp1254 -*-
import cookielib
import urllib2
import mechanize
b=mechanize.Browser()
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize.HTTPRefreshProcessor(),max_time=1)
b.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
b.open('https://hurpass.com/iframe/login?appkey=52da7ef64037f9497f0acb091390051062215&secret=52da7f0c4037f9497f0acb0b1390051084754&domain=sosyal.hurriyet.com.tr&callback_url=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&referer=http://sosyal.hurriyet.com.tr&user_page=http://sosyal.hurriyet.com.tr/Account/AutoLogin?returnUrl=http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073&is_mobile=0&session_timeout=0&is_vative=0&email=')
b.select_form(name='frm_login')
b["email"]="tasklak#hotmail.com"
b["password"]="123456"
b.submit(type="submit")
url='http://sosyal.hurriyet.com.tr/yazar/ahmet-hakan_131/baskanlik-diktatorluk-getirir-diyenleri-girtlaklamak-istiyorum_28116073'
last_response = b.response()
http_header_dict = last_response.info().dict
html_string_list = last_response.readlines()
html_data = "".join(html_string_list)
page = br.open(url)
print page.read().decode("UTF-8")
ha=open("test.html",'w')
ha.write(html_data)
ha.close
again I cant get this working but if I open the html it created, it redirects to logged article page. may it be mechanize redirection problem or is it impossible to login this page?
edit after mihail's answer:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
url='http://www.hurriyet.com.tr/anasayfa/'
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url+';ASPSESSIONID='+sessionidd)
print cj
edit 2:
sessionidd=urllib2.urlopen(auth_url).read().split(',')[1].split('\"')[3]
print sessionidd
opener.open(url)
k=0
for a in cj:
if k<2:
a.value=sessionidd
k+=1
print cj
First of all, you should know that if there isn't a publicly available API to do all this without scraping then it's very likely that what you are doing is not welcomed by the website owners, against their terms of service and could even be illegal and punishable by law depending on where you live.
Unless mechanize can interpret javascript code (which I doubt it does although I might be wrong) it's not going to be very helpful, although, skimming through the links you provided with Chrome's DevTools it looks like you could implement what you want with a few pure urlib2 requests.
For example, when you login for the first time, you'll see a GET request to http://auth.hurriyet.com.tr/api/loginuser/tasklak#hotmail.com/?%3D%3E%3F89%3A URL which includes your username and encoded password and returns some session IDs. The reason mechanize wouldn't work is because the password is encoded via a javascript code that's not being interpreted when you are submitting the form in your code.
Going into the source code of the login form you'll see that when the "Submit" button is clicked a loginUser() function is called which when you'll find you'll see that the password is being xor'ed with the following code:
for (i = 0; i < password.length; ++i) {
encoded_password += String.fromCharCode(12 ^ password.charCodeAt(i));
}
which you would have to rewrite in python, so to recieve the initial session IDs you'd have something like:
import urllib2
user = 'tasklak#hotmail.com'
password = '123456'
xor_password = ''.join(chr(12 ^ ord(c)) for c in password)
auth_url = 'http://auth.hurriyet.com.tr/api/loginuser/{}/?{}'.format(user, xor_password)
print(urllib2.urlopen(auth_url).read())
It looks like you're then going to need to validate the session IDs you received and retrieve session cookies which you then can use to get full articles but I will leave that to you.

Access to the cookies of the default browser

I want to write a program that opens the browser and open a url with a given cookie. I dont know how to do this. Maybe I could modify the cookies in the default place.
import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie', 'cookiename=cookievalue'))
f = opener.open("http://example.com/")
Modules to look into:
urllib2
cookielib
Cookie
In python, you can emulate a browser with the mechanize library. Also, there is good documentation about mechanize and cookies.

Python auth_handler not working for me

I've been reading about Python's urllib2's ability to open and read directories that are password protected, but even after looking at examples in the docs, and here on StackOverflow, I can't get my script to work.
import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm=None,
uri='https://webfiles.duke.edu/',
user='someUserName',
passwd='thisIsntMyRealPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
socks = urllib2.urlopen('https://webfiles.duke.edu/?path=/afs/acpub/users/a')
print socks.read()
socks.close()
When I print the contents, it prints the contents of the login screen that the url I'm trying to open will redirect you to. Anyone know why this is?
auth_handler is only for basic HTTP authentication. The site here contains a HTML form, so you'll need to submit your username/password as POST data.
I recommend you using the mechanize module that will simplify the login for you.
Quick example:
import mechanize
browser = mechanize.Browser()
browser.open('https://webfiles.duke.edu/?path=/afs/acpub/users/a')
browser.select_form(nr=0)
browser.form['user'] = 'username'
browser.form['pass'] = 'password'
req = browser.submit()
print req.read()

Link Checker (Spider Crawler)

I am looking for a link checker to spider my website and log invalid links, the problem is that I have a Login page at the start which is required. What i want is a link checker to run through the command post login details then spider the rest of the website.
Any ideas guys will be appreciated.
I've just recently solved a similar problem like this:
import urllib
import urllib2
import cookielib
login = 'user#host.com'
password = 'secret'
cookiejar = cookielib.CookieJar()
urlOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar))
# adjust this to match the form's field names
values = {'username': login, 'password': password}
data = urllib.urlencode(values)
request = urllib2.Request('http://target.of.POST-method', data)
url = urlOpener.open(request)
# from now on, we're authenticated and we can access the rest of the site
url = urlOpener.open('http://rest.of.user.area')
You want to look at the cookielib module: http://docs.python.org/library/cookielib.html. It implements a full implementation of cookies, which will let you store login details. Once you're using a CookieJar, you just have to get login details from the user (say, from the console) and submit a proper POST request.

Categories