Using python requests module to create an authenticated session in Github - python

My goal to create an authenticated session in github so I can use the advanced search (which limits functionality to non-authenticated users). Currently I am getting a webpage response from the post request of "What? Your browser did something unexpected. Please contact us if the problem persists."
Here is the code I am using to try to accomplish my task.
import requests
from lxml import html
s = requests.Session()
payload = (username, password)
_ = s.get('https://www.github.com/login')
p = s.post('https://www.github.com/login', auth=payload)
url = "https://github.com/search?l=&p=0&q=language%3APython+extension%3A.py+sklearn&ref=advsearch&type=Code"
r = s.get(url, auth=payload)
text = r.text
tree = html.fromstring(text)
Is what I'm trying possible? I would prefer to not use the github v3 api since it is rate limited and I wanted to do more of my own scraping of the advanced search. Thanks.

As mentioned in the comments, github uses post data for authentication so you should have your creds in the data parameter.
The elements you have to submit are 'login', 'password', and 'authenticity_token'. The value of 'authenticity_token' is dynamic, but you can scrape it from '/login'.
Finally submit data to /session and you should have an authenticated session.
s = requests.Session()
r = s.get('https://www.github.com/login')
tree = html.fromstring(r.content)
data = {i.get('name'):i.get('value') for i in tree.cssselect('input')}
data['login'] = username
data['password'] = password
r = s.post('https://github.com/session', data=data)

Related

How To Authenticate HumbleBundle

I want to write a program to automatically download my Humble Bundle purchases, but I'm struggling to login to the site. I thought that it should be a pretty straightforward process:
import requests
LOGIN_URL = "https://www.humblebundle.com/processlogin"
data = {
"username": "username",
"password": "top_secret",
}
session = requests.Session()
session.params.update({"ajax": "true"})
response = session.post(LOGIN_URL, data=data)
json = response.json()
print(json)
But I get a rather unhelpful failure message
{'errors': {'_all': ['Invalid request.']}, 'success': False}
What am I doing wrong?
I don't think that its going to let you do that. If I had to guess you're going to have to use OAuth.
Humble Bundle uses a CAPTCHA to ensure only humans are logged in. Only logged in users seem to be able to retrieve information about their purchases (I have not found another way to authenticate myself).
By design, a CAPTCHA disallows scripts to log in. My best suggestion is to log in with a regular webbrowser, and store the value for the cookie called '_simpleauth_sess' locally. You can use that to retrieve data as if you're logged in.
Here is an example with requests library that the OP uses:
cookies = dict(_simpleauth_sess='easAFa9afas.......32|32u8')
url = 'https://www.humblebundle.com/api/v1/user/order'
r = requests.get(url, cookies=cookies)
print(r.text)
Or a bit more complex:
session = requests.Session()
session.cookies.set('_simpleauth_sess', 'easAFa9afas.......32|32u8',
domain='humblebundle.com', path='/')
r = session.get('https://www.humblebundle.com/api/v1/user/order')
for order_id in [v['gamekey'] for v in r.json()]:
url = 'https://www.humblebundle.com/api/v1/order/{}?wallet_data=true&all_tpkds=true'.format(order_id)
r = session.get(url)
...

Setting up a login with python requests for indeed.com

I'm trying to write a resume searcher for www.indeed.com (there's no API for resumes unfortunately). Specifically, I need to provide login details (to get names from resumes). The login page is here:
https://secure.indeed.com/account/login
I was following a guide here: https://kazuar.github.io/scraping-tutorial/
My code so far is:
import requests
from lxml import html
session_requests = requests.session()
login_url = "https://secure.indeed.com/account/login"
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
payload={
'_email': 'my#email.com',
'_password': 'mypassword'
}
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
This doesn't seem to work quite right. First off, I think I'm missing some authentication tokens. After inspecting the login page, I think it might be the "surftok" attribute, but I'm not completely sure. Is this even possible just with the requests module, or will I need Selenium or mechanize to make this work?
You're missing multiple data fields.
This worked for me
import requests
data = {
'action':'Login',
'__email':'Your Email',
'__password':'Your password',
'remember':'1',
'hl':'en',
'continue':'/account/view?hl=en',
}
response = requests.post('https://secure.indeed.com/account/login',data=data)
response[200]

Retrieve OAuth code in redirect URL provided as POST response

Python newbie here, so I'm sure this is a trivial challenge...
Using Requests module to make a POST request to the Instagram API in order to obtain a code which is used later in the OAuth process to get an access token. The code is usually accessed on the client-side as it's provided at the end of the redirect URL.
I have tried using Request's response history method, like this (client ID is altered for this post):
OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
OAuth_AccessRequest = requests.post(OAuthURL)
ResHistory = OAuth_AccessRequest.history
for resp in ResHistory:
print resp.status_code, resp.url
print OAuth_AccessRequest.status_code, OAuth_AccessRequest.url
But the URLs this returns are not revealing the code number string, instead, the redirect just looks like this:
302 https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.dashboard.com/whathappened&response_type=code
200 https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%cb0096f08a3848e67355f%26redirect_uri%3Dhttps%3A//www.smashboarddashboard.com/whathappened%26response_type%3Dcode
Where if you do this on the client side, using a browser, code would be replaced with the actual number string.
Is there a method or approach I can add to the POST request that will allow me to have access to the actual redirect URL string that appears in the web browser?
It should work in a browser if you are already logged in at Instagram. If you are not logged in you are redirected to a login page:
https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%3Dcb0096f08a3848e67355f%26redirect_uri%3Dhttps%3A//www.smashboarddashboard.com/whathappened%26response_type%3Dcode
Your Python client is not logged in and so it is also redirected to Instagram's login page as shown by the value of OAuth_AccessRequest.url :
>>> import requests
>>> OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
>>> OAuth_AccessRequest = requests.get(OAuthURL)
>>> OAuth_AccessRequest
<Response [200]>
>>> OAuth_AccessRequest.url
u'https://instagram.com/accounts/login/?force_classic_login=&next=/oauth/authorize/%3Fclient_id%3Dcb0096f08a3848e67355f%26redirect_uri%3Dhttps%3A//www.smashboarddashboard.com/whathappened%26response_type%3Dcode'
So, to get to the next step, your Python client needs to login. This requires that the client extract and set fields to be posted back to the same URL. It also requires cookies and that the Referer header be properly set. There is a hidden CSRF token that must be extracted from the page (you could use BeautifulSoup for example), and form fields username and password must be set. So you would do something like this:
import requests
from bs4 import BeautifulSoup
OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
session = requests.session() # use session to handle cookies
OAuth_AccessRequest = session.get(OAuthURL)
soup = BeautifulSoup(OAuth_AccessRequest.content)
form = soup.form
login_data = {form.input.attrs['name'] : form.input['value']}
login_data.update({'username': 'your username', 'password': 'your password'})
headers = {'Referer': OAuth_AccessRequest.url}
login_url = 'https://instagram.com{}'.format(form.attrs['action'])
r = session.post(login_url, data=login_data, headers=headers)
>>> r
<Response [400]>
>>> r.json()
{u'error_type': u'OAuthException', u'code': 400, u'error_message': u'Invalid Client ID'}
Which looks like it will work once provided a valid client ID.
As an alternative, you could look at mechanize which will handle the form submission for you, including the hidden CSRF field:
import mechanize
OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e67355f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
br = mechanize.Browser()
br.open(OAuthURL)
br.select_form(nr=0)
br.form['username'] = 'your username'
br.form['password'] = 'your password'
r = br.submit()
response = r.read()
But this doesn't work because the referer header is not being set, however, you could use this method if you can figure out a solution to that.

Unable to retain login credentials across pages while using requests

I am pretty new to using urllib and requests module in python. I am trying to access a wikipage in my company's website which requires me to provide my login credentials through a pop up window when I try to access it through a browser.
I was able to write the following script to successfully access the webpage and read it using the following piece of code:
import sys
import urllib.parse
import urllib.request
import getpass
import http.cookiejar
wiki_page = 'http://wiki.company.com/wiki_page'
top_level_url = 'http://login.company.com/'
username = input("Enter Username: ")
password = getpass.getpass('Enter Password: ')
# Authenticate with login server and fetch the wiki page
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
cj = http.cookiejar.CookieJar()
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj),handler)
opener.open(wiki_page)
urllib.request.install_opener(opener)
with urllib.request.urlopen(wiki_page) as response:
# Do something
But now I need to use requests module to do the same. I tried using several methods including sessions but could not get it to work. The following is the piece of code which I think close to the actual solution but it gives Response 200 in the first print and Response 401 in the second print:
s = requests.Session()
print(s.post('http://login.company.com/', auth=(username, password))) # I have tried s.post() as well as s.get() in this line
print(s.get('http://wiki.company.com/wiki_page'))
The site uses the Basic Auth authorization scheme; you'll need to send the login credentials with each request.
Set the Session.auth attribute to a tuple with the username and password on the session:
s = requests.Session()
s.auth = (username, password)
response = s.get('http://wiki.company.com/wiki_page')
print(response.text)
The urllib.request.HTTPPasswordMgrWithDefaultRealm() object would normally only respond to challenges on URLs that start with http://login.company.com/ (so any deeper path will do too), and not send the password elsewhere.
If the simple approach (setting Session.auth) doesn't work, you'll need to find out what response is returned by accessing http://wiki.company.com/wiki_page directly, which is what your original code does. If the server redirects you to a login page, where you then use the Basic Auth information, you can replicate that:
s = requests.Session()
response = s.get('http://wiki.company.com/wiki_page', allow_redirects=False)
if response.status_code in (302, 303):
target = response.headers['location']
authenticated = s.get(target, auth=(username, password))
# continue on to the wiki again
response = s.get('http://wiki.company.com/wiki_page')
You'll have to investigate carefully what responses you get from the server. Open up an interactive console and see what responses you get back. Look at response.status_code and response.headers and response.text for hints. If you leave allow_redirects to the default True, look at response.history to see if there were any intermediate redirections.

python-requests - can't login

trying to scrape some data, but first I need to login. I am trying to use python-requests, and here is my code so far :
login_url = "https://www.wehelpen.nl/login/"
users_url = "https://www.wehelpen.nl/ik-zoek-hulp/hulpprofielen/"
profile_url = "https://www.wehelpen.nl/profiel/01136/hulpvragen/"
uname = "****"
pword = "****"
def main():
s = login(uname, pword, login_url)
page = s.get(users_url)
print makeUTF8(page.text) # grab html and grep for logged in text to make sure!
def login(uname, pword, url):
s = requests.session()
s.get(url, auth=(uname, pword))
csrftoken = s.cookies['csrftoken']
login_data = dict(username=uname, password=pword,
csrfmiddlewaretoken=csrftoken, next='/')
s.post(url, data=login_data, headers=dict(Referer=url))
return s
def makeUTF8(text):
return text.encode('utf-8')
Basically, I need to login at login_url with a POST request (using a csrf token because I get an error otherwise), then using the session object passed back from login(), I want to check that I am logged in by making a GET request to a user page. When I get the return - page.text I can then run a grep command to check for a certain href which tells me if I am logged in or not.
So, thus far I am unable to login and keep a working session object. Can anyone help me? So far, this has been the most tedious python experience of my life.
EDIT. I have searched, searched and searched SO for answers and nothing is working...
You need to have correct names for dictionary keys. Request libary uses html name of form to find right form. In your case those names are identification and password.
login_data = {'identification'=uname,'password'=pswrd...}
There are lots of options, but I have had success using cookielib instead of trying to "manually" handle the cookies.
import urllib2
import cookielib
cookiejar = cookielib.CookieJar()
cookiejar.clear()
urlOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar))
# ...etc...
Some potentially relevant answers on getting this set up are on SO, including: https://stackoverflow.com/a/5826033/1681480

Categories