I've read through dozens of pages on how to log into a web page using Python, but I can't seem to make my code work. I'm trying to log into a site called "Voobly", and I'm wondering if there might be something specific to Voobly that is making this more difficult. Here is my code:
import requests
loginURL = "https://www.voobly.com/login"
matchUrl = "https://www.voobly.com/profile/view/124993231/Matches"
s = requests.session()
loginInfo = {"username":"myUsername", "password":"myPassword"}
firstGetRequest = s.get(loginURL) # Get the login page using our session so we save the cookies
postRequest = s.post(loginURL,data=loginInfo) # Post data to the login page, the data being my login information
getRequest = s.get(matchUrl) # Get content from a login - restricted page
response = getRequest.content.decode() # Get the actual html text from restricted page
if "Page Access Failed" in response: # True if I'm blocked
print("Failed")
else: # If I'm not blocked, I have the result I want
print("Worked!") # I can't achieve this
As mentioned in the comments, the login form is submitted to /login/auth. But, the cookie is generated from the /login URL.
Use the following code:
form = {'username': USERNAME, 'password': PASSWORD}
with requests.Session() as s:
# Get the cookie
s.get('https://www.voobly.com/login')
# Post the login form data
s.post('https://www.voobly.com/login/auth', data=form)
# Go to home page
r = s.get('https://www.voobly.com/welcome')
# Check if username is in response.text
print(USERNAME in r.text)
# True
r2 = s.get('https://www.voobly.com/profile/view/124993231/Matches')
if "Page Access Failed" in r2.text:
print("Failed")
else:
print("Worked!")
# Worked!
Note: The Go to home page part is not at all needed for the login. It's used just to show that the login is successful.
Related
I am attempting to download a zip file from a website that sits with an https:// link. I have tried the following but can't seem to get any output. Could anyone suggest what I might be doing wrong?
URL = www.somewebsite.com
Download zip file = www.somewebsite.com/output/revisionId=40687821$$Xiiy75&action_id=
import requests
url = 'http://somewebsite.org'
user, password = 'bob', 'I love cats'
resp = requests.get(url, auth=(user, password))
To download a file from a non protected url do something like:
import requests
url = 'http://somewebsite.org'
user, password = 'bob', 'I love cats'
resp = requests.get(url, auth=(user, password))
with open("result.zip", "wb") as fout:
fout.write(resp.content)
If course you should check whether you got a valid response before writing the zip file.
For a considerable amount of websites with login following recipe will work:
However if asite.com uses too much javascript, this might not necessarily work.
Use a requests session in order to store any session cookies and perform following three steps.
GET the login url. This will get potential session cookies or CSRF protection cookies
POST to the login url with the username and password. the name of the forms to be posted depend on the page. Use your web browser in debug mode to learn about the right values that you have to post, this can be more parameters than username and password
List item
GET the document url and save the result to a file.
On Firefox for example you go to the website you want to login, you press F12 (for debug mode), click on the network tab and then on reload.
You might
Fill in the login form and submit and look in the debug panel for a POST request.
The generic python code would look like.
import requests
def login_and_download():
ses = requests.session()
# Step 1 get the login page
rslt = ses.get("https://www.asite.com/login-home")
# now any potentially required cookie will be set
if rslt.status_code != 200:
print("failed getting login page")
return False
# for simple pages you can procedd to login
# for a little more complicated pages you might have to parse the
# HTML
# for really annoying pages that use loads of javascript it might be
# even more complicated
# Step 2 perform a post request to login
login_post_url = # This depends on the site you want to connect to. you have analyze the login
# procedure
rslt = ses.post(login_post_url)
if rslt.status_code != 200:
print("failed logging in")
return False
# Step 3 download the url, that you want to get.
rslt = ses.get(url_of_your_document)
if rslt.status_code != 200:
print("failed fetching the file")
return False
with open("result.zip", "wb") as fout:
fout.write(resp.content)
I'm new to web scraping and I just couldn't find the solution to my problem.
I'm stuck at the login page.
import requests
POST_LOGIN_URL = 'https://ocjene.skole.hr/pocetna/prijava' # Login page
REQUEST_URL = 'https://ocjene.skole.hr/pregled/predmeti' # Goal page for scraping
with requests.Session() as session:
session.get(POST_LOGIN_URL) # Loading all cookies...
login_page = session.get(POST_LOGIN_URL) # Login page content (for comparison)
token = session.cookies["csrf_cookie"] # This cookie on chrome has a valid csrf token
payload = {
'csrf_token': token,
'user_login': 'xxx',
'user_password': 'xxx'
}
post = session.post(POST_LOGIN_URL, data=payload) # Logging in...
afterLogin = session.get(REQUEST_URL) # This is where I need to get all the content, but...
print(subject_math.content)
print(login_page.content)
# These two share exact same content, except the csrf token is different
I'm not sure if logging in was successful. I double-checked everything,
the form data is correct and I also tried replacing the request headers like so:
post = session.post(POST_LOGIN_URL, data=payload, headers=headers)
What am I missing? thanks.
It looks like chrome is posting to posalji/
Also inspect post.content after the request, that should tell you if it was ok.
I am attempting to scrape a website using the following code
import re
import requests
def get_csrf(page):
matchme = r'name="csrfToken" value="(.*)" /'
csrf = re.search(matchme, str(page))
csrf = csrf.group(1)
return csrf
def login():
login_url = 'https://www.edline.net/InterstitialLogin.page'
with requests.Session() as s:
login_page = s.get(login_url)
csrf = get_csrf(login_page.text)
username = 'USER'
password = 'PASS'
login = {'screenName': username,
'kclq': password,
'csrfToken': csrf,
'TCNK':'authenticationEntryComponent',
'submitEvent':'1',
'enterClicked':'true',
'ajaxSupported':'yes'}
page = s.post(login_url, data=login)
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
login()
Where I log into https://www.edline.net/InterstitialLogin.page, which is successful, but the problem I have is when I try to do
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
It doesn't print the expected page, instead it throws an error. Upon further testing I discovered that it throws this error even if you try to go directly to the page from a browser. So when I investigated the page source I found that the button used to link to the page I'm trying to scrape uses the following code
Private Reports
So essentially I am looking for a way to trigger the above javascript code in python in order to scrape the resulting page.
It is impossible to answer this question without having more context than this single link.
However, the first thing you want to check, in the case of javaScript driven content generation, are the requests made by your web page when clicking on that link.
To do this, take a look at the network-panel in the console of your browser. Record the requests being made, look especially for XHR-requests. Then, you can try to replicate this e.g. with the requests library.
content = requests.get('xhr-url')
I'm trying to retrieve info from a web page that is protected by username and password.
I went through authentication, sending POST with my username and password. After that I get back response object with redirecting page. When I browse that page I see that I must wait a few seconds or click continue to get to main page. My problem is how to skip this redirection or force the script to go to the main page.
import requests
main_url = 'https://my_main_page.com/edit#'
login = {
'USER' : 'username',
'PASSWORD' : 'password',
}
r = requests.post(main_url,
data=login,) #here I have now redirecting page with the
#url as main_url , but in page source I see
#redirecting page, not my expected page
print r.url
print r.text
Try this:
r = requests.post(main_url, data=login, allow_redirects=false)
Maybe this can help you:
http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
Maybe you need to use the request method instead. But the post method should work also with the request arguments:
http://docs.python-requests.org/en/latest/api/#requests.request
I am trying to use Python 2.7.6 to login a website. the login logic contains 2 steps in 2 webpages.
Putting in user ID and password onto page A, and the page A gives a cookie;
This cookie is used in the header to authenticate the login on page B.
It only logs in successfully once B authenticated it.
There’s a post here, HTTP POST and GET with cookies for authentication in python, asking the similar question. A solution is using requests.
import requests
url_0 = "http://www.PAGE_A.com/" # http://webapp.pucrs.br/consulta/principal.jsp, in original example
url = "http://www.PAGE_B.com/" # https://webapp.pucrs.br/consulta/servlet/consulta.aluno.ValidaAluno, in original example
data = {"field1_username": "ABC", " field_password": "123"}
s = requests.session()
s.get(url_0)
r = s.post(url, data)
I tired used this in Python for my case and it doesn't return error message so I guess it works fine.
But the question is, how do I know it’s logged in?
I added below to print the logged in page to see if it returned the right page.
import mechanize
br = mechanize.Browser()
open_page = br.open("http://www.PAGE_B.com/")
read_page = open_page.read()
print read_page
However, it stills shows the contents before login. What went wrong?
How about just going with one of the two?
import mechanize;
browser = mechanize.Browser()
browser.addheaders = [('...')]
browser.open(YOUR_URL)
browser.select_form(FORM_NAME)
browser.form['USERNAME_FIELD'] = 'abc'
browser.form['PASSWORD_FIELD'] = 'password'
browser.submit()
print browser.response().read()
print browser.geturl()