I am trying to log in to a website using python and the requests module.
My problem is that I am still seeing the log in page even after I have given my username / password and am trying to access pages after the log in - in other words, I am not getting past the log in page, even though it seems successful.
I am learning that it can be a different process with each website and so it's not obvious what I need to add to fix the problem.
It was suggested that I download a web traffic snooper like Fiddler and then try to replicate the actions with my python script.
I have downloaded Fiddler, but I'm a little out of my depth with how I find and replicate the actions that I need.
Any help would be gratefully received.
My original code:
import requests
payload = {
'login_Email': 'xxxxx#gmail.com',
'login_Password': 'xxxxx'
}
with requests.Session() as s:
p = s.post('https://www.auction4cars.com/', data=payload)
print p.text
If you look at the browser developer tools, you may see that the login POST request needs to be submitted to a different URL:
https://www.auction4cars.com/Home/UserLogin
Note that also the payload needs to be:
payload = {
'login_Email_or_Username': 'xxxxx#gmail.com',
'login_Password': 'xxxxx'
}
I'd still visit the login page before doing that and set the headers:
HOME_URL = 'https://www.auction4cars.com/'
LOGIN_URL = "https://www.auction4cars.com/Home/UserLogin"
with requests.Session() as s:
s.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
s.get(HOME_URL)
p = s.post(LOGIN_URL, data=payload)
print(p.text) # or use p.json() as, I think, the response format is JSON
Related
I find How would I log into Instagram using BeautifulSoup4 and Requests, and how would I determine it on my own? this
but code
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': 'login',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:your_password',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
r = s.post(login_url, data=payload, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken": csrf
})
print(r.status_code)
gives me error with csrftoken
line 21, in <module>
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
IndexError: list index out of range
and other posts on Stack Overflow don't work for me
I dont want use Selenium
TL;DR
Add a user-agent to your get request header on line 20:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Long answer
If we look at the error message you posted, we can start to dissect what's gone wrong. Line 21 is attempting to find a csrf_token attribute on the instagram login page.
Diagnostics
We can see from the error message that the list index is out of range, which in this case means that the list returned by re.findall (docs) is empty. This means that either
Your regex is wrong
The html returned by your get request (docs) r = s.get(link) on line 20 doesn't contain a csrf_token attribute
The attribute doesn't exist in the source html
If we visit the page and look at its html source, we can see that a csrf_token attribute is indeed present on line 261:
<script type="text/javascript">window._sharedData = {"config":{"csrf_token":"TOKEN HERE","viewer":null,"viewerId":null}}</script>
Note, I have excluded the rest on the code for brevity.
Now that we know it's present on the page, we can write the scraped html that you're receiving via your get request to a local file and inspect it:
r = s.get(link)
with open("csrf.html", "w") as f:
f.write(html)
If you open that file and do a Ctrl+f for csrf_token, it's not present. This likely means that Instagram detected that you're accessing the page via a scraper and returned a modified version of the page.
The fix
In order to fix this, you need to add a user-agent to your request header which essentially 'tricks' the page into thinking you're accessing it via a browser, This can be done by by changing:
r = s.get(link)
to something like this:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Note, this is a random user agent from here.
Notes
I appreciate that you don't want to use selenium for your task, but you might find that the more dynamic interactions you want to do, the harder it is to achieve it with static scraping libraries like the requests module. Here are some good resources for learning selenium in python:
Selenium docs
Python Selenium Tutorial #1 - Web Scraping, Bots & Testing
I'm currently working on attempting to scrape some HTML files from an electronic medical system that I use for work. I currently have a python bot that logs into the system and is able to download and send faxes for me, but there's some pages I want my bot to quickly grab before it even is logged in and sending faxes. These pages are basic HTML that have extremely predictable URLs and I have tested I can manually call the pages from my browser, so once I do get my session established it should be easy work.
The website is: https://kinnser.net/
Login URL: https://kinnser.net/login.cfm
second URL: https://kinnser.net/AM/Message/inbox.cfm
import requests
import json
import logging
import json
from requests.auth import HTTPBasicAuth
from lxml import html
#This URL will be the URL that your login form points to with the "action" tag.
POST_LOGIN_URL = 'https://kinnser.net/loginlogic.cfm'
#This URL is the page you actually want to pull down with requests.
REQUEST_URL = 'https://kinnser.net/AM/Message/inbox.cfm'
#username-input-name is the "name" tag associated with the username input field of the login form.
#password-input-name is the "name" tag associated with the password input field of the login form.
payload = {
'username': 'XXXXXXXX',
'password': 'XXXXXXXXX'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
with requests.Session() as session:
post = session.post(POST_LOGIN_URL, data=payload, headers=headers)
print(post)
r = session.get(REQUEST_URL)
print(r.text) #or whatever else you want to do with the request data!
I played around with the username, & password field by setting them equal to the input's name/ID but that wouldn't work. So I tried this script on our old EMR we used just to confirm it wasn't broken, and it did indeed work perfectly. So I began to play around with the headers in my request and it was still no dice. I'm not sure if my login is just failing or if they're detecting me being a bot and serving me the login page over and over again but I have spent about 10 hours trying to research a solution and I've hit a wall with my project currently.
If anyone see's any mistakes in my code or has workable solutions please feel free to suggest them. Thanks for the help and hopefully I'll soon grow to understand more about RESTful web services.
Think the HTML might actually be in post.text?
edit:
try the request with these headers:
...
user_agent_str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
+ "AppleWebKit/537.36 (KHTML, like Gecko) " \
+ "Chrome/78.0.3904.97 " \
+ "Safari/537.36"
content_type_str = "application/json"
headers = {
"user-agent": user_agent_str,
"content-type": content_type_str
}
...
Another edit:
I'm not sure if requests already handles this, but payload isn't valid JSON. You might also try using double instead of single quotes.
I would suggest trying out this two things.
kinnser.net/loginlogic.cfm From network calls it looks like this is post url.
Change 'Username' to 'username' and 'Password' to 'password' and try.
Since I don't have access username and password i can not verify this but this two thing might be causing the problem.
I created a API in my site and I'm trying to call an API from python but I always get 406 as a response, however, if I put the url in the browser with the parameters, I can see the correct answer
I already did some test in pages where you can tests you own API, I test it in the browser and work fine.
I already followed up a manual that explains how to call an API from python but I do not get the correct response :(
This is the URL of the API with the params:
https://icassy.com/api/login.php?usuario_email=warles34%40gmail.com&usuario_clave=123
This is the code I use to call the API from Python
import requests
urlLogin = "https://icassy.com/api/login.php"
params = {'usuario_email': 'warles34#gmail.com', 'usuario_clave': '123'}
r = requests.get(url=urlLogin, data=params)
print(r)
print(r.content)
and I get:
<Response [406]>
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
I should receive in JSON format the success message and the apikey like this:
{"message":"Successful login.","apikey":"eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJodHRwOlwvXC9leGFtcGxlLm9yZyIsImF1ZCI6Imh0dHA6XC9cL2ljYXNzeS5jb20iLCJpYXQiOjEzNTY5OTk1MjQsIm5iZiI6MTM1NzAwMDAwMCwiZGF0YSI6eyJ1c3VhcmlvX2lkIjoiMzQiLCJ1c3VhcmlvX25vbWJyZSI6IkNhcmxvcyIsInVzdWFyaW9fYXBlbGxpZG8iOiJQZXJleiIsInVzdWFyaW9fZW1haWwiOiJ3YXJsZXMzNEBnbWFpbC5jb20ifX0.bOhrC-vXhQEHtbbZGmhLByCxvJY7YxDrLhVOfy9zeFc"}
Looks like there is a validation on the server to check if request is made from some browser. Adding a user-agent header should do it -
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url=urlLogin, params=params, headers=headers)
This link of user agents might come handy in future.
I turned out that the service I was doing a request to was hosted on Akamai that has a bot manager. It looks at the requests (where it comes from) and if it determines that it is a bot you get a 406 error.
The solution was to ask for the server IP to be whitelisted, or to send a special header to all server communication.
In my case, I had
'Accept': 'text/plain'
and it worked after I replaced it with
'Accept': 'application/json'
I didn't need to use user-agent at all
I am trying to complete a webscrape of a page that requires a log-in first. I am fairly certain that I have my code and input names ('login' and 'password') correct yet it still gives me a 'Login Failed' page. Here is my code:
payload = {'login': 'MY_USERNAME', 'password': 'MY_PASSWORD'}
login_url = "https://www.spatialgroup.com.au/property_daily/"
with requests.Session() as session:
session.post(login_url, data=payload)
response = session.get("https://www.spatialgroup.com.au/cgi-bin/login.cgi")
html = response.text
print(html)
I've done some snooping around and have figured out that the session doesn't stay logged in when I run my session.get("LOGGEDIN_PAGE"). For example, if I complete the log in process and then enter a URL into the address bar that I know for a fact is a page only accessible once logged in, it returns me to the 'Login Failed' page. How would I get around this if my login session is not maintained?
As others have mentioned, its hard to help here without knowing the actual site you are attempting to log in to.
I'd point out that you aren't using any set HTTP headers at all, which is a common validation check for logins on webpages. If you're sure that you are POSTing the data in the right format (form encoded versus json encoded), then I would open up Chrome inspector and copy the user-agent from your browser.
s = requests.Session()
s.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Accept': '*/*'
}
Also, it's good practice to check the response status code of each web request you make using a try/except pattern. This will help you catch errors as you write and test requests, instead of blindly guessing which requests are erroneous.
r = requests.get('http://mypage.com')
try:
r.raise_for_status()
except requests.exceptions.HTTPError:
print('oops bad status code {} on request!'.format(r.status_code))
Edit: Now that you've given us the site, inspecting a login attempt reveals that the form data isn't actually being POSTed to that website, but rather it's being sent to a CGI script url.
To find this, open up Chrome Inspector and watch the "Network" tab as you try to login. You'll see that the login is actually being sent to https://www.spatialgroup.com.au/cgi-bin/login.cgi, not the actual login page. When you submit to this login page, it executes a 302 redirect after logging in. We can check the location after performing the request to see if the login was successful.
Knowing this I would send a request like this:
s = requests.Session()
# try to login
r = s.post(
url='https://www.spatialgroup.com.au/cgi-bin/login.cgi',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'
},
data={
'login': USERNAME,
'password': PASSWORD
}
)
# now lets check to make sure we didnt get 4XX or 5XX errors
try:
r.raise_for_status()
except requests.exceptions.HTTPError:
print('oops bad status code {} on request!'.format(r.status_code))
else:
print('our login redirected to: {}'.format(r.url))
# subsequently if the login was successful, you can now make a request to the login-protected page at this point
It's very difficult to help you without having the actual website you are working with. That being said I would recommend you changing this line:
session.post(login_url, data=payload)
to this one:
session.post(login_url, json=payload)
hope this helps
I'm having trouble logging into a website to scrape pages behind login permissions (which I have). I've tried a number of fixes, including using the Requests module (including csrf tokens and hidden tags) and using the BrowserCookie module to try to use cookies from a browser login session. However, nothing seems to work. In the example below, I used a simple requests session. The site returns a 200 code, which supposedly signifies a successful login, but the page redirects back to the login page. Is there anything else I'm missing or is it possible that the website blocks webscrapers from logging in?
import requests
from bs4 import BeautifulSoup as bs
payload = {
"UserName":"<user>",
"Password":"<pass>"
}
s = requests.Session()
r1=s.post("http://<webpage>/login", data=payload)
if r1.status_code == 200:
print("logged in")
r2=s.get("<url behind login permissions")
soup=bs(r2.content,'lxml')
print(soup.title.string) #Redirects to login page
Set the session's headers may work, here's an example changing User-Agent and Content-Type:
s.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36',
'Content-Type':'application/json;charset=UTF-8',
}