So i am currently doing a project for my school and I need to login to our canteen website using Python. I am using requests, but the code is not working. It just redirects me to starting page, instead of the user page. I have tried this code on other website and it worked just fine. I have found out, that this website uses some JavaServer pages. May that be the problem?
I have tried a few tutorials on Youtube and even searched something here, but nothing worked for me.
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36 OPR/58.0.3135.53'
}
login_data = {
'j_username': '**',
'j_password': '**',
'terminal': 'false',
'type': 'web',
'_spring_security_remember_me': 'on'
}
with requests.session() as c:
url = 'https://jidelna.mgo.opava.cz:6204/faces/secured/info.jsp?terminal=false&keyboard=false&printer=false'
r = c.get(url)
soup = BeautifulSoup(r.content, features="html.parser")
login_data['_csrf'] = soup.find('input', attrs={'name': '_csrf'})['value']
login_data['targetUrl'] = soup.find('input', attrs={'name': 'targetUrl'})['value']
r = c.post(url, data=login_data, headers=headers)
You are sending the post request to the wrong url. If you use developer tools to inspect the login form you can get the action attribute of the form.
In the network tab in developer tools you can see the POST request being made and the parameters. You should make the post request to https://jidelna.mgo.opava.cz:6204/j_spring_security_check
If all of these does not work, also consider emulating the headers as far as possible. There is a cookie being sent, so you might have to use session with Requests.
If everything else fails there is always selenium.
Related
I find How would I log into Instagram using BeautifulSoup4 and Requests, and how would I determine it on my own? this
but code
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': 'login',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:your_password',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
r = s.post(login_url, data=payload, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken": csrf
})
print(r.status_code)
gives me error with csrftoken
line 21, in <module>
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
IndexError: list index out of range
and other posts on Stack Overflow don't work for me
I dont want use Selenium
TL;DR
Add a user-agent to your get request header on line 20:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Long answer
If we look at the error message you posted, we can start to dissect what's gone wrong. Line 21 is attempting to find a csrf_token attribute on the instagram login page.
Diagnostics
We can see from the error message that the list index is out of range, which in this case means that the list returned by re.findall (docs) is empty. This means that either
Your regex is wrong
The html returned by your get request (docs) r = s.get(link) on line 20 doesn't contain a csrf_token attribute
The attribute doesn't exist in the source html
If we visit the page and look at its html source, we can see that a csrf_token attribute is indeed present on line 261:
<script type="text/javascript">window._sharedData = {"config":{"csrf_token":"TOKEN HERE","viewer":null,"viewerId":null}}</script>
Note, I have excluded the rest on the code for brevity.
Now that we know it's present on the page, we can write the scraped html that you're receiving via your get request to a local file and inspect it:
r = s.get(link)
with open("csrf.html", "w") as f:
f.write(html)
If you open that file and do a Ctrl+f for csrf_token, it's not present. This likely means that Instagram detected that you're accessing the page via a scraper and returned a modified version of the page.
The fix
In order to fix this, you need to add a user-agent to your request header which essentially 'tricks' the page into thinking you're accessing it via a browser, This can be done by by changing:
r = s.get(link)
to something like this:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Note, this is a random user agent from here.
Notes
I appreciate that you don't want to use selenium for your task, but you might find that the more dynamic interactions you want to do, the harder it is to achieve it with static scraping libraries like the requests module. Here are some good resources for learning selenium in python:
Selenium docs
Python Selenium Tutorial #1 - Web Scraping, Bots & Testing
I am trying to perform web scraping using Python, beatifulsoup and requests. I firstly need to log into the page and then request the following page from where I would like to perform the web scraping.
I can say that I login successfully as the status code is 200. However, when I request the next page after I log in, I do not get the whole content.
Specifically, I get this line instead of multiple nested divs.
<div id="app"></div>
actual content look like the following.
My code is the following. I would like to ask you whether I’m missing anything in order to get all nested divs.
import requests
from bs4 import BeautifulSoup
import html5lib
headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
login_data = {
'username': 'username',
'password': 'password',
'sp-login': 'false'
}
with requests.Session() as s:
url = "https://api.private.zscaler.com/base/api/zpa/signin"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
r = s.post(url, data=login_data, headers= headers)
print(r.content)
print(r.ok)
print(r.status_code)
r2 = requests.get("https://admin.private.zscaler.com/#dashboard/usersDashboard")
print(r2.text)
The web app you are trying to scrape might be an SPA (Single Page Application) built with something like React \ Vue \ Angular.
BeautifulSoup wouldn't work in this case, because you need to run javascript on page to build DOM.
You would have to use something like Selenium to accomplish this.
As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.
I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim¤cy=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL
I've been trying for ages to login to a web page to scrape some data with python. I just can't figure out how to perform it after using the Chrome inspect of the login site, it seems different than any of the answers found here. This is the site https://www.weatherlink.com/ and I would need to login to the site and then scrape some data of wind speeds from different public stations.
I've tried with requests library and with multiple different payloads without success. With the following code:
payload = {'username' : 'xx',
'password': 'yy',
'localTimezoneOffset': '10800000',
'keepLogged': ''}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
session_requests = requests.session()
login_url = "https://www.weatherlink.com/"
result = session_requests.post(login_url, data = payload, headers = headers, verify=True)
Expected is result.ok == True, but I get False with reason == "Not allowed" and status_code = 405. After login I would scrape the data from a station e.g from url https://www.weatherlink.com/bulletin/4a891aff-0761-4934-bdf9-9115397c12ea
Any help is much appreciated.
It looks like you have the wrong path for the POST request.
Try this:
payload = {
'username': 'xxx',
'password': 'yyy',
'rememberMe': 'false',
'localTimezoneOffset': '-14400000',
'ianaTimeZone': 'America/New_York'
}
headers = {
#you should be able to skip the user-agent string, unless your trying to bypass some kind of anti-bot protection.
}
session_requests = requests.session()
login_url = "https://www.weatherlink.com/processLogin"
result = session_requests.post(login_url, data = payload, headers = headers, verify=True)
I am using the following script to login to https://www.mbaco.com/login. While I am not getting any error, I can't access the protected pages of the website. Plz help.
import requests
url = 'https://www.mbaco.com/login'
payload = {
'_username':"mysuername",
'_password':"password"
}
session = requests.session()
r = session.post(url, data=payload)
You have the wrong url, the post is to https://www.mbaco.com/login_check, it is also a good to add a user-agent:
import requests
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
url = 'https://www.mbaco.com/login_check'
payload = {
'_username':"mysuername",
'_password':"password"
}
session = requests.session()
r = session.post(url, data=payload, headers=headers)
If you want to see what gets posted and to where, open developer tools or firebug and you can see exactly what is happening, in this case you can see under the other tab exactly what is posted and to where: