Sometimes when i try to get html code from a website with this code
import requests
url = "https://sit2play.com"
response = requests.get(url)
print response.content
i get this response
<h3 class="ielte9">
The browser you're using is not supported. Please use a different browser like Chrome or Firefox.
How can i avoid this, and get the real page content?
Add your user agent to the header of the request with
headers = {
'User-Agent': 'YOUR USER AGENT',
}
response = requests.get(url, headers=headers)
You can get your user agent from many websites like this.
Edit
If the solution above doesn't work for you, which might be because you are using an old version of requests, try this one:
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'YOUR USER AGENT',
})
response = requests.get(url, headers=headers)
Related
I'm trying to login to a website using python requests, however the webpage has a mandatory data protection consent form pop-up on the first page. I think this is why I cannot yet login, because posting your login credentials to the login URL requires these content cookies (which are probably dynamic).
After checking out the login post headers request (via inspection tools) it says it requires the cookies from a CMP, specifically a variable called euconsent-v2 (https://help.consentmanager.net/books/cmp/page/cookies-set-by-the-cmp), so my question is how to get these cookies (and/or other necessary cookies) from the website after accepting a consent pop-up, so I can login.
Here is my code so far:
import requests
# Website
base_url = 'https://www.wg-gesucht.de'
# Login URL
login_url = 'https://www.wg-gesucht.de/ajax/sessions.php?action=login'
# Post headers (just a sample of all variables)
headers = {...,
'Cookie': 'euconsent-v2=********'}
# Post params
payload = {'display_language': "de",
'login_email_username': "******",
'login_form_auto_login': "1",
'login_password': "******"}
# Setup session and login
sess = requests.session()
resp_login = sess.post(login_url, data=payload, headers=headers)
UPDATE: I have searched through all recorded requests from starting up the website to login and the only mention of euconsent-v2 is in the response of this:
cookie_url = 'https://cdn.consentmanager.mgr.consensu.org/delivery/cmp_en.min.js'
referer = 'https://www.wg-gesucht.de'
headers = {'Referer': referer,
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'}
sess = requests.session()
resp_init = sess.get(cookie_url, headers=headers)
But I still cannot get the required cookies
The best way would be creating a session, then requesting all the sites that set the cookie that you need. Then with all the cookies in the session you make, you request the login page.
https://help.consentmanager.net/books/cmp/page/cookies-set-by-the-cmp
On the right hand side there is the location.
The image shown below, is just an example of what I mean. Its a random site/url that on the response header, it sets two cookies. A session will save all the cookies and then when you have all the mandatory ones, you make a request to the login page with post data.
I am trying to login into www.ebay-kleinanzeigen.de using the requests library, but every time I try to post my data (on the register page its the same as on the login page) I am getting a 403 error.
Here is the code for the register function:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'user-agent': user_agent, 'Referer': 'https://www.ebay-kleinanzeigen.de'}
with requests.Session() as c:
url = 'https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html'
c.headers = headers
hp = c.get(url, headers=headers)
soup = BeautifulSoup(hp.content, 'html.parser')
crsf = soup.find('input', {'name': '_csrf'})['value']
print(crsf)
payload = dict(email='test.email#emailzz1.de', password='test123', passwordConfirmation='test123',
_marketingOptIn='on', _crsf=crsf)
page = c.post(url, data=payload, headers=headers)
print(page.text)
print(page.url)
print(page.status_code)
Is the problem that I need some more headers? Isn't a user-agent and a referrer enough?
I have tried adding all requested headers, but then I am getting no response.
I have managed to create a script that will successfully complete the register form you're trying to fill in using the mechanicalsoup library. Note you will have to manually check your email account for the email they send you to complete registration.
I realise this doesn't actually answer the question of why BeautifulSoup returned a 403 forbidden error however it does complete your task without encountering the same error.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html")
browser.select_form('#registration-form')
browser.get_current_form().print_summary()
browser["email"] = "mailuser#emailprovider.com"
browser["password"] = "testSO12345"
browser["passwordConfirmation"] = "testSO12345"
response = browser.submit_selected()
rsp_code = response.status_code
#print(response.text)
print("Response code:",rsp_code)
if(rsp_code == 200):
print("Success! Opening a local debug copy of the page... (no CSS formatting)")
browser.launch_browser()
else:
print("Failure!")
I'm new to python and trying to use xpath and requests to logon and scrape some data from here, using the methods demonstrated in this tutorial. My python script is currently as follows:
from lxml import html
import requests
url = "http://www.londoncoffeeguide.com/Venues/Profile/26-Grains"
session_requests = requests.session()
login_url = "http://www.londoncoffeeguide.com/signin?returnurl=%2fVenues"
result = session_requests.get(login_url)
tree = html.fromstring(result.content)
authenticity_token = list(set(tree.xpath("//input[#name='__CMSCsrfToken']/#value")))[0]
payload = {
"p$lt$ctl01$LogonForm_SignIn$Login1$UserName": 'XXX',
"p$lt$ctl01$LogonForm_SignIn$Login1$Password": 'XXX',
"__CMSCsrfToken": authenticity_token
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}
with requests.session() as s:
p = s.post(login_url, data=payload, headers=headers)
print(p.text)
Unfortunately the text return of the post request shows...
<head><title>
System error
</title>
...and then the remainder of the HTML for the login page. I've tried adding the headers line as shown above, double checked that the login details I'm using are correct and I'm pretty happy that the CMSCsrfToken is correct, but the login doesn't work. Any help with this is much appreciated, I've been googling around but none of the various responses I've found to similar problems seem to help (so far!)
You put your username and password in the wrong fields. Moreover, there are few additional fields to add within payload, as in viewstategenerator,viewstate e.t.c. in order for the script to work. The following script will get you logged in and then fetch the different profile items titles.
from lxml.html import fromstring
import requests
login_url = "http://www.londoncoffeeguide.com/signin?returnurl=%2fVenues"
username = "" #fill this in
password = "" #fill this in as well
with requests.session() as session:
session.headers['User-Agent'] = 'Mozilla/5.0'
result = session.get(login_url)
tree = fromstring(result.text)
auth_token = tree.xpath("//input[#id='__CMSCsrfToken']/#value")[0]
viewstate = tree.xpath("//input[#id='__VIEWSTATE']/#value")[0]
viewgen = tree.xpath("//input[#id='__VIEWSTATEGENERATOR']/#value")[0]
payload = {
"__CMSCsrfToken": auth_token,
"__VIEWSTATEGENERATOR":viewgen,
"p$lt$ctl02$pageplaceholder$p$lt$ctl00$RowLayout_Bootstrap$RowLayout_Bootstrap_2$ColumnLayout_Bootstrap1$ColumnLayout_Bootstrap1_1$LogonForm_SignIn$Login1$UserName": username,
"p$lt$ctl02$pageplaceholder$p$lt$ctl00$RowLayout_Bootstrap$RowLayout_Bootstrap_2$ColumnLayout_Bootstrap1$ColumnLayout_Bootstrap1_1$LogonForm_SignIn$Login1$Password": password,
"__VIEWSTATE":viewstate,
"p$lt$ctl02$pageplaceholder$p$lt$ctl00$RowLayout_Bootstrap$RowLayout_Bootstrap_2$ColumnLayout_Bootstrap1$ColumnLayout_Bootstrap1_1$LogonForm_SignIn$Login1$LoginButton": "Log on"
}
session.headers.update({'User-Agent': 'Mozilla/5.0'})
p = session.post(login_url, data=payload)
root = fromstring(p.text)
for iteminfo in root.cssselect(".ProfileItem .ProfileItemTitle"):
print(iteminfo.text)
Make sure to fill in the username and password fields within the script before execution.
I am new to python and programming and would really appreciate any help here.
I am trying to login to this website using the below code and I just cannot go beyond the first page.
Below is the code I have been trying...
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.dell.com/sts/passive/commercial/v1/us/en/19/Premier/Login/Anonymous?wa=wsignin1.0&wtrealm=http%253a%252f%252fwww.dell.com&wreply=https%253a%252f%252fwww.dell.com%252fidentity%252fv2%252fRedirect')
soup = BeautifulSoup(response.text)
formtoken = soup.find('input', {'name': '__RequestVerificationToken'}).get('value')
payload = {'UserName' = username, 'Password'=password, '__RequestVerificationToken': formtoken}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'}
with requests.Session() as s:
p = s.post('https://www.dell.com/sts/passive/commercial/v1/us/en/19/Premier/Login/Anonymous?wa=wsignin1.0&wtrealm=http%253a%252f%252fwww.dell.com&wreply=https%253a%252f%252fwww.dell.com%252fidentity%252fv2%252fRedirect', data=payload, headers=headers)
r = s.get('http://www.dell.com/account/', headers=headers)
print r.text
I am just not able to go beyond the login page. What parameters apart from login. I also tried checking the form data in the Chrome dev tool but that is encrypted. Form Data - Dev Tool screenshot
Any help here is highly appreciated.
EDIT
I have edited the code to pass token in the payload as suggested below. But I have no luck yet.
You are not following a correct approach for making a POST request.
Steps which you can follow:
First make a GET request with your URL.
Extract access token from the response.
Use that access token for your post request.
Unfortunately I get the error: HTTP Status 403 - Bots not allowed while using the following Python code.
import requests
URL = 'http://api.glassdoor.com/api/api.htm?v=1&format=json&t.p={PartnerID}&t.k={Key}&action=employers&q=pharmaceuticals&userip={IP_address}&useragent=Mozilla/%2F4.0'
response = requests.get(URL)
print(response)
The URL does work when I try it from my browser. What can I do to make it work from a code?
Update: SOLVED.
Apologies for not posting the question in the right way (I am new at SO).
According to this StackOverflow answer, you need to include a header field (note that this example uses urllib2 rather than requests):
import urllib2, sys
url = "http://api.glassdoor.com/api/api.htm?t.p=yourID&t.k=yourkey&userip=8.28.178.133&useragent=Mozilla&format=json&v=1&action=employers&q="
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(url,headers=hdr)
response = urllib2.urlopen(req)
with the requests module, it's probably:
import requests
URL = 'http://api.glassdoor.com/api/api.htm?v=1&format=json&t.p={PartnerID}&t.k={Key}&action=employers&q=pharmaceuticals&userip={IP_address}&useragent=Mozilla/%2F4.0'
headers = {'user-agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
print(response)