I'm trying to find all the information inside "inspect" when using a browser for example chrome, currently i can get the page "source" but it doesn't contain everything that inspect contains
when i tried using
with urllib.request.urlopen(section_url) as url:
html = url.read()
I got the following error message: "urllib.error.HTTPError: HTTP Error 403: Forbidden"
Now I'm assuming this is because the url I'm trying to get is from a https url instead of a http one, and i was wondering if there is a specific way to get that information from https since the normal methods arn't working.
Note: I've also tried this, but it didn't show me everything
f = requests.get(url)
print(f.text)
You need to have a user-agent to make the browser think you're not a robot.
import urllib.request, urllib.error, urllib.parse
url = 'http://www.google.com' #Input your url
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(req)
html = response.read()
response.close()
adapted from https://stackoverflow.com/a/3949760/6622817
Related
I find How would I log into Instagram using BeautifulSoup4 and Requests, and how would I determine it on my own? this
but code
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': 'login',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:your_password',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
r = s.post(login_url, data=payload, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken": csrf
})
print(r.status_code)
gives me error with csrftoken
line 21, in <module>
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
IndexError: list index out of range
and other posts on Stack Overflow don't work for me
I dont want use Selenium
TL;DR
Add a user-agent to your get request header on line 20:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Long answer
If we look at the error message you posted, we can start to dissect what's gone wrong. Line 21 is attempting to find a csrf_token attribute on the instagram login page.
Diagnostics
We can see from the error message that the list index is out of range, which in this case means that the list returned by re.findall (docs) is empty. This means that either
Your regex is wrong
The html returned by your get request (docs) r = s.get(link) on line 20 doesn't contain a csrf_token attribute
The attribute doesn't exist in the source html
If we visit the page and look at its html source, we can see that a csrf_token attribute is indeed present on line 261:
<script type="text/javascript">window._sharedData = {"config":{"csrf_token":"TOKEN HERE","viewer":null,"viewerId":null}}</script>
Note, I have excluded the rest on the code for brevity.
Now that we know it's present on the page, we can write the scraped html that you're receiving via your get request to a local file and inspect it:
r = s.get(link)
with open("csrf.html", "w") as f:
f.write(html)
If you open that file and do a Ctrl+f for csrf_token, it's not present. This likely means that Instagram detected that you're accessing the page via a scraper and returned a modified version of the page.
The fix
In order to fix this, you need to add a user-agent to your request header which essentially 'tricks' the page into thinking you're accessing it via a browser, This can be done by by changing:
r = s.get(link)
to something like this:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Note, this is a random user agent from here.
Notes
I appreciate that you don't want to use selenium for your task, but you might find that the more dynamic interactions you want to do, the harder it is to achieve it with static scraping libraries like the requests module. Here are some good resources for learning selenium in python:
Selenium docs
Python Selenium Tutorial #1 - Web Scraping, Bots & Testing
I am trying to login into www.ebay-kleinanzeigen.de using the requests library, but every time I try to post my data (on the register page its the same as on the login page) I am getting a 403 error.
Here is the code for the register function:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'user-agent': user_agent, 'Referer': 'https://www.ebay-kleinanzeigen.de'}
with requests.Session() as c:
url = 'https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html'
c.headers = headers
hp = c.get(url, headers=headers)
soup = BeautifulSoup(hp.content, 'html.parser')
crsf = soup.find('input', {'name': '_csrf'})['value']
print(crsf)
payload = dict(email='test.email#emailzz1.de', password='test123', passwordConfirmation='test123',
_marketingOptIn='on', _crsf=crsf)
page = c.post(url, data=payload, headers=headers)
print(page.text)
print(page.url)
print(page.status_code)
Is the problem that I need some more headers? Isn't a user-agent and a referrer enough?
I have tried adding all requested headers, but then I am getting no response.
I have managed to create a script that will successfully complete the register form you're trying to fill in using the mechanicalsoup library. Note you will have to manually check your email account for the email they send you to complete registration.
I realise this doesn't actually answer the question of why BeautifulSoup returned a 403 forbidden error however it does complete your task without encountering the same error.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html")
browser.select_form('#registration-form')
browser.get_current_form().print_summary()
browser["email"] = "mailuser#emailprovider.com"
browser["password"] = "testSO12345"
browser["passwordConfirmation"] = "testSO12345"
response = browser.submit_selected()
rsp_code = response.status_code
#print(response.text)
print("Response code:",rsp_code)
if(rsp_code == 200):
print("Success! Opening a local debug copy of the page... (no CSS formatting)")
browser.launch_browser()
else:
print("Failure!")
I am trying to to scrape the following page using python 3 but I keep getting HTTP Error 400: Bad Request. I have looked at some of the previous answers suggesting to use urllib.quote which didn't work for me since it's python 2. Also, I tried the following code as suggested by another post and still didn't work.
url = requote_uri('http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01')
with urllib.request.urlopen(url) as response:
html = response.read()
The server deny queries from non human-like User-Agent HTTP header.
Just pick a browser's User-Agent string and set it as header to your query:
import urllib.request
url = 'http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01'
headers={
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"
}
request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request) as response:
html = response.read()
i have basically tried everything but still i get the unsupported browser error. i use requests module. i have tried with headers, other modules but still the same.
import requests
url = "https://www.ratemyagent.com.au/"
response = requests.get(url)
html_icerigi = response.text
soup = BeautifulSoup(html_icerigi, "html.parser")
https://www.ratemyagent.com.au/ this is the adres. so please if you have any idea to get rid of this error let me know.
thanks a lot in advance
edit: this is what i have tried as headers :
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36' headers = { 'User-Agent' : user_agent } response = requests.get(url, headers=headers)
versions // python 3. BeautifulSoup4: 4.6.0 requests:2.20
[this is the output with soup.text or reponse.text][1]
it's not a error
when you normally search https://www.ratemyagent.com.au/ by chrome browser, the page resource your get also has the info like you said:
while the resource your get by requests module is same as what you get from a web browser I don't think it is a error.
I am trying to download a web page with python and access some elements on the page. I have an issue when I download the page: the content is garbage. This is the first lines of the page:
‹í}évÛH²æïòSd±ÏmÉ·’¸–%ÕhµÕ%ÙjI¶«JããIÐ(‰îî{æ1æ÷¼Æ¼Í}’ù"à""’‚d÷t»N‰$–\"ãˈŒˆŒÜøqïíîùï'û¬¼gôÁnžm–úq<ü¹R¹¾¾._›å ìUôv»]¹¡gJÌqÃÍ’‡%z‹[ÎÖ3†[(,jüËȽÚ,í~ÌýX;y‰Ùò×f)æ7q…JzÉì¾F<ÞÅ]Uª
this problem happen only on the following website: http://kickass.to. Is it possible that they have somehow protected their page? this is my python code:
import urllib2
import chardet
url = 'http://kickass.to/'
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KH
TML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req)
page = response.read()
f = open('page.html','w')
f.write(page)
f.close()
print response.headers['content-type']
print chardet.detect(page)
and result:
text/html; charset=UTF-8
{'confidence': 0.0, 'encoding': None}
it looks like an encoding issue but chardet detects 'None'.. Any ideas?
This page is returned in gzip encoding.
(Try printing out response.headers['content-encoding'] to verify this.)
Most likely the web-site doesn't respect 'Accept-Encoding' field in request and suggests that the client supports gzip (most modern browsers do).
urllib2 doesn't support deflating, but you can use gzip module for that as described e.g. in this thread: Does python urllib2 automatically uncompress gzip data fetched from webpage? .