Change user agent used with robotparser in Python - python

I am using the robotparser from the urlib module in Python to determine if can download webpages. One site I am accessing however returns a 403 error when the robot.txt file is accessed via the default user-agent, but correct response if e.g. downloaded via requests with my user-agent string. (The site also gives a 403 when accessed with the requests packages default user-agent, suggesting they are just blocking common/generic user-agent strings, rather than adding them to the robot.txt file).
Anyway, is it possible to change the user-agent in the rootparser module? Or alternatively, to load in a robot.txt file downloaded seperately?

There is no option to fetch robots.txt with User-Agent using RobotFileParser, but you can fetch it yourself and path an array of strings to the parse() method :
from urllib.robotparser import RobotFileParser
import urllib.request
rp = RobotFileParser()
with urllib.request.urlopen(urllib.request.Request('http://stackoverflow.com/robots.txt',
headers={'User-Agent': 'Python'})) as response:
rp.parse(response.read().decode("utf-8").splitlines())
print(rp.can_fetch("*", "http://stackoverflow.com/posts/"))

Related

Extracting file name from url when its name is not in url

So I wanted to create a download manager, which can download multiple files automatically. I had a problem however with extracting the name of the downloaded file from the url. I tried an answer to How to extract a filename from a URL and append a word to it?, more specifically
a = urlparse(URL)
file = os.path.basename(a.path)
but all of them, including the one shown, break when you have a url such as
URL = https://calibre-ebook.com/dist/win64
Downloading it in Microsoft Edge gives you file with the name of calibre-64bit-6.5.0.msi, but downloading it with python, and using the method from the other question to extract the name of the file, gives you win64 instead, which is the intended file.
The URL https://calibre-ebook.com/dist/win64 is a HTTP 302 redirect to another URL https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi. You can see this by running a HEAD request, for example in a macOS/Linux terminal (note 302 and the location header):
$ curl --head https://calibre-ebook.com/dist/win64
HTTP/2 302
server: nginx
date: Wed, 21 Sep 2022 16:54:49 GMT
content-type: text/html
content-length: 138
location: https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
The browser follows the HTTP redirect and downloads the file, naming it based on the last URL. If you'd like to do the same in Python, you also need to get to the last URL and use that as the file name. The requests library might or might not follow these redirects depending on the version, better to explicitly use allow_redirects=True.
With requests==2.28.1 this code returns the last URL:
import requests
requests.head('https://calibre-ebook.com/dist/win64', allow_redirects=True).url
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
If you'd like to solve it with built-in modules so you won't need to install external libs like requests you can also achieve the same with urllib:
import urllib.request
opener=urllib.request.build_opener()
opener.open('https://calibre-ebook.com/dist/win64').geturl()
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
Then you can split the lat URL by / and get the last section as the file name, for example:
import urllib.request
opener=urllib.request.build_opener()
url = opener.open('https://calibre-ebook.com/dist/win64').geturl()
url.split('/')[-1]
# 'calibre-64bit-6.5.0.msi'
I was using urllib3==1.26.12, requests==2.28.1 and Python 3.8.9 in the examples, if you are using much older versions they might behave differently and might need extra flags to ensure redirects.
The URL results in a 302 redirect, so you don't have enough information with just the URL to get that basename. You have to get the URL from 302 response.
import requests
resp = requests.head("https://calibre-ebook.com/dist/win64")
print(resp.status_code, resp.headers['location'])
>>> 302 https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
You'd want to have more intelligent handling obviously in case it's not a 302. And you'd want to loop in case the new URL results in another redirect.

Scraping an internal web page

I have to scrape an internal web page of my organization. If I use Beautiful soap I get
"Unauthorized access"
I don't want to put my username/password in the source code because it will be shared across collegues.
If I open the same web url using Firefox It doesn't not ask me to login, the only problem is when I make the same request using python script.
Is there a way to share the same session used by firefox with a python script?
I think my authentication is with my PC because if I log off deleting all cookies When i re-enter I because logged in automatically. Do you know why with my python script this doesn’t not happen?
When you use the browser to login to your organization, you provide your credentials and the server returns a cookie tied to your organization's domain. This cookie has an expiration and allows to use navigate your organization's site without having to login as long as the cookie is valid.
You can read about cookies here:
https://en.wikipedia.org/wiki/HTTP_cookie
Your website scraper does not need to store your credentials. First delete the cookies then, using your browser's developer tools, you can (look at the network tab):
Figure out if your organization uses a separate auth end point
If it's not evident, then you might ask the IT department
Use the auth endpoint to get a cookie using credentials passed in
See how this cookie is used by the system (look at the HTTP request/response headers)
Use this cookie to scrape the website
Share your code freely - if someone needs to scrape the website then they can either pass in their credentials, or use a curl command to get/set a valid cookie header
1) After authenticating in your Firefox browser, make sure to get the cookie key/value.
2) Use that data in the code below :
from bs4 import BeautifulSoup
import requests
browser_cookies = {'your_cookie_key':'your_cookie_value'}
s = requests.Session()
r = s.get(your_url, cookies=browser_cookies)
bsoup = BeautifulSoup(r.text, 'lxml')
The requests.Session() is for persistence.
One more tips, you could also call your script like that :
python3 /path/to/script/script.py cookies_key cookies_value
Then, get the two values with sys module. The code will be :
import sys
browser_cookies = {sys.argv[1]:sys.argv[2]}

python requests: url with another sub-requests

When I execute manually this URL in my webbrowser I see in my network console that three other requests will be executed. it works.
call: www.my.url/publish_something
get this cmd
get that cmd
post that...
How can I do it in Python requests?
That I only call once the "main"-URL including all sub-requests like my webbrowser.
> publish_url = "www.my.url/publish_something" r =
> self.session.get(publish_url, verify=False, params=p)
it seems, when I call this url with python requests-module, he does not execute the sub-requests.
When you open an url in your browser, the browser
- issues a GET request to that url
- parse the content
- issues GET requests for each image tag and for each script, style etc tags mentionning an external source,
- executes the scripts, which may lead to more sub requests and DOM modifications,
- and finally render the final DOM.
When you send a GET request with Python (with python-rquests, the urllib module or whatever), on the first of the above stages is performed, so if you want more you'll have to do it by yourself (parsing the content, retrieving images etc etc).
Or you can use a headless browser like PhantomJS.

How to log in over HTTPS via urllib?

I'm interested in using Python to retrieve a file that exists at an HTTPS url.
I have credentials for the site, and when I access it in my browser I'm able to download the file. How do I use those credentials in my Python script to do the same thing?
So far I have:
import urllib.request
response = urllib.request.urlopen('https:// (some url with an XML file)')
html = response.read()
html.write('C:/Users/mhurley/Portable_Python/notebooks/XMLOut.xml')
This code works for non-secured pages, but (understandably) returns 401:Unauthorized for the https address. I don't understand how urllib handles credentials, and the docs aren't as helpful as I'd like.

Python - Requests module HTTP and HTTPS requests

I wish to make a requests with the Python requests module. I have a large database of urls I wish to download. the urls are in the database of the form page.be/something/something.html
I get a lot of ConnectionError's. If I search the URL in my browser, the page exists.
My Code:
if not webpage.url.startswith('http://www.'):
new_html = requests.get(webpage.url, verify=True, timeout=10).text
An example of a page I'm trying to download is carlier.be/categorie/jobs.html. This gives me a ConnectionError, logged as below:
Connection error, Webpage not available for
"carlier.be/categorie/jobs.html" with webpage_id "229998"
What seems to be the problem here? Why can't requests make the connection, while I can find the page in the browser?
The Requests library requires that you supply a schema for it to connect with (the 'http://' part of the url). Make sure that every url has http:// or https:// in front of it. You may want a try/except block where you catch a requests.exceptions.MissingSchema and try again with "http://" prepended to the url.

Categories