Extracting file name from url when its name is not in url - python

So I wanted to create a download manager, which can download multiple files automatically. I had a problem however with extracting the name of the downloaded file from the url. I tried an answer to How to extract a filename from a URL and append a word to it?, more specifically
a = urlparse(URL)
file = os.path.basename(a.path)
but all of them, including the one shown, break when you have a url such as
URL = https://calibre-ebook.com/dist/win64
Downloading it in Microsoft Edge gives you file with the name of calibre-64bit-6.5.0.msi, but downloading it with python, and using the method from the other question to extract the name of the file, gives you win64 instead, which is the intended file.

The URL https://calibre-ebook.com/dist/win64 is a HTTP 302 redirect to another URL https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi. You can see this by running a HEAD request, for example in a macOS/Linux terminal (note 302 and the location header):
$ curl --head https://calibre-ebook.com/dist/win64
HTTP/2 302
server: nginx
date: Wed, 21 Sep 2022 16:54:49 GMT
content-type: text/html
content-length: 138
location: https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
The browser follows the HTTP redirect and downloads the file, naming it based on the last URL. If you'd like to do the same in Python, you also need to get to the last URL and use that as the file name. The requests library might or might not follow these redirects depending on the version, better to explicitly use allow_redirects=True.
With requests==2.28.1 this code returns the last URL:
import requests
requests.head('https://calibre-ebook.com/dist/win64', allow_redirects=True).url
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
If you'd like to solve it with built-in modules so you won't need to install external libs like requests you can also achieve the same with urllib:
import urllib.request
opener=urllib.request.build_opener()
opener.open('https://calibre-ebook.com/dist/win64').geturl()
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
Then you can split the lat URL by / and get the last section as the file name, for example:
import urllib.request
opener=urllib.request.build_opener()
url = opener.open('https://calibre-ebook.com/dist/win64').geturl()
url.split('/')[-1]
# 'calibre-64bit-6.5.0.msi'
I was using urllib3==1.26.12, requests==2.28.1 and Python 3.8.9 in the examples, if you are using much older versions they might behave differently and might need extra flags to ensure redirects.

The URL results in a 302 redirect, so you don't have enough information with just the URL to get that basename. You have to get the URL from 302 response.
import requests
resp = requests.head("https://calibre-ebook.com/dist/win64")
print(resp.status_code, resp.headers['location'])
>>> 302 https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
You'd want to have more intelligent handling obviously in case it's not a 302. And you'd want to loop in case the new URL results in another redirect.

Related

I am not able to log in into a website using post requests python

I am trying to login into a website by passing username and password.It says session cookie is missing.I am beginner to api .I dont know if I have missed something here.The website is http://testing-ground.scraping.pro/login
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, fields={'usr':'admin','pwd':'12345'})
print(req.data.decode('utf-8'))
There are two issues in your code that make you unable to log in successfully.
The content-type issue
In the code you are using urllib3 to send data of content-type multipart/form-data. The website, however, seems to only accept the content-type application/x-www-form-urlencoded.
Try the following cURL commands:
curl -v -d "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
curl -v -F "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
For the first one, the content-type in your request header is application/x-www-form-urlencoded, so the website takes it and logs you in (with a 302 Found response).
The second one, however, sends data with content-type multipart/form-data. The website doesn't take it and therefore rejects your login request (with a 200 OK response).
The cookie issue
Another issue is that urllib3 follows redirect by default. More importantly, the cookie is not handled (i.e. stored and sent in the following requests) by default by urllib3. Thus, the second request won't contain the cookie tdsess=TEST_DRIVE_SESSION, and therefore the website returns the message that you're not logged in.
If you only care about the login request, you can try the following code:
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, data={'usr':'admin','pwd':'12345'}, encode_multipart=False, redirect=False)
print(req.data.decode('utf-8'))
The encode_multipart=False instructs urllib3 to send data with content-type application/x-www-form-urlencoded; the redirect=False tells it not to follow the redirect, so that you can see the response of your initial request.
If you do want to complete the whole login process, however, you need to save the cookie from the first response and send it in the second request. You can do it with urllib3, or
Use the Requests library
I'm not sure if you have any particular reasons to use urllib3. Urllib3 will definitely work if you implements it well, but I would suggest try the Request library, which is much easier to use. For you case, the following code with Request will work and get you to the welcome page:
import requests
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = requests.post(url, data={'usr':'admin','pwd':'12345'})
print(req.text)
import requests
auth_credentials = ("admin", "12345")
url = "http://testing-ground.scraping.pro/login?mode=login"
response = requests.post(url=url, auth=auth_credentials)
print(response.text)

Change user agent used with robotparser in Python

I am using the robotparser from the urlib module in Python to determine if can download webpages. One site I am accessing however returns a 403 error when the robot.txt file is accessed via the default user-agent, but correct response if e.g. downloaded via requests with my user-agent string. (The site also gives a 403 when accessed with the requests packages default user-agent, suggesting they are just blocking common/generic user-agent strings, rather than adding them to the robot.txt file).
Anyway, is it possible to change the user-agent in the rootparser module? Or alternatively, to load in a robot.txt file downloaded seperately?
There is no option to fetch robots.txt with User-Agent using RobotFileParser, but you can fetch it yourself and path an array of strings to the parse() method :
from urllib.robotparser import RobotFileParser
import urllib.request
rp = RobotFileParser()
with urllib.request.urlopen(urllib.request.Request('http://stackoverflow.com/robots.txt',
headers={'User-Agent': 'Python'})) as response:
rp.parse(response.read().decode("utf-8").splitlines())
print(rp.can_fetch("*", "http://stackoverflow.com/posts/"))

Python - Requests module HTTP and HTTPS requests

I wish to make a requests with the Python requests module. I have a large database of urls I wish to download. the urls are in the database of the form page.be/something/something.html
I get a lot of ConnectionError's. If I search the URL in my browser, the page exists.
My Code:
if not webpage.url.startswith('http://www.'):
new_html = requests.get(webpage.url, verify=True, timeout=10).text
An example of a page I'm trying to download is carlier.be/categorie/jobs.html. This gives me a ConnectionError, logged as below:
Connection error, Webpage not available for
"carlier.be/categorie/jobs.html" with webpage_id "229998"
What seems to be the problem here? Why can't requests make the connection, while I can find the page in the browser?
The Requests library requires that you supply a schema for it to connect with (the 'http://' part of the url). Make sure that every url has http:// or https:// in front of it. You may want a try/except block where you catch a requests.exceptions.MissingSchema and try again with "http://" prepended to the url.

URL Redirection not working

The following is the code I use,
import unshortenit
unshortened_uri,status = unshortenit.unshorten('http://4sq.com/1iyfyI5')
print unshortened_uri
print status
The following is the output:
https://foursquare.com/edtechschools/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw
Invalid URL u'/tanyaavrith/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw': No schema supplied
whereas if I use the same url in browser, it correctly redirects to the actual url. Any idea why its not working ?
There's a 301 redirect chain:
From:
'http://4sq.com/1iyfyI5'
To:
'https://foursquare.com/edtechschools/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw'
To:
'/tanyaavrith/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw'
unshortenit use requests, and requests can't understand the last relative url.
Updates:
Actually, request lib can handle http redirects well and automatically with request.get method.
e.g.
import requests
r=requests.get('http://4sq.com/1iyfyI5')
r.status_code # 200
r.url # u'https://foursquare.com/tanyaavrith/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw'
But unshortenit does not want the overhead of HTTP GET, instead it uses HTTP HEAD. If the response of the HTTP Head request has a 'Location'field in its header, unshortenit makes a new HTTP HEAD request to that location. The new request is isolated from the original request, and relative url doesn't work any more.
Reference (from Wikipedia):
While the obsolete IETF standard RFC 2616 (HTTP 1.1) requires a
complete absolute URI for redirection, the most popular web
browsers tolerate the passing of a relative URL as the value for a
Location header field. Consequently, the current revision of HTTP/1.1
makes relative URLs conforming

HTML: Get direct link to file from embed src

I want to know how to get the direct link to an embedded video (the link to the .flv/.mp4 or whatever file) from just the embed link.
For example, http://www.kumby.com/ano-hana-episode-1/ has
<embed src="http://www.4shared.com/embed/571660264/396a46be"></embed>
, though the link to the video seems to be
"http://dc436.4shared.com/img/571660264/396a46be/dlink__2Fdownload_2FM2b0O5Rr_3Ftsid_3D20120514-093834-29c48ef9/preview.flv"
How does the browser know where to load the video from? How can I write code that converts the embed link to a direct link?
UPDATE:
Thanks for the quick answer, Quentin.
However, I don't seem to receive a 'Location' header when connecting to "http://www.4shared.com/embed/571660264/396a46be".
import urllib2
r=urllib2.urlopen('http://www.4shared.com/embed/571660264/396a46be')
gives me the following headers:
'content-length', 'via', 'x-cache', 'accept-ranges', 'server', 'x-cache-lookup', 'last-modified', 'connection', 'etag', 'date', 'content-type', 'x-jsl'
from urllib2 import Request
r=Request('http://www.4shared.com/embed/571660264/396a46be')
gives me no headers at all.
The server issues a 302 HTTP status code and a Location header.
$ curl -I http://www.4shared.com/embed/571660264/396a46be
HTTP/1.1 302 Moved Temporarily
Server: Apache-Coyote/1.1
(snip cookies)
Location: http://static.4shared.com/flash/player/5.6/player.swf?file=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&provider=image&image=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&displayclick=link&link=http://www.4shared.com/video/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.html&controlbar=none
Content-Length: 0
Date: Mon, 14 May 2012 10:01:59 GMT
See How do I prevent Python's urllib(2) from following a redirect if you want to get information about the redirect response instead of following the redirect automatically.

Categories