HTML: Get direct link to file from embed src - python

I want to know how to get the direct link to an embedded video (the link to the .flv/.mp4 or whatever file) from just the embed link.
For example, http://www.kumby.com/ano-hana-episode-1/ has
<embed src="http://www.4shared.com/embed/571660264/396a46be"></embed>
, though the link to the video seems to be
"http://dc436.4shared.com/img/571660264/396a46be/dlink__2Fdownload_2FM2b0O5Rr_3Ftsid_3D20120514-093834-29c48ef9/preview.flv"
How does the browser know where to load the video from? How can I write code that converts the embed link to a direct link?
UPDATE:
Thanks for the quick answer, Quentin.
However, I don't seem to receive a 'Location' header when connecting to "http://www.4shared.com/embed/571660264/396a46be".
import urllib2
r=urllib2.urlopen('http://www.4shared.com/embed/571660264/396a46be')
gives me the following headers:
'content-length', 'via', 'x-cache', 'accept-ranges', 'server', 'x-cache-lookup', 'last-modified', 'connection', 'etag', 'date', 'content-type', 'x-jsl'
from urllib2 import Request
r=Request('http://www.4shared.com/embed/571660264/396a46be')
gives me no headers at all.

The server issues a 302 HTTP status code and a Location header.
$ curl -I http://www.4shared.com/embed/571660264/396a46be
HTTP/1.1 302 Moved Temporarily
Server: Apache-Coyote/1.1
(snip cookies)
Location: http://static.4shared.com/flash/player/5.6/player.swf?file=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&provider=image&image=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&displayclick=link&link=http://www.4shared.com/video/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.html&controlbar=none
Content-Length: 0
Date: Mon, 14 May 2012 10:01:59 GMT
See How do I prevent Python's urllib(2) from following a redirect if you want to get information about the redirect response instead of following the redirect automatically.

Related

Getting response 444 when making request to a webpage

I am trying to make a request to a webpage using the code below and i am getting response 444.
Is there anything i can to about it?
import requests
url = "https://www.pseudosite.com/"
response = requests.get(url)
print(response) # <Response [444]>
http.dev website says the following:
When the 444 No Response status code is generated, the server returns
no information to the client and closes the HTTP Connection. This
error message can be found in the nginx logs and will not be sent to
the client. It is useful for dealing with malicious HTTP requests,
such as one that includes an illegal Host header.
I am trying to webscrape that website using python, but I am blocked at first step.
I believe that you need to add headers to view this webpage. If you open devtools on your browser, then you should see a 'get' request; if you now click on the 'Headers' tab, you can create a dictionary with the data.

Extracting file name from url when its name is not in url

So I wanted to create a download manager, which can download multiple files automatically. I had a problem however with extracting the name of the downloaded file from the url. I tried an answer to How to extract a filename from a URL and append a word to it?, more specifically
a = urlparse(URL)
file = os.path.basename(a.path)
but all of them, including the one shown, break when you have a url such as
URL = https://calibre-ebook.com/dist/win64
Downloading it in Microsoft Edge gives you file with the name of calibre-64bit-6.5.0.msi, but downloading it with python, and using the method from the other question to extract the name of the file, gives you win64 instead, which is the intended file.
The URL https://calibre-ebook.com/dist/win64 is a HTTP 302 redirect to another URL https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi. You can see this by running a HEAD request, for example in a macOS/Linux terminal (note 302 and the location header):
$ curl --head https://calibre-ebook.com/dist/win64
HTTP/2 302
server: nginx
date: Wed, 21 Sep 2022 16:54:49 GMT
content-type: text/html
content-length: 138
location: https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
The browser follows the HTTP redirect and downloads the file, naming it based on the last URL. If you'd like to do the same in Python, you also need to get to the last URL and use that as the file name. The requests library might or might not follow these redirects depending on the version, better to explicitly use allow_redirects=True.
With requests==2.28.1 this code returns the last URL:
import requests
requests.head('https://calibre-ebook.com/dist/win64', allow_redirects=True).url
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
If you'd like to solve it with built-in modules so you won't need to install external libs like requests you can also achieve the same with urllib:
import urllib.request
opener=urllib.request.build_opener()
opener.open('https://calibre-ebook.com/dist/win64').geturl()
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
Then you can split the lat URL by / and get the last section as the file name, for example:
import urllib.request
opener=urllib.request.build_opener()
url = opener.open('https://calibre-ebook.com/dist/win64').geturl()
url.split('/')[-1]
# 'calibre-64bit-6.5.0.msi'
I was using urllib3==1.26.12, requests==2.28.1 and Python 3.8.9 in the examples, if you are using much older versions they might behave differently and might need extra flags to ensure redirects.
The URL results in a 302 redirect, so you don't have enough information with just the URL to get that basename. You have to get the URL from 302 response.
import requests
resp = requests.head("https://calibre-ebook.com/dist/win64")
print(resp.status_code, resp.headers['location'])
>>> 302 https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
You'd want to have more intelligent handling obviously in case it's not a 302. And you'd want to loop in case the new URL results in another redirect.

Only Want Response Header Not Response body Python

I am making get/Post request on a URL and in response getting an HTML page. I only want a response header, no response body.
already used HEAD method but it is not working in all kind of situations.
By getting complete HTML page in response, bandwidth is increasing.
and also need a solution so it will work in both https and HTTP request.
For Example
import urllib2
urllib2.urlopen('http://www.google.com')
if I am sending a request on this URL using urllib2 or request. I am getting both response body and header from the server. this request is taking 14.08 kb in bytes. If I break this, the response header is taking 775 bytes and response body is taking 13.32kb. so I need only response header and will save 13.32 kb
What you want to do is a so called HEAD request. See this question on how to do it.
Is this what you are looking for:
import urllib2
l = urllib2.urlopen('http://www.google.com')
print(l.headers)
#Date: Thu, 11 Oct 2018 09:07:20 GMT
#Expires: -1
#...
EDIT
This seems to do what you are looking for:
import requests
a = requests.head('https://www.google.com')
a.headers
#{'X-XSS-Protection': '1; mode=block', 'Content-Encoding':...
a.text
#u''

python - Service Unavailable - urllib proxy not working

I use to get Information from google, I know that I will block after a few requests, that's why I tried to get through Proxies. For the Proxies I use
the ProxyBroker from this link:
The Link
However, if I use proxies, google returns 503. If I click on the error, google shows me my IP and not the Proxy IP.
Here is what I've tried with:
usedProxy = self.getProxy()
if usedProxy is not None:
proxies = {"http": "http://%s" % usedProxy[0]}
headers = {'User-agent': 'Mozilla/5.0'}
proxy_support = urlrequest.ProxyHandler(proxies)
opener = urlrequest.build_opener(proxy_support, urlrequest.HTTPHandler(debuglevel=1))
urlrequest.install_opener(opener)
req = urlrequest.Request(search_url, None, headers)
with contextlib.closing(urlrequest.urlopen(req)) as url:
htmltext = url.read()
I tried with http and https.
Even if the requests is going well, I get a 503 with this the following Message:
send: b'GET http://www.google.co.in/search?q=Test/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.co.in\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date header: Server header: Location header: Pragma header: Expires header: Cache-Control header: Content-Type header: Content-Length header: X-XSS-Protection header: X-Frame-Options header:
>Connection send: b'GET http://ipv4.google.com/sorry/index?continue=http://www.google.co.in/search%3Fq%3DTest/&q=EgTCDs9XGMbOgNAFIhkA8aeDS0dE8uXKu31DEbfj5mCVdhpUO598MgFy HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: ipv4.google.com\r\nUser-Agent: Mozilla/5.0\r\n
>Connection: close\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
If the above error doesn't happen, I finally get the following Error:
>[Errno 54] Connection reset by peer
My Questions are:
Is the Ip from the error Link every time my IP and not the Proxy IP?
Google Error Link
And if it´s every time the Host IP what is shown in the error Message from google and the Problem is from the Proxies, how to bypass the error?
It seems that Google is knowing that I go to an proxy, because it uses HTTPS and the HTTPS Proxies don´t seem to work. So the HTTP proxies are detected, that´s why I get blocked after 50-60 queries directly.
My Solution:
I tried all Solutions found on Stackoverflow but they doesen´t work fine like Sleep for 10 seconds. But I found a Article with the same Problem, the Solution was "quite" easy then. First I download the fake-useragent Library from Python, which provides a ton of usefull User-agents.
I select randomly a User-agent from this list at each request. I also add to take only common user-agents because otherwise the page has a different HTML which does not fit in my read method.
After installing the Useragent and selecting one randomly, I add a sleep between 15 and 90 seconds, because the article-writer tried different timespans, and with 30 seconds he got block. So with these two simple changes my programm is successfully running since 10 Hours without truble.
I hope this helps you also, because it cost me a bunch of time to figure out when google does block you. So it simple detects every time but let you go with this Configuration.
Have fun and I wish you all successfully crawling!
EDIT:
The Programm gets ~1000 Requests until it get banned.

how to get the url after redirection using python

I am working on a project of web crawler and right now i am facing a problem .
How to get the url after redirection of a page ??
I tried requests and it returns the value <Response [200]>
When i am crawling the download link of a file such as this one http://filehippo.com/download_firefox/download/f28dbaab19e38f3239d69ed7c350ac5d/ it opens a page where it is written program is downloading but after few seconds the program starts download... i want the url of the download
thanks in advance..
The link to download the file directly is located in the meta tag:
<meta http-equiv="Refresh" content="3; url=/download/file/0d48d61bb8c894b7388e83a3c873cde48f0b2cc330872f5ce77a3b38b24a4942/"/>
You need to read that link from the file, and then request it. Once you do, it will then direct you to the actual file download link:
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://fs41.filehippo.com/9452/f9851528b9974e08bf9fa217a7daa049/Firefox Setup 43.0.3.exe [following]
For requests; it will handle this redirection automatically for you; and the end result is you can then start downloading the file.
Your download is initiated only after some code is ran on the browser
It doesn't seem that your example URL redirects using an HTTP redirection, it seems to initiate a download once the browser finds that page and some client side code is executed. Your URL isn't an HTTP redirect.
To understand what I'm saying open the development console (FireBug, Chrome console etc) on your browser in its Network tab and refresh the page see all that happens before the actual file is downloaded by your browser. In the network tab you can get the URL to the file too.
However it may be not be useful to crawl because the URL may be "salted" with a token that expires or is only valid for the client that crawled it, basically rendering that download URL unshareable.
Browser automation
You might be able to that URL with some browser automation like Selenium or PhantomJS, by watchng the networks log and grepping for the URL structures you want (e.g. for this file you're looking for a .exe in the URL)
Bottom line is: you can get that URL, by using a browser automation tool and capturing all its network data, however a secure architecture would render that URL unshareable.
A URL that does actually redirect
However I'll give you an example that does redirect to show you how to do it for an URL that actually redirects with HTTP, with the Python requests library
Your URL doesn't redirect
>>> import requests
>>> response = requests.get('http://filehippo.com/download_firefox/download/f28dbaab19e38f3239d69ed7c350ac5d/')
>>> response.history
[] # There's no redirect there
>>> response.status_code
200
Let's try with a test URL that redirects
>>> response = requests.get('http://httpbin.org/redirect/3')
>>> response.history
[<Response [302]>, <Response [302]>, <Response [302]>]
>>> for r in response.history: print r.status_code, r.url
...
302 http://httpbin.org/redirect/3
302 http://httpbin.org/relative-redirect/2
302 http://httpbin.org/relative-redirect/1
>>>
Use geturl()
or
to get the current url use
self.request.url

Categories