The following is the code I use,
import unshortenit
unshortened_uri,status = unshortenit.unshorten('http://4sq.com/1iyfyI5')
print unshortened_uri
print status
The following is the output:
https://foursquare.com/edtechschools/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw
Invalid URL u'/tanyaavrith/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw': No schema supplied
whereas if I use the same url in browser, it correctly redirects to the actual url. Any idea why its not working ?
There's a 301 redirect chain:
From:
'http://4sq.com/1iyfyI5'
To:
'https://foursquare.com/edtechschools/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw'
To:
'/tanyaavrith/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw'
unshortenit use requests, and requests can't understand the last relative url.
Updates:
Actually, request lib can handle http redirects well and automatically with request.get method.
e.g.
import requests
r=requests.get('http://4sq.com/1iyfyI5')
r.status_code # 200
r.url # u'https://foursquare.com/tanyaavrith/checkin/53ac1e5f498e5d8d736ef3be?s=BlinbPzgFfShr0vdUnbEJUnOYYI&ref=tw'
But unshortenit does not want the overhead of HTTP GET, instead it uses HTTP HEAD. If the response of the HTTP Head request has a 'Location'field in its header, unshortenit makes a new HTTP HEAD request to that location. The new request is isolated from the original request, and relative url doesn't work any more.
Reference (from Wikipedia):
While the obsolete IETF standard RFC 2616 (HTTP 1.1) requires a
complete absolute URI for redirection, the most popular web
browsers tolerate the passing of a relative URL as the value for a
Location header field. Consequently, the current revision of HTTP/1.1
makes relative URLs conforming
Related
So I wanted to create a download manager, which can download multiple files automatically. I had a problem however with extracting the name of the downloaded file from the url. I tried an answer to How to extract a filename from a URL and append a word to it?, more specifically
a = urlparse(URL)
file = os.path.basename(a.path)
but all of them, including the one shown, break when you have a url such as
URL = https://calibre-ebook.com/dist/win64
Downloading it in Microsoft Edge gives you file with the name of calibre-64bit-6.5.0.msi, but downloading it with python, and using the method from the other question to extract the name of the file, gives you win64 instead, which is the intended file.
The URL https://calibre-ebook.com/dist/win64 is a HTTP 302 redirect to another URL https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi. You can see this by running a HEAD request, for example in a macOS/Linux terminal (note 302 and the location header):
$ curl --head https://calibre-ebook.com/dist/win64
HTTP/2 302
server: nginx
date: Wed, 21 Sep 2022 16:54:49 GMT
content-type: text/html
content-length: 138
location: https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
The browser follows the HTTP redirect and downloads the file, naming it based on the last URL. If you'd like to do the same in Python, you also need to get to the last URL and use that as the file name. The requests library might or might not follow these redirects depending on the version, better to explicitly use allow_redirects=True.
With requests==2.28.1 this code returns the last URL:
import requests
requests.head('https://calibre-ebook.com/dist/win64', allow_redirects=True).url
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
If you'd like to solve it with built-in modules so you won't need to install external libs like requests you can also achieve the same with urllib:
import urllib.request
opener=urllib.request.build_opener()
opener.open('https://calibre-ebook.com/dist/win64').geturl()
# 'https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi'
Then you can split the lat URL by / and get the last section as the file name, for example:
import urllib.request
opener=urllib.request.build_opener()
url = opener.open('https://calibre-ebook.com/dist/win64').geturl()
url.split('/')[-1]
# 'calibre-64bit-6.5.0.msi'
I was using urllib3==1.26.12, requests==2.28.1 and Python 3.8.9 in the examples, if you are using much older versions they might behave differently and might need extra flags to ensure redirects.
The URL results in a 302 redirect, so you don't have enough information with just the URL to get that basename. You have to get the URL from 302 response.
import requests
resp = requests.head("https://calibre-ebook.com/dist/win64")
print(resp.status_code, resp.headers['location'])
>>> 302 https://download.calibre-ebook.com/6.5.0/calibre-64bit-6.5.0.msi
You'd want to have more intelligent handling obviously in case it's not a 302. And you'd want to loop in case the new URL results in another redirect.
I am using Python request library to scrape robots.txt data from a list of URLs:
for url in urls:
url = urllib.parse.urljoin(url, "robots.txt")
try:
r = requests.get(url, headers=headers, allow_redirects=False)
r.raise_for_status()
extract_robots(r)
except (exceptions.RequestException, exceptions.HTTPError, exceptions.Timeout) as err:
handle_exeption(err)
In my list of urls, I have this webpage: https://reward.ff.garena.com. When I am requesting https://reward.ff.garena.com/robots.txt, I am directly redirected to https://reward.ff.garena.com/en. However, I specified in my request parameters that I don't want redirects allow_redirects=False.
How can I skip this kind of redirect and make sure I only have domain/robots.txt data calling my extract_robots(data) method?
Do you know for sure that there is a robots.txt at that location?
I note that if I request https://reward.ff.garena.com/NOSUCHFILE.txt that I get the same result as for robots.txt
The allow_redirects=False only stops requests from automatically following 302/location= responses - i.e. it doesn’t stop the server you’re trying to access from returning a redirect as the response to the request you’re making.
If you get this type of response I guess it indicates the file you requested isn’t available, or some other error preventing you accessing it, perhaps in the general case of file access this might indicate need for authentication but for robots.txt that shouldn’t be the problem - simplest to assume the robots.txt isn’t there.
I am trying to login into a website by passing username and password.It says session cookie is missing.I am beginner to api .I dont know if I have missed something here.The website is http://testing-ground.scraping.pro/login
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, fields={'usr':'admin','pwd':'12345'})
print(req.data.decode('utf-8'))
There are two issues in your code that make you unable to log in successfully.
The content-type issue
In the code you are using urllib3 to send data of content-type multipart/form-data. The website, however, seems to only accept the content-type application/x-www-form-urlencoded.
Try the following cURL commands:
curl -v -d "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
curl -v -F "usr=admin&pwd=12345" http://testing-ground.scraping.pro/login?mode=login
For the first one, the content-type in your request header is application/x-www-form-urlencoded, so the website takes it and logs you in (with a 302 Found response).
The second one, however, sends data with content-type multipart/form-data. The website doesn't take it and therefore rejects your login request (with a 200 OK response).
The cookie issue
Another issue is that urllib3 follows redirect by default. More importantly, the cookie is not handled (i.e. stored and sent in the following requests) by default by urllib3. Thus, the second request won't contain the cookie tdsess=TEST_DRIVE_SESSION, and therefore the website returns the message that you're not logged in.
If you only care about the login request, you can try the following code:
import urllib3
http = urllib3.PoolManager()
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = http.request('POST', url, data={'usr':'admin','pwd':'12345'}, encode_multipart=False, redirect=False)
print(req.data.decode('utf-8'))
The encode_multipart=False instructs urllib3 to send data with content-type application/x-www-form-urlencoded; the redirect=False tells it not to follow the redirect, so that you can see the response of your initial request.
If you do want to complete the whole login process, however, you need to save the cookie from the first response and send it in the second request. You can do it with urllib3, or
Use the Requests library
I'm not sure if you have any particular reasons to use urllib3. Urllib3 will definitely work if you implements it well, but I would suggest try the Request library, which is much easier to use. For you case, the following code with Request will work and get you to the welcome page:
import requests
url = 'http://testing-ground.scraping.pro/login?mode=login'
req = requests.post(url, data={'usr':'admin','pwd':'12345'})
print(req.text)
import requests
auth_credentials = ("admin", "12345")
url = "http://testing-ground.scraping.pro/login?mode=login"
response = requests.post(url=url, auth=auth_credentials)
print(response.text)
I have a huge list of URLs which redirect to different URLs.
I am supplying them in for loop from a list, and trying to print the redirected URLs
The first redirected URL prints fine.
But from the second one - requests just stops giving me redirected URLs, and just prints the given URL
I tried implementing with urllib, urllib2, and mechanize.
They give the first redirected url fine, and then throws an error at second one and stops.
Can anyone please let me know why this is happening?
Below is the pseudo code/implementation:
for given_url in url_list:
print ("Given URL: " + given_url)
s = requests.Session()
r = requests.get(given_url, allow_redirects=True)
redirected_url = r.url
print ("Redirected URL: " + redirected_url)
Output:
Given URL: www.xyz.com
Redirected URL: www.123456789.com
Given URL: www.abc.com
Redirected URL: www.abc.com
Given URL: www.pqr.com
Redirected URL: www.pqr.com
Try a HEAD request, it won't follow redirects or download the entire body:
r = requests.head('http://www.google.com/')
print r.headers['Location']
There is nothing wrong with the code snippet you provided, but as you mentioned in the comments you are getting HTTP 400 and 401 responses. HTTP 401 means Unauthorized, which means the site is blocking you. HTTP 400 means Bad Request which typically means the site doesn't understand your request, but it can also be returned when you are being blocked, which I suspect is the case on those too.
When I run your code for the ABC website I get redirected properly, which leads me to believe they are blocking your ip address for sending too many requests in a short period of time and/or for having no User-Agent set.
Since you mentioned you can open the links correctly in a browser, you can try setting your User-Agent string to match that of a browser, however this is not guaranteed to work since it is one of many parameters a site may use to detect whether you are a bot or not.
For example:
headers = {'User-agent': 'Mozilla/5.0'}
r = requests.get(url, headers=headers)
While using requests to download a webpage, we store the result of that operation in a response object. What I could not understand is, exactly what is stored in the response object? Is it the source code of that page in HTML or is it the entire string on the page that is stored?
It is an instance of the lower level Response class of the python requests library. The literal description from the documentation is..
The Response object, which contains a server's response to an HTTP request.
Every HTTP request sent returns a response from the server (the Response object) which includes quite a bit of information.
You can find all the info you need here, and also here is the github link.
Server and Client use HTTP Protocol to send/receive information.
response stores all information from server - HTTP headers (for example: cookies, status code) and HTTP body (mostly HTML but it can be JSON or file or other)
wikipedia: HTTP Protocol
BTW: request stores HTTP headers and HTTP body too. (sometimes HTTP body can be empty)