Content-length available in Curl, Wget, but not in Python Requests - python

I have an URL pointing to a binary file which I need to download after checking its size, because the download should only be (re-)executed if the local file size differs from the remote file size.
This is how it works with wget (anonymized host names and IPs):
$ wget <URL>
--2020-02-17 11:09:18-- <URL>
Resolving <URL> (<host>)... <IP>
Connecting to <host> (<host>)|<ip>|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31581872 (30M) [application/x-gzip]
Saving to: ‘[...]’
This also works fine with the --continue flag in order to resume a download, including skipping if the file was completely downloaded earlier.
I can do the same with curl, the content-length is also present:
$ curl -I <url>
HTTP/2 200
date: Mon, 17 Feb 2020 13:11:55 GMT
server: Apache/2.4.25 (Debian)
strict-transport-security: max-age=15768000
last-modified: Fri, 14 Feb 2020 15:42:29 GMT
etag: "[...]"
accept-ranges: bytes
content-length: 31581872
vary: Accept-Encoding
content-type: application/x-gzip
In Python, I try to implement the same logic by checking the Content-length header using the requests library:
with requests.get(url, stream=True) as response:
total_size = int(response.headers.get("Content-length"))
if not response.ok:
logger.error(
f"Error {response.status_code} when downloading file from {url}"
)
elif os.path.exists(file) and os.stat(file).st_size == total_size:
logger.info(f"File '{file}' already exists, skipping download.")
else:
[...] # download file
It turns out that the Content-length header is never present, i.e. gets a None value here. I know that this should be worked around by passing a default value to the get() call, but for the purpose of debugging, this example consequently triggers an exception:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
I can confirm manually that the Content-length header is not there:
requests.get(url, stream=True).headers
{'Date': '[...]', 'Server': '[...]', 'Strict-Transport-Security': '[...]', 'Upgrade': '[...]', 'Connection': 'Upgrade, Keep-Alive', 'Last-Modified': '[...]', 'ETag': ''[...]'', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=15, max=100', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/x-gzip'}
This logic works fine though for other URLs, i.e. I do get the Content-length header.
When using requests.head(url) (omitting the stream=True), I get the same headers except for Transfer-Encoding.
I understand that a server does not have to send a Content-length header.
However, wget and curl clearly do get that header. What do they do differently from my Python implementation?

Not really an answer to the question about the missing Content-length header, but a solution to the underlying problem:
Instead of checking the local file size vs the content length of the remote, I have ended up checking the Last-modified header and compare that to the mtime of the local file. This is also safer in (the unlikely) case that the remote file is updated, but still has the exact same size.

Related

Apache sending Transfer-Encoding: chunked when deflate module is enabled

I have a simple web.py code like below, deployed with mod_wsgi in apache.
import web
urls = (
'/', 'index'
)
class index:
def GET(self):
content = 'hello'
web.header('Content-length', len(content))
return content
app = web.application(urls, globals())
application = app.wsgifunc()
This website runs well, except one minor issue. When mod_deflate is turn on, the response is chunked, even it has a very small response body.
Response Header
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:14:12 GMT
Server: Apache/2.4.7 (Ubuntu)
Vary: Accept-Encoding
Content-Encoding: gzip
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
When mod_deflate is turn off, Content-Length header is back.
HTTP/1.1 200 OK
Date: Wed, 20 May 2015 20:30:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Content-Length: 5
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8
I've searched around and someone said reduce the DeflateBufferSize will help, but this response's size is only 5, far from it's default value: 8096, so I don't think it interferes with this issue.
And someone said apache send chunked response because it doesn't know the response's size before begin to send the response to client, but in my code, I do set Content-Length.
I've also tried Flask and Apache/2.2.15 (CentOS), same result.
How do I set content-length when deflate module is enabled? and I don't like to gzip content in python.
The response Content-Length has to reflect the final length of the data sent after the compression has been done, not the original length. Thus mod_deflate has to remove the original Content-Length header and use chunked transfer encoding. The only way it could otherwise know the content length to be able to send the Content-Length before sending the compressed data, would be to buffer up the complete compressed response in memory or into a file and then calculate the length. Buffering all the compressed content isn't practical and in part defeats the point of compressing the data as the response is streamed.
If you don't want mod_deflate enabled for the whole site, then only enable it for certain URL prefixes by scoping it within a Location block.

Get status code for request sent using pysimplesoap

I'm using Pysimplesoap for sending a data to the web-service. I want to know the status of the request code. I'm able to print a traceback using trace=True. Over there it does prints the status code and other response variables but how do I get the to store all the traceback into a variable and then check the status code for the same?
Here's my code :_
client = SoapClient(
location = url,
action = 'http://tempuri.org/IService_1_0/',
namespace = "http://tempuri.org/",
soap_ns='soap', ns = False,trace = True
)
data = {'AWB_Number' : '406438762211111', 'Weight':'0.023' ,'Length':'16.4','Height':'4.5','Width':'9.9'}
response= client.UpdateShipment(
ShipmentNumber = data['AWB_Number'],
Weight = Decimal(data['Weight']),
Length = Decimal(data['Length']),
Height = Decimal(data['Height']),
Width = Decimal(data['Width']) ,
InboundLane = "2",
SequenceNumber = "1",
)
I do get a traceback :-
Content-length: 526
Content-type: text/xml; charset="UTF-8"
<?xml version="1.0" encoding="UTF-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soap:Header/>
<soap:Body>
<UpdateShipment xmlns="http://tempuri.org/">
<SequenceNumber>1</SequenceNumber><Weight>0.023</Weight><Height>4.5</Height><Width>9.9</Width><Length>16.4</Length><ShipmentNumber>406438762211111</ShipmentNumber><InboundLane>2</InboundLane></UpdateShipment>
</soap:Body>
</soap:Envelope>
status: 200
content-length: 293
x-powered-by: ASP.NET
server: Microsoft-IIS/7.5
date: Sat, 23 Aug 2014 07:27:38 GMT
content-type: text/xml; charset=utf-8
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/"><s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><UpdateShipmentResponse xmlns="http://tempuri.org/"><UpdateShipmentResult/></UpdateShipmentResponse></s:Body></s:Envelope>
===============================================================================
There's a status code mention there as 200, but I dont get to store this traceback into a variable to get to know the status code of it. A human need to intervene to have a look at the status code. How do my program gets to know the status code?
The SoapClient instance retains a .response attribute, containing information about the response. What that information is depends on the transport picked.
If you have just PySimpleSoap installed, the urllib2 library is picked and the status code is not part of the client.response attribute; the information is not retained from the actual response from urllib2, only the HTTP headers are preserved.
The pycurl transport gives you even less info; client.response is always an empty dictionary then.
Only if you also installed httplib2 will you get anything useful out of this; client.response is then set to a dictionary that includes a status code:
>>> import pysimplesoap
>>> client = pysimplesoap.client.SoapClient(wsdl='http://www.webservicex.net/stockquote.asmx?WSDL')
>>> response = client.GetQuote('GOOG')
>>> client.response
{'status': '200', 'content-length': '991', 'x-aspnet-version': '4.0.30319', 'vary': 'Accept-Encoding', 'server': 'Microsoft-IIS/7.0', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Sat, 23 Aug 2014 08:05:19 GMT', 'x-powered-by': 'ASP.NET', 'content-type': 'text/xml; charset=utf-8'}
>>> client.response['status']
'200'
Note that the value is a string, not an integer.
The httplib2 transport is picked by default if available.
As for the trace option; that sets up a logging module log, calls logging.basicConfig() with a log level and gets on with it. You could add a custom handler to the pysimplesoap.client.log object, but I wouldn't bother, really. If you see the status logged to the trace, then in all likelihood you are already using httplib2 and can access the status code more directly. For urllib2, for example, no status is logged either. That's because it is the client.response.items() values that are being logged here.
From the pysimplesoap source code, at line 261 of client, the send method is written which is used to send the http request from client's end
Taking cue from it,
def send(self, method, xml):
...
response, content = self.http.request(
location, 'POST', body=xml, headers=headers)
self.response = response
self.content = content
...
you should be able to access client.response to get the raw response for the request. From there, try doing a client.response.status, or do a dir(client.response) to find out any supportive method to get status code.
EDIT
As mentioned here in transport module, you can specifiy a library (like urllib2 or httplib2). to make it the default one to be picked up.

Downloading *.gz zipped file with python requests corrupt it

I use this code (it is only a part) to download *.gz archive.
with requests.session() as s:
s.post(login_to_site_URL, payload)
load = s.get(scene, stream=True)
with open(path_to_file, "wb") as save_command:
for chunk in load.iter_content(chunk_size=1024, decode_unicode=False):
if chunk:
save_command.write(chunk)
save_command.flush()
After download the size of the file is twice more than when I download file by clicking "save as" on it. And the file is corrupted.
Link for the file is:http://www.zsrcpod.aviales.ru/modistlm/archive/tlm/geo/00000/28325/terra_77835_20140806_060059.geo.hdf.gz
File require login and password, so I add a screenshot of what I see when I follow the link: http://i.stack.imgur.com/DGqtS.jpg
Looks like some options set to define this archive as a text.
file.header is:
{'content-length': '58277138',
'content-encoding': 'gzip',
'set-cookie': 'cidaviales=53616c7465645f5fc8f0abdb26f7b0536784ae4e8b302410a288f1f67ccc0afd13ce067d97ba237dc27749d9957f30457f1a1d9763b03637; path=/,
avialestime=1407386483; path=/; expires=Wed,
05-Nov-2014 04:41:23 GMT,
ciddaviales=53616c7465645f5fc8f0abdb26f7b0536784ae4e8b302410a288f1f67ccc0afd13ce067d97ba237dc27749d9957f30457f1a1d9763b03637; domain=aviales.ru; path=/',
'accept-ranges': 'bytes',
'server': 'Apache/1.3.37 (Unix) mod_perl/1.30',
'last-modified': 'Wed, 06 Aug 2014 06:17:14 GMT',
'etag': '"21d4e63-3793d12-53e1c86a"',
'date': 'Thu, 07 Aug 2014 04:41:23 GMT',
'content-type': 'text/plain; charset=windows-1251'}
How to properly download this file using python requests library?
It looks like requests automatically decompresses the content for you. See here
Requests automatically decompresses gzip-encoded responses, and does
its best to decode response content to unicode when possible. You can
get direct access to the raw response (and even the socket), if needed
as well
This is default behaviour if Accept-Encoding request header contains gzip. You can check this by printing s.request.headers. To be able to get raw data you should modify this headers dict to exclude gzip, however in your case the decompressed data looks like valid hdf file - so, just save it with this extension and use it!

Script fails randomly due to urllib.error.HTTPError: HTTP Error 302

I have a strange problem i've been trying to 'google-out' for several hours.
I've tried also solutions from similar topics on stack but still wiht no positive result:
How do I set cookies using Python urlopen?
Handling rss redirects with Python/urllib2
So the case is that i want to download whole set of articles form some webpage. Its sub-links with proper content differ with just one number, so I loop for whole range ( 1 to 400 000 ) and write html's to files. Whats importatnt here is this webpage need the cookies to be re-send in order to get to proper url, and after a lecture of How to use Python to login to a webpage and retrieve cookies for later usage? i have this done.
But some times my script returns error:
response = meth(req, response)
File "/usr/lib/python3.1/urllib/request.py", line 468, in http_response
'http', request, response, code, msg, hdrs)
....
File "/usr/lib/python3.1/urllib/request.py", line 553, in http_error_302 self.inf_msg + msg, headers, fp)
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
This problem is hard to reproduce because script generally works fine but it happens randomly after a few thousands of 'for loops'.
Here is curl ouptut from server:
$ curl -I "http://my.url/"
HTTP/1.1 200 OK
Date: Wed, 17 Oct 2012 10:14:13 GMT
Server: Apache/2.2.15 (Oracle)
X-Powered-By: PHP/5.3.3
Set-Cookie: Kuuxk=ae7s3isu2cEshhijte4nb1clk5; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=UTF-8
Some folks suggested to use mechanize or try to catch exception but i have no clue how to do this, others said that error is caused by wrong cookie handling but i tried also to get and send cookies 'manually' using urllib2 and add_header('cookie', cookie) with similar result.
I wonder if my for loop and mabey to short sleep might cause script to fail some times..
Anwyay - any help appreciated.
edit:
In case this might work - how to catch the exception and try ignore it ?
edit:
Solved by simply ignoring this error. No everything goes fine.
I used
try:
#here open url
except any_HTTPError:
pass
On each time i use url.open instruction.
TO BE CLOSED.
Let me suggest another solution:
HTTP status code 302 means Found redirection (See: https://en.wikipedia.org/wiki/HTTP_302).
For example:
HTTP/1.1 302 Found
Location: http://www.iana.org/domains/example/
You can grab the Location header and try fetching this url.
There are 8 redirection status codes (301-308). You can for Location header if 301 <= status code <= 308.

python check url type

I wrote a crawler in python, fetched urls has different types: it can be url with html and url with image or big archives or other files. So i need fast determine this case to prevent of reading of big files such as big archives and continue crawling. How is the best way to determine url type at start of page loading?
i understand what i can do it by url name (end's with .rar .jpg etc) but i think it's not full solution. I need check header or something like that for this? also i need some page size predicition to prevent of large downloads. In other words set limit of downloaded page size, to prevent fast memory eating.
If you use a HTTP HEAD request on the resource, you will get relevant metadata on the resource without the resource data itself. Specifically, the content-length and content-type headers will be of interest.
E.g.
HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net
HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT
You can do this in python using httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
This tells you it's an image (image/* mime-type) of 1150 bytes. Enough information for you to decide if you want to fetch the full resource.
Additionally, this header tells you the server accepts HTTP partial content request (accept-ranges header) which allows you to retrieve data in batches.
You will get the same header information if you do a GET directly, but this will also start sending the resource data in the body of the response, something you want to avoid.
If you want to learn more about HTTP headers and their meaning, you can use an online tool such as 'Fetch'

Categories