Downloading *.gz zipped file with python requests corrupt it - python

I use this code (it is only a part) to download *.gz archive.
with requests.session() as s:
s.post(login_to_site_URL, payload)
load = s.get(scene, stream=True)
with open(path_to_file, "wb") as save_command:
for chunk in load.iter_content(chunk_size=1024, decode_unicode=False):
if chunk:
save_command.write(chunk)
save_command.flush()
After download the size of the file is twice more than when I download file by clicking "save as" on it. And the file is corrupted.
Link for the file is:http://www.zsrcpod.aviales.ru/modistlm/archive/tlm/geo/00000/28325/terra_77835_20140806_060059.geo.hdf.gz
File require login and password, so I add a screenshot of what I see when I follow the link: http://i.stack.imgur.com/DGqtS.jpg
Looks like some options set to define this archive as a text.
file.header is:
{'content-length': '58277138',
'content-encoding': 'gzip',
'set-cookie': 'cidaviales=53616c7465645f5fc8f0abdb26f7b0536784ae4e8b302410a288f1f67ccc0afd13ce067d97ba237dc27749d9957f30457f1a1d9763b03637; path=/,
avialestime=1407386483; path=/; expires=Wed,
05-Nov-2014 04:41:23 GMT,
ciddaviales=53616c7465645f5fc8f0abdb26f7b0536784ae4e8b302410a288f1f67ccc0afd13ce067d97ba237dc27749d9957f30457f1a1d9763b03637; domain=aviales.ru; path=/',
'accept-ranges': 'bytes',
'server': 'Apache/1.3.37 (Unix) mod_perl/1.30',
'last-modified': 'Wed, 06 Aug 2014 06:17:14 GMT',
'etag': '"21d4e63-3793d12-53e1c86a"',
'date': 'Thu, 07 Aug 2014 04:41:23 GMT',
'content-type': 'text/plain; charset=windows-1251'}
How to properly download this file using python requests library?

It looks like requests automatically decompresses the content for you. See here
Requests automatically decompresses gzip-encoded responses, and does
its best to decode response content to unicode when possible. You can
get direct access to the raw response (and even the socket), if needed
as well
This is default behaviour if Accept-Encoding request header contains gzip. You can check this by printing s.request.headers. To be able to get raw data you should modify this headers dict to exclude gzip, however in your case the decompressed data looks like valid hdf file - so, just save it with this extension and use it!

Related

Content-length available in Curl, Wget, but not in Python Requests

I have an URL pointing to a binary file which I need to download after checking its size, because the download should only be (re-)executed if the local file size differs from the remote file size.
This is how it works with wget (anonymized host names and IPs):
$ wget <URL>
--2020-02-17 11:09:18-- <URL>
Resolving <URL> (<host>)... <IP>
Connecting to <host> (<host>)|<ip>|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31581872 (30M) [application/x-gzip]
Saving to: ‘[...]’
This also works fine with the --continue flag in order to resume a download, including skipping if the file was completely downloaded earlier.
I can do the same with curl, the content-length is also present:
$ curl -I <url>
HTTP/2 200
date: Mon, 17 Feb 2020 13:11:55 GMT
server: Apache/2.4.25 (Debian)
strict-transport-security: max-age=15768000
last-modified: Fri, 14 Feb 2020 15:42:29 GMT
etag: "[...]"
accept-ranges: bytes
content-length: 31581872
vary: Accept-Encoding
content-type: application/x-gzip
In Python, I try to implement the same logic by checking the Content-length header using the requests library:
with requests.get(url, stream=True) as response:
total_size = int(response.headers.get("Content-length"))
if not response.ok:
logger.error(
f"Error {response.status_code} when downloading file from {url}"
)
elif os.path.exists(file) and os.stat(file).st_size == total_size:
logger.info(f"File '{file}' already exists, skipping download.")
else:
[...] # download file
It turns out that the Content-length header is never present, i.e. gets a None value here. I know that this should be worked around by passing a default value to the get() call, but for the purpose of debugging, this example consequently triggers an exception:
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
I can confirm manually that the Content-length header is not there:
requests.get(url, stream=True).headers
{'Date': '[...]', 'Server': '[...]', 'Strict-Transport-Security': '[...]', 'Upgrade': '[...]', 'Connection': 'Upgrade, Keep-Alive', 'Last-Modified': '[...]', 'ETag': ''[...]'', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=15, max=100', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/x-gzip'}
This logic works fine though for other URLs, i.e. I do get the Content-length header.
When using requests.head(url) (omitting the stream=True), I get the same headers except for Transfer-Encoding.
I understand that a server does not have to send a Content-length header.
However, wget and curl clearly do get that header. What do they do differently from my Python implementation?
Not really an answer to the question about the missing Content-length header, but a solution to the underlying problem:
Instead of checking the local file size vs the content length of the remote, I have ended up checking the Last-modified header and compare that to the mtime of the local file. This is also safer in (the unlikely) case that the remote file is updated, but still has the exact same size.

POST request to API Prestashop with Python

I achieve to list and create products through Prestashop API. I want to automate a little bit the product update process in my website.
But i have an issue trying to upload the images both in create new product with image and in upload an image to a product that i create through the webservice.
I don´t see any error in my code, so i want to know if I made wrong using the Prestashop API.
My code:
def addNewImage(product_id):
file = 'foto.png'
fd = io.open(file, "rb")
data = fd.read()
r = requests.post(urlimg + product_id, data=data,auth=(key,""), headers={'Content-Type': 'multipart/form-data'})
print(r.status_code)
print(r.headers)
print(r.content)
Prints:
500
{'Server': 'nginx', 'Date': 'Fri, 31 May 2019 09:18:27 GMT', 'Content-Type': 'text/xml;charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Access-Time': '1559294307', 'X-Powered-By': 'PrestaShop Webservice', 'PSWS-Version': '1.7.5.2', 'Execution-Time': '0.003', 'Set-Cookie': 'PrestaShop-30ff13c7718a401c862ad41ea4c0505f=def50200b7a8c608f3053d32136569a34c897c09cea1230b5f8a0aee977e6caac3e22bea39c63c30bfc955fe344d2cbabf640dc75039c63b33c88c5f33e6b01f2b282047bfb0e05c8f8eb7af08f2cc5b0c906d2060f92fea65f73ce063bf6d87bd8ac4d03d1f9fc0d7b6bf56b1eb152575ef559d95f89fc4f0090124630ae292633b4e08cfee38cee533eb8abe151a7d9c47ed84366a5dd0e241242b809300f84b9bb2; expires=Thu, 20-Jun-2019 09:18:27 GMT; Max-Age=1728000; path=/; domain=example.com; secure; HttpOnly', 'Vary': 'Authorization', 'MS-Author-Via': 'DAV'}
b'<?xml version="1.0" encoding="UTF-8"?>
\n<prestashop xmlns:xlink="http://www.w3.org/1999/xlink">
\n<errors>
\n<error>
\n<code><![CDATA[66]]></code>
\n<message><![CDATA[Unable to save this image]]></message>
\n</error>
\n</errors>
\n</prestashop>\n'
I probe to use the logging library of python but only tell me this:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): midominio:443
DEBUG:urllib3.connectionpool:https://midominio:443 "POST /api/images/products/20 HTTP/1.1" 500 None
Also I probe to change the file config/defines.inc.php, that i read in the forum of prestashop to active debug mode but any difference.
Also I probe the library prestapyt( and prestapyt3) but don´t work with python 3 and I read that are not compatible with presta 1.7
Edit:
Display_errors, and log_errors are activated in my Plesk Panel:
But when i go to var/www/vhosts/midominio/logs/error_log
I can´t see any error referenced to php or 500 error in any line.
Thanks in advance for any suggestion...
Edit: I probe the suggestion in response, but return same error.
I think the problem is in the post command if all else if working fine on the backend. data is used to send form data and other text data. To upload a file, you should do it like this:
files = {'media': open('test.jpg', 'rb')}
requests.post(url, files=files)
So your code translates to:
def addNewImage(product_id):
file = 'foto.png'
fd = io.open(file, "rb")
r = requests.post(urlimg + product_id, auth=(key,""), headers={'Content-Type': 'multipart/form-data'}, files={ 'media' : fd })
print(r.status_code)
print(r.headers)
print(r.content)

Suppress logging by cloudstorage.open() in google app engine GCS client library Python

I'm using the GCS client library to write files to Google Cloud Storage in my App Engine app in Python.
Before creating a file, I need to make sure it doesn't already exist to avoid overwriting.
To do this I am checking to see if the file exists before trying to create it:
import cloudstorage as gcs
try:
gcs_file = gcs.open(filename, 'r')
gcs_file.close()
return "File Exists!"
except gcs.NotFoundError as e:
return "File Does Not Exist: " + str(e)
cloudstorage.write() is logging (either directly or indirectly) the fact that it receives a 404 error when trying to read the non-existent file. I would like to suppress this if possible.
Thanks
edit
Here's what is logged:
12:19:32.565 suspended generator _get_segment(storage_api.py:432)
raised NotFoundError(Expect status [200, 206] from Google Storage. But
got status 404. Path:
'/cs-mailing/bde24e63-4d31-41e5-8aff-14b76b239388.html'. Request
headers: {'Range': 'bytes=0-1048575', 'x-goog-api-version': '2',
'accept-encoding': 'gzip, *'}. Response headers:
{'alternate-protocol': '443:quic', 'content-length': '127', 'via':
'HTTP/1.1 GWA', 'x-google-cache-control': 'remote-fetch', 'expires':
'Mon, 02 Jun 2014 11:19:32 GMT', 'server': 'HTTP Upload Server Built
on May 19 2014 09:31:01 (1400517061)', 'cache-control': 'private,
max-age=0', 'date': 'Mon, 02 Jun 2014 11:19:32 GMT', 'content-type':
'application/xml; charset=UTF-8'}. Body: "NoSuchKeyThe specified
key does not exist.". Extra info: None.
The logging in your case was triggered by str(e).
Issue was fixed in https://code.google.com/p/appengine-gcs-client/source/detail?r=172

Python HttpConnection - write request headers to file

I'm trying to log some connection info in the event that an error occurs. Using httplib's HTTPConnection I can print out the request headers by setting the debug level to 1:
connection = httplib.HTTPConnection('www.example.com')
connection.set_debuglevel(1)
However, this just seems to print directly to the shell without conditions. I need to be able to get this info as a string or something to store in a variable such that I only print it out when an Exception is thrown.
The specific info that I want is the request headers that the library is generating.
I would use requests HTTP library.
To get response headers, you just need this little piece of code:
import requests
try:
r = requests.get("http://www.example.com")
# Raise an exception in case of "bad"
# status code (non-200 response)
r.raise_for_status()
print r.headers
except Exception as e:
print e.message
Output:
{'connection': 'close',
'content-encoding': 'gzip',
'content-length': '1162',
'content-type': 'text/html; charset=UTF-8',
'date': 'Sun, 12 Aug 2012 12:49:44 GMT',
'last-modified': 'Wed, 09 Feb 2011 17:13:15 GMT',
'server': 'Apache/2.2.3 (CentOS)',
'vary': 'Accept-Encoding'}
Turns out the trick is to redirect sys.stdout to a StringIO object, the contents of which can then either be written to file or what ever you like as you can get to the String. Check out StringIO.StringIO.

python check url type

I wrote a crawler in python, fetched urls has different types: it can be url with html and url with image or big archives or other files. So i need fast determine this case to prevent of reading of big files such as big archives and continue crawling. How is the best way to determine url type at start of page loading?
i understand what i can do it by url name (end's with .rar .jpg etc) but i think it's not full solution. I need check header or something like that for this? also i need some page size predicition to prevent of large downloads. In other words set limit of downloaded page size, to prevent fast memory eating.
If you use a HTTP HEAD request on the resource, you will get relevant metadata on the resource without the resource data itself. Specifically, the content-length and content-type headers will be of interest.
E.g.
HEAD /stackoverflow/img/favicon.ico HTTP/1.1
host: sstatic.net
HTTP/1.1 200 OK
Cache-Control: max-age=604800
Content-Length: 1150
Content-Type: image/x-icon
Last-Modified: Mon, 02 Aug 2010 06:04:04 GMT
Accept-Ranges: bytes
ETag: "2187d82832cb1:0"
X-Powered-By: ASP.NET
Date: Sun, 12 Sep 2010 13:38:36 GMT
You can do this in python using httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection("sstatic.net")
>>> conn.request("HEAD", "/stackoverflow/img/favicon.ico")
>>> res = conn.getresponse()
>>> print res.getheaders()
[('content-length', '1150'), ('x-powered-by', 'ASP.NET'), ('accept-ranges', 'bytes'), ('last-modified', 'Mon, 02 Aug 2010 06:04:04 GMT'), ('etag', '"2187d82832cb1:0"'), ('cache-control', 'max-age=604800'), ('date', 'Sun, 12 Sep 2010 13:39:26 GMT'), ('content-type', 'image/x-icon')]
This tells you it's an image (image/* mime-type) of 1150 bytes. Enough information for you to decide if you want to fetch the full resource.
Additionally, this header tells you the server accepts HTTP partial content request (accept-ranges header) which allows you to retrieve data in batches.
You will get the same header information if you do a GET directly, but this will also start sending the resource data in the body of the response, something you want to avoid.
If you want to learn more about HTTP headers and their meaning, you can use an online tool such as 'Fetch'

Categories