Download a zip file and extract it in memory using Python3 - python

I would like to download a zip file from internet and extract it.
I would rather use requests. I don't want to write to the disk.
I knew how to do that in Python2 but I am clueless for python3.3. Apparently, zipfile.Zipfile wants a file-like object but I don't know how to get that from what requests returns.
If you know how to do it with urllib.request, I would be curious to see how you do it too.

I found out how to do it:
request = requests.get(url)
file = zipfile.ZipFile(BytesIO(request.content))
What I was missing :
request.content should be used to access the bytes
io.BytesIO is the correct file-like object for bytes.

Here's another approach saving you having to install requests:
r = urllib.request.urlopen(req)
with zipfile.ZipFile(BytesIO(r.read())) as z:
print( z.namelist() )

Using Requests, this can be done very simply.
import requests, zipfile, StringIO
response = requests.get(zip_file_url)
zipDocument = zipfile.ZipFile(StringIO.StringIO(response.content))
Using String.IO you can make a file-like object for the responses content attribute.
If you want to extract to directory you can use the ZipFile's extractall() function
zipDocment.extractall()

Related

Using URL of file as file path in Python in Lambda

I am trying to acquire a file from a url on the web and then open that file for use in an application I’m making in python on AWS Lambda. There doesn’t seem to be a way for me to acquire the file in the form I need it, which I believe to be an os.Pathlike object.
Here is what I am trying now, which doesn’t work since requests.get returns a response not path. I’m posting from a phone right now so I cannot use code tags. Apologies.
filename = requests.get(“url.com/file.txt”)
f = open(filename, ‘rb’)
I have also tried a urlparse and a urllib urlretrieve on the url but that does not return a pathlike object either. Note that I don’t believe I can just use wget or something on the shell level as I am using AWS lambda.
import requests
url = 'http://url.com/file.txt'
r = requests.get(url, allow_redirects=True)
f = open(r, ‘rb’)
When you do such operation, it's always a good to see the entire response of the request you are doing. I usually use the dict attribute, works quite often
print(response.__dict__)
On the ones I have done, there were a _content field in the response object with the file bytes. Then you can simply use the io module to read this file :
file = io.BytesIO(response._content)
This can then be used as a file just like when you do open() function

urllib2.open error in python

I can't get URL
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
for url in parser.url_list:
get_rss_th = threading.Thread(target=parser.get_rss,name="get_rss_th", args=(url,))
get_rss_th.start()
print htmldata
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
when specifying htmldata.read() (Python error when using urllib.open)
then getting blank screen
python 2.7
whole code:https://github.com/tech-sketch/zabbix_aws_template/blob/master/scripts/AWS_Service_Health_Dashboard.py
The problem is, that from URL link (RSS feed), i can't get output (data) variable data = zbx_client.recv(4096) is empty- no status
There's no real problem with your code (except for a bunch of indentation errors and syntax errors that apparently aren't in your real code), only with your attempts to debug it.
First, you did this:
print htmldata
That's perfectly fine, but since htmldata is a urllib2 response object, printing it just prints that response object. Which apparently looks like this:
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
That doesn't look like particularly useful information, but that's the kind of output you get when you print something that's only really useful for debugging purposes. It tells you what type of object it is, some unique identifier for it, and the key members (in this case, the socket fileobject wrapped up by the response).
Then you apparently tried this:
print htmldata.read()
But already called read on the same object earlier:
parser.feed(htmldata.read())
When you read() the same file-like object twice, the first time gets everything in the file, and the second time gets everything after everything in the file—that is, nothing.
What you want to do is read() the contents once, into a string, and then you can reuse that string as many times as you want:
contents = htmldata.read()
parser.feed(contents)
print contents
It's also worth noting that, as the urllib2 documentation said right at the top:
See also The Requests package is recommended for a higher-level HTTP client interface.
Using urllib2 can be a big pain, in a lot of ways, and this is just one of the more minor ones. Occasionally you can't use requests because you have to dig deep under the covers of HTTP, or handle some protocol it doesn't understand, or you just can't install third-party libraries, so urllib2 (or urllib.request, as it's renamed in Python 3.x) is still there. But when you don't have to use it, it's better not to. Even Python itself, in the ensurepip bootstrapper, uses requests instead of urllib2.
With requests, the normal way to access the contents of a response is with the content (for binary) or text (for Unicode text) properties. You don't have to worry about when to read(); it does it automatically for you, and lets you access the text over and over. So, you can just do this:
import requests
base_url = "http://status.aws.amazon.com/"
response = requests.get(base_url, timeout=30)
parser.feed(response.content) # assuming it wants bytes, not unicode
print response.text
If I use this code:
import urllib2
import socket
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
print(htmldata.read())
I get the page's HTML code.

How to get the raw content of a response in requests with Python?

Trying to get the raw data of the HTTP response content in requests in Python. I am interested in forwarding the response through another channel, which means that ideally the content should be as pristine as possible.
What would be a good way to do this?
If you are using a requests.get call to obtain your HTTP response, you can use the raw attribute of the response. Here is the code from the requests docs. The stream=True parameter in the requests.get call is required for this to work.
>>> r = requests.get('https://github.com/timeline.json', stream=True)
>>> r.raw
<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
>>> r.raw.read(10)
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
After requests.get(), you can use r.content to extract the raw Byte-type content.
r = requests.get('https://yourweb.com', stream=True)
r.content
To add to #brien answer, as stated in the docs:
In general, however, you should use a pattern like this to save what is being streamed to a file:
r = requests.get('https://yourweb.com', stream=True)
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
Using Response.iter_content will handle a lot of what you would otherwise have to handle when using Response.raw directly. When streaming a download, the above is the preferred and recommended way to retrieve the content. Note that chunk_size can be freely adjusted to a number that may better fit your use cases.
That pattern not only has the advantages described above, but is also a good to fetch data in environments with limited memory.

Decompress remote .gz file in Python

i've a issue with Python.
My case: i've a gzipped file from a partner platform (i.e. h..p//....namesite.../xxx)
If i click the link from my browser, it will download a file like (i.e. namefile.xml.gz).
So... if i read this file with python i can decompress and read it.
Code:
content = gzip.open(namefile.xml.gz,'rb')
print content.read()
But i can't if i try to read the file from remote source.
From remote file i can read only the encoded string, but not decoded it.
Code:
response = urllib2.urlopen(url)
encoded =response.read()
print encoded
With this code i can read the string encoded... but i can't decoded it with gzip or lzip.
Any advices?
Thanks a lot
Unfortunately the method #Aya suggests does not work, since GzipFile extensively uses seek method of the file object (not supported by response).
So you have basically two options:
Read the contents of the remote file into io.StringIO, and pass the object into gzip.GzipFile (if the file is small)
download the file into a temporary file on disk, and use gzip.open
There is another option (which requires some coding) - to implement your own reader using zlib module. It is rather easy, but you will need to know about a magic constant (How can I decompress a gzip stream with zlib?).
If you use Python 3.2 or later the bug in GzipFile (requiring tell support) is fixed, but they apparently aren't going to backport the fix to Python 2.x
For Python v3.2 or later, you can use the gzip.GzipFile class to wrap the file object returned by urllib2.urlopen(), with something like this...
import urllib2
import gzip
response = urllib2.urlopen(url)
gunzip_response = gzip.GzipFile(fileobj=response)
content = gunzip_response.read()
print content
...which will transparently decompress the response stream as you read it.

Python urllib download only some of a webpage?

I have a program where I need to open many webpages and download information in them. The information, however, is in the middle of the page, and it takes a long time to get to it. Is there a way to have urllib only retrieve x lines? Or, if nothing else, don't load the information afterwards?
I'm using Python 2.7.1 on Mac OS 10.8.2.
The returned object is a file-like object, and you can use .readline() to only read a partial response:
resp = urllib.urlopen(url)
for i in range(10):
line = resp.readline()
would read only 10 lines, for example. Note that this won't guarantee a faster response.

Categories