Extracting a zipfile to memory?

Extracting a zipfile to memory? - python

How do I extract a zip to memory?
My attempt (returning None on .getvalue()):
from zipfile import ZipFile
from StringIO import StringIO
def extract_zip(input_zip):
return StringIO(ZipFile(input_zip).extractall())

extractall extracts to the file system, so you won't get what you want. To extract a file in memory, use the ZipFile.read() method.
If you really need the full content in memory, you could do something like:
def extract_zip(input_zip):
input_zip=ZipFile(input_zip)
return {name: input_zip.read(name) for name in input_zip.namelist()}

If you work with in-memory archives frequently, I would recommend making a tool. Something like this:
# Works in Python 2 and 3.
try:
import BytesIO
except ImportError:
from io import BytesIO # Python 3
import zipfile
class InMemoryZip(object):
def __init__(self):
# Create the in-memory file-like object for working w/IMZ
self.in_memory_zip = BytesIO()
# Just zip it, zip it
def append(self, filename_in_zip, file_contents):
# Appends a file with name filename_in_zip and contents of
# file_contents to the in-memory zip.
# Get a handle to the in-memory zip in append mode
zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
# Write the file to the in-memory zip
zf.writestr(filename_in_zip, file_contents)
# Mark the files as having been created on Windows so that
# Unix permissions are not inferred as 0000
for zfile in zf.filelist:
zfile.create_system = 0
return self
def read(self):
# Returns a string with the contents of the in-memory zip.
self.in_memory_zip.seek(0)
return self.in_memory_zip.read()
# Zip it, zip it, zip it
def writetofile(self, filename):
# Writes the in-memory zip to a physical file.
with open(filename, "wb") as file:
file.write(self.read())
if __name__ == "__main__":
# Run a test
imz = InMemoryZip()
imz.append("testfile.txt", "Make a test").append("testfile2.txt", "And another one")
imz.writetofile("testfile.zip")
print("testfile.zip created")

Probable reasons:
1.This module does not currently handle multi-disk ZIP files.
(OR)
2.Check with StringIO.getvalue() weather Unicode Error is coming up.

Related

python uploading a remote file to GCS , without saving it in the machine [duplicate]

I have managed to get my first python script to work which downloads a list of .ZIP files from a URL and then proceeds to extract the ZIP files and writes them to disk.
I am now at a loss to achieve the next step.
My primary goal is to download and extract the zip file and pass the contents (CSV data) via a TCP stream. I would prefer not to actually write any of the zip or extracted files to disk if I could get away with it.
Here is my current script which works but unfortunately has to write the files to disk.
import urllib, urllister
import zipfile
import urllib2
import os
import time
import pickle
# check for extraction directories existence
if not os.path.isdir('downloaded'):
os.makedirs('downloaded')
if not os.path.isdir('extracted'):
os.makedirs('extracted')
# open logfile for downloaded data and save to local variable
if os.path.isfile('downloaded.pickle'):
downloadedLog = pickle.load(open('downloaded.pickle'))
else:
downloadedLog = {'key':'value'}
# remove entries older than 5 days (to maintain speed)
# path of zip files
zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files"
# retrieve list of URLs from the webservers
usock = urllib.urlopen(zipFileURL)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
# only parse urls
for url in parser.urls:
if "PUBLIC_P5MIN" in url:
# download the file
downloadURL = zipFileURL + url
outputFilename = "downloaded/" + url
# check if file already exists on disk
if url in downloadedLog or os.path.isfile(outputFilename):
print "Skipping " + downloadURL
continue
print "Downloading ",downloadURL
response = urllib2.urlopen(downloadURL)
zippedData = response.read()
# save data to disk
print "Saving to ",outputFilename
output = open(outputFilename,'wb')
output.write(zippedData)
output.close()
# extract the data
zfobj = zipfile.ZipFile(outputFilename)
for name in zfobj.namelist():
uncompressed = zfobj.read(name)
# save uncompressed data to disk
outputFilename = "extracted/" + name
print "Saving extracted file to ",outputFilename
output = open(outputFilename,'wb')
output.write(uncompressed)
output.close()
# send data via tcp stream
# file successfully downloaded and extracted store into local log and filesystem log
downloadedLog[url] = time.time();
pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))

Below is a code snippet I used to fetch zipped csv file, please have a look:
Python 2:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(StringIO(resp.read()))
for line in myzip.open(file).readlines():
print line
Python 3:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content
resp = urlopen("http://www.test.com/file.zip")
myzip = ZipFile(BytesIO(resp.read()))
for line in myzip.open(file).readlines():
print(line.decode('utf-8'))
Here file is a string. To get the actual string that you want to pass, you can use zipfile.namelist(). For instance,
resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
myzip = ZipFile(BytesIO(resp.read()))
myzip.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']

My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
import zipfile
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
Or more simply (apologies to Vishal):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]
In Python 3 use BytesIO instead of StringIO:
import zipfile
from io import BytesIO
filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]

I'd like to offer an updated Python 3 version of Vishal's excellent answer, which was using Python 2, along with some explanation of the adaptations / changes, which may have been already mentioned.
from io import BytesIO
from zipfile import ZipFile
import urllib.request
url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")
with ZipFile(BytesIO(url.read())) as my_zip_file:
for contained_file in my_zip_file.namelist():
# with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
for line in my_zip_file.open(contained_file).readlines():
print(line)
# output.write(line)
Necessary changes:
There's no StringIO module in Python 3 (it's been moved to io.StringIO). Instead, I use io.BytesIO]2, because we will be handling a bytestream -- Docs, also this thread.
urlopen:
"The legacy urllib.urlopen function from Python 2.6 and earlier has been discontinued; urllib.request.urlopen() corresponds to the old urllib2.urlopen.", Docs and this thread.
Note:
In Python 3, the printed output lines will look like so: b'some text'. This is expected, as they aren't strings - remember, we're reading a bytestream. Have a look at Dan04's excellent answer.
A few minor changes I made:
I use with ... as instead of zipfile = ... according to the Docs.
The script now uses .namelist() to cycle through all the files in the zip and print their contents.
I moved the creation of the ZipFile object into the with statement, although I'm not sure if that's better.
I added (and commented out) an option to write the bytestream to file (per file in the zip), in response to NumenorForLife's comment; it adds "unzipped_and_read_" to the beginning of the filename and a ".file" extension (I prefer not to use ".txt" for files with bytestrings). The indenting of the code will, of course, need to be adjusted if you want to use it.
Need to be careful here -- because we have a byte string, we use binary mode, so "wb"; I have a feeling that writing binary opens a can of worms anyway...
I am using an example file, the UN/LOCODE text archive:
What I didn't do:
NumenorForLife asked about saving the zip to disk. I'm not sure what he meant by it -- downloading the zip file? That's a different task; see Oleh Prypin's excellent answer.
Here's a way:
import urllib.request
import shutil
with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
shutil.copyfileobj(response, out_file)

I'd like to add my Python3 answer for completeness:
from io import BytesIO
from zipfile import ZipFile
import requests
def get_zip(file_url):
url = requests.get(file_url)
zipfile = ZipFile(BytesIO(url.content))
files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
return files.pop() if len(files) == 1 else files

write to a temporary file which resides in RAM
it turns out the tempfile module ( http://docs.python.org/library/tempfile.html ) has just the thing:
tempfile.SpooledTemporaryFile([max_size=0[,
mode='w+b'[, bufsize=-1[, suffix=''[,
prefix='tmp'[, dir=None]]]]]])
This
function operates exactly as
TemporaryFile() does, except that data
is spooled in memory until the file
size exceeds max_size, or until the
file’s fileno() method is called, at
which point the contents are written
to disk and operation proceeds as with
TemporaryFile().
The resulting file has one additional
method, rollover(), which causes the
file to roll over to an on-disk file
regardless of its size.
The returned object is a file-like
object whose _file attribute is either
a StringIO object or a true file
object, depending on whether
rollover() has been called. This
file-like object can be used in a with
statement, just like a normal file.
New in version 2.6.
or if you're lazy and you have a tmpfs-mounted /tmp on Linux, you can just make a file there, but you have to delete it yourself and deal with naming

Adding on to the other answers using requests:
# download from web
import requests
url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
content = requests.get(url)
# unzip the content
from io import BytesIO
from zipfile import ZipFile
f = ZipFile(BytesIO(content.content))
print(f.namelist())
# outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']
Use help(f) to get more functions details for e.g. extractall() which extracts the contents in zip file which later can be used with with open.

All of these answers appear too bulky and long. Use requests to shorten the code, e.g.:
import requests, zipfile, io
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("/path/to/directory")

Vishal's example, however great, confuses when it comes to the file name, and I do not see the merit of redefing 'zipfile'.
Here is my example that downloads a zip that contains some files, one of which is a csv file that I subsequently read into a pandas DataFrame:
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import pandas
url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
zf = ZipFile(StringIO(url.read()))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])
(Note, I use Python 2.7.13)
This is the exact solution that worked for me. I just tweaked it a little bit for Python 3 version by removing StringIO and adding IO library
Python 3 Version
from io import BytesIO
from zipfile import ZipFile
import pandas
import requests
url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))
for item in zf.namelist():
print("File in zip: "+ item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])

It wasn't obvious in Vishal's answer what the file name was supposed to be in cases where there is no file on disk. I've modified his answer to work without modification for most needs.
from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
def unzip_string(zipped_string):
unzipped_string = ''
zipfile = ZipFile(StringIO(zipped_string))
for name in zipfile.namelist():
unzipped_string += zipfile.open(name).read()
return unzipped_string

Use the zipfile module. To extract a file from a URL, you'll need to wrap the result of a urlopen call in a BytesIO object. This is because the result of a web request returned by urlopen doesn't support seeking:
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
zip_url = 'http://example.com/my_file.zip'
with urlopen(zip_url) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read())
If you already have the file downloaded locally, you don't need BytesIO, just open it in binary mode and pass to ZipFile directly:
from zipfile import ZipFile
zip_filename = 'my_file.zip'
with open(zip_filename, 'rb') as f:
with ZipFile(f) as myzipfile:
foofile = myzipfile.open('foo.txt')
print(foofile.read().decode('utf-8'))
Again, note that you have to open the file in binary ('rb') mode, not as text or you'll get a zipfile.BadZipFile: File is not a zip file error.
It's good practice to use all these things as context managers with the with statement, so that they'll be closed properly.

Avoid date changes in Zipfile.write

Looking at Zipfile module, I'm trying to figure out why the content of zipfile changes when I recreate a file with the same content
Here's a sample code I'm working on:
import os
import hashlib
import zipfile
from io import BytesIO
FILE_PATH = './'
SAMPLE_FILE = "zip_test123.txt"
# create an empty file
new_file = FILE_PATH+"/"+SAMPLE_FILE
try:
open(new_file, 'x')
except FileExistsError:
os.remove(new_file)
open(new_file, 'x')
full_path = os.path.expanduser(FILE_PATH)
# zip it
data = BytesIO()
with zipfile.ZipFile(data, mode='w') as zf:
zf.write(os.path.join(full_path, SAMPLE_FILE), SAMPLE_FILE)
zip_cntn = data.getvalue()
data.close()
print(zip_cntn)
print(hashlib.md5(zip_cntn).hexdigest())
This first creates an empty file, then zip it and prints out the hash of zipped data.
Running this multiple times results in differnt contents/hash, which I think is caused by modification date (my assumption is based on this which shows the Modified date as well)
I'm only interested in zipping the actual contents, and not anything else (e.g. hash should stay the same if I recreate the same content for a given file)
Any suggestion how to achieve this goal/ignore extra info while archiving a file?

function for loading both strings and files on disk?

I have a design question. I have a function loadImage() for loading an image file. Now it accepts a string which is a file path. But I also want to be able to load files which are not on physical disk, eg. generated procedurally. I could have it accept a string, but then how could it know the string is not a file path but file data? I could add an extra boolean argument to specify that, but that doesn't sound very clean. Any ideas?
It's something like this now:
def loadImage(filepath):
file = open(filepath, 'rb')
data = file.read()
# do stuff with data
The other version would be
def loadImage(data):
# do stuff with data
How to have this function accept both 'filepath' or 'data' and guess what it is?

You can change your loadImage function to expect an opened file-like object, such as:
def load_image(f):
data = file.read()
... and then have that called from two functions, one of which expects a path and the other a string that contains the data:
from StringIO import StringIO
def load_image_from_path(path):
with open(path, 'rb') as f:
load_image(f)
def load_image_from_string(s):
sio = StringIO(s)
try:
load_image(sio)
finally:
sio.close()

How about just creating two functions, loadImageFromString and loadImageFromFile?

This being Python, you can easily distinguish between a filename and a data string. I would do something like this:
import os.path as P
from StringIO import StringIO
def load_image(im):
fin = None
if P.isfile(im):
fin = open(im, 'rb')
else:
fin = StringIO(im)
# Read from fin like you would from any open file object
Other ways to do it would be a try block instead of using os.path, but the essence of the approach remains the same.

NameError: global name is not defined

Hello
My error is produced in generating a zip file. Can you inform what I should do?
main.py", line 2289, in get
buf=zipf.read(2048)
NameError: global name 'zipf' is not defined
The complete code is as follows:
def addFile(self,zipstream,url,fname):
# get the contents
result = urlfetch.fetch(url)
# store the contents in a stream
f=StringIO.StringIO(result.content)
length = result.headers['Content-Length']
f.seek(0)
# write the contents to the zip file
while True:
buff = f.read(int(length))
if buff=="":break
zipstream.writestr(fname,buff)
return zipstream
def get(self):
self.response.headers["Cache-Control"] = "public,max-age=%s" % 86400
start=datetime.datetime.now()-timedelta(days=20)
count = int(self.request.get('count')) if not self.request.get('count')=='' else 1000
from google.appengine.api import memcache
memcache_key = "ads"
data = memcache.get(memcache_key)
if data is None:
a= Ad.all().filter("modified >", start).filter("url IN", ['www.koolbusiness.com']).filter("published =", True).order("-modified").fetch(count)
memcache.set("ads", a)
else:
a = data
dispatch='templates/kml.html'
template_values = {'a': a , 'request':self.request,}
path = os.path.join(os.path.dirname(__file__), dispatch)
output = template.render(path, template_values)
self.response.headers['Content-Length'] = len(output)
zipstream=StringIO.StringIO()
file = zipfile.ZipFile(zipstream,"w")
url = 'http://www.koolbusiness.com/list.kml'
# repeat this for every URL that should be added to the zipfile
file =self.addFile(file,url,"list.kml")
# we have finished with the zip so package it up and write the directory
file.close()
zipstream.seek(0)
# create and return the output stream
self.response.headers['Content-Type'] ='application/zip'
self.response.headers['Content-Disposition'] = 'attachment; filename="list.kmz"'
while True:
buf=zipf.read(2048)
if buf=="": break
self.response.out.write(buf)

That is probably zipstream and not zipf. So replace that with zipstream and it might work.

i don't see where you declare zipf?
zipfile? Senthil Kumaran is probably right with zipstream since you seek(0) on zipstream before the while loop to read chunks of the mystery variable.
edit:
Almost certainly the variable is zipstream.
zipfile docs:
class zipfile.ZipFile(file[, mode[, compression[, allowZip64]]])
Open a ZIP file, where file can be either a path to a file (a string) or
a file-like object. The mode parameter
should be 'r' to read an existing
file, 'w' to truncate and write a new
file, or 'a' to append to an existing
file. If mode is 'a' and file refers
to an existing ZIP file, then
additional files are added to it. If
file does not refer to a ZIP file,
then a new ZIP archive is appended to
the file. This is meant for adding a
ZIP archive to another file (such as
python.exe).
your code:
zipsteam=StringIO.StringIO()
create a file-like object using StringIO which is essentially a "memory file" read more in docs
file = zipfile.ZipFile(zipstream,w)
opens the zipfile with the zipstream file-like object in 'w' mode
url = 'http://www.koolbusiness.com/list.kml'
# repeat this for every URL that should be added to the zipfile
file =self.addFile(file,url,"list.kml")
# we have finished with the zip so package it up and write the directory
file.close()
uses the addFile method to retrieve and write the retrieved data to the file-like object and returns it. The variables are slightly confusing because you pass a zipfile to the addFile method which aliases as zipstream (confusing because we are using zipstream as a StringIO file-like object). Anyways, the zipfile is returned, and closed to make sure everything is "written".
It was written to our "memory file", which we now seek to index 0
zipstream.seek(0)
and after doing some header stuff, we finally reach the while loop that will read our "memory-file" in chunks
while True:
buf=zipstream.read(2048)
if buf=="": break
self.response.out.write(buf)

You need to declare:
global zipf
right after your
def get(self):
line. you are modifying a global variable, and this is the only way python knows what you are doing.

Create zip archive for instant download

In a web app I am working on, the user can create a zip archive of a folder full of files. Here here's the code:
files = torrent[0].files
zipfile = z.ZipFile(zipname, 'w')
output = ""
for f in files:
zipfile.write(settings.PYRAT_TRANSMISSION_DOWNLOAD_DIR + "/" + f.name, f.name)
downloadurl = settings.PYRAT_DOWNLOAD_BASE_URL + "/" + settings.PYRAT_ARCHIVE_DIR + "/" + filename
output = "Download " + torrent_name + ""
return HttpResponse(output)
But this has the nasty side effect of a long wait (10+ seconds) while the zip archive is being downloaded. Is it possible to skip this? Instead of saving the archive to a file, is it possible to send it straight to the user?
I do beleive that torrentflux provides this excat feature I am talking about. Being able to zip GBs of data and download it within a second.

Check this Serving dynamically generated ZIP archives in Django

As mandrake says, constructor of HttpResponse accepts iterable objects.
Luckily, ZIP format is such that archive can be created in single pass, central directory record is located at the very end of file:
(Picture from Wikipedia)
And luckily, zipfile indeed doesn't do any seeks as long as you only add files.
Here is the code I came up with. Some notes:
I'm using this code for zipping up a bunch of JPEG pictures. There is no point compressing them, I'm using ZIP only as container.
Memory usage is O(size_of_largest_file) not O(size_of_archive). And this is good enough for me: many relatively small files that add up to potentially huge archive
This code doesn't set Content-Length header, so user doesn't get nice progress indication. It should be possible to calculate this in advance if sizes of all files are known.
Serving the ZIP straight to user like this means that resume on downloads won't work.
So, here goes:
import zipfile
class ZipBuffer(object):
""" A file-like object for zipfile.ZipFile to write into. """
def __init__(self):
self.data = []
self.pos = 0
def write(self, data):
self.data.append(data)
self.pos += len(data)
def tell(self):
# zipfile calls this so we need it
return self.pos
def flush(self):
# zipfile calls this so we need it
pass
def get_and_clear(self):
result = self.data
self.data = []
return result
def generate_zipped_stream():
sink = ZipBuffer()
archive = zipfile.ZipFile(sink, "w")
for filename in ["file1.txt", "file2.txt"]:
archive.writestr(filename, "contents of file here")
for chunk in sink.get_and_clear():
yield chunk
archive.close()
# close() generates some more data, so we yield that too
for chunk in sink.get_and_clear():
yield chunk
def my_django_view(request):
response = HttpResponse(generate_zipped_stream(), mimetype="application/zip")
response['Content-Disposition'] = 'attachment; filename=archive.zip'
return response

Here's a simple Django view function which zips up (as an example) any readable files in /tmp and returns the zip file.
from django.http import HttpResponse
import zipfile
import os
from cStringIO import StringIO # caveats for Python 3.0 apply
def somezip(request):
file = StringIO()
zf = zipfile.ZipFile(file, mode='w', compression=zipfile.ZIP_DEFLATED)
for fn in os.listdir("/tmp"):
path = os.path.join("/tmp", fn)
if os.path.isfile(path):
try:
zf.write(path)
except IOError:
pass
zf.close()
response = HttpResponse(file.getvalue(), mimetype="application/zip")
response['Content-Disposition'] = 'attachment; filename=yourfiles.zip'
return response
Of course this approach will only work if the zip files will conveniently fit into memory - if not, you'll have to use a disk file (which you're trying to avoid). In that case, you just replace the file = StringIO() with file = open('/path/to/yourfiles.zip', 'wb') and replace the file.getvalue() with code to read the contents of the disk file.

Does the zip library you are using allow for output to a stream. You could stream directly to the user instead of temporarily writing to a zip file THEN streaming to the user.

It is possible to pass an iterator to the constructor of a HttpResponse (see docs). That would allow you to create a custom iterator that generates data as it is being requested. However I don't think that will work with a zip (you would have to send partial zip as it is being created).
The proper way, I think, would be to create the files offline, in a separate process. The user could then monitor the progress and then download the file when its ready (possibly by using the iterator method described above). This would be similar what sites like youtube use when you upload a file and wait for it to be processed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting a zipfile to memory? - python

How do I extract a zip to memory? My attempt (returning None on .getvalue()): from zipfile import ZipFile from StringIO import StringIO def extract_zip(input_zip): return StringIO(ZipFile(input_zip).extractall())

Probable reasons: 1.This module does not currently handle multi-disk ZIP files. (OR) 2.Check with StringIO.getvalue() weather Unicode Error is coming up.

Related

python uploading a remote file to GCS , without saving it in the machine [duplicate]

Avoid date changes in Zipfile.write

function for loading both strings and files on disk?

NameError: global name is not defined

Create zip archive for instant download

Categories

Resources