Safely extract zip or tar using Python

Safely extract zip or tar using Python - python

I'm trying to extract user-submitted zip and tar files to a directory. The documentation for zipfile's extractall method (similarly with tarfile's extractall) states that it's possible for paths to be absolute or contain .. paths that go outside the destination path. Instead, I could use extract myself, like this:
some_path = '/destination/path'
some_zip = '/some/file.zip'
zipf = zipfile.ZipFile(some_zip, mode='r')
for subfile in zipf.namelist():
zipf.extract(subfile, some_path)
Is this safe? Is it possible for a file in the archive to wind up outside of some_path in this case? If so, what way can I ensure that files will never wind up outside the destination directory?

Note: Starting with python 2.7.4, this is a non-issue for ZIP archives. Details at the bottom of the answer. This answer focuses on tar archives.
To figure out where a path really points to, use os.path.abspath() (but note the caveat about symlinks as path components). If you normalize a path from your zipfile with abspath and it does not contain the current directory as a prefix, it's pointing outside it.
But you also need to check the value of any symlink extracted from your archive (both tarfiles and unix zipfiles can store symlinks). This is important if you are worried about a proverbial "malicious user" that would intentionally bypass your security, rather than an application that simply installs itself in system libraries.
That's the aforementioned caveat: abspath will be misled if your sandbox already contains a symlink that points to a directory. Even a symlink that points within the sandbox can be dangerous: The symlink sandbox/subdir/foo -> .. points to sandbox, so the path sandbox/subdir/foo/../.bashrc should be disallowed. The easiest way to do so is to wait until the previous files have been extracted and use os.path.realpath(). Fortunately extractall() accepts a generator, so this is easy to do.
Since you ask for code, here's a bit that explicates the algorithm. It prohibits not only the extraction of files to locations outside the sandbox (which is what was requested), but also the creation of links inside the sandbox that point to locations outside the sandbox. I'm curious to hear if anyone can sneak any stray files or links past it.
import tarfile
from os.path import abspath, realpath, dirname, join as joinpath
from sys import stderr
resolved = lambda x: realpath(abspath(x))
def badpath(path, base):
# joinpath will ignore base if path is absolute
return not resolved(joinpath(base,path)).startswith(base)
def badlink(info, base):
# Links are interpreted relative to the directory containing the link
tip = resolved(joinpath(base, dirname(info.name)))
return badpath(info.linkname, base=tip)
def safemembers(members):
base = resolved(".")
for finfo in members:
if badpath(finfo.name, base):
print >>stderr, finfo.name, "is blocked (illegal path)"
elif finfo.issym() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Symlink to", finfo.linkname
elif finfo.islnk() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Hard link to", finfo.linkname
else:
yield finfo
ar = tarfile.open("testtar.tar")
ar.extractall(path="./sandbox", members=safemembers(ar))
ar.close()
Edit: Starting with python 2.7.4, this is a non-issue for ZIP archives: The method zipfile.extract() prohibits the creation of files outside the sandbox:
Note: If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows, illegal characters (:, <, >, |, ", ?, and *) [are] replaced by underscore (_).
The tarfile class has not been similarly sanitized, so the above answer still apllies.

Contrary to the popular answer, unzipping files safely is not completely solved as of Python 2.7.4. The extractall method is still dangerous and can lead to path traversal, either directly or through the unzipping of symbolic links. Here was my final solution which should prevent both attacks in all versions of Python, even versions prior to Python 2.7.4 where the extract method was vulnerable:
import zipfile, os
def safe_unzip(zip_file, extract_path='.'):
with zipfile.ZipFile(zip_file, 'r') as zf:
for member in zf.infolist():
file_path = os.path.realpath(os.path.join(extract_path, member.filename))
if file_path.startswith(os.path.realpath(extract_path)):
zf.extract(member, extract_path)
Edit 1: Fixed variable name clash. Thanks Juuso Ohtonen.
Edit 2: s/abspath/realpath/g. Thanks TheLizzard

Use ZipFile.infolist()/TarFile.next()/TarFile.getmembers() to get the information about each entry in the archive, normalize the path, open the file yourself, use ZipFile.open()/TarFile.extractfile() to get a file-like for the entry, and copy the entry data yourself.

Copy the zipfile to an empty directory. Then use os.chroot to make that directory the root directory. Then unzip there.
Alternatively, you can call unzip itself with the -j flag, which ignores the directories:
import subprocess
filename = '/some/file.zip'
rv = subprocess.call(['unzip', '-j', filename])

Related

Is it possible to create a HDF5 structure matching the file path on a local pc?

I am trying to automatically create a HDF5 structure by using the file paths on my local pc. I want to read through the subdirectories and create a HDF5 structure to match, that I can then save files to. Thank you

You can do this by combining os.walk() and h5py create_group(). The only complications are handling Linux vs Windows (and drive letter on Windows). Another consideration is relative vs absolute path. (I used absolute path, but my example can be modified. (Note: it's a little verbose so you can see what's going on.) Here is the example code (for Windows):
with h5py.File('SO_73879694.h5','w') as h5f:
cwd = os.getcwd()
for root, dirs, _ in os.walk(cwd, topdown=True):
print(f'ROOT: {root}')
# for Windows, modify root: remove drive letter and replace backslashes:
grp_name = root[2:].replace( '\\', '/')
print(f'grp_name: {grp_name}\n')
h5f.create_group(grp_name)

This is actually quite easy to do using HDFql in Python. Here is a complete script that does that:
# import HDFql module (make sure it can be found by the Python interpreter)
import HDFql
# create and use (i.e. open) an HDF5 file named 'my_file.h5'
HDFql.execute("CREATE AND USE FILE my_file.h5")
# get all directories recursively starting from the root of the file system (/)
HDFql.execute("SHOW DIRECTORY / LIKE **")
# iterate result set and create group in each iteration
while HDFql.cursor_next() == HDFql.SUCCESS:
HDFql.execute("CREATE GROUP \"%s\"" % HDFql.cursor_get_char())

Primer needed in python pathnames

I am a very novice coder, and Python is my first (and, practically speaking, only) language. I am charged as part of a research job with manipulating a collection of data analysis scripts, first by getting them to run on my computer. I was able to do this, essentially by removing all lines of coding identifying paths, and running the scripts through a Jupyter terminal opened in the directory where the relevant modules and CSV files live so the script knows where to look (I know that Python defaults to the location of the terminal).
Here are the particular blocks of code whose function I don't understand
import sys
sys.path.append('C:\Users\Ben\Documents\TRACMIP_Project\mymodules/')
import altdata as altdata
I have replaced the pathname in the original code with the path name leading to the directory where the module is; the file containing all the CSV files that end up being referenced here is also in mymodules.
This works depending on where I open the terminal, but the only way I can get it to work consistently is by opening the terminal in mymodules, which is fine for now but won't work when I need to work by accessing the server remotely. I need to understand better precisely what is being done here, and how it relates to the location of the terminal (all the documentation I've found is overly technical for my knowledge level).
Here is another segment I don't understand
import os.path
csvfile = 'csv/' + model +'_' + exp + '.csv'
if os.path.isfile(csvfile): # csv file exists
hcsvfile = open(csvfile )
I get here that it's looking for the CSV file, but I'm not sure how. I'm also not sure why then on some occasions depending on where I open the terminal it's able to find the module but not the CSV files.
I would love an explanation of what I've presented, but more generally I would like information (or a link to information) explaining paths and how they work in scripts in modules, as well as what are ways of manipulating them. Thanks.

sys.path
This is simple list of directories where python will look for modules and packages (.py and dirs with __init__.py file, look at modules tutorial). Extending this list will allow you to load modules (custom libs, etc.) from non default locations (usually you need to change it in runtime, for static dirs you can modify startup script to add needed enviroment variables).
os.path
This module implements some useful functions on pathnames.
... and allows you to find out if file exists, is it link, dir, etc.
Why you failed loading *.csv?
Because sys.path responsible for module loading and only for this. When you use relative path:
csvfile = 'csv/' + model +'_' + exp + '.csv'
open() will look in current working directory
file is either a string or bytes object giving the pathname (absolute or relative to the current working directory)...
You need to use absolute paths by constucting them with os.path module.

I agree with cdarke's comment that you are probably running into an issue with backslashes. Replacing the line with:
sys.path.append(r'C:\Users\Ben\Documents\TRACMIP_Project\mymodules')
will likely solve your problem. Details below.
In general, Python treats paths as if they're relative to the current directory (where your terminal is running). When you feed it an absolute path-- which is a path that includes the root directory, like the C:\ in C:\Users\Ben\Documents\TRACMIP_Project\mymodules-- then Python doesn't care about the working directory anymore, it just looks where you tell it to look.
Backslashes are used to make special characters within strings, such as line breaks (\n) and tabs (\t). The snag you've hit is that Python paths are strings first, paths second. So the \U, \B, \D, \T and \m in your path are getting misinterpreted as special characters and messing up Python's path interpretation. If you prefix the string with 'r', Python will ignore the special characters meaning of the backslash and just interpret it as a literal backslash (what you want).
The reason it still works if you run the script from the mymodules directory is because Python automatically looks in the working directory for files when asked. sys.path.append(path) is telling the computer to include that directory when it looks for commands, so that you can use files in that directory no matter where you're running the script. The faulty path will still get added, but its meaningless. There is no directory where you point it, so there's nothing to find there.
As for path manipulation in general, the "safest" way is to use the function in os.path, which are platform-independent and will give the correct output whether you're working in a Windows or a Unix environment (usually).
EDIT: Forgot to cover the second part. Since Python paths are strings, you can build them using string operations. That's what is happening with the line
csvfile = 'csv/' + model +'_' + exp + '.csv'
Presumably model and exp are strings that appear in the filenames in the csv/ folder. With model = "foo" and exp = "bar", you'd get csv/foo_bar.csv which is a relative path to a file (that is, relative to your working directory). The code makes sure a file actually exists at that path and then opens it. Assuming the csv/ folder is in the same path as you added in sys.path.append, this path should work regardless of where you run the file, but I'm not 100% certain on that. EDIT: outoftime pointed out that sys.path.append only works for modules, not opening files, so you'll need to either expand csv/ into an absolute path or always run in its parent directory.
Also, I think Python is smart enough to not care about the direction of slashes in paths, but you should probably not mix them. All backslashes or all forward slashes only. os.path.join will normalize them for you. I'd probably change the line to
csvfile = os.path.join('csv\', model + '_' + exp + '.csv')
for consistency's sake.

filetype os.walk search speeding up the code

I have a function that traverses a directory tree searching for files of a designated filetype which works just fine the only problem I have is it can be quite slow. Can anyone offer more pythonic suggestions to potentially speed the process up:
def findbyfiletype (filetype, directory):
"""
findbyfiletype allows the user to search by two parameters, filetype and directory.
Example:
If the user wishes to locate all pdf files with a directory including subdirectories
then the function would be called as follows:
findbyfiletype(".pdf", "D:\\\\")
this will return a dictionary of strings where the filename is the key and the file path is the value
e.g.
{'file.pdf':'c:\\folder\\file.pdf'}
note that both parameters filetype and directory must be enclosed in string double or single quotes
and the directory parameter must use the backslash escape \\\\ as opposed to \ as python will throw a string literal error
"""
indexlist =[] #holds all files in the given directory including sub folders
FiletypeFilenameList =[] #holds list of all filenames of defined filetype in indexlist
FiletypePathList = [] #holds path names to indvidual files of defined filetype
for root, dirs, files in os.walk(directory):
for name in files:
indexlist.append(os.path.join(root,name))
if filetype in name[-5:]:
FiletypeFilenameList.append(name)
for files in indexlist:
if filetype in files[-5:]:
FiletypePathList.append(files)
FileDictionary=dict(zip(FiletypeFilenameList, FiletypePathList))
del indexlist, FiletypePathList, FiletypeFilenameList
return FileDictionary
ok so this is what I ended up with using a combintion of #Ulrich Eckhardt #Anton and #Cox
import os
import scandir
def findbyfiletype (filetype, directory):
FileDictionary={}
for root, dirs, files in scandir.walk(directory):
for name in files:
if filetype in name and name.endswith(filetype):
FileDictionary.update({name:os.path.join(root,name)})
return FileDictionary
as you can see it's been re factored getting rid of the unnecessary lists and creating the dictionary in one step. #Anton your suggestion for the scandir module helped greatly reduced the time in one instance by about 97 percent which nearly blew my mind.
I'm Listing #Anton as the accpeted answer as it sums up all that i actually acheived with refactoring but #Ulrich Eckhardt and #Cox both get up votes as you were both very helpful
regards

Instead of os.walk(), you can use the faster scandir module (PEP-471).
Also, a few other tips:
Don't use the arbitrary [-5:]. Use the ensdswith() string method or use os.path.splitext().
Don't build up two long lists and then make a dict. Build the dict directly.
If escaping back slashes bother you, use forward slashes like 'c:/folder/file.pdf'. They just work.

walk() can be slow because try to cover a lot of things.
I use a simple variant:
def walk(self, path):
try:
l = (os.path.join(path, x) for x in os.listdir(path))
for x in l:
if os.path.isdir(x):self.walk(x)
elif x.endswith(("jpg", "png", "jpeg")):
self.lf.append(x)
except PermissionError:pass
That's fast and python do a local cache of filesystem, so a second invocation is even faster.
PS: function walk is member of a class, obviously, that's why „self” is there.
EDIT: in NTFS, don't bother with islink. Update with try/except.
But this just ignore dirs where you don't have permissions. You have to run script as admin if you want them listed.

Python - Opening successive Files without physically opening every one

If I am to read a number of files in Python 3.2, say 30-40, and i want to keep the file references in a list
(all the files are in a common folder)
Is there anyway how i can open all the files to their respective file handles in the list, without having to individually open every file via the file.open() function

This is simple, just use a list comprehension based on your list of file paths. Or if you only need to access them one at a time, use a generator expression to avoid keeping all forty files open at once.
list_of_filenames = ['/foo/bar', '/baz', '/tmp/foo']
open_files = [open(f) for f in list_of_filenames]
If you want handles on all the files in a certain directory, use the os.listdir function:
import os
open_files = [open(f) for f in os.listdir(some_path)]
I've assumed a simple, flat directory here, but note that os.listdir returns a list of paths to all file objects in the given directory, whether they are "real" files or directories. So if you have directories within the directory you're opening, you'll want to filter the results using os.path.isfile:
import os
open_files = [open(f) for f in os.listdir(some_path) if os.path.isfile(f)]
Also, os.listdir only returns the bare filename, rather than the whole path, so if the current working directory is not some_path, you'll want to make absolute paths using os.path.join.
import os
open_files = [open(os.path.join(some_path, f)) for f in os.listdir(some_path)
if os.path.isfile(f)]
With a generator expression:
import os
all_files = (open(f) for f in os.listdir(some_path)) # note () instead of []
for f in all_files:
pass # do something with the open file here.
In all cases, make sure you close the files when you're done with them. If you can upgrade to Python 3.3 or higher, I recommend you use an ExitStack for one more level of convenience .

The os library (and listdir in particular) should provide you with the basic tools you need:
import os
print("\n".join(os.listdir())) # returns all of the files (& directories) in the current directory
Obviously you'll want to call open with them, but this gives you the files in an iterable form (which I think is the crux of the issue you're facing). At this point you can just do a for loop and open them all (or some of them).
quick caveat: Jon Clements pointed out in the comments of Henry Keiter's answer that you should watch out for directories, which will show up in os.listdir along with files.
Additionally, this is a good time to write in some filtering statements to make sure you only try to open the right kinds of files. You might be thinking you'll only ever have .txt files in a directory now, but someday your operating system (or users) will have a clever idea to put something else in there, and that could throw a wrench in your code.
Fortunately, a quick filter can do that, and you can do it a couple of ways (I'm just going to show a regex filter):
import os,re
scripts=re.compile(".*\.py$")
files=[open(x,'r') for x in os.listdir() if os.path.isfile(x) and scripts.match(x)]
files=map(lambda x:x.read(),files)
print("\n".join(files))
Note that I'm not checking things like whether I have permission to access the file, so if I have the ability to see the file in the directory but not permission to read it then I'll hit an exception.

Does Python zipfile always use posixpath filenames regardless of operating system?

It probably won't matter for my current utility, but just for good coding practice, I'd like to know if files in a ZIP file, using the zipfile module, can be accessed using a POSIX-style pathname such as subdir/file.ext regardless of on which operating system it was made, or on what system my Python script is running. Or if, in the case of Windows, the file will be stored or accessed as subdir\file.ext. I read the pydoc for the module, and did some searches here and on Google, but couldn't see anything relevant to this question.

Yes.
You can see these lines from the zipfile module:
# This is used to ensure paths in generated ZIP files always use
# forward slashes as the directory separator, as required by the
# ZIP format specification.
if os.sep != "/" and os.sep in filename:
filename = filename.replace(os.sep, "/")
And in the Zip specification:
file name: (Variable)
The name of the file, with optional relative path.
The path stored should not contain a drive or
device letter, or a leading slash. All slashes
should be forward slashes '/' as opposed to
backwards slashes '\' for compatibility with Amiga
and UNIX file systems etc.

I have the same problem in the zipfile.py module.
os.path.sep returns {AttributeError}module 'posixpath' has no attribute 'sep' so I modified the file in
def _extract_member(self, member, targetpath, pwd):
"""Extract the ZipInfo object 'member' to a physical
file on the path targetpath.
"""
by replacing os.path.sep by os.sep (which returns the correct value / on a mac operating system).
It solves the problem both for zipfile open and extract methods.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.