Does Python zipfile always use posixpath filenames regardless of operating system?

Does Python zipfile always use posixpath filenames regardless of operating system? - python

It probably won't matter for my current utility, but just for good coding practice, I'd like to know if files in a ZIP file, using the zipfile module, can be accessed using a POSIX-style pathname such as subdir/file.ext regardless of on which operating system it was made, or on what system my Python script is running. Or if, in the case of Windows, the file will be stored or accessed as subdir\file.ext. I read the pydoc for the module, and did some searches here and on Google, but couldn't see anything relevant to this question.

Yes.
You can see these lines from the zipfile module:
# This is used to ensure paths in generated ZIP files always use
# forward slashes as the directory separator, as required by the
# ZIP format specification.
if os.sep != "/" and os.sep in filename:
filename = filename.replace(os.sep, "/")
And in the Zip specification:
file name: (Variable)
The name of the file, with optional relative path.
The path stored should not contain a drive or
device letter, or a leading slash. All slashes
should be forward slashes '/' as opposed to
backwards slashes '\' for compatibility with Amiga
and UNIX file systems etc.

I have the same problem in the zipfile.py module.
os.path.sep returns {AttributeError}module 'posixpath' has no attribute 'sep' so I modified the file in
def _extract_member(self, member, targetpath, pwd):
"""Extract the ZipInfo object 'member' to a physical
file on the path targetpath.
"""
by replacing os.path.sep by os.sep (which returns the correct value / on a mac operating system).
It solves the problem both for zipfile open and extract methods.

Related

Primer needed in python pathnames

I am a very novice coder, and Python is my first (and, practically speaking, only) language. I am charged as part of a research job with manipulating a collection of data analysis scripts, first by getting them to run on my computer. I was able to do this, essentially by removing all lines of coding identifying paths, and running the scripts through a Jupyter terminal opened in the directory where the relevant modules and CSV files live so the script knows where to look (I know that Python defaults to the location of the terminal).
Here are the particular blocks of code whose function I don't understand
import sys
sys.path.append('C:\Users\Ben\Documents\TRACMIP_Project\mymodules/')
import altdata as altdata
I have replaced the pathname in the original code with the path name leading to the directory where the module is; the file containing all the CSV files that end up being referenced here is also in mymodules.
This works depending on where I open the terminal, but the only way I can get it to work consistently is by opening the terminal in mymodules, which is fine for now but won't work when I need to work by accessing the server remotely. I need to understand better precisely what is being done here, and how it relates to the location of the terminal (all the documentation I've found is overly technical for my knowledge level).
Here is another segment I don't understand
import os.path
csvfile = 'csv/' + model +'_' + exp + '.csv'
if os.path.isfile(csvfile): # csv file exists
hcsvfile = open(csvfile )
I get here that it's looking for the CSV file, but I'm not sure how. I'm also not sure why then on some occasions depending on where I open the terminal it's able to find the module but not the CSV files.
I would love an explanation of what I've presented, but more generally I would like information (or a link to information) explaining paths and how they work in scripts in modules, as well as what are ways of manipulating them. Thanks.

sys.path
This is simple list of directories where python will look for modules and packages (.py and dirs with __init__.py file, look at modules tutorial). Extending this list will allow you to load modules (custom libs, etc.) from non default locations (usually you need to change it in runtime, for static dirs you can modify startup script to add needed enviroment variables).
os.path
This module implements some useful functions on pathnames.
... and allows you to find out if file exists, is it link, dir, etc.
Why you failed loading *.csv?
Because sys.path responsible for module loading and only for this. When you use relative path:
csvfile = 'csv/' + model +'_' + exp + '.csv'
open() will look in current working directory
file is either a string or bytes object giving the pathname (absolute or relative to the current working directory)...
You need to use absolute paths by constucting them with os.path module.

I agree with cdarke's comment that you are probably running into an issue with backslashes. Replacing the line with:
sys.path.append(r'C:\Users\Ben\Documents\TRACMIP_Project\mymodules')
will likely solve your problem. Details below.
In general, Python treats paths as if they're relative to the current directory (where your terminal is running). When you feed it an absolute path-- which is a path that includes the root directory, like the C:\ in C:\Users\Ben\Documents\TRACMIP_Project\mymodules-- then Python doesn't care about the working directory anymore, it just looks where you tell it to look.
Backslashes are used to make special characters within strings, such as line breaks (\n) and tabs (\t). The snag you've hit is that Python paths are strings first, paths second. So the \U, \B, \D, \T and \m in your path are getting misinterpreted as special characters and messing up Python's path interpretation. If you prefix the string with 'r', Python will ignore the special characters meaning of the backslash and just interpret it as a literal backslash (what you want).
The reason it still works if you run the script from the mymodules directory is because Python automatically looks in the working directory for files when asked. sys.path.append(path) is telling the computer to include that directory when it looks for commands, so that you can use files in that directory no matter where you're running the script. The faulty path will still get added, but its meaningless. There is no directory where you point it, so there's nothing to find there.
As for path manipulation in general, the "safest" way is to use the function in os.path, which are platform-independent and will give the correct output whether you're working in a Windows or a Unix environment (usually).
EDIT: Forgot to cover the second part. Since Python paths are strings, you can build them using string operations. That's what is happening with the line
csvfile = 'csv/' + model +'_' + exp + '.csv'
Presumably model and exp are strings that appear in the filenames in the csv/ folder. With model = "foo" and exp = "bar", you'd get csv/foo_bar.csv which is a relative path to a file (that is, relative to your working directory). The code makes sure a file actually exists at that path and then opens it. Assuming the csv/ folder is in the same path as you added in sys.path.append, this path should work regardless of where you run the file, but I'm not 100% certain on that. EDIT: outoftime pointed out that sys.path.append only works for modules, not opening files, so you'll need to either expand csv/ into an absolute path or always run in its parent directory.
Also, I think Python is smart enough to not care about the direction of slashes in paths, but you should probably not mix them. All backslashes or all forward slashes only. os.path.join will normalize them for you. I'd probably change the line to
csvfile = os.path.join('csv\', model + '_' + exp + '.csv')
for consistency's sake.

Accessing file path in the os

The following line, unless I'm mistaken, will grab the absolute path to your directory so you can access files
PATH = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0])))
This is what I've been using typically access files in my current directory when I need to use images etc in the programs i've been writing.
Now, say I do the following since I'm using windows to access a specific image in the directory
image = PATH + "\\" + "some_image.gif"
This is where my question lies, this works on windows, but if I remember correctly "\\" is for windows and this will not work on other OS? I cannot directly test this myself as I don't have other operating systems or I wouldn't have bothered posting. As far as I can tell from where I've looked this isn't mentioned in the documentation.
If this is indeed the case is there a way around this?

Yes, '\\' is just for Windows. You can use os.sep, which will be '\\' on Windows, ':' on classic Mac, '/' on almost everything else, or whatever is appropriate for the current platform.
You can usually get away with using '/'. Nobody's likely to be running your program on anything but Windows or Unix. And Windows will accept '/' pathnames in most cases. But there are many Windows command-line tools that will confuse your path for a flag if it starts with /, and some even if you have a / in the middle, and if you're using \\.\ paths, a / is treated like a regular character rather than a separator, and so on. So you're better off not doing that.
The simple thing to do is just use os.path.join:
image = os.path.join(PATH, "some_image.gif")
As a side note, in your first line, you're already using join—but you don't need it there:
PATH = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0])))
It's perfectly legal to call join with only one argument like this, but also perfectly useless; you just join the one thing with nothing; you will get back exactly what you passed in. Just do this:
PATH = os.path.abspath(os.path.dirname(sys.argv[0]))
One last thing: If you're on Python 3.4+, you may want to consider using pathlib instead of os.path:
PATH = Path(sys.argv[0]).parent.resolve()
image = PATH / "some_image.gif"

Use os.path.join instead of "\\":
os.path.join(PATH, "some_image.gif")
The function will join intelligently the different parts of the path.

PATH = os.path.abspath(os.path.join(os.path.dirname(sys.argv[0])))
image = os.path.join(PATH, "some_image.gif")
os.path.join will intelligently join the arguments using os.sep which uses the OS file separator for you.

How to join with relative paths only?

For a simple web server script, I wrote the following function that resolves the url to the file system.
def resolve(url):
url = url.lstrip('/')
path = os.path.abspath(os.path.join(os.path.dirname(__file__), url))
return path
Here are some example outputs for the __file__ variable being C:\projects\resolve.py.
/index.html => C:\projects\index.html
/\index.html => C:\index.html
/C:\index.html => C:\index.html
The first example is just fine. The url get resolved to a file inside the directory of the script. However, I didn't expect the second and third example. Since the appended path is interpreted as an absolute path, it completely ignores the directory in which the script file lies.
This is a security risk since all files on the file system can be accesses, not just those inside the sub directory of the script. Why does Python's os.path.join allow joining with absolute paths and how can I prevent it?

os.path.join() is not suitable for unsafe input, no. It is entirely deliberate that an absolute path ignores arguments before it; this allows for supporting both absolute and relative paths in a configuration file, say, without having to test the entered path. Just use os.path.join(standard_location, config_path) and it'll do the right thing for you.
Take a look at Flask's safe_join() to handle untrusted filenames:
import posixpath
import os.path
_os_alt_seps = list(sep for sep in [os.path.sep, os.path.altsep]
if sep not in (None, '/'))
def safe_join(directory, filename):
# docstring omitted for brevity
filename = posixpath.normpath(filename)
for sep in _os_alt_seps:
if sep in filename:
raise NotFound()
if os.path.isabs(filename) or \
filename == '..' or \
filename.startswith('../'):
raise NotFound()
return os.path.join(directory, filename)
This uses the posixpath (the POSIX implementation for the platform-agnostic os.path module) to normalise the URL path first; this removes any embedded ../ or ./ path segments, making it a fully normalised relative or absolute path.
Then any alternative separators other than / are excluded; you are not allowed to use /\index.html for example. Last but not least, absolute filenames, or relative filenames are specifically prohibited as well.

Safely extract zip or tar using Python

I'm trying to extract user-submitted zip and tar files to a directory. The documentation for zipfile's extractall method (similarly with tarfile's extractall) states that it's possible for paths to be absolute or contain .. paths that go outside the destination path. Instead, I could use extract myself, like this:
some_path = '/destination/path'
some_zip = '/some/file.zip'
zipf = zipfile.ZipFile(some_zip, mode='r')
for subfile in zipf.namelist():
zipf.extract(subfile, some_path)
Is this safe? Is it possible for a file in the archive to wind up outside of some_path in this case? If so, what way can I ensure that files will never wind up outside the destination directory?

Note: Starting with python 2.7.4, this is a non-issue for ZIP archives. Details at the bottom of the answer. This answer focuses on tar archives.
To figure out where a path really points to, use os.path.abspath() (but note the caveat about symlinks as path components). If you normalize a path from your zipfile with abspath and it does not contain the current directory as a prefix, it's pointing outside it.
But you also need to check the value of any symlink extracted from your archive (both tarfiles and unix zipfiles can store symlinks). This is important if you are worried about a proverbial "malicious user" that would intentionally bypass your security, rather than an application that simply installs itself in system libraries.
That's the aforementioned caveat: abspath will be misled if your sandbox already contains a symlink that points to a directory. Even a symlink that points within the sandbox can be dangerous: The symlink sandbox/subdir/foo -> .. points to sandbox, so the path sandbox/subdir/foo/../.bashrc should be disallowed. The easiest way to do so is to wait until the previous files have been extracted and use os.path.realpath(). Fortunately extractall() accepts a generator, so this is easy to do.
Since you ask for code, here's a bit that explicates the algorithm. It prohibits not only the extraction of files to locations outside the sandbox (which is what was requested), but also the creation of links inside the sandbox that point to locations outside the sandbox. I'm curious to hear if anyone can sneak any stray files or links past it.
import tarfile
from os.path import abspath, realpath, dirname, join as joinpath
from sys import stderr
resolved = lambda x: realpath(abspath(x))
def badpath(path, base):
# joinpath will ignore base if path is absolute
return not resolved(joinpath(base,path)).startswith(base)
def badlink(info, base):
# Links are interpreted relative to the directory containing the link
tip = resolved(joinpath(base, dirname(info.name)))
return badpath(info.linkname, base=tip)
def safemembers(members):
base = resolved(".")
for finfo in members:
if badpath(finfo.name, base):
print >>stderr, finfo.name, "is blocked (illegal path)"
elif finfo.issym() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Symlink to", finfo.linkname
elif finfo.islnk() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Hard link to", finfo.linkname
else:
yield finfo
ar = tarfile.open("testtar.tar")
ar.extractall(path="./sandbox", members=safemembers(ar))
ar.close()
Edit: Starting with python 2.7.4, this is a non-issue for ZIP archives: The method zipfile.extract() prohibits the creation of files outside the sandbox:
Note: If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows, illegal characters (:, <, >, |, ", ?, and *) [are] replaced by underscore (_).
The tarfile class has not been similarly sanitized, so the above answer still apllies.

Contrary to the popular answer, unzipping files safely is not completely solved as of Python 2.7.4. The extractall method is still dangerous and can lead to path traversal, either directly or through the unzipping of symbolic links. Here was my final solution which should prevent both attacks in all versions of Python, even versions prior to Python 2.7.4 where the extract method was vulnerable:
import zipfile, os
def safe_unzip(zip_file, extract_path='.'):
with zipfile.ZipFile(zip_file, 'r') as zf:
for member in zf.infolist():
file_path = os.path.realpath(os.path.join(extract_path, member.filename))
if file_path.startswith(os.path.realpath(extract_path)):
zf.extract(member, extract_path)
Edit 1: Fixed variable name clash. Thanks Juuso Ohtonen.
Edit 2: s/abspath/realpath/g. Thanks TheLizzard

Use ZipFile.infolist()/TarFile.next()/TarFile.getmembers() to get the information about each entry in the archive, normalize the path, open the file yourself, use ZipFile.open()/TarFile.extractfile() to get a file-like for the entry, and copy the entry data yourself.

Copy the zipfile to an empty directory. Then use os.chroot to make that directory the root directory. Then unzip there.
Alternatively, you can call unzip itself with the -j flag, which ignores the directories:
import subprocess
filename = '/some/file.zip'
rv = subprocess.call(['unzip', '-j', filename])

Special characters in OSX filename ? (Python os.rename)

I am trying to rename some files automatically on OSX with a python script. But I fail to work with special characters like forward slash etc.:
oldname = "/test"
newname = "/test(1\/10)"
os.rename(oldname, newname)
I think I do have an encoding problem. But different tries with re.escape or using UTF-8 unicode encodings havent been successful for me. Would you have a hint?
Thanks!
Marco

What most of the file systems have in common is that they do not allow directory separators (slashes) in filenames.
That said, in Mac OS X you can have file names appear with slashes in finder, you can try replacing slashes with :.

If you're trying to rename the folder '/test' you'll need to run python as root, otherwise you won't have privileges to change stuff in the root. Furthermore the slash in your new name won't work as python will try find a directory "/test(1", so you'll have to let the directory separator go. Also this from the python documentation might be helpful.
Rename the file or directory src to dst. If dst is a directory, OSError will be raised. On Unix, if dst exists and is a file, it will be replaced silently if the user has permission. The operation may fail on some Unix flavors if src and dst are on different filesystems. If successful, the renaming will be an atomic operation (this is a POSIX requirement). On Windows, if dst already exists, OSError will be raised even if it is a file; there may be no way to implement an atomic rename when dst names an existing file. Availability: Unix, Windows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.