For a simple web server script, I wrote the following function that resolves the url to the file system.
def resolve(url):
url = url.lstrip('/')
path = os.path.abspath(os.path.join(os.path.dirname(__file__), url))
return path
Here are some example outputs for the __file__ variable being C:\projects\resolve.py.
/index.html => C:\projects\index.html
/\index.html => C:\index.html
/C:\index.html => C:\index.html
The first example is just fine. The url get resolved to a file inside the directory of the script. However, I didn't expect the second and third example. Since the appended path is interpreted as an absolute path, it completely ignores the directory in which the script file lies.
This is a security risk since all files on the file system can be accesses, not just those inside the sub directory of the script. Why does Python's os.path.join allow joining with absolute paths and how can I prevent it?
os.path.join() is not suitable for unsafe input, no. It is entirely deliberate that an absolute path ignores arguments before it; this allows for supporting both absolute and relative paths in a configuration file, say, without having to test the entered path. Just use os.path.join(standard_location, config_path) and it'll do the right thing for you.
Take a look at Flask's safe_join() to handle untrusted filenames:
import posixpath
import os.path
_os_alt_seps = list(sep for sep in [os.path.sep, os.path.altsep]
if sep not in (None, '/'))
def safe_join(directory, filename):
# docstring omitted for brevity
filename = posixpath.normpath(filename)
for sep in _os_alt_seps:
if sep in filename:
raise NotFound()
if os.path.isabs(filename) or \
filename == '..' or \
filename.startswith('../'):
raise NotFound()
return os.path.join(directory, filename)
This uses the posixpath (the POSIX implementation for the platform-agnostic os.path module) to normalise the URL path first; this removes any embedded ../ or ./ path segments, making it a fully normalised relative or absolute path.
Then any alternative separators other than / are excluded; you are not allowed to use /\index.html for example. Last but not least, absolute filenames, or relative filenames are specifically prohibited as well.
Related
I am writing a simple file server in Python. The filename is provided by the client and should be considered untrusted. How to verify that it corresponds to a file inside the current directory (within it or any of its subdirectories)? Will something like:
pwd=os.getcwd()
if os.path.commonpath((pwd,os.path.abspath(filename))) == pwd:
open(filename,'rb')
suffice?
Convert the filename to a canonical path using os.path.realpath, get the directory portion, and see if the current directory (in canonical form) is a prefix of that:
import os, os.path
def in_cwd(fname):
path = os.path.dirname(os.path.realpath(fname))
return path.startswith(os.getcwd())
By converting fname to a canonical path we handle symbolic links and paths containing ../.
Update
Unfortunately, the above code has a little problem. For example,
'/a/b/cd'.startswith('/a/b/c')
returns True, but we definitely don't want that behaviour here! Fortunately, there's an easy fix: we just need to append os.sep to the paths before performing the prefix test. The new version also handles any OS pathname case-insensitivity issues via os.path.normcase.
import os, os.path
def clean_dirname(dname):
dname = os.path.normcase(dname)
return os.path.join(dname, '')
def in_cwd(fname):
cwd = clean_dirname(os.getcwd())
path = os.path.dirname(os.path.realpath(fname))
path = clean_dirname(path)
return path.startswith(cwd)
Thanks to DSM for pointing out the flaw in the previous code.
Here's a version that's a little more efficient. It uses os.path.commonpath, which is more robust than appending os.sep and doing a string prefix test.
def in_cwd(fname):
cwd = os.path.normcase(os.getcwd())
path = os.path.normcase(os.path.dirname(os.path.realpath(fname)))
return os.path.commonpath((path, cwd)) == cwd
I have Directory structure like this
projectfolder/fold1/fold2/fold3/script.py
now I'm giving script.py a path as commandline argument of a file which is there in
fold1/fold_temp/myfile.txt
So basically I want to be able to give path in this way
../../fold_temp/myfile.txt
>>python somepath/pythonfile.py -input ../../fold_temp/myfile.txt
Here problem is that I might be given full path or relative path so I should be able to decide and based on that I should be able to create absolute path.
I already have knowledge of functions related to path.
Question 1
Question 2
Reference questions are giving partial answer but I don't know how to build full path using the functions provided in them.
try os.path.abspath, it should do what you want ;)
Basically it converts any given path to an absolute path you can work with, so you do not need to distinguish between relative and absolute paths, just normalize any of them with this function.
Example:
from os.path import abspath
filename = abspath('../../fold_temp/myfile.txt')
print(filename)
It will output the absolute path to your file.
EDIT:
If you are using Python 3.4 or newer you may also use the resolve() method of pathlib.Path. Be aware that this will return a Path object and not a string. If you need a string you can still use str() to convert it to a string.
Example:
from pathlib import Path
filename = Path('../../fold_temp/myfile.txt').resolve()
print(filename)
A practical example:
sys.argv[0] gives you the name of the current script
os.path.dirname() gives you the relative directory name
thus, the next line, gives you the absolute working directory of the current executing file.
cwd = os.path.abspath(os.path.dirname(sys.argv[0]))
Personally, I always use this instead of os.getcwd() since it gives me the script absolute path, independently of the directory from where the script was called.
For Python3, you can use pathlib's resolve functionality to resolve symlinks and .. components.
You need to have a Path object however it is very simple to do convert between str and Path.
I recommend for anyone using Python3 to drop os.path and its messy long function names and stick to pathlib Path objects.
import os
dir = os.path.dirname(__file__)
path = raw_input()
if os.path.isabs(path):
print "input path is absolute"
else:
path = os.path.join(dir, path)
print "absolute path is %s" % path
Use os.path.isabs to judge if input path is absolute or relative, if it is relative, then use os.path.join to convert it to absolute
os.path.sep is the character used by the operating system to separate pathname components.
But when os.path.sep is used in os.path.join(), why does it truncate the path?
Example:
Instead of 'home/python', os.path.join returns '/python':
>>> import os
>>> os.path.join('home', os.path.sep, 'python')
'/python'
I know that os.path.join() inserts the directory separator implicitly.
Where is os.path.sep useful? Why does it truncate the path?
Where os.path.sep is usefull?
I suspect that it exists mainly because a variable like this is required in the module anyway (to avoid hardcoding), and if it's there, it might as well be documented. Its documentation says that it is "occasionally useful".
Why it truncates the path?
From the docs for os.path.join():
If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.
and / is an absolute path on *nix systems.
Drop os.path.sep from the os.path.join() call. os.path.join() uses os.path.sep internally.
On your system, os.path.sep == '/' that is interpreted as a root directory (absolute path) and therefore os.path.join('home', '/', 'python') is equivalent to os.path.join('/', 'python') == '/python'. From the docs:
If a component is an absolute path, all previous components are thrown
away and joining continues from the absolute path component.
As correctly given in the docstring of os.path.join -
Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded.
Same is given in the docs as well -
os.path.join(path, *paths)
Join one or more path components intelligently. The return value is the concatenation of path and any members of *paths with exactly one directory separator (os.sep) following each non-empty part except the last, meaning that the result will only end in a separator if the last part is empty. If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.
When you give os.path.sep alone, it is considered as an absolute path to the root directory - / .
Please note , this is for unix/linux based os.path , which internally is posixpath . Though the same behavior is seen in windows os.path.join() .
Example -
>>> import os.path
>>> os.path.join.__doc__
"Join two or more pathname components, inserting '/' as needed.\n If any component is an absolute path, all previous path components\n will be discarded."
Here's the snippet of code that is run if you are on a POSIX machine:
posixpath.py
# Join pathnames.
# Ignore the previous parts if a part is absolute.
# Insert a '/' unless the first part is empty or already ends in '/'.
def join(a, *p):
"""Join two or more pathname components, inserting '/' as needed.
If any component is an absolute path, all previous path components
will be discarded. An empty last part will result in a path that
ends with a separator."""
sep = _get_sep(a)
path = a
try:
if not p:
path[:0] + sep #23780: Ensure compatible data type even if p is null.
for b in p:
if b.startswith(sep):
path = b
elif not path or path.endswith(sep):
path += b
else:
path += sep + b
except (TypeError, AttributeError, BytesWarning):
genericpath._check_arg_types('join', a, *p)
raise
return path
Specifically, the lines:
if b.startswith(sep):
path = b
And, since os.path.sep definitely starts with this character, whenever we encounter it we throw out the portion of the variable path that has already been constructed and start over with the next element in p.
But when os.path.sep is used in os.path.join() , why it truncates the path?
Quoting directly from the documentation of os.path.join
If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.
So when you do:
os.path.join('home', os.path.sep, 'python')
os.path.sep returns '/' which is an absolute path, and so 'home' is thrown away and you get only '/python' as the output.
This can is also clear from the example:
>>> import os
>>> os.path.join('home','/python','kivy')
'/python/kivy'
Where os.path.sep is usefull?
os.path.sep or os.sep returns the character used by the operating system to separate pathname components.
But again quoting from the docs:
Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful.
I'm trying to extract user-submitted zip and tar files to a directory. The documentation for zipfile's extractall method (similarly with tarfile's extractall) states that it's possible for paths to be absolute or contain .. paths that go outside the destination path. Instead, I could use extract myself, like this:
some_path = '/destination/path'
some_zip = '/some/file.zip'
zipf = zipfile.ZipFile(some_zip, mode='r')
for subfile in zipf.namelist():
zipf.extract(subfile, some_path)
Is this safe? Is it possible for a file in the archive to wind up outside of some_path in this case? If so, what way can I ensure that files will never wind up outside the destination directory?
Note: Starting with python 2.7.4, this is a non-issue for ZIP archives. Details at the bottom of the answer. This answer focuses on tar archives.
To figure out where a path really points to, use os.path.abspath() (but note the caveat about symlinks as path components). If you normalize a path from your zipfile with abspath and it does not contain the current directory as a prefix, it's pointing outside it.
But you also need to check the value of any symlink extracted from your archive (both tarfiles and unix zipfiles can store symlinks). This is important if you are worried about a proverbial "malicious user" that would intentionally bypass your security, rather than an application that simply installs itself in system libraries.
That's the aforementioned caveat: abspath will be misled if your sandbox already contains a symlink that points to a directory. Even a symlink that points within the sandbox can be dangerous: The symlink sandbox/subdir/foo -> .. points to sandbox, so the path sandbox/subdir/foo/../.bashrc should be disallowed. The easiest way to do so is to wait until the previous files have been extracted and use os.path.realpath(). Fortunately extractall() accepts a generator, so this is easy to do.
Since you ask for code, here's a bit that explicates the algorithm. It prohibits not only the extraction of files to locations outside the sandbox (which is what was requested), but also the creation of links inside the sandbox that point to locations outside the sandbox. I'm curious to hear if anyone can sneak any stray files or links past it.
import tarfile
from os.path import abspath, realpath, dirname, join as joinpath
from sys import stderr
resolved = lambda x: realpath(abspath(x))
def badpath(path, base):
# joinpath will ignore base if path is absolute
return not resolved(joinpath(base,path)).startswith(base)
def badlink(info, base):
# Links are interpreted relative to the directory containing the link
tip = resolved(joinpath(base, dirname(info.name)))
return badpath(info.linkname, base=tip)
def safemembers(members):
base = resolved(".")
for finfo in members:
if badpath(finfo.name, base):
print >>stderr, finfo.name, "is blocked (illegal path)"
elif finfo.issym() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Symlink to", finfo.linkname
elif finfo.islnk() and badlink(finfo,base):
print >>stderr, finfo.name, "is blocked: Hard link to", finfo.linkname
else:
yield finfo
ar = tarfile.open("testtar.tar")
ar.extractall(path="./sandbox", members=safemembers(ar))
ar.close()
Edit: Starting with python 2.7.4, this is a non-issue for ZIP archives: The method zipfile.extract() prohibits the creation of files outside the sandbox:
Note: If a member filename is an absolute path, a drive/UNC sharepoint and leading (back)slashes will be stripped, e.g.: ///foo/bar becomes foo/bar on Unix, and C:\foo\bar becomes foo\bar on Windows. And all ".." components in a member filename will be removed, e.g.: ../../foo../../ba..r becomes foo../ba..r. On Windows, illegal characters (:, <, >, |, ", ?, and *) [are] replaced by underscore (_).
The tarfile class has not been similarly sanitized, so the above answer still apllies.
Contrary to the popular answer, unzipping files safely is not completely solved as of Python 2.7.4. The extractall method is still dangerous and can lead to path traversal, either directly or through the unzipping of symbolic links. Here was my final solution which should prevent both attacks in all versions of Python, even versions prior to Python 2.7.4 where the extract method was vulnerable:
import zipfile, os
def safe_unzip(zip_file, extract_path='.'):
with zipfile.ZipFile(zip_file, 'r') as zf:
for member in zf.infolist():
file_path = os.path.realpath(os.path.join(extract_path, member.filename))
if file_path.startswith(os.path.realpath(extract_path)):
zf.extract(member, extract_path)
Edit 1: Fixed variable name clash. Thanks Juuso Ohtonen.
Edit 2: s/abspath/realpath/g. Thanks TheLizzard
Use ZipFile.infolist()/TarFile.next()/TarFile.getmembers() to get the information about each entry in the archive, normalize the path, open the file yourself, use ZipFile.open()/TarFile.extractfile() to get a file-like for the entry, and copy the entry data yourself.
Copy the zipfile to an empty directory. Then use os.chroot to make that directory the root directory. Then unzip there.
Alternatively, you can call unzip itself with the -j flag, which ignores the directories:
import subprocess
filename = '/some/file.zip'
rv = subprocess.call(['unzip', '-j', filename])
It probably won't matter for my current utility, but just for good coding practice, I'd like to know if files in a ZIP file, using the zipfile module, can be accessed using a POSIX-style pathname such as subdir/file.ext regardless of on which operating system it was made, or on what system my Python script is running. Or if, in the case of Windows, the file will be stored or accessed as subdir\file.ext. I read the pydoc for the module, and did some searches here and on Google, but couldn't see anything relevant to this question.
Yes.
You can see these lines from the zipfile module:
# This is used to ensure paths in generated ZIP files always use
# forward slashes as the directory separator, as required by the
# ZIP format specification.
if os.sep != "/" and os.sep in filename:
filename = filename.replace(os.sep, "/")
And in the Zip specification:
file name: (Variable)
The name of the file, with optional relative path.
The path stored should not contain a drive or
device letter, or a leading slash. All slashes
should be forward slashes '/' as opposed to
backwards slashes '\' for compatibility with Amiga
and UNIX file systems etc.
I have the same problem in the zipfile.py module.
os.path.sep returns {AttributeError}module 'posixpath' has no attribute 'sep' so I modified the file in
def _extract_member(self, member, targetpath, pwd):
"""Extract the ZipInfo object 'member' to a physical
file on the path targetpath.
"""
by replacing os.path.sep by os.sep (which returns the correct value / on a mac operating system).
It solves the problem both for zipfile open and extract methods.