Find the common path prefix of a list of paths

Find the common path prefix of a list of paths - python

My problem is to find the common path prefix of a given set of files.
Literally I was expecting that "os.path.commonprefix" would do just that. Unfortunately, the fact that commonprefix is located in path is rather misleading, since it actually will search for string prefixes.
The question to me is, how can this actually be solved for paths? The issue was briefly mentioned in this (fairly high rated) answer but only as a side-note and the proposed solution (appending slashes to the input of commonprefix) imho has issues, since it will fail for instance for:
os.path.commonprefix(['/usr/var1/log/', '/usr/var2/log/'])
# returns /usr/var but it should be /usr
To prevent others from falling into the same trap, it might be worthwhile to discuss this issue in a separate question: Is there a simple / portable solution for this problem that does not rely on nasty checks on the file system (i.e., access the result of commonprefix and check whether it is a directory and if not returns a os.path.dirname of the result)?

It seems that this issue has been corrected in recent versions of Python. New in version 3.5 is the function os.path.commonpath(), which returns the common path instead of the common string prefix.

Awhile ago I ran into this where os.path.commonprefix is a string prefix and not a path prefix as would be expected. So I wrote the following:
def commonprefix(l):
# this unlike the os.path.commonprefix version
# always returns path prefixes as it compares
# path component wise
cp = []
ls = [p.split('/') for p in l]
ml = min( len(p) for p in ls )
for i in range(ml):
s = set( p[i] for p in ls )
if len(s) != 1:
break
cp.append(s.pop())
return '/'.join(cp)
it could be made more portable by replacing '/' with os.path.sep.

Assuming you want the common directory path, one way is to:
Use only directory paths as input. If your input value is a file name, call os.path.dirname(filename) to get its directory path.
"Normalize" all the paths so that they are relative to the same thing and don't include double separators. The easiest way to do this is by calling os.path.abspath( ) to get the path relative to the root. (You might also want to use os.path.realpath( ) to remove symbolic links.)
Add a final separator (found portably with os.path.sep or os.sep) to the end of all the normalized directory paths.
Call os.path.dirname( ) on the result of os.path.commonprefix( ).
In code (without removing symbolic links):
def common_path(directories):
norm_paths = [os.path.abspath(p) + os.path.sep for p in directories]
return os.path.dirname(os.path.commonprefix(norm_paths))
def common_path_of_filenames(filenames):
return common_path([os.path.dirname(f) for f in filenames])

A robust approach is to split the path into individual components and then find the longest common prefix of the component lists.
Here is an implementation which is cross-platform and can be generalized easily to more than two paths:
import os.path
import itertools
def components(path):
'''
Returns the individual components of the given file path
string (for the local operating system).
The returned components, when joined with os.path.join(), point to
the same location as the original path.
'''
components = []
# The loop guarantees that the returned components can be
# os.path.joined with the path separator and point to the same
# location:
while True:
(new_path, tail) = os.path.split(path) # Works on any platform
components.append(tail)
if new_path == path: # Root (including drive, on Windows) reached
break
path = new_path
components.append(new_path)
components.reverse() # First component first
return components
def longest_prefix(iter0, iter1):
'''
Returns the longest common prefix of the given two iterables.
'''
longest_prefix = []
for (elmt0, elmt1) in itertools.izip(iter0, iter1):
if elmt0 != elmt1:
break
longest_prefix.append(elmt0)
return longest_prefix
def common_prefix_path(path0, path1):
return os.path.join(*longest_prefix(components(path0), components(path1)))
# For Unix:
assert common_prefix_path('/', '/usr') == '/'
assert common_prefix_path('/usr/var1/log/', '/usr/var2/log/') == '/usr'
assert common_prefix_path('/usr/var/log1/', '/usr/var/log2/') == '/usr/var'
assert common_prefix_path('/usr/var/log', '/usr/var/log2') == '/usr/var'
assert common_prefix_path('/usr/var/log', '/usr/var/log') == '/usr/var/log'
# Only for Windows:
# assert common_prefix_path(r'C:\Programs\Me', r'C:\Programs') == r'C:\Programs'

I've made a small python package commonpath to find common paths from a list. Comes with a few nice options.
https://github.com/faph/Common-Path

Related

Python - Extract everything in a filepath, after a certain directory

Let's say I have a folder like this.
/home/user/dev/Project/media/image_dump/images/02_car_folder
Everything after the media directory should be kept. The remaining should be removed.
/media/image_dump/images/02_car_folder
I was originally doing it this way but as more subdirectories were added to different folders started generating invalid filepaths
split_absolute = [os.sep.join(os.path.normpath(y).split(os.sep)[-2:]) for y in absolute_path]
The problem this causes is that once you start going deeper, the media path is cut out of the filepath all together.
So if I went into
media/image_dump/images/02_car_folder/
The filepath now becomes this, when it needs to include everything up to /media.
/images/02_car_folder
What are some ways to actually handle this? I won't know users filepaths will be leading up to media, but I know that everything after media is what should be kept regardless, no matter how deep their folders go.

I think you can achieve what you want quite easily using Path.parts:
from pathlib import Path
path = "/home/user/dev/Project/media/image_dump/images/02_car_folder"
parts = Path(path).parts
stripped_path = Path(*parts[parts.index("media"):])
Result:
>>> print(stripped_path)
media/image_dump/images/02_car_folder

Actually you don't need to use some path specific libraries.
Just work with strings:
※ note → the weak point of working with paths as strings is that you need to handle many edge cases by yourself (for example if path will be media/blahblah/blahblah2 or /blahblah/blahblah2/media). pathlib solving these cases out of the box.
import os
full_path1 = "/home/user/dev/Project/media/image_dump/images/02_car_folder"
full_path2 = "/home/user/dev/Project/media/image_dump/media/images/02_car_folder"
separator_dir = os.path.sep + "media" + os.path.sep
print(f'Separate by {separator_dir}')
if separator_dir in full_path1:
separated_path1 = os.path.sep + separator_dir.join(full_path1.split(separator_dir)[1:])
else:
separated_path1 = full_path1
if separator_dir in full_path2:
separated_path2 = os.path.sep + separator_dir.join(full_path2.split(separator_dir)[1:])
else:
separated_path2 = full_path2
print(f'Full path 1 is {full_path1}')
print(f'Full path 2 is {full_path2}')
print(f'Separated path 1 is {separated_path1}')
print(f'Separated path 2 is {separated_path2}')
First path has one media folder
Second path has two media folders, but use only first for path cutting
Separate by /media/
Full path 1 is /home/user/dev/Project/media/image_dump/images/02_car_folder
Full path 2 is /home/user/dev/Project/media/image_dump/media/images/02_car_folder
Separated path 1 is /image_dump/images/02_car_folder
Separated path 2 is /image_dump/media/images/02_car_folder

You could also use a regex, concise and easy:
path = '/home/user/dev/Project/media/image_dump/images/02_car_folder'
import re
re.search('/media/.*', path).group(0)
Output: '/media/image_dump/images/02_car_folder'
If the presence of media is unsure:
m = re.search('/media/.*', path)
m.group(0) if m else None # or any default you want
If you want the first / to be optional if media is at the beginning, use '(?:/|^)media/.*'

How to get the relative path between two absolute paths in Python using pathlib?

In Python 3, I defined two paths using pathlib, say:
from pathlib import Path
origin = Path('middle-earth/gondor/minas-tirith/castle').resolve()
destination = Path('middle-earth/gondor/osgiliath/tower').resolve()
How can I get the relative path that leads from origin to destination? In this example, I'd like a function that returns ../../osgiliath/tower or something equivalent.
Ideally, I'd have a function relative_path that always satisfies
origin.joinpath(
relative_path(origin, destination)
).resolve() == destination.resolve()
(well, ideally there would be an operator - such that destination == origin / (destination - origin) would always be true)
Note that Path.relative_to is not sufficient in this case, since origin is not a destination's parent. Also, I'm not working with symlinks, so it's safe to assume that there are none if this simplifies the problem.
How can relative_path be implemented?

This is trivially os.path.relpath
import os.path
from pathlib import Path
origin = Path('middle-earth/gondor/minas-tirith/castle').resolve()
destination = Path('middle-earth/gondor/osgiliath/tower').resolve()
assert os.path.relpath(destination, start=origin) == '..\\..\\osgiliath\\tower'

If you'd like your own Python function to convert an absolute path to a relative path:
def absolute_file_path_to_relative(start_file_path, destination_file_path):
return (start_file_path.count("/") + start_file_path.count("\\") + 1) * (".." + ((start_file_path.find("/") > -1) and "/" or "\\")) + destination_file_path
This assumes that:
1) start_file_path starts with the same root folder as destination_file_path.
2) Types of slashes don't occur interchangably.
3) You're not using a filesystem that permits slashes in the file name.
Those assumptions may be an advantage or disadvantage, depending on your use case.
Disadvantages: if you're using pathlib, you'll break that module's API flow in your code by mixing in this function; limited use cases; inputs have to be sterile for the filesystem you're working with.
Advantages: runs 202x faster than #AdamSmith's answer (tested on Windows 7, 32-bit)

How/where to use os.path.sep?

os.path.sep is the character used by the operating system to separate pathname components.
But when os.path.sep is used in os.path.join(), why does it truncate the path?
Example:
Instead of 'home/python', os.path.join returns '/python':
>>> import os
>>> os.path.join('home', os.path.sep, 'python')
'/python'
I know that os.path.join() inserts the directory separator implicitly.
Where is os.path.sep useful? Why does it truncate the path?

Where os.path.sep is usefull?
I suspect that it exists mainly because a variable like this is required in the module anyway (to avoid hardcoding), and if it's there, it might as well be documented. Its documentation says that it is "occasionally useful".
Why it truncates the path?
From the docs for os.path.join():
If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.
and / is an absolute path on *nix systems.

Drop os.path.sep from the os.path.join() call. os.path.join() uses os.path.sep internally.
On your system, os.path.sep == '/' that is interpreted as a root directory (absolute path) and therefore os.path.join('home', '/', 'python') is equivalent to os.path.join('/', 'python') == '/python'. From the docs:
If a component is an absolute path, all previous components are thrown
away and joining continues from the absolute path component.

As correctly given in the docstring of os.path.join -
Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded.
Same is given in the docs as well -
os.path.join(path, *paths)
Join one or more path components intelligently. The return value is the concatenation of path and any members of *paths with exactly one directory separator (os.sep) following each non-empty part except the last, meaning that the result will only end in a separator if the last part is empty. If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.
When you give os.path.sep alone, it is considered as an absolute path to the root directory - / .
Please note , this is for unix/linux based os.path , which internally is posixpath . Though the same behavior is seen in windows os.path.join() .
Example -
>>> import os.path
>>> os.path.join.__doc__
"Join two or more pathname components, inserting '/' as needed.\n If any component is an absolute path, all previous path components\n will be discarded."

Here's the snippet of code that is run if you are on a POSIX machine:
posixpath.py
# Join pathnames.
# Ignore the previous parts if a part is absolute.
# Insert a '/' unless the first part is empty or already ends in '/'.
def join(a, *p):
"""Join two or more pathname components, inserting '/' as needed.
If any component is an absolute path, all previous path components
will be discarded. An empty last part will result in a path that
ends with a separator."""
sep = _get_sep(a)
path = a
try:
if not p:
path[:0] + sep #23780: Ensure compatible data type even if p is null.
for b in p:
if b.startswith(sep):
path = b
elif not path or path.endswith(sep):
path += b
else:
path += sep + b
except (TypeError, AttributeError, BytesWarning):
genericpath._check_arg_types('join', a, *p)
raise
return path
Specifically, the lines:
if b.startswith(sep):
path = b
And, since os.path.sep definitely starts with this character, whenever we encounter it we throw out the portion of the variable path that has already been constructed and start over with the next element in p.

But when os.path.sep is used in os.path.join() , why it truncates the path?
Quoting directly from the documentation of os.path.join
If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.
So when you do:
os.path.join('home', os.path.sep, 'python')
os.path.sep returns '/' which is an absolute path, and so 'home' is thrown away and you get only '/python' as the output.
This can is also clear from the example:
>>> import os
>>> os.path.join('home','/python','kivy')
'/python/kivy'
Where os.path.sep is usefull?
os.path.sep or os.sep returns the character used by the operating system to separate pathname components.
But again quoting from the docs:
Note that knowing this is not sufficient to be able to parse or concatenate pathnames — use os.path.split() and os.path.join() — but it is occasionally useful.

replace part of path - python

Is there a quick way to replace part of the path in python?
for example:
old_path='/abc/dfg/ghi/f.txt'
I don't know the beginning of the path (/abc/dfg/), so what I'd really like to tell python to keep everything that comes after /ghi/ (inclusive) and replace everything before /ghi/ with /jkl/mno/:
>>> new_path
'/jkl/mno/ghi/f.txt/'

If you're using Python 3.4+, or willing to install the backport, consider using pathlib instead of os.path:
path = pathlib.Path(old_path)
index = path.parts.index('ghi')
new_path = pathlib.Path('/jkl/mno').joinpath(*path.parts[index:])
If you just want to stick with the 2.7 or 3.3 stdlib, there's no direct way to do this, but you can get the equivalent of parts by looping over os.path.split. For example, keeping each path component until you find the first ghi, and then tacking on the new prefix, will replace everything before the last ghi (if you want to replace everything before the first ghi, it's not hard to change things):
path = old_path
new_path = ''
while True:
path, base = os.path.split(path)
new_path = os.path.join(base, new_path)
if base == 'ghi':
break
new_path = os.path.join('/jkl/mno', new_path)
This is a bit clumsy, so you might want to consider writing a simple function that gives you a list or tuple of the path components, so you can just use find, then join it all back together, as with the pathlib version.

>>> import os.path
>>> old_path='/abc/dfg/ghi/f.txt'
First grab the relative path from the starting directory of your choice using os.path.relpath
>>> rel = os.path.relpath(old_path, '/abc/dfg/')
>>> rel
'ghi\\f.txt'
Then add the new first part of the path to this relative path using os.path.join
>>> new_path = os.path.join('jkl\mno', rel)
>>> new_path
'jkl\\mno\\ghi\\f.txt'

You can use the index of ghi:
old_path.replace(old_path[:old_path.index("ghi")],"/jkl/mno/")
In [4]: old_path.replace(old_path[:old_path.index("ghi")],"/jkl/mno/" )
Out[4]: '/jkl/mno/ghi/f.txt'

A rather naive approach, but does the job:
Function:
def replace_path(path, frm, to):
pre, match, post = path.rpartition(frm)
return ''.join((to if match else pre, match, post))
Example:
>>> s = '/abc/dfg/ghi/f.txt'
>>> replace_path(s, '/ghi/', '/jkl/mno')
'/jkl/mno/ghi/f.txt'
>>> replace_path(s, '/whatever/', '/jkl/mno')
'/abc/dfg/ghi/f.txt'

The following is useful when you want to replace some known base directory in your path.
from pathlib import Path
old_path = Path('/abc/dfg/ghi/f.txt')
old_root = Path('/abc/dfg')
new_root = Path('/jkl/mno')
new_path = new_root / old_path.relative_to(old_root)
# Result: /jkl/mno/ghi/f.txt
I understand that the OP specifically mentioned that the path to the base directory is not known. However, since it is a common task to remove the path to the base directory, and the title of the question ("replace part of the path") is certainly bringing some folks with this subtype of problem here, I am posting it anyway.

I needed to replace an arbitrary number of an arbitrary strings in a path
e.g. replace 'package' with foo in
VERSION_FILE = Path(f'{Path.home()}', 'projects', 'package', 'package', '_version.py')
So I use this call
_replace_path_text(VERSION_FILE, 'package', 'foo)
def _replace_path_text(path, text, replacement):
parts = list(path.parts)
new_parts = [part.replace(text, replacement) for part in parts]
return Path(*new_parts)

What are some specific examples of using the wrong path separator failing? [duplicate]

I'm not able to see the bigger picture here I think; but basically I have no idea why you would use os.path.join instead of just normal string concatenation?
I have mainly used VBScript so I don't understand the point of this function.

Portable
Write filepath manipulations once and it works across many different platforms, for free. The delimiting character is abstracted away, making your job easier.
Smart
You no longer need to worry if that directory path had a trailing slash or not. os.path.join will add it if it needs to.
Clear
Using os.path.join makes it obvious to other people reading your code that you are working with filepaths. People can quickly scan through the code and discover it's a filepath intrinsically. If you decide to construct it yourself, you will likely detract the reader from finding actual problems with your code: "Hmm, some string concats, a substitution. Is this a filepath or what? Gah! Why didn't he use os.path.join?" :)

Will work on Windows with '\' and Unix (including Mac OS X) with '/'.
for posixpath here's the straightforward code
In [22]: os.path.join??
Type: function
String Form:<function join at 0x107c28ed8>
File: /usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py
Definition: os.path.join(a, *p)
Source:
def join(a, *p):
"""Join two or more pathname components, inserting '/' as needed.
If any component is an absolute path, all previous path components
will be discarded."""
path = a
for b in p:
if b.startswith('/'):
path = b
elif path == '' or path.endswith('/'):
path += b
else:
path += '/' + b
return path
don't have windows but the same should be there with '\'

It is OS-independent. If you hardcode your paths as C:\Whatever they will only work on Windows. If you hardcode them with the Unix standard "/" they will only work on Unix. os.path.join detects the operating system it is running under and joins the paths using the correct symbol.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the common path prefix of a list of paths - python

It seems that this issue has been corrected in recent versions of Python. New in version 3.5 is the function os.path.commonpath(), which returns the common path instead of the common string prefix.

I've made a small python package commonpath to find common paths from a list. Comes with a few nice options. https://github.com/faph/Common-Path

Related

Python - Extract everything in a filepath, after a certain directory

How to get the relative path between two absolute paths in Python using pathlib?

How/where to use os.path.sep?

replace part of path - python

What are some specific examples of using the wrong path separator failing? [duplicate]

Categories

Resources