I feel that there is (should be?) a Python function out there that recursively splits a path string into its constituent files and directories (beyond basename and dirname). I've written one but since I use Python for shell-scripting on 5+ computers, I was hoping for something from the standard library or simpler that I can use on-the-fly.
import os
def recsplit(x):
if type(x) is str: return recsplit(os.path.split(x))
else: return (x[0]=='' or x[0] == '.' or x[0]=='/') and x[1:] or \
recsplit(os.path.split(x[0]) + x[1:])
>>> print recsplit('main/sub1/sub2/sub3/file')
('main', 'sub1', 'sub2', 'sub3', 'file')
Any leads/ideas? ~Thanks~
UPDATE: After all the mucking about with altsep, the currently selected answer doesn't even split on backslashes.
>>> import re, os.path
>>> seps = os.path.sep
>>> if os.path.altsep:
... seps += os.path.altsep
...
>>> seps
'\\/'
>>> somepath = r"C:\foo/bar.txt"
>>> print re.split('[%s]' % (seps,), somepath)
['C:\\foo', 'bar.txt'] # Whoops!! it was splitting using [\/] same as [/]
>>> print re.split('[%r]' % (seps,), somepath)
['C:', 'foo', 'bar.txt'] # after fixing it
>>> print re.split('[%r]' % seps, somepath)
['C:', 'foo', 'bar.txt'] # removed redundant cruft
>>>
Now back to what we ought to be doing:
(end of update)
1. Consider carefully what you are asking for -- you may get what you want, not what you need.
If you have relative paths
r"./foo/bar.txt" (unix) and r"C:foo\bar.txt" (windows)
do you want
[".", "foo", "bar.txt"] (unix) and ["C:foo", "bar.txt"] (windows)
(do notice the C:foo in there) or do you want
["", "CWD", "foo", "bar.txt"] (unix) and ["C:", "CWD", "foo", "bar.txt"] (windows)
where CWD is the current working directory (system-wide on unix, that of C: on windows)?
2. You don't need to faff about with os.path.altsep -- os.path.normpath() will make the separators uniform, and tidy up other weirdnesses like foo/bar/zot/../../whoopsy/daisy/somewhere/else
Solution step 1: unkink your path with one of os.path.normpath() or os.path.abspath().
Step 2: doing unkinked_path.split(os.path.sep) is not a good idea. You should pull it apart with os.path.splitdrive(), then use multiple applications of os.path.split().
Here are some examples of what would happen in step 1 on windows:
>>> os.path.abspath(r"C:/hello\world.txt")
'C:\\hello\\world.txt'
>>> os.path.abspath(r"C:hello\world.txt")
'C:\\Documents and Settings\\sjm_2\\hello\\world.txt'
>>> os.path.abspath(r"/hello\world.txt")
'C:\\hello\\world.txt'
>>> os.path.abspath(r"hello\world.txt")
'C:\\Documents and Settings\\sjm_2\\hello\\world.txt'
>>> os.path.abspath(r"e:hello\world.txt")
'E:\\emoh_ruo\\hello\\world.txt'
>>>
(the current drive is C, the CWD on drive C is \Documents and Settings\sjm_2, and the CWD on drive E is \emoh_ruo)
I'd like to suggest that you write step 2 without the conglomeration of and and or that you have in your example. Write code as if your eventual replacement knows where you live and owns a chainsaw :-)
use this:
import os
def recSplitPath(path):
elements = []
while ((path != '/') and (path != '')):
path, tail = os.path.split(path)
elements.insert(0,tail)
return elements
This turns /for/bar/whatever into ['for','bar','whatever]
path='main/sub1/sub2/sub3/file'
path.split(os.path.sep)
Related
I know that this question is asked many times on this website. But I found that they missed an important point: only file extension with one period was taken into consider like *.png *.mp3, but how do I deal with these filename with two period like .tar.gz.
The basic code is:
filename = '/home/lancaster/Downloads/a.ppt'
extention = filename.split('/')[-1]
But obviously, this code do not work with the file like a.tar.gz.
How to deal with it? Thanks.
Python 3.4
You can now use Path from pathlib. It has many features, one of them is suffix:
>>> from pathlib import Path
>>> Path('my/library/setup.py').suffix
'.py'
>>> Path('my/library.tar.gz').suffix
'.gz'
>>> Path('my/library').suffix
''
If you want to get more than one suffix, use suffixes:
>>> from pathlib import Path
>>> Path('my/library.tar.gar').suffixes
['.tar', '.gar']
>>> Path('my/library.tar.gz').suffixes
['.tar', '.gz']
>>> Path('my/library').suffixes
[]
Here is a in build module in os. More about os.path.splitext.
In [1]: from os.path import splitext
In [2]: file_name,extension = splitext('/home/lancaster/Downloads/a.ppt')
In [3]: extension
Out[1]: '.ppt'
If you have to fine the extension of .tar.gz,.tar.bz2 you have to write a function like this
from os.path import splitext
def splitext_(path):
for ext in ['.tar.gz', '.tar.bz2']:
if path.endswith(ext):
return path[:-len(ext)], path[-len(ext):]
return splitext(path)
Result
In [4]: file_name,ext = splitext_('/home/lancaster/Downloads/a.tar.gz')
In [5]: ext
Out[2]: '.tar.gz'
Edit
Generally you can use this function
from os.path import splitext
def splitext_(path):
if len(path.split('.')) > 2:
return path.split('.')[0],'.'.join(path.split('.')[-2:])
return splitext(path)
It will work for all extensions.
Working on all files.
In [6]: inputs = ['a.tar.gz', 'b.tar.lzma', 'a.tar.lz', 'a.tar.lzo', 'a.tar.xz','a.png']
In [7]: for file_ in inputs:
file_name,extension = splitext_(file_)
print extension
....:
tar.gz
tar.lzma
tar.lz
tar.lzo
tar.xz
.png
The role of a file extension is to tell the viewer (and sometimes the computer) which application to use to handle the file.
Taking your worst-case example in your comments (a.ppt.tar.gz), this is a PowerPoint file that has been tar-balled and then gzipped. So you need to use a gzip-handling program to open it. Using PowerPoint or a tarball-handling program wouldn't work. OK, a clever program that knew how to handle both .tar and .gz files could understand both operations and work with a .tar.gz file - but note that it would do that even if the extension was simply .gz.
The fact that both tar and gzip add their extensions to the original filename, rather than replace them (as zip does) is a convenience. But the base name of the gzip file is still a.ppt.tar.
Simplest One:
import os.path
print os.path.splitext("/home/lancaster/Downloads/a.ppt")[1]
# '.ppt'
One possible way is:
Slice at "." => tmp_ext = filename.split('.')[1:]
Result is a list = ['tar', 'gz']
Join them together => extention = ".".join(tmp_ext)
Result is your extension as string = 'tar.gz'
Update: Example:
>>> test = "/test/test/test.tar.gz"
>>> t2 = test.split(".")[1:]
>>> t2
['tar', 'gz']
>>> ".".join(t2)
'tar.gz'
>>> import os
>>> import re
>>> filename = os.path.basename('/home/lancaster/Downloads/a.ppt')
>>> extensions = re.findall(r'\.([^.]+)', basename)
['ppt']
>>> filename = os.path.basename('/home/lancaster/Downloads/a.ppt.tar.gz')
>>> extensions = re.findall(r'\.([^.]+)', basename)
['ppt','tar','gz']
with re.findall and python 3.6
filename = '/home/Downloads/abc.ppt.tar.gz'
ext = r'\.\w{1,6}'
re.findall(f'{ext}\\b | {ext}$', filename, re.X)
['.ppt', '.tar', '.gz']
filename = '/home/lancaster/Downloads/a.tar.gz'
extention = filename.split('/')[-1]
if '.' in extention:
extention = extention.split('.')[-1]
if len(extention) > 0:
extention = '.'+extention
print extention
How I could split this:
C:\my_dir\repo\branch
to:
['C:\my_dir', rest_part_of_string]
where rest_part_of_string can be one string or could be splitted every \. I don't care about rest, i just want first two elements together.
python 3.4 has methods for that (note the forward slashes instead of the backslashes (or double the backslashes))
pathlib documentation
# python 3.4
from pathlib import Path
p = Path('C:/my_dir/repo/branch')
print(p.parent)
print(p.name)
for what you need parts is interesting:
print(p.parts)
# -> ('C:', 'my_dir', 'repo', 'branch')
print('\\'.join(p.parts[:2]), ' -- ', '\\'.join( p.parts[2:]))
# -> C:\my_dir -- repo\branch
in python 2.7 this needs a bit more work:
import os
p = 'C:/my_dir/repo/branch'
def split_path(path):
parts = []
while 1:
path, folder = os.path.split(path)
if folder:
parts.append(folder)
else:
if path:
parts.append(path)
break
parts.reverse()
return parts
parts = split_path(p)
print('\\'.join(parts[:2]), ' -- ', '\\'.join(parts[2:]))
# -> C:\my_dir -- repo\branch
Using regular expression (re module documentation):
>>> import re
>>> print(re.match(r'[^\\]+\\[^\\]+', r'C:\my_dir\repo\branch').group())
C:\my_dir
>>> re.findall(r'[^\\]+\\[^\\]+|.+', r'C:\my_dir\repo\branch')
['C:\\my_dir', '\\repo\\branch']
you could split the path on \ and rejoin based on index:
>>>my_path = r'C:\my_dir\repo\branch'
>>>split_path = ["\\".join(my_path.split("\\")[:2]), "\\".join(my_path.split("\\")[2:])]
['C:\\my_dir', 'repo\\branch']
>>> first, last = "\\".join(x.split("\\")[:2]), "\\".join(x.split("\\")[2:])
>>> print first, last
C:\my_dir repo\branch
You need os.path.dirname() (or os.path.split), applied recursively or iteratively, until you cannot go up in the directory hierarchy further.
In general the functions provided by os.path should work better that re-invented wheels, due to better cross-platform support. There are a large number of primitives from which you can build your own path-manipulating function.
Here is an example of what i need.
Suppose that we have the following string:
str = "/home/user/folder/MyVeryLongFileName.foo"
I have multiple operations to do on this one :
remove the path (assuming i have its length) :
str = str[path_length:]
revome the extension (always 4 char in my case) :
str = str[path_length:-4]
So, right now my string looks like MyVeryLongFileName
Now I would like to limit its size at 15 characters.
Is it possible to do it in the same expression ? Or may I have to do it after the 2 previous operations ?
If you want only the first 15 characters, then you can slice the string again, like this:
file_name[path_length:-4][:15]
If you really are dealing with filenames, you might want to go with
>>> file_name = "/home/user/folder/MyVeryLongFileName.foo"
>>> import os
>>> print os.path.split(file_name)[1].rpartition(".")[0][:15]
MyVeryLongFileN
Or:
>>> print os.path.basename(file_name).rpartition(".")[0][:15]
'MyVeryLongFileN'
Also, it would be better to use splitext to get the extension, like this
>>> from os.path import basename, splitext
>>> print splitext(basename(file_name))[0][:15]
MyVeryLongFileN
You can get the filename with this:
>>> print str.split('/')[-1]
MyVeryLongFileName.foo
Remove the extension with:
>>> print str.split('.')[0]
/home/user/folder/MyVeryLongFileName
Limit the file name to 15 characters:
>>> print str.split('/')[-1][:15]
MyVeryLongFileN
This being said, you can always use the bash utils to extract this info. basename is the tool to get the file and dirname to get the path. See Extract filename and extension in bash for more info.
I would do this:
>>> from os.path import splitext, basename
>>> apath = "/home/user/folder/MyVeryLongFileName.foo"
>>> splitext(basename(apath))[0][:15]
'MyVeryLongFileN'
splitext separates the file-extension from the rest, and we do this on the result of basename which splits the part into the base file-name and the rest of the path. Then we can cut down the remaining string. I would definitely use these methods because they are much more reliable.
In python, suppose I have a path like this:
/folderA/folderB/folderC/folderD/
How can I get just the folderD part?
Use os.path.normpath, then os.path.basename:
>>> os.path.basename(os.path.normpath('/folderA/folderB/folderC/folderD/'))
'folderD'
The first strips off any trailing slashes, the second gives you the last part of the path. Using only basename gives everything after the last slash, which in this case is ''.
With python 3 you can use the pathlib module (pathlib.PurePath for example):
>>> import pathlib
>>> path = pathlib.PurePath('/folderA/folderB/folderC/folderD/')
>>> path.name
'folderD'
If you want the last folder name where a file is located:
>>> path = pathlib.PurePath('/folderA/folderB/folderC/folderD/file.py')
>>> path.parent.name
'folderD'
You could do
>>> import os
>>> os.path.basename('/folderA/folderB/folderC/folderD')
UPDATE1: This approach works in case you give it /folderA/folderB/folderC/folderD/xx.py. This gives xx.py as the basename. Which is not what you want I guess. So you could do this -
>>> import os
>>> path = "/folderA/folderB/folderC/folderD"
>>> if os.path.isdir(path):
dirname = os.path.basename(path)
UPDATE2: As lars pointed out, making changes so as to accomodate trailing '/'.
>>> from os.path import normpath, basename
>>> basename(normpath('/folderA/folderB/folderC/folderD/'))
'folderD'
Here is my approach:
>>> import os
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD/test.py'))
folderD
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD/'))
folderD
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD'))
folderC
I was searching for a solution to get the last foldername where the file is located, I just used split two times, to get the right part. It's not the question but google transfered me here.
pathname = "/folderA/folderB/folderC/folderD/filename.py"
head, tail = os.path.split(os.path.split(pathname)[0])
print(head + " " + tail)
I like the parts method of Path for this:
grandparent_directory, parent_directory, filename = Path(export_filename).parts[-3:]
log.info(f'{t: <30}: {num_rows: >7} Rows exported to {grandparent_directory}/{parent_directory}/{filename}')
If you use the native python package pathlib it's really simple.
>>> from pathlib import Path
>>> your_path = Path("/folderA/folderB/folderC/folderD/")
>>> your_path.stem
'folderD'
Suppose you have the path to a file in folderD.
>>> from pathlib import Path
>>> your_path = Path("/folderA/folderB/folderC/folderD/file.txt")
>>> your_path.name
'file.txt'
>>> your_path.parent
'folderD'
During my current projects, I'm often passing rear parts of a path to a function and therefore use the Path module. To get the n-th part in reverse order, I'm using:
from typing import Union
from pathlib import Path
def get_single_subpath_part(base_dir: Union[Path, str], n:int) -> str:
if n ==0:
return Path(base_dir).name
for _ in range(n):
base_dir = Path(base_dir).parent
return getattr(base_dir, "name")
path= "/folderA/folderB/folderC/folderD/"
# for getting the last part:
print(get_single_subpath_part(path, 0))
# yields "folderD"
# for the second last
print(get_single_subpath_part(path, 1))
#yields "folderC"
Furthermore, to pass the n-th part in reverse order of a path containing the remaining path, I use:
from typing import Union
from pathlib import Path
def get_n_last_subparts_path(base_dir: Union[Path, str], n:int) -> Path:
return Path(*Path(base_dir).parts[-n-1:])
path= "/folderA/folderB/folderC/folderD/"
# for getting the last part:
print(get_n_last_subparts_path(path, 0))
# yields a `Path` object of "folderD"
# for second last and last part together
print(get_n_last_subparts_path(path, 1))
# yields a `Path` object of "folderc/folderD"
Note that this function returns a Pathobject which can easily be converted to a string (e.g. str(path))
path = "/folderA/folderB/folderC/folderD/"
last = path.split('/').pop()
str = "/folderA/folderB/folderC/folderD/"
print str.split("/")[-2]
What is the most pythonic way to find presence of every directory name ['spam', 'eggs'] in path e.g. "/home/user/spam/eggs"
Usage example (doesn't work but explains my case):
dirs = ['spam', 'eggs']
path = "/home/user/spam/eggs"
if path.find(dirs):
print "All dirs are present in the path"
Thanks
set.issubset:
>>> set(['spam', 'eggs']).issubset('/home/user/spam/eggs'.split('/'))
True
Looks line you want something like...:
if all(d in path.split('/') for d in dirs):
...
This one-liner style is inefficient since it keeps splitting path for each d (and split makes a list, while a set is better for membership checking). Making it into a 2-liner:
pathpieces = set(path.split('/'))
if all(d in pathpieces for d in dirs):
...
vastly improves performance.
names = ['spam', 'eggs']
dir = "/home/user/spam/eggs"
# Split into parts
parts = [ for part in dir.split('/') if part != '' ]
# Rejoin found paths
dirs = [ '/'.join(parts[0:n]) for (n, name) in enumerate(parts) if name in names ]
Edit : If you just want to verify whether all dirs exist:
parts = "/home/user/spam/eggs".split('/')
print all(dir in parts for dir in ['spam', 'eggs'])
Maybe this is what you want?
dirs = ['spam', 'eggs']
path = "/home/user/spam/eggs"
present = [dir for dir in dirs if dir in path]
A one liner using generators (using textual lookup and not treating names as anything to do with the filesystem - your request is not totally clear to me)
[x for x in dirs if x in path.split( os.sep )]