Split path in Python

Split path in Python - python

How I could split this:
C:\my_dir\repo\branch
to:
['C:\my_dir', rest_part_of_string]
where rest_part_of_string can be one string or could be splitted every \. I don't care about rest, i just want first two elements together.

python 3.4 has methods for that (note the forward slashes instead of the backslashes (or double the backslashes))
pathlib documentation
# python 3.4
from pathlib import Path
p = Path('C:/my_dir/repo/branch')
print(p.parent)
print(p.name)
for what you need parts is interesting:
print(p.parts)
# -> ('C:', 'my_dir', 'repo', 'branch')
print('\\'.join(p.parts[:2]), ' -- ', '\\'.join( p.parts[2:]))
# -> C:\my_dir -- repo\branch
in python 2.7 this needs a bit more work:
import os
p = 'C:/my_dir/repo/branch'
def split_path(path):
parts = []
while 1:
path, folder = os.path.split(path)
if folder:
parts.append(folder)
else:
if path:
parts.append(path)
break
parts.reverse()
return parts
parts = split_path(p)
print('\\'.join(parts[:2]), ' -- ', '\\'.join(parts[2:]))
# -> C:\my_dir -- repo\branch

Using regular expression (re module documentation):
>>> import re
>>> print(re.match(r'[^\\]+\\[^\\]+', r'C:\my_dir\repo\branch').group())
C:\my_dir
>>> re.findall(r'[^\\]+\\[^\\]+|.+', r'C:\my_dir\repo\branch')
['C:\\my_dir', '\\repo\\branch']

you could split the path on \ and rejoin based on index:
>>>my_path = r'C:\my_dir\repo\branch'
>>>split_path = ["\\".join(my_path.split("\\")[:2]), "\\".join(my_path.split("\\")[2:])]
['C:\\my_dir', 'repo\\branch']
>>> first, last = "\\".join(x.split("\\")[:2]), "\\".join(x.split("\\")[2:])
>>> print first, last
C:\my_dir repo\branch

You need os.path.dirname() (or os.path.split), applied recursively or iteratively, until you cannot go up in the directory hierarchy further.
In general the functions provided by os.path should work better that re-invented wheels, due to better cross-platform support. There are a large number of primitives from which you can build your own path-manipulating function.

Related

Cut out a sequence of files using glob in python

I have a directory with files like img-0001.jpg, img-0005.pg, img-0006.jpg, ... , img-xxxx.jpg.
What I need to do is to get a list with all files starting at 0238, literally img-0238.jpg. The next existing filename is img-0240.jpg
Right now I use glob to get all filenames.
list_images = glob.glob(path_images + "*.jpg")
Thanks in advance
Edit:
-> The last filename is img-0315.jpg

Glob doesn't allow regex filtering. But you filter list right after you receive all matching files.
Here is how it would look like using re:
import re
list_images = [f for f in glob.glob(path_images + "*.jpg") \
if re.search(r'[1-9]\d{3,}|0[3-9]\d{2,}|02[4-9]\d|023[8-9]\.jpg$', f)]
The regular expression with verify that file ends with number with 4 digits bigger or equal 0238.
You can play around with regular expression using https://regex101.com/
Basically, we check if number is:
starts with 1 followed by any 3 digits
or starts with 0[3-9] followed by any 2 digits
or starts with 02[4-9] followed by any 1 digit
or starts with 023 and followed by either 8 or 9.
But it's probably would be easier to do simple comparison:
list_images = [f for f in glob.glob(path_images + "*.jpg") \
if f[-8:-4] > "0237" and f[-8:-4] < "0316"]

You can specify multiple repeated wildcards to match all files whose number is 23[89] or 2[4-9][0-9] or 30[0-9] etc;
list_images = []
for pattern in ('023[89]', '02[4-9][0-9]', '030[0-9]', '031[0-5]'):
list_images.extend(glob.glob(
os.path.join(path_images, '*{0}.jpg'.format(pattern))))
or you can just filter out the ones you don't want.
list_images = [x for x in glob.glob(os.path.join(path_images, "*.jpg"))
if 238 <= int(x[-8:-4]) <= 315]

For something like this, you could try the wcmatch library. It's a library that aims to enhance file globbing and wildcard matching.
In this example, we enable brace expansion and demonstrate the pattern by filtering a list of files:
from wcmatch import glob
files = []
# Generate list of files from img-0000.jpg to img-0315.jpg
for x in range(316):
files.append('path/img-{:04d}.jpg'.format(x))
print(glob.globfilter(files, 'path/img-{0238..0315}.jpg', flags=glob.BRACE))
And we get the following output:
['path/img-0238.jpg', 'path/img-0239.jpg', 'path/img-0240.jpg', 'path/img-0241.jpg', 'path/img-0242.jpg', 'path/img-0243.jpg', 'path/img-0244.jpg', 'path/img-0245.jpg', 'path/img-0246.jpg', 'path/img-0247.jpg', 'path/img-0248.jpg', 'path/img-0249.jpg', 'path/img-0250.jpg', 'path/img-0251.jpg', 'path/img-0252.jpg', 'path/img-0253.jpg', 'path/img-0254.jpg', 'path/img-0255.jpg', 'path/img-0256.jpg', 'path/img-0257.jpg', 'path/img-0258.jpg', 'path/img-0259.jpg', 'path/img-0260.jpg', 'path/img-0261.jpg', 'path/img-0262.jpg', 'path/img-0263.jpg', 'path/img-0264.jpg', 'path/img-0265.jpg', 'path/img-0266.jpg', 'path/img-0267.jpg', 'path/img-0268.jpg', 'path/img-0269.jpg', 'path/img-0270.jpg', 'path/img-0271.jpg', 'path/img-0272.jpg', 'path/img-0273.jpg', 'path/img-0274.jpg', 'path/img-0275.jpg', 'path/img-0276.jpg', 'path/img-0277.jpg', 'path/img-0278.jpg', 'path/img-0279.jpg', 'path/img-0280.jpg', 'path/img-0281.jpg', 'path/img-0282.jpg', 'path/img-0283.jpg', 'path/img-0284.jpg', 'path/img-0285.jpg', 'path/img-0286.jpg', 'path/img-0287.jpg', 'path/img-0288.jpg', 'path/img-0289.jpg', 'path/img-0290.jpg', 'path/img-0291.jpg', 'path/img-0292.jpg', 'path/img-0293.jpg', 'path/img-0294.jpg', 'path/img-0295.jpg', 'path/img-0296.jpg', 'path/img-0297.jpg', 'path/img-0298.jpg', 'path/img-0299.jpg', 'path/img-0300.jpg', 'path/img-0301.jpg', 'path/img-0302.jpg', 'path/img-0303.jpg', 'path/img-0304.jpg', 'path/img-0305.jpg', 'path/img-0306.jpg', 'path/img-0307.jpg', 'path/img-0308.jpg', 'path/img-0309.jpg', 'path/img-0310.jpg', 'path/img-0311.jpg', 'path/img-0312.jpg', 'path/img-0313.jpg', 'path/img-0314.jpg', 'path/img-0315.jpg']
So, we could apply this to a file search:
from wcmatch import glob
list_images = glob.glob('path/img-{0238..0315}.jpg', flags=glob.BRACE)
In this example, we've hard coded the path, but in your example, make sure path_images has a trailing / so that the pattern is constructed correctly. Others have suggested this might be an issue. Print out your pattern to confirm the pattern is correct.

Python Parse through String to create variable

I have a variable that reads in a datafile
dfPort = pd.read_csv("E:...\Portfolios\ConsDisc_20160701_Q.csv")
I was hoping to create three variables: portName, inceptionDate, and frequency that would read the string of the "E:..." above and take out the wanted parts of the string using the underscore as a indicator to go to next variable. Example after parsing string:
portName = "ConsDisc"
inceptionDate: "2016-07-01"
frequency: "Q"
Any tips would be appreciated!

You can use os.path.basename, os.path.splitext and str.split:
import os
filename = r'E:...\Portfolios\ConsDisc_20160701_Q.csv'
parts = os.path.splitext(os.path.basename(filename.replace('\\', os.sep)))[0].split('_')
print(parts)
outputs ['ConsDisc', '20160701', 'Q']. You can then manipulate this list as you like, for example extract it into variables with port_name, inception_date, frequency = parts, etc.
The .replace('\\', os.sep) there is used to "normalize" Windows-style backslash-separated paths into whatever is the convention of the system the code is being run on (i.e. forward slashes on anything but Windows :) )

import os
def parse_filename(path):
filename = os.path.basename(path)
filename_no_ext = os.path.splitext(filename)[0]
return filename_no_ext.split("_")
path = r"Portfolios\ConsDisc_20160701_Q.csv"
portName, inceptionDate, frequency = parse_filename(path)

How about an alternative solution just in case if you want to store them into a dictionary and use them like so,
import re
str1 = "E:...\Portfolios\ConsDisc_20160701_Q.csv"
re.search(r'Portfolios\\(?P<portName>.*)_(?P<inceptionDate>.*)_(?P<frequency>.)', str1).groupdict()
# result
# {'portName': 'ConsDisc', 'inceptionDate': '20160701', 'frequency': 'Q'}

How to cut and limit a string size at the same time?

Here is an example of what i need.
Suppose that we have the following string:
str = "/home/user/folder/MyVeryLongFileName.foo"
I have multiple operations to do on this one :
remove the path (assuming i have its length) :
str = str[path_length:]
revome the extension (always 4 char in my case) :
str = str[path_length:-4]
So, right now my string looks like MyVeryLongFileName
Now I would like to limit its size at 15 characters.
Is it possible to do it in the same expression ? Or may I have to do it after the 2 previous operations ?

If you want only the first 15 characters, then you can slice the string again, like this:
file_name[path_length:-4][:15]
If you really are dealing with filenames, you might want to go with
>>> file_name = "/home/user/folder/MyVeryLongFileName.foo"
>>> import os
>>> print os.path.split(file_name)[1].rpartition(".")[0][:15]
MyVeryLongFileN
Or:
>>> print os.path.basename(file_name).rpartition(".")[0][:15]
'MyVeryLongFileN'
Also, it would be better to use splitext to get the extension, like this
>>> from os.path import basename, splitext
>>> print splitext(basename(file_name))[0][:15]
MyVeryLongFileN

You can get the filename with this:
>>> print str.split('/')[-1]
MyVeryLongFileName.foo
Remove the extension with:
>>> print str.split('.')[0]
/home/user/folder/MyVeryLongFileName
Limit the file name to 15 characters:
>>> print str.split('/')[-1][:15]
MyVeryLongFileN
This being said, you can always use the bash utils to extract this info. basename is the tool to get the file and dirname to get the path. See Extract filename and extension in bash for more info.

I would do this:
>>> from os.path import splitext, basename
>>> apath = "/home/user/folder/MyVeryLongFileName.foo"
>>> splitext(basename(apath))[0][:15]
'MyVeryLongFileN'
splitext separates the file-extension from the rest, and we do this on the result of basename which splits the part into the base file-name and the rest of the path. Then we can cut down the remaining string. I would definitely use these methods because they are much more reliable.

How to get only the last part of a path in Python?

In python, suppose I have a path like this:
/folderA/folderB/folderC/folderD/
How can I get just the folderD part?

Use os.path.normpath, then os.path.basename:
>>> os.path.basename(os.path.normpath('/folderA/folderB/folderC/folderD/'))
'folderD'
The first strips off any trailing slashes, the second gives you the last part of the path. Using only basename gives everything after the last slash, which in this case is ''.

With python 3 you can use the pathlib module (pathlib.PurePath for example):
>>> import pathlib
>>> path = pathlib.PurePath('/folderA/folderB/folderC/folderD/')
>>> path.name
'folderD'
If you want the last folder name where a file is located:
>>> path = pathlib.PurePath('/folderA/folderB/folderC/folderD/file.py')
>>> path.parent.name
'folderD'

You could do
>>> import os
>>> os.path.basename('/folderA/folderB/folderC/folderD')
UPDATE1: This approach works in case you give it /folderA/folderB/folderC/folderD/xx.py. This gives xx.py as the basename. Which is not what you want I guess. So you could do this -
>>> import os
>>> path = "/folderA/folderB/folderC/folderD"
>>> if os.path.isdir(path):
dirname = os.path.basename(path)
UPDATE2: As lars pointed out, making changes so as to accomodate trailing '/'.
>>> from os.path import normpath, basename
>>> basename(normpath('/folderA/folderB/folderC/folderD/'))
'folderD'

Here is my approach:
>>> import os
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD/test.py'))
folderD
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD/'))
folderD
>>> print os.path.basename(
os.path.dirname('/folderA/folderB/folderC/folderD'))
folderC

I was searching for a solution to get the last foldername where the file is located, I just used split two times, to get the right part. It's not the question but google transfered me here.
pathname = "/folderA/folderB/folderC/folderD/filename.py"
head, tail = os.path.split(os.path.split(pathname)[0])
print(head + " " + tail)

I like the parts method of Path for this:
grandparent_directory, parent_directory, filename = Path(export_filename).parts[-3:]
log.info(f'{t: <30}: {num_rows: >7} Rows exported to {grandparent_directory}/{parent_directory}/{filename}')

If you use the native python package pathlib it's really simple.
>>> from pathlib import Path
>>> your_path = Path("/folderA/folderB/folderC/folderD/")
>>> your_path.stem
'folderD'
Suppose you have the path to a file in folderD.
>>> from pathlib import Path
>>> your_path = Path("/folderA/folderB/folderC/folderD/file.txt")
>>> your_path.name
'file.txt'
>>> your_path.parent
'folderD'

During my current projects, I'm often passing rear parts of a path to a function and therefore use the Path module. To get the n-th part in reverse order, I'm using:
from typing import Union
from pathlib import Path
def get_single_subpath_part(base_dir: Union[Path, str], n:int) -> str:
if n ==0:
return Path(base_dir).name
for _ in range(n):
base_dir = Path(base_dir).parent
return getattr(base_dir, "name")
path= "/folderA/folderB/folderC/folderD/"
# for getting the last part:
print(get_single_subpath_part(path, 0))
# yields "folderD"
# for the second last
print(get_single_subpath_part(path, 1))
#yields "folderC"
Furthermore, to pass the n-th part in reverse order of a path containing the remaining path, I use:
from typing import Union
from pathlib import Path
def get_n_last_subparts_path(base_dir: Union[Path, str], n:int) -> Path:
return Path(*Path(base_dir).parts[-n-1:])
path= "/folderA/folderB/folderC/folderD/"
# for getting the last part:
print(get_n_last_subparts_path(path, 0))
# yields a `Path` object of "folderD"
# for second last and last part together
print(get_n_last_subparts_path(path, 1))
# yields a `Path` object of "folderc/folderD"
Note that this function returns a Pathobject which can easily be converted to a string (e.g. str(path))

path = "/folderA/folderB/folderC/folderD/"
last = path.split('/').pop()

str = "/folderA/folderB/folderC/folderD/"
print str.split("/")[-2]

Recursive splitting of path name (in Python)?

I feel that there is (should be?) a Python function out there that recursively splits a path string into its constituent files and directories (beyond basename and dirname). I've written one but since I use Python for shell-scripting on 5+ computers, I was hoping for something from the standard library or simpler that I can use on-the-fly.
import os
def recsplit(x):
if type(x) is str: return recsplit(os.path.split(x))
else: return (x[0]=='' or x[0] == '.' or x[0]=='/') and x[1:] or \
recsplit(os.path.split(x[0]) + x[1:])
>>> print recsplit('main/sub1/sub2/sub3/file')
('main', 'sub1', 'sub2', 'sub3', 'file')
Any leads/ideas? ~Thanks~

UPDATE: After all the mucking about with altsep, the currently selected answer doesn't even split on backslashes.
>>> import re, os.path
>>> seps = os.path.sep
>>> if os.path.altsep:
... seps += os.path.altsep
...
>>> seps
'\\/'
>>> somepath = r"C:\foo/bar.txt"
>>> print re.split('[%s]' % (seps,), somepath)
['C:\\foo', 'bar.txt'] # Whoops!! it was splitting using [\/] same as [/]
>>> print re.split('[%r]' % (seps,), somepath)
['C:', 'foo', 'bar.txt'] # after fixing it
>>> print re.split('[%r]' % seps, somepath)
['C:', 'foo', 'bar.txt'] # removed redundant cruft
>>>
Now back to what we ought to be doing:
(end of update)
1. Consider carefully what you are asking for -- you may get what you want, not what you need.
If you have relative paths
r"./foo/bar.txt" (unix) and r"C:foo\bar.txt" (windows)
do you want
[".", "foo", "bar.txt"] (unix) and ["C:foo", "bar.txt"] (windows)
(do notice the C:foo in there) or do you want
["", "CWD", "foo", "bar.txt"] (unix) and ["C:", "CWD", "foo", "bar.txt"] (windows)
where CWD is the current working directory (system-wide on unix, that of C: on windows)?
2. You don't need to faff about with os.path.altsep -- os.path.normpath() will make the separators uniform, and tidy up other weirdnesses like foo/bar/zot/../../whoopsy/daisy/somewhere/else
Solution step 1: unkink your path with one of os.path.normpath() or os.path.abspath().
Step 2: doing unkinked_path.split(os.path.sep) is not a good idea. You should pull it apart with os.path.splitdrive(), then use multiple applications of os.path.split().
Here are some examples of what would happen in step 1 on windows:
>>> os.path.abspath(r"C:/hello\world.txt")
'C:\\hello\\world.txt'
>>> os.path.abspath(r"C:hello\world.txt")
'C:\\Documents and Settings\\sjm_2\\hello\\world.txt'
>>> os.path.abspath(r"/hello\world.txt")
'C:\\hello\\world.txt'
>>> os.path.abspath(r"hello\world.txt")
'C:\\Documents and Settings\\sjm_2\\hello\\world.txt'
>>> os.path.abspath(r"e:hello\world.txt")
'E:\\emoh_ruo\\hello\\world.txt'
>>>
(the current drive is C, the CWD on drive C is \Documents and Settings\sjm_2, and the CWD on drive E is \emoh_ruo)
I'd like to suggest that you write step 2 without the conglomeration of and and or that you have in your example. Write code as if your eventual replacement knows where you live and owns a chainsaw :-)

use this:
import os
def recSplitPath(path):
elements = []
while ((path != '/') and (path != '')):
path, tail = os.path.split(path)
elements.insert(0,tail)
return elements
This turns /for/bar/whatever into ['for','bar','whatever]

path='main/sub1/sub2/sub3/file'
path.split(os.path.sep)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split path in Python - python

How I could split this: C:\my_dir\repo\branch to: ['C:\my_dir', rest_part_of_string] where rest_part_of_string can be one string or could be splitted every \. I don't care about rest, i just want first two elements together.

Using regular expression (re module documentation): >>> import re >>> print(re.match(r'[^\\]+\\[^\\]+', r'C:\my_dir\repo\branch').group()) C:\my_dir >>> re.findall(r'[^\\]+\\[^\\]+|.+', r'C:\my_dir\repo\branch') ['C:\\my_dir', '\\repo\\branch']

Related

Cut out a sequence of files using glob in python

Python Parse through String to create variable

How to cut and limit a string size at the same time?

How to get only the last part of a path in Python?

Recursive splitting of path name (in Python)?

Categories

Resources