Python - Extract everything in a filepath, after a certain directory - python

Let's say I have a folder like this.
/home/user/dev/Project/media/image_dump/images/02_car_folder
Everything after the media directory should be kept. The remaining should be removed.
/media/image_dump/images/02_car_folder
I was originally doing it this way but as more subdirectories were added to different folders started generating invalid filepaths
split_absolute = [os.sep.join(os.path.normpath(y).split(os.sep)[-2:]) for y in absolute_path]
The problem this causes is that once you start going deeper, the media path is cut out of the filepath all together.
So if I went into
media/image_dump/images/02_car_folder/
The filepath now becomes this, when it needs to include everything up to /media.
/images/02_car_folder
What are some ways to actually handle this? I won't know users filepaths will be leading up to media, but I know that everything after media is what should be kept regardless, no matter how deep their folders go.

I think you can achieve what you want quite easily using Path.parts:
from pathlib import Path
path = "/home/user/dev/Project/media/image_dump/images/02_car_folder"
parts = Path(path).parts
stripped_path = Path(*parts[parts.index("media"):])
Result:
>>> print(stripped_path)
media/image_dump/images/02_car_folder

Actually you don't need to use some path specific libraries.
Just work with strings:
※ note → the weak point of working with paths as strings is that you need to handle many edge cases by yourself (for example if path will be media/blahblah/blahblah2 or /blahblah/blahblah2/media). pathlib solving these cases out of the box.
import os
full_path1 = "/home/user/dev/Project/media/image_dump/images/02_car_folder"
full_path2 = "/home/user/dev/Project/media/image_dump/media/images/02_car_folder"
separator_dir = os.path.sep + "media" + os.path.sep
print(f'Separate by {separator_dir}')
if separator_dir in full_path1:
separated_path1 = os.path.sep + separator_dir.join(full_path1.split(separator_dir)[1:])
else:
separated_path1 = full_path1
if separator_dir in full_path2:
separated_path2 = os.path.sep + separator_dir.join(full_path2.split(separator_dir)[1:])
else:
separated_path2 = full_path2
print(f'Full path 1 is {full_path1}')
print(f'Full path 2 is {full_path2}')
print(f'Separated path 1 is {separated_path1}')
print(f'Separated path 2 is {separated_path2}')
First path has one media folder
Second path has two media folders, but use only first for path cutting
Separate by /media/
Full path 1 is /home/user/dev/Project/media/image_dump/images/02_car_folder
Full path 2 is /home/user/dev/Project/media/image_dump/media/images/02_car_folder
Separated path 1 is /image_dump/images/02_car_folder
Separated path 2 is /image_dump/media/images/02_car_folder

You could also use a regex, concise and easy:
path = '/home/user/dev/Project/media/image_dump/images/02_car_folder'
import re
re.search('/media/.*', path).group(0)
Output: '/media/image_dump/images/02_car_folder'
If the presence of media is unsure:
m = re.search('/media/.*', path)
m.group(0) if m else None # or any default you want
If you want the first / to be optional if media is at the beginning, use '(?:/|^)media/.*'

Related

Why does root returned from os.walk() contain / as directory separator but os.sep (or os.path.sep) return \ on Win10?

Why does the root element returned from os.walk() show / as the directory separator but os.sep (or os.path.sep) shows \ on Win10?
I'm just trying to create the complete path for a set of files in a folder as follows:
import os
base_folder = "c:/data/MA Maps"
for root, dirs, files in os.walk(base_folder):
for f in files:
if f.endswith(".png") and f.find("_N") != -1:
print(os.path.join(root, f))
print(os.path.sep)
Here's what I get as an output:
c:/data/MA Maps\Map_of_Massachusetts_Nantucket_County.png
c:/data/MA Maps\Map_of_Massachusetts_Norfolk_County.png
\
I understand that some of python's library functions (like open()) will work with mixed path separators (at least on Windows) but relying on that hack really can't be trusted across all libraries. It just seems like the items returned from os.walk() and os.path (.sep or .join()) should yield consistent results based on the operating system being used. Can anyone explain why this inconsistency is happening?
P.S. - I know there is a more consistent library for working with file paths (and lots of other file manipulation) called pathlib that was introduced in python 3.4 and it does seem to fix all this. If your code is being used in 3.4 or beyond, is it best to use pathlib methods to resolve this issue? But if your code is targeted for systems using python before 3.4, what is the best way to address this issue?
Here's a good basic explanation of pathlib: Python 3 Quick Tip: The easy way to deal with file paths on Windows, Mac and Linux
Here's my code & result using pathlib:
import os
from pathlib import Path
# All of this should work properly for any OS. I'm running Win10.
# You can even mix up the separators used (i.e."c:\data/MA Maps") and pathlib still
# returns the consistent result given below.
base_folder = "c:/data/MA Maps"
for root, dirs, files in os.walk(base_folder):
# This changes the root path provided to one using the current operating systems
# path separator (/ for Win10).
root_folder = Path(root)
for f in files:
if f.endswith(".png") and f.find("_N") != -1:
# The / operator, when used with a pathlib object, just concatenates the
# the path segments together using the current operating system path separator.
print(root_folder / f)
c:\data\MA Maps\Map_of_Massachusetts_Nantucket_County.png
c:\data\MA Maps\Map_of_Massachusetts_Norfolk_County.png
This can even be done more succinctly using only pathlib and list comprehension (with all path separators correctly handled per OS used):
from pathlib import Path
base_folder = "c:/data/MA Maps"
path = Path(base_folder)
files = [item for item in path.iterdir() if item.is_file() and
str(item).endswith(".png") and
(str(item).find("_N") != -1)]
for file in files:
print(file)
c:\data\MA Maps\Map_of_Massachusetts_Nantucket_County.png
c:\data\MA Maps\Map_of_Massachusetts_Norfolk_County.png
This is very Pythonic and at least I feel it is quite easy to read and understand. .iterdir() is really powerful and makes dealing with files and dirs reasonably easy and in a cross-platform way. What do you think?
The os.walk function always yields the initial part of the dirpath unchanged from what you pass in to it. It doesn't try to normalize the separators itself, it just keeps what you've given it. It does use the system-standard separators for the rest of the path, as it combines each subdirectory's name to the root directory with os.path.join. You can see the current version of the implementation of the os.walk function in the CPython source repository.
One option for normalizing the separators in your output is to normalize the base path you pass in to os.walk, perhaps using pathlib. If you normalize the initial path, all the output should use the system path separators automatically, since it will be the normalized path that will be preserved through the recursive walk, rather than the non-standard one. Here's a very basic transformation of your first code block to normalize the base_folder using pathlib, while preserving all the rest of the code, in its simplicity. Whether it's better than your version using more of pathlib's features is a judgement call that I'll leave up to you.
import os
from pathlib import Path
base_folder = Path("c:/data/MA Maps") # this will be normalized when converted to a string
for root, dirs, files in os.walk(base_folder):
for f in files:
if f.endswith(".png") and f.find("_N") != -1:
print(os.path.join(root, f))

How to handle long path with spaces in Windows with Python

In the following code, I need to iterate through files in a directory with long names and spaces in paths.
def avg_dmg_acc(path):
for d in os.listdir(path):
sub_path = path + '/' + d
if os.path.isdir(sub_path):
if d.startswith('Front'):
for f in os.listdir(sub_path):
fpath = r"%s" % sub_path + '/' + f
print(fpath)
print(os.path.exists(fpath))
df = pd.read_csv(fpath)
Then I ran the function providing the argument path:
path = r"./Mid-Con Master dd3d5c56-581c-42e0-acde-04e7feed3bb8/620138 91852327-e08d-4ed1-9774-383c888cb04e/Power End 2d41ba63-dfb9-4984-a5a5-153997fea43a"
avg_dmg_acc(path)
However I am getting file not exist error:
File b'./Mid-Con Master dd3d5c56-581c-42e0-acde-04e7feed3bb8/620138 91852327-e08d-4ed1-9774-383c888cb04e/Power End 2d41ba63-dfb9-4984-a5a5-153997fea43a/Front c41f42ce-7158-4371-8cf6-82d1bcf04787/Damage Accumulation f907a97a-6d2d-40f6-ba02-0bc0599b773b.csv' does not exist
As you can see, I am already using r"path" since I read it somewhere it handles spaces in path. Also the path was constructed manually in this version, e.g. sub_path = path + '/' + d but I tried to use os.path.join(path, d) originally and it didn't work. I also tried Path from pathlib since it is the recommended way in Python 3 and still the same. At one point I tried to use os.path.abspath instead of the relative path I am using now with ./ but it still says file not exist.
Why is it not working? Is it because the path is too long or spaces are still not dealt with correctly?
It turns out it is the length of the path that is causing this problem. I tried to reduce the folder name of the lowest level one character at a time and got to the point where os.path.exists(fpath) changed from false to true. I think I will need to rename all the folder names before processing

Calling for relative paths in Python

I have this below Python script that fetches a file from one location and copies that to another Target location. The below code works just fine if I define the paths with the absolute locations.
I am trying to rather define this using variables, which when done does not execute the script. There is no error that is thrown but the code does not seem to be executed.
Code:
Path_from = r'/Users/user/Desktop/report'
Path_to = r'/Users/user/Desktop/report'
for root, dirs, files in os.walk((os.path.normpath(Path_from)), topdown=False):
for name in files:
if name.endswith('{}.txt'.format(date)):
print
"Found"
SourceFolder = os.path.join(root, name)
shutil.copy2(SourceFolder, Path_to)
I want to change the code from
Path_from = r'/Users/user/Desktop/report'
to
base_path = /Users/user/Desktop/
Path_from = r'base_path/{}'.format(type)
I would recommend you leave all the current working directory concerns to the user - if they want to specify a relative path, they can enter into the directory to which it relates before invoking the python and providing relative paths.
This is what just about every linux tool and program does - rarely do they take a 'base path', but rather leave the job of providing valid paths relative to the current directory ( or absolute ) to the user.
If you're dedicated to the idea of taking another parameter as the relative path, it should be pretty straightforward to do. Your example doesn't have valid python syntax, but it's close:
$ cat t.py
from os.path import join
basepath="/tmp"
pathA = "fileA"
pathB = "fileB"
print(join(basepath,pathA))
print(join(basepath,pathB))
note however that this prevents an absolute path being provided at script execution time.
You could use a format instead,
basepath="/tmp"
pathA = "fileA"
pathB = "fileB"
print( "{}/{}".format(basepath, pathA) )
print( "{}/{}".format(basepath, pathB) )
But then you're assuming that you know how to join paths on the operating system in question, which is why os.path.join exists.
If I'm reading this right, you could use pathlib, specifically pathlib.Path code would look like
from pathlib import Path
import re
import shutil
path_from = Path("/") / "Users" / "user" / "Desktop" # Better IMO
# path_from = Path("/Users/user/Desktop")
path_to = Path("/") / "Users" / "user" / "OtherDesktop"
datename = "whatever"
for x in path_from.glob("*.txt"):
if re.search(r"{}$".format(datename), x.stem): # stem is whatever is before the extension
# ex. something.txt -> something
shutil.copy(str(path_from / x.name), str(path_to / x.name))

Count the number of folders in a directory and subdirectories

I've got a script that will accurately tell me how many files are in a directory, and the subdirectories within. However, I'm also looking into identify how many folders there are within the same directory and its subdirectories...
My current script:
import os, getpass
from os.path import join, getsize
user = 'Copy of ' + getpass.getuser()
path = "C://Documents and Settings//" + user + "./"
folder_counter = sum([len(folder) for r, d, folder in os.walk(path)])
file_counter = sum([len(files) for r, d, files in os.walk(path)])
print ' [*] ' + str(file_counter) + ' Files were found and ' + str(folder_counter) + ' folders'
This code gives me the print out of: [*] 147 Files were found and 147 folders.
Meaning that the folder_counter isn't counting the right elements. How can I correct this so the folder_counter is correct?
Python 2.7 solution
For a single directory and in you can also do:
import os
print len(os.walk('dir_name').next()[1])
which will not load the whole string list and also return you the amount of directories inside the 'dir_name' directory.
Python 3.x solution
Since many people just want an easy and fast solution, without actually understanding the solution, I edit my answer to include the exact working code for Python 3.x.
So, in Python 3.x we have the next method instead of .next. Thus, the above snippet becomes:
import os
print(len(next(os.walk('dir_name'))[1]))
where dir_name is the directory that you want to find out how many directories has inside.
I think you want something like:
import os
files = folders = 0
for _, dirnames, filenames in os.walk(path):
# ^ this idiom means "we won't be using this value"
files += len(filenames)
folders += len(dirnames)
print "{:,} files, {:,} folders".format(files, folders)
Note that this only iterates over os.walk once, which will make it much quicker on paths containing lots of files and directories. Running it on my Python directory gives me:
30,183 files, 2,074 folders
which exactly matches what the Windows folder properties view tells me.
Note that your current code calculates the same number twice because the only change is renaming one of the returned values from the call to os.walk:
folder_counter = sum([len(folder) for r, d, folder in os.walk(path)])
# ^ here # ^ and here
file_counter = sum([len(files) for r, d, files in os.walk(path)])
# ^ vs. here # ^ and here
Despite that name change, you're counting the same value (i.e. in both it's the third of the three returned values that you're using)! Python functions do not know what names (if any at all; you could do print list(os.walk(path)), for example) the values they return will be assigned to, and their behaviour certainly won't change because of it. Per the documentation, os.walk returns a three-tuple (dirpath, dirnames, filenames), and the names you use for that, e.g. whether:
for foo, bar, baz in os.walk(...):
or:
for all_three in os.walk(..):
won't change that.
If interested only in the number of folders in /input/dir (and not in the subdirectories):
import os
folder_count = 0 # type: int
input_path = "/path/to/your/input/dir" # type: str
for folders in os.listdir(input_path): # loop over all files
if os.path.isdir(os.path.join(input_path, folders): # if it's a directory
folder_count += 1 # increment counter
print("There are {} folders".format(folder_count))
>>> import os
>>> len(list(os.walk('folder_name')))
According to os.walk the first argument dirpath enumerates all directories.

How to process files from one subfolder to another in each directory using Python?

I have a basic file/folder structure on the Desktop where the "Test" folder contains "Folder 1", which in turn contains 2 subfolders:
An "Original files" subfolder which contains shapefiles (.shp).
A "Processed files" subfolder which is empty.
I am attempting to write a script which looks into each parent folder (Folder 1, Folder 2 etc) and if it finds an Original Files subfolder, it will run a function and output the results into the Processed files subfolder.
I made a simple diagram to showcase this where if Folder 1 contains the relevant subfolders then the function will run; if Folder 2 does not contain the subfolders then it's simply ignored:
I looked into the following posts but having some trouble:
python glob issues with directory with [] in name
Getting a list of all subdirectories in the current directory
How to list all files of a directory?
The following is the script which seems to run happily, annoying thing is that it doesn't produce an error so this real noob can't see where the problem is:
import os, sys
from os.path import expanduser
home = expanduser("~")
for subFolders, files in os.walk(home + "\Test\\" + "\*Original\\"):
if filename.endswith('.shp'):
output = home + "\Test\\" + "\*Processed\\" + filename
# do_some_function, output
I guess you mixed something up in your os.walk()-loop.
I just created a simple structure as shown in your question and used this code to get what you're looking for:
root_dir = '/path/to/your/test_dir'
original_dir = 'Original files'
processed_dir = 'Processed files'
for path, subdirs, files in os.walk(root_dir):
if original_dir in path:
for file in files:
if file.endswith('shp'):
print('original dir: \t' + path)
print('original file: \t' + path + os.path.sep + file)
print('processed dir: \t' + os.path.sep.join(path.split(os.path.sep)[:-1]) + os.path.sep + processed_dir)
print('processed file: ' + os.path.sep.join(path.split(os.path.sep)[:-1]) + os.path.sep + processed_dir + os.path.sep + file)
print('')
I'd suggest to only use wildcards in a directory-crawling script if you are REALLY sure what your directory tree looks like. I'd rather use the full names of the folders to search for, as in my script.
Update: Paths
Whenever you use paths, take care of your path separators - the slashes.
On windows systems, the backslash is used for that:
C:\any\path\you\name
Most other systems use a normal, forward slash:
/the/path/you/want
In python, a forward slash could be used directly, without any problem:
path_var = '/the/path/you/want'
...as opposed to backslashes. A backslash is a special character in python strings. For example, it's used for the newline-command: \n
To clarify that you don't want to use it as a special character, but as a backslash itself, you either have to "escape" it, using another backslash: '\\'. That makes a windows path look like this:
path_var = 'C:\\any\\path\\you\\name'
...or you could mark the string as a "raw" string (or "literal string") with a proceeding r. Note that by doing that, you can't use special characters in that string anymore.
path_var = r'C:\any\path\you\name'
In your comment, you used the example root_dir = home + "\Test\\". The backslash in this string is used as a special character there, so python tries to make sense out of the backslash and the following character: \T. I'm not sure if that has any meaning in python, but \t would be converted to a tab-stop. Either way - that will not resolve to the path you want to use.
I'm wondering why your other example works. In "C:\Users\me\Test\\", the \U and \m should lead to similar errors. And you also mixed single and double backslashes.
That said...
When you take care of your OS path separators and trying around with new paths now, also note that python does a lot of path-concerning things for you. For example, if your script reads a directory, as os.walk() does, on my windows system the separators are already processed as double backslashes. There's no need for me to check that - it's usually just hardcoded strings, where you'll have to take care.
And finally: The Python os.path module provides a lot of methods to handle paths, seperators and so on. For example, os.path.sep (and os.sep, too) wil be converted in the correct seperator for the system python is running on. You can also build paths using os.path.join().
And finally: The home-directory
You use expanduser("~") to get the home-path of the current user. That should work fine, but if you're using an old python version, there could be a bug - see: expanduser("~") on Windows looks for HOME first
So check if that home-path is resolved correct, and then build your paths using the power of the os-module :-)
Hope that helps!

Categories