Recursive globing using wildcard without adding extra directory level - python

I am trying to construct absolute file paths with glob using wildcards. Using this code.
list_of_files = glob.glob(globable_file_path, recursive=True)
Now I feed this a list of globable file paths. When formatted like this it works
\?\Z:\level_1\level_2\**\*12345*.pdf
But the above is adding an extra level (\) of directory and I can not change the parent path. So I have been trying things like this
\?\Z:\level_1\level_2\**12345*.pdf
\?\Z:\level_1\level_2\[**]12345*.pdf
\?\Z:\level_1\level_2\**[12345]*.pdf
But none of these work. How would I be able to avoid adding an extra level of directory but glob recursively using wildcards?
The docs show this
>>> glob.glob('**/*.txt', recursive=True)
['2.txt', 'sub/3.txt']
>>> glob.glob('./**/', recursive=True)
['./', './sub/']
Which suggest you need the extra \ for recursive searching. Is there a way to do it with pathlib or os.path or some trick I do not know about?

So this works using pathlib
from pathlib import Path
for path in Path(globable_file_path).rglob("*" + str(log1) + "*.pdf"):
list_of_files.append(path.__str__())
I used the __str__ method to convert from WindowsPath to string.

Related

Why does root returned from os.walk() contain / as directory separator but os.sep (or os.path.sep) return \ on Win10?

Why does the root element returned from os.walk() show / as the directory separator but os.sep (or os.path.sep) shows \ on Win10?
I'm just trying to create the complete path for a set of files in a folder as follows:
import os
base_folder = "c:/data/MA Maps"
for root, dirs, files in os.walk(base_folder):
for f in files:
if f.endswith(".png") and f.find("_N") != -1:
print(os.path.join(root, f))
print(os.path.sep)
Here's what I get as an output:
c:/data/MA Maps\Map_of_Massachusetts_Nantucket_County.png
c:/data/MA Maps\Map_of_Massachusetts_Norfolk_County.png
\
I understand that some of python's library functions (like open()) will work with mixed path separators (at least on Windows) but relying on that hack really can't be trusted across all libraries. It just seems like the items returned from os.walk() and os.path (.sep or .join()) should yield consistent results based on the operating system being used. Can anyone explain why this inconsistency is happening?
P.S. - I know there is a more consistent library for working with file paths (and lots of other file manipulation) called pathlib that was introduced in python 3.4 and it does seem to fix all this. If your code is being used in 3.4 or beyond, is it best to use pathlib methods to resolve this issue? But if your code is targeted for systems using python before 3.4, what is the best way to address this issue?
Here's a good basic explanation of pathlib: Python 3 Quick Tip: The easy way to deal with file paths on Windows, Mac and Linux
Here's my code & result using pathlib:
import os
from pathlib import Path
# All of this should work properly for any OS. I'm running Win10.
# You can even mix up the separators used (i.e."c:\data/MA Maps") and pathlib still
# returns the consistent result given below.
base_folder = "c:/data/MA Maps"
for root, dirs, files in os.walk(base_folder):
# This changes the root path provided to one using the current operating systems
# path separator (/ for Win10).
root_folder = Path(root)
for f in files:
if f.endswith(".png") and f.find("_N") != -1:
# The / operator, when used with a pathlib object, just concatenates the
# the path segments together using the current operating system path separator.
print(root_folder / f)
c:\data\MA Maps\Map_of_Massachusetts_Nantucket_County.png
c:\data\MA Maps\Map_of_Massachusetts_Norfolk_County.png
This can even be done more succinctly using only pathlib and list comprehension (with all path separators correctly handled per OS used):
from pathlib import Path
base_folder = "c:/data/MA Maps"
path = Path(base_folder)
files = [item for item in path.iterdir() if item.is_file() and
str(item).endswith(".png") and
(str(item).find("_N") != -1)]
for file in files:
print(file)
c:\data\MA Maps\Map_of_Massachusetts_Nantucket_County.png
c:\data\MA Maps\Map_of_Massachusetts_Norfolk_County.png
This is very Pythonic and at least I feel it is quite easy to read and understand. .iterdir() is really powerful and makes dealing with files and dirs reasonably easy and in a cross-platform way. What do you think?
The os.walk function always yields the initial part of the dirpath unchanged from what you pass in to it. It doesn't try to normalize the separators itself, it just keeps what you've given it. It does use the system-standard separators for the rest of the path, as it combines each subdirectory's name to the root directory with os.path.join. You can see the current version of the implementation of the os.walk function in the CPython source repository.
One option for normalizing the separators in your output is to normalize the base path you pass in to os.walk, perhaps using pathlib. If you normalize the initial path, all the output should use the system path separators automatically, since it will be the normalized path that will be preserved through the recursive walk, rather than the non-standard one. Here's a very basic transformation of your first code block to normalize the base_folder using pathlib, while preserving all the rest of the code, in its simplicity. Whether it's better than your version using more of pathlib's features is a judgement call that I'll leave up to you.
import os
from pathlib import Path
base_folder = Path("c:/data/MA Maps") # this will be normalized when converted to a string
for root, dirs, files in os.walk(base_folder):
for f in files:
if f.endswith(".png") and f.find("_N") != -1:
print(os.path.join(root, f))

Listing files with specific endings in folder as variable, using python

I am trying to list files with specific ending ('.txt') in folder which was set as a variable, using python.
I tried to use glob.glob('userFolder/*.txt') in order to do it.
import os
import glob
userFolder='/homes/myFolder'
glob.glob('userFolder/*.txt')
I got an empty list.
The text userFolder in your glob() call is just part of the string value, it's not related to the variable with the same name. If it was, you could never use something like print or os or any other variable name directly in a string.
You could just use + to concatenate the variable value with the glob pattern:
text_files = glob.glob(userFolder + '/*.txt')
but the better method is to use os.path.join() to handle path construction:
files = glob.glob(os.path.join(userFolder, '*.txt'))
Another option is to use the [pathlib module], which has its own glob support:
import pathlib
userFolder = pathlib.Path('/homes/myFolder')
text_files = userFolder.glob('*.txt')

python import with absolute path

I have the following folder structure and want to find a good way to import python modules.
project1/test/benchmark/benchmark_project1.py
#in benchmark_project1.py
from project1.test.benchmark import *
My question is how to get rid of project1, since it might be renamed to "project2" or something else. I want to use import with absolute path, but don't know a good way to achieve that.
You can use os.chdir() to change your directory before the import statement, and then to change it back afterward. This will allow you to specify the precise file to import. You can use os.listdir() to get the list of all files in the directory, and then simply index them. Using a loop will get all the modules in the folder, or providing the right index according to some pattern will give you a specific one. The glob module allows you to select files using regex.
import os
cwd = os.getcwd()
new_dir = 'project1/test/benchmark/'
list_dir = os.listdir(new_dir) # Find all matching
os.chdir(new_dir)
for i in range(len(list_dir)): # Import all of them (or index them in some way)
module = list_dir[i][0:-3] # Filter off the '.py' file extension
from module import *
os.chdir(cwd)
Alternatively, you can add the location to your path instead of changing directories. Take a look at this question for some additional resources.

list of jpeg files in nested subdirectories

I use the following python code to get list of jpg files in nested subdirectories which are in parent directory.
import glob2,os
all_header_files = glob2.glob(os.path.join('Path/to/parent/directory','/**/*.jpg'))
However, I get nothing but when I cd into the parent directory and I use the following python code then I get the list of jpeg files.
import glob2
all_header_files = glob2.glob('./**/*.jpg')
How can I get the result with the absolute path?(first version)
You have an extra slash.
The os.path.join will insert the filepath separators for you, so you should think of it as this to get the correct directory
join('Path/to/parent directory' , '**/*.jpg')
Even more accurately,
parent = os.path.join('Path', 'to', 'parent directory')
os.path.join(parent, '**/*.jpg')
If you are trying to use your Home directory, see os.path.expanduser
In [10]: import os, glob
In [11]: glob.glob(os.path.join('~', 'Downloads', "**/*.sh"))
Out[11]: []
In [12]: glob.glob(os.path.expanduser(os.path.join('~', 'Downloads', "**/*.sh")))
Out[12]:
['/Users/name/Downloads/dir/script.sh']
You should not join with the trailing slash as you'll end up with the root. You can debug by printing out the resulting path before passing it to glob.
Try to change your code like this (note the dot):
import glob2,os
all_header_files = glob2.glob(os.path.join('Path/to/parent directory','./**/*.jpg'))
os.path.join() joins paths in an intelligent way.
os.path.join('Path/to/anything','/**/*.jpg'))
resolves to '/**/*.jpg' since '/**/*.jpg' is any path, ever.
Change the '/**/*.jpg' to '**/*.jpg' and it should work.
In cases like this, I recommend to always try out the result of a certain function within the python command line. At least, this is how I found out the issue here.
The problem with the code you have posted lies in the use of os.path.join.
In the documentation it says for os.path.join(path, *paths):
If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.
In your case, the component /**/*.jpg is an absolute path, as it starts with a /. Consequently your initial input /Path/to/parent directory is being truncated by the call to the join function. (https://docs.python.org/3.5/library/os.path.html#os.path.join)
I have locally tested the joining part with python3 and for me it is the case, that using os.path.join(any_path, "/**/*.pdf") returns the string '/**/*.pdf'.
The fix for this error is:
import glob2,os
all_header_files = glob2.glob(os.path.join('Path/to/parent directory','**/*.jpg'))
This returns the path 'Path/to/parent directory/**/*.jpg'

How can I list the contents of a directory in Python?

Can’t be hard, but I’m having a mental block.
import os
os.listdir("path") # returns list
One way:
import os
os.listdir("/home/username/www/")
Another way:
glob.glob("/home/username/www/*")
Examples found here.
The glob.glob method above will not list hidden files.
Since I originally answered this question years ago, pathlib has been added to Python. My preferred way to list a directory now usually involves the iterdir method on Path objects:
from pathlib import Path
print(*Path("/home/username/www/").iterdir(), sep="\n")
os.walk can be used if you need recursion:
import os
start_path = '.' # current directory
for path,dirs,files in os.walk(start_path):
for filename in files:
print os.path.join(path,filename)
glob.glob or os.listdir will do it.
The os module handles all that stuff.
os.listdir(path)
Return a list containing the names of the entries in the directory given by path.
The list is in arbitrary order. It does not include the special entries '.' and
'..' even if they are present in the directory.
Availability: Unix, Windows.
In Python 3.4+, you can use the new pathlib package:
from pathlib import Path
for path in Path('.').iterdir():
print(path)
Path.iterdir() returns an iterator, which can be easily turned into a list:
contents = list(Path('.').iterdir())
Since Python 3.5, you can use os.scandir.
The difference is that it returns file entries not names. On some OSes like windows, it means that you don't have to os.path.isdir/file to know if it's a file or not, and that saves CPU time because stat is already done when scanning dir in Windows:
example to list a directory and print files bigger than max_value bytes:
for dentry in os.scandir("/path/to/dir"):
if dentry.stat().st_size > max_value:
print("{} is biiiig".format(dentry.name))
(read an extensive performance-based answer of mine here)
Below code will list directories and the files within the dir. The other one is os.walk
def print_directory_contents(sPath):
import os
for sChild in os.listdir(sPath):
sChildPath = os.path.join(sPath,sChild)
if os.path.isdir(sChildPath):
print_directory_contents(sChildPath)
else:
print(sChildPath)

Categories