I have a condition where I need to compare the relative name/path of the file (string) with another path that has wildcards:
files list:
aaa/file1.txt
bbb/ccc/file2.txt
accepted files by the wildcard:
aaa/*
bbb/**/*
I need to pick only those files that match to any of the wildcard masks.
aaa/file1.txt equals aaa/* => True
ddd/file3.txt equals aaa/* or bbb/**/* => False
Is that possible to do with glob or any other module?
edit: I removed all unnecessary details from the question after the discussion.
Answering on my own question:
This wcmatch module is able to do what is required for me
from wcmatch import pathlib
print(pathlib.PurePath('a/a/file1.txt').globmatch('a/**/*', flags=pathlib.GLOBSTAR))
print(pathlib.PurePath('b/file2.txt').globmatch('b/**/*', flags=pathlib.GLOBSTAR))
print(pathlib.PurePath('c/file3.txt').globmatch('c/*', flags=pathlib.GLOBSTAR))
print(pathlib.PurePath('d/d/file4.txt').globmatch('d/*', flags=pathlib.GLOBSTAR))
True
True
True
False
Related
I am trying this code:
import glob
temp_path = glob.glob("/FileStorage/user/final/**/**/sample.json")[-1][5:]
sample = (spark.read.json(f"FileStorage:{temp_path}"))
However, when I run this command in databricks, the error message is:
IndexError: list index out of range
I try to print the:
glob.glob("/FileStorage/user/final/**/**/sample.json") the result is an empty list.
The issue is that "/FileStorage/user/final/** /**/sample.json" is probably not the correct pathname for what you are trying to express. What you probably want is:
glob.glob("/FileStorage/user/final/**/sample.json", recursive=True)
You need to remove the space from the pathname and add recursive=True.
import glob
path = r"C:User/FileStorage/user/final/*" # path to directory + '*'
for file in glob.iglob(path, recursive=True):
print(file)
#if you want to filter json format file
if file.endswith(".json"):
print(file)
I think that the issue is with the '**'.
If you were trying to use a relative path then you should use only one '*' ("/FileStorage/user/final/*/*/sample.json").
If you wanted the search to be recursive/include hidden files, then you need to remove the space after the first '**' and set recursive=True or include_hidden=True (according to what you want) when calling glob.glob (for example: glob.glob("/FileStorage/user/final/**/**/sample.json", include_hidden=True) will return hidden files that are in this path)
If recursive is true, the pattern “**” will match any files and zero
or more directories, subdirectories and symbolic links to directories.
If the pattern is followed by an os.sep or os.altsep then files will
not match.
If include_hidden is true, “**” pattern will match hidden directories.
see documentation here:
https://docs.python.org/3/library/glob.html
Edit:
If this does not work, validate the path in your file system
Suppose I have next files:
/path/to/file/a.png
/path/to/file/a_match_me.png
/path/to/file/a_dont_match_me.png
I want to match 2 files
/path/to/file/a.png
/path/to/file/a_match_me.png
but do not match /path/to/file/a_dont_match_me.png
So, I need regexp, something like /path/to/file/a[zero or one occurence of _match_me].png
Can I do it, using glob library in python?
You would have a hard time doing this with the built-in Python glob library, but you could do this with a third party library.
The python library wcmatch can be used for the described case above. Full disclosure, I am the author of the mentioned library.
Below we use the GLOBSTAR flag (G) to match multiple folders with ** and the EXTGLOB flag (E) to use extended glob patterns such as !() which excludes a file name pattern. We use the globfilter command which can filter a full path names with glob patterns. It is kind of like fnmatch's filter, but does full paths.
from wcmatch import glob
files = [
"/path/to/file/a.png",
"/path/to/file/a_match_me.png",
"/path/to/file/a_dont_match_me.png"
]
print(glob.globfilter(files, '**/a!(_dont_match_me).png', flags=glob.G | glob.E))
Output
['/path/to/file/a.png', '/path/to/file/a_match_me.png']
You could also glob these files directly from the file system:
from wcmatch import glob
glob.glob('**/a!(_dont_match_me).png', flags=glob.G | glob.E)
Hopefully this helps.
I need to invoke different functionality if we pass file name and file path
ex
python test.py test1 (invoke different function)
python test.py /home/sai/test1 (invoke different function)
I can get the argument from sys.argv[1]. But I am not able to differentiate into file and filepath.(i.e is it file or file path)
It's a tricky bit as a name of the file is also a valid relative path, right?
You can not differentiate it.
On the other hand assuming you would like to differentiate an absolute path or a relative path starting with a slash\backslash you could use os.path.isabs(path). Doc says it checks if the path starts with slash on Unix, backlash on Win after chopping a potential drive letter:
>>> import os
>>> os.path.isabs('C:\\folder\\name.txt')
True
>>> os.path.isabs('\\folder\\name.txt')
True
>>> os.path.isabs('name.txt')
False
However this will fail with a relative path not beginint with slash:
>>> os.path.isabs('folder\\name.txt')
False
The solution that would work with all cases mentioned above, not sensitive to relative paths with slashes or without them, would be to perform a comparison of the tail of path with the path itself using os.path.basename(path). If they are equal it's just a name:
>>> os.path.basename('C:\\folder\\name.txt') == 'C:\\folder\\name.txt'
False
>>> os.path.basename('\\folder\\name.txt') == '\\folder\\name.txt'
False
>>> os.path.basename('folder\\name.txt') == 'folder\\name.txt'
False
>>> os.path.basename('name.txt') == 'name.txt'
True
You can use isdir() and isfile():
File:
>>> os.path.isdir('a.txt')
False
>>> os.path.isfile('a.txt')
True
Dir:
>>> os.path.isfile('Doc')
False
>>> os.path.isdir('Doc')
True
Did you try
os.path.basename
or
os.path.dirname
hello im trying to do something like
// 1. for x in glob.glob('/../../nodes/*/views/assets/js/*.js'):
// 2 .for x in glob.glob('/../../nodes/*/views/assets/js/*/*.js'):
print x
is there anything can i do to search it recuresively ?
i already looked into Use a Glob() to find files recursively in Python? but the os.walk dont accept wildcards folders like above between nodes and views, and the http://docs.python.org/library/glob.html docs that dosent help much.
thanks
Caveat: This will also select any files matching the pattern anywhere beneath the root folder which is nodes/.
import os, fnmatch
def locate(pattern, root_path):
for path, dirs, files in os.walk(os.path.abspath(root_path)):
for filename in fnmatch.filter(files, pattern):
yield os.path.join(path, filename)
As os.walk does not accept wildcards we walk the tree and filter what we need.
js_assets = [js for js in locate('*.js', '/../../nodes')]
The locate function yields an iterator of all files which match the pattern.
Alternative solution: You can try the extended glob which adds recursive searching to glob.
Now you can write a much simpler expression like:
fnmatch.filter( glob.glob('/../../nodes/*/views/assets/js/**/*'), '*.js' )
I answered a similar question here: fnmatch and recursive path match with `**`
You could use glob2 or formic, both available via easy_install or pip.
GLOB2
FORMIC
You can find them both mentioned here:
Use a Glob() to find files recursively in Python?
I use glob2 a lot, ex:
import glob2
files = glob2.glob(r'C:\Users\**\iTunes\**\*.mp4')
Why don't you split your wild-carded paths into multiple parts, like:
parent_path = glob.glob('/../../nodes/*')
for p in parent_path:
child_paths = glob.glob(os.path.join(p, './views/assets/js/*.js'))
for c in child_paths:
#do something
You can replace some of the above with a list of child assets that you want to retrieve.
Alternatively, if your environment provides the find command, that provides better support for this kind of task. If you're in Windows, there may be an analogous program.
Ant has a nice way to select groups of files, most handily using ** to indicate a directory tree. E.g.
**/CVS/* # All files immediately under a CVS directory.
mydir/mysubdir/** # All files recursively under mysubdir
More examples can be seen here:
http://ant.apache.org/manual/dirtasks.html
How would you implement this in python, so that you could do something like:
files = get_files("**/CVS/*")
for file in files:
print file
=>
CVS/Repository
mydir/mysubdir/CVS/Entries
mydir/mysubdir/foo/bar/CVS/Entries
Sorry, this is quite a long time after your OP. I have just released a Python package which does exactly this - it's called Formic and it's available at the PyPI Cheeseshop. With Formic, your problem is solved with:
import formic
fileset = formic.FileSet(include="**/CVS/*", default_excludes=False)
for file_name in fileset.qualified_files():
print file_name
There is one slight complexity: default_excludes. Formic, just like Ant, excludes CVS directories by default (as for the most part collecting files from them for a build is dangerous), the default answer to the question would result in no files. Setting default_excludes=False disables this behaviour.
As soon as you come across a **, you're going to have to recurse through the whole directory structure, so I think at that point, the easiest method is to iterate through the directory with os.walk, construct a path, and then check if it matches the pattern. You can probably convert to a regex by something like:
def glob_to_regex(pat, dirsep=os.sep):
dirsep = re.escape(dirsep)
print re.escape(pat)
regex = (re.escape(pat).replace("\\*\\*"+dirsep,".*")
.replace("\\*\\*",".*")
.replace("\\*","[^%s]*" % dirsep)
.replace("\\?","[^%s]" % dirsep))
return re.compile(regex+"$")
(Though note that this isn't that fully featured - it doesn't support [a-z] style glob patterns for instance, though this could probably be added). (The first \*\*/ match is to cover cases like \*\*/CVS matching ./CVS, as well as having just \*\* to match at the tail.)
However, obviously you don't want to recurse through everything below the current dir when not processing a ** pattern, so I think you'll need a two-phase approach. I haven't tried implementing the below, and there are probably a few corner cases, but I think it should work:
Split the pattern on your directory seperator. ie pat.split('/') -> ['**','CVS','*']
Recurse through the directories, and look at the relevant part of the pattern for this level. ie. n levels deep -> look at pat[n].
If pat[n] == '**' switch to the above strategy:
Reconstruct the pattern with dirsep.join(pat[n:])
Convert to a regex with glob\_to\_regex()
Recursively os.walk through the current directory, building up the path relative to the level you started at. If the path matches the regex, yield it.
If pat doesn't match "**", and it is the last element in the pattern, then yield all files/dirs matching glob.glob(os.path.join(curpath,pat[n]))
If pat doesn't match "**", and it is NOT the last element in the pattern, then for each directory, check if it matches (with glob) pat[n]. If so, recurse down through it, incrementing depth (so it will look at pat[n+1])
os.walk is your friend. Look at the example in the Python manual
(https://docs.python.org/2/library/os.html#os.walk) and try to build something from that.
To match "**/CVS/*" against a file name you get, you can do something like this:
def match(pattern, filename):
if pattern.startswith("**"):
return fnmatch.fnmatch(file, pattern[1:])
else:
return fnmatch.fnmatch(file, pattern)
In fnmatch.fnmatch, "*" matches anything (including slashes).
There's an implementation in the 'waf' build system source code.
http://code.google.com/p/waf/source/browse/trunk/waflib/Node.py?r=10755#471
May be this should be wrapped up in a library of its own?
Yup. Your best bet is, as has already been suggested, to work with 'os.walk'. Or, write wrappers around 'glob' and 'fnmatch' modules, perhaps.
os.walk is your best bet for this. I did the example below with .svn because I had that handy, and it worked great:
import re
for (dirpath, dirnames, filenames) in os.walk("."):
if re.search(r'\.svn$', dirpath):
for file in filenames:
print file