Detecting case mismatch on filename in Windows (preferably using python)? - python

I have some xml-configuration files that we create in a Windows environment but is deployed on Linux. These configuration files reference each other with filepaths. We've had problems with case-sensitivity and trailing spaces before, and I'd like to write a script that checks for these problems. We have Cygwin if that helps.
Example:
Let's say I have a reference to the file foo/bar/baz.xml, I'd do this
<someTag fileref="foo/bar/baz.xml" />
Now if we by mistake do this:
<someTag fileref="fOo/baR/baz.Xml " />
It will still work on Windows, but it will fail on Linux.
What I want to do is detect these cases where the file reference in these files don't match the real file with respect to case sensitivity.

os.listdir on a directory, in all case-preserving filesystems (including those on Windows), returns the actual case for the filenames in the directory you're listing.
So you need to do this check at each level of the path:
def onelevelok(parent, thislevel):
for fn in os.listdir(parent):
if fn.lower() == thislevel.lower():
return fn == thislevel
raise ValueError('No %r in dir %r!' % (
thislevel, parent))
where I'm assuming that the complete absence of any case variation of a name is a different kind of error, and using an exception for that; and, for the whole path (assuming no drive letters or UNC that wouldn't translate to Windows anyway):
def allpathok(path):
levels = os.path.split(path)
if os.path.isabs(path):
top = ['/']
else:
top = ['.']
return all(onelevelok(p, t)
for p, t in zip(top+levels, levels))
You may need to adapt this if , e.g., foo/bar is not to be taken to mean that foo is in the current directory, but somewhere else; or, of course, if UNC or drive letters are in fact needed (but as I mentioned translating them to Linux is not trivial anyway;-).
Implementation notes: I'm taking advantage of the fact that zip just drop "extra entries" beyond the length of the shortest of the sequences it's zipping; so I don't need to explicitly slice off the "leaf" (last entry) from levels in the first argument, zip does it for me. all will short circuit where it can, returning False as soon as it detects a false value, so it's just as good as an explicit loop but faster and more concise.

it's hard to judge what exactly your problem is, but if you apply os.path.normcase along with str.stript before saving your file name, it should solve all your problems.
as I said in comment, it's not clear how are you ending up with such a mistake. However, it would be trivial to check for existing file, as long as you have some sensible convention (all file names are lower case, for example):
try:
open(fname)
except IOError:
open(fname.lower())

Related

How to tell if a non-existent path leads to a directory or a file in a cross-platform way in Python?

Here's how I'm currently guessing whether a path leads to a theoretical file or folder:
def __init__(self, path:str, is_dir:bool=None):
# checks
if is_dir is None: is_dir = path.endswith('/') or path.endswith('\\') or os.path.isdir(path) or '.' not in path # guess if path is pointing to a file
self.is_dir = is_dir
# note: Because both files and folders can, but don't have to include dots in their names,
# this check is just a basic guess.
#
# files without extensions that were not yet created will be recognised as folders
# and
# folders with dots in their name that were not yet created will be recognised as files
#
# If you know of a better solution to this problem (which I frankly don't think exists),
# please implement it here.
Is there a better way that works on both UNIX-likes and Windows? (and possibly even elsewhere?)
(in case that the question is confusing: by theoretical I mean that it might not exist (yet))
Overall, this question is indeed pretty dumb, however, there are a few pointers that can help you guess (not determine):
on common Windows filesystems, you cannot put \ or / in a file (both are accepted as separators in a path)
on most common Linux filesystems, however, you can put a \ in a file (/ is used as a separator in a path)
directories are less likely to contain dots in their name (this isn't the case for "hidden" directories on Linux, also you can easily name a directory something like why.would.you.do.this)
If you're still desperate for better confidence:
you could check the path against a list of common file extensions (this will still, however, only provide you with a guess)
reconsider your life decisions
I'm sorry for asking such a dumb question.

Iterate over infinite files in a directory in Python

I'm using Python 3.3.
If I'm manipulating potentially infinite files in a directory (bear with me; just pretend I have a filesystem that supports that), how do I do that without encountering a MemoryError? I only want the string name of one file to be in memory at a time. I don't want them all in an iterable as that would cause a memory error when there are too many.
Will os.walk() work just fine, since it returns a generator? Or, do generators not work like that?
Is this possible?
If you have a system for naming the files that can be figured out computationally, you can do such as this (this iterates over any number of numbered txt files, with only one in memory at a time; you could convert to another calculable system to get shorter filenames for large numbers):
import os
def infinite_files(path):
num=0;
while 1:
if not os.path.exists(os.path.join(path, str(num)+".txt")):
break
else:
num+=1 #perform operations on the file: str(num)+".txt"
[My old inapplicable answer is below]
glob.iglob seems to do exactly what the question asks for. [EDIT: It doesn't. It actually seems less efficient than listdir(), but see my alternative solution above.] From the official documentation:
glob.glob(pathname, *, recursive=False)
Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).
glob.iglob(pathname, *, recursive=False)
Return an iterator which yields the same values as glob() without actually storing them all simultaneously.
iglob returns an "iterator which yields" or-- more concisely-- a generator.
Since glob.iglob has the same behavior as glob.glob, you can search with wildcard characters:
import glob
for x glob.iglob("/home/me/Desktop/*.txt"):
print(x) #prints all txt files in that directory
I don't see a way for it to differentiate between files and directories without doing it manually. That is certainly possible, however.

What is the fastest method of finding a file in Linux and Windows using Python?

I am writing a plug-in for RawTherapee in Python. I need to extract the version number from a file called 'AboutThisBuild.txt' that may exist anywhere in the directory tree. Although RawTherapee knows where it is installed this data is baked into the binary file.
My plug-in is being designed to collect basic system data when run without any command line parameters for the purpose of short circuiting troubleshooting. By having the version number, revision number and changeset (AKA Mercurial), I can sort out why the script may not be working as expected. OK that is the context.
I have tried a variety of methods, some suggested elsewhere on this site. The main one is using os.walk and fnmatch.
The problem is speed. Searching the entire directory tree is like watching paint dry!
To reduce load I have tried to predict likely hiding places and only traverse these. This is quicker but has the obvious disadvantage of missing some files.
This is what I have at the moment. Tested on Linux but not Windows as yet as I am still researching where the file might be placed.
import fnmatch
import os
import sys
rootPath = ('/usr/share/doc/rawtherapee',
'~',
'/media/CoreData/opt/',
'/opt')
pattern = 'AboutThisBuild.txt'
# Return the first instance of RT found in the paths searched
for CheckPath in rootPath:
print("\n")
print(">>>>>>>>>>>>> " + CheckPath)
print("\n")
for root, dirs, files in os.walk(CheckPath, True, None, False):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
break
Usually 'AboutThisBuild.txt' is stored in a directory/subdirectory called 'rawtherapee' or has the string somewhere in the directory tree. I had naively though I could get the 5000 odd directory names and search these for 'rawtherapee' then use os.walk to traverse those directories but all modules and functions I have looked at collate all files in the directory (again).
Anyone have a quicker method of searching the entire directory tree or am I stuck with this hybrid option?
I am a beginner in Python, but I think I know the simplest way of finding a file in Windows.
import os
for dirpath, subdirs, filenames in os.walk('The directory you wanna search the file in'):
if 'name of your file with extension' in filenames:
print(dirpath)
This code will print out the directory of the file you are searching for in the console. All you have to do is get to the directory.
The thing about searching is that it doesn't matter too much how you get there (eg cheating). Once you have a result, you can verify it is correct relatively quickly.
You may be able to identify candidate locations fairly efficiently by guessing. For example, on Linux, you could first try looking in these locations (obviously not all are directories, but it doesn't do any harm to os.path.isfile('/;l$/AboutThisBuild.txt'))
$ strings /usr/bin/rawtherapee | grep '^/'
/lib/ld-linux.so.2
/H=!
/;l$
/9T$,
/.ba
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
If you have it installed, you can try the locate command
If you still don't find it, move on to the brute force method
Here is a rough equivalent of strings using Python
>>> from string import printable, whitespace
>>> from itertools import groupby
>>> pathchars = set(printable) - set(whitespace)
>>> with open("/usr/bin/rawtherapee") as fp:
... data = fp.read()
...
>>> for k, g in groupby(data, pathchars.__contains__):
... if not k: continue
... g = ''.join(g)
... if len(g) > 3 and g.startswith("/"):
... print g
...
/lib64/ld-linux-x86-64.so.2
/^W0Kq[
/pW$<
/3R8
/)wyX
/WUO
/w=H
/t_1
/.badpixH
/d$(
/\$P
/D$Pv
/D$#
/D$(
/l$#
/d$#v?H
/usr/share/rawtherapee
/usr/share/doc/rawtherapee
/themes/
/themes/slim
/options
/usr/share/color/icc
/cache
/languages/default
/languages/
/languages
/themes
/batch/queue.csv
/batch/
/dcpprofiles
/#q=
/N6rtexif16NAISOInterpreterE
It sounds like you need a pure python solution here. If not, other answers will suffice.
In this case, you should traverse the folders using a queue and threads. While some may say Threads are never the solution, Threads are a great way of speeding up when you are I/O bound, which you are in this case. Essentially, you'll os.listdir the current dir. If it contains your file, party like it's 1999. If it doesn't, add each subfolder to the work queue.
If you're clever, you can play with depth first vs breadth first traversal to get the best results.
There is a great example I have used quite successfully at work at http://www.tutorialspoint.com/python/python_multithreading.htm. See the section titled Multithreaded Priority Queue. The example could probably be updated to include threadpools though, but it's not necessary.

What is expected behviour of tarfile.add() when adding archive to itself?

The question might sound strange because I know I enforce a strange situation> It came up by accident (a bug one might say) and I even know hot to avoid it, so please skip that part.
I would really like to understand the behaviour I see.
The point of the function is to add all files with a given prefix in a directory to an archive. I noticed that even despite a "bug", the program works correctly (sic!). I wanted to understand why.
The code is fairly simple so I allow myself to post whole function:
def pack(prefix, custom_meta_files = []):
postfix = 'tgz'
if prefix[-1] != '.':
postfix = '.tgz'
archive = tarfile.open(prefix+postfix, "w:gz")
files = filter(lambda path: path.startswith(prefix), os.listdir())
#print('files: {0}'.format(list(files)))
for file in files:
print('packing `{0}`'.format(file))
archive_name = file[len(prefix):] #skip prefix + dot
archive.add(file, archive_name)
not_doubled_metas = set(custom_meta_files) - set(archive.getnames())
print('metas to add: {0}'.format(not_doubled_metas))
for meta in not_doubled_metas:
print('packing `{0}`'.format(meta))
archive.add(meta)
print('contents:{0}'.format(archive.getnames()))
As one can notice I create the archive with the prefix, and then I create a list of files to pack by by listing everything in cwd and filter it via the lambda. Naturally the archive passes the filter. There is also a snippet to add fixed files if the names do not overlap, although it is not important I think.
So the output from such run is e.g:
packing `ga_run.seq_niche.N30.1.bt0_5K.params`
packing `ga_run.seq_niche.N30.1.bt0_5K.stats`
packing `ga_run.seq_niche.N30.1.bt0_5K.tgz`
metas to add: {'stats.meta'}
packing `stats.meta`
contents:['params', 'stats', 'stats.meta']
So the script tried adding itself, however it does not appear in the final contents. I do not know what is the expected behaviour, but there is no warning at all and the documentation does not mention anything. I read the parts about methods to add members and used search for itself and same name.
I would assume it is automatically skipped, but I don't know how to acutally check it. I would personally expect to add a zero length file as member, however I understand skipping as I makes more sense actually.
Question Is it a desired behaviour in tarfile.add() to ignore adding the archive to itself? Where is it said?
Scanning the tarfile.py code from 3.2 to 2.4 they all have code similar to:
# Skip if somebody tries to archive the archive...
if self.name is not None and os.path.abspath(name) == self.name:
self._dbg(2, "tarfile: Skipped %r" % name)
return

How can I tell if a file is a descendant of a given directory?

On the surface, this is pretty simple, and I could implement it myself easily. Just successively call dirname() to go up each level in the file's path and check each one to see if it's the directory we're checking for.
But symlinks throw the whole thing into chaos. Any directory along the path of either the file or directory being checked could be a symlink, and any symlink could have an arbitrary chain of symlinks to other symlinks. At this point my brain melts and I'm not sure what to do. I've tried writing the code to handle these special cases, but it soon gets too complicated and I assume I'm doing it wrong. Is there a reasonably elegant way to do this?
I'm using Python, so any mention of a library that does this would be cool. Otherwise, this is a pretty language-neutral problem.
Use os.path.realpath and os.path.commonprefix:
os.path.commonprefix(['/the/dir/', os.path.realpath(filename)]) == "/the/dir/"
os.path.realpath will expand any symlinks as well as .. in the filename. os.path.commonprefix is a bit fickle -- it doesn't really test for paths, just plain string prefixes, so you should make sure your directory ends in a directory separator. If you don't, it will claim /the/dirtwo/filename is also in /the/dir
Python 3.5 has the useful function os.path.commonpath:
Return the longest common sub-path of each pathname in the sequence paths. Raise ValueError if paths contains both absolute and relative pathnames, or if paths is empty. Unlike commonprefix(), this returns a valid path.
So to check if a file is a descendant of a directory, you could do this:
os.path.commonpath(["/the/dir", os.path.realpath(filename)]) == "/the/dir"
Unlike commonprefix, you don't need to worry if the inputs have trailing slashes or not. The return value of commonprefix always lacks a trailing slash.
Another way to do this in Python 3 is to use pathlib:
from pathlib import Path
is_descendant = Path("/the/dir") in Path(filename).resolve().parents
See documentation for Path.resolve() and Path.parents.

Categories