I have a string which contains user input for a directory address on a linux system. I need to check if it is properly formatted and could be an address in Python 2.6. It's important to note that this is not on the current system so I can't check if it is there using os.path nor can I try to create the directories as the function will be run many times.
These strings will always be absolute paths, so my first thought was to look for a leading slash. From there I wondered about checking if the rest of the string only contains valid characters and does not contain any double slashes. This seems a little clunky, any other ideas?
Sure the question has been edited since writing this but:
There is the os.path.isabs(PATH) which will tell you if the path is absolute or not.
Return True if path is an absolute pathname. On Unix, that means it begins with a slash, on Windows that it begins with a (back)slash after chopping off a potential drive letter.
Related
The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?
Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path
I know that this is not something that should ever be done, but is there a way to use the slash character that normally separates directories within a filename in Linux?
The answer is that you can't, unless your filesystem has a bug. Here's why:
There is a system call for renaming your file defined in fs/namei.c called renameat:
SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
int, newdfd, const char __user *, newname)
When the system call gets invoked, it does a path lookup (do_path_lookup) on the name. Keep tracing this, and we get to link_path_walk which has this:
static int link_path_walk(const char *name, struct nameidata *nd)
{
struct path next;
int err;
unsigned int lookup_flags = nd->flags;
while (*name=='/')
name++;
if (!*name)
return 0;
...
This code applies to any file system. What's this mean? It means that if you try to pass a parameter with an actual '/' character as the name of the file using traditional means, it will not do what you want. There is no way to escape the character. If a filesystem "supports" this, it's because they either:
Use a unicode character or something that looks like a slash but isn't.
They have a bug.
Furthermore, if you did go in and edit the bytes to add a slash character into a file name, bad things would happen. That's because you could never refer to this file by name :( since anytime you did, Linux would assume you were referring to a nonexistent directory. Using the 'rm *' technique would not work either, since bash simply expands that to the filename. Even rm -rf wouldn't work, since a simple strace reveals how things go on under the hood (shortened):
$ ls testdir
myfile2 out
$ strace -vf rm -rf testdir
...
unlinkat(3, "myfile2", 0) = 0
unlinkat(3, "out", 0) = 0
fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
close(3) = 0
unlinkat(AT_FDCWD, "testdir", AT_REMOVEDIR) = 0
...
Notice that these calls to unlinkat would fail because they need to refer to the files by name.
You could use a Unicode character that displays as / (for example the fraction slash), assuming your filesystem supports it.
It depends on what filesystem you are using. Of some of the more popular ones:
ext3: No
ext4: No
jfs: Yes
reiserfs: No
xfs: No
Only with an agreed-upon encoding. For example, you could agree that % will be encoded as %% and that %2F will mean a /. All the software that accessed this file would have to understand the encoding.
The short answer is: No, you can't. It's a necessary prohibition because of how the directory structure is defined.
And, as mentioned, you can display a unicode character that "looks like" a slash, but that's as far as you get.
In general it's a bad idea to try to use "bad" characters in a file name at all; even if you somehow manage it, it tends to make it hard to use the file later. The filesystem separator is flat-out not going to work at all, so you're going to need to pick an alternative method.
Have you considered URL-encoding the URL then using that as the filename? The result should be fine as a filename, and it's easy to reconstruct the name from the encoded version.
Another option is to create an index - create the output filename using whatever method you like - sequentially-numbered names, SHA1 hashes, whatever - then write a file with the generated filename/URL pair. You can save that into a hash and use it to do a URL-to-filename lookup or vice-versa with the reversed version of the hash, and you can write it out and reload it later if needed.
The short answer is: you must not. The long answer is, you probably can or it depends on where you are viewing it from and in which layer you are working with.
Since the question has Unix tag in it, I am going to answer for Unix.
As mentioned in other answers that, you must not use forward slashes in a filename.
However, in MacOS you can create a file with forward slashes / by:
# avoid doing it at all cost
touch 'foo:bar'
Now, when you see this filename from terminal you will see it as foo:bar
But, if you see it from finder: you will see finder converted it as foo/bar
Same thing can be done the other way round, if you create a file from finder with forward slashes in it like /foobar, there will be a conversion done in the background. As a result, you will see :foobar in terminal but the other way round when viewed from finder.
So, : is valid in the unix layer, but it is translated to or from / in the Mac layers like Finder window, GUI. : the colon is used as the separator in HFS paths and the slash / is used as the separator in POSIX paths
So there is a two-way translation happening, depending on which “layer” you are working with.
See more details here: https://apple.stackexchange.com/a/283095/323181
You can have a filename with a / in Linux and Unix. This is a very old question, but surprisingly nobody has said it in almost 10 years since the question was asked.
Every Unix and Linux system has the root directory named /. A directory is just a special kind of file. Symbolic links, character devices, etc are also special kinds of files. See here for an in depth discussion.
You can't create any other files with a /, but you certainly have one -- and a very important one at that.
I'm writing a script that will take a list of file paths as input. I want the script to make sure the strings in the input file are, or at least appear to be, valid full Windows paths, that include the drive letter.
That being said, what's the best way to ensure that a string starts with any single letter, upper or lowercase, a colon, and a back slash?
I'm guessing the regex would look something like this:
[a-zA-Z]:\, but how do I make sure it check for only one letter and that it's the first 3 characters in the string?
I appreciate it.
The ^ character matches the start of the string. Your character class will currently only match one letter, and you need to escape the \. So your final regex would be:
^[a-zA-Z]:\\
If you just want to check to make sure it starts with a drive letter, you could also use the built-in splitdrive:
drive, path = os.path.splitdrive(filename)
if drive == None:
raise ValueError, 'Filename does not include a drive!'
Edit: Thanks to jme, if you are not on a Windows system, do import ntpath and replace the first line like this:
drive, path = ntpath.splitdrive(filename)
Note: since Python 2.7.8, splitdrive returns a 'drive' for UNC paths too.
Checks that the given path is a valid windows path based on criteria at http://msdn.microsoft.com/en-us/library/aa365247%28v=vs.85%29.aspx. I made this a little while ago out of frustration of not finding a good one online:
r'^(?:[a-zA-Z]:\\|\\\\?|\\\\\?\\|\\\\\.\\)?(?:(?!(CLOCK\$(\\|$)|(CON|PRN|AUX|NUL|COM[1-9]|LPT[1-9]| )(?:\..*|(\\|$))|.*\.$))(?:(?:(?![><:/"\\\|\?\*])[\x20-\u10FFFF])+\\?))*$'
So this is what ended up working for me.
is_path = re.match("^[a-zA-Z]:\\)*", file)
So, I'm trying to incorporate os.path.isfile or os.path.exists into my code with success in finding certain regular files(pdf,png) when searching for filenames that begin with a letter.
The file naming standard that I'm using (and can't change due to the user) starts with a number and subsequently can't be found using the same method. Is there a way that I can make these files discoverable by .isfile or .exists?
The files I'm searching for are .txt files.
os.path.isfile("D:\Users\spx9gs\Project Work\Data\21022013AA.txt")
os.path.isfile("D:\Users\spx9gs\Project Work\Data\AA21022013.txt")
Returns:
False
True
You need to use raw strings, or escape your backslashes. In the filename:
"D:\Users\spx9gs\Project Work\Data\21022013AA.txt"
the \210 will be interpreted as an octal escape code so you won't get the correct filename.
Either of these will work:
r"D:\Users\spx9gs\Project Work\Data\21022013AA.txt"
"D:\\Users\\spx9gs\\Project Work\\Data\\21022013AA.txt"
It is inspired by "How to make a valid Windows filename from an arbitrary string?", I've written a function that will take arbitrary string and make it a valid filename.
My function should technically be an answer to this question, but I want to make sure I've not done anything stupid, or overlooked anything, before posting it as an answer.
I wrote this as part of tvnamer - a utility which takes TV episode filenames, and renames them nice and consistently, with an episode pulled from http://www.thetvdb.com - while the source filename must be a valid file, the series name is corrected, and the episode name - so both could contain theoretically any characters. I'm not so much concerned about security as usability - it's mainly to prevent files being renamed .some.series - [01x01].avi and the file "disappearing" (rather than to thwart evil people)
It makes a few assumptions:
The filesystem supports Unicode filenames. HFS+ and NTFS both do, which will cover a majority of users. There is also a normalize_unicode argument to strip out Unicode characters (in tvnamer, this is set via the config XML file)
The platform is either Darwin, Linux, and everything else is treated as Windows
The filename is intended to be visible (not a dotfile like .bashrc) - it would be simple enough to modify the code to allow .abc format filenames, if desired
Things I've (hopefully) handled:
Prepend underscore if filename starts with . (prevents filenames . .. and files from disappearing)
Remove directory separators: / on Linux, and / and : on OS X
Removing invalid Windows filename characters \/:*?"<>| (when on Windows, or forced with windows_safe=True)
Prepend reserved filenames with underscore (COM2 becomes _COM2, NUL becomes _NUL etc)
Optional normalisation of Unicode data, so å becomes a and non-convertable characters are removed
Truncation of filenames over 255 characters on Linux/Darwin, and 32 characters on Windows
The code and a bunch of test-cases can be found and fiddled with at http://gist.github.com/256270. The "production" code can be found in tvnamer/utils.py
Is there any errors with this function? Any conditions I've missed?
One point I've noticed: Under NTFS, some files can not be created in specific directories.
E.G. $Boot in root