It is inspired by "How to make a valid Windows filename from an arbitrary string?", I've written a function that will take arbitrary string and make it a valid filename.
My function should technically be an answer to this question, but I want to make sure I've not done anything stupid, or overlooked anything, before posting it as an answer.
I wrote this as part of tvnamer - a utility which takes TV episode filenames, and renames them nice and consistently, with an episode pulled from http://www.thetvdb.com - while the source filename must be a valid file, the series name is corrected, and the episode name - so both could contain theoretically any characters. I'm not so much concerned about security as usability - it's mainly to prevent files being renamed .some.series - [01x01].avi and the file "disappearing" (rather than to thwart evil people)
It makes a few assumptions:
The filesystem supports Unicode filenames. HFS+ and NTFS both do, which will cover a majority of users. There is also a normalize_unicode argument to strip out Unicode characters (in tvnamer, this is set via the config XML file)
The platform is either Darwin, Linux, and everything else is treated as Windows
The filename is intended to be visible (not a dotfile like .bashrc) - it would be simple enough to modify the code to allow .abc format filenames, if desired
Things I've (hopefully) handled:
Prepend underscore if filename starts with . (prevents filenames . .. and files from disappearing)
Remove directory separators: / on Linux, and / and : on OS X
Removing invalid Windows filename characters \/:*?"<>| (when on Windows, or forced with windows_safe=True)
Prepend reserved filenames with underscore (COM2 becomes _COM2, NUL becomes _NUL etc)
Optional normalisation of Unicode data, so å becomes a and non-convertable characters are removed
Truncation of filenames over 255 characters on Linux/Darwin, and 32 characters on Windows
The code and a bunch of test-cases can be found and fiddled with at http://gist.github.com/256270. The "production" code can be found in tvnamer/utils.py
Is there any errors with this function? Any conditions I've missed?
One point I've noticed: Under NTFS, some files can not be created in specific directories.
E.G. $Boot in root
Related
Very new here, and I am trying to modify some python code to normalize directory/file names for Windows using regular expression. I have searched and found lots of code examples, but haven’t quite figured out how to put it all together.
This is what I am trying to accomplish:
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? *
Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So, I need to get rid of ellipsis without affecting the extension. To clarify, when I say ellipsis, I am referring to a pattern of three periods, and NOT the single unicode character “Horizontal Ellipsis (U+2026)”. I have researched and found multiple ways of doing individual parts of this, but I cannot see to get it all together and playing nice.
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename)
This cleans up the names, but not the pattern of two or more periods.
return unicode(re.sub(r'[<>:"/\\|?*.]', "", filename)
This cleans up the names, but also affects the file extension.
[^\w\-_\. ]
This also seemed to be a viable alternative. It is a bit more restrictive than necessary, but I did find it easy to just keep adding specific characters I wanted to ignore.
\.{2,}
This is the piece I can’t seem to get to integrate with any of these methods. I understand that this should match two or more “.”, but leave a single “.” alone. But there are some situations where I “might” be left with a period at the end of a Windows directory name, which won’t work.
.*[.](?!mp3$)[^.]*$
I searched and found this specific snippet, which looks promising to match/ignore a specific extension. In my case, I want .mp3 left alone. Maybe a different way to go about things. And I think it might eliminate a potential problem of having a period at the end of a directory name.
Thank you for your time!
Edit: Additional Information Added
def normalize_filename(self, filename):
"""Remove invalid characters from filename"""
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename))
def get_outfile(self):
"""Returns output filename based on song information"""
destination_dir = os.path.join(self.normalize_filename(self.info["AlbumArtist"]),
self.normalize_filename(self.info["Album"]))
filename = u"{TrackNumber:02d} - {Title}.mp3".format(**self.info)
return os.path.join(destination_dir, self.normalize_filename(filename))
This is the relevant code I am trying to modify. The full code basically pulls song artist, album, and track descriptions out of a sqlite database file. Then based on that information, it creates an artist directory, album directory, and a mp3 file.
However, because of Windows naming restrictions, those names need to be normalized/sanitized.
Ideally I would like this to be done with a single re.sub, if it can be done.
return unicode(re.sub(r'[<>:"/\|?*]', "", filename))
If there is another/better way to make this code work, I am open to it. But with my limited understanding, adding more complexity was beyond me, so I was trying to work within the bounds of what I currently understand. I have done a lot of reading over the past few days, but can’t quite accomplish what I would like to do.
For Example: “Ned’s Atomic Dustbin\ARE YOU NORMAL?\Not Sleeping Around” needs to become C:\Ned’s Atomic Dustbin\ARE YOU NORMAL\Not Sleeping Around.mp3
Another: “Green Day\UNO... DOS... TRÉ!\F*** Time” needs to become C:\Green Day\UNO DOS TRÉ\F Time.mp3”
Another: “Incubus\A Crow Left Of The Murder…\Pistola” would become C:\Incubus\A Crow Left Of The Murder\Pistola.mp3
Tricky Example: “System Of A Down\B.Y.O.B.\B.Y.O.B.” to C:\System Of A Down\BYOB\BYOB.mp3” Windows wouldn’t care if it was B.Y.O.B, but the last period is what causes issues. So it would probably be best if the solution eliminated all “.”, except on the extension .mp3.
My answer is totally based on the text below (you typed, of course):
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? * Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So here we go (for file/directory):
unicode(re.sub(r'(\<|\>|\:|\"|\/|\\|\||\?|\*', '', file/directory))
Explanation:
\<|\>|\:|\"|\/|\\|\||\?|\* <= matches alll of your undesired chars
At this time you will have erased all of your undesired chars EXCEPT the spaces/dots at the end of the name.
For yours file_name you can update its variable with
file_name = re.sub(r'( +)$', '', file_name)
( +)$ <= matches spaces or a dot at the end of the string.
and you'll be done because there are no more restrictions besides that the name can't contain any spaces at its end (remember we already removed the special chars).
For directories however, you can't have both periods and spaces.
So the best way, my opinion of course, is to implement a recursive procedure, once that stops only when:
dir_name == re.sub(r'( +|\.+)$', '', dir_name)
and dir_name keeps being updated with dir_name = re.sub(r'( +|\.+)$', '', dir_name) while the above statement is false.
Hope this helps you.
I know that this is not something that should ever be done, but is there a way to use the slash character that normally separates directories within a filename in Linux?
The answer is that you can't, unless your filesystem has a bug. Here's why:
There is a system call for renaming your file defined in fs/namei.c called renameat:
SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
int, newdfd, const char __user *, newname)
When the system call gets invoked, it does a path lookup (do_path_lookup) on the name. Keep tracing this, and we get to link_path_walk which has this:
static int link_path_walk(const char *name, struct nameidata *nd)
{
struct path next;
int err;
unsigned int lookup_flags = nd->flags;
while (*name=='/')
name++;
if (!*name)
return 0;
...
This code applies to any file system. What's this mean? It means that if you try to pass a parameter with an actual '/' character as the name of the file using traditional means, it will not do what you want. There is no way to escape the character. If a filesystem "supports" this, it's because they either:
Use a unicode character or something that looks like a slash but isn't.
They have a bug.
Furthermore, if you did go in and edit the bytes to add a slash character into a file name, bad things would happen. That's because you could never refer to this file by name :( since anytime you did, Linux would assume you were referring to a nonexistent directory. Using the 'rm *' technique would not work either, since bash simply expands that to the filename. Even rm -rf wouldn't work, since a simple strace reveals how things go on under the hood (shortened):
$ ls testdir
myfile2 out
$ strace -vf rm -rf testdir
...
unlinkat(3, "myfile2", 0) = 0
unlinkat(3, "out", 0) = 0
fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
close(3) = 0
unlinkat(AT_FDCWD, "testdir", AT_REMOVEDIR) = 0
...
Notice that these calls to unlinkat would fail because they need to refer to the files by name.
You could use a Unicode character that displays as / (for example the fraction slash), assuming your filesystem supports it.
It depends on what filesystem you are using. Of some of the more popular ones:
ext3: No
ext4: No
jfs: Yes
reiserfs: No
xfs: No
Only with an agreed-upon encoding. For example, you could agree that % will be encoded as %% and that %2F will mean a /. All the software that accessed this file would have to understand the encoding.
The short answer is: No, you can't. It's a necessary prohibition because of how the directory structure is defined.
And, as mentioned, you can display a unicode character that "looks like" a slash, but that's as far as you get.
In general it's a bad idea to try to use "bad" characters in a file name at all; even if you somehow manage it, it tends to make it hard to use the file later. The filesystem separator is flat-out not going to work at all, so you're going to need to pick an alternative method.
Have you considered URL-encoding the URL then using that as the filename? The result should be fine as a filename, and it's easy to reconstruct the name from the encoded version.
Another option is to create an index - create the output filename using whatever method you like - sequentially-numbered names, SHA1 hashes, whatever - then write a file with the generated filename/URL pair. You can save that into a hash and use it to do a URL-to-filename lookup or vice-versa with the reversed version of the hash, and you can write it out and reload it later if needed.
The short answer is: you must not. The long answer is, you probably can or it depends on where you are viewing it from and in which layer you are working with.
Since the question has Unix tag in it, I am going to answer for Unix.
As mentioned in other answers that, you must not use forward slashes in a filename.
However, in MacOS you can create a file with forward slashes / by:
# avoid doing it at all cost
touch 'foo:bar'
Now, when you see this filename from terminal you will see it as foo:bar
But, if you see it from finder: you will see finder converted it as foo/bar
Same thing can be done the other way round, if you create a file from finder with forward slashes in it like /foobar, there will be a conversion done in the background. As a result, you will see :foobar in terminal but the other way round when viewed from finder.
So, : is valid in the unix layer, but it is translated to or from / in the Mac layers like Finder window, GUI. : the colon is used as the separator in HFS paths and the slash / is used as the separator in POSIX paths
So there is a two-way translation happening, depending on which “layer” you are working with.
See more details here: https://apple.stackexchange.com/a/283095/323181
You can have a filename with a / in Linux and Unix. This is a very old question, but surprisingly nobody has said it in almost 10 years since the question was asked.
Every Unix and Linux system has the root directory named /. A directory is just a special kind of file. Symbolic links, character devices, etc are also special kinds of files. See here for an in depth discussion.
You can't create any other files with a /, but you certainly have one -- and a very important one at that.
Trying to save image files in batches. Works nicely, but the list of names for each file sometimes includes apostrophes, and everything stops.
The offending script is:
pic.save(r"C:\Python34\Scripts\{!s}.jpg".format(name))
The apostrophes in the names aren't a problem when I embed them in a url with selenium
browser.get("https://website.com/{!s}".format(name))
or when I print the destination file name, e.g.
print(r"C:\Python34\Scripts\{!s}.jpg".format(name))
Which is fine to turn out like
C:\Python34\Scripts['It's fine'].jpg
so I assume this kind of problem has something to do with the save function.
The trace back calls the pic.save line of code in PIL\Image.py and says the OSError: [Errno 22] is an Invalid argument in the save destination.
Using Windows 7 if that matters.
Probably super-novice error, but I've been reading threads and can't figure this out--workaround would be cleaning the list of apostrophes before using it, which would be annoying but acceptable.
Any help appreciated.
---edited to fix double quotes as single, just mistyped when writing this post...doh.
It's not a Python problem, but Windows, or rather the file system, file naming rules. From the MSDN:
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
On UNIX type systems, all except the / would be valid (although most would be a bad idea). A further "character", binary zero 0x00, is invalid on most file systems.
Rules for URLs are different again.
So you are going to have to write a sanitiser for filenames avoiding these characters. A regular expression would probably be the easiest, but you will have to choose replacement characters that don't occur naturally.
Edit: I was assuming that Error 22 was reporting an invalid filename, but I was wrong, it actually means "The device does not recognise the command".
See https://stackoverflow.com/questions/19870570/pil-giving-oserror-errno-22-when-opening-gif. The accepted reply is rather weird though.
I Google'd "python PIL OSError Errno 22", you might like to try the same and see if any of the conditions apply to you, but clearly you are not alone, if that's any consolation.
Sorry I can't do more.
So, I'm trying to incorporate os.path.isfile or os.path.exists into my code with success in finding certain regular files(pdf,png) when searching for filenames that begin with a letter.
The file naming standard that I'm using (and can't change due to the user) starts with a number and subsequently can't be found using the same method. Is there a way that I can make these files discoverable by .isfile or .exists?
The files I'm searching for are .txt files.
os.path.isfile("D:\Users\spx9gs\Project Work\Data\21022013AA.txt")
os.path.isfile("D:\Users\spx9gs\Project Work\Data\AA21022013.txt")
Returns:
False
True
You need to use raw strings, or escape your backslashes. In the filename:
"D:\Users\spx9gs\Project Work\Data\21022013AA.txt"
the \210 will be interpreted as an octal escape code so you won't get the correct filename.
Either of these will work:
r"D:\Users\spx9gs\Project Work\Data\21022013AA.txt"
"D:\\Users\\spx9gs\\Project Work\\Data\\21022013AA.txt"
I have a string which contains user input for a directory address on a linux system. I need to check if it is properly formatted and could be an address in Python 2.6. It's important to note that this is not on the current system so I can't check if it is there using os.path nor can I try to create the directories as the function will be run many times.
These strings will always be absolute paths, so my first thought was to look for a leading slash. From there I wondered about checking if the rest of the string only contains valid characters and does not contain any double slashes. This seems a little clunky, any other ideas?
Sure the question has been edited since writing this but:
There is the os.path.isabs(PATH) which will tell you if the path is absolute or not.
Return True if path is an absolute pathname. On Unix, that means it begins with a slash, on Windows that it begins with a (back)slash after chopping off a potential drive letter.