I'm writing a script that will take a list of file paths as input. I want the script to make sure the strings in the input file are, or at least appear to be, valid full Windows paths, that include the drive letter.
That being said, what's the best way to ensure that a string starts with any single letter, upper or lowercase, a colon, and a back slash?
I'm guessing the regex would look something like this:
[a-zA-Z]:\, but how do I make sure it check for only one letter and that it's the first 3 characters in the string?
I appreciate it.
The ^ character matches the start of the string. Your character class will currently only match one letter, and you need to escape the \. So your final regex would be:
^[a-zA-Z]:\\
If you just want to check to make sure it starts with a drive letter, you could also use the built-in splitdrive:
drive, path = os.path.splitdrive(filename)
if drive == None:
raise ValueError, 'Filename does not include a drive!'
Edit: Thanks to jme, if you are not on a Windows system, do import ntpath and replace the first line like this:
drive, path = ntpath.splitdrive(filename)
Note: since Python 2.7.8, splitdrive returns a 'drive' for UNC paths too.
Checks that the given path is a valid windows path based on criteria at http://msdn.microsoft.com/en-us/library/aa365247%28v=vs.85%29.aspx. I made this a little while ago out of frustration of not finding a good one online:
r'^(?:[a-zA-Z]:\\|\\\\?|\\\\\?\\|\\\\\.\\)?(?:(?!(CLOCK\$(\\|$)|(CON|PRN|AUX|NUL|COM[1-9]|LPT[1-9]| )(?:\..*|(\\|$))|.*\.$))(?:(?:(?![><:/"\\\|\?\*])[\x20-\u10FFFF])+\\?))*$'
So this is what ended up working for me.
is_path = re.match("^[a-zA-Z]:\\)*", file)
Related
The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?
Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path
I created about 200 csv files in Python and now need to download them all.
I created the files from a single file using:
g = df.groupby("col")
for n,g in df.groupby('col'):
g.to_csv(n+'stars'+'.csv')
When I try to use this same statement to export to my machine I get a syntax error and I'm not sure what I'm doing wrong:
g = df.groupby("col")
for n,g in df.groupby('col'):
g.to_csv('C:\Users\egagne\Downloads\'n+'stars'+'.csv'')
Error:
File "<ipython-input-27-43a5bfe55259>", line 3
g.to_csv('C:\Users\egagne\Downloads\'n+'stars'+'.csv'')
^
SyntaxError: invalid syntax
I'm in Jupyter lab, so I can download each file individually but I really don't want to have to do that.
You're possibly mixing up integers and strings, and the use of backslash in literals is dangerous anyway. Consider using the following
import os
inside the loop
f_name = os.path.join('C:', 'users', ' egagne', 'Downloads', str(n), 'stars.csv')
g.to_csv(f_name)
with os.path.join taking care of the backslashes for you.
g.to_csv('C:\Users\egagne\Downloads\'n+'stars'+'.csv'')
needs to be
g.to_csv('C:\\Users\\egagne\\Downloads\\'+n+'stars.csv').
There were two things wrong -- the backslash is an escape character so if you put a ' after it, it will be treated as part of your string instead of a closing quote as you intended it. Using \\ instead of a single \ escapes the escape character so that you can include a backslash in your string.
Also, you did not pair your quotes correctly. n is a variable name but from the syntax highlighting in your question it is clear that it is part of the string. Similarly you can see that stars and .csv are not highlighted as part of a string, and the closing '' should be a red flag that something has gone wrong.
Edit: I addressed what is causing the problem but Ami Tavory's answer is the right one -- though you know this is going to run on windows it is a better practice to use os.path.join() with directory names instead of writing out a path in a string. str(n) is also the right way to go if you are at all unsure about the type of n.
Very new here, and I am trying to modify some python code to normalize directory/file names for Windows using regular expression. I have searched and found lots of code examples, but haven’t quite figured out how to put it all together.
This is what I am trying to accomplish:
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? *
Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So, I need to get rid of ellipsis without affecting the extension. To clarify, when I say ellipsis, I am referring to a pattern of three periods, and NOT the single unicode character “Horizontal Ellipsis (U+2026)”. I have researched and found multiple ways of doing individual parts of this, but I cannot see to get it all together and playing nice.
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename)
This cleans up the names, but not the pattern of two or more periods.
return unicode(re.sub(r'[<>:"/\\|?*.]', "", filename)
This cleans up the names, but also affects the file extension.
[^\w\-_\. ]
This also seemed to be a viable alternative. It is a bit more restrictive than necessary, but I did find it easy to just keep adding specific characters I wanted to ignore.
\.{2,}
This is the piece I can’t seem to get to integrate with any of these methods. I understand that this should match two or more “.”, but leave a single “.” alone. But there are some situations where I “might” be left with a period at the end of a Windows directory name, which won’t work.
.*[.](?!mp3$)[^.]*$
I searched and found this specific snippet, which looks promising to match/ignore a specific extension. In my case, I want .mp3 left alone. Maybe a different way to go about things. And I think it might eliminate a potential problem of having a period at the end of a directory name.
Thank you for your time!
Edit: Additional Information Added
def normalize_filename(self, filename):
"""Remove invalid characters from filename"""
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename))
def get_outfile(self):
"""Returns output filename based on song information"""
destination_dir = os.path.join(self.normalize_filename(self.info["AlbumArtist"]),
self.normalize_filename(self.info["Album"]))
filename = u"{TrackNumber:02d} - {Title}.mp3".format(**self.info)
return os.path.join(destination_dir, self.normalize_filename(filename))
This is the relevant code I am trying to modify. The full code basically pulls song artist, album, and track descriptions out of a sqlite database file. Then based on that information, it creates an artist directory, album directory, and a mp3 file.
However, because of Windows naming restrictions, those names need to be normalized/sanitized.
Ideally I would like this to be done with a single re.sub, if it can be done.
return unicode(re.sub(r'[<>:"/\|?*]', "", filename))
If there is another/better way to make this code work, I am open to it. But with my limited understanding, adding more complexity was beyond me, so I was trying to work within the bounds of what I currently understand. I have done a lot of reading over the past few days, but can’t quite accomplish what I would like to do.
For Example: “Ned’s Atomic Dustbin\ARE YOU NORMAL?\Not Sleeping Around” needs to become C:\Ned’s Atomic Dustbin\ARE YOU NORMAL\Not Sleeping Around.mp3
Another: “Green Day\UNO... DOS... TRÉ!\F*** Time” needs to become C:\Green Day\UNO DOS TRÉ\F Time.mp3”
Another: “Incubus\A Crow Left Of The Murder…\Pistola” would become C:\Incubus\A Crow Left Of The Murder\Pistola.mp3
Tricky Example: “System Of A Down\B.Y.O.B.\B.Y.O.B.” to C:\System Of A Down\BYOB\BYOB.mp3” Windows wouldn’t care if it was B.Y.O.B, but the last period is what causes issues. So it would probably be best if the solution eliminated all “.”, except on the extension .mp3.
My answer is totally based on the text below (you typed, of course):
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? * Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So here we go (for file/directory):
unicode(re.sub(r'(\<|\>|\:|\"|\/|\\|\||\?|\*', '', file/directory))
Explanation:
\<|\>|\:|\"|\/|\\|\||\?|\* <= matches alll of your undesired chars
At this time you will have erased all of your undesired chars EXCEPT the spaces/dots at the end of the name.
For yours file_name you can update its variable with
file_name = re.sub(r'( +)$', '', file_name)
( +)$ <= matches spaces or a dot at the end of the string.
and you'll be done because there are no more restrictions besides that the name can't contain any spaces at its end (remember we already removed the special chars).
For directories however, you can't have both periods and spaces.
So the best way, my opinion of course, is to implement a recursive procedure, once that stops only when:
dir_name == re.sub(r'( +|\.+)$', '', dir_name)
and dir_name keeps being updated with dir_name = re.sub(r'( +|\.+)$', '', dir_name) while the above statement is false.
Hope this helps you.
I want to open a file in python 3.5 in its default application, specifically 'screen.txt' in Notepad.
I have searched the internet, and found os.startfile(path) on most of the answers. I tried that with the file's path os.startfile(C:\[directories n stuff]\screen.txt) but it returned an error saying 'unexpected character after line continuation character'. I tried it without the file's path, just the file's name but it still didn't work.
What does this error mean? I have never seen it before.
Please provide a solution for opening a .txt file that works.
EDIT: I am on Windows 7 on a restricted (school) computer.
It's hard to be certain from your question as it stands, but I bet your problem is backslashes.
[EDITED to add:] Or actually maybe it's something simpler. Did you put quotes around your pathname at all? If not, that will certainly not work -- but once you do, you will find that then you need the rest of what I've written below.
In a Windows filesystem, the backslash \ is the standard way to separate directories.
In a Python string literal, the backslash \ is used for putting things into the string that would otherwise be difficult to enter. For instance, if you are writing a single-quoted string and you want a single quote in it, you can do this: 'don\'t'. Or if you want a newline character, you can do this: 'First line.\nSecond line.'
So if you take a Windows pathname and plug it into Python like this:
os.startfile('C:\foo\bar\baz')
then the string actually passed to os.startfile will not contain those backslashes; it will contain a form-feed character (from the \f) and two backspace characters (from the \bs), which is not what you want at all.
You can deal with this in three ways.
You can use forward slashes instead of backslashes. Although Windows prefers backslashes in its user interface, forward slashes work too, and they don't have special meaning in Python string literals.
You can "escape" the backslashes: two backslashes in a row mean an actual backslash. os.startfile('C:\\foo\\bar\\baz')
You can use a "raw string literal". Put an r before the opening single or double quotes. This will make backslashes not get interpreted specially. os.startfile(r'C:\foo\bar\baz')
The last is maybe the nicest, except for one annoying quirk: backslash-quote is still special in a raw string literal so that you can still say 'don\'t', which means you can't end a raw string literal with a backslash.
The recommended way to open a file with the default program is os.startfile. You can do something a bit more manual using os.system or subprocess though:
os.system(r'start ' + path_to_file')
or
subprocess.Popen('{start} {path}'.format(
start='start', path=path_to_file), shell=True)
Of course, this won't work cross-platform, but it might be enough for your use case.
For example I created file "test file.txt" on my drive D: so file path is 'D:/test file.txt'
Now I can open it with associated program with that script:
import os
os.startfile('d:/test file.txt')
I have a string which contains user input for a directory address on a linux system. I need to check if it is properly formatted and could be an address in Python 2.6. It's important to note that this is not on the current system so I can't check if it is there using os.path nor can I try to create the directories as the function will be run many times.
These strings will always be absolute paths, so my first thought was to look for a leading slash. From there I wondered about checking if the rest of the string only contains valid characters and does not contain any double slashes. This seems a little clunky, any other ideas?
Sure the question has been edited since writing this but:
There is the os.path.isabs(PATH) which will tell you if the path is absolute or not.
Return True if path is an absolute pathname. On Unix, that means it begins with a slash, on Windows that it begins with a (back)slash after chopping off a potential drive letter.