Python regular expression for Windows file path

Python regular expression for Windows file path - python

The problem, and it may not be easily solved with a regex, is that I want to be able to extract a Windows file path from an arbitrary string. The closest that I have been able to come (I've tried a bunch of others) is using the following regex:
[a-zA-Z]:\\([a-zA-Z0-9() ]*\\)*\w*.*\w*
Which picks up the start of the file and is designed to look at patterns (after the initial drive letter) of strings followed by a backslash and ending with a file name, optional dot, and optional extension.
The difficulty is what happens, next. Since the maximum path length is 260 characters, I only need to count 260 characters beyond the start. But since spaces (and other characters) are allowed in file names I would need to make sure that there are no additional backslashes that could indicate that the prior characters are the name of a folder and that what follows isn't the file name, itself.
I am pretty certain that there isn't a perfect solition (the perfect being the enemy of the good) but I wondered if anyone could suggest a "best possible" solution?

Here's the expression I got, based on yours, that allow me to get the path on windows : [a-zA-Z]:\\((?:[a-zA-Z0-9() ]*\\)*).* . An example of it being used is available here : https://regex101.com/r/SXUlVX/1
First, I changed the capture group from ([a-zA-Z0-9() ]*\\)* to ((?:[a-zA-Z0-9() ]*\\)*).
Your original expression captures each XXX\ one after another (eg : Users\ the Users\).
Mine matches (?:[a-zA-Z0-9() ]*\\)*. This allows me to capture the concatenation of XXX\YYYY\ZZZ\ before capturing. As such, it allows me to get the full path.
The second change I made is related to the filename : I'll just match any group of character that does not contain \ (the capture group being greedy). This allows me to take care of strange file names.
Another regex that would work would be : [a-zA-Z]:\\((?:.*?\\)*).* as shown in this example : https://regex101.com/r/SXUlVX/2
This time, I used .*?\\ to match the XXX\ parts of the path.
.*? will match in a non-greedy way : thus, .*?\\ will match the bare minimum of text followed by a back-slash.
Do not hesitate if you have any question regarding the expressions.
I'd also encourage you to try to see how well your expression works using : https://regex101.com . This also has a list of the different tokens you can use in your regex.
Edit : As my previous answer did not work (though I'll need to spend some times to find out exactly why), I looked for another way to do what you want. And I managed to do so using string splitting and joining.
The command is "\\".join(TARGETSTRING.split("\\")[1:-1]).
How does this work : Is plit the original string into a list of substrings, based. I then remove the first and last part ([1:-1]from 2nd element to the one before the last) and transform the resulting list back into a string.
This works, whether the value given is a path or the full address of a file.
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred is a file path
Program Files (x86)\\Adobe\\Acrobat Distiller\\acrbd.exe fred\ is a directory path

Related

Text between delimiters starting from the end of the string

I'm really new to python and programming in general, and to practice I'm doing projects where I try to tackle problems from my day to day work, so please excuse me if this may be a silly question. I'm trying to combine a group of files located on a remote folder into a single monthly one based on the date, I've already combined files based on date so I think I can do that, but I'm having trouble with the regex to pick the date from the file name string, the string with the filepath is as follows
\\machinename123\main folder\subfolder\2021-01-24.csv
The file name will always have the same format since it's and automated process, only changing the date on the name, I was trying to pick the date from this string using a regex to select the text between the last \ of the string and the . from the format, so I can get a 2021-01-24 as a result but at the level I'm at, regex are like witchcraft and I don't really know what I'm doing, I've been trying for a few hours to no success, so far this is the closest I can get by trial and error (?:[0-9\-]) but this selects all the numbers on the string, including the ones on the machine name, besides the issue of not knowing why it works the way it works (for example I know that the ?: works by testing, but I don't understand the theory behind it so I couldn't replicate it in the future).
How can I make it ignore the other numbers on the string, or more specifically pick only the text between the last \ and the . from the csv, xlsx or whatever the format is?
I'd like the former option better, since it would allow me to learn how to make it do what I need it to do and not get the result by coincidence.
Thanks for any help

Use re.search() to find a pattern: <4 digits>-<2 digits>-<2 digits>.
s = r'\\machinename123\main folder\subfolder\2021-01-24.csv'
m = re.search(r'\d{4}-\d{2}-\d{2}', s).group(0)

You can use the following regex, that summarizes the structure of your full path:
import re
filename_regex = re.compile(r'^\\\\[(?:\w| )+\\]+((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
m = filename_regex.match(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
if m is not None:
print(f'File found with date: {m.groups()[0]}')
else:
print('Filename not of interest')
The output will be:
File found with date: 2021-01-24
filename_regex accepts string starting with \\, followed by a repetition of characters (alphanumeric and underscores) and spaces followed by \, but with the final part corresponding to 4 digits, followed by a minus, then 2 digits, a minus again, 2 digits and the string .csv. The regular expression used here to match the date is very simple, but you can use a more complex one if you prefer.
Another simpler approach would be using ntpath library, extracting the name of the file from the full path and applying the regular expression only to the name of the file:
import ntpath
import re
filename_regex = re.compile(r'^((?:\d{4,4})-(?:\d{2,2})-(?:\d{2,2})).csv$')
filename = ntpath.basename(r"\\machinename123\main folder\subfolder\2021-01-24.csv")
m = filename_regex.match(filename)
m here will have the same value as before.

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.

Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

Reg.sub regex help in Python to normalize directory/file to play nice with Windows

Very new here, and I am trying to modify some python code to normalize directory/file names for Windows using regular expression. I have searched and found lots of code examples, but haven’t quite figured out how to put it all together.
This is what I am trying to accomplish:
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? *
Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So, I need to get rid of ellipsis without affecting the extension. To clarify, when I say ellipsis, I am referring to a pattern of three periods, and NOT the single unicode character “Horizontal Ellipsis (U+2026)”. I have researched and found multiple ways of doing individual parts of this, but I cannot see to get it all together and playing nice.
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename)
This cleans up the names, but not the pattern of two or more periods.
return unicode(re.sub(r'[<>:"/\\|?*.]', "", filename)
This cleans up the names, but also affects the file extension.
[^\w\-_\. ]
This also seemed to be a viable alternative. It is a bit more restrictive than necessary, but I did find it easy to just keep adding specific characters I wanted to ignore.
\.{2,}
This is the piece I can’t seem to get to integrate with any of these methods. I understand that this should match two or more “.”, but leave a single “.” alone. But there are some situations where I “might” be left with a period at the end of a Windows directory name, which won’t work.
.*[.](?!mp3$)[^.]*$
I searched and found this specific snippet, which looks promising to match/ignore a specific extension. In my case, I want .mp3 left alone. Maybe a different way to go about things. And I think it might eliminate a potential problem of having a period at the end of a directory name.
Thank you for your time!
Edit: Additional Information Added
def normalize_filename(self, filename):
"""Remove invalid characters from filename"""
return unicode(re.sub(r'[<>:"/\\|?*]', "", filename))
def get_outfile(self):
"""Returns output filename based on song information"""
destination_dir = os.path.join(self.normalize_filename(self.info["AlbumArtist"]),
self.normalize_filename(self.info["Album"]))
filename = u"{TrackNumber:02d} - {Title}.mp3".format(**self.info)
return os.path.join(destination_dir, self.normalize_filename(filename))
This is the relevant code I am trying to modify. The full code basically pulls song artist, album, and track descriptions out of a sqlite database file. Then based on that information, it creates an artist directory, album directory, and a mp3 file.
However, because of Windows naming restrictions, those names need to be normalized/sanitized.
Ideally I would like this to be done with a single re.sub, if it can be done.
return unicode(re.sub(r'[<>:"/\|?*]', "", filename))
If there is another/better way to make this code work, I am open to it. But with my limited understanding, adding more complexity was beyond me, so I was trying to work within the bounds of what I currently understand. I have done a lot of reading over the past few days, but can’t quite accomplish what I would like to do.
For Example: “Ned’s Atomic Dustbin\ARE YOU NORMAL?\Not Sleeping Around” needs to become C:\Ned’s Atomic Dustbin\ARE YOU NORMAL\Not Sleeping Around.mp3
Another: “Green Day\UNO... DOS... TRÉ!\F*** Time” needs to become C:\Green Day\UNO DOS TRÉ\F Time.mp3”
Another: “Incubus\A Crow Left Of The Murder…\Pistola” would become C:\Incubus\A Crow Left Of The Murder\Pistola.mp3
Tricky Example: “System Of A Down\B.Y.O.B.\B.Y.O.B.” to C:\System Of A Down\BYOB\BYOB.mp3” Windows wouldn’t care if it was B.Y.O.B, but the last period is what causes issues. So it would probably be best if the solution eliminated all “.”, except on the extension .mp3.

My answer is totally based on the text below (you typed, of course):
I need to remove all invalid Windows characters so directory/file names do not include: < > : " / \ | ? * Windows also doesn’t seem to like spaces at the end of a directory/file name. Windows also doesn’t like periods at the end of directory names.
So here we go (for file/directory):
unicode(re.sub(r'(\<|\>|\:|\"|\/|\\|\||\?|\*', '', file/directory))
Explanation:
\<|\>|\:|\"|\/|\\|\||\?|\* <= matches alll of your undesired chars
At this time you will have erased all of your undesired chars EXCEPT the spaces/dots at the end of the name.
For yours file_name you can update its variable with
file_name = re.sub(r'( +)$', '', file_name)
( +)$ <= matches spaces or a dot at the end of the string.
and you'll be done because there are no more restrictions besides that the name can't contain any spaces at its end (remember we already removed the special chars).
For directories however, you can't have both periods and spaces.
So the best way, my opinion of course, is to implement a recursive procedure, once that stops only when:
dir_name == re.sub(r'( +|\.+)$', '', dir_name)
and dir_name keeps being updated with dir_name = re.sub(r'( +|\.+)$', '', dir_name) while the above statement is false.
Hope this helps you.

Regexp to match chords, issue with national accents

I am dealing with this problem. I have *.txt file containing tens of songs. Each song might consist of
name
lines with chords
lines with lyrics
blank lines
I'm writing Python script, which reads the file by lines. I need to recognise the lines with chords. For that purpose I have decided to use regular expressions, since it looks like playful but strong tool for such tasks. I am new to regexp, I've done this tutorial (which I am rather fond of). I have written something like this
\b ?\(?([AC-Hac-h]{1})(#|##|b|bb)?(is|mi|maj|sus)?\d?[ \n(/\(\))?]
I am not very happy with that, since it does not do the job properly. One of the problems is that the language of the songs uses a lot of accents. The second one: the chords might come in pairs - e.g. C(D), h/e. You can see my approach here.
Note
For better readability in final script I would split the regexp into more variables and those then add together.
Edit
After rereading my question I thought, that my goal might not be clear enough. I would like to much different types of chords for instance:
C, C#, Cis, c#, Cmaj, Cmi, Csus, C7, C#7, Db, Dbsus
Also sometimes there might be (no more than two) chord next to each other such as this: C7/D7, Cmi(a). The best solution would be to catch those "pairs" together in one that is match C7/D7 not C7 and D7. I think, that with this additional condition it might be a bit robust, but if it would be unnecessarily difficult I might go with the (I assume) easier version (meaning: matching C7 and D7 instead of C7/D7) and deal with this later separately.

Your Python script reads the text file line by line and you want to find out with a regular expression if the current line is a line with chords or with other information.
Perhaps it is enough to apply the regular expression ^[\t #()/\dAC-Hac-jmsu]+$ on each line. If the regular expression does not return a match, the line contains characters not being allowed in a line with chords. Perhaps this simple regular expression using only a single character class definition is enough.
But it could be that a line with a name or lyrics matches also the expression above. For your example this is not the case, but it could be. In such a case I would suggest to use first the function strip() on every line to remove spaces and tabs from begin and end of every line. And then apply the following regular expression
^(?:[#()/\dAC-Hac-jmsu]{1,6}[\t ]+?)*[#()/\dAC-Hac-jmsu]{1,6}$
The difference is that now each string not containing a space or tab character must have a length between 1 to 6. Longer strings are not allowed. With this additional rule it could be that there are no false positive anymore on detection of lines with chords.
The problems for the chords line detection rule are definitely the letters as a name or a lyric text consisting only of the letters allowed for chords could match too. A solution would be to create a list of strings consisting only of letters which are allowed for chords and using them in an OR expression. That would avoid most likely a false positive by a name or lyric string. With the complete list of chord strings it is most likely also possible to define the rule shorter without the need to list all chord strings in an OR expression.

regex regarding symbols in urls

I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?

So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.