Three python regular expressions around underscores - python

I'm helping someone with some file renaming at work using an application that supports python regular expression syntax. I tried a few expressions found on forums like ^[^_]+(?=_) for a) below but it didn't work properly, and some others that didn't work. so, I figured I should reach out to someone who actually knows what they're doing. thanks for your help.
a) in the first expression I have to find all characters before the first underscore in patterns like this:
cannon_mac_23567_prsln_333
jones_james_343342_prsln_333
smith_john_223462_prsln_333
so, I have to get cannon, jones, and smith
b) in a separate expression I have to find all characters between the first and second underscore. so, I need to find mac, james, and john in the examples above.
c) in the last expression I have to find the first underscore
the way the renaming app works I have to do these regular expressions in three parts, like the above. thanks.

Well, you could do it without regular expressions entirely, as you know your delimiter is the underscore.
Use the str.split, and index methods.
'smith_john_223462_prsln_333'.split('_')[0] //(to extract smith)
'smith_john_223462_prsln_333'.split('_')[1] //(to extract john)
'smith_john_223462_prsln_333'.index('_') //(to get position of first underscore)

I'd use:
1. ^([^_]+)_
2. _([^_]+)_
3. ^[^_]_
Using re.match, as it matches at the beginning of the string.
[Edit: As Cthulhu pointed out, you might be better of not using regular expressions for this, as it's faster and easier to use string methods]

Right, I misunderstood your question at first. While the str.split would definitely be a more elegant way to solve this, here are three regular expressions to suit your needs. I have no idea whether or not this application of yours will work with them. So take this with a grain of salt.
Please take a look at the re library and the MatchObject.span() for further information.
As a single regex:
import re
line = "cannon_mac_23567_prsln_333"
In [1812]: match = re.match(r"(.+?)(\_)(.+?)\_", line)
In [1813]: match.groups()
Out[1813]: ('cannon', '_', 'mac')
In [1814]: match.span(2)[0] <-- second group, start. The first occurence of _
Out[1814]: 6
In [1815]: line[6]
Out[1815]: '_'
Seprated in a, b, c:
a:
import re
line = "cannon_mac_23567_prsln_333"
In [1707]: match = re.match(r"(.+?)\_", line)
In [1708]: match.groups()
Out[1708]: ('cannon',)
b:
In [1712]: match = re.match(r".+\_(.+?)\_", line)
In [1713]: match.groups()
Out[1713]: ('prsln',)
c: Last one uses re.search for simplicity. MatchObject.span() returns a tuple of position (start, end)
In [1763]: match = re.search("\_", line)
In [1764]: match.span()[0]
Out[1764]: 6
In [1765]: line[6]
Out[1765]: '_'

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Python regular expression issue

I'm trying to use the re module in a way that it will return bunch of characters until a particular string follows an individual character. The re documentation seems to indicate that I can use (?!...) to accomplish this. The example that I'm currently wrestling with:
str_to_search = 'abababsonab, etc'
first = re.search(r'(ab)+(?!son)', str_to_search)
second = re.search(r'.+(?!son)', str_to_search)
first.group() is 'abab', which is what I'm aiming for. However, second.group() returns the entire str_to_search string, despite the fact that I'm trying to make it stop at 'ababa', as the subsequent 'b' is immediately followed by 'son'. Where am I going wrong?
It's not the simplest thing, but you can capture a repeating sequence of "a character not followed by 'son'". This repeated expression should be in a non-capturing group, (?: ... ), so it doesn't mess with your match results. (You'd end up with an extra match group)
Try this:
import re
str_to_search = 'abababsonab, etc'
second = re.search(r'(?:.(?!son))+', str_to_search)
print(second.group())
Output:
ababa
See it here: http://ideone.com/6DhLgN
This should work:
second = re.search(r'(.(?!son))+', str_to_search)
#output: 'ababa'
not sure what you are trying to do
check out string.partition
'.+?' is the minimal matcher, otherwise it is greedy and gets it all
read the docs for group(...) and groups(..) especially when passing group number

Regex - Using * with a set of characters

I'm fairly new at regex, and I've run into a problem that I cannot figure out:
I am trying to match a set of characters that start with an arbitrary number of A-Z, 0-9, and _ characters that can optionally be followed by a number enclosed in a single set of parentheses and can be separated from the original string by a space (or not)
Examples of what this should find:
_ABCD1E
_123FD(13)
ABDF1G (2)
This is my current regex expression:
[A-Z_0-9]+\s*\({0,1}[\d]*\){0,1}
It's finding everything just fine, but a problem exists if I have the following:
_ABCDE )
It should only grab _ABCDE and not the " )" but it currently grabs '_ABCDE )'
Is there some way I can grab the (#) but not get extra characters if that entire pattern does not exist?
If possible, please explain syntax as I am aiming to learn, not just get the answer.
ANSWER: The following code is working for what I needed so far:
[A-Z_0-9]+(\s*\([\d]+\)){0,1}
# or, as has been mentioned, the above can be simplified
# and cleaned up a bit to be
[A-Z_0-9]+(\s*\(\d+\))?
# The [] around \d are unnecessary and {0,1} is equivalent to ?
Adding the parentheses around the (#) pattern allows for the use of ? or {0,1} on the entire pattern. I also changed the [\d]* to be [\d]+ to ensure at least one number inside of the parentheses.
Thanks for the fast answers, all!
Your regex says that each paren (open & closed) may or may not be there, INDEPENDENTLY. Instead, you should say that the number-enclosed-in-parens may or may not be there:
(\([\d]*\)){0,1}
Note that this allows for there to be nothing in the parens; that's what your regex said, but I'm not clear that's what you actually want.
how about
^[A-Z0-9_]+\s*(\([0-9]+\))?$
btw, from your example, the first part accepts not only [A-Z_], but also [0-9]
This seems to do the job.
[1-9A-Z_]+\s*(?:\([1-9]*\))?
It seems like you want the following regex:
^[A-Z\d_]+(\s*\(\d+\))?$
I used a non-capturing group to avoid grouping matching in results:
>>> pattern = r'[A-Z_]+\s*(?:\(\d+\)|\d*)'
>>> l = ['_ABCD1E', '_123FD(13)', 'ABDF1G (2)', '_ABCDE )', 'A_B (15)', 'E (345']
>>> [re.search(pattern , i).group() for i in l]
['_ABCD1', '_123', 'ABDF1', '_ABCDE ', 'A_B (15)', 'E ']

Accessing matched substrings when substituting using regular expressions in Python

I want to match two regular expressions A and B where A and B appear as 'AB'. I want to then insert a space between A and B so that it becomes 'A B'.
For example, if A = [0-9] and B = !+, I want to do something like the following.
match = re.sub('[0-9]!+', '[0-9] !+', input_string)
But, this obviously does not work as this will replace any matches with a string '[0-9] !+'.
How do I do this in regular expressions (preferably in one line)? Or does this require several tedious steps?
Use the groups!
match = re.sub('([0-9])(!+)', r'\1 \2', input_string);
\1 and \2 indicate the first and second parenthesised fragment. The prefix r is used to keep the \ character intact.
Suppose the input string is "I have 5G network" but you want whitespace between 5 and G i.e. whenever there are expressions like G20 or AK47, you want to separate the digit and the alphabets (I have 5 G network). In this case, you need to replace a regex expression with another regular expression. Something like this:
re.sub(r'\w\d',r'\w \d',input_string)
But this won't work as the substituting string will not retain the string caught by the first regular expression.
Solution:
It can be easily solved by accessing the groups in the regex substitution. This method will work well if you want to add something to the spotted groups.
re.sub(r"(\..*$)",r"_BACK\1","my_file.jpg") and re.sub(r'(\d+)',r'<num>\1</num>',"I have 25 cents")
You can use this method to solve your question as well by capturing two groups instead of one.
re.sub(r"([A-Z])(\d)",r"\1 \2",input_string)
Another way to do it, is by using lambda functions:
re.sub(r"(\w\d)",lambda d: d.group(0)[0]+' '+d.group(0)[1],input_string)
And another way of doing it is by using look-aheads:
re.sub(r"(?<=[A-Z])(?=\d)",r" ",input_string)

How to use ? and ?: and : in REGEX for Python?

I understand that
* = "zero or more"
? = "zero or more" ...what's the difference?
Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
As Manu already said, ? means "zero or one time". It is the same as {0,1}.
And by ?:, you probably meant (?:X), where X is some other string. This is called a "non-capturing group".
Normally when you wrap parenthesis around something, you group what is matched by those parenthesis. For example, the regex .(.).(.) matches any 4 characters (except line breaks) and stores the second character in group 1 and the fourth character in group 2. However, when you do: .(?:.).(.) only the fourth character is stored in group 1, everything bewteen (?:.) is matched, but not "remembered".
A little demo:
import re
m = re.search('.(.).(.)', '1234')
print m.group(1)
print m.group(2)
# output:
# 2
# 4
m = re.search('.(?:.).(.)', '1234')
print m.group(1)
# output:
# 4
You might ask yourself: "why use this non-capturing group at all?". Well, sometimes, you want to make an OR between two strings, for example, you want to match the string "www.google.com" or "www.yahoo.com", you could then do: www\.google\.com|www\.yahoo\.com, but shorter would be: www\.(google|yahoo)\.com of course. But if you're not going to do something useful with what is being captured by this group (the string "google", or "yahoo"), you mind as well use a non-capturing group: www\.(?:google|yahoo)\.com. When the regex engine does not need to "remember" the substring "google" or "yahoo" then your app/script will run faster. Of course, it wouldn't make much difference with relatively small strings, but when your regex and string(s) gets larger, it probably will.
And for a better example to use non-capturing groups, see Chris Lutz's comment below.
?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
If that’s indeed what your book says, then I advise getting a better book.
Inside parentheses (more precisely: right after an opening parenthesis), ? has another meaning. It starts a group of options which count only for the scope of the parentheses. ?: is a special case of these options. To understand this special case, you must first know that parentheses create capture groups:
a(.)c
This is a regular expression that matches any three-letter string starting with a and ending with c. The middle character is (more or less) aribtrary. Since you put it in parentheses, you can capture it:
m = re.search('a(.)c', 'abcdef')
print m.group(1)
This will print b, since m.group(1) captures the content of the first parentheses (group(0) captures the whole hit, here abc).
Now, consider this regular expression:
a(?:.)c
No capture is made here – this is what ?: after an opening parenthesis means. That is, the following code will fail:
print m.group(1)
Because there is no group 1!
? = zero or one
you use (?:) for grouping w/o saving the group in a temporary variable as you would with ()
? does not mean "zero or more", it means "zero or one".

Categories