Regex - Using * with a set of characters - python

I'm fairly new at regex, and I've run into a problem that I cannot figure out:
I am trying to match a set of characters that start with an arbitrary number of A-Z, 0-9, and _ characters that can optionally be followed by a number enclosed in a single set of parentheses and can be separated from the original string by a space (or not)
Examples of what this should find:
_ABCD1E
_123FD(13)
ABDF1G (2)
This is my current regex expression:
[A-Z_0-9]+\s*\({0,1}[\d]*\){0,1}
It's finding everything just fine, but a problem exists if I have the following:
_ABCDE )
It should only grab _ABCDE and not the " )" but it currently grabs '_ABCDE )'
Is there some way I can grab the (#) but not get extra characters if that entire pattern does not exist?
If possible, please explain syntax as I am aiming to learn, not just get the answer.
ANSWER: The following code is working for what I needed so far:
[A-Z_0-9]+(\s*\([\d]+\)){0,1}
# or, as has been mentioned, the above can be simplified
# and cleaned up a bit to be
[A-Z_0-9]+(\s*\(\d+\))?
# The [] around \d are unnecessary and {0,1} is equivalent to ?
Adding the parentheses around the (#) pattern allows for the use of ? or {0,1} on the entire pattern. I also changed the [\d]* to be [\d]+ to ensure at least one number inside of the parentheses.
Thanks for the fast answers, all!

Your regex says that each paren (open & closed) may or may not be there, INDEPENDENTLY. Instead, you should say that the number-enclosed-in-parens may or may not be there:
(\([\d]*\)){0,1}
Note that this allows for there to be nothing in the parens; that's what your regex said, but I'm not clear that's what you actually want.

how about
^[A-Z0-9_]+\s*(\([0-9]+\))?$
btw, from your example, the first part accepts not only [A-Z_], but also [0-9]

This seems to do the job.
[1-9A-Z_]+\s*(?:\([1-9]*\))?

It seems like you want the following regex:
^[A-Z\d_]+(\s*\(\d+\))?$

I used a non-capturing group to avoid grouping matching in results:
>>> pattern = r'[A-Z_]+\s*(?:\(\d+\)|\d*)'
>>> l = ['_ABCD1E', '_123FD(13)', 'ABDF1G (2)', '_ABCDE )', 'A_B (15)', 'E (345']
>>> [re.search(pattern , i).group() for i in l]
['_ABCD1', '_123', 'ABDF1', '_ABCDE ', 'A_B (15)', 'E ']

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Need a specific explanation of part of a regex code

I'm developing a calculator program in Python, and need to remove leading zeros from numbers so that calculations work as expected. For example, if the user enters "02+03" into the calculator, the result should return 5. In order to remove these leading zeroes in-front of digits, I asked a question on here and got the following answer.
self.answer = eval(re.sub(r"((?<=^)|(?<=[^\.\d]))0+(\d+)", r"\1\2", self.equation.get()))
I fully understand how the positive lookbehind to the beginning of the string and lookbehind to the non digit, non period character works. What I'm confused about is where in this regex code can I find the replacement for the matched patterns?
I found this online when researching regex expressions.
result = re.sub(pattern, repl, string, count=0, flags=0)
Where is the "repl" in the regex code above? If possible, could somebody please help to explain what the r"\1\2" is used for in this regex also?
Thanks for your help! :)
The "repl" part of the regex is this component:
r"\1\2"
In the "find" part of the regex, group capturing is taking place (ordinarily indicated by "()" characters around content, although this can be overridden by specific arguments).
In python regex, the syntax used to indicate a reference to a positional captured group (sometimes called a "backreference") is "\n" (where "n" is a digit refering to the position of the group in the "find" part of the regex).
So, this regex is returning a string in which the overall content is being replaced specifically by parts of the input string matched by numbered groups.
Note: I don't believe the "\1" part of the "repl" is actually required. I think:
r"\2"
...would work just as well.
Further reading: https://www.regular-expressions.info/brackets.html
Firstly, repl includes what you are about to replace.
To understand \1\2 you need to know what capture grouping is.
Check this video out for basics of Group capturing.
Here , since your regex splits every match it finds into groups which are 1,2... so on. This is so because of the parenthesis () you have placed in the regex.
$1 , $2 or \1,\2 can be used to refer to them.
In this case: The regex is replacing all numbers after the leading 0 (which is caught by group 2) with itself.
Note: \1 is not necessary. works fine without it.
See example:
>>> import re
>>> s='awd232frr2cr23'
>>> re.sub('\d',' ',s)
'awd frr cr '
>>>
Explanation:
As it is, '\d' is for integer so removes them and replaces with repl (in this case ' ').

string consists of punctuation

I want to check if string contains punctuation or not so a continuous sequence of exclamation, question & both.
By continuous, it means more than 2 times. Just like below,
#If sentence contains !!!
exc = re.compile(r"(.)\!{2}")
word["cont_exclamation"] = if exc.search(sent[i]) else not(found)
#If sentence contains ???
reg = re.compile(r"(.)\?{2}")
word["cont_question"] = if reg.search(sent[i]) else not(found)
But now I want to find both, exclamation and question marks so for example, hello??! or hey!! or dude!?!
Also, what if I want ? and ! both but more than 2 any of them.
I dont know regex properly so any help would be great.
Use the regex '[?!]{3,}' which means match the ? or ! characters 3 or more times (if continous = more than two times). Quoting is not needed inside character class.
Add more punctuation characters to the char class as needed
try re.compile(r"(.)[\?\!]{2}")
regex = re.compile(r"(.)(\?|\!){2}")
edit: Typing "regex tutorial" into google gives more info than you possibly need. This tutorial looks particularly well-balanced between conciseness and completeness.
Particularly (i.m.o.) useful tricks that are often not mentioned:
use +? and *? to switch from greedy to lazy match. I.e. match as few characters as possible instead of as much as possible. Example text: #ab# #de# --> #.*?# matches #ab# only (not #ab# #de#)
parentheses create a capture group by default. If you don't want this, you can use (?:...).
Most importantly, comment each regexp with a human-readable explanation. Future-you will be grateful. :-)

Three python regular expressions around underscores

I'm helping someone with some file renaming at work using an application that supports python regular expression syntax. I tried a few expressions found on forums like ^[^_]+(?=_) for a) below but it didn't work properly, and some others that didn't work. so, I figured I should reach out to someone who actually knows what they're doing. thanks for your help.
a) in the first expression I have to find all characters before the first underscore in patterns like this:
cannon_mac_23567_prsln_333
jones_james_343342_prsln_333
smith_john_223462_prsln_333
so, I have to get cannon, jones, and smith
b) in a separate expression I have to find all characters between the first and second underscore. so, I need to find mac, james, and john in the examples above.
c) in the last expression I have to find the first underscore
the way the renaming app works I have to do these regular expressions in three parts, like the above. thanks.
Well, you could do it without regular expressions entirely, as you know your delimiter is the underscore.
Use the str.split, and index methods.
'smith_john_223462_prsln_333'.split('_')[0] //(to extract smith)
'smith_john_223462_prsln_333'.split('_')[1] //(to extract john)
'smith_john_223462_prsln_333'.index('_') //(to get position of first underscore)
I'd use:
1. ^([^_]+)_
2. _([^_]+)_
3. ^[^_]_
Using re.match, as it matches at the beginning of the string.
[Edit: As Cthulhu pointed out, you might be better of not using regular expressions for this, as it's faster and easier to use string methods]
Right, I misunderstood your question at first. While the str.split would definitely be a more elegant way to solve this, here are three regular expressions to suit your needs. I have no idea whether or not this application of yours will work with them. So take this with a grain of salt.
Please take a look at the re library and the MatchObject.span() for further information.
As a single regex:
import re
line = "cannon_mac_23567_prsln_333"
In [1812]: match = re.match(r"(.+?)(\_)(.+?)\_", line)
In [1813]: match.groups()
Out[1813]: ('cannon', '_', 'mac')
In [1814]: match.span(2)[0] <-- second group, start. The first occurence of _
Out[1814]: 6
In [1815]: line[6]
Out[1815]: '_'
Seprated in a, b, c:
a:
import re
line = "cannon_mac_23567_prsln_333"
In [1707]: match = re.match(r"(.+?)\_", line)
In [1708]: match.groups()
Out[1708]: ('cannon',)
b:
In [1712]: match = re.match(r".+\_(.+?)\_", line)
In [1713]: match.groups()
Out[1713]: ('prsln',)
c: Last one uses re.search for simplicity. MatchObject.span() returns a tuple of position (start, end)
In [1763]: match = re.search("\_", line)
In [1764]: match.span()[0]
Out[1764]: 6
In [1765]: line[6]
Out[1765]: '_'

How to use ? and ?: and : in REGEX for Python?

I understand that
* = "zero or more"
? = "zero or more" ...what's the difference?
Also, ?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
As Manu already said, ? means "zero or one time". It is the same as {0,1}.
And by ?:, you probably meant (?:X), where X is some other string. This is called a "non-capturing group".
Normally when you wrap parenthesis around something, you group what is matched by those parenthesis. For example, the regex .(.).(.) matches any 4 characters (except line breaks) and stores the second character in group 1 and the fourth character in group 2. However, when you do: .(?:.).(.) only the fourth character is stored in group 1, everything bewteen (?:.) is matched, but not "remembered".
A little demo:
import re
m = re.search('.(.).(.)', '1234')
print m.group(1)
print m.group(2)
# output:
# 2
# 4
m = re.search('.(?:.).(.)', '1234')
print m.group(1)
# output:
# 4
You might ask yourself: "why use this non-capturing group at all?". Well, sometimes, you want to make an OR between two strings, for example, you want to match the string "www.google.com" or "www.yahoo.com", you could then do: www\.google\.com|www\.yahoo\.com, but shorter would be: www\.(google|yahoo)\.com of course. But if you're not going to do something useful with what is being captured by this group (the string "google", or "yahoo"), you mind as well use a non-capturing group: www\.(?:google|yahoo)\.com. When the regex engine does not need to "remember" the substring "google" or "yahoo" then your app/script will run faster. Of course, it wouldn't make much difference with relatively small strings, but when your regex and string(s) gets larger, it probably will.
And for a better example to use non-capturing groups, see Chris Lutz's comment below.
?: << my book uses this, it says its a "subtlety" but I don't know what exactly these do!
If that’s indeed what your book says, then I advise getting a better book.
Inside parentheses (more precisely: right after an opening parenthesis), ? has another meaning. It starts a group of options which count only for the scope of the parentheses. ?: is a special case of these options. To understand this special case, you must first know that parentheses create capture groups:
a(.)c
This is a regular expression that matches any three-letter string starting with a and ending with c. The middle character is (more or less) aribtrary. Since you put it in parentheses, you can capture it:
m = re.search('a(.)c', 'abcdef')
print m.group(1)
This will print b, since m.group(1) captures the content of the first parentheses (group(0) captures the whole hit, here abc).
Now, consider this regular expression:
a(?:.)c
No capture is made here – this is what ?: after an opening parenthesis means. That is, the following code will fail:
print m.group(1)
Because there is no group 1!
? = zero or one
you use (?:) for grouping w/o saving the group in a temporary variable as you would with ()
? does not mean "zero or more", it means "zero or one".

Categories