This question already has answers here:
Get start location of capturing group within regex pattern
(3 answers)
Python Regex - How to Get Positions and Values of Matches
(4 answers)
Closed 1 year ago.
I am trying to capture the start and end of a capture group for each group found using the finditer() method in re.
For example:
strng = 'move 12345-!'
matches = re.finditer('move ([0-9]+).*?', strng)
for each in matches:
print(*each.groups())
print(each.start(), each.end())
This will yield the start and end index position, but of the matched pattern and not specifically the captured group. I essentially want to always capture the number as this will change. The word move will always be an anchor, but I don't want to include that in the position, as I need to capture the actual position of the numbers found within the text document so that I can do slicing for each number found.
Full document might be like:
move 12345-!
move 57496-!
move 96038-!
move 00528-!
And I would capture 57496 starting/ending document[17:21] where start of the 57496 is at 17 and end is at 21. The underlying positions are being used to train a model.
If you don't want move to be part of the match, you can turn it into a positive lookbehind to assert it to the left.
Then you can use each.group() to get the match.
Note that you can omit .*? at the end of the pattern, as it is a non greedy quantifier without anything after that part and will not match any characters.
import re
strng = 'move 12345-!'
matches = re.finditer('(?<=move )[0-9]+', strng)
for each in matches:
print(each.group())
print(each.start(), each.end())
Output
12345
5 10
>>> import re
>>> strng = "move 12345-!"
>>> matches = re.finditer('move ([0-9]+).*?', strng)
>>> for each in matches:
print(each.group(1))
print(each.start(1), each.end(1))
12345
5 10
>>>
Related
This question already has an answer here:
Why Does a Repeated Capture Group Return these Strings?
(1 answer)
Closed 1 year ago.
I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.
Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:
>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'
Why isn't the whole expected group being captured?
Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.
This will capture multiple repeated groups:
(/[a-zA-Z]+)*
However, as already discussed in another thread, quoting from #ByteCommander
If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.
Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/
regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');
Don't need the * between the two expressions here, also move the first / into the brackets:
>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>
In this case, you may don't need regex.
You can simply use split function.
text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])
output:
/foo/bar
a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.
As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).
In this case the generic approach would be :
text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])
Output:
/foo/bar/baz/boo
This question already has answers here:
Python regex module not finding all matches even with overlapping = True
(2 answers)
How to use regex to find all overlapping matches
(5 answers)
Closed 1 year ago.
I'm using regular expressions to find the maximum number of consecutive repeats of a substring in a given string. In the example below, there are 9 consecutive AAGAA substrings. The first method returns the lengths of all the different stretches of consecutive substrings, and the second returns the overall max. Therefore, max(lens) should be equal to val. However, in the method using val there is a match with 10 repeats of AAGAA, even though the original string contains a maximum of only 9.
I've spent a lot of time looking at regex tutorials and regex101.com but I can't figure this out. Where is "(?=((" + re.escape(substring) + ")+))" finding an extra substring?
string='AAGAAAAAAAAGAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAAAAGAA'
substring = 'AAGAA'
#this one is right; returns [1,1,9], as desired
sl = len(substring)
regex = re.compile(f'((?:{substring})+)')
lens = [len(m) // sl for m in regex.findall(string)]
#this one is wrong; returns 10, should return 9
pattern = re.compile("(?=((" + re.escape(substring) + ")+))")
matches = re.findall( pattern, string )
val = max(len(m[0]) // len(substring) for m in matches)
The reason you are seeing an extra substring is because the regex you are using will find overlapping matches (see this question for an explanation, but essentially the reason it finds overlapping matches is because the regex only contains a lookahead; this doesn't consume any characters) and so it allows the regex to match not only the single occurrence of AAGAA starting at string[9], but also the sequence of 10 occurrences starting at string[13]. Since this latter match partially overlaps the previous one, it does not get matched by your first regex (which actually consumes characters as it matches). The first regex matches the sequence of 9 occurrences starting at string[18] instead because having matched the single occurrence at string[9], the next place it looks for a match is at string[14], at which point it has passed the start of the match of 10 occurrences.
This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 2 years ago.
I want to capture numbers and number ranges from a list: ["op.15", "Op.16-17", "Op16,17,18"]
match = re.compile(r"\d+[-]?\d+").findall(text)
Gets the correct result
op.15 ['15']
Op.16-17 ['16-17']
Op16,17,18 ['16', '17', '18']
but this doesn't work:
match = re.compile(r"\d+(-)?\d+").findall(text)
op.15 ['']
Op.16-17 ['-']
Op16,17,18 ['', '', '']
What's the issue here? I want to add in alternative values to -, such as "to" i.e. -|to which doesn't work with [].
The documentation for findall in re module says
Return a list of all non-overlapping matches in the string. If one or
more capturing groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result.
In your first regex you dont provide any capture groups so you get returned a list of non overlapping matches I.E it will return one or more digits followed by 0 or 1 hyphen followed by one or more digits.
In your second regex you change your [ ] which was saying match any chars in this list. To ( ) which is a capture group. so now you are saying match one or more digits followed by and capture zero or one hyphen, followed by one or more digits.
Now since you have given a capture group as per the documentation you wont now be returned the full non over lapping match, instead you will be returned only the capture group. I.e only returned anything inside the ( ) which will be either empty if there was 0 hyphen or will be - if there was 1 hyphen.
To fix the issue, use a non-capturing group: r"\d+(?:-)?\d+".
This question already has answers here:
Python non-greedy regexes
(7 answers)
Closed 3 years ago.
I am trying to find all strings that follows a specific pattern in a python string.
"\[\[Cats:.*\]\]"
However if many occurrences of such pattern exist together on a line in a string it sees the pattern as just one, instead of taking the patterns separately.
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
x = re.findall("\[\[Cats:.*\]\]", strng)
The output gotten is:
['[[Cats: Text1]] said I am in the [[Cats: Text2]]']
instead of
['[[Cats: Text1]]', '[[Cats: Text2]]']
which is a list of lists.
What regex do I use?
"\[\[Cats:.*?\]\]"
Your current regex is greedy - it's gobbling up EVERYTHING, from the first open brace to the last close brace. Making it non-greedy should return all of your results.
Demo
The problom is that you are doing a greedy search, add a ? after .* to get a non greedy return.
code follows:
import re
strng = '[[Cats: Text1]] said I am in the [[Cats: Text2]]fhg is abnorn'
regex_template = re.compile(r'\[\[Cats:.*?\]\]')
matches = re.findall(regex_template, strng)
print(matches)
Don't do .*, because that will never terminate. .* means any character and not even one occurence is required.
import re
strng = '''[[Cats: lol, this is 100 % cringe]]
said I am in the [[Cats: lol, this is 100 % cringe]]
fhg is abnorn'''
x = re.findall(r"\[\[Cats: [^\]]+\]\]", strng)
print(x)
This question already has answers here:
Capturing repeating subpatterns in Python regex
(4 answers)
Closed 3 years ago.
I am implementing a method that takes in a regex pattern like r'(\w+/)+end' and a string 'ab/cd/ef/end'. Note that I cannot request the caller of the method to update their pattern format. Within the method, I needs to perform an operation that requires extracting all matches of the first capturing group i.e. ab/, cd/, and ef/.
How do I accomplish this in Python? Something like below returns a tuple of last-matches for each of capturing groups. We have just one in this example, so it returns ('ef/',).
re.match(r'(\w+/)+end', 'ab/cd/ef/end').groups()
By the way, in C#, every capturing group can match multiple strings e.g. Regex.Match("ab/cd/ef/end", #"(\w+/)+end").Groups[1].Captures will return all the three matches for first capturing group (\w+/)+.
If you just want to capture all path names which are followed by a separator, then use the pattern \w+/ with re.findall:
inp = "ab/cd/ef/end"
matches = re.findall(r'\w+/', inp)
print(matches)
['ab/', 'cd/', 'ef/']
If instead you want all path components, whether or not they be preceded by a path separator, then we can try:
inp = "ab/cd/ef/end"
matches = re.findall(r'[^/]+', inp)
r = r"(\w+/)(?<!end)"
s = "ab/cd/ef/end"
m = re.finditer(r, s, re.MULTILINE)
for g in m:
print(g.group())
Example:
https://regex101.com/r/VJ6knI/1