Python re.finditer match.groups() does not contain all groups from match - python

I am trying to use regex in Python to find and print all matching lines from a multiline search.
The text that I am searching through may have the below example structure:
AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA
From which I want to retrieve the ABC*s that occur at least once and are preceeded by an AAA.
The problem is, that despite the group catching what I want:
match = <_sre.SRE_Match object; span=(19, 38), match='AAA\nABC2\nABC3\nABC4\n'>
... I can access only the last match of the group:
match groups = ('AAA\n', 'ABC4\n')
Below is the example code that I use for this problem.
#! python
import sys
import re
import os
string = "AAA\nABC1\nABC2\nABC3\nAAA\nABC1\nABC2\nABC3\nABC4\nABC\nAAA\nABC1\nAAA\n"
print(string)
p_MATCHES = []
p_MATCHES.append( (re.compile('(AAA\n)(ABC[0-9]\n){1,}')) ) #
matches = re.finditer(p_MATCHES[0],string)
for match in matches:
strout = ''
gr_iter=0
print("match = "+str(match))
print("match groups = "+str(match.groups()))
for group in match.groups():
gr_iter+=1
sys.stdout.write("TEST GROUP:"+str(gr_iter)+"\t"+group) # test output
if group is not None:
if group != '':
strout+= '"'+group.replace("\n","",1)+'"'+'\n'
sys.stdout.write("\nCOMPLETE RESULT:\n"+strout+"====\n")

Here is your regular expression:
(AAA\r\n)(ABC[0-9]\r\n){1,}
Debuggex Demo
Your goal is to capture all ABC#s that immediately follow AAA. As you can see in this Debuggex demo, all ABC#s are indeed being matched (they're highlighted in yellow). However, since only the "what is being repeated" part
ABC[0-9]\r\n
is being captured (is inside the parentheses), and its quantifier,
{1,}
is not being captured, this therefore causes all matches except the final one to be discarded. To get them, you must also capture the quantifier:
AAA\r\n((?:ABC[0-9]\r\n){1,})
Debuggex Demo
I've placed the "what is being repeated" part (ABC[0-9]\r\n) into a non-capturing group. (I've also stopped capturing AAA, as you don't seem to need it.)
The captured text can be split on the newline, and will give you all the pieces as you wish.
(Note that \n by itself doesn't work in Debuggex. It requires \r\n.)
This is a workaround. Not many regular expression flavors offer the capability of iterating through repeating captures (which ones...?). A more normal approach is to loop through and process each match as they are found. Here's an example from Java:
import java.util.regex.*;
public class RepeatingCaptureGroupsDemo {
public static void main(String[] args) {
String input = "I have a cat, but I like my dog better.";
Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
}
}
Output:
cat
dog
(From http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/, about a 1/4 down)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. The links in this answer come from it.

You want the pattern of consecutive ABC\n occurring after a AAA\n in the most greedy way. You also want only the group of consecutive ABC\n and not a tuple of that and the most recent ABC\n. So in your regex, exclude the subgroup within the group.
Notice the pattern, write the pattern that represents the whole string.
AAA\n(ABC[0-9]\n)+
Then capture the one you are interested in with (), while remembering to exclude subgroup(s)
AAA\n((?:ABC[0-9]\n)+)
You can then use either findall() or finditer(). I find findIter easier especially when you are dealing with more than one capture.
finditer:-
import re
matches_iter = re.finditer(r'AAA\n((?:ABC[0-9]\n)+)', string)
[print(i.group(1)) for i in matches_iter]
findall, used the original {1,} as its a more verbose form of + :-
matches_all = re.findall(r'AAA\n((?:ABC[0-9]\n){1,})', string)
[[print(x) for x in y.split("\n")] for y in matches_all]

Related

Python regex show all characters between span

I want to view everything between span=(179, 331), How to display this ? In advance thanks
new_v1 = re.compile(r'Sprzedawca:')
new_v2 = re.compile(r'lp ')
print(new_v1.search(txt))
print(new_v2.search(txt))
Output:
<re.Match object; span=(179, 199), match='Sprzedawca: Nabywca:'>
<re.Match object; span=(328, 331), match='lp '>
This is an example how to match text between start and stop. I chose a simple text, adjust the regexp for your needs:
import re
RE = re.compile(r'(?:Start)(.*)(?:End)')
# re.compile(..., flags=re.DOTALL) to match also newlines
match = RE.search('testStartTextBetween123ABCxyzEndtest')
if match:
print(match.group(1)) # TextBetween123ABCxyz
There are three groups in the regexp (groups are in parentheses). The first one matches the start of text mark, the second one matches everything and the last one mathes the end of text mark.
The (?: notation means the resulting match is not saved. Only the middle group is saved as the first (and only) matched subgroup. This corresponds to match.group(1)
The function call new_v1.search(txt) returns a match object which has various attributes. You can call its methods to retrieve various facts about the match and the matched text. The simplest way to pull out the text which matched is probably
print(new_v1.search(txt).group(0))
but you could certainly also pull out the start and end attributes and extract the span yourself:
matched = new_v1.search(txt)
print(txt[matched.start():matched.end()])
Demo: https://ideone.com/CFXjx2
Of course, with a trivial regex, the matched text will be exactly the regex itself; perhaps you are actually more interested in where exactly in the string the match was found.

How to return whole non-latin strings matching a reduplication pattern, such as AAB or ABB

I am working with strings of non-latin characters.
I want to match strings with reduplication patterns, such as AAB, ABB, ABAB, etc.
I tried out the following code:
import re
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.findall(rawtext)
print(match)
However, it reurns only the first character of the matched string.
I know this happens because of the capturing parenthesis around the first \w.
I tried to add capturing parenthesis around the whole matched block, but Python gives
error: cannot refer to an open group at position 7
I also found this method,but didn't work for me:
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
print(match.group(1))
How could I match the pattern and return the whole matching string?
# Ex. 哈哈笑
# string matches AAB pattern so my code returns 哈
# but not the entire string
The message:
error: cannot refer to an open group at position 7
is telling you that \1 refers to the group with parentheses all around, because its opening parenthesis comes first. The group you want to backreference is number 2, so this code works:
import re
rawtext = 'abc 哈哈笑 def'
patternAAB = re.compile(r'\b((\w)\2\w)\b')
match = patternAAB.findall(rawtext)
print(match)
Each item in match has both groups:
[('哈哈笑', '哈')]
I also found this method, but didn't work for me:
You were close here as well. You can use match.group(0) to get the full match, not just a group in parentheses. So this code works:
import re
rawtext = 'abc 哈哈笑 def'
patternAAB = re.compile(r'\b(\w)\1\w\b')
match = patternAAB.search(rawtext)
if match:
print(match.group(0)) # 哈哈笑

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

python regex: capturing group within OR

I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:
str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"
tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).
My regex looks like:
re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)
When parsing the 2 strings above, I get:
[('1234','')]
[('','5678'),('0123','')]
And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple.
How can I modify the regex to achieve that? Thanks in advance.
You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So, the solution is to use only one capturing group.
Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:
(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^
The regex means:
(?:tokenA|tokenB) - match but not capture tokenA or tokenB
([0-9]{4}) - match and capture into Group 1 four digits
IDEONE demo:
import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))
Result: ['1234', '3456']
Simply do this:
re.findall(r"token[AB](\d{4})", s)
Put [AB] inside a character class, so that it would match either A or B

Regex expression to strip ending of word

I have the following identifiers:
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
I need a regex to get me the following output:
id1 = '883316040119'
id2 = 'ZWEX01DE9463DB'
id3 = '35358'
id4 = 'as3d99j'
Here is what I have so far --
re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)
It doesn't work perfectly though, here is what it gives me:
BAD - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j
What would be the correct regular expression to get all of them? For the first one, I basically want to strip the ending if it is only underscores and letters, so 1928h9829_bundle_hd --> 1928h9829.
Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. I'm not looking for a python split() way to do it, as it wouldn't work.
The way you present your input, I would suggest this simple regex:
^(?:[^_]+(?=_)|\d+)
This can be tweaked if you want to add details to the spec.
To show you a regex demo, just because of the way the site regex101 works, we have to add \n (it assumes we are working on the whole file, rather than one input at a time): DEMO
Explanation
The ^ anchor asserts that we are at the beginning of the string
The non-capture group (?: ... ) matches either
[^_]+(?=_) non-underscore characters (followed by an underscore, not matched)
| OR
\d+ digits
This works for the examples:
for id in ids :
print (id)
883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001
for id in ids :
hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
print (hit)
883316040119
ZWEX01DE9463DB
35358
as3d99j
When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit.
You are checking for underscore only one possible time, as ? means {0,1}.
r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'
The following reproduces your desired results from your input.
I would use the replace method with this regex:
_[^']+|(?!.*_)('[0-9]+)[^']+
and return capturing group 1
Perhaps:
result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)
The regex first looks for an underscore. If it finds one, it will match everything up to but not including the next single quote; and that will get removed.
If that doesn't match, the alternative will look for a string that does NOT have an underscore; match and return in capturing group 1 the sequence of digits; and then replace everything after the digits up to but not including the single quote.
This is not subtraction approach. Just capture matched string.
The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_).(ie (^\d+)|(^[\d\w]+(?=_)))
import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]
for i in ids:
try:
print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
except:
print "not matched"
output:
883316040119
ZWEX01DE9463DB
35358
as3d99j

Categories