Regex Match Recurring Pattern with Small Variation

Regex Match Recurring Pattern with Small Variation - python

I'm trying to match a repeating pattern with regex (in Python 3.9) which contains the same data in general but there are some areas which have varying iterations (specifically the lines beginning "CLD" and "REF".
I"m trying to match from "LIN" to the end of the line starting "HL" so I can carry out further matching on each iteration after.
This is an extract of the data I am using...
LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US
SN1**300*PC
PRF*5500015558****01
PID*F****DESCRIPTION01
REF*PK*000000051213
CLD*1*300*PLT71
REF*LS*0079393
HL*3*1*I
LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US
SN1**64*PC
PRF*5500014695****01
PID*F****DESCRIPTION02
REF*PK*000000051213
CLD*1*24*PLT71
REF*LS*0079393
CLD*1*40*PLT71
REF*LS*0079390
HL*4*1*I
My RegEx so far looks like this (although well short of what I'm trying to achieve)...
LIN.*\nSN.*\nPRF.*\nPID.*\nREF.*\n
However I got stuck at this point due to the varying number of "CLD" & "REF" lines and therefore it stops short of what I need and I'm pretty sure this is not efficient regex...
LIN**SI*ASN*BP*CH11979*VP*1262702*CH*US
SN1**300*PC
PRF*5500015558****01
PID*F****SEAL INTEGRAL
REF*PK*000000051213
LIN**SI*ASN*BP*CH10439*VP*1375541*CH*US
SN1**64*PC
PRF*5500014695****01
PID*F****PUMP AS PRIMING
REF*PK*000000051213
I also experimented with the regex below (from some Googling) to get around the varying occurrences and also be more efficient but it's not working...
LIN(.|\n)*HL.*
Can anyone help me pull this together?

You can use
(?m)^LIN[\w\W]*?\nHL.* # For any Python version
(?m)^LIN(?s:.*?)\nHL.* # For Python 3.6+
See the regex demo.
Details:
(?m) - an inline re.M modifier flag that enables ^ to match any line start position
^ - start of a line
LIN - LIN string
[\w\W]*? / (?s:.*?) - any zero or more chars, as few as possible
\n - a newline
HL - HL
.* - the rest of the line.
See the Python demo:
import re
text = "LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I\nLIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I"
print(re.findall(r'(?m)^LIN(?s:.*?)\nHL.*', text))
Output:
[
'LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I',
'LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I'
]

For varying iterations for CLD and REF, you could repeat 1 or more times matching the lines that start with either of them using an alternation |, and a quantifier + to match at least 1 line.
^LIN.*\nSN.*\nPRF.*\nPID.*\n(?:(?:REF|CLD).*\n)+HL.*
Explanation
^ Start of string
LIN.*\nSN.*\nPRF.*\nPID.*\n Match the first 4 lines
(?:(?:REF|CLD).*\n)+ Repeat 1+ times matching either REF or CLD, the rest of the line and a newline
HL.* Match HL and the rest of the line
Regex demo | Python demo
For example
import re
s = ("LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\n"
"SN1**300*PC\n"
"PRF*5500015558****01\n"
"PID*F****DESCRIPTION01\n"
"REF*PK*000000051213\n"
"CLD*1*300*PLT71\n"
"REF*LS*0079393\n"
"HL*3*1*I\n"
"LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\n"
"SN1**64*PC\n"
"PRF*5500014695****01\n"
"PID*F****DESCRIPTION02\n"
"REF*PK*000000051213\n"
"CLD*1*24*PLT71\n"
"REF*LS*0079393\n"
"CLD*1*40*PLT71\n"
"REF*LS*0079390\n"
"HL*4*1*I")
regex = r"^LIN.*\nSN.*\nPRF.*\nPID.*\n(?:(?:REF|CLD).*\n)+HL.*"
print(re.findall(regex, s, re.MULTILINE))
Output
[
'LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I',
'LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I'
]

Related

Define a regex to grab groups of matches based on strings that start a line

I am trying to build a regex that will capture groups of lines up to and including lines from ^INS through ^DMG. I am able to exclude INS*Y*G8 through DMG. However, I keep getting catastrophic backtracking.
INS*Y*G8*030**A***AC~
REF*0F*XXXXXXXX~
NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
DMG*D8*19700101*M~
**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
AMT*D2*100~
AMT*FK*100~
AMT*R*50~
AMT*C1*30~
AMT*P3*31~
AMT*B9*32~
NM1*31*1~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
In general, I need a regex that can capture groups of lines given a string that starts a line, up to and including the line that ends the capture group based on another given start of a line.
I have tried this (INS\*Y\*[^G8]+.*)(.*?)(?=DMG) and other variations unsuccessfully. What I am expecting is...
Group 1 should be:
**INS*Y*01*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
Group 2 should be:
**INS*Y*19*030**A***AC~**
REF*0F*XXXXXXXXX~
NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~
PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~
N3*45874 WHYYWYW WTWYXW~
N4*DYXWHXVYW*NY*88980~
**DMG*D8*20000101*F~**
Thank you for your help.

With your shown samples and attempts please try following regex
(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)
Here is the Online Demo for used Regex.
Here is the Complete python3 code written and tested in Python3 using re.findall module of it.
re.findall(r"(?m)(?:^|\n)(INS\*(?!Y\*G8)(?:.*\n)+?DMG\S+)",var)
Output will be as follows with your shown samples:
['INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**', 'INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\nDMG*D8*20000101*F~**']
Explanation of Used regex:
(?:^|\n) ##In a non-capturing group match starting of value OR new line.
( ##Starting a capturing group from here.
INS\* ##Matching INS followed by literal * here.
(?!Y\*G8) ##Using negative look ahead to make sure Y*G8 is not present.
(?:.*\n)+? ##In a non-capturing group matching till new line greedy match with 1 or more matches.
DMG\S+ ##Matching DMG followed by continuous non-spaces.
) ##Closing capturing group here.

You were a little vague on just what your desired stop & start patterns should be, so I assumed a start pattern of <begin line> **INS and an end pattern of <begin line> **DMG. If that's not exactly correct you can adjust the regex.
The key to this solution, is to set the right set of flags to scan multiple lines correctly. They are:
"g" - global - continue to scan for all matches, not just the first
"s" - single line - .matches newline, so multiple lines can be matched with .+
"m" - mult-line - ^ & $ match <newline> as well as begin & end of string.
"i" - ignore case - May not be necessary, but it can make the patterns shorter.
So, this solution is
r"^\*\*INS.*?^\*\*DMG.*?$"gmsi
...pretty simple. Find a **INS at the beginning of a line, continue to scan any character across multiple lines until you see a **DMG at the beginning of a line, then scan everything up to the next line end.
Here's s a screenprint in Regex101:

Edit
If the leading ** in the example for **INS text are markers for bold text in the question, then you could write the pattern as:
^INS\*(?!Y\*G8).*(?:\n(?!INS\*|DMG\*).*)*\nDMG\*.*
See a regex101 demo
You could use a single match starting with **INS and asserting that it is not followed by Y*G8
Then match all following lines that do not start with either **INS or **DMG
^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*
Explanation
^ Start of string
\*\*INS(?!Y\*G8) Match **INS and assert that it is not directly followed by Y*G8
.* Match the whole line
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\*\*(?:INS|DMG)) Negative lookahead, assert not **INS or **DMG directly to the right
.* Match the whole line
)* Close the non capture group and optionally repeat it to match all lines
\n\*\*DMG.* Match a newline, then **DMG and the rest of th eline
Regex demo101.
Example code
import re
pattern = r"^\*\*INS(?!Y\*G8).*(?:\n(?!\*\*(?:INS|DMG)).*)*\n\*\*DMG.*"
s = ("INS*Y*G8*030**A***AC~\n"
"REF*0F*XXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*TWDD****34*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"DMG*D8*19700101*M~\n"
"**INS*Y*01*030**A***AC~**\n"
"REF*0F*XXXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\n"
"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"**DMG*D8*20000101*F~**\n"
"**INS*Y*19*030**A***AC~**\n"
"REF*0F*XXXXXXXXX~\n"
"NM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\n"
"PER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~\n"
"**DMG*D8*20000101*F~**\n"
"AMT*D2*100~\n"
"AMT*FK*100~\n"
"AMT*R*50~\n"
"AMT*C1*30~\n"
"AMT*P3*31~\n"
"AMT*B9*32~\n"
"NM1*31*1~\n"
"N3*45874 WHYYWYW WTWYXW~\n"
"N4*DYXWHXVYW*NY*88980~")
print(re.findall(pattern, s, re.M))
Output
[
'**INS*Y*01*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*RRWTW****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**',
'**INS*Y*19*030**A***AC~**\nREF*0F*XXXXXXXXX~\nNM1*IL*1*JWHWWWW*GGDFS****34*XXXXXXXXX~\nPER*IP**TE*XXXXXXXXX*AP*XXXXXXXXX~\nN3*45874 WHYYWYW WTWYXW~\nN4*DYXWHXVYW*NY*88980~\n**DMG*D8*20000101*F~**'
]

(Python) How to check a long string against several regex?

I want to ensure that a long string can match with several regex at once.
I have a long multi line string containing a list of files and some content of the file.
DIR1\FILE1.EXT1 CONTENT11
DIR1\FILE1.EXT1 CONTENT12
DIR1\FILE1.EXT1 CONTENT13
DIR1\FILE2.EXT1 CONTENT21
DIR2\FILE3.EXT2 CONTENT31
DIR3\FILE3.EXT2 CONTENT11
The list typically contains hundreds of thousands of lines, sometimes several millions.
I want to check that the list contains predefined couples file/content:
FILE1 CONTENT11
FILE1 CONTENT12
FILE3 CONTENT11
I know that I can check that the string contains all of these couples by matching the string against some regexes
"^\S*FILE1\S*\tCONTENT11$"
"^\S*FILE1\S*\tCONTENT12$"
"^\S*FILE3\S*\tCONTENT11$"
import re
def all_matching(str, rxs):
res = True
for rx in rxs:
p = re.compile(rx, re.M)
res = res and p.search(str)
return(res)
input1 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31
DIR3\\FILE3.EXT2\tCONTENT11"""
input2 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31"""
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
if all_matching(input1,rxs):
print("input1 matches all rxs") # excpected
else:
print("input1 do not match all rxs")
if all_matching(input2,rxs):
print("input2 matches all rxs")
else:
print("input2 do not match all rxs") # expected because input2 doesn't match wirh rxs[2]
ideone is available here
However, as the input string is very long in my case, I'd rather avoid launching search many times...
I feel like it should be possible to change the all_matching function in that way.
Any help will be much appreciated!
EDIT
clarified the problem an provided sample code

You may build a single regex from the regex strings you have that will require all the regexes to find a match in the input string.
The resulting regex will look like
\A(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$)(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$)(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$)
See the regex demo.
Basically, it will match:
(?m) - a re.M / re.MULTILINE embedded flag option
\A - start of string (not start of a line!), all the lookaheads below will be triggered one by one, checking the string from the start, until one of them fails
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$) - a positive lookahead that, immediately to the right of the current location, requires the presence of
(?:.*\n)*? - 0 or more (but as few as possible, the pattern will only be tried if the subsequent subpatterns do not match)
\S* - 0+ non-whitespaces
FILE1 - a string
\S* - 0+ non-whitespaces
\tCONTENT11 - tab and CONTENT11 substring
$ - end of line (since (?m) allows $ to match end of lines)
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$) - a lookahead working similarly as the preceding one, requiring FILE1 and CONTENT12 substrings on the line
(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$) - a lookahead working similarly as the preceding one, requiring FILE3 and CONTENT11 substrings on the line.
In Python, it will look like
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
pat = re.compile( r"(?m)\A(?=(?:.*\n)*?{})".format(r")(?=(?:.*\n)*?".join([rx[1:] for rx in rxs])) )
Then, the check method will look like
def all_matching(s, pat):
return pat.search(s)
See full Python demo online.

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.

If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.

With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704

You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

Using Regex to extract a specific word followed by certain syntax (such as parentheses)

I have a very large document containing section references in different formats. I want to extract these references using Python & regex.
Examples of the string formats:
1) Section 23
2) Section 45(3)
3) point (e) of Section 75
4) Sections 21(1), 54(2), 78(1)
Right now, I have the following code:
s = "This is a sample for Section 231"
m = re.search('Section\\W+(\\w+)', s)
m.group(0)
The output is: Section 231
This works perfectly, except that it does not account for the other formatting cases.
Is there any way to indicate that for 231(1), the (1) should also be extracted? Or to include the following section numbers if several others are listed?
I'm also open to using other libraries if you think Regex is not the best in this case. Thank you!

Try:
Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*
Demo
>>> s = 'Sections 21(1), 54(2), 78(1)'
>>> res = re.search(r'Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*', s)
>>> res.group(0)
# => 'Sections 21(1), 54(2), 78(1)'
Explanation:
Sections? matches "Section" with optionable s
\W+(\w+)(\(\w+\))? matches section number/title (as you did it) and adds optional text in brackets
(, (\w+)(\(\w+\))?)* allows repetition of the section number patter after comma and space
EDIT
To exclude Section 1 of Other Book you can use combination of word boundary and negative lookahead:
Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*\b(?! of)
Demo
\b assures that you match until end of a word
(?! of) check that after the word boundary there is no space followed by of

There's probably never going to be a catch-all regex for this - however the following is quite close to what you want:
Sections?( *\d+((\(\d+\))*,?(?= *))*)+
Sections? = Section or Sections
( *\d+((\(\d+\))*,?(?= *))*)+ = 1 or more of: 0 or more spaces, then 1 or more digits, optionally followed by 1 or more digits in braces, then optionally a comma and 0 or spaces.
The 'trailing' space uses a positive lookahead so it isn't included in the match, so you don't need to strip trailing spaces.
Try it out

Anchor to End of Last Match

In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'

You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex Match Recurring Pattern with Small Variation - python

Related

Define a regex to grab groups of matches based on strings that start a line

(Python) How to check a long string against several regex?

How to match numeric characters with no white space following

Using Regex to extract a specific word followed by certain syntax (such as parentheses)

Anchor to End of Last Match

Categories

Resources