(Python) How to check a long string against several regex? - python

I want to ensure that a long string can match with several regex at once.
I have a long multi line string containing a list of files and some content of the file.
DIR1\FILE1.EXT1 CONTENT11
DIR1\FILE1.EXT1 CONTENT12
DIR1\FILE1.EXT1 CONTENT13
DIR1\FILE2.EXT1 CONTENT21
DIR2\FILE3.EXT2 CONTENT31
DIR3\FILE3.EXT2 CONTENT11
The list typically contains hundreds of thousands of lines, sometimes several millions.
I want to check that the list contains predefined couples file/content:
FILE1 CONTENT11
FILE1 CONTENT12
FILE3 CONTENT11
I know that I can check that the string contains all of these couples by matching the string against some regexes
"^\S*FILE1\S*\tCONTENT11$"
"^\S*FILE1\S*\tCONTENT12$"
"^\S*FILE3\S*\tCONTENT11$"
import re
def all_matching(str, rxs):
res = True
for rx in rxs:
p = re.compile(rx, re.M)
res = res and p.search(str)
return(res)
input1 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31
DIR3\\FILE3.EXT2\tCONTENT11"""
input2 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31"""
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
if all_matching(input1,rxs):
print("input1 matches all rxs") # excpected
else:
print("input1 do not match all rxs")
if all_matching(input2,rxs):
print("input2 matches all rxs")
else:
print("input2 do not match all rxs") # expected because input2 doesn't match wirh rxs[2]
ideone is available here
However, as the input string is very long in my case, I'd rather avoid launching search many times...
I feel like it should be possible to change the all_matching function in that way.
Any help will be much appreciated!
EDIT
clarified the problem an provided sample code

You may build a single regex from the regex strings you have that will require all the regexes to find a match in the input string.
The resulting regex will look like
\A(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$)(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$)(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$)
See the regex demo.
Basically, it will match:
(?m) - a re.M / re.MULTILINE embedded flag option
\A - start of string (not start of a line!), all the lookaheads below will be triggered one by one, checking the string from the start, until one of them fails
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$) - a positive lookahead that, immediately to the right of the current location, requires the presence of
(?:.*\n)*? - 0 or more (but as few as possible, the pattern will only be tried if the subsequent subpatterns do not match)
\S* - 0+ non-whitespaces
FILE1 - a string
\S* - 0+ non-whitespaces
\tCONTENT11 - tab and CONTENT11 substring
$ - end of line (since (?m) allows $ to match end of lines)
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$) - a lookahead working similarly as the preceding one, requiring FILE1 and CONTENT12 substrings on the line
(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$) - a lookahead working similarly as the preceding one, requiring FILE3 and CONTENT11 substrings on the line.
In Python, it will look like
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
pat = re.compile( r"(?m)\A(?=(?:.*\n)*?{})".format(r")(?=(?:.*\n)*?".join([rx[1:] for rx in rxs])) )
Then, the check method will look like
def all_matching(s, pat):
return pat.search(s)
See full Python demo online.

Related

Regex Match Recurring Pattern with Small Variation

I'm trying to match a repeating pattern with regex (in Python 3.9) which contains the same data in general but there are some areas which have varying iterations (specifically the lines beginning "CLD" and "REF".
I"m trying to match from "LIN" to the end of the line starting "HL" so I can carry out further matching on each iteration after.
This is an extract of the data I am using...
LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US
SN1**300*PC
PRF*5500015558****01
PID*F****DESCRIPTION01
REF*PK*000000051213
CLD*1*300*PLT71
REF*LS*0079393
HL*3*1*I
LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US
SN1**64*PC
PRF*5500014695****01
PID*F****DESCRIPTION02
REF*PK*000000051213
CLD*1*24*PLT71
REF*LS*0079393
CLD*1*40*PLT71
REF*LS*0079390
HL*4*1*I
My RegEx so far looks like this (although well short of what I'm trying to achieve)...
LIN.*\nSN.*\nPRF.*\nPID.*\nREF.*\n
However I got stuck at this point due to the varying number of "CLD" & "REF" lines and therefore it stops short of what I need and I'm pretty sure this is not efficient regex...
LIN**SI*ASN*BP*CH11979*VP*1262702*CH*US
SN1**300*PC
PRF*5500015558****01
PID*F****SEAL INTEGRAL
REF*PK*000000051213
LIN**SI*ASN*BP*CH10439*VP*1375541*CH*US
SN1**64*PC
PRF*5500014695****01
PID*F****PUMP AS PRIMING
REF*PK*000000051213
I also experimented with the regex below (from some Googling) to get around the varying occurrences and also be more efficient but it's not working...
LIN(.|\n)*HL.*
Can anyone help me pull this together?
You can use
(?m)^LIN[\w\W]*?\nHL.* # For any Python version
(?m)^LIN(?s:.*?)\nHL.* # For Python 3.6+
See the regex demo.
Details:
(?m) - an inline re.M modifier flag that enables ^ to match any line start position
^ - start of a line
LIN - LIN string
[\w\W]*? / (?s:.*?) - any zero or more chars, as few as possible
\n - a newline
HL - HL
.* - the rest of the line.
See the Python demo:
import re
text = "LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I\nLIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I"
print(re.findall(r'(?m)^LIN(?s:.*?)\nHL.*', text))
Output:
[
'LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I',
'LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I'
]
For varying iterations for CLD and REF, you could repeat 1 or more times matching the lines that start with either of them using an alternation |, and a quantifier + to match at least 1 line.
^LIN.*\nSN.*\nPRF.*\nPID.*\n(?:(?:REF|CLD).*\n)+HL.*
Explanation
^ Start of string
LIN.*\nSN.*\nPRF.*\nPID.*\n Match the first 4 lines
(?:(?:REF|CLD).*\n)+ Repeat 1+ times matching either REF or CLD, the rest of the line and a newline
HL.* Match HL and the rest of the line
Regex demo | Python demo
For example
import re
s = ("LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\n"
"SN1**300*PC\n"
"PRF*5500015558****01\n"
"PID*F****DESCRIPTION01\n"
"REF*PK*000000051213\n"
"CLD*1*300*PLT71\n"
"REF*LS*0079393\n"
"HL*3*1*I\n"
"LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\n"
"SN1**64*PC\n"
"PRF*5500014695****01\n"
"PID*F****DESCRIPTION02\n"
"REF*PK*000000051213\n"
"CLD*1*24*PLT71\n"
"REF*LS*0079393\n"
"CLD*1*40*PLT71\n"
"REF*LS*0079390\n"
"HL*4*1*I")
regex = r"^LIN.*\nSN.*\nPRF.*\nPID.*\n(?:(?:REF|CLD).*\n)+HL.*"
print(re.findall(regex, s, re.MULTILINE))
Output
[
'LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I',
'LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I'
]

python regex combine patterns with AND and group

I am trying to use regex to match something meets the following conditions:
do not contain a "//" string
contain Chinese characters
pick up those Chinese characters
I read line by line from a file:
f = open("test.js", 'r')
lines = f.readlines()
for line in lines:
matches = regex.findall(line)
if matches:
print(matches)
First I tried to match Chinese characters using following pattern:
re.compile(r"[\u4e00-\u9fff]+")
it works and give me the output:
['下载失成功']
['下载失败']
['绑定监听']
['该功能暂未开放']
Then I tried to exclude the "//" with the following pattern and combine it to the above pattern:
re.compile(r"^(?=^(?:(?!//).)*$)(?=.*[\u4e00-\u9fff]+).*$")
it gives me the output:
[' showToastByText("该功能暂未开放");']
which is almost right but what I want is only the Chinese characters part.
I tried to add "()" but just can not pick up the part that I want.
Any advice will be appreciated, thanks :)
You don't need so complex regex for just negating // in your input and capturing the Chinese characters that appear in sequence together. For discarding the lines containing // just this (?!.*//) negative look ahead is enough and for capturing the Chinese text, you can capture with this regex [^\u4e00-\u9fff]*([\u4e00-\u9fff]+) and your overall regex becomes this,
^(?!.*//)[^\u4e00-\u9fff]*([\u4e00-\u9fff]+)
Where you can extract Chinese characters from first grouping pattern.
Explanation of above regex:
^ - Start of string
(?!.*//) - Negative look ahead that will discard the match if // is present in the line anywhere ahead
[^\u4e00-\u9fff]* - Optionally matches zero or more non-Chinese characters
([\u4e00-\u9fff]+) - Captures Chinese characters one or more and puts then in first grouping pattern.
Demo
Edit: Here are sample codes showing how to capture text from group1
import re
s = ' showToastByText("该功能暂未开放");'
m = re.search(r'^(?!.*//)[^\u4e00-\u9fff]*([\u4e00-\u9fff]+)',s)
if (m):
print(m.group(1))
Prints,
该功能暂未开放
Online Python Demo
Edit: For extracting multiple occurrence of Chinese characters as mentioned in comment
As you want to extract multiple occurrence of Chinese characters, you can check if the string does not contain // and then use findall to extract all the Chinese text. Here is a sample code demonstrating same,
import re
arr = ['showToastByText("该功能暂未开放");','//showToastByText("该功能暂未开放");','showToastByText("未开放");','showToastByText("该功能暂xxxxxx未开放");']
for s in arr:
if (re.match(r'\/\/', s)):
print(s, ' --> contains // hence not finding')
else:
print(s, ' --> ', re.findall(r'[\u4e00-\u9fff]+',s))
Prints,
showToastByText("该功能暂未开放"); --> ['该功能暂未开放']
//showToastByText("该功能暂未开放"); --> contains // hence not finding
showToastByText("未开放"); --> ['未开放']
showToastByText("该功能暂xxxxxx未开放"); --> ['该功能暂', '未开放']
Online Python demo
You don't need a positive lookahead to get the chinese characters (as it will not match anything). So we can rewrite that part to make a lazy match for .* until it finds the desired characters.
As such, using:
^(?=^(?:(?!//).)*$).*?([\u4e00-\u9fff]+).*$
Your first capture group will be the chinese characters

How to match numeric characters with no white space following

I need to match lines in text document where the line starts with numbers and the numbers are followed by nothing.... I want to include numbers that have '.' and ',' separating them.
Currently, I have:
p = re.compile('\$?\s?[0-9]+')
for i, line in enumerate(letter):
m = p.match(line)
if s !=None:
print(m)
print(line)
Which gives me this:
"15,704" and "416" -> this is good, I want this
but also this:
"$40 million...." -> I do not want to match this line or any line where the numbers are followed by words.
I've tried:
p = re.compile('\$?\s?[0-9]+[ \t\n\r\f\v]')
But it doesn't work. One reason is that it turns out there is no white space after the numbers I'm trying to match.
Appreciate any tips or tricks.
If you want to match the whole string with a regex,
you have 2 choices:
Either call re.fullmatch(pattern, string) (note full in the function name).
It tries to match just the whole string.
Or put $ anchor at the end of your regex and call re.match(pattern, string).
It tries to find a match from the start of the string.
Actually, you could also add ^ at the start of regex and call re.search(pattern,
string), but it would be a very strange combination.
I have also a remark concerning how you specified your conditions, maybe in incomplete
way: You put e.g. $40 million string and stated that the only reason to reject
it is space and letters after $40.
So actually you should have written that you want to match a string:
Possibly starting with $.
After the $ there can be a space (maybe, I'm not sure).
Then there can be a sequence of digits, dots or commas.
And nothing more.
And one more remark concerning Python literals: Apparently you have forgotten to prepend the pattern with r.
If you use r-string literal, you do not have to double backslashes inside.
So I think the most natural solution is to call a function devoted just to
match the whole string (i.e. fullmatch), without adding start / end
anchors and the whole script can be:
import re
pat = re.compile(r'(?:\$\s?)?[\d,.]+')
lines = ["416", "15,704", "$40 million"]
for line in lines:
if pat.fullmatch(line):
print(line)
Details concerning the regex:
(?: - A non-capturing group.
\$ - Consisting of a $ char.
\s? - And optional space.
)? - End of the non-capturing group and ? stating that the whole
group group is optional.
[\d,.]+ - A sequence of digits, commas and dots (note that between [
and ] the dot represents itself, so no backslash quotation is needed.
If you would like to reject strings like 2...5 or 3.,44 (no consecutive
dots or commas allowed), change the last part of the above regex to:
[\d]+(?:[,.]?[\d]+)*
Details:
[\d]+ - A sequence of digits.
(?: - A non-capturing group.
[,.] - Either a comma or a dot (single).
[\d]+ - Another sequence of digits.
)* - End of the non-capturing group, it may occur several times.
With a little modification to your code:
letter = ["15,704", "$40 million"]
p = re.compile('^\d{1,3}([\.,]\d{3})*$') # Numbers separated by commas or points
for i, line in enumerate(letter):
m = p.match(line)
if m:
print(line)
Output:
15,704
You could use the following regex:
import re
pattern = re.compile('^[0-9,.]+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^[0-9,.]+\s*$ matches everything that is a digit a , or ., followed by zero or more spaces. If you want to match only numbers with one , or . use the following pattern: '^\d+[,.]?\d+\s*$', code:
import re
pattern = re.compile('^\d+[,.]?\d+\s*$')
lines = ["416", "15,704", "$40 million...."]
for line in lines:
if pattern.match(line):
print(line)
Output
416
15,704
The pattern ^\d+[,.]?\d+\s*$ matches everything that starts with a group of digits (\d+) followed by an optional , or . ([,.]?) followed by a group of digits, with an optional group of spaces \s*.

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Anchor to End of Last Match

In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'
You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'

Categories