Match File Names with Preceding Zeros - python

I'm trying to match exactly match one file in the Printer Spooler folder with RegEx. Basically what I'm going for is match the one file called *[filename].SPL in %windir%\spool\PRINTERS using Python. I thought using a dynamically generated RegEx along the lines of: [matching none or many-zeros] + [filename].SPL
Tried a few Regular Expressions, but always had the issue that the linebreak of the previous file matches as well at regex101.com
File format:
02980.SPL
20980.SPL
00011.SPL
00001.SPL
Expressions I came up with:
[\r\n][^1-9]+1.SPL
[^1-9].*1.SPL

You may use
r'^0*{}\.SPL$'.format(filename)
See the regex demo online.
If filename is 1, the pattern will look like ^0*1\.SPL$ and will match:
^ - start of string
0* - zero or more 0 chars
1\.SPL - 1.SPL substring
$ - end of string.
See the Python demo:
import re
l = ['02980.SPL','20980.SPL','00011.SPL','00001.SPL']
filename=1
rx = re.compile(r'^0*{}\.SPL$'.format(filename))
print([f for f in l if rx.search(f) ])
# => ['00001.SPL']

Related

Regex Match Recurring Pattern with Small Variation

I'm trying to match a repeating pattern with regex (in Python 3.9) which contains the same data in general but there are some areas which have varying iterations (specifically the lines beginning "CLD" and "REF".
I"m trying to match from "LIN" to the end of the line starting "HL" so I can carry out further matching on each iteration after.
This is an extract of the data I am using...
LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US
SN1**300*PC
PRF*5500015558****01
PID*F****DESCRIPTION01
REF*PK*000000051213
CLD*1*300*PLT71
REF*LS*0079393
HL*3*1*I
LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US
SN1**64*PC
PRF*5500014695****01
PID*F****DESCRIPTION02
REF*PK*000000051213
CLD*1*24*PLT71
REF*LS*0079393
CLD*1*40*PLT71
REF*LS*0079390
HL*4*1*I
My RegEx so far looks like this (although well short of what I'm trying to achieve)...
LIN.*\nSN.*\nPRF.*\nPID.*\nREF.*\n
However I got stuck at this point due to the varying number of "CLD" & "REF" lines and therefore it stops short of what I need and I'm pretty sure this is not efficient regex...
LIN**SI*ASN*BP*CH11979*VP*1262702*CH*US
SN1**300*PC
PRF*5500015558****01
PID*F****SEAL INTEGRAL
REF*PK*000000051213
LIN**SI*ASN*BP*CH10439*VP*1375541*CH*US
SN1**64*PC
PRF*5500014695****01
PID*F****PUMP AS PRIMING
REF*PK*000000051213
I also experimented with the regex below (from some Googling) to get around the varying occurrences and also be more efficient but it's not working...
LIN(.|\n)*HL.*
Can anyone help me pull this together?
You can use
(?m)^LIN[\w\W]*?\nHL.* # For any Python version
(?m)^LIN(?s:.*?)\nHL.* # For Python 3.6+
See the regex demo.
Details:
(?m) - an inline re.M modifier flag that enables ^ to match any line start position
^ - start of a line
LIN - LIN string
[\w\W]*? / (?s:.*?) - any zero or more chars, as few as possible
\n - a newline
HL - HL
.* - the rest of the line.
See the Python demo:
import re
text = "LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I\nLIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I"
print(re.findall(r'(?m)^LIN(?s:.*?)\nHL.*', text))
Output:
[
'LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I',
'LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I'
]
For varying iterations for CLD and REF, you could repeat 1 or more times matching the lines that start with either of them using an alternation |, and a quantifier + to match at least 1 line.
^LIN.*\nSN.*\nPRF.*\nPID.*\n(?:(?:REF|CLD).*\n)+HL.*
Explanation
^ Start of string
LIN.*\nSN.*\nPRF.*\nPID.*\n Match the first 4 lines
(?:(?:REF|CLD).*\n)+ Repeat 1+ times matching either REF or CLD, the rest of the line and a newline
HL.* Match HL and the rest of the line
Regex demo | Python demo
For example
import re
s = ("LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\n"
"SN1**300*PC\n"
"PRF*5500015558****01\n"
"PID*F****DESCRIPTION01\n"
"REF*PK*000000051213\n"
"CLD*1*300*PLT71\n"
"REF*LS*0079393\n"
"HL*3*1*I\n"
"LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\n"
"SN1**64*PC\n"
"PRF*5500014695****01\n"
"PID*F****DESCRIPTION02\n"
"REF*PK*000000051213\n"
"CLD*1*24*PLT71\n"
"REF*LS*0079393\n"
"CLD*1*40*PLT71\n"
"REF*LS*0079390\n"
"HL*4*1*I")
regex = r"^LIN.*\nSN.*\nPRF.*\nPID.*\n(?:(?:REF|CLD).*\n)+HL.*"
print(re.findall(regex, s, re.MULTILINE))
Output
[
'LIN**SI*ASN*BP*ITEM01*VP*1262702*CH*US\nSN1**300*PC\nPRF*5500015558****01\nPID*F****DESCRIPTION01\nREF*PK*000000051213\nCLD*1*300*PLT71\nREF*LS*0079393\nHL*3*1*I',
'LIN**SI*ASN*BP*ITEM02*VP*1375541*CH*US\nSN1**64*PC\nPRF*5500014695****01\nPID*F****DESCRIPTION02\nREF*PK*000000051213\nCLD*1*24*PLT71\nREF*LS*0079393\nCLD*1*40*PLT71\nREF*LS*0079390\nHL*4*1*I'
]

(Python) How to check a long string against several regex?

I want to ensure that a long string can match with several regex at once.
I have a long multi line string containing a list of files and some content of the file.
DIR1\FILE1.EXT1 CONTENT11
DIR1\FILE1.EXT1 CONTENT12
DIR1\FILE1.EXT1 CONTENT13
DIR1\FILE2.EXT1 CONTENT21
DIR2\FILE3.EXT2 CONTENT31
DIR3\FILE3.EXT2 CONTENT11
The list typically contains hundreds of thousands of lines, sometimes several millions.
I want to check that the list contains predefined couples file/content:
FILE1 CONTENT11
FILE1 CONTENT12
FILE3 CONTENT11
I know that I can check that the string contains all of these couples by matching the string against some regexes
"^\S*FILE1\S*\tCONTENT11$"
"^\S*FILE1\S*\tCONTENT12$"
"^\S*FILE3\S*\tCONTENT11$"
import re
def all_matching(str, rxs):
res = True
for rx in rxs:
p = re.compile(rx, re.M)
res = res and p.search(str)
return(res)
input1 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31
DIR3\\FILE3.EXT2\tCONTENT11"""
input2 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31"""
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
if all_matching(input1,rxs):
print("input1 matches all rxs") # excpected
else:
print("input1 do not match all rxs")
if all_matching(input2,rxs):
print("input2 matches all rxs")
else:
print("input2 do not match all rxs") # expected because input2 doesn't match wirh rxs[2]
ideone is available here
However, as the input string is very long in my case, I'd rather avoid launching search many times...
I feel like it should be possible to change the all_matching function in that way.
Any help will be much appreciated!
EDIT
clarified the problem an provided sample code
You may build a single regex from the regex strings you have that will require all the regexes to find a match in the input string.
The resulting regex will look like
\A(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$)(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$)(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$)
See the regex demo.
Basically, it will match:
(?m) - a re.M / re.MULTILINE embedded flag option
\A - start of string (not start of a line!), all the lookaheads below will be triggered one by one, checking the string from the start, until one of them fails
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$) - a positive lookahead that, immediately to the right of the current location, requires the presence of
(?:.*\n)*? - 0 or more (but as few as possible, the pattern will only be tried if the subsequent subpatterns do not match)
\S* - 0+ non-whitespaces
FILE1 - a string
\S* - 0+ non-whitespaces
\tCONTENT11 - tab and CONTENT11 substring
$ - end of line (since (?m) allows $ to match end of lines)
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$) - a lookahead working similarly as the preceding one, requiring FILE1 and CONTENT12 substrings on the line
(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$) - a lookahead working similarly as the preceding one, requiring FILE3 and CONTENT11 substrings on the line.
In Python, it will look like
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
pat = re.compile( r"(?m)\A(?=(?:.*\n)*?{})".format(r")(?=(?:.*\n)*?".join([rx[1:] for rx in rxs])) )
Then, the check method will look like
def all_matching(s, pat):
return pat.search(s)
See full Python demo online.

Python regex to match after the text and the dot [duplicate]

I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).

Extract path from lines in a file using python

I am trying to extract path from a given file which meet some criteria:
Example:
I have a small file with contents something like :
contentsaasdf /net/super/file-1.txt othercontents...
data is in /sample/random/folder/folder2/file-2.txt otherdata...
filename /otherfile/other-3.txt somewording
I want to extract the path's from file which contain file-*.txt in it.
In above example, I need the below path's as output
/net/super/file-1.txt
/sample/random/folder/folder2/file-2.txt
Any suggestions with Python code ?
I am trying regex. But facing issues with multiple folder's, etc. Something like:
FileRegEx = re.compile('.*(file-\\d.txt).*', re.IGNORECASE|re.DOTALL)
You don't need .* just use character classes properly:
r'[\/\w]+file-[^.]+\.txt'
[\/\w]+ will match any combinations of word characters and /. And [^.]+ will match any combination of characters except dot.
Demo:
https://regex101.com/r/ytsZ0D/1
Note that this regex might be kind of general, In that case, if you want to exclude some cases you can use ^ within character class or another proper pattern, based on your need.
Assuming your filenames are white-space separated ...
\\s(\\S+/file-\\d+\\.txt)\\s
\\s - match a white-space character
\\S+ - matches one or more non-whitespace characters
\\d+ - matches one or more digits
\\. - turns the . into a non-interesting period, instead of a match any character
You can avoid the double backslashes using r'' strings:
r'\s(\S+/file-\d+\.txt)\s'
Try this:
import re
re.findall('/.+\.txt', s)
# Output: ['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']
Output:
>>> import re
>>>
>>> s = """contentsaasdf /net/super/file-1.txt othercontents...
... data is in /sample/random/folder/folder2/file-2.txt otherdata...
... filename /otherfile/other-3.txt somewording"""
>>>
>>> re.findall('/.+\.txt', s)
['/net/super/file-1.txt', '/sample/random/folder/folder2/file-2.txt', '/otherfile/other-3.txt']

Anchor to End of Last Match

In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'
You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'

Categories