In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'
You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'
Related
I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO
I want to ensure that a long string can match with several regex at once.
I have a long multi line string containing a list of files and some content of the file.
DIR1\FILE1.EXT1 CONTENT11
DIR1\FILE1.EXT1 CONTENT12
DIR1\FILE1.EXT1 CONTENT13
DIR1\FILE2.EXT1 CONTENT21
DIR2\FILE3.EXT2 CONTENT31
DIR3\FILE3.EXT2 CONTENT11
The list typically contains hundreds of thousands of lines, sometimes several millions.
I want to check that the list contains predefined couples file/content:
FILE1 CONTENT11
FILE1 CONTENT12
FILE3 CONTENT11
I know that I can check that the string contains all of these couples by matching the string against some regexes
"^\S*FILE1\S*\tCONTENT11$"
"^\S*FILE1\S*\tCONTENT12$"
"^\S*FILE3\S*\tCONTENT11$"
import re
def all_matching(str, rxs):
res = True
for rx in rxs:
p = re.compile(rx, re.M)
res = res and p.search(str)
return(res)
input1 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31
DIR3\\FILE3.EXT2\tCONTENT11"""
input2 = """DIR1\\FILE1.EXT1\tCONTENT11
DIR1\\FILE1.EXT1\tCONTENT12
DIR1\\FILE1.EXT1\tCONTENT13
DIR1\\FILE2.EXT1\tCONTENT21
DIR2\\FILE3.EXT2\tCONTENT31"""
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
if all_matching(input1,rxs):
print("input1 matches all rxs") # excpected
else:
print("input1 do not match all rxs")
if all_matching(input2,rxs):
print("input2 matches all rxs")
else:
print("input2 do not match all rxs") # expected because input2 doesn't match wirh rxs[2]
ideone is available here
However, as the input string is very long in my case, I'd rather avoid launching search many times...
I feel like it should be possible to change the all_matching function in that way.
Any help will be much appreciated!
EDIT
clarified the problem an provided sample code
You may build a single regex from the regex strings you have that will require all the regexes to find a match in the input string.
The resulting regex will look like
\A(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$)(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$)(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$)
See the regex demo.
Basically, it will match:
(?m) - a re.M / re.MULTILINE embedded flag option
\A - start of string (not start of a line!), all the lookaheads below will be triggered one by one, checking the string from the start, until one of them fails
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT11$) - a positive lookahead that, immediately to the right of the current location, requires the presence of
(?:.*\n)*? - 0 or more (but as few as possible, the pattern will only be tried if the subsequent subpatterns do not match)
\S* - 0+ non-whitespaces
FILE1 - a string
\S* - 0+ non-whitespaces
\tCONTENT11 - tab and CONTENT11 substring
$ - end of line (since (?m) allows $ to match end of lines)
(?=(?:.*\n)*?\S*FILE1\S*\tCONTENT12$) - a lookahead working similarly as the preceding one, requiring FILE1 and CONTENT12 substrings on the line
(?=(?:.*\n)*?\S*FILE3\S*\tCONTENT11$) - a lookahead working similarly as the preceding one, requiring FILE3 and CONTENT11 substrings on the line.
In Python, it will look like
rxs = [r"^\S*FILE1\S*\tCONTENT11$",r"^\S*FILE1\S*\tCONTENT12$",r"^\S*FILE3\S*\tCONTENT11$"]
pat = re.compile( r"(?m)\A(?=(?:.*\n)*?{})".format(r")(?=(?:.*\n)*?".join([rx[1:] for rx in rxs])) )
Then, the check method will look like
def all_matching(s, pat):
return pat.search(s)
See full Python demo online.
Given a string s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c" I'm trying to write a regular expression which will extract portions of: angle brackets with the text inside and the surrounding text. Like this:
<foo>abcaaa
abcaaa<bar>a
a<foo>cbacba
cbacba<foo>c
So expected output should look like this:
["<foo>abcaaa", "abcaaa<bar>a", "a<foo>cbacba", "cbacba<foo>c"]
I found this question How to find overlapping matches with a regexp? which brought me little bit closer to the desired result but still my regex doesn't work.
regex = r"(?=([a-c]*)\<(\w+)\>([a-c]*))"
Any ideas how to solve this problem?
You can match overlapping content with standard regex syntax by using capturing groups inside lookaround assertions, since those may match parts of the string without consuming the matched substring and hence precluding it from further matches. In this specific example, we match either the beginning of the string or a > as anchor for the lookahead assertion which captures our actual targets:
(?:\A|>)(?=([a-c]*<\w+>[a-c]*))
See regex demo.
In python we then use the property of re.findall() to only return matches captured in groups when capturing groups are present in the expression:
text = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
expr = r'(?:\A|>)(?=([a-c]*<\w+>[a-c]*))'
captures = re.findall(expr, text)
print(captures)
Output:
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
You need to set the left- and right-hand boundaries to < or > chars or start/end of string.
Use
import re
text = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"
print( re.findall(r'(?=(?<![^<>])([a-c]*<\w+>[a-c]*)(?![^<>]))', text) )
# => ['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
See the Python demo online and the regex demo.
Pattern details
(?= - start of a positive lookahead to enable overlapping matches
(?<![^<>]) - start of string, < or >
([a-c]*<\w+>[a-c]*) - Group 1 (the value extracted): 0+ a, b or c chars, then <, 1+ word chars, > and again 0+ a, b or c chars
(?![^<>]) - end of string, < or > must follow immediately
) - end of the lookahead.
You may use this regex code in python:
>>> s = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
>>> reg = r'([^<>]*<[^>]*>)(?=([^<>]*))'
>>> print ( [''.join(i) for i in re.findall(reg, s)] )
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
RegEx Demo
RegEx Details:
([^<>]*<[^>]*>): Capture group #1 to match 0 or more characters that are not < and > followed by <...> string.
(?=([^<>]*)): Lookahead to assert that we have 0 or more non-<> characters ahead of current position. We have capture group #2 inside this lookahead.
I am using Python and would like to match all the words after test till a period (full-stop) or space is encountered.
text = "test : match this."
At the moment, I am using :
import re
re.match('(?<=test :).*',text)
The above code doesn't match anything. I need match this as my output.
Everything after test, including test
test.*
Everything after test, without test
(?<=test).*
Example here on regexr.com
You need to use re.search since re.match tries to match from the beging of the string. To match until a space or period is encountered.
re.search(r'(?<=test :)[^.\s]*',text)
To match all the chars until a period is encountered,
re.search(r'(?<=test :)[^.]*',text)
In a general case, as the title mentions, you may capture with (.*) pattern any 0 or more chars other than newline after any pattern(s) you want:
import re
p = re.compile(r'test\s*:\s*(.*)')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want . to match across multiple lines, compile the regex with re.DOTALL or re.S flag (or add (?s) before the pattern):
p = re.compile(r'test\s*:\s*(.*)', re.DOTALL)
p = re.compile(r'(?s)test\s*:\s*(.*)')
However, it will retrun match this.. See also a regex demo.
You can add \. pattern after (.*) to make the regex engine stop before the last . on that line:
test\s*:\s*(.*)\.
Watch out for re.match() since it will only look for a match at the beginning of the string (Avinash aleady pointed that out, but it is a very important note!)
See the regex demo and a sample Python code snippet:
import re
p = re.compile(r'test\s*:\s*(.*)\.')
s = "test : match this."
m = p.search(s) # Run a regex search anywhere inside a string
if m: # If there is a match
print(m.group(1)) # Print Group 1 value
If you want to make sure test is matched as a whole word, add \b before it (do not remove the r prefix from the string literal, or '\b' will match a BACKSPACE char!) - r'\btest\s*:\s*(.*)\.'.
I don't see why you want to use regex if you're just getting a subset from a string.
This works the same way:
if line.startswith('test:'):
print(line[5:line.find('.')])
example:
>>> line = "test: match this."
>>> print(line[5:line.find('.')])
match this
Regex is slow, it is awkward to design, and difficult to debug. There are definitely occassions to use it, but if you just want to extract the text between test: and ., then I don't think is one of those occasions.
See: https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
For more flexibility (for example if you are looping through a list of strings you want to find at the beginning of a string and then index out) replace 5 (the length of 'test:') in the index with len(str_you_looked_for).
I have been trying to figure out how to include certain word groups and exclude others.I have this string for example
string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
I want to find HI:MYDLKJL:ajkld? and app? but not :JKLJBLKJD:DKJL? because it begins with a : I have made this code but it still includes the :JKLJBLKJD:DKJL? just ignoring the : in the front
match3=re.findall("[A-Za-z]{1,15}[:]{0,1}[A-Za-z]{0,15}[:]{0,1}[A-Za-z]{0,15}[:]{0,1}[A-Za-z]{0,15}[\?]{1}",string1)
The actual pattern is pretty simple to specify. But, you'll also need to specify a look-behind to handle the second term appropriately.
>>> re.findall(r'(?:(?<=\s)|(?<=^))[^:]\S+\?', string1)
['HI:MYDLKJL:ajkld?', 'app?']
The regex means "any expression that does not start with a colon but ends with a question mark".
(?: # lookbehind
(?<=\s) # space
| # OR
(?<=^) # start-of-line metachar
)
[^:] # anything that is not a colon
\S+ # one or more characters that are not a space
\? # literal question mark
A simple word boundary does not work because \b will also match the boundary between : and JKLJBLKJD... no bueno, hence the lookbehind.
Alternate approach
>>> string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
>>> string1.split()
['HI:MYDLKJL:ajkld?', ':JKLJBLKJD:DKJL?', 'app?']
# filter out elements not needed
>>> [s for s in string1.split() if not s.startswith(':')]
['HI:MYDLKJL:ajkld?', 'app?']
Or, using the regex module
>>> string1="HI:MYDLKJL:ajkld? :JKLJBLKJD:DKJL? app?"
>>> regex.findall(r'(?:^|\s):\S+(*SKIP)(*F)|\S+', string1)
['HI:MYDLKJL:ajkld?', 'app?']
(?:^|\s):\S+(*SKIP)(*F) will effectively ignore strings starting with :
(?: means non-capturing group