How to scan for a string literal allowing escaped characters? - python

I would like to parse an input string and determine if it contains a sequence of characters surrounded by double quotes (").
The sequence of characters itself is not allowed to contain further double quotes, unless they are escaped by a backslash, like so: \".
To make things more complicated, the backslashes can be escaped themselves, like so: \\. A double quote preceded by two (or any even number of) backslashes (\\") is therefore not escaped.
And to make it even worse, single non-escaping backslashes (i.e. followed by neither " nor \) are allowed.
I'm trying to solve that with Python's re module.
The module documentation tells us about the pipe operator A|B:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
However, this doesn't work as I expected:
>>> import re
>>> re.match(r'"(\\[\\"]|[^"])*"', r'"a\"')
<_sre.SRE_Match object; span=(0, 4), match='"a\\"'>
The idea of this regex is to first check for an escaped character (\\ or \") and only if that's not found, check for any character that's not " (but it could be a single \).
This can occur an arbitrary number of times and it has to be surrounded by literal " characters.
I would expect the string "a\" not to match at all, but apparently it does.
I would expect \" to match the A part and the B part not to be tested, but apparently it is.
I don't really know how the backtracking works in this very case, but is there a way to avoid it?
I guess it would work if I check first for the initial " character (and remove it from the input) in a separate step.
I could then use the following regular expression to get the content of the string:
>>> re.match(r'(\\[\\"]|[^"])*', r'a\"')
<_sre.SRE_Match object; span=(0, 3), match='a\\"'>
This would include the escaped quote. Since there wouldn't be a closing quote left, I would know that overall, the given string does not match.
Do I have to do it like that or is it possible to solve this with a single regular expression and no additional manual checking?
In my real application, the "-enclosed string is only one part of a larger pattern, so I think it would be simpler to do it all at once in a single regular expression.
I found similar questions, but those don't consider that a single non-escaping backslash can be part of the string: regex to parse string with escaped characters, Parsing for escape characters with a regular expression.

When you use "(\\[\\"]|[^"])*", you match " followed by 0+ sequences of \ followed by either \ or ", or non-", and then followed by a "closing" ". Note that when your input is "a\", the \ is matched by the second alternative branch [^"] (as the backslash is a valid non-").
You need to exclude the \ from the non-":
"(?:[^\\"]|\\.)*"
^^
So, we match ", then either non-" and non-\ (with [^\\"]) or any escape sequence (with \\.), 0 or more times.
However, this regex is not efficient enough as there is much backtracking going on (caused by the alternation and the quantifier). Unrolled version is:
"[^"\\]*(?:\\.[^"\\]*)*"
See the regex demo
The last pattern matches:
" - a double quote
[^"\\]* - zero or more characters other than \ and "
(?:\\.[^"\\]*)* - zero or more sequences of
\\. - a backslash followed with any character but a newline
[^"\\]* - zero or more characters other than \ and "
" - a double quote

Related

Python match text successfully even when there are 1, 2 and 3 backslash at front of the same regex pattern [duplicate]

My current understanding of the python 3.4 regex library from the language reference does not seem to match up with my experiment results of the module.
My current understanding
The regular expression engine can be thought of as a separate entity with its own programming language that it understands (regex). It just happens to live inside python, among a variety of other languages. As such, python must pass (regex) pattern/code to this independent interpreter, if you will.
For clarity reasons, the following text will use the notion of logical length - which is supposed to represent how long the given string logically is. For example, the special character carriage return \r will have len=1 since it is a single character. However, the 2 distinct characters (backslash followed by an r) \r will have len=2.
Step 1) Lets say we want to match a carriage return \r len=1 in some text.
Step 2) We need to feed the pattern \r len=2 (2 distinct characters) to the regular expression engine.
Step 3) The regular expression engine recieves \r len=2 and interprets the pattern as: match special character carriage return \r len=1.
Step 4) It goes ahead and does the magic.
The problem is that the backslash character \ itself is used by the python interpreter as something special - a character meant to escape other stuff (like quotes).
So when we are coding in python and need to express the idea that we need to send the pattern \r len=2 to the internal regular expression interpreter, we must type pattern = '\\r' or alternatively pattern = r'\r' to express \r len=2.
And everything is well... until
I try a couple of experiments involving re.escape
Summary of questions
Point 1) Please confirm/modify my current understanding of the regex engine.
Point 2) Why are these supposed non-textbook definition patterns matching.
Point 3) What on earth is going on with \\\r from re.escape, and the whole "we have the same string lengths, but we compared unequal, but we ALSO all worked the same in matching a carriage return in the previous re.search test".
You need to understand that each time you write a pattern, it is first interpreted as a string before to be read and interpreted a second time by the regex engine.
Lets describe what happens:
>>> s='\r'
s contains the character CR.
>>> re.match('\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>
Here the string '\r' is a string that contains CR, so a literal CR is given to the regex engine.
>>> re.match('\\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>
The string is now a literal backslash and a literal r, the regex engine receives these two characters and since \r is a regex escape sequence that means a CR character too, you obtain a match too.
>>> re.match('\\\r', s)
<_sre.SRE_Match object; span=(0, 1), match='\r'>
The string contains a literal backslash and a literal CR, the regex engine receives \ and CR, but since \CR isn't a known regex escape sequence, the backslash is ignored and you obtain a match.
Note that for the regex engine, a literal backslash is the escape sequence \\ (so in a pattern string r'\\' or '\\\\')

How does the regex "\" character and grouping "()" character work together?

I am trying to see which statements the following pattern matches:
\(*[0­-9]{3}\)*-­*[0-­9]{3}­\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'404­678­2347'
'(123)­1247890'
'456­900­900'
'(678)­2001236'
'404123­1234'
'(404123­123'
Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?
Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html

Regular Expression that Includes a Character Only If Another Character Precedes It

I'm new to Stack so not sure if I'm asking this right.
I'm trying to form a regular expression to match all characters except 3 specific ones (%,&,and$) but I want to ignore that exception if a backslash () proceeds any of those characters. For example, if I have the string
abcd\$&
I would want the regular expression to match
abcd\$
because a backsplash preceeds the dollar sign, but not match the ^ because no backslash precedes it.
So far I have:
^[^%$&]+
which matches any string that doesn't have the characters (%, $, or &), but it stops at the backslash rather than include the backslash and the next character.
Thanks in advance!
^([^%$&\\]|\\.)+$
should work.
It also excludes \ from the charset and then allows \ followed by any character.

Python Regex: Ignore Escaped Character

Alright, I'm currently using Python's regular expression library to split up the following string into groups of semicolon delimited fields.
'key1:"this is a test phrase"; key2:"this is another test phrase"; key3:"ok this is a gotcha\; but you should get it";'
Regex: \s*([^;]+[^\\])\s*;
I'm currently using the pcre above, which was working fine until I encountered a case where an escaped semicolon is included in one of the phrases as noted above by key3.
How can I modify this expression to only split on the non-escaped semicolons?
The basic version of this is where you want to ignore any ; that's preceded by a backslash, regardless of anything else. That's relatively simple:
\s*([^;]*[^;\\]);
What will make this tricky is if you want escaped backslashes in the input to be treated as literals. For example:
"You may want to split here\\;"
"But not here\;"
If that's something you want to take into account, try this (edited):
\s*((?:[^;\\]|\\.)+);
Why so complicated? Because if escaped backslashes are allowed, then you have to account for things like this:
"0 slashes; 2 slashes\\; 5 slashes\\\\\; 6 slashes\\\\\\;"
Each pair of doubled backslashes would be treated as a literal \. That means a ; would only be escaped if there were an odd number of backslashes before it. So the above input would be grouped like this:
#1: '0 slashes'
#2: '2 slashes\'
#3: '5 slashes\\; 6 slashes\\\'
Hence the different parts of the pattern:
\s* #Whitespace
((?:
[^;\\] #One character that's not ; or \
| #Or...
\\. #A backslash followed by any character, even ; or another backslash
)+); #Repeated one or more times, followed by ;
Requiring a character after a backslash ensures that the second character is always escaped properly, even if it's another backslash.
If the string may contain semicolons and escaped quotes (or escaped anything), I would suggest parsing each valid key:"value"; sequence. Like so:
import re
s = r'''
key1:"this is a test phrase";
key2:"this is another test phrase";
key3:"ok this is a gotcha\; but you should get it";
key4:"String with \" escaped quote";
key5:"String with ; unescaped semi-colon";
key6:"String with \\; escaped-escape before semi-colon";
'''
result = re.findall(r'\w+:"[^"\\]*(?:\\.[^"\\]*)*";', s)
print (result)
Note that this correctly handles any escapes within the double quoted string.

How to detect an invalid C escaped string using a regular expression?

I would like to find a regular expression (regex) that does detect if you have some invalid escapes in a C double quoted escaped string (where you can find double quotes only escaped).
I consider valid \\ \n \r \" (the test string is using ")
A partial solution to this is to use (?<!\\)\\[^\"\\nr] but this one fails to detect bad escapes like \\\.
Here is a test string that I use to test the matching:
...\n...\\b...\"...\\\\...\\\E...\...\\\...\\\\\..."...\E...
The expression should match the last 6 blocks as invalid, the first 4 are valid. The problem is that my current version does find only 2/5 errors.
(?:^|[^\\])(?:\\\\)*((?:\"|\\(?:[^\"\\nr]|$)))
That's the start of a string, or something that's not a backslash. Then some (possibly zero) properly escaped backslashes, then either an unescaped " or another backslash; if it's another backslash, it must be followed by something that is neither ", \, n, nor r, or the end of the string.
The incorrect escape is captured for you as well.
Try this regular expression:
^(?:[^\\]+|\\[\\rn"])*(\\(?:[^\\rn"]|$))
If you have a match, you have an invalid escape sequence.

Categories