How to detect an invalid C escaped string using a regular expression? - python

I would like to find a regular expression (regex) that does detect if you have some invalid escapes in a C double quoted escaped string (where you can find double quotes only escaped).
I consider valid \\ \n \r \" (the test string is using ")
A partial solution to this is to use (?<!\\)\\[^\"\\nr] but this one fails to detect bad escapes like \\\.
Here is a test string that I use to test the matching:
...\n...\\b...\"...\\\\...\\\E...\...\\\...\\\\\..."...\E...
The expression should match the last 6 blocks as invalid, the first 4 are valid. The problem is that my current version does find only 2/5 errors.

(?:^|[^\\])(?:\\\\)*((?:\"|\\(?:[^\"\\nr]|$)))
That's the start of a string, or something that's not a backslash. Then some (possibly zero) properly escaped backslashes, then either an unescaped " or another backslash; if it's another backslash, it must be followed by something that is neither ", \, n, nor r, or the end of the string.
The incorrect escape is captured for you as well.

Try this regular expression:
^(?:[^\\]+|\\[\\rn"])*(\\(?:[^\\rn"]|$))
If you have a match, you have an invalid escape sequence.

Related

python equal strings are not equal with new line character in it

I have a password. It contains newline (lets for now omit why) character and is:
"h2sdf\ndfGd"
This password is in dict my_dict. When I just print values of dict I get "\" instead of "" - "h2sdf\ndfGd"! Don't understand why.
When I get it and use it to authenticate to web server, it says that Authentication fails. When I try to compare:
my_dict["password"] == "h2sdf\ndfGd"
it returns False.
But when I try just print(my_dict["password"]) I get h2sdf\ndfGd which is identical, but for python it is not. Why? I am lost.
Check this:
>>> print("h2sdf\ndfGd")
h2sdf
dfGd
>>> print("h2sdf\\ndfGd")
h2sdf\ndfGd
You simply have to escape \n with a double \ backslash, to prevent it to become a newline.
Characters like tabs, newlines, which cannot be represented in a string are described using an escape sequence with a backslash.
In order to indicate that the backslash is not part of an escape sequence(\n, \t, ...), it must itself be escaped using another backslash: \\.
my_dict["password"] == "h2sdf\\ndfGd"
If you don't want to have to escape all your \, you can use a raw string instead.
Raw strings are prefixed with r or R, and treat backslashes \ as literal characters.
my_dict["password"] == r"h2sdf\ndfGd"

How to scan for a string literal allowing escaped characters?

I would like to parse an input string and determine if it contains a sequence of characters surrounded by double quotes (").
The sequence of characters itself is not allowed to contain further double quotes, unless they are escaped by a backslash, like so: \".
To make things more complicated, the backslashes can be escaped themselves, like so: \\. A double quote preceded by two (or any even number of) backslashes (\\") is therefore not escaped.
And to make it even worse, single non-escaping backslashes (i.e. followed by neither " nor \) are allowed.
I'm trying to solve that with Python's re module.
The module documentation tells us about the pipe operator A|B:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
However, this doesn't work as I expected:
>>> import re
>>> re.match(r'"(\\[\\"]|[^"])*"', r'"a\"')
<_sre.SRE_Match object; span=(0, 4), match='"a\\"'>
The idea of this regex is to first check for an escaped character (\\ or \") and only if that's not found, check for any character that's not " (but it could be a single \).
This can occur an arbitrary number of times and it has to be surrounded by literal " characters.
I would expect the string "a\" not to match at all, but apparently it does.
I would expect \" to match the A part and the B part not to be tested, but apparently it is.
I don't really know how the backtracking works in this very case, but is there a way to avoid it?
I guess it would work if I check first for the initial " character (and remove it from the input) in a separate step.
I could then use the following regular expression to get the content of the string:
>>> re.match(r'(\\[\\"]|[^"])*', r'a\"')
<_sre.SRE_Match object; span=(0, 3), match='a\\"'>
This would include the escaped quote. Since there wouldn't be a closing quote left, I would know that overall, the given string does not match.
Do I have to do it like that or is it possible to solve this with a single regular expression and no additional manual checking?
In my real application, the "-enclosed string is only one part of a larger pattern, so I think it would be simpler to do it all at once in a single regular expression.
I found similar questions, but those don't consider that a single non-escaping backslash can be part of the string: regex to parse string with escaped characters, Parsing for escape characters with a regular expression.
When you use "(\\[\\"]|[^"])*", you match " followed by 0+ sequences of \ followed by either \ or ", or non-", and then followed by a "closing" ". Note that when your input is "a\", the \ is matched by the second alternative branch [^"] (as the backslash is a valid non-").
You need to exclude the \ from the non-":
"(?:[^\\"]|\\.)*"
^^
So, we match ", then either non-" and non-\ (with [^\\"]) or any escape sequence (with \\.), 0 or more times.
However, this regex is not efficient enough as there is much backtracking going on (caused by the alternation and the quantifier). Unrolled version is:
"[^"\\]*(?:\\.[^"\\]*)*"
See the regex demo
The last pattern matches:
" - a double quote
[^"\\]* - zero or more characters other than \ and "
(?:\\.[^"\\]*)* - zero or more sequences of
\\. - a backslash followed with any character but a newline
[^"\\]* - zero or more characters other than \ and "
" - a double quote

How to escape “\” characters in python

i am very new to regular expression and trying get "\" character using python
normally i can escape "\" like this
print ("\\");
print ("i am \\nit");
output
\
i am \nit
but when i use the same in regX it didn't work as i thought
print (re.findall(r'\\',"i am \\nit"));
and return me output
['\\']
can someone please explain why
EDIT: The problem is actually how print works with lists & strings. It prints the representation of the string, not the string itself, the representation of a string containing just a backslash is '\\'. So findall is actually finding the single backslash correctly, but print isn't printing it as you'd expect. Try:
>>> print(re.findall(r'\\',"i am \\nit")[0])
\
(The following is my original answer, it can be ignored (it's entirely irrelevant), I'd misinterpreted the question initially. But it seems to have been upvoted a bit, so I'll leave it here.)
The r prefix on a string means the string is in "raw" mode, that is, \ are not treated as special characters (it doesn't have anything to do with "regex").
However, r'\' doesn't work, as you can't end a raw string with a backslash, it's stated in the docs:
Even in a raw string, string quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character).
But you actually can use a non-raw string to get a single backslash: "\\".
can someone please explain why
Because re.findall found one match, and the match text consisted of a backslash. It gave you a list with one element, which is a string, which has one character, which is a backslash.
That is written ['\\'] because '\\' is how you write "a string with one backslash" - just like you had to do when you wrote the example code print "\\".
Note that you're using two different kinds of string literal here -- there's the regular string "a string" and the raw string r"a raw string". Regular string literals observe backslash escaping, so to actually put a backslash in the string, you need to escape it too. Raw string literals treat backslashes like any other character, so you're more limited in which characters you can actually put in the string (no specials that need an escape code) but it's easier to enter things like regular expressions, because you don't need to double up backslashes if you need to add a backslash to have meaning inside the string, not just when creating the string.
It is unnecessary to escape backslashes in raw strings, unless the backslash immediately precedes the closing quote.

python regex re.compile match

I am trying to match (using regex in python):
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
in the following string:
http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'
My code has something like this:
temp="http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"
dummy=str(re.compile(r'.com'',,''(.*?)'',,''Model Photo').search(str(temp)).group(1))
I do not think the "dummy" is correct & I am unsure how I "escape" the single and double quotes in the regex re.compile command.
I tried googling for the problem, but I couldnt find anything relevant.
Would appreciate any guidance on this.
Thanks.
The easiest way to deal with strings in Python that contain escape characters and quotes is to triple double-quote the string (""") and prefix it with r. For example:
my_str = r"""This string would "really "suck"" to write if I didn't
know how to tell Python to parse it as "raw" text with the 'r' character and
triple " quotes. Especially since I want \n to show up as a backlash followed
by n. I don't want \0 to be the null byte either!"""
The r means "take escape characters as literal". The triple double-quotes (""") prevent single-quotes, double-quotes, and double double-quotes from prematurely ending the string.
EDIT: I expanded the example to include things like \0 and \n. In a normal string (not a raw string) a \ (the escape character) signifies that the next character has special meaning. For example \n means "the newline character". If you literally wanted the character \ followed by n in your string you would have to write \\n, or just use a raw string instead, as I show in the example above.
You can also read about string literals in the Python documentation here:
For beginners: http://docs.python.org/tutorial/introduction.html#strings
Complex explanation: http://docs.python.org/reference/lexical_analysis.html#string-literals
Try triple quotes:
import re
tmp=""".*http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg.*"""
str="""http://www.mymaterialssite.com\'\,\'http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"""
x=re.match(tmp,str)
if x!=None:
print x.group()
Also you were missing the .* in the beginning of the pattern and at the end. I added that too.
if you use double quotes (which have the same meaning as the single ones, in Python), you don't have to escape at all.. (in this case). you can even use string literal without the starting r (you don't have any backslash there)
re.compile(".com','(.*?)','Model Photo")
Commas don't need to be escaped, and single quotes don't need to be escaped if you use double quotes to create the string:
>>> dummy=re.compile(r".com','(.*?)','Model Photo").search(temp).group(1)
>>> print dummy
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
Note that I also removed some unnecessary str() calls, and for future reference if you do ever need to escape single or double quotes (say your string contains both), use a backslash like this:
'.com\',\'(.*?)\',\'Model Photo'
As mykhal pointed out in comments, this doesn't work very nicely with regex because you can no longer use the raw string (r'...') literal. A better solution would be to use triple quoted strings as other answers suggested.

Python Regex: Ignore Escaped Character

Alright, I'm currently using Python's regular expression library to split up the following string into groups of semicolon delimited fields.
'key1:"this is a test phrase"; key2:"this is another test phrase"; key3:"ok this is a gotcha\; but you should get it";'
Regex: \s*([^;]+[^\\])\s*;
I'm currently using the pcre above, which was working fine until I encountered a case where an escaped semicolon is included in one of the phrases as noted above by key3.
How can I modify this expression to only split on the non-escaped semicolons?
The basic version of this is where you want to ignore any ; that's preceded by a backslash, regardless of anything else. That's relatively simple:
\s*([^;]*[^;\\]);
What will make this tricky is if you want escaped backslashes in the input to be treated as literals. For example:
"You may want to split here\\;"
"But not here\;"
If that's something you want to take into account, try this (edited):
\s*((?:[^;\\]|\\.)+);
Why so complicated? Because if escaped backslashes are allowed, then you have to account for things like this:
"0 slashes; 2 slashes\\; 5 slashes\\\\\; 6 slashes\\\\\\;"
Each pair of doubled backslashes would be treated as a literal \. That means a ; would only be escaped if there were an odd number of backslashes before it. So the above input would be grouped like this:
#1: '0 slashes'
#2: '2 slashes\'
#3: '5 slashes\\; 6 slashes\\\'
Hence the different parts of the pattern:
\s* #Whitespace
((?:
[^;\\] #One character that's not ; or \
| #Or...
\\. #A backslash followed by any character, even ; or another backslash
)+); #Repeated one or more times, followed by ;
Requiring a character after a backslash ensures that the second character is always escaped properly, even if it's another backslash.
If the string may contain semicolons and escaped quotes (or escaped anything), I would suggest parsing each valid key:"value"; sequence. Like so:
import re
s = r'''
key1:"this is a test phrase";
key2:"this is another test phrase";
key3:"ok this is a gotcha\; but you should get it";
key4:"String with \" escaped quote";
key5:"String with ; unescaped semi-colon";
key6:"String with \\; escaped-escape before semi-colon";
'''
result = re.findall(r'\w+:"[^"\\]*(?:\\.[^"\\]*)*";', s)
print (result)
Note that this correctly handles any escapes within the double quoted string.

Categories