Alright, I'm currently using Python's regular expression library to split up the following string into groups of semicolon delimited fields.
'key1:"this is a test phrase"; key2:"this is another test phrase"; key3:"ok this is a gotcha\; but you should get it";'
Regex: \s*([^;]+[^\\])\s*;
I'm currently using the pcre above, which was working fine until I encountered a case where an escaped semicolon is included in one of the phrases as noted above by key3.
How can I modify this expression to only split on the non-escaped semicolons?
The basic version of this is where you want to ignore any ; that's preceded by a backslash, regardless of anything else. That's relatively simple:
\s*([^;]*[^;\\]);
What will make this tricky is if you want escaped backslashes in the input to be treated as literals. For example:
"You may want to split here\\;"
"But not here\;"
If that's something you want to take into account, try this (edited):
\s*((?:[^;\\]|\\.)+);
Why so complicated? Because if escaped backslashes are allowed, then you have to account for things like this:
"0 slashes; 2 slashes\\; 5 slashes\\\\\; 6 slashes\\\\\\;"
Each pair of doubled backslashes would be treated as a literal \. That means a ; would only be escaped if there were an odd number of backslashes before it. So the above input would be grouped like this:
#1: '0 slashes'
#2: '2 slashes\'
#3: '5 slashes\\; 6 slashes\\\'
Hence the different parts of the pattern:
\s* #Whitespace
((?:
[^;\\] #One character that's not ; or \
| #Or...
\\. #A backslash followed by any character, even ; or another backslash
)+); #Repeated one or more times, followed by ;
Requiring a character after a backslash ensures that the second character is always escaped properly, even if it's another backslash.
If the string may contain semicolons and escaped quotes (or escaped anything), I would suggest parsing each valid key:"value"; sequence. Like so:
import re
s = r'''
key1:"this is a test phrase";
key2:"this is another test phrase";
key3:"ok this is a gotcha\; but you should get it";
key4:"String with \" escaped quote";
key5:"String with ; unescaped semi-colon";
key6:"String with \\; escaped-escape before semi-colon";
'''
result = re.findall(r'\w+:"[^"\\]*(?:\\.[^"\\]*)*";', s)
print (result)
Note that this correctly handles any escapes within the double quoted string.
Related
Need help in interpreting string prefixes and escape character. I found this when I was learning about the arguments of re.compile() commands below.
a = re.compile(r'^([a-z]|_)*$')
b = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
c = re.compile(r'[=\+/&<>;\'"\?%#$#\,\. \t\r\n]')
What is the meaning of r?
What is the meaning of \', \?, \, and \.?
What is the meaning of \t\r\n ?
What is the meaning of r?
This is the raw prefix for a string literal. Essentially it prevents normal escaping from occurring, leaving in backslashes. A more in depth explanation was given here: https://stackoverflow.com/a/2081708/416500
What is the meaning of \', \?, \, and .?
These are regex specific characters that are escaped by the backslashes. The \? tells it to look for a literal ?, the \, tells it to look for a literal , and the \. tells it to look for a literal ..
What is the meaning of \t\r\n ?
\t is the tab character, \r is a carriage return, and \n is the newline character. These all render as whitespace in most programs, but are stored differently.
A handy tool for breaking down regex patterns I use a lot is Regex Pal (no affiliation) which lets you hover over parts of the regex to see how it is compiled.
I downloaded the source code of a website. Through downloading the source code, and converting it into a string, many of the characters (like single quotes ('), double quotes ("), angled brackets (<, >), and forward slashes (/)) are now double escaped.
Example:
s = '\\u2018this \\/ that\\u2019'
The text represented in the website, and how i want it represented when printed out is:
this / that
My first instinct was to use regex to find all instances of 2 backslashes, and replace it with a single backslash, then use str.encode('utf-8').decode('utf-8') to convert the 4 digit escaped Unicode characters into their actual characters:
import re
sample = '\\u2018this \\/ that\\u2019'
pattern = r'(\\)\\\1'
double_escapes_removed = re.sub(pattern, '', text)
final_text = text.encode('utf-8').decode('utf-8')
print(final_text) should return this / that, but the returned string appears to be completely unaltered: \u2018this \/ that\u2019.
I tested the pattern individually with re.findall(pattern, text), and it successfully found the 3 instances of double backslashes. Beyond that, I have no idea what is going wrong
This turns out to be a bit difficult. A big part of the issue is that although '\u2018' is 6 characters, '\u2018' is a representation of a single character, so you can't just replace '\u' with '\u' and have it work.
This gets you most of the way there without having to manually iterate over escapes with regex:
>>> s.encode('ascii').decode('unicode-escape')
<<< '‘this \\/ that’'
Python 3 does output a warning about '\/' being an invalid unicode escape sequence, so you'd probably want to take care of those first.
I would like to parse an input string and determine if it contains a sequence of characters surrounded by double quotes (").
The sequence of characters itself is not allowed to contain further double quotes, unless they are escaped by a backslash, like so: \".
To make things more complicated, the backslashes can be escaped themselves, like so: \\. A double quote preceded by two (or any even number of) backslashes (\\") is therefore not escaped.
And to make it even worse, single non-escaping backslashes (i.e. followed by neither " nor \) are allowed.
I'm trying to solve that with Python's re module.
The module documentation tells us about the pipe operator A|B:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
However, this doesn't work as I expected:
>>> import re
>>> re.match(r'"(\\[\\"]|[^"])*"', r'"a\"')
<_sre.SRE_Match object; span=(0, 4), match='"a\\"'>
The idea of this regex is to first check for an escaped character (\\ or \") and only if that's not found, check for any character that's not " (but it could be a single \).
This can occur an arbitrary number of times and it has to be surrounded by literal " characters.
I would expect the string "a\" not to match at all, but apparently it does.
I would expect \" to match the A part and the B part not to be tested, but apparently it is.
I don't really know how the backtracking works in this very case, but is there a way to avoid it?
I guess it would work if I check first for the initial " character (and remove it from the input) in a separate step.
I could then use the following regular expression to get the content of the string:
>>> re.match(r'(\\[\\"]|[^"])*', r'a\"')
<_sre.SRE_Match object; span=(0, 3), match='a\\"'>
This would include the escaped quote. Since there wouldn't be a closing quote left, I would know that overall, the given string does not match.
Do I have to do it like that or is it possible to solve this with a single regular expression and no additional manual checking?
In my real application, the "-enclosed string is only one part of a larger pattern, so I think it would be simpler to do it all at once in a single regular expression.
I found similar questions, but those don't consider that a single non-escaping backslash can be part of the string: regex to parse string with escaped characters, Parsing for escape characters with a regular expression.
When you use "(\\[\\"]|[^"])*", you match " followed by 0+ sequences of \ followed by either \ or ", or non-", and then followed by a "closing" ". Note that when your input is "a\", the \ is matched by the second alternative branch [^"] (as the backslash is a valid non-").
You need to exclude the \ from the non-":
"(?:[^\\"]|\\.)*"
^^
So, we match ", then either non-" and non-\ (with [^\\"]) or any escape sequence (with \\.), 0 or more times.
However, this regex is not efficient enough as there is much backtracking going on (caused by the alternation and the quantifier). Unrolled version is:
"[^"\\]*(?:\\.[^"\\]*)*"
See the regex demo
The last pattern matches:
" - a double quote
[^"\\]* - zero or more characters other than \ and "
(?:\\.[^"\\]*)* - zero or more sequences of
\\. - a backslash followed with any character but a newline
[^"\\]* - zero or more characters other than \ and "
" - a double quote
I am trying to match (using regex in python):
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
in the following string:
http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'
My code has something like this:
temp="http://www.mymaterialssite.com','http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"
dummy=str(re.compile(r'.com'',,''(.*?)'',,''Model Photo').search(str(temp)).group(1))
I do not think the "dummy" is correct & I am unsure how I "escape" the single and double quotes in the regex re.compile command.
I tried googling for the problem, but I couldnt find anything relevant.
Would appreciate any guidance on this.
Thanks.
The easiest way to deal with strings in Python that contain escape characters and quotes is to triple double-quote the string (""") and prefix it with r. For example:
my_str = r"""This string would "really "suck"" to write if I didn't
know how to tell Python to parse it as "raw" text with the 'r' character and
triple " quotes. Especially since I want \n to show up as a backlash followed
by n. I don't want \0 to be the null byte either!"""
The r means "take escape characters as literal". The triple double-quotes (""") prevent single-quotes, double-quotes, and double double-quotes from prematurely ending the string.
EDIT: I expanded the example to include things like \0 and \n. In a normal string (not a raw string) a \ (the escape character) signifies that the next character has special meaning. For example \n means "the newline character". If you literally wanted the character \ followed by n in your string you would have to write \\n, or just use a raw string instead, as I show in the example above.
You can also read about string literals in the Python documentation here:
For beginners: http://docs.python.org/tutorial/introduction.html#strings
Complex explanation: http://docs.python.org/reference/lexical_analysis.html#string-literals
Try triple quotes:
import re
tmp=""".*http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg.*"""
str="""http://www.mymaterialssite.com\'\,\'http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg','Model Photo'"""
x=re.match(tmp,str)
if x!=None:
print x.group()
Also you were missing the .* in the beginning of the pattern and at the end. I added that too.
if you use double quotes (which have the same meaning as the single ones, in Python), you don't have to escape at all.. (in this case). you can even use string literal without the starting r (you don't have any backslash there)
re.compile(".com','(.*?)','Model Photo")
Commas don't need to be escaped, and single quotes don't need to be escaped if you use double quotes to create the string:
>>> dummy=re.compile(r".com','(.*?)','Model Photo").search(temp).group(1)
>>> print dummy
http://images.mymaterials.com/images/steel-images/small/steel/steel800/steel800-2.jpg
Note that I also removed some unnecessary str() calls, and for future reference if you do ever need to escape single or double quotes (say your string contains both), use a backslash like this:
'.com\',\'(.*?)\',\'Model Photo'
As mykhal pointed out in comments, this doesn't work very nicely with regex because you can no longer use the raw string (r'...') literal. A better solution would be to use triple quoted strings as other answers suggested.
I would like to find a regular expression (regex) that does detect if you have some invalid escapes in a C double quoted escaped string (where you can find double quotes only escaped).
I consider valid \\ \n \r \" (the test string is using ")
A partial solution to this is to use (?<!\\)\\[^\"\\nr] but this one fails to detect bad escapes like \\\.
Here is a test string that I use to test the matching:
...\n...\\b...\"...\\\\...\\\E...\...\\\...\\\\\..."...\E...
The expression should match the last 6 blocks as invalid, the first 4 are valid. The problem is that my current version does find only 2/5 errors.
(?:^|[^\\])(?:\\\\)*((?:\"|\\(?:[^\"\\nr]|$)))
That's the start of a string, or something that's not a backslash. Then some (possibly zero) properly escaped backslashes, then either an unescaped " or another backslash; if it's another backslash, it must be followed by something that is neither ", \, n, nor r, or the end of the string.
The incorrect escape is captured for you as well.
Try this regular expression:
^(?:[^\\]+|\\[\\rn"])*(\\(?:[^\\rn"]|$))
If you have a match, you have an invalid escape sequence.