Regular Expression that Includes a Character Only If Another Character Precedes It - python

I'm new to Stack so not sure if I'm asking this right.
I'm trying to form a regular expression to match all characters except 3 specific ones (%,&,and$) but I want to ignore that exception if a backslash () proceeds any of those characters. For example, if I have the string
abcd\$&
I would want the regular expression to match
abcd\$
because a backsplash preceeds the dollar sign, but not match the ^ because no backslash precedes it.
So far I have:
^[^%$&]+
which matches any string that doesn't have the characters (%, $, or &), but it stops at the backslash rather than include the backslash and the next character.
Thanks in advance!

^([^%$&\\]|\\.)+$
should work.
It also excludes \ from the charset and then allows \ followed by any character.

Related

How to replace text within parenthesis in pandas

I have some data with parenthetical I want to remove. I know the following piece of code works to remove the parenthetical. I just want to understand what exactly it's doing. What does the r do? How about the \? I know the .* stands for any number of characters between the parenthesis.
df['name'].str.replace(r"\(.*\)","")
I've never used pandas before, however a quick lookup indicates that the r prior to the String literal indicates that you're applying a RegEx pattern as opposed to replacing a literal.
As far as the RegEx, here it is broken down:
\(.*\)
\( Escapes the open parenthesis character to match it literally
.* Matches any character except for line breaks zero or more times
\) Escapes the close parenthesis character to match it literally

How to scan for a string literal allowing escaped characters?

I would like to parse an input string and determine if it contains a sequence of characters surrounded by double quotes (").
The sequence of characters itself is not allowed to contain further double quotes, unless they are escaped by a backslash, like so: \".
To make things more complicated, the backslashes can be escaped themselves, like so: \\. A double quote preceded by two (or any even number of) backslashes (\\") is therefore not escaped.
And to make it even worse, single non-escaping backslashes (i.e. followed by neither " nor \) are allowed.
I'm trying to solve that with Python's re module.
The module documentation tells us about the pipe operator A|B:
As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.
However, this doesn't work as I expected:
>>> import re
>>> re.match(r'"(\\[\\"]|[^"])*"', r'"a\"')
<_sre.SRE_Match object; span=(0, 4), match='"a\\"'>
The idea of this regex is to first check for an escaped character (\\ or \") and only if that's not found, check for any character that's not " (but it could be a single \).
This can occur an arbitrary number of times and it has to be surrounded by literal " characters.
I would expect the string "a\" not to match at all, but apparently it does.
I would expect \" to match the A part and the B part not to be tested, but apparently it is.
I don't really know how the backtracking works in this very case, but is there a way to avoid it?
I guess it would work if I check first for the initial " character (and remove it from the input) in a separate step.
I could then use the following regular expression to get the content of the string:
>>> re.match(r'(\\[\\"]|[^"])*', r'a\"')
<_sre.SRE_Match object; span=(0, 3), match='a\\"'>
This would include the escaped quote. Since there wouldn't be a closing quote left, I would know that overall, the given string does not match.
Do I have to do it like that or is it possible to solve this with a single regular expression and no additional manual checking?
In my real application, the "-enclosed string is only one part of a larger pattern, so I think it would be simpler to do it all at once in a single regular expression.
I found similar questions, but those don't consider that a single non-escaping backslash can be part of the string: regex to parse string with escaped characters, Parsing for escape characters with a regular expression.
When you use "(\\[\\"]|[^"])*", you match " followed by 0+ sequences of \ followed by either \ or ", or non-", and then followed by a "closing" ". Note that when your input is "a\", the \ is matched by the second alternative branch [^"] (as the backslash is a valid non-").
You need to exclude the \ from the non-":
"(?:[^\\"]|\\.)*"
^^
So, we match ", then either non-" and non-\ (with [^\\"]) or any escape sequence (with \\.), 0 or more times.
However, this regex is not efficient enough as there is much backtracking going on (caused by the alternation and the quantifier). Unrolled version is:
"[^"\\]*(?:\\.[^"\\]*)*"
See the regex demo
The last pattern matches:
" - a double quote
[^"\\]* - zero or more characters other than \ and "
(?:\\.[^"\\]*)* - zero or more sequences of
\\. - a backslash followed with any character but a newline
[^"\\]* - zero or more characters other than \ and "
" - a double quote

How does the regex "\" character and grouping "()" character work together?

I am trying to see which statements the following pattern matches:
\(*[0­-9]{3}\)*-­*[0-­9]{3}­\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'404­678­2347'
'(123)­1247890'
'456­900­900'
'(678)­2001236'
'404123­1234'
'(404123­123'
Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?
Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html

In regex, what does [\w*] mean?

What does this regex mean?
^[\w*]$
Quick answer: ^[\w*]$ will match a string consisting of a single character, where that character is alphanumeric (letters, numbers) an underscore (_) or an asterisk (*).
Details:
The "\w" means "any word character" which usually means alphanumeric (letters, numbers, regardless of case) plus underscore (_)
The "^" "anchors" to the beginning of a string, and the "$" "anchors" To the end of a string, which means that, in this case, the match must start at the beginning of a string and end at the end of the string.
The [] means a character class, which means "match any character contained in the character class".
It is also worth mentioning that normal quoting and escaping rules for strings make it very difficult to enter regular expressions (all the backslashes would need to be escaped with additional backslashes), so in Python there is a special notation which has its own special quoting rules that allow for all of the backslashes to be interpreted properly, and that is what the "r" at the beginning is for.
Note: Normally an asterisk (*) means "0 or more of the previous thing" but in the example above, it does not have that meaning, since the asterisk is inside of the character class, so it loses its "special-ness".
For more information on regular expressions in Python, the two official references are the re module, the Regular Expression HOWTO.
As exhuma said, \w is any word-class character (alphanumeric as Jonathan clarifies).
However because it is in square brackets it will match:
a single alphanumeric character OR
an asterisk (*)
So the whole regular expression matches:
the beginning of a
line (^)
followed by either a
single alphanumeric character or an
asterisk
followed by the end of a
line ($)
so the following would match:
blah
z <- matches this line
blah
or
blah
* <- matches this line
blah
\w refers to 0 or more alphanumeric characters and the underscore. the * in your case is also inside the character class, so [\w*] would match all of [a-zA-Z0-9_*] (the * is interpreted literally)
See http://www.regular-expressions.info/reference.html
To quote:
\d, \w and \s --- Shorthand character classes matching digits, word characters, and whitespace. Can be used inside and outside character classes.
Edit corrected in response to comment
From the beginning of this line, "Any number of word characters (letter, number, underscore)" until the end of the line.
I am unsure as to why it's in square brackets, as circle brackets (e.g. "(" and ")") are correct if you want the matched text returned.
\w is equivalent to [a-zA-Z0-9_] I don't understand the * after it or the [] around it, because \w already is a class and * in class definitions makes no sense.
As said above \w means any word. so you could use this in the context of below
view.aspx?url=[\w]
which means you can have any word as the value of the "url=" parameter

Looking for a regular expression including alphanumeric + "&" and ";"

Here's the problem:
split=re.compile('\\W*')
This regular expression works fine when dealing with regular words, but there are occasions where I need the expression to include words like k&auml;ytt&auml;j&aml;auml;.
What should I add to the regex to include the & and ; characters?
I would treat the entities as a unit (since they also can contain numerical character codes), resulting in the following regular expression:
(\w|&(#(x[0-9a-fA-F]+|[0-9]+)|[a-z]+);)+
This matches
either a word character (including “_”), or
an HTML entity consisting of
the character “&”,
the character “#”,
the character “x” followed by at least one hexadecimal digit, or
at least one decimal digit, or
at least one letter (= named entity),
a semicolon
at least once.
/EDIT: Thanks to ΤΖΩΤΖΙΟΥ for pointing out an error.
You probably want to take the problem reverse, i.e. finding all the character without the spaces:
[^ \t\n]*
Or you want to add the extra characters:
[a-zA-Z0-9&;]*
In case you want to match HTML entities, you should try something like:
(\w+|&\w+;)*
you should make a character class that would include the extra characters. For example:
split=re.compile('[\w&;]+')
This should do the trick. For your information
\w (lower case 'w') matches word characters (alphanumeric)
\W (capital W) is a negated character class (meaning it matches any non-alphanumeric character)
* matches 0 or more times and + matches one or more times, so * will match anything (even if there are no characters there).
Looks like this RegEx did the trick:
split=re.compile('(\\\W+&\\\W+;)*')
Thanks for the suggestions. Most of them worked fine on Reggy, but I don't quite understand why they failed with re.compile.

Categories