Difference between character sets in Python and re2c regular expressions

Difference between character sets in Python and re2c regular expressions - python

Character sets in regular expressions are specified using []. Character sets match any one of the enclosed characters. For example, [abc] will match one of 'a', 'b', or 'c'.
I realize there are potentially differences between character sets in Python and re2c regular expressions. I know what is the same in both:
Both accept ranges, for example [a-z] matches all lowercase letters
Both accept inverse sets using [^...] notation
Both accept common alphanumeric and some other characters (spaces, etc.)
But I'm concerned about these possibly being different:
Characters that need to be escaped inside of the character set
Where to place a literal '-' or '^' inside the character set if I want to match that character and not specify an inverse set or a range
Can you explain the difference between Python and re2c character sets?

Looking at the re2c manual link that you provided, it appears that re2c uses the same syntax, just a subset of that syntax.
To address your specific questions about regex syntax
Characters that need to be escaped inside of the character set.
What characters are you referring to specifically?
Where to place a literal - or ^ inside the character set...
For ^, anywhere but the beginning should do, and for -, anywhere but in the middle should do.
>>> import re
>>> match_literal_hyphen = "[ab-]"
>>> re.findall(match_literal_hyphen, "abc - def")
['a', 'b', '-']
>>> match_literal_caret = "[a^b]"
>>> re.findall(match_literal_caret, "abc ^ def")
['a', 'b', '^']

I would escape anything that causes confusion -
/[][]/ matches ']' or '['
/[[]]/ matches '[]'
/[]]]/ matches ']]'
/[[[]/ matches '['
/[]/ is an umatched '[' error

Related

How to search for a character string that is an escape sequence with re.search

I wrote the code to check if the escape sequence "\n" is included in the string. However, it behaved unexpectedly, so I would like to know the reason. Why did I get the result of case2?
Case 1
The code below worked. Since r"\n" (reg1) is a string consisting of two characters, '\' and 'n', I think it is correct to search for and match the target string "\n".
import re
reg1 = r"\n"
print (re.search (reg1, "\n"))
#output: <re.Match object; span = (0, 1), match ='\n'>
Case 2
The code below expected the output to be None, but it didn't. Since "\n" (reg2), which is the line feed of the escape sequence, was used as the pattern, and "\n" consisting of two characters, '\' and 'n', was used as the target string, it was considered that they did not match. However, it actually matched.
import re
reg2 = "\n"
print (re.search (reg2, "\n"))
#output: <re.Match object; span = (0, 1), match ='\n'>

You are correct when it comes to the contents of the strings used for the regexes, but not the targets. The statement:
"\n" consisting of two characters, '\' and 'n', was used as the target string,
is incorrect. The interpretation of a string is not context-sensitive; r"\n" is always 2 characters, and "\n" is always 1. This is covered in the Python Regular Expression HOWTO:
r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline.
This is more easily demonstrated with a non-control character, as a literal "\n" would be written:
Did you catch that? Let's use "þ" (thorn) instead.
Case 1:
re.search(r"\u00FE", "\u00FE")
r"\u00FE" is a string with 6 characters, which compiles to the regex /\u00FE/. This is interpreted as an escape sequence by the regex library itself that matches a thorn character.
"\u00FE" is interpreted by python, producing the string "þ".
/\u00FE/ matches "þ".
Case 2:
re.search("\u00FE", "\u00FE")
"\u00FE" is a string with 1 character, "þ", which compiles to the regex /⁠þ⁠/.
/þ/ matches "þ".
Result: both regexes match. The only difference is that the regex contains an escape sequence in case 1 and a character literal in case 2.
What you seem to have in mind is a raw string for the target:
re.search(r"\u00FE", r"\u00FE")
re.search("\u00FE", r"\u00FE")
Neither of these matches, as neither of the targets contains a thorn character.
If you wanted to match an escape sequence, the escape character must be escaped within the regex:
re.search(r"\\u00FE", r"\u00FE")
re.search("\\\\u00FE", r"\u00FE")
Either of those patterns will result in the regex /\\u00FE/, which matches a string containing the given escape sequence.

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?

To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.

The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!

^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$

You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

Proper replacement of "beginning" non-alphanumeric characters, in python, using regular expressions

NOTE: This post is not the same as the post "Re.sub not working for me".
That post is about matching and replacing ANY non-alphanumeric substring in a string.
This question is specifically about matching and replacing non-alphanumeric substrings that explicitly show up at the beginning of a string.
The following method attempts to match any non-alphanumeric character string "AT THE BEGINNING" of a string and replace it with a new string "BEGINNING_"
def m_getWebSafeString(self, dirtyAttributeName):
cleanAttributeName = ''.join(dirtyAttributeName)
# Deal with beginning of string...
cleanAttributeName = re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
# Deal with end of string...
if "BEGINNING_" in cleanAttributeName:
print ' ** ** ** D: "{}" ** ** ** C: "{}"'.format(dirtyAttributeName, cleanAttributeName)
PROBLEM DESCRIPTION: The method seems to not only replace non-alphnumeric characters but it also incorrectly inserts the "BEGINNING_" string at the beginning of all strings that are passed into it. In other words...
GOOD RESULT: If the method is passed the string *##$ThisIsMyString1, it correctly returns BEGINNING_ThisIsMyString1
BAD/UNWANTED RESULT: However, if the method is passed the string ThisIsMyString2 it incorrectly (and always) inserts the replacement string (BEGINNING_), even there are no non-alphanumeric characters, and yields the result BEGINNING_ThisIsMyString2
MY QUESTION: What is the correct way to write the re.sub() line so it only replaces those non-alphnumeric characters at the beginning of the string such that it does not always insert the replacement string at the beginning of the original input string?

You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with
re.sub('^[^a-zA-Z]+', ...)
to ensure that only 1 or more instances are matched.

replace
re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
with
re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)

There is a more elegant solution. You can use this
re.sub('^\W+', 'BEGINNING_', cleanAttributeName)
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
>>> re.sub('^\W+', 'BEGINNING_', '##$ThisIsMyString1')
'BEGINNING_ThisIsMyString1'
>>> re.sub('^\W+', 'BEGINNING_', 'ThisIsMyString2')
'ThisIsMyString2'

Splitting a string into groups of either a number and a letter or just a single letter

I'm trying to decode a string that looks like this "2a3bc" into "aabbbc" in Python. So the first thing I need to do is to split it up into a list with groups that make sense. In other words: ['2a','3b','c'].
Essentially, match either (1) a number and a letter or (2) just a letter.
I've got this:
re.findall('\d+\S|\s', '2a3bc')
and it returns:
['2a', '3b']
So it's actually missing the c.
Perhaps my regex skills is lacking here, any help is appreciated.

Your current expression could work with a small bugfix: \S is non-whitespace, while \s is whitespace. You're looking for non-whitespace in both cases, so you shouldn't use \s anywhere:
>>> re.findall(r'\d+\S|\S', '2a3bc')
['2a', '3b', 'c']
However, this expression could be shorter: instead of using + for one or more digits, use * for zero or more, since the group might not be preceded by any digits, and you can then get rid of the alternation.
>>> re.findall(r'\d*\S', '2a3bc')
['2a', '3b', 'c']
Again, though, note that \S is simply non-whitespace - that includes letters, digits, and even punctuation. \D, non-digits, has a similar problem: it excludes digits, but includes punctuation. The shortest, clearest regex for this, then, would replace the \S with \w, which indicates alphanumeric characters:
>>> re.findall(r'\d*\w', '2a3bc')
['2a', '3b', 'c']
Since the other character class in the group is already digits, this particular \w will only match letters.

Replace pairs of characters at start of string with a single character

I only want this done at the start of the sting. Some examples (I want to replace "--" with "-"):
"--foo" -> "-foo"
"-----foo" -> "---foo"
"foo--bar" -> "foo--bar"
I can't simply use s.replace("--", "-") because of the third case. I also tried a regex, but I can't get it to work specifically with replacing pairs. I get as far as trying to replace r"^(?:(-){2})+" with r"\1", but that tries to replace the full block of dashes at the start, and I can't figure how to get it to replace only pairs within that block.

Final regex was:
re.sub(r'^(-+)\1', r'\1', "------foo--bar")
^ - match start
(-+) - match at least one -, but...
\1 - an equal number must exist outside the capture group.
and finally, replace with that number of hyphens, effectively cutting the number of hyphens in half.

import re
print re.sub(r'\--', '',"--foo")
print re.sub(r'\--', '',"-----foo")
Output:
foo
-foo
EDIT this answer is for the OP before it was completely edited and changed.

Here's it all written out for anyone else who comes this way.
>>> foo = '---foo'
>>> bar = '-----foo'
>>> foobar = 'foo--bar'
>>> foobaz = '-----foo--bar'
>>> re.sub('^(-+)\\1', '-', foo)
'-foo'
>>> re.sub('^(-+)\\1', '-', bar)
'---foo'
>>> re.sub('^(-+)\\1', '-', foobar)
'foo--bar'
>>> re.sub('^(-+)\\1', '-', foobaz)
'--foo--bar'
The pattern for re.sub() is:
re.sub(pattern, replacement, string)
therefore in this case we want to replace -- with -. HOWEVER, the issue comes when we have -- that we don't want to replace, given by some circumstances.
In this case we only want to match -- at the beginning of a string. In regular expressions for python, the ^ character, when used in the pattern string, will only match the given pattern at the beginning of the string - just what we were looking for!
Note that the ^ character behaves differently when used within square brackets.
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'... An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
Getting back to what we were talking about. The parenthesis in the pattern represent a "group," this group can then be referenced with the \\1, meaning the first group. If there was a second set of parenthesis, we could then reference that sub-pattern with \\2. The extra \ is to escape the next slash. This pattern can also be written with re.sub(r'^(-+)\1', '-', foo) forcing python to interpret the string as a raw string, as denoted with the r preceding the pattern, thereby eliminating the need to escape special characters.
Now that the pattern is all set up, you just make the replacement whatever you want to replace the pattern with, and put in the string that you are searching through.
A link that I keep handy when dealing with regular expressions, is Google's developer's notes on them.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.