Proper replacement of "beginning" non-alphanumeric characters, in python, using regular expressions

Proper replacement of "beginning" non-alphanumeric characters, in python, using regular expressions - python

NOTE: This post is not the same as the post "Re.sub not working for me".
That post is about matching and replacing ANY non-alphanumeric substring in a string.
This question is specifically about matching and replacing non-alphanumeric substrings that explicitly show up at the beginning of a string.
The following method attempts to match any non-alphanumeric character string "AT THE BEGINNING" of a string and replace it with a new string "BEGINNING_"
def m_getWebSafeString(self, dirtyAttributeName):
cleanAttributeName = ''.join(dirtyAttributeName)
# Deal with beginning of string...
cleanAttributeName = re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
# Deal with end of string...
if "BEGINNING_" in cleanAttributeName:
print ' ** ** ** D: "{}" ** ** ** C: "{}"'.format(dirtyAttributeName, cleanAttributeName)
PROBLEM DESCRIPTION: The method seems to not only replace non-alphnumeric characters but it also incorrectly inserts the "BEGINNING_" string at the beginning of all strings that are passed into it. In other words...
GOOD RESULT: If the method is passed the string *##$ThisIsMyString1, it correctly returns BEGINNING_ThisIsMyString1
BAD/UNWANTED RESULT: However, if the method is passed the string ThisIsMyString2 it incorrectly (and always) inserts the replacement string (BEGINNING_), even there are no non-alphanumeric characters, and yields the result BEGINNING_ThisIsMyString2
MY QUESTION: What is the correct way to write the re.sub() line so it only replaces those non-alphnumeric characters at the beginning of the string such that it does not always insert the replacement string at the beginning of the original input string?

You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with
re.sub('^[^a-zA-Z]+', ...)
to ensure that only 1 or more instances are matched.

replace
re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
with
re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)

There is a more elegant solution. You can use this
re.sub('^\W+', 'BEGINNING_', cleanAttributeName)
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
>>> re.sub('^\W+', 'BEGINNING_', '##$ThisIsMyString1')
'BEGINNING_ThisIsMyString1'
>>> re.sub('^\W+', 'BEGINNING_', 'ThisIsMyString2')
'ThisIsMyString2'

Related

Regex to check if it is exactly one single word

I am basically trying to match string pattern(wildcard match)
Please carefully look at this -
*(star) - means exactly one word .
This is not a regex pattern...it is a convention.
So,if there patterns like -
*.key - '.key.' is preceded by exactly one word(word containing no dots)
*.key.* - '.key.' is preceded and succeeded by exactly one word having no dots
key.* - '.key' preceeds exactly one word .
So,
"door.key" matches "*.key"
"brown.door.key" doesn't match "*.key".
"brown.key.door" matches "*.key.*"
but "brown.iron.key.door" doesn't match "*.key.*"
So, when I encounter a '*' in pattern, I have replace it with a regex so that it means it is exactly one word.(a-zA-z0-9_).Can anyone please help me do this in python?

To convert your pattern to a regexp, you first need to make sure each character is interpreted literally and not as a special character. We can do that by inserting a \ in front of any re special character. Those characters can be obtained through sre_parse.SPECIAL_CHARS.
Since you have a special meaning for *, we do not want to escape that one but instead replace it by \w+.
Code
import sre_parse
def convert_to_regexp(pattern):
special_characters = set(sre_parse.SPECIAL_CHARS)
special_characters.remove('*')
safe_pattern = ''.join(['\\' + c if c in special_characters else c for c in pattern ])
return safe_pattern.replace('*', '\\w+')
Example
import re
pattern = '*.key'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.key'
re.match(r_pattern, 'door.key') # Match
re.match(r_pattern, 'brown.door.key') # None
And here is an example with escaped special characters
pattern = '*.(key)'
r_pattern = convert_to_regexp(pattern) # '\\w+\\.\\(key\\)'
re.match(r_pattern, 'door.(key)') # Match
re.match(r_pattern, 'brown.door.(key)') # None
Sidenote
If you intend looking for the output pattern with re.search or re.findall, you might want to wrap the re pattern between \b boundary characters.

The conversion rules you are looking for go like this:
* is a word, thus: \w+
. is a literal dot: \.
key is and stays a literal string
plus, your samples indicate you are going to match whole strings, which in turn means your pattern should match from the ^ beginning to the $ end of the string.
Therefore, *.key becomes ^\w+\.key$, *.key.* becomes ^\w+\.key\.\w+$, and so forth..
Online Demo: play with it!

^ means a string that starts with the given set of characters in a regular expression.
$ means a string that ends with the given set of characters in a regular expression.
\s means a whitespace character.
\S means a non-whitespace character.
+ means 1 or more characters matching given condition.
Now, you want to match just a single word meaning a string of characters that start and end with non-spaced string. So, the required regular expression is:
^\S+$

You could do it with a combination of "any characters that aren't period" and the start/end anchors.
*.key would be ^[^.]*\.key, and *.key.* would be ^[^.]*\.key\.[^.]*$
EDIT: As tripleee said, [^.]*, which matches "any number of characters that aren't periods," would allow whitespace characters (which of course aren't periods), so using \w+, "any number of 'word characters'" like the other answers is better.

How to find superfluous escapes in regex patterns

How to find and remove all the unneeded backslash escapes in Python regular expressions.
For example in r'\{\"*' all the escapes are unnecessary and has the same meaning as r'{"*'. But in r'\[a-b]\{2}\Z\'\+' removing any of the escapes would change how the regex is interpreted by the regex engine (or cause a syntax error).
Given the pattern, is there an easy, i.e. other than perhaps parsing the whole regex string looking for escapes on non-special characters, way to remove escape patterns programmatically in Python?

Here is the code that I came up with:
from contextlib import redirect_stdout
from io import StringIO
from re import compile, DEBUG, error, MULTILINE, VERBOSE
def unescape(pattern: str, flags: int):
"""Remove any escape that does not change the regex meaning"""
strio = StringIO()
with redirect_stdout(strio):
compile(pattern, DEBUG | flags)
original_debug = strio.getvalue()
index = len(pattern)
while index >= 0:
index -= 1
character = pattern[index]
if character != '\\':
continue
removed_escape = pattern[:index] + pattern[index+1:]
strio = StringIO()
with redirect_stdout(strio):
try:
compile(removed_escape, DEBUG | flags)
except error:
continue
if original_debug == strio.getvalue():
pattern = removed_escape
return pattern
def print_unescaped_raw(regex: str, flags:int=0):
"""Print an unescaped raw-string representation for s."""
print(
("r'%s'" % unescape(regex, flags)
.replace("'", r'\'')
.replace('\n', r'\n'))
)
print_unescaped_raw(r'\{\"*') # r'{"*'
One can also use sre_parse.parse directly, but the SubPatterns and tuples in the result may contain nested SubPatterns. And SubPattern instances don't have __eq__ method defined for them, so a recursive comparison subroutine might be required.
P.S.
Unfortunately, this method does not work with the regex module because in regex you get different debug output for escaped characters:
regex.compile(r'{', regex.DEBUG)
LITERAL MATCH '{'
regex.compile(r'\{', regex.DEBUG)
CHARACTER MATCH '{'
Unlike re that gives:
re.compile(r'{', re.DEBUG)
LITERAL 123
re.compile(r'\{', re.DEBUG)
LITERAL 123

I will not do the whole implementation but I can give you some hints to make a viable heuristic/algo:
Initial Hypothesis: You have for each regex that you are going to modify a list of input strings/expected output strings to validate its behavior
Use this website to have the list of characters that should stay escaped with the backslash \ http://www.rexegg.com/regex-quickstart.html and Create a list of elements that should not be replaced
Parse your regex and replace all the \X where X is a character that is not present in the list generated at the previous step by X
Test your initial regex on its input strings and test your new regex on the same input strings and compare their respective outputs for all the result
If all of your results are the same, then you can use your new/simplified regex.
If at least one of the output is different then you have to throw away your new regex and proceed with local replacements: select randomly (round robin could be used) one of the \X in your initial regex that is not in the list that you have construct at step 1. and replace it by X check the output in comparison to the initial regex output for each input string if it matches you can use that regex and repeat step 5. until it is not possible to progress anymore. however, If the output is different for that replacement remove it from the list of elements you might be able to replace and repeat the step 5 with your previous regex. Do the process until your list of possible local replacement is empty, you can use the new regex instead of the old one.

Replace pairs of characters at start of string with a single character

I only want this done at the start of the sting. Some examples (I want to replace "--" with "-"):
"--foo" -> "-foo"
"-----foo" -> "---foo"
"foo--bar" -> "foo--bar"
I can't simply use s.replace("--", "-") because of the third case. I also tried a regex, but I can't get it to work specifically with replacing pairs. I get as far as trying to replace r"^(?:(-){2})+" with r"\1", but that tries to replace the full block of dashes at the start, and I can't figure how to get it to replace only pairs within that block.

Final regex was:
re.sub(r'^(-+)\1', r'\1', "------foo--bar")
^ - match start
(-+) - match at least one -, but...
\1 - an equal number must exist outside the capture group.
and finally, replace with that number of hyphens, effectively cutting the number of hyphens in half.

import re
print re.sub(r'\--', '',"--foo")
print re.sub(r'\--', '',"-----foo")
Output:
foo
-foo
EDIT this answer is for the OP before it was completely edited and changed.

Here's it all written out for anyone else who comes this way.
>>> foo = '---foo'
>>> bar = '-----foo'
>>> foobar = 'foo--bar'
>>> foobaz = '-----foo--bar'
>>> re.sub('^(-+)\\1', '-', foo)
'-foo'
>>> re.sub('^(-+)\\1', '-', bar)
'---foo'
>>> re.sub('^(-+)\\1', '-', foobar)
'foo--bar'
>>> re.sub('^(-+)\\1', '-', foobaz)
'--foo--bar'
The pattern for re.sub() is:
re.sub(pattern, replacement, string)
therefore in this case we want to replace -- with -. HOWEVER, the issue comes when we have -- that we don't want to replace, given by some circumstances.
In this case we only want to match -- at the beginning of a string. In regular expressions for python, the ^ character, when used in the pattern string, will only match the given pattern at the beginning of the string - just what we were looking for!
Note that the ^ character behaves differently when used within square brackets.
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'... An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.
Getting back to what we were talking about. The parenthesis in the pattern represent a "group," this group can then be referenced with the \\1, meaning the first group. If there was a second set of parenthesis, we could then reference that sub-pattern with \\2. The extra \ is to escape the next slash. This pattern can also be written with re.sub(r'^(-+)\1', '-', foo) forcing python to interpret the string as a raw string, as denoted with the r preceding the pattern, thereby eliminating the need to escape special characters.
Now that the pattern is all set up, you just make the replacement whatever you want to replace the pattern with, and put in the string that you are searching through.
A link that I keep handy when dealing with regular expressions, is Google's developer's notes on them.

How do I match a string without a specific character in Python?

Which regular expression pattern will match a substring not containing a specific character in Python? For example, I have the string "abc,5 * de", and I want to match "abc" and "5 * de" as two substrings, but not the ,.

Use a negated character class that contains all characters you don't want to match.
Something like
[^,]+
See it here on Regexr
The [] denotes the character class and the ^ as first character makes it a negated class.

s = "abc,5 * de"
result = s.split(',')
result[0] # "abc"
result[1] # "5* de"
Regex expressions are not always the only way to solve string problems.

I don't know Python, but with all the regexp engines I know, that would be /[^,]*/. Or if Python has a built-in function to split a string on a regexp, then you could just split on /,/.

Check String for / against Characters in Python

I need to be able to tell the difference between a string that can contain letters and numbers, and a string that can contain numbers, colons and hyphens.
>>> def checkString(s):
... pattern = r'[-:0-9]'
... if re.search(pattern,s):
... print "Matches pattern."
... else:
... print "Does not match pattern."
# 3 Numbers seperated by colons. 12, 24 and minus 14
>>> s1 = "12:24:-14"
# String containing letters and string containing letters/numbers.
>>> s2 = "hello"
>>> s3 = "hello2"
When I run the checkString method on each of the above strings:
>>>checkString(s1)
Matches Pattern.
>>>checkString(s2)
Does not match Pattern.
>>>checkString(s3)
Matches Pattern
s3 is the only one that doesn't do what I want. I'd like to be able to create a regex that allows numbers, colons and hyphens, but excludes EVERYTHING else (or just alphabetical characters). Can anyone point me in the right direction?
EDIT:
Therefore, I need a regex that would accept:
229 // number
187:657 //two numbers
187:678:-765 // two pos and 1 neg numbers
and decline:
Car //characters
Car2 //characters and numbers

you need to match the whole string, not a single character as you do at the moment:
>>> re.search('^[-:0-9]+$', "12:24:-14")
<_sre.SRE_Match object at 0x01013758>
>>> re.search('^[-:0-9]+$', "hello")
>>> re.search('^[-:0-9]+$', "hello2")
To explain regex:
within square brackets (character class): match digits 0 to 9, hyphen and colon, only once.
+ is a quantifier, that indicates that preceding expression should be matched as many times as possible but at least once.
^ and $ match start and end of the string. For one-line strings they're equivalent to \A and \Z.
This way you restrict content of the whole string to be at least one-charter long and contain any permutation of characters from the character class. What you were doing before hand was to search for a single character from the character class within subject string. This is why s3 that contains a digit matched.

SilentGhost's answer is pretty good, but take note that it would also match strings like "---::::" with no digits at all.
I think you're looking for something like this:
'^(-?\d+:)*-?\d+$'
^ Matches the beginning of the line.
(-?\d+:)* Possible - sign, at least one digit, a colon. That whole pattern 0 or many times.
-?\d+ Then the pattern again, at least once, without the colon
$ The end of the line
This will better match the strings you describe.

pattern = r'\A([^-:0-9]+|[A-Za-z0-9])\Z'

Your regular expression is almost fine; you just need to make it match the whole string. Also, as a commenter pointed out, you don't really need a raw string (the r prefix on the string) in this case. Voila:
def checkString(s):
if re.match('[-:0-9]+$', s):
print "Matches pattern."
else:
print "Does not match pattern."
The '+' means "match one or more of the previous expression". (This will make checkString return False on an empty string. If you want True on an empty string, change the '+' to a '*'.) The '$' means "match the end of the string".
re.match means "the string must match the regular expression starting at the first character"; re.search means "the regular expression can match a sequence anywhere inside the string".
Also, if you like premature optimization--and who doesn't!--note that 're.match' needs to compile the regular expression each time. This version compiles the regular expression only once:
__checkString_re = re.compile('[-:0-9]+$')
def checkString(s):
global __checkString_re
if __checkString_re.match(s):
print "Matches pattern."
else:
print "Does not match pattern."

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Proper replacement of "beginning" non-alphanumeric characters, in python, using regular expressions - python

You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with re.sub('^[^a-zA-Z]+', ...) to ensure that only 1 or more instances are matched.

replace re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName) with re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)

Related

Regex to check if it is exactly one single word

How to find superfluous escapes in regex patterns

Replace pairs of characters at start of string with a single character

How do I match a string without a specific character in Python?

Check String for / against Characters in Python

Categories

Resources