How to fix bad escape regex error (python re)

How to fix bad escape regex error (python re) - python

I've been messing around with re.sub() to see how I would change the format from Y-m-d to M/d/y. To perform the test, I defined the starting variable: current_date = "2012-05-26"
I would try to achieve to convert that date to 05/26/2012.
I tried to achieve this without using DateTime but with regex. I used re.sub as below:
formatted_date = re.sub(r"\d{2,4}-\d{1,2}-\d{1,2}", r"[^a-zA-Z]\d{1,2}/\d{1,2}/\d{2,4}", current_date)
The first regex is to match the original format of Y-M-D and the second Regex is to try to convert it to the format that I want it to be. I got the following error:
Traceback (most recent call last):
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\sre_parse.py", line 1039, in parse_template
this = chr(ESCAPES[this][1])
KeyError: '\\d'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\Users\ghub4\OneDrive\Desktop\test_sub.py", line 5, in <module>
formatted_date = re.sub(r"\d{2,4}-\d{1,2}-\d{1,2}", r"[^a-zA-Z]\d{1,2}/\d{1,2}/\d{2,4}", current_date)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\re.py", line 210, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\re.py", line 327, in _subx
template = _compile_repl(template, pattern)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\re.py", line 318, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File "C:\Users\ghub4\AppData\Local\Programs\Python\Python39\lib\sre_parse.py", line 1042, in parse_template
raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \d at position 9
Full Code:
import re
current_date = "2012-05-26"
formatted_date = re.sub(r"\d{2,4}-\d{1,2}-\d{1,2}", r"[^a-zA-Z]\d{1,2}/\d{1,2}/\d{2,4}", current_date)
print(formatted_date)
I've traced the error to potential the second regex but I'm unsure where position 9 is and how to fix the error. Another reason why I'm not sure how to fix it is due to the first error where it stated a keyerror raised by \\d. I'm sure that when the regex is interpret somewhere in the code, it is taking the \d as \\d instead which Im also not sure how to prevent that. I'm also pretty sure that the second regex may backfire on me and I am working on a solution on that after this question is posted. How would I be able to correct these errors?

The replacement string for a regex is not a regex in itself, rather it is a string which may contain references to groups captured by the original regex. In your case, you want to capture the year, month and day and then output them in the result string. You do that with () around the values you want to capture, and then refer to the groups by \1, \2, and \3 in the replacement string, with the numbers being assigned in order of the groups being captured. So for your code, you want:
formatted_date = re.sub(r"(\d{2,4})-(\d{1,2})-(\d{1,2})", r"\2/\3/\1", current_date)

Try and group your digits (If you goal is testing then position 9 is your first \d in your second regex-check - It is an invalid group reference):
formatted_date = re.sub(r"(\d{2,4})-(\d{1,2})-(\d{1,2})",r"\2/\3/\1",current_date)

Related

Regex 'sre_constants.error: bad character range' in large regex pattern

The following is the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
This is my object:
>>> re101121=re.compile("""(?i)激[ _]{0,}活[ _]{0,}邮[ _]{0,}箱|(click|clicking)[ _]{1,}[here ]{0,1}to[ _]{1,}verify|stop[ _]{1,}mail[ _]{1,}.{1,16}[ _]{1,}here|(click|clicking|view|update)([ _-]{1,}|\\xc2\\xa0)(on|here|Validate)[^a-z0-9]{1}|(點|点)[ _]{0,}(擊|击)[ _]{0,}(這|这|以)[ _]{0,}(裡|里|下)|DHL[ _]{1,}international|DHL[ _]{1,}Customer[ _]{1,}Service|Online[ _]{1,}Banking|更[ _]{0,}新[ _]{0,}您[ _]{0,}的[ _]{0,}(帐|账)[ _]{0,}户|CONFIRM[ _]{1,}ACCOUNT[ _]{1,}NOW|avoid[ _]{1,}Account[ _]{1,}malfunction|confirm[ _]{1,}this[ _]{1,}request|verify your account IP|Continue to Account security|继[\\s-_]*续[\\s-_]*使[\\s-_]*用|崩[\\s-_]*溃[\\s-_]*信[\\s-_]*息|shipment[\\s]+confirmation|will be shutdown in [0-9]{0,} (hours|days)|DHL Account|保[ ]{0,}留[ ]{0,}密[ ]{0,}码|(Password|password|PASSWORD).*(expired|expiring)|login.*email.*password.*confirm|[0-9]{0,} messages were quarantined|由于.*错误(的)?(送货)?信息|confirm.*(same)? password|keep.*account secure|settings below|loss.*(email|messages)|simply login|quick verification now""")

After minimization, your error boils down to re.compile("""[\\s-_]"""). This is a bad character range indeed; you probably meant the dash to be literal re.compile(r"[\s\-_]") (always use raw strings for regex r"..."). Moving the dash to the end of the bracket group works too: r"[\s_-]".
In the future, try to binary search to find the minimal failing input: remove the right half of the regex. If it still fails, the problem must have been in the left half. Remove the right half of the remaining substring and repeat until you're down to a minimal failing case. This technique doesn't always work when the problem spans both halves, but it can't hurt to try.
As mentioned in the comments, it's pretty odd to have such a massive regex as this, but I'll assume you know what you're doing.
As another aside, there are some antipatterns in this regex (pardon the pun) like {0,} which can be simplified to *.

Why is this python exception not ValueError?

The code:
import datetime
TF = "%d-%M-%Y %H:%M"
last= datetime.datetime.strptime( "11/07/10 10:00", TF)
Throws the following exception:
Traceback (most recent call last):
File "strange.py", line 4, in <module>
last= datetime.datetime.strptime( "11/07/10 10:00", TF)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 308, in _strptime
format_regex = _TimeRE_cache.compile(format)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_strptime.py", line 265, in compile
return re_compile(self.pattern(format), IGNORECASE)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: redefinition of group name 'M' as group 5; was group 2
Now I believe my error is that I use %M twice when defining the date format. Here's my query:
I would expect the code to either:
a) accept the fact you mine have the same time value twice in in a string (it might be redundent, but so is "monday" if you have the rest of the date)
b) throw a value error saying that the same field shouldn't be used more than once.
This looks like something very different. What's going on?

Value Error is used when "a built-in operation or function receives an argument that has the right type but an inappropriate value" (docs) - so in that case, that would mean sending TF as a malformed string with a wrong formatting (try with %K for example).
Here you used a correct formatting - but as your error mentions, you failed on the the SRE paring part - since you defined the same group (that's how the %xs are interpreted) twice, and the regex parser failed since it can not understand when you tell him that the group M should match two different parts of the string, which it can't "guess" by itself.

Nothing directly detected the specific error you made.
The datetime module turned your strptime format into a regular expression to do the actual parsing, without analyzing it (or having any need to analyze it) in sufficient detail to notice the duplicated field. This resulted in an invalid regular expression, and the re module rightfully threw an error - one which I'd consider to be closer to a SyntaxError than a ValueError. The datetime module passed this on without trying to figure out the source of the problem.

you have 2 errors:
the format does not match because you are using "-" in TF and passed "/" as a time! you should pass the same format as the string.
you passed 2 minutes symbols ( 'M' = minutes, 'm' = months )
this is the correct solution for you:
import datetime
TF = "%d-%m-%y %H:%M"
last= datetime.datetime.strptime( "11-07-10 10:00", TF)
good luck!

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:?

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:
If you look at the whatsapp chat transcript, it looks like a mess. The purpose of this code is to print messages sent by a person in a new line (for better readability). This is my code
import re
f = open("wa_chat.txt", "r")
match = re.findall(r'(\d{2})\:(\d{2})\:(\d{4})\s(\d{2})\:(\d{2})\:(\d{2})\:\s(\w)\s(\w)\:', f)
for content in match:
print(f.readlines(), '\n')
f.close()
I am getting the following error message:
Traceback (most recent call last):
File "whatsapp.py", line 4, in <module>
match = re.findall(r'(\d{2})\:(\d{2})\:(\d{4})\s(\d{2})\:(\d{2})\:(\d{2})\:\s(\w)\s(\w)\:', f)
File "/usr/lib/python2.7/re.py", line 177, in findall
return_compile(pattern, flags).findall(string)
TypeError: expected string or buffer
Where am I going wrong?

For some reason you're putting \: where - should be. Also, instead of \s you can be more specific and just use a space. You can be more specific with those kinds of things because you know exactly what the format is. Your other big problem is that you're only using \w, which only matches one alphanumeric character, when you should use \w+, matching the whole word. Lastly, your actual error is coming from the fact that you're passing in a file object instead of the string containing its contents, i.e. f.read(). Here's some code that should work:
import re
f = open("wa_chat.txt", 'r')
match = re.findall(r'(\d{2})-(\d{2})-(\d{4}) (\d{2}):(\d{2}):(\d{2}): (\w+) (\w+):', f.read())
print match #or do whatever you want with it
Note that match will be a list of tuples since you wanted to use grouping.

How to re-match a group that did not capture anything?

I'm trying to parse a string in which a certain section can either be enclosed between " or ' or not be enclosed at all. However, I'm struggling finding a syntax that works when no quotation marks are there at all.
See the following (simplified) example:
>>> print re.match(r'\w(?P<quote>(\'|"))?\w', 'f"oo').group('quote')
"
>>> print re.match(r'\w(?P<quote>(\'|"))?\w', 'foo').group('quote')
None
>>> print re.match(r'\w(?P<quote>(\'|"))?\w(?P=quote)', 'f"o"o').group('quote')
"
>>> print re.match(r'\w(?P<quote>(\'|"))?\w(?P=quote)', 'foo').group('quote')
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
'NoneType' object has no attribute 'group'
The desired result for the last attempt should be None as the second command in the example.

Based on the suggestions I got to another question, I was able to produce a slightly different regex that provides the correct answers:
>>> re.match(r'\w(?P<quote>[\'"]?)\w(?P=quote)\w', 'foo').group('quote')
u''
>>> re.match(r'\w(?P<quote>[\'"]?)\w(?P=quote)\w', 'f"o"o').group('quote')
u'"'
>>> re.match(r'\w(?P<quote>[\'"]?)\w(?P=quote)\w', 'f\'o\'o').group('quote')
u"'"
The trick was really to use a quantifier on the character matched rather than on the entire group.
[ The leading and trailing \w in this example are just for preventing the regex to match the full string (as an unquoted string). In the real case scenario this was not needed as this match is part of a larger regex with previous and later groups matched ].

Python re "bogus escape error"

I've been messing around with the python re modules .search method. cur is the input from a Tkinter entry widget. Whenever I enter a "\" into the entry widget, it throws this error. I'm not all to sure what the error is or how to deal with it. Any insight would be much appreciated.
cur is a string
tup[0] is also a string
Snippet:
se = re.search(cur, tup[0], flags=re.IGNORECASE)
The error:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python26\Lib\Tkinter.py", line 1410, in __call__
return self.func(*args)
File "C:\Python26\Suite\quidgets7.py", line 2874, in quick_links_results
self.quick_links_results_s()
File "C:\Python26\Suite\quidgets7.py", line 2893, in quick_links_results_s
se = re.search(cur, tup[0], flags=re.IGNORECASE)
File "C:\Python26\Lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\Lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape (end of line)

"bogus escape (end of line)" means that your pattern ends with a backslash. This has nothing to do with Tkinter. You can duplicate the error pretty easily in an interactive shell:
>>> import re
>>> pattern="foobar\\"
>>> re.search(pattern, "foobar")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 241, in _compile
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)
The solution? Make sure your pattern doesn't end with a single backslash.

The solution to this issue is to use a raw string as the replacement text. The following won't work:
re.sub('this', 'This \\', 'this is a text')
It will throw the error: bogus escape (end of line)
But the following will work just fine:
re.sub('this', r'This \\', 'this is a text')
Now, the question is how do you convert a string generated during program runtime into a raw string in Python. You can find a solution for this here. But I prefer using a simpler method to do this:
def raw_string(s):
if isinstance(s, str):
s = s.encode('string-escape')
elif isinstance(s, unicode):
s = s.encode('unicode-escape')
return s
The above method can convert only ascii and unicode strings into raw strings. Well, this has been working great for me till date :)

If you are trying to search for "cur" in "tup[0]" you should do this through "try:... except:..." block to catch invalid pattern:
try :
se = re.search(cur, tup[0], flags=re.IGNORECASE)
except re.error, e:
# print to stdout or any status widget in your gui
print "Your search pattern is not valid."
# Some details for error:
print e
# Or some other code for default action.

The first parameter to re is the pattern to search for, thus if 'cur' contains a backslash at the end of the line, it'll be an invalid escape sequence. You've probably swapped your arguments around (I don't know what tup[0] is, but is it your pattern?) and it should be like this
se = re.search(tup[0], cur, flags=re.IGNORECASE)
As you very rarely use user input as a pattern (unless you're doing a regular expression search mechanism, in which case you might want to show the error instead).
HTH.
EDIT:
The error it is reporting is that you're using an escape character before the end of line (which is what bogus escape (end of line) means), that is your pattern ends with a backslash, which is not a valid pattern. Escape character (backslash) must be followed by another character, which removes or adds special meaning to that character (not sure exactly how python does it, posix makes groups by adding escape to parentheses, perl removes the group effect by escaping it). That is \* matches a literal asterix, whereas * matches the preceding character 0 or more times.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fix bad escape regex error (python re) - python

Try and group your digits (If you goal is testing then position 9 is your first \d in your second regex-check - It is an invalid group reference): formatted_date = re.sub(r"(\d{2,4})-(\d{1,2})-(\d{1,2})",r"\2/\3/\1",current_date)

Related

Regex 'sre_constants.error: bad character range' in large regex pattern

Why is this python exception not ValueError?

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:?

How to re-match a group that did not capture anything?

Python re "bogus escape error"

Categories

Resources