Having some issues with re.sub - python

In my program I'm parsing Japanese definitions, and I need to take a few things out. There are three things I need to take things out between. 「text」 (text) 《text》
To take out things between 「」 I've been doing sentence = re.sub('「[^)]*」','', sentence) The problem with this is, for some reason if there are parentheses within 「」 it will not replace anything. Also, I've tried using the same code for the other two things like sentence = re.sub('([^)]*)','', sentence)
sentence = re.sub('《[^)]*》','', sentence) but it doesn't work for some reason. There isn't an error or anything, it just doesn't replace anything.
How can I make this work, or is there some better way of doing this?
EDIT:
I'm having a slight problem with another part of this though. Before I replace anything I check the length to make sure it's over a certain length.
parse = re.findall(r'「[^」]*」','', match.text)
if len(str(parse)) > 8:
sentence = re.sub(r'「[^」]*」','', match.text)
This seems to be causing an error now:
Traceback (most recent call last):
File "C:/Users/Dominic/PycharmProjects/untitled9/main.py", line 48, in <module>
parse = re.findall(r'「[^」]*」','', match.text)
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python34\lib\re.py", line 275, in _compile
bypass_cache = flags & DEBUG
TypeError: unsupported operand type(s) for &: 'str' and 'int'
I sort of understand what's causing this, but I don't understand why It's not working just from that slight change. I know the re.sub part is fine, It's just the first two lines that are causing the problems.

You should read a tutorial on regular expressions so you understand what your regexps do.
The regexp '「[^)]*」' matches anything between the angles that is not a closing parenthesis. You need this:
sentence = re.sub(r'「[^」]*」','', sentence)
The second regexp has an additional problem: Parentheses have a special meaning (when they are not inside square brackets), so to match parentheses you need to write \( and \). So you need this:
'\([^)]*\)'
Finally: You should always use raw strings for your python regexps. It doesn't happen to make a difference in this case, but it very often does, and the bugs are maddening to spot. E.g., use:
r'\([^)]*\)'

sentence = re.sub(ur'「[^」]*」','', sentence)
^^
You need to change the negatiion based quantifer to stop at 」 instead of ).
You should use unicode flag if dealing with them.If there are ) within them then it will fail as you have used 「[^)]*」
^^
You have instructed regex to stop when it finds ).

Related

Regex 'sre_constants.error: bad character range' in large regex pattern

The following is the error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
This is my object:
>>> re101121=re.compile("""(?i)激[ _]{0,}活[ _]{0,}邮[ _]{0,}箱|(click|clicking)[ _]{1,}[here ]{0,1}to[ _]{1,}verify|stop[ _]{1,}mail[ _]{1,}.{1,16}[ _]{1,}here|(click|clicking|view|update)([ _-]{1,}|\\xc2\\xa0)(on|here|Validate)[^a-z0-9]{1}|(點|点)[ _]{0,}(擊|击)[ _]{0,}(這|这|以)[ _]{0,}(裡|里|下)|DHL[ _]{1,}international|DHL[ _]{1,}Customer[ _]{1,}Service|Online[ _]{1,}Banking|更[ _]{0,}新[ _]{0,}您[ _]{0,}的[ _]{0,}(帐|账)[ _]{0,}户|CONFIRM[ _]{1,}ACCOUNT[ _]{1,}NOW|avoid[ _]{1,}Account[ _]{1,}malfunction|confirm[ _]{1,}this[ _]{1,}request|verify your account IP|Continue to Account security|继[\\s-_]*续[\\s-_]*使[\\s-_]*用|崩[\\s-_]*溃[\\s-_]*信[\\s-_]*息|shipment[\\s]+confirmation|will be shutdown in [0-9]{0,} (hours|days)|DHL Account|保[ ]{0,}留[ ]{0,}密[ ]{0,}码|(Password|password|PASSWORD).*(expired|expiring)|login.*email.*password.*confirm|[0-9]{0,} messages were quarantined|由于.*错误(的)?(送货)?信息|confirm.*(same)? password|keep.*account secure|settings below|loss.*(email|messages)|simply login|quick verification now""")
After minimization, your error boils down to re.compile("""[\\s-_]"""). This is a bad character range indeed; you probably meant the dash to be literal re.compile(r"[\s\-_]") (always use raw strings for regex r"..."). Moving the dash to the end of the bracket group works too: r"[\s_-]".
In the future, try to binary search to find the minimal failing input: remove the right half of the regex. If it still fails, the problem must have been in the left half. Remove the right half of the remaining substring and repeat until you're down to a minimal failing case. This technique doesn't always work when the problem spans both halves, but it can't hurt to try.
As mentioned in the comments, it's pretty odd to have such a massive regex as this, but I'll assume you know what you're doing.
As another aside, there are some antipatterns in this regex (pardon the pun) like {0,} which can be simplified to *.

Read until an only partially known line - Python

I need to get (parse) from a device its whole output.
My solution was: 1) Determine how the last line of its output looks like
2) Use the code below to read the output until the last line (which is a way around of saying - read the whole output)
last_line = "text of the last line"
read_until(last_line)
3) Technical detail: make it to a return value of the get_output() as means of passing it further to a parse_result() function.
The problem is: The last line might take various forms and only its rough format is known. For example it might say: {"diag":"hdd_id", "status":"0"}. However, both "diag" and "status" might take other values than "hdd_id" and "0".
What can I do to make the "text of the last line" more universal so that the read_until() stops for every value of "diag" and "status"? (given that the output always includes words "diag" and "status")
What I tried: Using regular expressions. Defining last_line = re('"status":"."}') making use of the fact that . in regular expression means any value. What I get though is TypeError: 'module' object is not callable.
It also wouldn't make much sense to convert that regular expression to a string by str(re('"status":"."}')) since, as far as I understand regular expressions, it wouldn't mean any particular string (due to .).
You should read (again) the re chapter from the Python standard library manual.
The correct usage is:
import re
...
eof = re.compile(r'\s*\{\s*"diag":.*,\s*"status":.*\}') # compile the regex
...
The above expression uses \s* to allow for optional white spaces in the line. You can remove them if you know that they cannot occur.
You can then use it with the telnetlib Python module, but with expect instead of read_until, because the latter searches for a string and not a regex:
index, match, text = tn.expect([eof])
Here, index will the the index of the matched regex (here 0), match the match object, and text the full text including the last line

How to make the re.search() try a best attempt approach

This is the text file sb.txt
JOHN:ENGINEER:35?:
Now this the piece of code that tries to perform a regex search on the above line.
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:',line)
Now I get a proper output for biodata1.group(1), biodata1.group(2) and biodata1.group(3).
If however, I modify the file by removing ":" from the end
JOHN:ENGINEER:35?
and run the script again, I get the following error which makes sense since group(3) didn't match successfully
Traceback (most recent call last):
File "dictionary.py", line 26, in <module>
print('re.search(r([\w\W])+?:([\w\W])+?:([\w\W])+? '+biodata1.group(1)+' '+biodata1.group(2)+' '+biodata1.group(3)) # STMT1
AttributeError: 'NoneType' object has no attribute 'group'
But group(1) and group(2) should've still matched "N" "R" respectively. Is there anyway to avoid this error and attempt a best attempt approach to regex so it doesn't fail and at least prints biodata1.group(1) & biodata1.group(2).
I tried to edit the output statment by not having it print biodata1.group(3) though that didn't work
I think you misunderstand what has happened. Your entire regular expression has failed to match and therefore there is no match object.
Where it says AttributeError: 'NoneType' object has no attribute 'group' it's trying to tell you that biodata1 is None. None is the return you get from re.search when it fails to match.
To be clear, there's no way to get a "best match". What you're asking for is that re should make a decision as to what you really want. If you want groups to be optional, you need to make them optional.
Depending on what you actually want you can try the regexes:
r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?'
or
r'([\w\W])+?:([\w\W])+?:(([\w\W])+?:)?'
Which respectively make the last : and the entire last group optional.
You'll have to modify the regex to instruct it on what exactly is optional and what isn't. Python regexes don't have this concept of partial matches. One possibility is to change it to
biodata1 = re.search(r'([\w\W])+?:(?:([\w\W])+?:(?:([\w\W])+?:)?)?',line)
Where you allow 1, 2 or 3 groups to match. In this case, any groups that don't match will return the empty string when you do match.group(X)
What a regex does is it matches exactly what you provided. There is no best try or anything like that.
If you want some part of your match to be optional you need to declare it using the ? operator. So in your case your regex would need to look like this:
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?',line)
Also +? (at least once, or not at all) is equal to * (at least zero times), so you could just do this:
biodata1 = re.search(r'([\w\W])*:([\w\W])*:([\w\W])*:?',line)

Passing string with (accidental) escape character loses character even though it's a raw string

I have a function with a python doctest that fails because one of the test input strings has a backslash that's treated like an escape character even though I've encoded the string as a raw string.
My doctest looks like this:
>>> infile = [ "Todo: fix me", "/** todo: fix", "* me", "*/", r"""//\todo stuff to fix""", "TODO fix me too", "toDo bug 4663" ]
>>> find_todos( infile )
['fix me', 'fix', 'stuff to fix', 'fix me too', 'bug 4663']
And the function, which is intended to extract the todo texts from a single line following some variation over a todo specification, looks like this:
todos = list()
for line in infile:
print line
if todo_match_obj.search( line ):
todos.append( todo_match_obj.search( line ).group( 'todo' ) )
And the regular expression called todo_match_obj is:
r"""(?:/{0,2}\**\s?todo):?\s*(?P<todo>.+)"""
A quick conversation with my ipython shell gives me:
In [35]: print "//\todo"
// odo
In [36]: print r"""//\todo"""
//\todo
And, just in case the doctest implementation uses stdout (I haven't checked, sorry):
In [37]: sys.stdout.write( r"""//\todo""" )
//\todo
My regex-foo is not high by any standards, and I realize that I could be missing something here.
EDIT: Following Alex Martellis answer, I would like suggestions on what regular expression would actually match the blasted r"""//\todo fix me""". I know that I did not originally ask for someone to do my homework, and I will accept Alex's answer as it really did answer my question (or confirm my fears). But I promise to upvote any good solutions to my problem here :)
EDITEDIT: for reference, a bug has been filed with the kodos project: bug #437633
I'm using Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
Thank you for reading this far (If you skipped directly down here, I understand)
Read your original regex carefully:
r"""(?:/{0,2}\**\s?todo):?\s*(?P<todo>.+)"""
It matches: zero to two slashes, then 0+ stars, then 0 or 1 "whitespace characters" (blanks, tabs etc), then the literal characters 'todo' (and so on).
Your rawstring is:
r"""//\todo stuff to fix"""
so there's a literal backslash between the slashes and the 'todo', therefore of course the regex doesn't match it. It can't -- nowhere in that regex are you expressing any desire to optionally match a literal backslash.
Edit:
A RE pattern, very close to yours, that would accept and ignore an optional backslash right before the 't' would be:
r"""(?:/{0,2}\**\s?\\?todo):?\s*(?P<todo>.+)"""
note that the backslash does have to be repeated, to "escape itself", in this case.
This gets even more strange as I venture down the road of doctests.
Consider this python script.
If you uncomment the lines 22 and 23, the script passes just fine, as the method returns True, which is both asserted and explicitly compared.
But if you run the file as it stands in the link, the doctest will fail with the message:
% python doctest_test.py
**********************************************************************
File "doctest_test.py", line 3, in __main__.doctest_test
Failed example:
doctest_test( r"""// odo""" )
Exception raised:
Traceback (most recent call last):
File "/usr/lib/python2.6/doctest.py", line 1241, in __run
compileflags, 1) in test.globs
File "<doctest __main__.doctest_test[0]>", line 1, in <module>
doctest_test( r"""// odo""" )
File "doctest_test.py", line 14, in doctest_test
assert input_string == compare_string
AssertionError
**********************************************************************
1 items had failures:
1 of 1 in __main__.doctest_test
***Test Failed*** 1 failures.
Can someone enlighten me here?
I'm still using python 2.6.4 for this.
I'm placing this answer under 'community wiki', as it really does not reputation-wise relate to the question.

Python's Regular Expression Source String Length

In Python Regular Expressions,
re.compile("x"*50000)
gives me OverflowError: regular expression code size limit exceeded
but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC
>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>
Is that normal?
Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?
Tested on Python 2.6, Win32
UPDATE 1:
It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?
So, how about this one?
re.compile(".*?x"*50000)
It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.
So, Am I missing something?
UPDATE 2:
My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.
It does not make sense for me, thats why.
It is something missing on overflow checking or Its just fine or its really overflowing something?
Any Hints/Opinions will be appreciated.
The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).
EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.
you want to match 50000 "x"s , correct??? if so, an alternative without regex
if "x"*50000 in mystring:
print "found"
if you want to match 50000 "x"s using regex, you can use range
>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>
on my system it will take in length of 65535 max
>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 241, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>
I don't know if there are tweaks in Python we can use to increase that limit though.

Categories