pyPEG2 parsing of newlines

pyPEG2 parsing of newlines - python

I'm trying to use pyPEG2 to translate MoinMoin markup to Markdown, and I need to pay attention to newlines in certain cases. However, I can't even get my newline parsing tests to work. I'm new to pyPEG and my Python is rusty. Please bear with me.
Here's the code:
#!/usr/local/bin/python3
from pypeg2 import *
import re
class Newline(List):
grammar = re.compile(r'\n')
parse("\n", Newline)
parse("""
""", Newline)
This results in:
Traceback (most recent call last):
File "./pyPegNewlineTest.py", line 7, in <module>
parse("\n", Newline)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pypeg2/__init__.py", line 667, in parse
t, r = parser.parse(text, thing)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pypeg2/__init__.py", line 794, in parse
raise r
File "<string>", line 2
^
SyntaxError: expecting match on \n
It's as if pypeg is inserting an empty line after the \n.
Trying other options such as
grammar = re.compile(r'\n', re.MULTILINE)
grammar = re.compile(r'\r\n|\r|\n', re.MULTILINE)
grammar = contiguous(re.compile(r'\r\n|\r|\n', re.MULTILINE))
and various combinations of those don't change the error message (although I don't think I tried all combinations). Changing Newline to subclass str instead of List doesn't change the error either.
Update
I have figured out that pypeg is stripping the newline before parsing it:
#!/usr/local/bin/python3
from pypeg2 import *
import re
class Newline(str):
grammar = contiguous(re.compile(r'a'))
parse("\na", Newline)
parse("""
a""", Newline)
print("Success, of a sort.")
Running this results in:
Success, of a sort.
If I override the Newline's parse method I don't even see the newline. The first thing it gets is the "a". This is consistent with what I'm seeing elsewhere. pypeg strips all leading whitespace, even when you specify contiguous.
So, that's what's happening. Not sure what to do about it.

Yes by default pypeg remove the whitespaces including the newlines.
This is easly configurable by setting the optional whitespace argument in the parse() function, e.g. in:
parse("\na", Newline, whitespace=re.compile(r"[ \t\r]"))
Doing so spaces and tabs will still be skipped, but not newlines \n.
With this example the parser now correctly find the syntax error:
SyntaxError: expecting match on a

Related

Different lines between two files, when one line contains trailing whitespace (Python, difflib)

I want to compare two text files in Python, and return the lines that are different. My attempt uses difflib, but I'm open to other suggestions. I need to get the lines that are different, as well as the lines that appear in one file but not the other. Order is somewhat important, but if a good solution exists that doesn't take order into consideration, I can let go of that.
The problem is that one file has lines that have multiple trailing characters \t and \n, while the other doesn't; I don't want to consider that as a difference. For other files, the first file has only \n and the other files has \t characters at the end. The lines contain elements that are separated by tabs or spaces, so those are important; I just don't care for the trailing characters \t and \n.
My solution:
from difflib import Differ
with open(file_path) as actual:
with open(test_file_path) as test:
differ = Differ()
for line in differ.compare(actual.readlines(), test.readlines()):
if line.startswith('-'):
log.error('EXPECTED: {}'.format(line[2:]))
if line.startswith('+'):
log.error('TEST FILE: {}'.format(line[2:]))
I expect the output to show EXPECTED and TEST FILE lines when there's a difference, and just EXPECTED or just TEST FILE when one contains a line the other doesn't. Right now, I'm seeing a lot of the following types of errors:
00:02:40: ERROR EXPECTED: Issuer Type OBal Net WAC OTerm WAM Age GrossCpn HighRemTerm Grp
00:02:40: ERROR TEST FILE: Issuer Type OBal Net WAC OTerm WAM Age GrossCpn HighRemTerm Grp
As you can see (if you highlight it), the first line contains a number of spaces after 'Grp' and the other line doesn't. I want to consider these two lines the same.
I've tried to explicitly specify the tabs and line breaks:
actual_file = actual.readlines()
expected_file = []
for line in actual_file:
if line[-1] == '\n':
expected_file.append(line.rstrip('\n').rstrip('\t') + '\n')
else:
expected_file.append(line.rstrip('\t'))
However, it (a) slows the process down quite a bit, and (b) is required for every file type in a different way, since some files have trailing tabs followed by line breaks, some have just line breaks, and some have nothing at all. If there's no better way, I can strip every line of every trailing tab and linebreak, but it seems like a lot of processing power (I have to run a lot of files) for something that seems fairly easy to resolve.

Take a look at string.rstrip() here: https://docs.python.org/2/library/string.html#string.rstrip
string.rstrip() should do exactly what you need by stripping whitespace off the end of a string, while leaving \t and \n characters before the end alone.
Check it out:
>>> import string
>>> s = "This \t is \t a \t line \t\t\t\n\n\n"
>>> print(s)
This is a line
>>>
>>> s = string.rstrip(s)
>>> s
'This \t is \t a \t line'
>>> print(s)
This is a line
>>>
Hope this helps!

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:?

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:
If you look at the whatsapp chat transcript, it looks like a mess. The purpose of this code is to print messages sent by a person in a new line (for better readability). This is my code
import re
f = open("wa_chat.txt", "r")
match = re.findall(r'(\d{2})\:(\d{2})\:(\d{4})\s(\d{2})\:(\d{2})\:(\d{2})\:\s(\w)\s(\w)\:', f)
for content in match:
print(f.readlines(), '\n')
f.close()
I am getting the following error message:
Traceback (most recent call last):
File "whatsapp.py", line 4, in <module>
match = re.findall(r'(\d{2})\:(\d{2})\:(\d{4})\s(\d{2})\:(\d{2})\:(\d{2})\:\s(\w)\s(\w)\:', f)
File "/usr/lib/python2.7/re.py", line 177, in findall
return_compile(pattern, flags).findall(string)
TypeError: expected string or buffer
Where am I going wrong?

For some reason you're putting \: where - should be. Also, instead of \s you can be more specific and just use a space. You can be more specific with those kinds of things because you know exactly what the format is. Your other big problem is that you're only using \w, which only matches one alphanumeric character, when you should use \w+, matching the whole word. Lastly, your actual error is coming from the fact that you're passing in a file object instead of the string containing its contents, i.e. f.read(). Here's some code that should work:
import re
f = open("wa_chat.txt", 'r')
match = re.findall(r'(\d{2})-(\d{2})-(\d{4}) (\d{2}):(\d{2}):(\d{2}): (\w+) (\w+):', f.read())
print match #or do whatever you want with it
Note that match will be a list of tuples since you wanted to use grouping.

Doctest Involving Escape Characters

Have a function fix(), as a helper function to an output function which writes strings to a text file.
def fix(line):
"""
returns the corrected line, with all apostrophes prefixed by an escape character
>>> fix('DOUG\'S')
'DOUG\\\'S'
"""
if '\'' in line:
return line.replace('\'', '\\\'')
return line
Turning on doctests, I get the following error:
Failed example:
fix('DOUG'S')
Exception raised:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/doctest.py", line 1254, in __run
compileflags, 1) in test.globs
File "<doctest convert.fix[0]>", line 1
fix('DOUG'S')
^
No matter what combination of \ and 's I use, the doctest doesn't seem to to want to work, even though the function itself works perfectly. Have a suspicion that it is a result of the doctest being in a block comment, but any tips to resolve this.

Is this what you want?:
def fix(line):
r"""
returns the corrected line, with all apostrophes prefixed by an escape character
>>> fix("DOUG\'S")
"DOUG\\'S"
>>> fix("DOUG'S") == r"DOUG\'S"
True
>>> fix("DOUG'S")
"DOUG\\'S"
"""
return line.replace("'", r"\'")
import doctest
doctest.testmod()
raw strings are your friend...

First, this is what happens if you actually call your function in the interactive interpreter:
>>> fix("Doug's")
"Doug\\'s"
Note that you don't need to escape single quotes in double-quoted strings, and that Python does not do this in the representation of the resulting string – only the back slash gets escaped.
This means the correct docstring should be (untested!)
"""
returns the corrected line, with all apostrophes prefixed by an escape character
>>> fix("DOUG'S")
"DOUG\\\\'S"
"""
I'd use a raw string literal for this docstring to make this more readable:
r"""
returns the corrected line, with all apostrophes prefixed by an escape character
>>> fix("DOUG'S")
"DOUG\\'S"
"""

Python re "bogus escape error"

I've been messing around with the python re modules .search method. cur is the input from a Tkinter entry widget. Whenever I enter a "\" into the entry widget, it throws this error. I'm not all to sure what the error is or how to deal with it. Any insight would be much appreciated.
cur is a string
tup[0] is also a string
Snippet:
se = re.search(cur, tup[0], flags=re.IGNORECASE)
The error:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python26\Lib\Tkinter.py", line 1410, in __call__
return self.func(*args)
File "C:\Python26\Suite\quidgets7.py", line 2874, in quick_links_results
self.quick_links_results_s()
File "C:\Python26\Suite\quidgets7.py", line 2893, in quick_links_results_s
se = re.search(cur, tup[0], flags=re.IGNORECASE)
File "C:\Python26\Lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\Lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape (end of line)

"bogus escape (end of line)" means that your pattern ends with a backslash. This has nothing to do with Tkinter. You can duplicate the error pretty easily in an interactive shell:
>>> import re
>>> pattern="foobar\\"
>>> re.search(pattern, "foobar")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/re.py", line 241, in _compile
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)
The solution? Make sure your pattern doesn't end with a single backslash.

The solution to this issue is to use a raw string as the replacement text. The following won't work:
re.sub('this', 'This \\', 'this is a text')
It will throw the error: bogus escape (end of line)
But the following will work just fine:
re.sub('this', r'This \\', 'this is a text')
Now, the question is how do you convert a string generated during program runtime into a raw string in Python. You can find a solution for this here. But I prefer using a simpler method to do this:
def raw_string(s):
if isinstance(s, str):
s = s.encode('string-escape')
elif isinstance(s, unicode):
s = s.encode('unicode-escape')
return s
The above method can convert only ascii and unicode strings into raw strings. Well, this has been working great for me till date :)

If you are trying to search for "cur" in "tup[0]" you should do this through "try:... except:..." block to catch invalid pattern:
try :
se = re.search(cur, tup[0], flags=re.IGNORECASE)
except re.error, e:
# print to stdout or any status widget in your gui
print "Your search pattern is not valid."
# Some details for error:
print e
# Or some other code for default action.

The first parameter to re is the pattern to search for, thus if 'cur' contains a backslash at the end of the line, it'll be an invalid escape sequence. You've probably swapped your arguments around (I don't know what tup[0] is, but is it your pattern?) and it should be like this
se = re.search(tup[0], cur, flags=re.IGNORECASE)
As you very rarely use user input as a pattern (unless you're doing a regular expression search mechanism, in which case you might want to show the error instead).
HTH.
EDIT:
The error it is reporting is that you're using an escape character before the end of line (which is what bogus escape (end of line) means), that is your pattern ends with a backslash, which is not a valid pattern. Escape character (backslash) must be followed by another character, which removes or adds special meaning to that character (not sure exactly how python does it, posix makes groups by adding escape to parentheses, perl removes the group effect by escaping it). That is \* matches a literal asterix, whereas * matches the preceding character 0 or more times.

Passing string with (accidental) escape character loses character even though it's a raw string

I have a function with a python doctest that fails because one of the test input strings has a backslash that's treated like an escape character even though I've encoded the string as a raw string.
My doctest looks like this:
>>> infile = [ "Todo: fix me", "/** todo: fix", "* me", "*/", r"""//\todo stuff to fix""", "TODO fix me too", "toDo bug 4663" ]
>>> find_todos( infile )
['fix me', 'fix', 'stuff to fix', 'fix me too', 'bug 4663']
And the function, which is intended to extract the todo texts from a single line following some variation over a todo specification, looks like this:
todos = list()
for line in infile:
print line
if todo_match_obj.search( line ):
todos.append( todo_match_obj.search( line ).group( 'todo' ) )
And the regular expression called todo_match_obj is:
r"""(?:/{0,2}\**\s?todo):?\s*(?P<todo>.+)"""
A quick conversation with my ipython shell gives me:
In [35]: print "//\todo"
// odo
In [36]: print r"""//\todo"""
//\todo
And, just in case the doctest implementation uses stdout (I haven't checked, sorry):
In [37]: sys.stdout.write( r"""//\todo""" )
//\todo
My regex-foo is not high by any standards, and I realize that I could be missing something here.
EDIT: Following Alex Martellis answer, I would like suggestions on what regular expression would actually match the blasted r"""//\todo fix me""". I know that I did not originally ask for someone to do my homework, and I will accept Alex's answer as it really did answer my question (or confirm my fears). But I promise to upvote any good solutions to my problem here :)
EDITEDIT: for reference, a bug has been filed with the kodos project: bug #437633
I'm using Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
Thank you for reading this far (If you skipped directly down here, I understand)

Read your original regex carefully:
r"""(?:/{0,2}\**\s?todo):?\s*(?P<todo>.+)"""
It matches: zero to two slashes, then 0+ stars, then 0 or 1 "whitespace characters" (blanks, tabs etc), then the literal characters 'todo' (and so on).
Your rawstring is:
r"""//\todo stuff to fix"""
so there's a literal backslash between the slashes and the 'todo', therefore of course the regex doesn't match it. It can't -- nowhere in that regex are you expressing any desire to optionally match a literal backslash.
Edit:
A RE pattern, very close to yours, that would accept and ignore an optional backslash right before the 't' would be:
r"""(?:/{0,2}\**\s?\\?todo):?\s*(?P<todo>.+)"""
note that the backslash does have to be repeated, to "escape itself", in this case.

This gets even more strange as I venture down the road of doctests.
Consider this python script.
If you uncomment the lines 22 and 23, the script passes just fine, as the method returns True, which is both asserted and explicitly compared.
But if you run the file as it stands in the link, the doctest will fail with the message:
% python doctest_test.py
**********************************************************************
File "doctest_test.py", line 3, in __main__.doctest_test
Failed example:
doctest_test( r"""// odo""" )
Exception raised:
Traceback (most recent call last):
File "/usr/lib/python2.6/doctest.py", line 1241, in __run
compileflags, 1) in test.globs
File "<doctest __main__.doctest_test[0]>", line 1, in <module>
doctest_test( r"""// odo""" )
File "doctest_test.py", line 14, in doctest_test
assert input_string == compare_string
AssertionError
**********************************************************************
1 items had failures:
1 of 1 in __main__.doctest_test
***Test Failed*** 1 failures.
Can someone enlighten me here?
I'm still using python 2.6.4 for this.
I'm placing this answer under 'community wiki', as it really does not reputation-wise relate to the question.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyPEG2 parsing of newlines - python

Related

Different lines between two files, when one line contains trailing whitespace (Python, difflib)

What is the RegEx pattern for 24-06-2015 10:15:45: Aditya Krishnakant:?

Doctest Involving Escape Characters

Python re "bogus escape error"

Passing string with (accidental) escape character loses character even though it's a raw string

Categories

Resources