Python re search on binary strings - python

I am trying to use python re to find binary substrings but I get a somewhat puzzling error.
Here is a small example to demonstrate the issue (python3):
import re
memory = b"\x07\x00\x42\x13"
query1 = (7).to_bytes(1, byteorder="little", signed=False)
query2 = (42).to_bytes(1, byteorder="little", signed=False)
# Works
for match in re.finditer(query1, memory):
print(match.group(0))
# Causes error
for match in re.finditer(query2, memory):
print(match.group(0))
The first loop correctly prints b'\x07' while the second gives the following error:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python3.7/re.py", line 230, in finditer
return _compile(pattern, flags).finditer(string)
File "/usr/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.7/sre_parse.py", line 930, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub
not nested and not items))
File "/usr/lib/python3.7/sre_parse.py", line 651, in _parse
source.tell() - here + len(this))
re.error: nothing to repeat at position
For context I am trying to find specific integers within the memory space of a program in similar fashion to tools like cheat engine. It is being done using python scripts within gdb.
-- Note 1 --
I have a suspicion that this may be related to the fact that 42 is representable in ascii as * while 7 is not. For example if you print the query strings you get:
>>> print(query1)
b'\x07'
>>> print(query2)
b'*'
-- Note 2 --
Actually it looks like this is unrelated to whether the string is representable in ascii. If you run:
import re
memory = b"\x07\x00\x42\x13"
for i in range(255):
query = i.to_bytes(1, byteorder="little", signed=False)
try:
for match in re.finditer(query, memory):
pass
except:
print(str(i) + " failed -- as ascii: " + chr(i))
It gives:
40 failed -- as ascii: (
41 failed -- as ascii: )
42 failed -- as ascii: *
43 failed -- as ascii: +
63 failed -- as ascii: ?
91 failed -- as ascii: [
92 failed -- as ascii: \
All of the failed bytes represent characters which are special to re syntax. This makes me think that python re is first printing the query string and then parsing it to do that search. I guess that is not entirely unreasonable but still odd.
Actually in writing this question I've found a solution which is to first wrap the query in re.escape(query) which will insert a \ before each special character but I will still post this question in case it may be helpful to others or if anyone has more to add.

\x42 is corresponds to *, which is a special regex character. You can instead use
re.finditer(re.escape(query2), memory)
which will escape the query (convert * to \*) and find the character * in the string.

Related

Python lex - TypeError: Unknown text

I'm trying to write a simple lex parser. The cope is currently:
from ply import lex
tokens = (
'COMMENT',
'OTHER'
)
t_COMMENT = r'^\#.*\n'
t_OTHER = r'^[^\#].*\n'
def t_error(t):
raise TypeError("Unknown text '%s'" % (t.value,))
lex.lex()
lex.input(yaml)
for tok in iter(lex.token, None):
print repr(tok.type), repr(tok.value)
But is fails to parse simple input file:
# This is a real comment
#And this one also
#/*
# *
# *Variable de feeu
# */
ma_var: True
It is done, over, kaput
With the following output:
l
'COMMENT' '# This is a real comment\n'
Traceback (most recent call last):
File "parser_adoc.py", line 62, in <module>
main2()
File "parser_adoc.py", line 57, in main2
for tok in iter(lex.token, None):
File "/usr/lib/python2.7/site-packages/ply/lex.py", line 384, in token
newtok = self.lexerrorf(tok)
File "parser_adoc.py", line 44, in t_error
raise TypeError("Unknown text '%s'" % (t.value,))
TypeError: Unknown text '#And this one also
#/*
# *
# *Variable de feeu
# */
ma_var: True
this is done
'
So in summary, I defined 2 regex:
One for line beginning with #
One for lines beginning not with #
But it's not working.
I don't understand what's wrong with my regex.
Could you help?
Simon
In python regexes (which PLY uses), ^ refers to the beginning of the string, not the beginning of the line, unless multi-line mode has been set. So since both of your rules start with ^, they can only match on the first line.
You could fix this by wrapping your regexes in (?m:...), which enables multi-line mode, but that's not even necessary here. Instead you can just remove the ^ from the beginning of your rules and it will work as you intend. Since both of your rules always match the entire line, the next token will always start at the beginning of the line - no need to anchor them.

What is the python vesion of this?

How would you do this in python?
(It goes through a file and print the string in between author": ", " and text": ", \ and then print them to their files)
Here is an example string before it goes through this:
{"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}
#!/bin/bash
cat html.txt | awk -F 'author": "' {'print $2'} | cut -d '"' -f1 >> users.txt
cat html.txt | awk -F 'text": "' {'print $2'} | cut -d '\' -f1 >> comments.txt
I tried to do it like this in python (Didn't work):
import re
start = '"author": "'
end = '", '
st = open("html.txt", "r")
s = st.readlines()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
#print u.group(1)
Not sure if I'm close.
I get this error code:
Traceback (most recent call last):
File "test.py", line 9, in <module>
u = re.search('%s(.*)%s' % (start, end), s).group(1)
File "/usr/lib/python2.7/re.py", line 146, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer`
Before getting into any of this: As chepner pointed out in a comment, this input looks like, and therefore is probably intended to be, JSON. Which means you shouldn't be parsing it with regular expressions; just parse it as JSON:
>>> s = ''' {"text": "Love this series!\ufeff", "time": "Hace 11 horas", "author": "HasRah", "cid": "UgyvXmvSiMjuDrOQn-l4AaABAg"}'''
>>> obj = json.loads(s)
>>> obj['author']
'HasRah'
Actually, it's not clear whether your input is a JSON file (a file containing one JSON text), or a JSONlines file (a file containing a bunch of lines, each of which is a JSON text with no embedded newlines).1
For the former, you want to parse it like this:
obj = json.load(st)
For the latter, you want to loop over the lines, and parse each one like this:
for line in st:
obj = json.loads(line)
… or, alternatively, you can get a JSONlines library off PyPI.
But meanwhile, if you want to understand what's wrong with your code:
The error message is telling you the problem, although maybe not in the user-friendliest way:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 148, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
See the docs for search make clear:
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance…
You haven't passed it a string, you've passed it a list of strings. That's the whole point of readlines, after all.
There are two obvious fixes here.
First, you could read the whole file into a single string, instead of reading it into a list of strings:
s = st.read()
u = re.search('%s(.*)%s' % (start, end), s).group(1)
Alternatively, you could loop over the lines, trying to match each one. And, if you do this, you still don't need readlines(), because a file is already an iterable of lines:
for line in st:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
While we're at it, if any of your lines don't match the pattern, this is going to raise an AttributeError. After all, search returns None if there's no match, but then you're going to try to call None.group(1).
There are two obvious fixes here as well.
You could handle that error:
try:
u = re.search('%s(.*)%s' % (start, end), line).group(1)
except AttributeError:
pass
… or you could check whether you got a match:
m = re.search('%s(.*)%s' % (start, end), line)
if m:
u = m.group(1)
1. In fact, there are at least two other formats that are nearly, but not quite, identical to JSONlines. I think that if you only care about reading, not creating files, and you don't have any numbers, you can parse all of them with a loop around json.loads or with a JSONlines library. But if you know who created the file, and know that they intended it to be, say, NDJ rather than JSONlines, you should read the docs on NDJ, or get a library made for NDJ, rather than just trusting that some guy on the internet thinks it's OK to treat it as JSONlines.

How to prevent truncating of string in unit test python

I am doing a unit test in Python for my program and I would like to do an assertEquals test.
My code looks something like this:
class UnitTest(unittest.TestCase):
def test_parser(self):
self.assertEquals(parser,"some long string", "String is not equal")
However, as my string is too long, I got something like testing[471 chars]0 != testing[473 chars]. I wanna see what is the exact difference between both the strings instead of seeing the truncated ones.
Anyone has an idea how to counter this problem?
To replace [... chars] and [truncated]... with actual characters (no matter how long, and no matter what the type of the compared values are), add this to your *_test.py file:
if 'unittest.util' in __import__('sys').modules:
# Show full diff in self.assertEqual.
__import__('sys').modules['unittest.util']._MAX_LENGTH = 999999999
Indeed, as other answers have noted, setting self.maxDiff = None doesn't help, it doesn't make the [... chars] disappear. However, this setting helps showing other types of long diffs, so my recommendation is doing both.
So, I landed on this question because I had an issue where I was using assertEqual() and self.maxDiff = None wouldn't cause the full output to be displayed. Tracing through, it turned out that because the types of the two objects were different (one was a list, one was a generator), the code path that would make use of self.maxDiff wasn't used. So, if you run into the issue where you need the full diff and self.maxDiff isn't working, ensure the types of your two compared objects are the same.
unittest.TestCase.assertEquals tries to give you the actual difference in the string while at the same time making the text fit in your screen.
To do this it truncates the common sections, so sections that have no differences are truncated by replacing them with [<count> chars] chunks:
>>> case.assertEqual('foo' * 200, 'foo' * 100 + 'bar' + 'foo' * 99)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 821, in assertEqual
assertion_func(first, second, msg=msg)
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 1194, in assertMultiLineEqual
self.fail(self._formatMessage(msg, standardMsg))
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 666, in fail
raise self.failureException(msg)
AssertionError: 'foof[291 chars]oofoofoofoofoofoofoofoofoofoofoofoofoofoofoofo[255 chars]ofoo' != 'foof[291 chars]oofoobarfoofoofoofoofoofoofoofoofoofoofoofoofo[255 chars]ofoo'
Diff is 1819 characters long. Set self.maxDiff to None to see it.
In the above example, the two strings share a long prefix, which has been shortened by replacing 291 characters with [291 chars] in both prefixes. They also share a long postfix, again shortened in both locations by replacing text with [255 chars].
The actual difference is still being displayed, right in the middle.
Of course, when you make that difference too long, then even the difference is truncated:
>>> case.assertEqual('foo' * 200, 'foo' * 80 + 'bar' * 30 + 'foo' * 80)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 821, in assertEqual
assertion_func(first, second, msg=msg)
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 1194, in assertMultiLineEqual
self.fail(self._formatMessage(msg, standardMsg))
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 666, in fail
raise self.failureException(msg)
AssertionError: 'foof[231 chars]oofoofoofoofoofoofoofoofoofoofoofoofoofoofoofo[315 chars]ofoo' != 'foof[231 chars]oofoobarbarbarbarbarbarbarbarbarbarbarbarbarba[285 chars]ofoo'
Diff is 1873 characters long. Set self.maxDiff to None to see it.
Here, the common postfix is starting to differ, but the start of the difference is still visible, and should help you figure out where the text went wrong.
If this is still not enough, you can either increase or eliminate the diff limits. Set the TestCase.maxDiff attribute to a higher number (the default is 8 * 80, 80 lines of text), or set it to None to eliminate altogether:
self.maxDiff = None
Note that unless your string contains newlines that the diff is likely to be unreadable:
AssertionError: 'foof[231
chars]oofoofoofoofoofoofoofoofoofoofoofoofoofoofoofo[315 chars]ofoo'
!= 'foof[231 chars]oofoobarbarbarbarbarbarbarbarbarbarbarbarbarba[285
chars]ofoo'
- foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo
?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoobarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarfoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo
?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In that case it may be more useful to wrap your input and output texts:
from textwrap import wrap
self.maxDiff = None
self.assertEquals(wrap(parser), wrap("some long string"), "String is not equal")
just so you get better and more readable diff output:
>>> from textwrap import wrap
>>> case.assertEqual(wrap('foo' * 200), wrap('foo' * 80 + 'bar' * 30 + 'foo' * 80))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 821, in assertEqual
assertion_func(first, second, msg=msg)
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 1019, in assertListEqual
self.assertSequenceEqual(list1, list2, msg, seq_type=list)
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 1001, in assertSequenceEqual
self.fail(msg)
File "/Users/mjpieters/Development/Library/buildout.python/parts/opt/lib/python3.6/unittest/case.py", line 666, in fail
raise self.failureException(msg)
AssertionError: Lists differ: ['foo[244 chars]oofoofoofoofoofoofoofoofoofoofoofoofoofoofoof'[336 chars]foo'] != ['foo[244 chars]oofoobarbarbarbarbarbarbarbarbarbarbarbarbarb'[306 chars]foo']
First differing element 3:
'foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoof'
'foofoofoofoofoofoofoofoofoofoobarbarbarbarbarbarbarbarbarbarbarbarbarb'
['foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoof',
'oofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofo',
'ofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo',
- 'foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoof',
- 'oofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofo',
+ 'foofoofoofoofoofoofoofoofoofoobarbarbarbarbarbarbarbarbarbarbarbarbarb',
+ 'arbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarfoofoofoofoofoofoofo',
'ofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo',
'foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoof',
'oofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofo',
- 'ofoofoofoofoofoofoofoofoofoofoofoofoofoo']
+ 'ofoofoofoo']

How to use variable in python regex?

I am trying to process user input in regex from variable. After a lot of searching I have come up with following:
Explaination of code variables:
step is a string used as input for regex
e.g.
replace|-|space ,
replace|*|null,
replace|/|\|squot|space
b is a list of elements. Element is fetched and modified as per regex.
i is integer received from other function to access list b using i as index
I process the above string to get array, then use the last element of array as substitution string
First element is deleted as it is not required.
All other elements need to be replaced with substitution string.
def replacer(step,i,b):
steparray = step.split('|')
del steparray[0]
final = steparray.pop()
if final == "space":
subst = u" "
elif final == "squot":
subst = u"'"
elif final == "dquot":
subst = u"\""
else:
subst = u"%s"%final
for input in xrange(0,len(steparray)):
test=steparray[input]
regex = re.compile(ur'%s'%test)
b[i] = re.sub(regex, subst, b[i])
print b[i]
However, when I run above code, following error is shown:
File "CSV_process.py", line 78, in processor
replacer(step,i,b)
File "CSV_process.py", line 115, in replacer
regex = re.compile(ur'%s'%test)
File "/usr/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
I tried a lot but dont understand how regex works. Please help with error.
Final requirement is to get a special character from user input and replace it with another character (again from user input)
PS: Also, the code does not have 242 lines but error is on line 242. Is the error occurring after end of array in for loop?
Some special characters like * should be escaped to match literally.
>>> import re
>>> re.compile('*')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\re.py", line 194, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
Using re.escape, you can escape them:
>>> print(re.escape('*'))
\*
>>> re.compile(re.escape('*'))
<_sre.SRE_Pattern object at 0x000000000273DF10>
BTW, if you want to simply replace them, regular expression is not necessary. Why don't you use str.replace?
replaced_string = string_object.replace(old, new)

Python kludge to read UCS-2 (UTF-16?) as ASCII

I'm in a little over my head on this one, so please pardon my terminology in advance.
I'm running this using Python 2.7 on Windows XP.
I found some Python code that reads a log file, does some stuff, then displays something.
What, that's not enough detail? Ok, here's a simplified version:
#!/usr/bin/python
import re
import sys
class NotSupportedTOCError(Exception):
pass
def filter_toc_entries(lines):
while True:
line = lines.next()
if re.match(r""" \s*
.+\s+ \| (?#track)
\s+.+\s+ \| (?#start)
\s+.+\s+ \| (?#length)
\s+.+\s+ \| (?#start sec)
\s+.+\s*$ (?#end sec)
""", line, re.X):
lines.next()
break
while True:
line = lines.next()
m = re.match(r"""
^\s*
(?P<num>\d+)
\s*\|\s*
(?P<start_time>[0-9:.]+)
\s*\|\s*
(?P<length_time>[0-9:.]+)
\s*\|\s*
(?P<start_sector>\d+)
\s*\|\s*
(?P<end_sector>\d+)
\s*$
""", line, re.X)
if not m:
break
yield m.groupdict()
def calculate_mb_toc_numbers(eac_entries):
eac = list(eac_entries)
num_tracks = len(eac)
tracknums = [int(e['num']) for e in eac]
if range(1,num_tracks+1) != tracknums:
raise NotSupportedTOCError("Non-standard track number sequence: %s", tracknums)
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
offsets = [(int(x['start_sector']) + 150) for x in eac]
return [1, num_tracks, leadout_offset] + offsets
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
The code works fine as long as the log file is "simple" text (I'm tempted to say ASCII although that may not be precise/accurate - for e.g. Notepad++ indicates it's ANSI).
However, the script doesn't work on certain log files (in these cases, Notepad++ says "UCS-2 Little Endian").
I get the following error:
Traceback (most recent call last):
File "simple.py", line 55, in <module>
mb_toc_urlpart = "%20".join(str(x) for x in calculate_mb_toc_numbers(filter_
toc_entries(f)))
File "simple.py", line 49, in calculate_mb_toc_numbers
leadout_offset = int(eac[-1]['end_sector']) + 150 + 1
IndexError: list index out of range
This log works
This log breaks
I believe it's the encoding that's breaking the script because if I simply do this at a command prompt:
type ascii.log > scrubbed.log
and then run the script on scrubbed.log, the script works fine (this is actually fine for my purposes since there's no loss of important information and I'm not writing back to a file, just printing to the console).
One workaround would be to "scrub" the log file before passing it to Python (e.g. using the type pipe trick above to a temporary file and then have the script run on that), but I would like to have Python "ignore" the encoding if it's possible. I'm also not sure how to detect what type of log file the script is reading so I can act appropriately.
I'm reading this and this but my eyes are still spinning around in their head, so while that may be my longer term strategy, I'm wondering if there's an interim hack I could use.
codecs.open() will allow you to open a file using a specific encoding, and it will produce unicodes. You can try a few, going from most likely to least likely (or the tool could just always produce UTF-16LE but ha ha fat chance).
Also, "Unicode In Python, Completely Demystified".
works.log appears to be encoded in ASCII:
>>> data = open('works.log', 'rb').read()
>>> all(d < '\x80' for d in data)
True
breaks.log appears to be encoded in UTF-16LE -- it starts with the 2 bytes '\xff\xfe'. None of the characters in breaks.log are outside the ASCII range:
>>> data = open('breaks.log', 'rb').read()
>>> data[:2]
'\xff\xfe'
>>> udata = data.decode('utf16')
>>> all(d < u'\x80' for d in udata)
True
If these are the only two possibilities, you should be able to get away with the following hack. Change your mainline code from:
f = open(sys.argv[1])
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(f)))
print mb_toc_urlpart
to this:
f = open(sys.argv[1], 'rb')
data = f.read()
f.close()
if data[:2] == '\xff\xfe':
data = data.decode('utf16').encode('ascii')
# ilines is a generator which produces newline-terminated strings
ilines = (line + '\n' for line in data.splitlines())
mb_toc_urlpart = "%20".join(
str(x) for x in calculate_mb_toc_numbers(filter_toc_entries(ilines)) )
print mb_toc_urlpart
Python 2.x expects normal strings to be ASCII (or at least one byte). Try this:
Put this at the top of your Python source file:
from __future__ import unicode_literals
And change all the str to unicode.
[edit]
And as Ignacio Vazquez-Abrams wrote, try codecs.open() to open the input file.

Categories