UnicodeDecodeError with seek() and read() - python

I am following example code in Programming Python, and something confuses.Here's the code that writes simple string to a file and then reads it back
>>> data = 'sp\xe4m' # data to your script
>>> data, len(data) # 4 unicode chars, 1 nonascii
('späm', 4)
>>> data.encode('utf8'), len(data.encode('utf8')) # bytes written to file
(b'sp\xc3\xa4m', 5)
>>> f = open('test', mode='w+', encoding='utf8') # use text mode, encoded
>>> f.write(data)
>>> f.flush()
>>> f.seek(0); f.read(1) # ascii bytes work
's'
>>> f.seek(2); f.read(1) # as does 2-byte nonascii
'ä'
>>> data[3] # but offset 3 is not 'm' !
'm'
>>> f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte
Now, what confuses me is this, why this UnicodeDecodeError is happening if data string is utf8 encoded? Reading with manual f.read() works fine, but when using seek to jump and read(1), this error shows up.

Seeking within a file will move the read pointer by bytes, not by characters. The .read() call expects to be able to read whole characters instead. Because UTF-8 uses multiple bytes for any unicode codepoint beyond the ASCII character set, you cannot just seek into the middle of a multi-byte UTF-8 codepoint and expect .read() to work.
The U+00a4 codepoint (the glyph ä) is encoded to two bytes, C3 and A4. In the file, this means there are now 5 bytes, representing s, p, the hex bytes C3 and A4, then m.
By seeking to position 3, you moved the file header to the A4 byte, and calling .read() then fails because without the preceding C3 byte, there is not enough context to decode the character. This raises the UnicodeDecodeError; the A4 byte is unexpected, as it is not a valid UTF-8 sequence.
Seek to position 4 instead:
>>> f.seek(3); f.read(1)
'm'
Better still, don't seek around in UTF-8 data, or open the file in binary mode and decode manually.

Related

Using ASCII number to character in python

I am trying to print a list of dicts to file that's encoded in latin-1. Each field is to be separated by an ASCII character 254 and the end of line should be ASCII character 20.
When I try to use a character that is greater than 128 I get "UnicodeDecodeError: 'ascii' codec can't decode byte 0xfe in position 12: ordinal not in range(128)"
This is my current code. Could some one help me with how to encode a ASCII char 254 and how to add a end of line ASCII char 20 when using DictWriter.
Thanks
my Code:
with codecs.open("test.dat", "w", "ISO-8859-1") as outputFile:
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)
ASCII does only contain character codes 0-127.
Codes in the range 128-255 are not defined in ASCII but only in codecs that extend it, like ANSI, latin-1 or all Unicodes.
In your case it's probably somehow double-encoding the string, which fails.
It works if you use the standard built-in open function without specifying a codec:
with open("test.dat", "w") as outputFile: # omit the codec stuff here
delimiter = (chr(254))
keys = file_dict[0].keys()
dict_writer = csv.DictWriter(outputFile, keys, delimiter=delimiter)
dict_writer.writeheader()
for value in file_dict:
dict_writer.writerow(value)

Unicode in python 3

I want to convert string, which contains Unicode numbers to usual text. For example, file "input.txt" contains string '\u0057\u0068\u0061\u0074,' and I want to know what does it mean. If string is input in the code like:
s = '\u0057\u0068\u0061\u0074'
b = s.encode('utf-8')
print(b)
it works perfectly, but if I want to do the same with file I get this result b'\\u0057\\u0068\\u0061\\u0074'.
How to fix this problem? Windows 8, encoding of files are 'windows-1251'.
If your file contains those unicode escape sequences, then you can use the unicode_escape “codec” to interpret them after you read the file contents as a string.
>>> s = r'\u0057\u0068\u0061\u0074'
>>> print(s)
\u0057\u0068\u0061\u0074
>>> s.encode('utf-8').decode('unicode_escape')
'What'
Or, you can just read a bytes string directly and decode that:
with open('file.txt', 'br') as f:
print(f.read().decode('unicode_escape'))

Extract first 20 bytes in a binary header

I'm trying to learn how to do this in Python, play'd arround with the psuedo code bellow but couln't come up with anything worth a penny
with open(file, "rb") as f:
byte = f.read(20) # read the first 20 bytes?
while byte != "":
print f.read(1)
In the end I'd like to end up with a code capable of the following: https://stackoverflow.com/a/2538034/2080223
But I'm ofcourse interested in learning how to get there so any pointers would be much apphriciated!
Very close
with open(file, "rb") as f:
byte = f.read(20) # read the first 20 bytes? *Yes*
will indeed read the first 20 bytes.
But
while byte != "":
print f.read(1) # print a single byte?
will (as you expect) read a single byte and print it, but it will print it forever, since your loop condition will always be true.
Its not clear what you want to do here, but if you just want to print a single byte, removing the while loop will do that:
print f.read(1)
If you want to print single bytes until the end of file, consider:
while True:
byte = f.read(1)
if byte == "": break
print byte
Alternatively, if you're looking for specific bytes within the first 20 you read into byte, you can use iterable indexing:
with open(file, "rb") as f:
byte = f.read(20)
print byte[0] # First byte of the 20 bytes / first byte of the file
print byte[1] # Second byte of the 20 bytes / ...
# ...
Or as Lucas suggests in the comments, you could iterate over the string byte (it's a string by the way, that's returned from read()):
with open(file, "rb") as f:
byte = f.read(20)
for b in byte:
print b
You may also be interested in the position of the byte, and it's hexidecimal value (for values like 0x0a, 0x0d, etc):
with open(file, "rb") as f:
byte = f.read(20)
for i,b in enumerate(byte):
print "%02d: %02x" % (i,b)

Why is the first line longer?

i'm using python to read a txt document with:
f = open(path,"r")
for line in f:
line = line.decode('utf8').strip()
length = len(line)
firstLetter = line[:1]
it seems to work, but the first line's length is always longer by... 1
for example:
the first line is "XXXX" where X denotes a chinese character
then length will be 5, but not 4
and firstLetter will be nothing
but when it goes to the second and after lines,it works properly
tks~
You have a UTF-8 BOM at the start of your file. Don't faff about inspecting the first character. Instead of the utf8 encoding, use the utf_8_sig encoding with either codecs.open() or your_byte_string.decode() ... this sucks up the BOM if it exists and you don't see it in your code.
>>> bom8 = u'\ufeff'.encode('utf8')
>>> bom8
'\xef\xbb\xbf'
>>> bom8.decode('utf8')
u'\ufeff'
>>> bom8.decode('utf_8_sig')
u'' # removes the BOM
>>> 'abcd'.decode('utf_8_sig')
u'abcd' # doesn't care if no BOM
>>>
You are probably getting the Byte Order Mark (BOM) as the first character on the first line.
Information about dealing with it is here

Python write to file

I've got a little problem here.
I'm converting binary to ascii, in order to compress data.
All seems to work fine, but when I convert '11011011' to ascii and try to write it into file, I keep getting error
UnicodeEncodeError: 'charmap' codec can't encode character '\xdb' in position 0: character maps to
Here's my code:
byte = ""
handleR = open(self.getInput())
handleW = open(self.getOutput(), 'w')
file = handleR.readlines()
for line in file:
for a in range(0, len(line)):
chunk = result[ord(line[a])]
for b in chunk:
if (len(byte) < 8):
byte+=str(chunk[b])
else:
char = chr(eval('0b'+byte))
print(byte, char)
handleW.write(char)
byte = ""
handleR.close()
handleW.close()
Any help appreciated,
Thank You
I think you want:
handleR = open(self.getInput(), 'rb')
handleW = open(self.getOutput(), 'wb')
That will ensure you're reading and writing byte streams. Also, you can parse binary strings without eval:
char = chr(int(byte, 2))
And of course, it would be faster to use bit manipulation. Instead of appending to a string, you can use << (left shift) and | (bitwise or).
EDIT: For the actual writing, you can use:
handleW.write(bytes([char]))
This creates and writes a bytes from a list consisting of a single number.
EDIT 2: Correction, it should be:
handleW.write(bytes([int(byte, 2)]))
There is no need to use chr.

Categories