Converting widechars to system ANSI encoding in Python

Converting widechars to system ANSI encoding in Python - python

I am currently trying to make my screen reader work better with Becky! Internet Mail. The problem which I am facing is related to the list view in there. This control is not Unicode aware but the items are custom drawn on screen so when someone looks at it content of all fields regardless of encoding looks okay. When accessed via MSAA or UIA however basic ANSI chars and mails encoded with the code page set for non Unicode programs have they text correct whereas mails encoded in Unicode do not.
Samples of the text :
Zażółć gęślą jaźń
is represented by:
ZaĹĽĂłĹ‚Ä‡ gÄ™Ĺ›lÄ… jaĹşĹ„
In this case it is damaged CP1250 as per answer below.
However:
⚠️
is represented by:
âš ď¸Ź
⏰
is represented by:
âŹ°
and
高生旺
is represented by:
é«ç”źć—ş
I've just assumed that these strings are damaged beyond repair, however when unicode beta support in windows 10 is enabled they are exposed correctly.
Is it possible to simulate this behavior in Python?
The solution needs to work in both Python 2 and 3.
At the moment I am simply replacing known combinations of these characters with their proper representations, but it is not very good solution, because lists containing replacements and characters to replace needs to be updated with each new discovered character.

your utf-8 is decoded as cp1250.
What I did in python3 is this:
orig = "Zażółć gęślą jaźń"
wrong = "ZaĹĽĂłĹ‚Ä‡ gÄ™Ĺ›lÄ… jaĹşĹ„"
for enc in range(437, 1300):
try:
res = orig.encode().decode(f"cp{enc}")
if res == wrong:
print('FOUND', res, enc)
except:
pass
...and the result was the 1250 codepage.
So your solution shall be:
import sys
def restore(garbaged):
# python 3
if sys.version_info.major > 2:
return garbaged.encode('cp1250').decode()
# python 2
else:
# is it a string
try:
return garbaged.decode('utf-8').encode('cp1250')
# or is it unicode
except UnicodeEncodeError:
return garbaged.encode('cp1250')
EDIT:
The reason why "高生旺" can not be recovered from é«ç”źć—ş:
"高生旺".encode('utf-8') is b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'.
The problem is the \x98 part. In cp1250 there is no character set for that value. If you try this:
"高生旺".encode('utf-8').decode('cp1250')
You will get this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 2: character maps to <undefined>
The way to get "é«ç”źć—ş" is:
"高生旺".encode('utf-8').decode('cp1250', 'ignore')
But the ignore part is critical, it causes data loss:
'é«ç”źć—ş'.encode('cp1250') is b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'.
If you compare these two:
b'\xe9\xab\xe7\x94\x9f\xe6\x97\xba'
b'\xe9\xab\x98\xe7\x94\x9f\xe6\x97\xba'
you will see that the \x98 character is missing so when you try to restore the original content, you will get a UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte.
If you try this:
'é«ç”źć—ş'.encode('cp1250').decode('utf-8', 'backslashreplace')
The result will be '\\xe9\\xab生旺'. \xe9\xab\x98 could be decoded to 高, from \xe9\xab it is not possible.

Related

Python utf-8 encoding not following unicode rules

Background: I've got a byte file that is encoded using unicode. However, I can't figure out the right method to get Python to decode it to a string. Sometimes is uses 1-byte ASCII text. The majority of the time it uses 2-byte "plain latin" text, but it can possibly contain any unicode character. So my python program needs to be able to decode that and handle it. Unfortunately byte_string.decode('unicode') isn't a thing, so I need to specify another encoding scheme. Using Python 3.9
I've read through the Python doc on unicode and utf-8 Python doc. If Python uses unicode for it's strings, and utf-8 as default, this should be pretty straightforward, yet I keep getting incorrect decodes.
If I understand how unicode works, the most significant byte is the character code, and the least significant byte is the lookup value in the decode table. So I would expect 0x00_41 to decode to "A",
0x00_F2 =>
x65_03_01 => é (e with combining acute accent).
I wrote a short test file to experiment with these byte combinations, and I'm running into a few situations that I don't understand (despite extensive reading).
Example code:
def main():
print("Starting MAIN...")
vrsn_bytes = b'\x76\x72\x73\x6E'
serato_bytes = b'\x00\x53\x00\x65\x00\x72\x00\x61\x00\x74\x00\x6F'
special_bytes = b'\xB2\xF2'
combining_bytes = b'\x41\x75\x64\x65\x03\x01'
print(f"vrsn_bytes: {vrsn_bytes}")
print(f"serato_bytes: {serato_bytes}")
print(f"special_bytes: {special_bytes}")
print(f"combining_bytes: {combining_bytes}")
encoding_method = 'utf-8' # also tried latin-1 and cp1252
vrsn_str = vrsn_bytes.decode(encoding_method)
serato_str = serato_bytes.decode(encoding_method)
special_str = special_bytes.decode(encoding_method)
combining_str = combining_bytes.decode(encoding_method)
print(f"vrsn_str: {vrsn_str}")
print(f"serato_str: {serato_str}")
print(f"special_str: {special_str}")
print(f"combining_str: {combining_str}")
return True
if __name__ == '__main__':
print("Starting Command Line Experiment!")
if not main():
print("\n Command Line Test FAILED!!")
else:
print("\n Command Line Test PASSED!!")
Issue 1: utf-8 encoding. As the experiment is written, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte
I don't understand why this fails to decode, according to the unicode decode table, 0x00B2 should be "SUPERSCRIPT TWO". In fact, it seems like anything above 0x7F returns the same UnicodeDecodeError.
I know that some encoding schemes only support 7 bits, which is what seems like is happening, but utf-8 should support not only 8 bits, but multiple bytes.
If I changed encoding_method to encoding_method = 'latin-1' which extends the original ascii 128 characters to 256 characters (up to 0xFF), then I get a better output:
vrsn_str: vrsn
serato_str: Serato
special_str: ²ò
combining_str: Aude
However, this encoding is not handling the 2-byte codes properly. \x00_53 should be S, not �S, and none of the encoding methods I'll mention in this post handle the combining acute accent after Aude properly.
So far I've tried many different encoding methods, but the ones that are closest are: unicode_escape, latin-1, and cp1252. while I expect utf-8 to be what I'm supposed to use, it does not behave like it's described in the Python doc linked above.
Any help is appreciated. Besides trying more methods, I don't understand why this isn't decoding according to the table in link 3.
UPDATE:
After some more reading, and see your responses, I understand why you're so confused. I'm going to explain further so that hopefully this helps someone in the future.
The byte file that I'm decoding is not mine (hence why the encoding does not make sense). What I see now is that the bytes represent the code point, not the byte representation of the unicode character.
For example: I want 0x00_B2 to translate to ò. But the actual byte representation of ò is 0xC3_B2. What I have is the integer representation of the code point. So while I was trying to decode, what I actually need to do is convert 0x00B2 to an integer = 178. then I can use chr(178) to convert to unicode.
I don't know why the file was written this way, and I can't change it. But I see now why the decoding wasn't working. Hopefully this helps someone avoid the frustration I've been figuring out.
Thanks!

This isn't actually a python issue, it's how you're encoding the character. To convert a unicode codepoint to utf-8, you do not simply get the bytes from the codepoint position.
For example, the code point U+2192 is →. The actual binary representation in utf-8 is: 0xE28692, or 11100010 10000110 10010010
As we can see, this is 3 bytes, not 2 as we'd expect if we only used the position. To get correct behavior, you can either do the encoding by hand, or use a converter such as this one:
https://onlineunicodetools.com/convert-unicode-to-binary
This will let you input a unicode character and get the utf-8 binary representation.
To get correct output for ò, we need to use 0xC3B2.
>>> s = b'\xC3\xB2'
>>> print(s.decode('utf-8'))
ò
The reason why you can't use the direct binary representation is because of the header for the bytes. In utf-8, we can have 1-byte, 2-byte, and 4-byte codepoints. For example, to signify a 1 byte codepoint, the first bit is encoded as a 0. This means that we can only store 2^7 1-byte code points. So, the codepoint U+0080, which is a control character, must be encoded as a 2-byte character such as 11000010 10000000.
For this character, the first byte begins with the header 110, while the second byte begins with the header 10. This means that the data for the codepoint is stored in the last 5 bits of the first byte and the last 6 bits of the second byte. If we combine those, we get
00010 000000, which is equivalent to 0x80.

"ascii" codec can't encode characters in position 0-2: ordinal not in range(128)

I am using python 2.7 and used Chinese characters in my code, so...
# coding = utf-8
and the problem is part of my code, as follows:
def fileoutput():
global percent_shown
date = str(datetime.datetime.now()).decode('utf-8')
with open("result.txt","a") as datafile:
datafile.write(date+" "+str(percent_shown.get()))
percent_shown is a string that includes Chinese characters
When I run it, I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
How to fix it? Thanks

As per PEP 263, the coding declaration must match the regular expression r"^[ \t\v]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)" so you need to get rid of the space between "coding" and the equal sign:
# coding=utf-8
This declaration tells python that the .py file itself is utf-8 encoded, but doesn't change the rest of the program. This is useful if you are writing unicode literals but you still need to cast them to unicde properly to make sure things work.
Since you haven't shown us what you are trying to print, I found some Chinese characters to demonstrate. I have no idea what they mean... so appollogies for anyone I insult!
foo = u"学而设" # Good! you've got a unicode string
bar = "学而设" # Bad! you've got a utf-8 encoded string that python
# thinks is ascii
I think you can fix your program with a few tweaks. First, don't try to decode datetime.now(). Its just ascii. It didn't change its return type just because you declared the source file encoding. Second, use the codecs module to open the file with the encoding you wnat (I'm assuming its utf-8). Now, since you are working with unicode strings you can write them directly to the file.
import codecs
def fileoutput():
date = unicode(datetime.datetime.now())
with codecs.open("result.txt","a", encoding="utf-8") as datafile:
datafile.write(date+" "+percent_shown.get())

You can't have whitespace before the = in your coding comment. Try:
# coding=utf-8
See the regular expression in: https://www.python.org/dev/peps/pep-0263/

Python decoding issue with Chinese characters

I'm using Python 3.5, and I'm trying to take a block of byte text that may or may not contain special Chinese characters and output it to a file. It works for entries that do not contain Chinese characters, but breaks when they do. The Chinese characters are always a person's name, and are always in addition to the English spelling of their name. The text is JSON formatted and needs to be decoded before I can load it. The decoding seems to go fine and doesn't give me any errors. When I try and write the decoded text to a file it gives me the following error message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-18: character maps to undefined
Here is an example of the raw data that I get before I do anything to it:
b' "isBulkRecipient": "false",\r\n "name": "Name in, English \xef'
b'\xab\x62\xb6\xe2\x15\x8a\x8b\x8a\xee\xab\x89\xcf\xbc\x8a",\r\n
Here is the code that I am using:
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
recipientName = recipientData['signers'][0]['name']
pprint(recipientName)
with open('envelope recipient list.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
csvData = [[recipientName]]
a.writerows(csvData)
The recipientContent is obtained from an API call. I do not need to have the Chinese characters in the output file. Any advice will be greatly appreciated!
Update:
I've been doing some manual workarounds for each entry that breaks, and came other entries that didn't contain Chinese special characters, but had them from other languages, and the broke the program as well. The special characters are only in the name field. So a name could be something like "Ałex" where it is a mixture of normal and special characters. Before i decode the string that contains this information i am able to print it out to the screen and it looks like this: b'name": "A\xc5ex",\r\n
But after i decode it into utf-8 it will give me an error if i try to output it. The error message is: UnicodeEncodeError: 'charmap' codec can't encode character 'u0142' in position 2- character maps to -undefined-
I looked up what \u0142 was and it is the ł special character.

The error you're getting is when you're writing to the file.
In Python 3.x, when you open() in text mode (the default) without specifying an encoding=, Python will use an encoding most suitable to your locale or language settings.
If you're on Windows, this will use the charmap codec to map to your language encoding.
Although you could just write bytes straight to a file, you're doing the right thing by decoding it first. As others have said, you should really decode using the encoding specified by the web server. You could also use Python Requests module, which does this for you. (You example doesn't decode as UTF-8, so I assume your example isn't correct)
To solve your immediate error, simply pass an encoding to open(), which supports the characters you have in your data. Unicode in UTF-8 encoding is the obvious choice. Therefore, you should change your code to read:
with open('envelope recipient list.csv', 'a', encoding='utf-8', newline='') as fp:

Warning: shotgun solution ahead
Assuming you just want to get rid of all foreign character in all your file ( that is they are not important for your future processing of all other fields), you can simply ignore all non ascii characters
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
by
recipientData = json.loads(recipientContent.decode('ascii', 'ignore'))
like this you remove all non ascii characters before future processing.
I called it shotgun solution because it might not work correctly under certain circumstances:
Obviously if non ascii characters are needed to keep for future use
If b'\' or b" characters appears for example from part of an utf-16 character.

Add this line to your code :
from __future__ import unicode_literals

Identify garbage unicode string using python

My script is reads data from csv file, the csv file can have multiple strings of English or non English words.
Some time the text file has garbage strings , i want to identify those string and skip those string and process others
doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
if is_valid_unicode_str(row['Name']):
process_futher
def is_valid_unicode_str(value):
try:
function
return True
except UnicodeEncodeError:
return false
csv input:
"Name"
"Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€"
"元大寶來證券"
"John Dove"
I want to defile function is_valid_unicode_str() which will identify the garbage string and process valid one only.
I tried to use decode is but it doesnt failed while decoding garbage strings
value.decode('utf8')
The expected output are string with Chinese and English string to be process
could you please guide me how can i implement function to filter valid Unicode files?.

(ftfy developer here)
I've figured out that the text is likely to be '袋袋与朋友们电子商'. I had to guess at the characters 友, 子, and 商, because some unprintable characters are characters missing in the string in your question. When guessing, I picked the most common character from the small number of possibilities. And I don't know where the "dcx" goes or why it's there.
Google Translate is not very helpful here but it seems to mean something about e-commerce.
So here's everything that happened to your text:
It was encoded as UTF-8 and decoded incorrectly as sloppy-windows-1252, twice
It had the letters "dcx" inserted into the middle of a UTF-8 sequence
Characters that don't exist in windows-1252 -- with byte values 81, 8d, 8f, 90, and 9d -- were removed
A non-breaking space (byte value a0) was removed from the end
If just the first problem had happened, ftfy.fix_text_encoding would be able to fix it. It's possible that the remaining problems just happened while you were trying to get the string onto Stack Overflow.
So here's my recommendation:
Find out who keeps decoding the data incorrectly as sloppy-windows-1252, and get them to decode it as UTF-8 instead.
If you end up with a string like this again, try ftfy.fix_text_encoding on it.

You have Mojibake strings; text encoded to one (correct) codec, then decoded as another.
In this case, your text was decoded with the Windows 1252 codepage; the U+20AC EURO SIGN in the text is typical of CP1252 Mojibakes. The original encoding could be one of the GB* family of Chinese encodings, or a multiple roundtrip UTF-8 - CP1252 Mojibake. Which one I cannot determine, I cannot read Chinese, nor do I have your full data; CP1252 Mojibakes include un-printable characters like 0x81 and 0x8D bytes that might have gotten lost when you posted your question here.
I'd install the ftfy project; it won't fix GB* encodings (I requested the project add support), but it includes a new codec called sloppy-windows-1252 that'll let you reverse an erroneous decode with that codec:
>>> import ftfy # registers extra codecs on import
>>> text = u'Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€'
>>> print text.encode('sloppy-windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('sloppy-windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'replace')
袋�dcx与朋�们���
The � U+FFFD REPLACEMENT CHARACTER shows the decoding wasn't entirely successful, but that could be due to the fact that your copied string here is missing anything not printable or using the 0x81 or 0x8D bytes.
You can try to fix your data this way; from the file data, try to decode to one of the GB* codecs after encoding to sloppy-windows-1252, or roundtrip from UTF-8 twice and see what fits best.
If that's not good enough (you cannot fix the data) you can use the ftfy.badness.sequence_weirdness() function to try and detect the issue:
>>> from ftfy.badness import sequence_weirdness
>>> sequence_weirdness(text)
9
>>> sequence_weirdness(u'元大寶來證券')
0
>>> sequence_weirdness(u'John Dove')
0
Mojibakes score high on the sequence weirdness scale. You'd could try and find an appropriate threshold for your data by which time you'd call the data most likely to be corrupted.
However, I think we can use a non-zero return value as a starting point for another test. English text should score 0 on that scale, and so should Chinese text. Chinese mixed with English can still score over 0, but you could not then encode that Chinese text to the CP-1252 codec while you can with the broken text:
from ftfy.badness import sequence_weirdness
def is_valid_unicode_str(text):
if not sequence_weirdness(text):
# nothing weird, should be okay
return True
try:
text.encode('sloppy-windows-1252')
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return True
else:
# Encodable as CP-1252, Mojibake alert level high
return False

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)

For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.