I want write some strings to file which is not in English, they are in Azeri language. Even if I do utf-8 encoding I get following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)
my code piece that wants to write to file is following:
t_w = text_list[y].encode('utf-8')
new_file.write(t_w.decode('utf-8'))
new_file.write('\n')
EDIT
Even if I make code as:
t_w = text_list[y].encode('ascii',errors='ignore')
new_file.write(t_w)
new_file.write('\n')
I get following error which is :
TypeError: write() argument must be str, not bytes
From what I can tell t_w.decode(...) attempts to convert your characters to ASCII, which doesn't encode some Azeri characters. There is no need to decode the string because you want to write it to the file as UTF-8, so omit the .decode(...) part:new_file.write(t_w)
Related
I am trying to replace text. Unfortunately, the main string is stored as type unicode, but the string which describes the text to be replaced is stored as type string. Below is a reproducible example:
mystring = u'Bunch of text with non-standard character in the name Rubén'
old = 'Rubén'
new = u'newtext'
mystring.replace(old, new)
This throws an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
I get the same error when I try to convert old to unicode with unicode(old). Several answers solve the problem for specific characters, but I cannot find a generic solution.
You need to convert the old value to Unicode with an explicit codec. What that codec is depends entirely on how you sourced old.
If it is a string literal in the source code, use the source code encoding. Python won't accept your source file unless you specified a valid codec at the top in a comment; see PEP 263
Pasting your old definition into a terminal will use your terminal codec (the terminal sends Python encoded bytes as you paste).
If the data is sourced from anywhere else, you'll need to determine the encoding from that source. For HTTP data, check the Content-Type header for a charset parameter, for example.
Then decode:
old = old.decode(encoding)
When you use unicode(old) without an explicit codec, or try to use the bytestring in unicode.replace(), Python uses the default codec, ASCII.
Demo in my terminal, configured to use UTF-8:
>>> import sys
>>> sys.stdin.encoding # reflects the detected terminal codec
'UTF-8'
>>> old = 'Rubén'
>>> old # shows encoded data in python string literal form
'Rub\xc3\xa9n'
>>> old.decode('utf8') # unicode string literal form
u'Rub\xe9n'
>>> print old.decode('utf8') # string value written to the terminal
Rubén
>>> mystring = u'Bunch of text with non-standard character in the name Rubén'
>>> new = u'newtext'
>>> mystring.replace(old, new)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
>>> mystring.replace(old.decode('utf8'), new)
u'Bunch of text with non-standard character in the name newtext'
Generally speaking, you want to decode early, encode late; make your data flow a Unicode Sandwich. As soon as your receive text, decode it all to Unicode values, and don't encode again until the data is leaving your program.
I am trying to convert a csv file to a json file. The whole code runs fine but when I encounter the statement:
json.dump(DictName, out_file)
I get the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 15: invalid start byte
Would someone please be able to help?
TIA.
I found the solution. While parsing the string, I converted the string to unicode using the unicode() function:
unicode(stringname, errors='replace')
and it replaced all the erroneous symbols.
I have a text file which comprises unicode strings "aBiyukÙwa", "varcasÙva" etc. When I try to decode them in the python interpreter using the following code, it works fine and decodes to u'aBiyuk\xd9wa':
"aBiyukÙwa".decode("utf-8")
But when I read it from a file in a python program using the codecs module in the following code it throws a UnicodeDecodeError.
file = codecs.open('/home/abehl/TokenOutput.wx', 'r', 'utf-8')
for row in file:
Following is the error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd9 in position 8: invalid continuation byte
Any ideas what is causing this strange behavior?
Your file is not encoded in UTF-8. Find out what it is encoded in, and then use that.
I am retrieving a list of names from a webservice using a client I've written in Python. Upon retrieving the list, I encode each name to unicode and then print each of them to stdout. When I get to the name "Ólafur Jóhann Ólafsson", I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
Since I cannot know what the encoding will be, how do I convert all of these strings to unicode? Or can you suggest a better way to handle this problem?
First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; and encode Unicode to an appropriate byte encoding when sending/persisting data. I suspect this is the root of your problem.
The web service should declare the encoding in the headers or data received. print normally automatically encodes Unicode to the terminal's encoding (discovered through sys.stdout.encoding) or in absence of that just ascii. If the characters in your data are not supported by the target encoding, you'll get a UnicodeEncodeError.
Since that is not the error you received, you should post some code so we can see what your are doing. Most likely, you are encoding a byte string instead of decoding. Here's an example of this:
>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
What I did here is call encode on a byte string. Since encode requires a Unicode string, Python used the default ascii encoding to decode the byte string to Unicode first, before encoding to cp437.
Fix this by decoding instead of encoding the data, then print will encode to stdout automatically. As long as your terminal supports the characters in the data, it will display properly:
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½
The UnicodeDammit module from BeautifulSoup can automagically detect the encoding.
from BeautifulSoup import UnicodeDammit
u = UnicodeDammit("Ólafur Jóhann Ólafsson")
print u.unicode
print u.originalEncoding
This page may help you http://wiki.python.org/moin/PrintFails
The problem, I guess, is that you need to print those names to console. Do you really need it? or it's just a test environment? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get.
I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails
All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.