UnicodeDecodeError while processing Accented words - python

I have a python script which reads a YAML file (runs on an embedded system). Without accents, the script runs normally on my development machine and in the embedded system. But with accented words make it crash with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
only in the embedded environment.
The YAML sample:
data: ã
The snippet which reads the YAML:
with open(YAML_FILE, 'r') as stream:
try:
data = yaml.load(stream)
Tried a bunch of solutions without success.
Versions: Python 3.6, PyYAML 3.12

The codec that is reading your bytes has been set to ASCII. This restricts you to byte values between 0 and 127.
The representation of accented characters in Unicode, comes outside this range, so you're getting a decoding error.
A UTF-8 codec decodes ASCII as well as UTF-8, because ASCII is a (very small) subset of UTF-8, by design.
If you can change your codec to be a UTF-8 decode, it should work.
In general, you should always specify how you will decode a byte stream to text, otherwise, your stream could be ambiguous.

You can specify the codec that should be used when dumping data using PyYAML, but there is no way you specify your coded in PyYAML when you load. However PyYAML will handle unicode as input and you can explicitly specify which codec to use when opening the file for reading, that codec is then used to return the text (you open the file as text file with 'r', which is the default for open()).
import yaml
YAML_FILE = 'input.yaml'
with open(YAML_FILE, encoding='utf-8') as stream:
data = yaml.safe_load(stream)
Please note that you should almost never have to use yaml.load(), which is documented to be unsafe, use yaml.safe_load() instead.
To dump data in the same format you loaded it use:
import sys
yaml.safe_dump(data, sys.stdout, allow_unicode=True, encoding='utf-8',
default_flow_style=False)
The default_flow_style is needed in order not to get the flow-style curly braces, and the allow_unicode is necessary or else you get data: "\xE3" (i.e. escape sequences for unicode characters)

Related

python: writing ★ in a file

I am trying to use:
text = "★"
file.write(text)
In python 3. But I get this error message:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0: ordinal not in range(128)
How can I print the symbol ★ in a file in python? This is the same symbol that is being used as star ratings.
By default open uses the platform default encoding (see docs):
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
This might not be an encoding that supports not-ascii characters as you noticed yourself. If you know that you want utf-8 then it's always a good idea to provide it explicitly:
with open(filename, encoding='utf-8', mode='w') as file:
file.write(text)
Using the with context manager also makes sure there is no file handle around in case you forget to close or it throws an exception before you close the handle.

How to open a ASCII text graceful

It is confusing when I open a file with Python. By the way I'm using python3.4.
First it's a log file (a huge file that is appended to any time), so iconv is not possible.
Info1 file is ASCII text.
demo git:master ❯ file 1.log
1.log: ASCII text, with very long lines
Info2 ipython opens it with default encoding of 'UTF-8':
In [1]: f = open('1.log')
In [2]: f.encoding
Out[2]: 'UTF-8'
THEN
First when I open('1.log', encoding='utf-8', mode='r')
ERROR: 'utf-8' codec can't decode byte 0xb1 in position 6435: invalid start byte
Second when I open('1.log', encoding='ascii', mode='r')
ERROR: 'ascii' codec can't decode byte 0xe9 in position 6633: ordinal
not in range(128)
How can I gracefully handle this file with every line read?
This is my demo on github demo
I tried a few different combinations of encodings and I was able to get all the way through the log file by simply changing the encoding in your script to latin1, so the line open('1.log', encoding='utf-8', mode='r') becomes open('1.log', encoding='latin1', mode='r').
It's probably Windows CP 1252 or Latin 1. Try opening it with:
open('1.log', encoding='latin-1', 'rU')
Looks like its not an ascii file. The encoding test is usually inaccurate. try chardet which will detect the encoding for you.
Then
import chardet
filepointer = open(self.filename)
charset_detected = chardet.detect(filepointer.read())
Keep in mind that this can take a very very long time. Before you try that I recommend you manually cycle through the obvious encodings first.
Try UTF16 and UTF32. Then try the Windows encodings. Here is a list of several encodings.

Python 3 unicode to utf-8 on file

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
I pull file up in notepad
Save as...
change encoding from unicode to UTF-8
Then run python program on it
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
Anybody been through this and know which method I should use and how to do it?
EDIT:
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you #Mark Ransom
What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

How to open an ascii-encoded file as UTF8?

My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

Python Decoding Uncode files with 'ÆØÅ'

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?
C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

Categories