BLUF: Why is the decode() method on a bytes object failing to decode ç?
I am receiving a UnicodeDecodeError: 'utf-8' codec can't decode by 0xe7 in position..... Upon tracking down the character, it is the ç character. So when I get to reading the response from the server:
conn = http.client.HTTPConnection(host = 'something.com')
conn.request('GET', url = '/some/json')
resp = conn.getresponse()
content = resp.read().decode() # throws error
I am unable to get the content. If I just do content = resp.read() it is successful, I can write to file using wb but then whever the ç is, it is replaced with 0xE7 in the file upon writing. Even if I open the file in Notepad++ and set the encoding to UTF-8, the character only shows as the hex version.
Why am I not able to decode this UTF-8 character from an HTTPResponse? Am I not correctly writing it to file either?
When you have issues with encoding/decoding, you should take a look at the UTF-8 Encoding Debugging Chart.
If you look in the chart for the Windows 1252 code point 0xE7 you find the expected character is ç showing that the encoding is CP1252.
Related
I am trying to open a Windows PE file and alter some strings in the resource section.
f = open('c:\test\file.exe', 'rb')
file = f.read()
if b'A'*10 in file:
s = file.replace(b'A'*10, newstring)
In the resource section I have a string that is just:
AAAAAAAAAA
And I want to replace that with something else. When I read the file I get:
\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A
I have tried opening with UTF-16 and decoding as UTF-16 but then I run into a error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1604-1605: illegal encoding
Everyone I seen who had the same issue fixed by decoding to UTF-16. I am not sure why this doesn't work for me.
If resource inside binary file is encoded to utf-16, you shouldn't change encoding.
try this
f = open('c:\\test\\file.exe', 'rb')
file = f.read()
unicode_str = u'AAAAAAAAAA'
encoded_str = unicode_str.encode('UTF-16')
if encoded_str in file:
s = file.replace(encoded_str, new_utf_string.encode('UTF-16'))
inside binary file everything is encoded, keep in mind
I am trying to open in python3 wikipedia database dump file. I unpack in linux this file with gzip command and try open with this code:
#!/usr/bin/env python
# -*- coding: utf-8 -*
with open('dump.sql', 'r') as file:
for i in file:
print(i)
But it gives me this error:
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 250-251: invalid continuation byte
Linux command file -i dump.sql shows utf8 charset. Where can be a problem?
I found more info here but this file is from 4.7.2017 so this cannot be a problem.
The dumps may contain non-Unicode (UTF8) characters in older text revisions due to lenient charset validation in the earlier MediaWiki releases (2004 or so). For instance, zhwiki-20130102-langlinks.sql.gz contained some copy and pasted iso8859-1 "ö" characters; as the langlinks table is generated on parsing, a null edit or forcelinkupdate to the page was enough to fix it.
So how can I process wikipedia database dumps files in python?
I am working with some software that is generating an error when trying to create a pdf from html that contains non-ascii characters. I have created a much simpler program to reproduce the problem and help me understand what is going on.
#!/usr/bin/python
#coding=utf8
from __future__ import unicode_literals
import pdfkit
from pyPdf import PdfFileWriter, PdfFileReader
f = open('test.html','r')
html = f.read()
print html
pdfkit.from_string(html, 'gen.pdf')
f.close()
Running this program results in:
<html>
<body>
<h1>ر</h1>
</body>
</html>
Traceback (most recent call last):
File "./testerror.py", line 10, in <module>
pdfkit.from_string(html, 'gen.pdf')
File "/usr/local/lib/python2.7/dist-packages/pdfkit/api.py", line 72, in from_string
return r.to_pdf(output_path)
File "/usr/local/lib/python2.7/dist-packages/pdfkit/pdfkit.py", line 136, in to_pdf
input = self.source.to_s().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)
I tried adding a replace statement to strip the problem character, but that also resulted in an error:
Traceback (most recent call last):
File "./testerror.py", line 9, in <module>
html = html.replace('ر','-')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)
I am afraid I don't understand ascii / utf-8 encoding very well. If anyone could help me understand what is going on here, that would be great! I am not sure if this is a problem in the pdf library, or if this is a result of my ignorance of encodings :)
Reading pdfkit source code, it appears that pdfkit.from_string expects its first argument to be unicode not str, so it's up to you to properly decode html. To do so you must know what encoding your test.html file is. Once you know that you just have to proceed:
with open('test.html') as f:
html = f.read().decode('<your-encoding-name-here>)
pdfkit.from_string(html, 'gen.pdf')
Note that str.decode(<encoding>) will return a unicode string and unicode.encode(<encoding>) will return a byte string, IOW you decode from byte string to unicode and you encode from unicode to byte string.
In your case can also use codecs.open(path, mode, encoding) instead of file.open() + explicit decoding, ie:
import codecs
with codecs.open('test.html', encoding=<your-encoding-name-here>) as f:
html = f.read() # `codecs` while do the decoding behind the scene
As a side note:
read (read binary for codecs but that's an implementation detail) is the default mode when opening a file so no need to specify it all
using files as context managers (with open(path) as f: ...) makes sure the file will be properly closed. While CPython will usually close opened filed when the file objects get collected, this is an implementation detail and is not garanteed by the language, so do not rely on it.
Also HTML should include charset
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
The question seems to be Python 2 specific. However, I had a similar issue with Python 3 in a Flask + Apache/mod_wsgi environment on Ubuntu 22.04. when passing a non-ASCII-string to the header or footer via the from_string options (e.g. document = pdfkit.from_string(html, False, options={"header-left": "é"}). I then got the error UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128). The problem was the missing locale setting for WSGIDaemonProcess in the Apache/VirtualHost configuration. I solved it by passing locake=C.UTF-8: WSGIDaemonProcess myapp user=myuser group=mygroup threads=5 locale=C.UTF-8 python-home=/path/to/myapp/venv.
import requests
test = requests.get("https://www.hipstercode.com/")
outfile = open("./settings.txt", "w")
test.encoding = 'ISO-8859-1'
outfile.write(str(test.text))
The error that i'm getting is:
File "C:/Users/Bamba/PycharmProjects/Requests/Requests/Requests.py", line 8, in <module>
outfile.write(str(test.text))
File "C:\Users\Bamba\AppData\Local\Programs\Python\Python35\lib\encodings\cp1255.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xef' in position 0: character maps to <undefined>
So, looks like response contains smth you can't encode in cp1251.
If utf-8 is ok for you, try
import requests
test = requests.get("https://www.hipstercode.com/")
outfile = open("./settings.txt", "wb")
outfile.write(test.text.encode('ISO-8859-1'))
If you're getting error while encoding, you simply cannot encode lossless. Options you have described in encode docs: https://docs.python.org/3/library/stdtypes.html#str.encode
I.e., you can
outfile.write(test.text.encode('ISO-8859-1', 'replace'))
to handle errors without losing most sense of text written in smth that doesn't fit ISO-8859-1
File "/usr/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 805: invalid start byte
Hi, I get this exception. How do I catch it, and continue reading my files when I get this exception.
My program has a loop that reads a text file line-by-line and tries to do some processing. However, some files I encounter may not be text files, or have lines that are not properly formatted (foreign language etc). I want to ignore those lines.
The following is not working
for line in sys.stdin:
if line != "":
try:
matched = re.match(searchstuff, line, re.IGNORECASE)
print (matched)
except UnicodeDecodeError, UnicodeEncodeError:
continue
Look at http://docs.python.org/py3k/library/codecs.html. When you open the codecs stream, you probably want to use the additional argument errors='ignore'
In Python 3, sys.stdin is by default opened as a text stream (see http://docs.python.org/py3k/library/sys.html), and has strict error checking.
You need to reopen it as an error-tolerant utf-8 stream. Something like this will work:
sys.stdin = codecs.getreader('utf8')(sys.stdin.detach(), errors='ignore')