Python 3.4, Decoding error for byte '\x93' - python

I am simply reading a file in binary mode into a buffer, performing some replacements on that buffer and after that inserting that buffered data into MySQL database. Every byte got inserted regardless of any encoding standard. I used python 2.7, it worked well.
CODE :
with open(binfile,'rb') as fd_bin:
bin_data = fd_bin.read()
bin_data = bin_data.replace('"','\\"')
db_cursor.execute("INSERT INTO table BIN_DATA values {}".format(bin_data))
When I used python 3.4 version, it needed to be decoded so I used :
bin_data = fd_bin.read()
bin_data = bin_data.decode('utf-8') # error at this line
The mentioned second line produced error:
bindata = bindata.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 1:invalid
start byte
I used latin-1 and iso-8859-1 decoding scheme but they insert some extra byte at some places. when I fetch data from database, data is not same in this case but it was for python 2.7 version.
How can I insert that data or decode that data regardless of encoding scheme ? I can't skip or ignore the bytes.

Related

Parse big JSON file with font-encoding cp1252

I have to handle a big JSON file (approx. 47GB) and it seems as if I found the solution in ijson.
However, when I want to go through the objects I get the following error:
byggesag = (o for o in objects if o["h�ndelse"] == 'Byggesag')
^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe6 in position 12: invalid continuation byte
Here is the code I am using so far:
import ijson
with open("C:/Path/To/Json/JSON_20220703180000.json", "r", encoding="cp1252") as json_file:
objects = ijson.items(json_file, 'SagList.item')
byggesag = (o for o in objects if o['hændelse'] == 'Byggesag')
How can I deal with the encoding of the input file?
The problem is with the python script itself, which is encoded with cp1252 but python expects it to be in utf8. You seem to be dealing with the input JSON file correctly (but you won't be able to tell until you actually are able to run your script).
First, note that the error is a SyntaxError, which probably happens when you are loading your script/module.
Secondly, note how in the first bit of code you shared hændelse appears somewhat scrambled, and python is complaining about how utf-8 cannot handle byte 0xe6. This is becase the character æ (U+00E6, https://www.compart.com/de/unicode/U+00E6) is encoded as 0xe6 in cp1252, which isn't a valid utf8 byte sequence; hence the error.
To solve it save your python script with utf8 encoding, or specify that it's saved with cp1252 (see https://peps.python.org/pep-0263/ for reference).

UnicodeDecodeError while processing Accented words

I have a python script which reads a YAML file (runs on an embedded system). Without accents, the script runs normally on my development machine and in the embedded system. But with accented words make it crash with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
only in the embedded environment.
The YAML sample:
data: ã
The snippet which reads the YAML:
with open(YAML_FILE, 'r') as stream:
try:
data = yaml.load(stream)
Tried a bunch of solutions without success.
Versions: Python 3.6, PyYAML 3.12
The codec that is reading your bytes has been set to ASCII. This restricts you to byte values between 0 and 127.
The representation of accented characters in Unicode, comes outside this range, so you're getting a decoding error.
A UTF-8 codec decodes ASCII as well as UTF-8, because ASCII is a (very small) subset of UTF-8, by design.
If you can change your codec to be a UTF-8 decode, it should work.
In general, you should always specify how you will decode a byte stream to text, otherwise, your stream could be ambiguous.
You can specify the codec that should be used when dumping data using PyYAML, but there is no way you specify your coded in PyYAML when you load. However PyYAML will handle unicode as input and you can explicitly specify which codec to use when opening the file for reading, that codec is then used to return the text (you open the file as text file with 'r', which is the default for open()).
import yaml
YAML_FILE = 'input.yaml'
with open(YAML_FILE, encoding='utf-8') as stream:
data = yaml.safe_load(stream)
Please note that you should almost never have to use yaml.load(), which is documented to be unsafe, use yaml.safe_load() instead.
To dump data in the same format you loaded it use:
import sys
yaml.safe_dump(data, sys.stdout, allow_unicode=True, encoding='utf-8',
default_flow_style=False)
The default_flow_style is needed in order not to get the flow-style curly braces, and the allow_unicode is necessary or else you get data: "\xE3" (i.e. escape sequences for unicode characters)

Ignore UnicodeEncodeError when saving utf8 file

I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.
from urllib import request
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()
This gives me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
First, I tried to remove the BOM at the beginning of the file:
# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')
but I get the same error, just with a different position number:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
If I look in that area I can't find the offending characters, so I don't know what to remove:
raw[7850:7900]
just prints out:
' BALLENA, Spanish.\r\n PEKEE-NUEE-'
which doesn't look like it would be a problem.
So then I tried to skip the bad lines with a try statement:
file = open('corpora/canon_texts/' + 'test', 'w')
try:
file.write(raw)
except UnicodeEncodeError:
pass
file.close()
but this skips the entire text, giving me a file of 0 size.
How can I fix this?
EDIT:
A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')
But I can't even download the data before I get this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data
SECOND EDIT:
I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:
raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
outfile.write(raw)
This is the only reliable way to write to disk exactly what you downloaded.
Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.
text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
outfile.write(text)
U + feff is for UTF-16. Try that instead.
.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.
Probably the safest option is
decode("utf8", errors='backslashreplace')
which will escape encoding errors with a backslash, so you have a record of what failed to decode.
Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.
What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with
decode("utf-16", errors='ignore')

Can't write Unicode text using cx_Oracle

I'm working on a Python script that reads in a CSV writes the contents out to an Oracle database using cx_Oracle. So far I've been getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 1369: ordinal not in range(128)
Evidently, cx_Oracle is trying to convert a Unicode character to ASCII and it's not working.
A few clarifying points:
I'm using Python 3.4.3
The CSV file is encoded in UTF-8 and is being opened like open('all.csv', encoding='utf8')
I'm using NVARCHAR2 fields for text in the database and the NLS_NCHAR_CHARACTERSET is set to AL16UTF16. The NLS_CHARACTERSET is WE8MSWIN1252 but from what I understand that shouldn't be relevant since I'm using NVARCHAR2.
I've tried setting the NLS_LANG environment variable to things like .AL16UTF16, _.AL16UTF16 and AMERICAN_AMERICA.WE8MSWIN1252 per this post, but I still get the same error.
Given that I'm reading a UTF-8 file and trying to write to a Unicode-encoded table, can anyone think of why cx_Oracle would still be trying to convert my data to ASCII?
I'm able to produce the error with this code:
field_map = {
...
}
with open('all.csv', encoding='utf8') as f:
reader = csv.DictReader(f)
out_rows = []
for row in reader:
if i == 1000:
break
out_row = {}
for field, source_field in field_map.items():
out_val = row[source_field]
out_row[field] = out_val
out_rows.append(out_row)
i += 1
out_db = datum.connect('oracle-stgeom://user:pass#db')
out_table = out_db['service_requests']
out_table.write(out_rows, chunk_size=10000)
The datum module is a data abstraction library I'm working on. The function responsible for writing to Oracle table is found here.
The full traceback is:
File "C:\Projects\311\write.py", line 64, in <module>
out_table.write(out_rows, chunk_size=10000)
File "z:\datum\datum\table.py", line 89, in write
self._child.write(rows, from_srid=from_srid, chunk_size=chunk_size)
File "z:\datum\datum\oracle_stgeom\table.py", line 476, in write
self._c.executemany(None, val_rows)
UnicodeEncodeError: 'ascii' codec can't encode character '\xa0' in position 1361: ordinal not in range(128)
Check the value of the "encoding" and "nencoding" attributes on the connection. This value is set by calling OCI routines that check the environment variables NLS_LANG and NLS_NCHAR. It looks like this value is US-ASCII or equivalent. When writing to the database, cx_Oracle takes the text and gets a byte string by encoding in the encoding the Oracle client is expecting. Note that this is unrelated to the database encoding. In general, for best performance, it is a good idea to match the database and client encodings -- but if you don't, Oracle will quite happily convert between the two, provided all of the characters used can be represented in both character sets!
Note that if the value of NLS_LANG is invalid it is essentially ignored. AL16UTF16 is one such invalid entry! So set it to the value you would like (such as .AL32UTF8) and check the value of encoding and nencoding on the connection until you get what you want.
Note as well that unless you state otherwise, all strings bound via cx_Oracle to the database are assumed to be in the normal encoding, not the NCHAR encoding. You can override this by using cursor.setinputsizes() and specifying that the input type is NCHAR, FIXED_NCHAR or LONG_NCHAR.

protobuf decode error using python

When I try to decode a steam into a protobuf using python, I got this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xaa in position 1
My codes just read data from a file, and use "ParseFromString" to decode it.
f = open('ds.resp', 'rb')
eof = False
data = f.read(1024*1024)
eof = not data
if not eof:
entity = dataRes_pb2.dataRes()
entity.ParseFromString(data)
print entity
The data in the file is downloaded and saved from a http request. It seems that the data is not utf-8 encoded. So I use chardet.detect() and found that it is a ISO-8859-2.
The problem is, it seems that ParseFromString() needs the data to be utf-8 encoded (I am not sure). If I convert the data from ISO-8859-2 to utf-8. Then I got another error:
google.protobuf.message.DecodeError: Truncated message
How to correctly decode the data? Anybody have some advice?

Categories