protobuf decode error using python - python

When I try to decode a steam into a protobuf using python, I got this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xaa in position 1
My codes just read data from a file, and use "ParseFromString" to decode it.
f = open('ds.resp', 'rb')
eof = False
data = f.read(1024*1024)
eof = not data
if not eof:
entity = dataRes_pb2.dataRes()
entity.ParseFromString(data)
print entity
The data in the file is downloaded and saved from a http request. It seems that the data is not utf-8 encoded. So I use chardet.detect() and found that it is a ISO-8859-2.
The problem is, it seems that ParseFromString() needs the data to be utf-8 encoded (I am not sure). If I convert the data from ISO-8859-2 to utf-8. Then I got another error:
google.protobuf.message.DecodeError: Truncated message
How to correctly decode the data? Anybody have some advice?

Related

pd.read_excel throws UnicodeDecodeError

I am trying to read data from excel to pandas. The file I get comes from api and is not saved (the access to the file needs special permissions, so I don't want to save it). When I try to read excel from file
with open('path_to_file') as file:
re = pd.read_excel(file)
I get the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position
10: invalid start byte
When I input path in palce of file everythng works fine
re = pd.read_excel('path-to-exactly-the-same-file')
Is there a way to read excel by pandas without saving it and inputting path?
the part that was missing was 'rb' in open
with open('path_to_file', 'rb') as file:
re = pd.read_excel(file)
to treat the file as binary. Idea taken from error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

CSV to bytes to DF to bypass UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte?

I have a csv which I have previously read to a dataframe without issue, but now is giving me the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
df = pd.read_csv(r'\\blah\blah2\csv.csv')
I tried this:
df = pd.read_csv(r'\\blah\blah2\csv.csv', encoding = 'utf-8-sig')
but that gave me this error: UnicodeDecodeError: 'utf-8-sig' codec can't decode byte 0xff in position 10423: invalid start byte
So then I tried 'utf-16', but that gave me this error: UnicodeError: UTF-16 stream does not start with BOM
Then I tried this:
with open(r'\\blah\blah2\csv.csv', 'rb') as f:
contents = f.read()
and that worked, but I need that csv as a dataframe, so then I tried:
new_df = pd.DataFrame.to_string(contents)
but I got this error: AttributeError: 'bytes' object has no attribute 'columns'
Could someone please help me get my dataframe?
Thank you.
UPDATE:
This fixed it. It read the csv into a dataframe without the unicode errors.
df = pd.read_csv(r'\\blah\blah2\csv.csv', encoding='latin1')
Try to find correct encoding with the code below:
# import the chardet library
import chardet
# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open(your_file, 'rb') as file:
print(chardet.detect(file.read()))
However it is not guaranteed to find the encoding since the context may contain different encodings or different languages but, if it is encoded by only 1 code then you can see that.
pip(3) install chardet
if you dont have it installed
EDIT1:
Following is another way to find the right encoding. May this help if the above didn't:
from encodings.aliases import aliases
alias_values = set(aliases.values())
for value in alias_values:
try:
df = pd.read_csv(your_file, encoding=value) # or pd.read_excel
print(value)
except:
continue
This fixed it. It read the csv into a dataframe without the unicode errors.
df = pd.read_csv(r'\\blah\blah2\csv.csv', encoding='latin1')

Ignore UnicodeEncodeError when saving utf8 file

I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.
from urllib import request
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()
This gives me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
First, I tried to remove the BOM at the beginning of the file:
# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')
but I get the same error, just with a different position number:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
If I look in that area I can't find the offending characters, so I don't know what to remove:
raw[7850:7900]
just prints out:
' BALLENA, Spanish.\r\n PEKEE-NUEE-'
which doesn't look like it would be a problem.
So then I tried to skip the bad lines with a try statement:
file = open('corpora/canon_texts/' + 'test', 'w')
try:
file.write(raw)
except UnicodeEncodeError:
pass
file.close()
but this skips the entire text, giving me a file of 0 size.
How can I fix this?
EDIT:
A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')
But I can't even download the data before I get this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data
SECOND EDIT:
I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:
raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
outfile.write(raw)
This is the only reliable way to write to disk exactly what you downloaded.
Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.
text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
outfile.write(text)
U + feff is for UTF-16. Try that instead.
.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.
Probably the safest option is
decode("utf8", errors='backslashreplace')
which will escape encoding errors with a backslash, so you have a record of what failed to decode.
Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.
What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with
decode("utf-16", errors='ignore')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 6 years ago.
I have a CSV file that I'm uploading via an HTML form to a Python API
The API looks like this:
#app.route('/add_candidates_to_db', methods=['GET','POST'])
def add_candidates():
file = request.files['csv_file']
x = io.StringIO(file.read().decode('UTF8'), newline=None)
csv_input = csv.reader(x)
for row in csv_input:
print(row)
I found the part of the file that causes the issue. In my file it has Í character.
I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 1317: invalid continuation byte
I thought I was decoding it with .decode('UTF8') or is the error happening before that with file.read()?
How do I fix this?
**
**
Edit: I have control of the file. I am creating the CSV file myself by pulling data (sometimes this data has strange characters).
One the server side, I'm reading each row in the file and inserting into a database.
Your data is not UTF-8, it contains errors. You say that you are generating the data, so the ideal solution is to generate better data.
Unfortunately, sometimes we are unable to get high-quality data, or we have servers that give us garbage and we have to sort it out. For these situations, we can use less strict error handling when decoding text.
Instead of:
file.read().decode('UTF8')
You can use:
file.read().decode('UTF8', 'replace')
This will make it so that any “garbage” characters (anything which is not correctly encoded as UTF-8) will get replaced with U+FFFD, which looks like this:
�
You say that your file has the Í character, but you are probably viewing the file using an encoding other than UTF-8. Is your file supposed to contain Í, or is it just mojibake? Maybe you can figure out what the character is supposed to be, and from that, you can figure out what encoding your data uses if it's not UTF-8.
It seems that your file is not encoded in utf8. You can try reading the file with all the encodings that Python understand and check which lets you read the entire content of the file. Try this script:
from codecs import open
encodings = [
"ascii",
"big5",
"big5hkscs",
"cp037",
"cp424",
"cp437",
"cp500",
"cp720",
"cp737",
"cp775",
"cp850",
"cp852",
"cp855",
"cp856",
"cp857",
"cp858",
"cp860",
"cp861",
"cp862",
"cp863",
"cp864",
"cp865",
"cp866",
"cp869",
"cp874",
"cp875",
"cp932",
"cp949",
"cp950",
"cp1006",
"cp1026",
"cp1140",
"cp1250",
"cp1251",
"cp1252",
"cp1253",
"cp1254",
"cp1255",
"cp1256",
"cp1257",
"cp1258",
"euc_jp",
"euc_jis_2004",
"euc_jisx0213",
"euc_kr",
"gb2312",
"gbk",
"gb18030",
"hz",
"iso2022_jp",
"iso2022_jp_1",
"iso2022_jp_2",
"iso2022_jp_2004",
"iso2022_jp_3",
"iso2022_jp_ext",
"iso2022_kr",
"latin_1",
"iso8859_2",
"iso8859_3",
"iso8859_4",
"iso8859_5",
"iso8859_6",
"iso8859_7",
"iso8859_8",
"iso8859_9",
"iso8859_10",
"iso8859_13",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"johab",
"koi8_r",
"koi8_u",
"mac_cyrillic",
"mac_greek",
"mac_iceland",
"mac_latin2",
"mac_roman",
"mac_turkish",
"ptcp154",
"shift_jis",
"shift_jis_2004",
"shift_jisx0213",
"utf_32",
"utf_32_be",
"utf_32_le",
"utf_16",
"utf_16_be",
"utf_16_le",
"utf_7",
"utf_8",
"utf_8_sig",
]
for encoding in encodings:
try:
with open(file, encoding=encoding) as f:
f.read()
print('Seemingly working encoding: {}'.format(encoding))
except:
pass
where file is again the filename of your file.

Python 3.4, Decoding error for byte '\x93'

I am simply reading a file in binary mode into a buffer, performing some replacements on that buffer and after that inserting that buffered data into MySQL database. Every byte got inserted regardless of any encoding standard. I used python 2.7, it worked well.
CODE :
with open(binfile,'rb') as fd_bin:
bin_data = fd_bin.read()
bin_data = bin_data.replace('"','\\"')
db_cursor.execute("INSERT INTO table BIN_DATA values {}".format(bin_data))
When I used python 3.4 version, it needed to be decoded so I used :
bin_data = fd_bin.read()
bin_data = bin_data.decode('utf-8') # error at this line
The mentioned second line produced error:
bindata = bindata.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 1:invalid
start byte
I used latin-1 and iso-8859-1 decoding scheme but they insert some extra byte at some places. when I fetch data from database, data is not same in this case but it was for python 2.7 version.
How can I insert that data or decode that data regardless of encoding scheme ? I can't skip or ignore the bytes.

Categories