Parsing a binary file to CSV file format - python

I need to parse a .bin file into a specific CSV format(already given).
CSV format:
Hex Addr,Byte,Package,Mnemonic,Short Description,Tlm Type,Tlm Conversion (EU/lsb),Eng. Units (EU)
0x420,32,COMMAND_TLM,CMD_STATUS,Command Status,uint8,1,0/OK 1/BAD_APID 2/BAD_OPCODE 3/BAD_DATA 7/NO_CMD_DATA 8/CMD_SRVC_OVERRUN 9/CMD_APID_OVERRUN 12/TABLES_BUSY 13/FLASH_NOT_ARMED 14/THRUSTERS_NOT_ENABLED 15/ATT_ERR_TOO_HIGH 16/ASYNC_REFUSED
0x421,33,COMMAND_TLM,CMD_REJECT_STATUS,Command Reject Status,uint8,1,0/OK 1/BAD_APID 2/BAD_OPCODE 3/BAD_DATA 7/NO_CMD_DATA 8/CMD_SRVC_OVERRUN 9/CMD_APID_OVERRUN 12/TABLES_BUSY 13/FLASH_NOT_ARMED 14/THRUSTERS_NOT_ENABLED 15/ATT_ERR_TOO_HIGH 16/ASYNC_REFUSED
0x422,34,COMMAND_TLM,CMD_ACCEPT_COUNT,Command Accept Count,uint8,1,none
0x423,35,COMMAND_TLM,CMD_REJECT_COUNT,Command Reject Count,uint8,1,none
there are 8 columns in CSV file
after reading my .bin file by:
import struct
data = open("telemetry.bin", "rb").read()
type of data is <class'bytes'>
I cant figure out which encoding it is. I also tried using
encoding_1=json.detect_encoding(data)
which gives me utf-16-le. I tried to decode like
data= data.decode('utf16')
but it throws an error of
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 10-11: illegal UTF-16 surrogate
any help would be appreaciated

Related

pd.read_excel throws UnicodeDecodeError

I am trying to read data from excel to pandas. The file I get comes from api and is not saved (the access to the file needs special permissions, so I don't want to save it). When I try to read excel from file
with open('path_to_file') as file:
re = pd.read_excel(file)
I get the error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position
10: invalid start byte
When I input path in palce of file everythng works fine
re = pd.read_excel('path-to-exactly-the-same-file')
Is there a way to read excel by pandas without saving it and inputting path?
the part that was missing was 'rb' in open
with open('path_to_file', 'rb') as file:
re = pd.read_excel(file)
to treat the file as binary. Idea taken from error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Trying to read MS Excel file, version 2016. File contains several lists with data. File downloaded from DataBase and it can be opened in MS Office correctly. In example below I changed the file name.
EDIT: file contains russian and english words. Most probably used the Latin-1 encoding, but encoding='latin-1' does not help
import pandas as pd
with open('1.xlsx', 'r', encoding='utf8') as f:
data = pd.read_excel(f)
Result:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
Without encoding ='utf8'
'charmap' codec can't decode byte 0x9d in position 622: character maps to <undefined>
P.S. Task is to process 52 files, to merge data in every sheet with corresponded sheets in the 52 files. So, please no handle work advices.
The problem is that the original requester is calling read_excel with a filehandle as the first argument. As demonstrated by the last responder, the first argument should be a string containing the filename.
I ran into this same error using:
df = pd.read_excel(open("file.xlsx",'r'))
but correct is:
df = pd.read_excel("file.xlsx")
Most probably you're using Python3. In Python2 this wouldn't happen.
xlsx files are binary (actually they're an xml, but it's compressed), so you need to open them in binary mode. Use this call to open:
open('1.xlsx', 'rb')
There's no full traceback, but I imagine the UnicodeDecodeError comes from the file object, not from read_excel(). That happens because the stream of bytes can contain anything, but we don't want decoding to happen too soon; read_excel() must receive raw bytes and be able to process them.
Most probably the problem is in Russian symbols.
Charmap is default decoding method used in case no encoding is beeing noticed.
As I see if utf-8 and latin-1 do not help then try to read this file not as
pd.read_excel(f)
but
pd.read_table(f)
or even just
f.readline()
in order to check what is a symbol raise an exeception and delete this symbol/symbols.
Panda support encoding feature to read your excel
In your case you can use:
df=pd.read_excel('your_file.xlsx',encoding='utf-8')
or if you want in more of system specific without any surpise you can use:
df=pd.read_excel('your_file.xlsx',encoding='sys.getfilesystemencoding()')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 6 years ago.
I have a CSV file that I'm uploading via an HTML form to a Python API
The API looks like this:
#app.route('/add_candidates_to_db', methods=['GET','POST'])
def add_candidates():
file = request.files['csv_file']
x = io.StringIO(file.read().decode('UTF8'), newline=None)
csv_input = csv.reader(x)
for row in csv_input:
print(row)
I found the part of the file that causes the issue. In my file it has Í character.
I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 1317: invalid continuation byte
I thought I was decoding it with .decode('UTF8') or is the error happening before that with file.read()?
How do I fix this?
**
**
Edit: I have control of the file. I am creating the CSV file myself by pulling data (sometimes this data has strange characters).
One the server side, I'm reading each row in the file and inserting into a database.
Your data is not UTF-8, it contains errors. You say that you are generating the data, so the ideal solution is to generate better data.
Unfortunately, sometimes we are unable to get high-quality data, or we have servers that give us garbage and we have to sort it out. For these situations, we can use less strict error handling when decoding text.
Instead of:
file.read().decode('UTF8')
You can use:
file.read().decode('UTF8', 'replace')
This will make it so that any “garbage” characters (anything which is not correctly encoded as UTF-8) will get replaced with U+FFFD, which looks like this:
�
You say that your file has the Í character, but you are probably viewing the file using an encoding other than UTF-8. Is your file supposed to contain Í, or is it just mojibake? Maybe you can figure out what the character is supposed to be, and from that, you can figure out what encoding your data uses if it's not UTF-8.
It seems that your file is not encoded in utf8. You can try reading the file with all the encodings that Python understand and check which lets you read the entire content of the file. Try this script:
from codecs import open
encodings = [
"ascii",
"big5",
"big5hkscs",
"cp037",
"cp424",
"cp437",
"cp500",
"cp720",
"cp737",
"cp775",
"cp850",
"cp852",
"cp855",
"cp856",
"cp857",
"cp858",
"cp860",
"cp861",
"cp862",
"cp863",
"cp864",
"cp865",
"cp866",
"cp869",
"cp874",
"cp875",
"cp932",
"cp949",
"cp950",
"cp1006",
"cp1026",
"cp1140",
"cp1250",
"cp1251",
"cp1252",
"cp1253",
"cp1254",
"cp1255",
"cp1256",
"cp1257",
"cp1258",
"euc_jp",
"euc_jis_2004",
"euc_jisx0213",
"euc_kr",
"gb2312",
"gbk",
"gb18030",
"hz",
"iso2022_jp",
"iso2022_jp_1",
"iso2022_jp_2",
"iso2022_jp_2004",
"iso2022_jp_3",
"iso2022_jp_ext",
"iso2022_kr",
"latin_1",
"iso8859_2",
"iso8859_3",
"iso8859_4",
"iso8859_5",
"iso8859_6",
"iso8859_7",
"iso8859_8",
"iso8859_9",
"iso8859_10",
"iso8859_13",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"johab",
"koi8_r",
"koi8_u",
"mac_cyrillic",
"mac_greek",
"mac_iceland",
"mac_latin2",
"mac_roman",
"mac_turkish",
"ptcp154",
"shift_jis",
"shift_jis_2004",
"shift_jisx0213",
"utf_32",
"utf_32_be",
"utf_32_le",
"utf_16",
"utf_16_be",
"utf_16_le",
"utf_7",
"utf_8",
"utf_8_sig",
]
for encoding in encodings:
try:
with open(file, encoding=encoding) as f:
f.read()
print('Seemingly working encoding: {}'.format(encoding))
except:
pass
where file is again the filename of your file.

Python convert csv to xlsx, however getting UnicodeDecodeError

I am trying to convert a csv file to a .xlsx file using PyExcel.
Here is some example data I have in the CSV file.
1.34805E+12,STANDARD,Jose,Sez,,La Pica, 16 o,Renedo de Piélagos,,39470,Spain,,No,No,1231800,2
I am having issues with the special characters, if there are none it line
merge_all_to_a_book(glob.glob("uploadorders.csv"), "uploadorders.xlsx")
Has no problems, however if it does have special characters such as
Piélagos
or
Lücht
I get this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 26: invalid continuation byte
I am unsure what to do about this, I have resorted to downloading the file, and re-saving it in excel.
You get the UnicodeDecodeError because the encoding python uses to read the csv is different from the one used to save the file.
Try to save the file as UTF-8 or use the correct encoding to read it: https://docs.python.org/2/howto/unicode.html

protobuf decode error using python

When I try to decode a steam into a protobuf using python, I got this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xaa in position 1
My codes just read data from a file, and use "ParseFromString" to decode it.
f = open('ds.resp', 'rb')
eof = False
data = f.read(1024*1024)
eof = not data
if not eof:
entity = dataRes_pb2.dataRes()
entity.ParseFromString(data)
print entity
The data in the file is downloaded and saved from a http request. It seems that the data is not utf-8 encoded. So I use chardet.detect() and found that it is a ISO-8859-2.
The problem is, it seems that ParseFromString() needs the data to be utf-8 encoded (I am not sure). If I convert the data from ISO-8859-2 to utf-8. Then I got another error:
google.protobuf.message.DecodeError: Truncated message
How to correctly decode the data? Anybody have some advice?

Categories