Python convert csv to xlsx, however getting UnicodeDecodeError - python

I am trying to convert a csv file to a .xlsx file using PyExcel.
Here is some example data I have in the CSV file.
1.34805E+12,STANDARD,Jose,Sez,,La Pica, 16 o,Renedo de Piélagos,,39470,Spain,,No,No,1231800,2
I am having issues with the special characters, if there are none it line
merge_all_to_a_book(glob.glob("uploadorders.csv"), "uploadorders.xlsx")
Has no problems, however if it does have special characters such as
Piélagos
or
Lücht
I get this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 26: invalid continuation byte
I am unsure what to do about this, I have resorted to downloading the file, and re-saving it in excel.

You get the UnicodeDecodeError because the encoding python uses to read the csv is different from the one used to save the file.
Try to save the file as UTF-8 or use the correct encoding to read it: https://docs.python.org/2/howto/unicode.html

Related

Parsing a binary file to CSV file format

I need to parse a .bin file into a specific CSV format(already given).
CSV format:
Hex Addr,Byte,Package,Mnemonic,Short Description,Tlm Type,Tlm Conversion (EU/lsb),Eng. Units (EU)
0x420,32,COMMAND_TLM,CMD_STATUS,Command Status,uint8,1,0/OK 1/BAD_APID 2/BAD_OPCODE 3/BAD_DATA 7/NO_CMD_DATA 8/CMD_SRVC_OVERRUN 9/CMD_APID_OVERRUN 12/TABLES_BUSY 13/FLASH_NOT_ARMED 14/THRUSTERS_NOT_ENABLED 15/ATT_ERR_TOO_HIGH 16/ASYNC_REFUSED
0x421,33,COMMAND_TLM,CMD_REJECT_STATUS,Command Reject Status,uint8,1,0/OK 1/BAD_APID 2/BAD_OPCODE 3/BAD_DATA 7/NO_CMD_DATA 8/CMD_SRVC_OVERRUN 9/CMD_APID_OVERRUN 12/TABLES_BUSY 13/FLASH_NOT_ARMED 14/THRUSTERS_NOT_ENABLED 15/ATT_ERR_TOO_HIGH 16/ASYNC_REFUSED
0x422,34,COMMAND_TLM,CMD_ACCEPT_COUNT,Command Accept Count,uint8,1,none
0x423,35,COMMAND_TLM,CMD_REJECT_COUNT,Command Reject Count,uint8,1,none
there are 8 columns in CSV file
after reading my .bin file by:
import struct
data = open("telemetry.bin", "rb").read()
type of data is <class'bytes'>
I cant figure out which encoding it is. I also tried using
encoding_1=json.detect_encoding(data)
which gives me utf-16-le. I tried to decode like
data= data.decode('utf16')
but it throws an error of
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 10-11: illegal UTF-16 surrogate
any help would be appreaciated

Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Trying to read MS Excel file, version 2016. File contains several lists with data. File downloaded from DataBase and it can be opened in MS Office correctly. In example below I changed the file name.
EDIT: file contains russian and english words. Most probably used the Latin-1 encoding, but encoding='latin-1' does not help
import pandas as pd
with open('1.xlsx', 'r', encoding='utf8') as f:
data = pd.read_excel(f)
Result:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
Without encoding ='utf8'
'charmap' codec can't decode byte 0x9d in position 622: character maps to <undefined>
P.S. Task is to process 52 files, to merge data in every sheet with corresponded sheets in the 52 files. So, please no handle work advices.
The problem is that the original requester is calling read_excel with a filehandle as the first argument. As demonstrated by the last responder, the first argument should be a string containing the filename.
I ran into this same error using:
df = pd.read_excel(open("file.xlsx",'r'))
but correct is:
df = pd.read_excel("file.xlsx")
Most probably you're using Python3. In Python2 this wouldn't happen.
xlsx files are binary (actually they're an xml, but it's compressed), so you need to open them in binary mode. Use this call to open:
open('1.xlsx', 'rb')
There's no full traceback, but I imagine the UnicodeDecodeError comes from the file object, not from read_excel(). That happens because the stream of bytes can contain anything, but we don't want decoding to happen too soon; read_excel() must receive raw bytes and be able to process them.
Most probably the problem is in Russian symbols.
Charmap is default decoding method used in case no encoding is beeing noticed.
As I see if utf-8 and latin-1 do not help then try to read this file not as
pd.read_excel(f)
but
pd.read_table(f)
or even just
f.readline()
in order to check what is a symbol raise an exeception and delete this symbol/symbols.
Panda support encoding feature to read your excel
In your case you can use:
df=pd.read_excel('your_file.xlsx',encoding='utf-8')
or if you want in more of system specific without any surpise you can use:
df=pd.read_excel('your_file.xlsx',encoding='sys.getfilesystemencoding()')

Python decoding error on Excel with xlrd

I know this is a recurrent subject but I'm facing a encoding/decoding error when trying to parse an Excel file (.xlsx) opened with xlrd
value = sheet.cell(row,col).value
value = value.decode('utf-8') // also tried cp1252 and iso-8859-15
WARNING: 'ascii' codec can't encode character u'\xe9' in position xx: ordinal not in range(128)
xlrd doc says that From Excel 97 onwards, the text in Excel spreadsheets has been stored as Unicode. So decoding should not even be necessary.
Any idea what should be done ?
P.S. My Excel file has some é and à inside.
Still using Python 2? :(
If what you're trying to do is convert from unicode to UTF-8 encoded str, you need to value.encode('utf-8'), not decode.

Reading large excel files in Python and UnicodeDecodeError:

I am new to Python and I'm trying to read a large excel file in python. I converted my xlsx file to csv to work with pandas. I wrote the code below:
import pandas as pd
pd.read_csv('filepath.csv')
df = csv.parse("Sheet")
df.head()
But it gives this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 28: character maps to <undefined>
Can you please tell me why it gives this error? Or do you have any advice to read large excel files? I also tried to work with openpyxl module, but I couldn't use read_only because of version of my Python.(I am using Python 2.7.8)
Save the excel into Unicode Text File with Microsoft Excel.
Open the file with this line:
df = pd.read_csv(filename,sep='\t',encoding='utf-16-le')
print(df.head())
Try with
pd.read_csv('filepath.csv',encoding ='utf-8')
There are many other encoding techniques like encoding = 'iso-8859-1' or encoding = 'cp1252' or encoding = 'latin1'. You can choose as per your requirement.

Read csv with foreign language using pandas

I am reading a csv which has the following format
ID,Name,Description
Now some of my description contain foriegn langage characters like
,
I would like to read this file into dataframe if possible. When I am using read_csv etcI am getting encoding errors. I tried
csv = pd.read_csv('foo.csv',encoding='utf-8')
but it throws encoding error
'charmap' codec can't decode byte 0x9d
Is there any way to read the file keeping character set and keep analysing.
If not what would be the way to get such file into a dataframe or array like structure?
If keeping characters is not possible, then can such lines/words be ignored and read rest of the data? Help appreciated.

Categories