Reading large excel files in Python and UnicodeDecodeError: - python

I am new to Python and I'm trying to read a large excel file in python. I converted my xlsx file to csv to work with pandas. I wrote the code below:
import pandas as pd
pd.read_csv('filepath.csv')
df = csv.parse("Sheet")
df.head()
But it gives this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 28: character maps to <undefined>
Can you please tell me why it gives this error? Or do you have any advice to read large excel files? I also tried to work with openpyxl module, but I couldn't use read_only because of version of my Python.(I am using Python 2.7.8)

Save the excel into Unicode Text File with Microsoft Excel.
Open the file with this line:
df = pd.read_csv(filename,sep='\t',encoding='utf-16-le')
print(df.head())

Try with
pd.read_csv('filepath.csv',encoding ='utf-8')
There are many other encoding techniques like encoding = 'iso-8859-1' or encoding = 'cp1252' or encoding = 'latin1'. You can choose as per your requirement.

Related

Importing a file from a subfolder with read_csv : how to get it to work with engine='c' ? (UnicodeDecodeError)

I am trying to use pandas to read a csv file which is in a sunfolder of the current folder. I am on a Windows PC.
If I run:
df=pd.read_csv("subfolder//file.csv")
I get:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position
16: invalid start byte
If I run:
df=pd.read_csv("subfolder//file.csv", engine='python')
It works.
Why????
Isn't there a way to use c as the engine? It's meant to be faster
This might be because read_csv is trying to read the file in "UTF-8" format while your file is clearly in a different format. To detect the encoding in Windows, you can look at this.
Get encoding of a file in Windows
After you found out the file's encoding format, you can give an argument of the encoding type to the read_csv function. e.g.
df=pd.read_csv("subfolder//file.csv", encoding="utf-8")

Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Trying to read MS Excel file, version 2016. File contains several lists with data. File downloaded from DataBase and it can be opened in MS Office correctly. In example below I changed the file name.
EDIT: file contains russian and english words. Most probably used the Latin-1 encoding, but encoding='latin-1' does not help
import pandas as pd
with open('1.xlsx', 'r', encoding='utf8') as f:
data = pd.read_excel(f)
Result:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
Without encoding ='utf8'
'charmap' codec can't decode byte 0x9d in position 622: character maps to <undefined>
P.S. Task is to process 52 files, to merge data in every sheet with corresponded sheets in the 52 files. So, please no handle work advices.
The problem is that the original requester is calling read_excel with a filehandle as the first argument. As demonstrated by the last responder, the first argument should be a string containing the filename.
I ran into this same error using:
df = pd.read_excel(open("file.xlsx",'r'))
but correct is:
df = pd.read_excel("file.xlsx")
Most probably you're using Python3. In Python2 this wouldn't happen.
xlsx files are binary (actually they're an xml, but it's compressed), so you need to open them in binary mode. Use this call to open:
open('1.xlsx', 'rb')
There's no full traceback, but I imagine the UnicodeDecodeError comes from the file object, not from read_excel(). That happens because the stream of bytes can contain anything, but we don't want decoding to happen too soon; read_excel() must receive raw bytes and be able to process them.
Most probably the problem is in Russian symbols.
Charmap is default decoding method used in case no encoding is beeing noticed.
As I see if utf-8 and latin-1 do not help then try to read this file not as
pd.read_excel(f)
but
pd.read_table(f)
or even just
f.readline()
in order to check what is a symbol raise an exeception and delete this symbol/symbols.
Panda support encoding feature to read your excel
In your case you can use:
df=pd.read_excel('your_file.xlsx',encoding='utf-8')
or if you want in more of system specific without any surpise you can use:
df=pd.read_excel('your_file.xlsx',encoding='sys.getfilesystemencoding()')

What am I doing wrong when I am trying to read my csv file in python?

I am using Spyder through the Anaconda bundle on a Macbook and keep getting this error when I use the below commands
import pandas as pd
file = ('/Users/JDMacBook/.spyder-py3/US Mass Shootings.csv')
df = pd.read_csv(file)
print(df.head)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 87: invalid continuation byte
Sorry if this is a duplicate -- I googled and youtube'd and even stackflowed the crap out of this question but I cant seem to figure this out. Can you please help this newbie?
If the file you are trying to process is https://github.com/bruno78/python-capstone-project/blob/master/mj-1982-2016-US-mass-shootings.csv there is a spurious ghost byte on line 55 which needs to be removed in order for the file to be properly decoded.
Line 55 describes the Trolley Square shooting so there is a third-party source (viz. Wikipedia) where you can verify the correct orthography of the shooter's name.
import pandas as pd
file = '/Users/JDMacBook/.spyder-py3/US Mass Shootings.csv'
data = pd.read_csv(file, encoding='utf-8')
Try this.
This is because the encoding the file is utf-8. The default encoding is ascii.

Python convert csv to xlsx, however getting UnicodeDecodeError

I am trying to convert a csv file to a .xlsx file using PyExcel.
Here is some example data I have in the CSV file.
1.34805E+12,STANDARD,Jose,Sez,,La Pica, 16 o,Renedo de Piélagos,,39470,Spain,,No,No,1231800,2
I am having issues with the special characters, if there are none it line
merge_all_to_a_book(glob.glob("uploadorders.csv"), "uploadorders.xlsx")
Has no problems, however if it does have special characters such as
Piélagos
or
Lücht
I get this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 26: invalid continuation byte
I am unsure what to do about this, I have resorted to downloading the file, and re-saving it in excel.
You get the UnicodeDecodeError because the encoding python uses to read the csv is different from the one used to save the file.
Try to save the file as UTF-8 or use the correct encoding to read it: https://docs.python.org/2/howto/unicode.html

UnicodeDecodeError: ('utf-8' codec) while reading a csv file [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 5 years ago.
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
But i don't know what exactly should be changed.so need help.
Known encoding
If you know the encoding of the file you want to read in,
you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.
Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)
One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.
Above method used by importing and then detecting file type works
import chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.
I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Categories