Read csv with foreign language using pandas - python

I am reading a csv which has the following format
ID,Name,Description
Now some of my description contain foriegn langage characters like
,
I would like to read this file into dataframe if possible. When I am using read_csv etcI am getting encoding errors. I tried
csv = pd.read_csv('foo.csv',encoding='utf-8')
but it throws encoding error
'charmap' codec can't decode byte 0x9d
Is there any way to read the file keeping character set and keep analysing.
If not what would be the way to get such file into a dataframe or array like structure?
If keeping characters is not possible, then can such lines/words be ignored and read rest of the data? Help appreciated.

Related

Working with a pandas dataframe read from 'UTF-16' encoded csv file

I'm working on a csv format dataset on python. This csv file looks like a normal csv file. When I tried to read it using pandas, it produced an error message saying
In: df = pandas.read_csv(filename)
Out: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Then I used chardet library and got to know the encoding of the file is UTF-16. I tried to read the file again as shown
df = pandas.read_csv(filename,encoding = 'UTF-16')
Now I was able to read the file. But When I tried to get its view using df.head() I got the output as shown below
In: df.head()
Out: Col1\tCol2\tCol3
val1\tval2\tval3
all the columns got combined separated by \. How can I modify this csv file to make it work like a how typical utf-8 encoded csv file works?
Edit:
When I added sep=None in pd.read_csv() I was able to get the columns separated. Also, it took a bit long for pd.read_csv() to read the file when I added sep=None than when I just left the default value for sep.
Can Somebody explain why these two things happened?

while decode the csv file gives wrong data

I want to decode a csv file but it gives the wrong data..
example: in csv file i have BP1-R241 after decode the file it gives BP1+AC0-R241
if the columns contain (-,/,\,*,....etc) it gives +AC0 is added
How can I rectify this ?
My code:
import base64
data = 'Y29kZSxxdWFudGl0eSxsb2NhdGlvbgoxMjM0NTY2NDMsMSxCUDErQUMwLVIyNDEKMTIzNDUsMixCUDErQUMwLVIyNDEKMTIzNDU2LDMsQlAxK0FDMC1SMjQxCnEyMzIzNDM1NDY1Niw0LEJQMStBQzAtUjI0MQpkc2Zkc2YsNSxCUDErQUMwLVIyNDEKMjMzNDU2LDYsQlAxK0FDMC1SMjQxCmRkZnNkZiw3LEJQMStBQzAtUjI0MQozNTQ2NzgsOCxCUDErQUMwLVIyNDEKMTIzNDU2Nyw5LEJQMStBQzAtUjI0MQoyMzQ1NjcsMTAsQlAxK0FDMC1SMjQxCml1NjU0MzIsMTEsQlAxK0FDMC1SMjQxCmpoZ2ZkLDEyLEJQMStBQzAtUjI0MQp4Y3ZmZ2JobiwxMyxCUDErQUMwLVIyNDEKY2ZjZ2hqaywxNCxCUDErQUMwLVIyNDEKc2RmZ2hqLDE1LEJQMStBQzAtUjI0MQphc2RmZ2hqLDE2LEJQMStBQzAtUjI0MQpzYWRmZ2hqaywxNyxCUDErQUMwLVIyNDEKc2RzZHNkc2QsMTgsQlAxK0FDMC1SMjQxCjExMjIzMzQ0LDE5LEJQMStBQzAtUjI0MQoxMTIyMzM0NDIsMjAsQlAxK0FDMC1SMjQxClRFU1QxMjMsMjEsQlAxK0FDMC1SMjQxCg=='
data = base64.b64decode(data).decode('utf-8')
output:-
code,quantity,location
123456643,1,BP1+AC0-R241
12345,2,BP1+AC0-R241
123456,3,BP1+AC0-R241
q23234354656,4,BP1+AC0-R241
dsfdsf,5,BP1+AC0-R241
233456,6,BP1+AC0-R241
ddfsdf,7,BP1+AC0-R241
354678,8,BP1+AC0-R241
1234567,9,BP1+AC0-R241
234567,10,BP1+AC0-R241
iu65432,11,BP1+AC0-R241
jhgfd,12,BP1+AC0-R241
xcvfgbhn,13,BP1+AC0-R241
cfcghjk,14,BP1+AC0-R241
sdfghj,15,BP1+AC0-R241
asdfghj,16,BP1+AC0-R241
sadfghjk,17,BP1+AC0-R241
sdsdsdsd,18,BP1+AC0-R241
11223344,19,BP1+AC0-R241
112233442,20,BP1+AC0-R241
TEST123,21,BP1+AC0-R241
The data you've pasted in simply contains BP1+AC0-R241, there's no way around it.
The problem is not in decoding, it's in wherever you get that data from.
Googling "+AC0" leads me to this thread, and namely this:
The data in your file is encoded as UTF-7 (http://en.wikipedia.org/wiki/UTF-7), instead of the more usual ascii/latin-1 or UTF-8. Each of the +ACI- sequences encodes one double quote character.
Are you sure you've exported the file as UTF-8, not UTF-7?

Pandas read _excel: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte

Trying to read MS Excel file, version 2016. File contains several lists with data. File downloaded from DataBase and it can be opened in MS Office correctly. In example below I changed the file name.
EDIT: file contains russian and english words. Most probably used the Latin-1 encoding, but encoding='latin-1' does not help
import pandas as pd
with open('1.xlsx', 'r', encoding='utf8') as f:
data = pd.read_excel(f)
Result:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 14: invalid start byte
Without encoding ='utf8'
'charmap' codec can't decode byte 0x9d in position 622: character maps to <undefined>
P.S. Task is to process 52 files, to merge data in every sheet with corresponded sheets in the 52 files. So, please no handle work advices.
The problem is that the original requester is calling read_excel with a filehandle as the first argument. As demonstrated by the last responder, the first argument should be a string containing the filename.
I ran into this same error using:
df = pd.read_excel(open("file.xlsx",'r'))
but correct is:
df = pd.read_excel("file.xlsx")
Most probably you're using Python3. In Python2 this wouldn't happen.
xlsx files are binary (actually they're an xml, but it's compressed), so you need to open them in binary mode. Use this call to open:
open('1.xlsx', 'rb')
There's no full traceback, but I imagine the UnicodeDecodeError comes from the file object, not from read_excel(). That happens because the stream of bytes can contain anything, but we don't want decoding to happen too soon; read_excel() must receive raw bytes and be able to process them.
Most probably the problem is in Russian symbols.
Charmap is default decoding method used in case no encoding is beeing noticed.
As I see if utf-8 and latin-1 do not help then try to read this file not as
pd.read_excel(f)
but
pd.read_table(f)
or even just
f.readline()
in order to check what is a symbol raise an exeception and delete this symbol/symbols.
Panda support encoding feature to read your excel
In your case you can use:
df=pd.read_excel('your_file.xlsx',encoding='utf-8')
or if you want in more of system specific without any surpise you can use:
df=pd.read_excel('your_file.xlsx',encoding='sys.getfilesystemencoding()')

Reading erroneous data form csv file using read_csv from pandas

I am trying to read data from a huge csv file I have. I is showing me this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 13: invalid start byte. Is there any way to just skip through the lines that cause this exception to be thrown? From the millions of lines these are just a handful and I can't manually delete them. I tried adding error_bad_lines=False, but that did not solve the problem. I am using Python 3.6.1 that I got through Anaconda 4.4.0. I am also using a Mac if that helps. Please help me I am new to this.
Seems to me that there are some non-ascii characters in your file that cannot be decoded. Pandas accepts an encoding as an argument for read_csv (if that helps):
my_file = pd.read_csv('Path/to/file.csv', encoding = 'encoding')
The default encoding is None, which is why you might be getting those errors.Here is a link to the standard Python encodings - Try "ISO-8859-1" (aka 'latin1') or maybe 'utf8' to start.
Pandas does allow you to specify rows to skip when reading a csv, but you would need to know the index of those rows, which in your case would be very difficult.

Python convert csv to xlsx, however getting UnicodeDecodeError

I am trying to convert a csv file to a .xlsx file using PyExcel.
Here is some example data I have in the CSV file.
1.34805E+12,STANDARD,Jose,Sez,,La Pica, 16 o,Renedo de Piélagos,,39470,Spain,,No,No,1231800,2
I am having issues with the special characters, if there are none it line
merge_all_to_a_book(glob.glob("uploadorders.csv"), "uploadorders.xlsx")
Has no problems, however if it does have special characters such as
Piélagos
or
Lücht
I get this error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 26: invalid continuation byte
I am unsure what to do about this, I have resorted to downloading the file, and re-saving it in excel.
You get the UnicodeDecodeError because the encoding python uses to read the csv is different from the one used to save the file.
Try to save the file as UTF-8 or use the correct encoding to read it: https://docs.python.org/2/howto/unicode.html

Categories