while decode the csv file gives wrong data - python

I want to decode a csv file but it gives the wrong data..
example: in csv file i have BP1-R241 after decode the file it gives BP1+AC0-R241
if the columns contain (-,/,\,*,....etc) it gives +AC0 is added
How can I rectify this ?
My code:
import base64
data = 'Y29kZSxxdWFudGl0eSxsb2NhdGlvbgoxMjM0NTY2NDMsMSxCUDErQUMwLVIyNDEKMTIzNDUsMixCUDErQUMwLVIyNDEKMTIzNDU2LDMsQlAxK0FDMC1SMjQxCnEyMzIzNDM1NDY1Niw0LEJQMStBQzAtUjI0MQpkc2Zkc2YsNSxCUDErQUMwLVIyNDEKMjMzNDU2LDYsQlAxK0FDMC1SMjQxCmRkZnNkZiw3LEJQMStBQzAtUjI0MQozNTQ2NzgsOCxCUDErQUMwLVIyNDEKMTIzNDU2Nyw5LEJQMStBQzAtUjI0MQoyMzQ1NjcsMTAsQlAxK0FDMC1SMjQxCml1NjU0MzIsMTEsQlAxK0FDMC1SMjQxCmpoZ2ZkLDEyLEJQMStBQzAtUjI0MQp4Y3ZmZ2JobiwxMyxCUDErQUMwLVIyNDEKY2ZjZ2hqaywxNCxCUDErQUMwLVIyNDEKc2RmZ2hqLDE1LEJQMStBQzAtUjI0MQphc2RmZ2hqLDE2LEJQMStBQzAtUjI0MQpzYWRmZ2hqaywxNyxCUDErQUMwLVIyNDEKc2RzZHNkc2QsMTgsQlAxK0FDMC1SMjQxCjExMjIzMzQ0LDE5LEJQMStBQzAtUjI0MQoxMTIyMzM0NDIsMjAsQlAxK0FDMC1SMjQxClRFU1QxMjMsMjEsQlAxK0FDMC1SMjQxCg=='
data = base64.b64decode(data).decode('utf-8')
output:-
code,quantity,location
123456643,1,BP1+AC0-R241
12345,2,BP1+AC0-R241
123456,3,BP1+AC0-R241
q23234354656,4,BP1+AC0-R241
dsfdsf,5,BP1+AC0-R241
233456,6,BP1+AC0-R241
ddfsdf,7,BP1+AC0-R241
354678,8,BP1+AC0-R241
1234567,9,BP1+AC0-R241
234567,10,BP1+AC0-R241
iu65432,11,BP1+AC0-R241
jhgfd,12,BP1+AC0-R241
xcvfgbhn,13,BP1+AC0-R241
cfcghjk,14,BP1+AC0-R241
sdfghj,15,BP1+AC0-R241
asdfghj,16,BP1+AC0-R241
sadfghjk,17,BP1+AC0-R241
sdsdsdsd,18,BP1+AC0-R241
11223344,19,BP1+AC0-R241
112233442,20,BP1+AC0-R241
TEST123,21,BP1+AC0-R241

The data you've pasted in simply contains BP1+AC0-R241, there's no way around it.
The problem is not in decoding, it's in wherever you get that data from.
Googling "+AC0" leads me to this thread, and namely this:
The data in your file is encoded as UTF-7 (http://en.wikipedia.org/wiki/UTF-7), instead of the more usual ascii/latin-1 or UTF-8. Each of the +ACI- sequences encodes one double quote character.
Are you sure you've exported the file as UTF-8, not UTF-7?

Related

Issue trying to decode utf-8 encoded json in python

I do have a json file I am trying to read.
The Json includes the following text in the file, which I am trying to decode with the code following this text.
Silikon-Ersatzpl\u00ef\u00bf\u00bdttchen Regenlichtsensor
with open("file_name", encoding = "utf-8") as file:
pdf_labels = json.loads(file.read())
When I try to load it with the json module and specify utf-8 encoding I get some weird results.
"\u00ef\u00bf\u00bd" will become "�" instead of a desired "ä"
The desired ouput should look like the following.
Silikon-Ersatzplättchen Regenlichtsensor
Please don´t be harsh, this is my first question :)

Working with a pandas dataframe read from 'UTF-16' encoded csv file

I'm working on a csv format dataset on python. This csv file looks like a normal csv file. When I tried to read it using pandas, it produced an error message saying
In: df = pandas.read_csv(filename)
Out: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Then I used chardet library and got to know the encoding of the file is UTF-16. I tried to read the file again as shown
df = pandas.read_csv(filename,encoding = 'UTF-16')
Now I was able to read the file. But When I tried to get its view using df.head() I got the output as shown below
In: df.head()
Out: Col1\tCol2\tCol3
val1\tval2\tval3
all the columns got combined separated by \. How can I modify this csv file to make it work like a how typical utf-8 encoded csv file works?
Edit:
When I added sep=None in pd.read_csv() I was able to get the columns separated. Also, it took a bit long for pd.read_csv() to read the file when I added sep=None than when I just left the default value for sep.
Can Somebody explain why these two things happened?

Any way to get correct conversion for unicode text format data to csv in python?

I am accessing dataset that lives on ftp server. after I download the data, I used pandas to read it as csv but I got an encoding error. The file has csv file extension but after I opened the file with MS excell, data was in Unicode Text format. I want to make conversion of those dataset that stored in Unicode text format. How can I make this happen? Any idea to get this done?
my attempt:
from ftplib import FTP
import os
def mydef():
defaultIP=''
username='cat'
password='cat'
ftp = FTP(defaultIP,user=username, passwd=password)
ftp.dir()
filenames=ftp.nlst()
for filename in files:
local_filename = os.path.join('C:\\Users\\me', filename)
file = open(local_filename, 'wb')
ftp.retrbinary('RETR '+ filename, file.write)
file.close()
ftp.quit()
then I tried this to get correct encoding:
mydef.encode('utf-8').splitlines()
but this one is not working for me. I used this solution
the output of above code:
here is output snippet of above code:
b'\xff\xfeF\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00\t'
expected output
the expected output of this dataset should be in normal csv data such as common trade data, but encoding doesn't work for me.
I used different encoding for getting the correct conversion of csv format data but none of them works for me. How can I make that work? any idea to get this done? thanks
EDIT: I have to change it - now I remove 2 bytes at the beginning (BOM) and one byte at the end because data is incomplete (every char needs 2 bytes)
It seems it is not utf-8 but utf-16 with BOM
If I remove first two bytes (BOM - Bytes Order Mark) and last byte at the end because it is incomplete (every char needs two bytes) and use decode('utf-16-le')
b'F\x00L\x00O\x00W\x00\t\x00C\x00T\x00Y\x00_\x00R\x00P\x00T\x00\t\x00R\x00E\x00P\x00O\x00R\x00T\x00E\x00R\x00\t\x00C\x00T\x00Y\x00_\x00P\x00T\x00N\x00\t\x00P\x00A\x00R\x00T\x00N\x00E\x00R\x00\t\x00C\x00O\x00M\x00M\x00O\x00D\x00I\x00T\x00Y\x00\t\x00D\x00E\x00S\x00C\x00R\x00I\x00P\x00T\x00I\x00O\x00N\x00'.decode('utf-16-le')
then I get
'FLOW\tCTY_RPT\tREPORTER\tCTY_PTN\tPARTNER\tCOMMODITY\tDESCRIPTION'
EDIT: meanwhile I found also Python - Decode UTF-16 file with BOM

UnicodeDecodeError: ('utf-8' codec) while reading a csv file [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 5 years ago.
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
But i don't know what exactly should be changed.so need help.
Known encoding
If you know the encoding of the file you want to read in,
you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.
Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)
One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.
Above method used by importing and then detecting file type works
import chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.
I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Read csv with foreign language using pandas

I am reading a csv which has the following format
ID,Name,Description
Now some of my description contain foriegn langage characters like
,
I would like to read this file into dataframe if possible. When I am using read_csv etcI am getting encoding errors. I tried
csv = pd.read_csv('foo.csv',encoding='utf-8')
but it throws encoding error
'charmap' codec can't decode byte 0x9d
Is there any way to read the file keeping character set and keep analysing.
If not what would be the way to get such file into a dataframe or array like structure?
If keeping characters is not possible, then can such lines/words be ignored and read rest of the data? Help appreciated.

Categories