What is the difference between utf-8 and utf-8-sig? - python

I am trying to encode Bangla words in python using pandas dataframe. But as encoding type, utf-8 is not working but utf-8-sig is. I know utf-8-sig is with BOM(Byte order mark). But why is this called utf-8-sig and how it works?

"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file).
Using utf-8-sig to read a file will treat the BOM as metadata that explains how to interpret the file, instead of as part of the file contents.

Related

UTF-8 Encoding in python gets transformed to ASCII?

I'm attempting to do something very simple, which is read a file in ascii or utf-8-sig and save it as utf-8. However, when I run the function below, and then do file filename.json in linux, it always shows the file as being ASCII. I have tried using codecs, and no luck either. The only way I can get it to work, is if I replace utf-8 with utf-8-sig, BUT that gives me the issue that the file has BOM endings. I've searched around for solutions, and I found some removing the beginning characters, however, after this is performed, the file becomes ascii again. I have tried everything her: Convert UTF-8 with BOM to UTF-8 with no BOM in Python
def file_converter(file_path):
s = open(file_path, mode='r', encoding='ascii').read()
open(file_path, mode='w', encoding='utf-8').write(s)
Files that only contain characters below U+0080 encode to exactly the same bytes as either ASCII or UTF-8 (this was one of the compatibility goals of UTF-8). file detects the file as ASCII, and it is, but it's also UTF-8, and will decode correctly as UTF-8 (just like any ASCII file will). So nothing at all is wrong.

python/pyspark - Reading special characters from csv and writing it back to the file

I am reading a csv file which has some of the values in a column like this -
MÉXICO
ATLÁNTICO
I am reading the file with encoding = 'utf8' but after the processing values are getting changed like below
M�XICO
ATL�NTICO
What can I do to retain the original value which is in input file.
Edit - Tried utf-16 and ISO-8859-1 also. but does not help either.
Your input file may not be in utf8 encoding.
You can convert to utf8 before reading from the file. That should fix your issue.
Here is a stack-overflow link to convert CSV from non utf8 to utf8 encoding.

pypy3 - UnicodeDecodeError when reading a csv file [duplicate]

Ok, so python3 and unicode. I know that all python3 strings are actually unicode strings and all python3 code is stored as utf-8. But how does python3 reads text files? Does it assume that they are encoded in utf-8? Do I need to call decode('utf-8') when reading a text file? What about pandas read_csv() and to_csv()?
Python's built-in function open() has an optional parameter encoding:
encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is
platform dependent (whatever locale.getpreferredencoding() returns),
but any text encoding supported by Python can be used. See the
codecs module for the list of supported encodings.
Analogous parameter could be found in pandas:
pandas.read_csv(): encoding: str, default None. Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
Series.to_csv(): encoding: string, optional. A string representing the encoding to use if the contents are non-ascii, for python versions prior to 3.
DataFrame.to_csv(): encoding: string, optional. A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
Do I need to call decode('utf-8') when reading a text file?
You need to try-read a text file to make sure it's utf-8 encoding in the file.

Which encoding to open utf-8 csv file in Python which opens correctly in Excel with Windows (ANSI)

I have a database export in csv which is UTF8 encoded.
When i open it in Excel, i have to choose Windows (ANSI) at opening in order to see special characters correctly displays (é, è, à for instance).
If i use Python pandas to open csv file specifying UTF8 encoding, it does not seem to get correctly decoded (é,è,à characters are not displayed correctly):
StŽphanie
FrŽdŽrique
GŽraldine
How should i correctly read this file with Python pandas ?
Thanks a lot
This encoding is Windows-1252, referred to as "cp1252" by Python. ANSI is a misnomer; it's completely unrelated to the organisation.
Try:
with open("filepath.csv", encoding="cp1252") as f:
pandas.read_csv(f)
The solution was actually to use latin1 encoding in my case:
Stéphanie
Frédérique
Géraldine

Python: Use the codecs module or use string function decode?

I have a text file that is encoded in UTF-8. I'm reading it in to analyze and plot some data. I would like the file to be read in as ascii. Would it be best to use the codecs module or use the builtin string decode method? Also, the file is divided up as a csv, so could the csv module also be a valid solution?
Thanks for your help.
Do you mean that your file is encoded in UTF-8? ("Unicode" is not an encoding... Required reading: http://www.joelonsoftware.com/articles/Unicode.html) I'm not 100% sure but I think you should be able to read a UTF-8 encoded file with the csv module, and you can convert the strings which contain special characters to Python's unicode strings (edit: if you need to) after reading.
There are a few examples of using csv with UTF-8 encoded data at http://docs.python.org./library/csv.html#csv-examples; it might help you to look at them.

Categories