Python: Use the codecs module or use string function decode? - python

I have a text file that is encoded in UTF-8. I'm reading it in to analyze and plot some data. I would like the file to be read in as ascii. Would it be best to use the codecs module or use the builtin string decode method? Also, the file is divided up as a csv, so could the csv module also be a valid solution?
Thanks for your help.

Do you mean that your file is encoded in UTF-8? ("Unicode" is not an encoding... Required reading: http://www.joelonsoftware.com/articles/Unicode.html) I'm not 100% sure but I think you should be able to read a UTF-8 encoded file with the csv module, and you can convert the strings which contain special characters to Python's unicode strings (edit: if you need to) after reading.
There are a few examples of using csv with UTF-8 encoded data at http://docs.python.org./library/csv.html#csv-examples; it might help you to look at them.

Related

Convert encoding of a csv file from utf-8 to shift-jis in python

I have a CSV file with utf-8 encoding. I want to change it's to shift-jis csv file using python code. Is it possible? How can i do it?
This sounds like task for codecs (it is part of standard library). Two codecs.open might be used if you want to just change encoding following way
import codecs
with codecs.open("file1.csv","r",encoding="utf_8") as fin:
with codecs.open("file2.csv","w",encoding="shift_jis") as fout:
fout.write(fin.read())
above code assumes that you have UTF-8 encoded file file1.csv and what to create shitf-jis encoded file2.csv and you have enough RAM space free to load whole file there. Be warned that in Standard Encodings following shift_jis encoding are available
shift_jis
shift_jis_2004
shift_jisx0213
I do not know difference between them, so you would need yourself which one you actually need to use.

UTF-16 decoding fails when reading from csv

Trying to read a csv that contains some UTF-16 strings. When I print these string as extracted from the csv they don't decode to cyrillic/japanese/whatever as they should, instead just print the encoded utf-16. Yet when I copy/paste the strings and print them directly, there's no problem.
data = pd.read_csv('stuff.csv')
for index,row in data.iterrows():
print('\u0423\u043a\u0440\u0430\u0438\u043d\u0430')
print(row[1])
outputs:
Украина
\u0423\u043a\u0440\u0430\u0438\u043d\u0430
what am I missing? Note that some of the CSV is ascii so I can't just set encoding to utf-16 for the csv.
Edit: I'm trying to conditionally decode the strings where utf-16 is detected. Tried both the string taken from the csv and the copy/pasted string:
print(bytearray(row[1].encode()).decode('utf-16'))
print(b'\u0423\u043a\u0440\u0430\u0438\u043d\u0430'.decode('utf-16'))
For some reason it decodes to chinese characters:
畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳
畜㐰㌲畜㐰愳畜㐰〴畜㐰〳畜㐰㠳畜㐰搳畜㐰〳
Assuming you actually have \u escapes in the file, you can use the Python ast module to get access to the interpreter's actual parser:
from ast import literal_eval
...
print(literal_eval('"'+row[1]+'"'))
pandas.read_csv
has an encoding argument.
Try data = pd.read_csv('stuff.csv', encoding='utf-16')

What is the difference between utf-8 and utf-8-sig?

I am trying to encode Bangla words in python using pandas dataframe. But as encoding type, utf-8 is not working but utf-8-sig is. I know utf-8-sig is with BOM(Byte order mark). But why is this called utf-8-sig and how it works?
"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file).
Using utf-8-sig to read a file will treat the BOM as metadata that explains how to interpret the file, instead of as part of the file contents.

Which encoding is in use by csv.DictReader when reading csv?

I have a csv file saved encoded as UTF-8.
It contains non-ascii chars [umlauts].
I am reading the file using:
csv.DictReader(<file>,delimiter=<delimiter>).
My questions are:
In which encoding is the file being read?
I noticed that in order to refer to the strings as utf-8 I need to perform:
str.decode('utf-8')
Is there a better approach then reading the file in one encoding and then to convert to another, i.e. utf-8?
[Python version: 2.7]
In Python 2.7, the CSV module does not apply any decoding - it opens the file in binary mode and returns bytes strings.
Use https://github.com/jdunck/python-unicodecsv, which decodes on the fly.
Use it like:
with open("myfile.csv", 'rb') as my_file:
r = unicodecsv.DictReader(my_file, encoding='utf-8')
r will contain a dict of Unicodes. It's important that the source file is opened as binary mode.
How about using instances and classes in order to achieve this?
You can store the shared dictionary at the class level and also make it load Unicode text files, and even detect their encoding, with or without use of BOM file masks.
Long time ago I wrote a simple library which overrides the default open() with one that is Unicode aware.
If you do import tendo.unicode you will be able to change the way csv library loads the files too.
If your files do not have a BOM header the library will assume UTF-8 instead of the old ascii. You can even specify another fallback encoding if you want.

UnicodeEncodeError while writing data to an xml file

My aim is to write an XML file with few tags whose values are in the regional language. I'm using Python to do this and using IDLE (Pythong GUI) for programming.
While I try to write the words in an xmls file it gives the following error:
UnicodeEncodeError: 'ascii' codec
can't encode characters in position
0-4: ordinal not in range(128)
For now, I'm not using any xml writer library; instead, I'm opening a file "test.xml" and writing the data into it. This error is encountered by the line:
f.write(data)
If I replace the above write statement with print statement then it prints the data properly on the Python shell.
I'm reading the data from an Excel file which is not in the UTF-8, 16, or 32 encoding formats. It's in some other format. cp1252 is reading the data properly.
Any help in getting this data written to an XML file would be highly appreciated.
You should .decode your incoming cp1252 to get Unicode strings, and .encode them in utf-8 (by far the preferred encoding for XML) at the time you write, i.e.
f.write(unicodedata.encode('utf-8'))
where unicodedata is obtained by .decode('cp1252') on the incoming bytestrings.
It's possible to put lipstick on it by using the codecs module of the standard Python library to open the input and output files each with their proper encodings in lieu of plain open, but what I show is the underlying mechanism (and it's often, though not invariably, clearer and more explicit to apply it directly, rather than indirectly via codecs -- a matter of style and taste).
What does matter is the general principle: translate your input strings to unicode as soon as you can right after you obtain them, use unicode throughout your processing, translate them back to byte strings at late as you can just before you output them. This gives you the simplest, most straightforward life!-)

Categories