I have a csv file saved encoded as UTF-8.
It contains non-ascii chars [umlauts].
I am reading the file using:
csv.DictReader(<file>,delimiter=<delimiter>).
My questions are:
In which encoding is the file being read?
I noticed that in order to refer to the strings as utf-8 I need to perform:
str.decode('utf-8')
Is there a better approach then reading the file in one encoding and then to convert to another, i.e. utf-8?
[Python version: 2.7]
In Python 2.7, the CSV module does not apply any decoding - it opens the file in binary mode and returns bytes strings.
Use https://github.com/jdunck/python-unicodecsv, which decodes on the fly.
Use it like:
with open("myfile.csv", 'rb') as my_file:
r = unicodecsv.DictReader(my_file, encoding='utf-8')
r will contain a dict of Unicodes. It's important that the source file is opened as binary mode.
How about using instances and classes in order to achieve this?
You can store the shared dictionary at the class level and also make it load Unicode text files, and even detect their encoding, with or without use of BOM file masks.
Long time ago I wrote a simple library which overrides the default open() with one that is Unicode aware.
If you do import tendo.unicode you will be able to change the way csv library loads the files too.
If your files do not have a BOM header the library will assume UTF-8 instead of the old ascii. You can even specify another fallback encoding if you want.
Related
I have a CSV file with utf-8 encoding. I want to change it's to shift-jis csv file using python code. Is it possible? How can i do it?
This sounds like task for codecs (it is part of standard library). Two codecs.open might be used if you want to just change encoding following way
import codecs
with codecs.open("file1.csv","r",encoding="utf_8") as fin:
with codecs.open("file2.csv","w",encoding="shift_jis") as fout:
fout.write(fin.read())
above code assumes that you have UTF-8 encoded file file1.csv and what to create shitf-jis encoded file2.csv and you have enough RAM space free to load whole file there. Be warned that in Standard Encodings following shift_jis encoding are available
shift_jis
shift_jis_2004
shift_jisx0213
I do not know difference between them, so you would need yourself which one you actually need to use.
I have excersise to make script which convert UTF-16 files to UTF-8, so I wanted to have one example file with UTF-16 coding. The problem is that all files encoding which Python shows me is 'cp1250'(no matter which format .csv or .txt). What am I missing here? I have also example files from the Internet, but Python recognize them as cp-1250. Even when I save file with UTF-8, Python shows cp-1250 coding.
This is the code I use:
with open('FILE') as f:
print(f.encoding)
The result from open simply is a file in your system's default encoding. To open it in something else, you have to specifically say so.
To actually convert a file, try something like
with open('input', encoding='cp1252') as input, open('output', 'w', encoding='utf-16le') as output:
for line in input:
output.write(line)
Converting a legacy 8-bit file to Unicode isn't really useful because it only exercises a small subset of the character set. See if you can find a good "hello world" sample file. https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html is one for UTF-8.
I have a database export in csv which is UTF8 encoded.
When i open it in Excel, i have to choose Windows (ANSI) at opening in order to see special characters correctly displays (é, è, à for instance).
If i use Python pandas to open csv file specifying UTF8 encoding, it does not seem to get correctly decoded (é,è,à characters are not displayed correctly):
StŽphanie
FrŽdŽrique
GŽraldine
How should i correctly read this file with Python pandas ?
Thanks a lot
This encoding is Windows-1252, referred to as "cp1252" by Python. ANSI is a misnomer; it's completely unrelated to the organisation.
Try:
with open("filepath.csv", encoding="cp1252") as f:
pandas.read_csv(f)
The solution was actually to use latin1 encoding in my case:
Stéphanie
Frédérique
Géraldine
I wish to be able to read an srt file with python3.
These files can be found here:
http://www.opensubtitles.org/
With info here:
http://en.wikipedia.org/wiki/SubRip
Subrip supports any encoding: ascii or unicode, for example.
If I understand correctly then I need to specify which decoder to use when I use pythons read function. So am I right in saying that I need to know how the file is encoded in order to make this judgement? If so how do I establish that for each file if I have a hundred such files with different sources and language support?
Ultimately I would prefer if I could convert the files so that they are all in utf-8 encoding to start with. But some of these files might be some obscure encoding for all I know.
Please help,
Barry
You could use the charade package (formerly chardet) to detect the encoding.
You can check for the byte order mark at the start of each .srt file to test for encoding. However, this probably won't work for all files, as it is not a required attribute, and only specified in UTF files anyways. A check can be performed by
testStr = b'\xff\xfeOtherdata'
if testStr[0:2] == b'\xff\xfe':
print('UTF-16 Little Endian')
elif testStr[0:2] == b'\xfe\xff':
print('UTF-16 Big Endian')
#...
What you probably want to do is simply open your file, then decode whatever you pull out of the file into unicode, deal with the unicode representation until you are ready to print, and then encode it back again. See this talk for some more information, and code samples that might be relevant.
There's also a decent library for handling SRT files:
https://pypi.python.org/pypi/pysrt
You can specify the encoding when opening and writing SRT files.
I have a text file that is encoded in UTF-8. I'm reading it in to analyze and plot some data. I would like the file to be read in as ascii. Would it be best to use the codecs module or use the builtin string decode method? Also, the file is divided up as a csv, so could the csv module also be a valid solution?
Thanks for your help.
Do you mean that your file is encoded in UTF-8? ("Unicode" is not an encoding... Required reading: http://www.joelonsoftware.com/articles/Unicode.html) I'm not 100% sure but I think you should be able to read a UTF-8 encoded file with the csv module, and you can convert the strings which contain special characters to Python's unicode strings (edit: if you need to) after reading.
There are a few examples of using csv with UTF-8 encoded data at http://docs.python.org./library/csv.html#csv-examples; it might help you to look at them.