How to have UTF-8 enconding with an Excel file with openpyxl

How to have UTF-8 enconding with an Excel file with openpyxl - python

With Python and Openpyxl, i have this result when i try to read the sheet name :
[u'Janvier ', u'F\xe9vrier'
The code is :
self.classeur = openpyxl.load_workbook('/users/utilisateur/Desktop/Historique.xlsx')
print self.classeur.get_sheet_names()
What can I do to have Février ?

In OOXML all strings are unicode. How these appear in your command line depends on a lot things but mainly the configuration of your computer. As the string is unicode you will need to convert it to the local encoding, assuming this can display the non-ascii characters.
Try:
print(s.encode("utf8"))
Note this only affects what you see. If you want to process the content or edit the file, just keep things as unicode.

Related

Which encoding to open utf-8 csv file in Python which opens correctly in Excel with Windows (ANSI)

I have a database export in csv which is UTF8 encoded.
When i open it in Excel, i have to choose Windows (ANSI) at opening in order to see special characters correctly displays (é, è, à for instance).
If i use Python pandas to open csv file specifying UTF8 encoding, it does not seem to get correctly decoded (é,è,à characters are not displayed correctly):
StŽphanie
FrŽdŽrique
GŽraldine
How should i correctly read this file with Python pandas ?
Thanks a lot

This encoding is Windows-1252, referred to as "cp1252" by Python. ANSI is a misnomer; it's completely unrelated to the organisation.
Try:
with open("filepath.csv", encoding="cp1252") as f:
pandas.read_csv(f)

The solution was actually to use latin1 encoding in my case:
Stéphanie
Frédérique
Géraldine

Special characters not being displayed correctly when being written to a csv file in excel

I am trying to write strings containing special characters such as chinese letters and french accents to a csv file. At first I was getting the classic Unicode encode error and looked online for a solution. Many resources told me to use .encode('utf-8',errors='ignore') to solve the problem.This places bytes in the excel file. In my code shown below I tried getting the function that appends the character to the csv file to convert the character to utf-8. This makes the program run without error, however, when I open up the excel document I see that instead of "é" and "蒋" being added to the file, I see "Ã©" and "è’‹".
import csv
def appendToCSV(specialCharacter):
with open('myCSVFile.csv',"a",newline="",encoding='utf-8') as csvFile:
csvFileWriter = csv.writer(csvFile)
csvFileWriter.writerow([specialCharacter])
csvFile.close()
appendToCSV('é')
appendToCSV('蒋')
I would like to get display the characters in the excel document exactly as shown, any help would be appreciated. Thank you.

Use utf-8-sig for the encoding. Excel requires the byte order mark (BOM) signature or it will interpret the file in the local ANSI encoding.

I'm pretty sure your excel worksheet is set to use "Latin 1". Try to switch the setting to use utf-8.
Note:
>>> x = "蒋"
>>> bs = x.encode()
>>> bs
b'\xe8\x92\x8b'
>>> bs.decode("latin")
'è\x92\x8b'
And:
>>> x = 'é'
>>> bs = x.encode()
>>> bs.decode('latin-1')
'Ã©'

Python encoding issue involving special characters

I am running Win7 x64 and I have Python 2.7.5 x64 installed. I am using Wing IDE 101 4.1.
For some reason, encoding is messed up.
special_str = "sauté"
print string
# saut├⌐
string
# 'saut\xc3\xa9'
I don't understand why when I try to print it, it comes out weird. When I write it to a notepad text file, it comes out as right ("sauté"). Problem with this is that when I use BeautifulSoup on the string, it comes out containing that weird string "saut├⌐" and then when I output it back into a csv file, I end up with a html chunk containing that weird bit. Help!

You need to declare the encoding of the source file so Python can properly decode your string literals.
You can do this with a special comment at the top of the file (first or second line).
# coding:<coding>
where <coding> is the encoding used when saving the file, for example utf-8.

Python 2.7: Setting I/O Encoding, ‚Äô?

Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith‚Äôs BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?

You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ‚Äô.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.

There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.

how to deal with japanese word using python xlrd

this is my code:
#!/usr/bin/python
#-*-coding:utf-8-*-
import xlrd,sys,re
data = xlrd.open_workbook('a.xls',encoding_override="utf-8")
a = data.sheets()[0]
s=''
for i in range(a.nrows):
if 9<i<20:
#stage
print a.row_values(i)[1].decode('shift_jis')+'\n'
but it show :
????
????????
??????
????
????
????
????????
so what can i do ,
thanks

Background: In a "modern" (Excel 97-2003) XLS file, text is effectively stored as Unicode. In older files, text is stored as 8-bit strings, and a "codepage" record tells how it is encoded e.g. the integer 1252 corresponds to the encoding known as cp1252 or windows-1252. In either case, xlrd presents extracted text as unicode objects.
Please insert this line into your code:
print data.biff_version, data.codepage, data.encoding
If you have a new file, you should see
80 1200 utf_16_le
In any case, please edit your question to report the outcome.
Problem 1: encoding_override is required ONLY if the file is an old file AND you know/suspect that the codepage record is omitted or wrong. It is ignored if the file is a new file. Do you really know that the file is pre-Excel-97 and the text is encoded in UTF-8? If so, it can only have been created by some seriously deluded 3rd-party software, and Excel will blow up if you try to open it with Excel; visit the author with a baseball bat. Otherwise, don't use encoding_override.
Problem 2: You should have unicode objects. To display them, you need to encode (not decode) them from unicode to str using a suitable encoding. It is very suprising that print unicode_object.decode('shift-jis') doesn't raise an exception and prints question marks.
To help understand this, please change your code to be like this:
text = a.rowvalues(i)[1]
print i, repr(text)
print repr(text.decode('shift-jis'))
and report the outcome.
So that we can help you choose an appropriate encoding (if any), tell us what version of what operating system you are using, and what the following display:
print sys.stdout.encoding
import locale
print locale.getpreferredencoding()
Further reading:
(1) the xlrd documentation (section on Unicode, right up the front) ... included in the distribution, or get the latest commit here.
(2) the Python Unicode HOWTO.

Why isn't your encoding override on open shift-jis?
data = xlrd.open_workbook('a.xls',encoding_override="shift-jis")
If the file is really shift-JIS, there are lots of code points (well frankly, almost all of them) that don't overlap with valid UTF-8 code points. If you are getting illegal characters (?) and your file is really UTF-8 and you want to output Shift-JIS, might I suggest that your output shell (for print - probably a file would be fine) can't handle the encoding.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.