Trouble with utf-8 encoding/decoding

Trouble with utf-8 encoding/decoding - python

I am reading a .csv which is UTF-8 encoded.
I want to create an index and rewrite the csv.
The index is created as an ongoing number and the first letter of a word.
Python 2.7.10, Ubuntu Server
#!/usr/bin/env python
# -*- coding: utf-8 -*-
counter = 0
tempDict = {}
with open(modifiedFile, "wb") as newFile:
with open(originalFile, "r") as file:
for row in file:
myList = row.split(",")
toId = str(myList[0])
if toId not in tempDict:
tempDict[toId] = counter
myId = str(toId[0]) + str(counter)
myList.append(myId)
counter += 1
else:
myId = str(toId[0]) + str(tempDict[toId])
myList.append(myId)
# and then I write everything into the csv
for i, j in enumerate(myList):
if i < 6:
newFile.write(str(j).strip())
newFile.write(",")
else:
newFile.write(str(j).strip())
newFile.write("\n")
The problem is the following.
When a word starts with a fancy letter, such as
Č
É
Ā
...
The id I create starts with a ?, but not with the letter of the word.
The strange part is, that withing the csv I create, the words with the fancy letters are written correct. There are no ? or other symbols which indicate a wrong encoding.
Why is that?

By all means, you should not be learning Python 2 unless there is a specific legacy C extension that you need.
Python 3 makes major changes to the unicode/bytes handling that removes (most) implicit behavior and makes errors visible. It's still good practice to use open('filename', encoding='utf-8') since the default encoding is environment- and platform-dependent.
Indeed, running your program in Python 3 should fix it without any changes. But here's where your bug lies:
toId = str(myList[0])
This is a no-op, since myList[0] is already a str.
myId = str(toId[0]) + str(counter)
This is a bug: toId is a str (byte string) containing UTF-8 data. You never, ever want to do anything with UTF-8 data except process it one character at a time.
with open(originalFile, "r") as file:
This is a style error, since it masks the built-in function file.
There are two changes to make this run under Python 2.
Change open(filename, mode) to io.open(filename, mode, encoding='utf-8').
Stop calling str() on strings, since that actually attempts to encode them (in ASCII!).
But you really should switch to Python 3.
There are a few pieces new to 2.6 and 2.7 that are intended to bridge the gap to 3, and one of them is the io module, which behaves in all the nice new ways: Unicode files and universal newlines.
~$ python2.7 -c 'import io,sys;print(list(io.open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
[u'\xc4\n', u'\xf9\n']
~$ python3 -c 'import sys;print(list(open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
['Ä\n', 'ù\n']
This can be useful to write software for both 2 and 3. Again, the encoding argument is optional but on all platforms the default encoding is environment-dependent, so it's good to be specific.

In python 2.x strings are by default non-unicode - str() returns a non-unicode string. Use unicode() instead.
Besides, you must open the file using utf-8 encoding through codecs.open() rather than the built-in open().

Related

How can I write characters such as § into a file using Python?

This is my code for creating the string to be written ('result' is the variable that holds the final text):
fileobj = open('file_name.yml','a+')
begin = initial+":0 "
n_name = '"§'+tag+name+'§!"'
begin_d = initial+"_desc:0 "
n_desc = '"§3'+desc+'§!"'
title = ' '+begin + n_name
descript = ' '+begin_d + n_desc
result = title+'\n'+descript
print()
fileobj.close()
return result
This is my code for actually writing it into the file:
text = writing(initial, tag, name, desc)
override = inserter(fileobj, country, text)
fileobj.close()
fileobj = open('file_name.yml','w+')
fileobj.write(override)
fileobj.close()
(P.S: Override is a function which works perfectly. It returns a longer string to be written into the file.)
I have tried this with .txt and .yml files but in both cases, instead of §, this is what takes its place: xA7 (I cannot copy the actual text into the internet as it changes into the correct character. It is, however, appearing as xA7 in the file.) Everything else is unaffected, and the code runs fine.
Do let me know if I can improve the question in any way.

You're running into a problem called character encoding. There are two parts to the problem - first is to get the encoding you want in the file, the second is to get the OS to use the same encoding.
The most flexible and common encoding is UTF-8, because it can handle any Unicode character while remaining backwards compatible with the very old 7-bit ASCII character set. Most Unix-like systems like Linux will handle it automatically.
fileobj = open('file_name.yml','w+',encoding='utf-8')
You can set your PYTHONIOENCODING environment value to make it the default.
Windows operating systems are a little trickier because they'll rarely assume UTF-8, especially if it's a Microsoft program opening the file. There's a magic byte sequence called a BOM that will trigger Microsoft to use UTF-8 if it's at the beginning of a file. Python can add that automatically for you:
fileobj = open('file_name.yml','w+',encoding='utf_8_sig')

Special characters like ç and ã aren't decoded when the text is obtained from a file

I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.
In my code, I get a collection of secret words which is imported from a txt file using the following code:
words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
words.append(line.strip().lower())
words_bank.close()
print(words)
The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.
The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:
File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared
Line 12 content: print('maçã, açaí, tucumã')
Things that I've already tried:
add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1
My OS is Ubuntu 20.04 LTS, my IDE is VS Code.
Nothing works!
I don't know what search and what to do anymore.
SOLUTION HERE
Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.
Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.
It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.
Thank's everybody for your time and your help =)

This only happens when using Python 2.x.
The error is probably because you're printing a list not printing items in the list.
When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.
repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.
Instead, try:
for x in words:
print(x)
Note. The option for open is encoding. E.g
open('myfile.txt', encoding='utf-8')
You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page.
Python 3.9 will always use UTF-8
See Python 2.x vs 3.x behaviour:
Py2
>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã
Py3
>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"

Why is python converting Kurdich characters into UTF-8 literals?

I was trying to take the content of a text file and map it into a json file, but I noticed that python automatically turned the kurdish(sorani) text into UTF-8 literals. Can someone explain why python does this and how can I prevent the conversion?
You can test it with the code below:
def readText():
# test.txt contains kurdish sorani characters (an article)
# Sorani example: ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە.
with open('test.txt', 'r') as context:
data = context.readlines()
return data
print(readText())
I'm running python 2.x on Ubuntu 14.x. Python2.x does this! Python 3.x does not convert it and works just fine.

You are seeing the repr output as you call readlines which returns a list and lists show the repr representation of your data, once you actually print the strings themselves you will see the actual str output, you are also using python2:
In [11]: out = readText()
In [12]: print out
['\xda\x95\xdb\x86\xda\x98\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xa8\xd8\xa7\xd8\xb4 \xd8\xa8\xdb\x95\xda\x95\xdb\x8e\xd8\xb2\xd8\xa7\xd9\x86. \xd9\x85\xd9\x86 \xd9\x86\xd8\xa7\xd9\x88\xd9\x85 \xda\x95\xdb\x95\xd9\x86\xd8\xac\xdb\x95. ']
In [13]: print out[0]
ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە.

I'm going to take a stab here and guess that you are reading the output in a terminal of some sort, and when Python writes to the terminal it's trying to display in ASCII.
If you set your PYTHONIOENCODING environment variable to UTF-8 this can sometimes solve the issue - it depends on other variables as well.
So, if you're on a UNIX-like system, try this in your terminal: export PYTHONIOENCODING=UTF-8
Or, for Windows, set PYTHONIOENCODING=UTF-8.
Then, try running your script again and see if you get the correct characters printed.
More information can be found here: How to print UTF-8 Encoded Text to the console in Python3

How to write utf8 to standard output in a way that works with python2 and python3

I want to write a non-ascii character, lets say → to standard output. The tricky part seems to be that some of the data that I want to concatenate to that string is read from json. Consider the follwing simple json document:
{"foo":"bar"}
I include this because if I just want to print → then it seems enough to simply write:
print("→")
and it will do the right thing in python2 and python3.
So I want to print the value of foo together with my non-ascii character →. The only way I found to do this such that it works in both, python2 and python3 is:
getattr(sys.stdout, 'buffer', sys.stdout).write(data["foo"].encode("utf8")+u"→".encode("utf8"))
or
getattr(sys.stdout, 'buffer', sys.stdout).write((data["foo"]+u"→").encode("utf8"))
It is important to not miss the u in front of → because otherwise a UnicodeDecodeError will be thrown by python2.
Using the print function like this:
print((data["foo"]+u"→").encode("utf8"), file=(getattr(sys.stdout, 'buffer', sys.stdout)))
doesnt seem to work because python3 will complain TypeError: 'str' does not support the buffer interface.
Did I find the best way or is there a better option? Can I make the print function work?

The most concise I could come up with is the following, which you may be able to make more concise with a few convenience functions (or even replacing/overriding the print function):
# -*- coding=utf-8 -*-
import codecs
import os
import sys
# if you include the -*- coding line, you can use this
output = 'bar' + u'→'
# otherwise, use this
output = 'bar' + b'\xe2\x86\x92'.decode('utf-8')
if sys.stdout.encoding == 'UTF-8':
print(output)
else:
output += os.linesep
if sys.version_info[0] >= 3:
sys.stdout.buffer.write(bytes(output.encode('utf-8')))
else:
codecs.getwriter('utf-8')(sys.stdout).write(output)
The best option is using the -*- encoding line, which allows you to use the actual character in the file. But if for some reason, you can't use the encoding line, it's still possible to accomplish without it.
This (both with and without the encoding line) works on Linux (Arch) with python 2.7.7 and 3.4.1.
It also works if the terminal's encoding is not UTF-8. (On Arch Linux, I just change the encoding by using a different LANG environment variable.)
LANG=zh_CN python test.py
It also sort of works on Windows, which I tried with 2.6, 2.7, 3.3, and 3.4. By sort of, I mean I could get the '→' character to display only on a mintty terminal. On a cmd terminal, that character would display as 'ΓåÆ'. (There may be something simple I'm missing there.)

If you don't need to print to sys.stdout.buffer, then the following should print fine to sys.stdout. I tried it in both Python 2.7 and 3.4, and it seemed to work fine:
# -*- coding=utf-8 -*-
print("bar" + u"→")

Python codecs module

I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.
import codecs
f = open('C:/temp/list_test.txt', 'r')
for lines in f:
line=filter_str(lines.decode("utf-8")
This all works well. I parse the entire file and then want to export
14 different language files. The problem that I can't understand is the following
I use the following code for output:
malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
for item in lang_dic['English']:
temp = lang_dic[lang1][item]
malangout.write(temp + '\n')
malangout.close()
Example:
Language: Polish
Expected output: Dziennik zakłóceń
Actual output: Dziennik zak‚óceƒ
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.

The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
Well, that's a problem since
In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ
in contrast to
In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń
So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain
In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'
or
In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'
?
PS. You may want to change
temp + '\n'
to
temp + u'\n'
to make it clear you are adding two unicode together to form a unicode.
The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble with utf-8 encoding/decoding - python

In python 2.x strings are by default non-unicode - str() returns a non-unicode string. Use unicode() instead. Besides, you must open the file using utf-8 encoding through codecs.open() rather than the built-in open().

Related

How can I write characters such as § into a file using Python?

Special characters like ç and ã aren't decoded when the text is obtained from a file

Why is python converting Kurdich characters into UTF-8 literals?

How to write utf8 to standard output in a way that works with python2 and python3

Python codecs module

Categories

Resources