Python sets and good encoding

Python sets and good encoding - python

I'm using Python2 and I try to put many words of a french dictionary in a set object, but I always have an encoding problem with the words that have accent.
This is my main code (this part reads a text file):
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from sets import Set
with open('.../test_unicode.txt', 'r') as word:
lines = word.readlines()
print(lines)
And this is the result of my print:
['\xc3\xa9l\xc3\xa9phants\n', 'bonjour\n', '\xc3\xa9l\xc3\xa8ves\n']
This is my text file for this example:
éléphants
bonjour
élèves
After, this is the continuity of my main code that put the words in a python set:
dict_word = Set()
for line in lines:
print(line)
dict_word.add(line[:-1].upper()) #Get rid of the '\n'
print(dict_word)
This is the result of my print:
Set(['\xc3\xa9L\xc3\xa8VES', 'BONJOUR', '\xc3\xa9L\xc3\xa9PHANTS'])
What I want is this output:
Set(['ÉLÈVES', 'BONJOUR', 'ÉLÉPHANTS'])
But I can't figure out a way to have this result I tried many ways including putting this line '# -- encoding: utf-8 --' at the top of my file. I also tried 'with codecs.open()' but it didn't work either.
Thanks!

In python 2 you can use the codecs module to read the file with an encoding. Remember that the repr representation of a unicode string will look funky (starts with a u, escapes the unicode stuff) but the actual string is in fact unicode.
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from sets import Set
import codecs
with codecs.open('test.txt', encoding='utf-8') as word:
lines = [line.strip() for line in word.readlines()]
# since you print the list, it shows you the repr of its values
print(lines)
# but they really are unicode
for line in lines:
print(line)
The output shows the unicode repr when printing the list, but the real string when printing the strings themselves.
[u'\xe9l\xe9phants', u'bonjour', u'\xe9l\xe8ves']
éléphants
bonjour
élèves

The reason is probably that you read the file using the wrong encoding.
In Python 3 you would simply switch:
from with open('.../test_unicode.txt', 'r') as word:
to with open('.../test_unicode.txt', 'r', encoding="utf-8") as word:
In Python 2, it seems you can do something like this: Backporting Python 3 open(encoding="utf-8") to Python 2
I.e. use io.open (you have to import io first), and specify encoding="utf-8". I would have expected this to work with codecs.open as well, if you specify that same keyword argument.

You can try to infer input encoding
from sets import Set
import chardet
with open('.../test_unicode.txt', 'rb') as word:
bin_data = word.readlines()
enc = chardet.detect(bin_data)
lines = bin_data.decode(enc['encoding'])
print(lines)

Related

Comparison of Non ASCII only works in IDLE

I'm doing a fairly simple code that transforms European Portuguese input into Brazilian Portuguese -- so there are a lot of accented characters such as á,é,À,ç, etc.
Basically, the goal is to find words in the text from a list and replace them with the BR words from a second list.
Here's the code:
#-*- coding: latin-1 -*-
listapt=["gestão","utilizador","telemóvel"]
listabr=["gerenciamento", "usuário", "celular"]
while True:
#this is all because I need to be able to input multiple lines of text, seems to be working fine
print ("Insert text")
lines = []
while True:
line = raw_input()
if line != "FIM":
lines.append(line)
else:
break
text = '\n'.join(lines)
for word in listapt:
if word in text:
num = listapt.index(word)
wordbr = listabr[num]
print(word + " --> " + wordbr) #just to show what changes were made
text = text.replace(word, wordbr)
print(text)
I run the code on Windows using IDLE and by double-clicking on the .py file.
The code works fine when using IDLE, but does not match and replace characters when double-clicking the .py file.

Here's why the code works as expected in IDLE but not from CMD or by doubleclicking:
Your code is UTF-8 encoded, not latin-1 encoded
IDLE always works in UTF-8 "input/output" mode.
On Windows, CMD/Doubleclicking will use a non-UTF-8 8bit locale.
When your code compares the input to the hardcoded strings it's doing so at a byte level. On IDLE, it's comparing UTF-8 to hardcoded UTF-8. On CMD, it's comparing non-UTF-8 8bit to hardcoded UTF-8 (If you were on a stock MacOS, it would also work).
The way to fix this is to make sure you're comparing "apples with apples". You could do this by converting everything to the same encoding. E.g. Convert the input read to UTF-8 so it matches the hardcoded strings. The better solution is to convert all [byte] strings to Unicode strings (Strings with no encoding). If you were on Python 3, this would be all automatic.
On Python 2.x, you need to do three things:
Prefix all sourcecode strings with u to make them Unicode strings:
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celula]
...
if line != u"FIM":
Alternatively, add from __future__ import unicode_literals to avoid changing all your code.
Use the correct coding header for the encoding of your file. I suspect your header should read utf-8. E.g.
#-*- coding: utf-8 -*-
Convert the result of raw_input to Unicode. This must be done with the detected encoding of the standard input:
import sys
line = raw_input().decode(sys.stdin.encoding)
By the way, the better way to model list of words to replace it to use a dict. The keys are the original word, the value is the replacement. E.g.
words = { u"telemóvel": u"celula"}

I don't see that problem over here.
Based on your use of raw_input, it seems like you're using Python 2.x
This may be because I'm copypasting off of stack overflow, and have a different dev environment to you.
Try running your script under the latest Python 3 interpreter, as well as removing the "#-*- coding:" line.
This should either hit UnicodeDecodeError issues a lot sooner in your code, or work fine.
The problem you have here is Python 2.x getting confused at some point while trying to translate between byte sequences (what Python 2.x strings contain, eg binary file contents), and human-meaningful text (unicode, eg for things like user informational display of chinese characters), because it makes incorrect assumptions about how human-readable text was encoded into the byte sequence seen in the Python strings.
It's a detail that Python 3 attempts to address a lot better/less ambiguously.

First try executing the code below, it should resolve the issue:
# -*- coding: latin-1 -*-
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celular"]
lines=[]
line = raw_input()
line = line.decode('latin-1')
if line != "FIM":
lines.append(line)
text = u'\n'.join(lines)
for word in listapt:
if word in text:
print("Hello")
num = listapt.index(word)
print(num)
wordbr = listabr[num]
print(wordbr)

Converting unicode list to a readable format

I am using polyglot to tokenize text in Burmese language. Here is what I am doing.
from polyglot.text import Text
blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
text = Text(blob)
When I do :
print(text.words)
It outputs in the following format:
[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c', u'\u1000\u1039\u103b', u'\u1019', u'\u1014\u1039', u'\u1019\u102c', u'\u101c\u102f', u'\u1015\u1039', u'\u101e\u102c\u1038', u'\u1019\u103a\u102c\u1038', u'\u1000\u102d\u102f', u'\u101c\u102f\u1036', u'\u107f', u'\u1001\u1033\u1036\u1031', u'\u101b\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015\u107f', u'\u1015\u102e\u1038', u'\u1011\u102d\u102f', u'\u1004\u1039\u1038', u'\u101b\u1032', u'\u1006', u'\u1000\u1039', u'\u101c', u'\u1000\u1039', u'\u1016', u'\u1019\u1039\u1038', u'\u1006\u102e\u1038', u'\u104a', u'\u1027', u'\u100a\u1037\u1039', u'\u1005\u102c', u'\u101b', u'\u1004\u1039\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015', u'\u104d', u'\u1012', u'\u100f\u1039\u1031', u'\u1004\u103c\u1090\u102d\u102f', u'\u1000\u1039']
What output is this? I am not sure why the output is like this. How could I convert it back to the format where I could make some sense out of this?
I had also tried the following:
text.words[1].decode('unicode-escape')
but it throws an error saying: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

That is the way Python 2 prints a list. It is debugging output (see repr()), that unambiguously indicates the content of a list. u'' indicates a Unicode string and \uxxxx indicates a Unicode code point of U+xxxx. The output is all ASCII so it works on any terminal. If you directly print the strings in the list, they will display correctly if your terminal supports the characters being printed. Example:
words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print words
for word in words:
print word
Output:
[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
ထို
င္းေ
ရာ
To reemphasize, your terminal must be configured with an encoding that supports the Unicode code points (ideally, UTF-8), and use a font that supports the characters as well. Otherwise, you can print the text to a file in UTF-8 encoding, and view the file in an editor that supports UTF-8 and has fonts that support the characters:
import io
with io.open('example.txt','w',encoding='utf8') as f:
for word in words:
f.write(word + u'\n')
Switch to Python 3, and things get more simple. It defaults to displaying the characters if the terminal supports it, but you can still get the debugging output as well:
words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print(words)
print(ascii(words))
Output:
['ထို', 'င္းေ', 'ရာ']
['\u1011\u102d\u102f', '\u1004\u1039\u1038\u1031', '\u101b\u102c']

Looks like your terminal is unable to handle the UTF-8 encoded Unicode. Try saving the output by encoding each token into utf-8 as follows.
# -*- coding: utf-8 -*-
from _future_ import unicode_literals
from polyglot.text import Text
blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
text = Text(blob)
with open('output.txt', 'a') as the_file:
for word in text.words:
the_file.write("\n")
the_file.write(word.encode("utf-8"))

Why does ï»¿ appear in my data?

I downloaded the file 'pi_million_digits.txt' from here:
https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt
I then used this code to open and read it:
filename = 'pi_million_digits.txt'
with open(filename) as file_object:
lines = file_object.readlines()
pi_string = ''
for line in lines:
pi_string += line.strip()
print(pi_string[:52] + "...")
print(len(pi_string))
However the output produced is correct apart from the fact it is preceded by same strange symbols: "ï»¿3.141...."
What causes these strange symbols? I am stripping the lines so I'd expect such symbols to be removed.

It looks like you're opening a file with a UTF-8 encoded Byte Order Mark using the ISO-8859-1 encoding (presumably because this is the default encoding on your OS).
If you open it as bytes and read the first line, you should see something like this:
>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… where \xef\xbb\xbf is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like what you're getting:
>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'ï»¿3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… and opening it as UTF-8 shows the actual BOM character U+FEFF:
>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
To strip the mark out, use the special encoding utf-8-sig:
>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
The use of next() in the examples above is just for demonstration purposes. In your code, you just need to add the encoding argument to your open() line, e.g.
with open(filename, encoding='utf-8-sig') as file_object:
# ... etc.

Which of those encoding methods is the most reliable one?

I am rather new to python, but since my native language includes some nasty umlauts, I have to dive into the nightmare that encoding is right at the start.
I read joelonsoftware's text on encoding and understand the difference between codepoints and actual renderings of letters (and the connection between unicode and encodings).
To get me out of trouble I found 3 ways to deal with umlauts, but I can't decide, which of them suits what situations.
If someone could shed some lights on it? I want to be able to write text to file, read from it (or sqlite3) and give out text, all including readable umlauts...
Thanks a lot!
# -*- coding: utf-8 -*-
import codecs
# using just u + string
with open("testutf8.txt", "w") as f:
f.write(u"Österreichs Kapitän")
with open("testutf8.txt", "r") as f:
print f.read()
# using encode/decode
s = u'Österreichs Kapitän'
sutf8 = s.encode('UTF-8')
with open('encode_utf-8.txt', 'w') as f2:
f2.write(sutf8)
with open('encode_utf-8.txt','r') as f2:
print f2.read().decode('UTF-8')
# using codec
with codecs.open("testcodec.txt", "w","utf-8") as f3:
f3.write(u"Österreichs Kapitän")
with codecs.open("testcodec.txt", "r","utf-8") as f3:
print f3.read()
EDIT:
I tested this (content of file is 'Österreichs Kapitän'):
with codecs.open("testcodec.txt", "r","utf-8") as f3:
s= f3.read()
print s
s= s.replace(u"ä",u"ü")
print s
Do I have to use u'string' (unicode) everywhere in my code? I found out, if I just use the blank string (without 'u'), the replacement of umlauts didn't work...

As a general rule of thumb, you typically want to decode an encoded string as early as possible, then manipulate it as a unicode object and finally encode it as late as possible (before writing it to a file e.g.).
So e.g.:
with codecs.open("testcodec.txt", "r","utf-8") as f3:
s = f3.read()
# modify s here
with codecs.open("testcodec.txt", "w","utf-8") as f3:
f3.write(s)
As to your question, which way is the best to do it: I don't think there is a difference between using the codecs library or using encode/decode manually. It is a matter of preference, either works.
Simply using open, as in your first example, does not work as python will then try to encode the string using the default codec (which is ASCII, if you didn't change it).
Regarding the question whether you should use unicode strings everywhere:
In principle, yes. If you create a string s = 'asdf' it has type str (you can check this with type(s)), and if you do s2 = u'asdf' it has type unicode.
And since it is better to always manipulate unicode objects, the latter is recommended.
If you don't want to always have to append the 'u' in front of a string, you can use the following import:
from __future__ import unicode_literals
Then you can do s = 'asdf' and s will have the type unicode. In Python3 this is the default, so the import is only needed in Python2.
For potential gotchas you can take a look at Any gotchas using unicode_literals in Python 2.6?. Basically you don't want to mix utf-8 encoded strings and unicode strings.

Encoding in python

I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:
final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
if line == "Wykształcenie i praca": #error
print "ok"
and how I save file that I try read:
comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))
for string in comm_p.strings:
#print repr(string).encode('utf-8')
save = string.encode('utf-8') # there is how i save
info.write(save)
info.write("\n")
info.close()
and at the top of file I have # -- coding: utf-8 --
Any ideas?

This should do what you need:
# -- coding: utf-8 --
import io
with io.open('info', encoding='utf-8') as final:
lines = final.readlines()
for line in lines:
if line.strip() == u"Wykształcenie i praca": #error
print "ok"
You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.

First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.
About your current problem:
You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.
So what you need to do (at least):
use codecs.open("info", "r", encoding="utf-8") to read the file
use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":

It is likely the difference is in a '\n' character
readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?
In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file

use unicode for string comparision
>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>
when it comes to string unicode is the smartest move :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python sets and good encoding - python

You can try to infer input encoding from sets import Set import chardet with open('.../test_unicode.txt', 'rb') as word: bin_data = word.readlines() enc = chardet.detect(bin_data) lines = bin_data.decode(enc['encoding']) print(lines)

Related

Comparison of Non ASCII only works in IDLE

Converting unicode list to a readable format

Why does ï»¿ appear in my data?

Which of those encoding methods is the most reliable one?

Encoding in python

Categories

Resources