How to encode/decode this file in Python? - python

I am planning to make a little Python game that will randomly print keys (English) out of a dictionary and the user has to input the value (in German). If the value is correct, it prints 'correct' and continue. If the value is wrong, it prints 'wrong' and breaks.
I thought this would be an easy task but I got stuck on the way. My problem is I do not know how to print the German characters. Let's say I have a file 'dictionary.txt' with this text:
cat:Katze
dog:Hund
exercise:Übung
solve:lösen
door:Tür
cheese:Käse
And I have this code just to test how the output looks like:
# -*- coding: UTF-8 -*-
words = {} # empty dictionary
with open('dictionary.txt') as my_file:
for line in my_file.readlines():
if len(line.strip())>0: # ignoring blank lines
elem = line.split(':') # split on ":"
words[elem[0]] = elem[1].strip() # appending elements to dictionary
print words
Obviously the result of the print is not as expected:
{'cheese': 'K\xc3\xa4se', 'door': 'T\xc3\xbcr',
'dog': 'Hund', 'cat': 'Katze', 'solve': 'l\xc3\xb6sen',
'exercise': '\xc3\x9cbung'}
So where do I add the encoding and how do I do it?
Thank you!

You are looking at byte string values, printed as repr() results because they are contained in a dictionary. String representations can be re-used as Python string literals and non-printable and non-ASCII characters are shown using string escape sequences. Container values are always represented with repr() to ease debugging.
Thus, the string 'K\xc3\xa4se' contains two non-ASCII bytes with hex values C3 and A4, a UTF-8 combo for the U+00E4 codepoint.
You should decode the values to unicode objects:
with open('dictionary.txt') as my_file:
for line in my_file: # just loop over the file
if line.strip(): # ignoring blank lines
key, value = line.decode('utf8').strip().split(':')
words[key] = value
or better still, use codecs.open() to decode the file as you read it:
import codecs
with codecs.open('dictionary.txt', 'r', 'utf8') as my_file:
for line in my_file:
if line.strip(): # ignoring blank lines
key, value = line.strip().split(':')
words[key] = value
Printing the resulting dictionary will still use repr() results for the contents, so now you'll see u'cheese': u'K\xe4se' instead, because \xe4 is the escape code for Unicode point 00E4, the ä character. Print individual words if you want the actual characters to be written to the terminal:
print words['cheese']
But now you can compare these values with other data that you decoded, provided you know their correct encoding, and manipulate them and encode them again to whatever target codec you needed to use. print will do this automatically, for example, when printing unicode values to your terminal.
You may want to read up on Unicode and Python:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

This is how you should do it.
def game(input,answer):
if input == answer:
sentence = "You got it!"
return sentence
elif input != answer:
wrong = "sorry, wrong answer"
return wrong

Related

Is there a way to store 4-byte UTF-8 encoded characters from a file as two characters in Python?

My understanding of encoding/decoding isn't the best so apologies if any of this is confusing:
I'm modding a Javascript app. It searches an index file to find a word's index then accesses the word's entry in the dictionary file using its index. So if the index for the word is 100, then the word's definition appears at dict[100]. The files are loaded in the Javascript app into variables using response.text(). This seems to render 4-byte utf-8 encoded characters as two separate characters. For instance: 𥻗 and 𪧘 are four bytes in utf-8 so I think they're appearing as �� (like they do in cmd). The current indices account for this, but since I'm updating the entries in the dictionary, I need to update the indices. Is there a way in Python to decode 4-byte utf-8 encoded characters as two characters? My current solution is to read the old_index and old_dict files in Python and manually add an extra character whenever the index fails to find the entry. I'm suspecting I need to switch languages for a more elegant solution.
EDIT: I feel like explaining my goal made this confusing. The crux of the matter is I'm trying to find a way to count 4-byte utf-8 encoded characters twice. This can probably be done by going character by character and checking the size of its encoding in utf-8.
with open(r"data\dict.txt", "r", encoding="utf-8") as f:
dict = f.read()
for char in dict:
byteArray = char.encode("utf-8")
if len(byteArray) == 4:
idx += 2
else:
idx += 1
Unicode characters with code point greater than or equal to 0x10000 have 4-byte representations in utf-8.
Wikipedia utf-8
So:
with open(r"data\dict.txt", "r", encoding="utf-8") as f:
s = f.read()
idx = 0
for char in s:
idx += 2 if ord(char) >= 0x10000 else 1
I hope that this code is sufficiently "elegant" to justify the use of Python :-)
I changed the variable name from "dict" to "s" since "dict" is the name of a built-in type.

Regex conflict for certain characters (ISO-8859-1 Windows-1252)

all - I'm trying to perform a regex on a bunch of science data, converting certain special symbols into ASCII-friendly characters. For example, I want to replace 'µ'(UTF-8 \xc2\xb5) to the string 'micro', and '±' with '+/-'. I cooked up a python script to do this, which looks like this:
import re
def stripChars(string):
outString = (re.sub(r'\xc2\xb5+','micro', string)) #Metric 'micro (10^-6)' (Greek 'mu') letter
outString = (re.sub(r'\xc2\xb1+','+/-', outString)) #Scientific 'Plus-Minus' symbol
return outString
However, for these two specific characters, I'm getting strange results. I dug into it a bit, and it looks like I'm suffering from the bug described here, in which certain characters come out wrong because they are UTF data being interpreted as Windows-1252 (or ISO 8859-1).
I grepped the relevant data, and found that it is returning the erroneous result there as well (e.g. the 'µ' appears as 'µ') However, elsewhere in the same data set there exists datum in which the same symbol is displayed correctly. This may be due to a bug in the system which collected the data in the first place. The real weirdness is that it seems my current code only catches the incorrect version, letting the correct one pass through.
In any case, I'm really stuck on how to proceed. I need to be able to come up with a series of regex substitutions which will catch both the correct and incorrect versions of these characters, but the identifier for the correct version is failing in this case.
I must admit, I'm still fairly junior to programming, and anything more than the most basic regex is still like black magic to me. This problem seems a bit more intractable than any I've had to tackle before, and that's why I bring it to here to get some more eyes on it.
Thanks!
If your input data is encoded as UTF-8, your code should work. Here’s a
complete program that works for me. It assumes the input is UTF-8 and
simply operates on the raw bytes, not converting to or from Unicode.
Note that I removed the + from the end of each input regex; that
would accept one or more of the last character, which you probably
didn’t intend.
import re
def stripChars(s):
s = (re.sub(r'\xc2\xb5', 'micro', s)) # micro
s = (re.sub(r'\xc2\xb1', '+/-', s)) # plus-or-minus
return s
f_in = open('data')
f_out = open('output', 'w')
for line in f_in:
print(type(line))
line = stripChars(line)
f_out.write(line)
If your data is encoded some other way (see for example this
question for how to tell), this version will be more useful. You can
specify any encoding for input and output. It decodes to internal
Unicode on reading, acts on that when replacing, then encodes on
writing.
import codecs
import re
encoding_in = 'iso8859-1'
encoding_out = 'ascii'
def stripChars(s):
s = (re.sub(u'\u00B5', 'micro', s)) # micro
s = (re.sub(u'\u00B1', '+/-', s)) # plus-or-minus
return s
f_in = codecs.open('data-8859', 'r', encoding_in)
f_out = codecs.open('output', 'w', encoding_out)
for uline in f_in:
uline = stripChars(uline)
f_out.write(uline)
Note that it will raise an exception if it tries to write non-ASCII data
with an ASCII encoding. The easy way to avoid this is to just write
UTF-8, but then you may not notice uncaught characters. You can catch
the exception and do something graceful. Or you can let the program
crash and update it for the character(s) you’re missing.
Ok, as you use a Python2 version, you read the file as byte strings, and your code should successfully translate all utf-8 encoded versions of µ (U+00B5) or ± (U+00B1).
This is coherent with what you later say:
my current code only catches the incorrect version, letting the correct one pass through
This is in fact perfectly correct. Let us first look at what exactly happen for µ. µ is u'\u00b5' it is encoded in utf-8 as '\xc2\xb5' and encoded in Latin1 or cp1252 as '\xb5'. As 'Â' is U+00C2, its Latin1 or cp1252 code is 0xc2. That means that a µ character correctly encoded in utf-8 will read as µ in a Windows 1252 system. And when it looks correct, it is because it is not utf-8 encoded but Latin1 encoded.
It looks that you are trying to process a file where parts are utf-8 encoded while others are Latin1 (or cp1252) encoded. You really should try to fix that in the system that is collecting data because it can cause hard to recover trouble.
The good news is that it can be fixed here because you only want to process 2 non ASCII characters: you just have to try to decode the utf-8 version as you do, and then try in a second pass to decode the Latin1 version. Code could be (ne need for regexes here):
def stripChars(string):
outString = string.replace('\xc2\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in utf-8
outString = outString.replace('\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in Latin1
outString = outString.replace('\xc2\xb1','+/-') #Scientific 'Plus-Minus' symbol in utf-8
outString = outString.replace('\xb1','+/-') #Scientific 'Plus-Minus' symbol in Latin1
return outString
For references Latin1 AKA ISO-8859-1 encoding has the exact unicode values for all unicode character below 256. Window code page 1252 (cp1252 in Python) is a Windows variation of the Latin1 encoding where some characters normally unused in Latin1 are used for higher code characters. For example € (U+20AC) is encoded as '\80' in cp1252 while it does not exist at all in Latin1.

Running Python 2.7 Code With Unicode Characters in Source

I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*- in the beginning. However, I wish to do it without using this method.
One way I could think of was writing the unicode strings in escaped form. For example,
Edit: Updated Source. Added Unicode comments.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
becomes
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
I have two questions regarding the above method.
How do I convert the first code snippet, using Python, into its equivalent that
follows it? That is, only unicode sequences should be written in
escaped form.
Is the method foolproof considering only unicode (utf-8) characters are used? Is there something that can go wrong?
Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.
It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.
Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.
u'na\xefve'
u'\u7537\u5b69'
note the u prefix
Your code is now encoding agnostic.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
Result:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
question 1:
try to use:
print u'naïve'
print u'长者'
question 2:
If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030
This snippet of Python 3 should convert your program correctly to work in Python 2.
def convertchar(char): #converts individual characters
if 32<=ord(char)<=126 or char=="\n": return char #if normal character, return it
h=hex(ord(char))[2:]
if ord(char)<256: #if unprintable ASCII
h=" "*(2-len(h))+h
return "\\x"+h
elif ord(char)<65536: #if short unicode
h=" "*(4-len(h))+h
return "\\u"+h
else: #if long unicode
h=" "*(8-len(h))+h
return "\\U"+h
def converttext(text): #converts a chunk of text
newtext=""
for char in text:
newtext+=convertchar(char)
return newtext
def convertfile(oldfilename,newfilename): #converts a file
oldfile=open(oldfilename,"r")
oldtext=oldfile.read()
oldfile.close()
newtext=converttext(oldtext)
newfile=open(newfilename,"w")
newfile.write(newtext)
newfile.close()
convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")
First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:
# -*- coding: utf-8 -*-
...
utxt = u'naïve' # source code is the bytestring `na\xc3\xafve'
# but utxt must become the unicode string u'na\xefve'
Simply it might be interpreted by clever editors to automatically use a utf8 charset.
Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.
But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.
A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:
def src_encode(infile, outfile):
while True:
c = infile.read(1)
if len(c) < 1: break # stop on end of file
if ord(c) > 127: # transform high characters
c = "\\x{:2x}".format(ord(c))
outfile.write(c)
An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...
(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...

How to read Chinese files?

I'm stuck with all this confusing encoding stuff. I have a file containing Chinese subs. I actually believe it is UTF-8 because using this in Notepad++ gives me a very good result. If I set gb2312 the Chinese part is still fine, but I will see some UTF8 code not being converted.
The goal is to loop through the text in the file and count how many times the different chars come up.
import os
import re
import io
character_dict = {}
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
if "srt" in filename:
import codecs
f = codecs.open(filename, 'r', 'gb2312', errors='ignore')
s = f.read()
# deleting {}
s = re.sub('{[^}]+}', '', s)
# deleting every line that does not start with a chinese char
s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
# delete non chinese chars
s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
#print s
s = s.encode('gb2312')
print s
for c in s:
#print c
pass
This will actually give me the complete Chinese text. But when I print out the loop on the bottom I just get questionmarks instead of the single chars.
Also note I said it is UTF8, but I have to use gb2312 for encoding and as the setting in my gnome-terminal. If I set it to UTF8 in the code i just get trash no matter if I set my terminal to UTF8 or gb2312. So maybe this file is not UTF8 after all!?
In any case s contains the full Chinese text. Why can't I loop it?
Please help me to understand this. It is very confusing for me and the docs are getting me nowhere. And google just leads me to similar problems that somebody solves, but there is no explanation so far that helped me understand this.
gb2312 is a multi-byte encoding. If you iterate over a bytestring encoded with it, you will be iterating over the bytes, not over the characters you want to be counting (or printing). You probably want to do your iteration on the unicode string before encoding it. If necessary, you can encode the individual codepoints (characters) to their own bytestrings for output:
# don't do s = s.encode('gb2312')
for c in s: # iterate over the unicode codepoints
print c.encode('gb2312') # encode them individually for output, if necessary
You are printing individual bytes. GB2312 is a multi-byte encoding, and each codepoint uses 2 bytes. Printing those bytes individually won't produce valid output, no.
The solution is to not encode from Unicode to bytes when printing. Loop over the Unicode string instead:
# deleting {}
s = re.sub('{[^}]+}', '', s)
# deleting every line that does not start with a chinese char
s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
# delete non chinese chars
s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
#print s
# No `s.encode()`!
for char in s:
print char
You could encode each char chararter individually:
for char in s:
print char
But if you have your console / IDE / terminal correctly configured you should be able to print directly without errors, especially since your print s.encode('gb2312)` produces correct output.
You also appear to be confusing UTF-8 (an encoding) with the Unicode standard. UTF-8 can be used to represent Unicode data in bytes. GB2312 is an encoding too, and can be used to represent a (subset of) Unicode text in bytes.
You may want to read up on Python and Unicode:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Comparing strings not working

I have a list of article titles that I store in a text file and load into a list. I'm trying to compare the current title with all the titles that are in that list like so
def duplicate(entry):
for line in posted_titles:
print 'Comparing'
print entry.title
print line
if line.lower() == entry.title.lower()
print 'found duplicate'
return True
return False
My problem is, this never returns true. When it prints out identical strings for entry.title and line, it won't flag them as equal. Is there a string compare method or something I should be using?
Edit
After looking at the representation of the strings, repr(line) the strings that are being compared look like this:
u"Some Article Title About Things And Stuff - Publisher Name"
'Some Article Title About Things And Stuff - Publisher Name'
It would help even more if you would have provided an actual example.
In any way, your problem is the different string encoding in Python 2. entry.title is apparently a unicode string (denoted by a u before the quotes), while line is a normal str (or vice-versa).
For all characters that are equally represented in both formats (ASCII characters and probably a few more), the equality comparison will be successful. For other characters it won’t:
>>> 'Ä' == u'Ä'
False
When doing the comparison in the reversed order, IDLE actually gives a warning here:
>>> u'Ä' == 'Ä'
Warning (from warnings module):
File "__main__", line 1
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
You can get a unicode string from a normal string by using str.decode and supplying the original encoding. For example latin1 in my IDLE:
>>> 'Ä'.decode('latin1')
u'\xc4'
>>> 'Ä'.decode('latin1') == u'Ä'
True
If you know it’s utf-8, you could also specify that. For example the following file saved with utf-8 will also print True:
# -*- coding: utf-8 -*-
print('Ä'.decode('utf-8') == u'Ä')
== is fine for string comparison. Make sure you are dealing with strings
if str(line).lower() == str(entry.title).lower()
other possible syntax is boolean expression str1 is str2.

Categories