Python JSON preserve encoding - python

I have a file like this:
aarónico
aaronita
ababol
abacá
abacería
abacero
ábaco
#more words, with no ascii chars
When i read and print that file to the console, it prints exactly the same, as expected, but when i do:
f.write(json.dumps({word: Lookup(line)}))
This is saved instead:
{"aar\u00f3nico": ["Stuff"]}
When i expected:
{"aarónico": ["Stuff"]}
I need to get the same when i jason.loads() it, but i don't know where or how to do the encoding or if it's needed to get it to work.
EDIT
This is the code that saves the data to a file:
with open(LEMARIO_FILE, "r") as flemario:
with open(DATA_FILE, "w") as f:
while True:
word = flemario.readline().strip()
if word == "":
break
print word #this is correct
f.write(json.dumps({word: RAELookup(word)}))
f.write("\n")
And this one loads the data and returns the dictionary object:
with open(DATA_FILE, "r") as f:
while True:
new = f.readline().strip()
if new == "":
break
print json.loads(new) #this is not
I cannot lookup the dictionaries if the keys are not the same as the saved ones.
EDIT 2
>>> import json
>>> f = open("test", "w")
>>> f.write(json.dumps({"héllö": ["stuff"]}))
>>> f.close()
>>> f = open("test", "r")
>>> print json.loads(f.read())
{u'h\xe9ll\xf6': [u'stuff']}
>>> "héllö" in {u'h\xe9ll\xf6': [u'stuff']}
False

This is normal and valid JSON behaviour. The \uxxxx escape is also used by Python, so make sure you don't confuse python literal representations with the contents of the string.
Demo in Python 3.3:
>>> import json
>>> print('aar\u00f3nico')
aarónico
>>> print(json.dumps('aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps('aar\u00f3nico')))
aarónico
In python 2.7:
>>> import json
>>> print u'aar\u00f3nico'
aarónico
>>> print(json.dumps(u'aar\u00f3nico'))
"aar\u00f3nico"
>>> print(json.loads(json.dumps(u'aar\u00f3nico')))
aarónico
When reading and writing from and to files, and when specifying just raw byte strings (and "héllö" is a raw byte string) then you are not dealing with Unicode data. You need to learn about the differences between encoded and Unicode data first. I strongly recommend you read at least 2 of the following 3 articles:
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
You were lucky with your "héllö" python raw byte string representation, Python managed to decode it automatically for you. The value read back from the file is perfectly normal and correct:
>>> print u'h\xe9ll\xf6'
héllö

Related

Error when reading UTF-8 characters with python

I have the following function in python, which takes a string as argument and returns the same string in ASCII (e.g. "alçapão" -> "alcapao"):
def filt(word):
dic = { u'á':'a',u'ã':'a',u'â':'a' } # the whole dictionary is too big, it is just a sample
new = ''
for l in word:
new = new + dic.get(l, l)
return new
It is supposed to "filter" all strings in a list that I read from a file using this:
lines = []
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.strip())
lines = [filt(l) for l in lines]
But I get this:
filt.py:9: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
new = new + dic.get(l, l)
and the strings filtered have characters like '\xc3\xb4' instead of ASCII characters. What should I do?
You're mixing and matching Unicodes strs and regular (byte) strs.
Use the io module to open and decode your text file to Unicodes as it's read:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
this assumes your to-filter.txt file is UTF-8 encoded.
You can also shrink your file read into an array with just:
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines is now a list of Unicode strings.
Optional
It looks like you're trying to convert non-ASCII characters to their closest ASCII equivalent. The easy way to this is:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
What this does is:
Decomposes each character into their component parts. For example, ã can be expressed as a single Unicode char (U+00E3 'LATIN SMALL LETTER A WITH TILDE') or as two Unicode characters: U+0061 'LATIN SMALL LETTER A' + U+0303 'COMBINING TILDE'.
Encode component parts to ASCII. Non ASCII parts (those with code points greater than U+007F), will be ignored.
Decode back to a Unicode str for convenience.
Tl;dr
Your code is now:
import unicodedata
def filt(word):
return unicodedata.normalize('NFKD', word).encode('ascii', errors='ignore').decode('ascii')
with io.open("to-filter.txt","r", encoding="utf-8") as f:
lines = f.read().splitlines()
lines = [filt(l) for l in lines]
Python 3.x
Although not strictly necessarily, remove io from open()
The root of your problem is that you're not reading Unicode strings from the file, you're reading byte strings. There are three ways to fix this, first is to open the file with the io module as suggested by another answer. The second is to convert each string as you read it:
with open("to-filter.txt","r") as f:
for line in f:
lines.append(line.decode('utf-8').strip())
The third way is to use Python 3, which always reads text files into Unicode strings.
Finally, there's no need to write your own code to turn accented characters into plain ASCII, there's a package unidecode to do that.
from unidecode import unidecode
print(unidecode(line))

Unicode in python 3

I want to convert string, which contains Unicode numbers to usual text. For example, file "input.txt" contains string '\u0057\u0068\u0061\u0074,' and I want to know what does it mean. If string is input in the code like:
s = '\u0057\u0068\u0061\u0074'
b = s.encode('utf-8')
print(b)
it works perfectly, but if I want to do the same with file I get this result b'\\u0057\\u0068\\u0061\\u0074'.
How to fix this problem? Windows 8, encoding of files are 'windows-1251'.
If your file contains those unicode escape sequences, then you can use the unicode_escape “codec” to interpret them after you read the file contents as a string.
>>> s = r'\u0057\u0068\u0061\u0074'
>>> print(s)
\u0057\u0068\u0061\u0074
>>> s.encode('utf-8').decode('unicode_escape')
'What'
Or, you can just read a bytes string directly and decode that:
with open('file.txt', 'br') as f:
print(f.read().decode('unicode_escape'))

Storing a random byte string in Python

For my project, I need to be able to store random byte strings in a file and read the byte string again later. For example, I want to store randomByteString from the following code:
>>> from os import urandom
>>> randomByteString=urandom(8)
>>> randomByteString
b'zOZ\x84\xfb\xceM~'
What would be the proper way to do this?
Edit: Forgot to mention that I also want to store 'normal' string alongside the byte strings.
Code like:
>>> fh = open("e:\\test","wb")
>>> fh.write(randomByteString)
8
>>> fh.close()
Operate the file as binary mode. Also, you could do it in a better manner if the file operations are near one place (Thanks to #Blender):
>>> with open("e:\\test","wb") as fh:
fh.write(randomByteString)
Update: if you want to strong normal strings, you could encode it and then write it like:
>>> "test".encode()
b'test'
>>> fh.write("test".encode())
Here the fh means the same file handle opened previously.
Works just fine. You can't expect the output to make much sense though.
>>> import os
>>> with open("foo.txt", "wb") as fh:
... fh.write(os.urandom(8))
...
>>> fh.close()
>>> with open("foo.txt", "r") as fh:
... for line in fh.read():
... print line
...
^J^JM-/
^O
R
M-9
J
~G

Why is the first line longer?

i'm using python to read a txt document with:
f = open(path,"r")
for line in f:
line = line.decode('utf8').strip()
length = len(line)
firstLetter = line[:1]
it seems to work, but the first line's length is always longer by... 1
for example:
the first line is "XXXX" where X denotes a chinese character
then length will be 5, but not 4
and firstLetter will be nothing
but when it goes to the second and after lines,it works properly
tks~
You have a UTF-8 BOM at the start of your file. Don't faff about inspecting the first character. Instead of the utf8 encoding, use the utf_8_sig encoding with either codecs.open() or your_byte_string.decode() ... this sucks up the BOM if it exists and you don't see it in your code.
>>> bom8 = u'\ufeff'.encode('utf8')
>>> bom8
'\xef\xbb\xbf'
>>> bom8.decode('utf8')
u'\ufeff'
>>> bom8.decode('utf_8_sig')
u'' # removes the BOM
>>> 'abcd'.decode('utf_8_sig')
u'abcd' # doesn't care if no BOM
>>>
You are probably getting the Byte Order Mark (BOM) as the first character on the first line.
Information about dealing with it is here

Python file input string: how to handle escaped unicode characters?

In a text file (test.txt), my string looks like this:
Gro\u00DFbritannien
Reading it, python escapes the backslash:
>>> file = open('test.txt', 'r')
>>> input = file.readline()
>>> input
'Gro\\u00DFbritannien'
How can I have this interpreted as unicode? decode() and unicode() won't do the job.
The following code writes Gro\u00DFbritannien back to the file, but I want it to be Großbritannien
>>> input.decode('latin-1')
u'Gro\\u00DFbritannien'
>>> out = codecs.open('out.txt', 'w', 'utf-8')
>>> out.write(input)
You want to use the unicode_escape codec:
>>> x = 'Gro\\u00DFbritannien'
>>> y = unicode(x, 'unicode_escape')
>>> print y
Großbritannien
See the docs for the vast number of standard encodings that come as part of the Python standard library.
Use the built-in 'unicode_escape' codec:
>>> file = open('test.txt', 'r')
>>> input = file.readline()
>>> input
'Gro\\u00DFbritannien\n'
>>> input.decode('unicode_escape')
u'Gro\xdfbritannien\n'
You may also use codecs.open():
>>> import codecs
>>> file = codecs.open('test.txt', 'r', 'unicode_escape')
>>> input = file.readline()
>>> input
u'Gro\xdfbritannien\n'
The list of standard encodings is available in the Python documentation: http://docs.python.org/library/codecs.html#standard-encodings

Categories