python2.7 - reading a dictionary from a .txt file riddled with unicode

python2.7 - reading a dictionary from a .txt file riddled with unicode - python

I enrolled into a Chinese Studies course some time ago, and I thought it'd be a great exercise for me to write a flashcard program in python. I'm storing the flash card lists in a dictionary in a .txt file, so far without trouble. The real problems kick in when I try to load the file, encoded in utf-8, into my program. An excerpt of my code:
import codecs
f = codecs.open(('list.txt'),'r','utf-8')
quiz_list = eval(f.read())
quizy = str(quiz_list).encode('utf-8')
print quizy
Now, if for example list.txt consists of:
{'character1':'男人'}
what is printed is actually
{'character1': '\xe7\x94\xb7\xe7\x86\xb1'}
Obviously there are some serious encoding issues here, but I cannot for the life of me understand where these occur. I am working with a terminal which supports utf-8, so not the standard cmd.exe: this is not the problem. Reading a normal list.txt without the curly dict-bits returns the chinese characters without a problem, so my guess is I'm not handling the dictionary part correctly. Any thoughts would be greatly appreciated!

There's nothing wrong with your encoding... Look at this:
>>> d = {1:'男人'}
>>> d[1]
'\xe7\x94\xb7\xe4\xba\xba'
>>> print d[1]
男人
One thing is to print a unicode string another one is printing its representation.

str(quizy) calls repr(quizy['character1']) which produces an ASCII representation of the string value. If you just print quizy['character1'] you'll see that the character codes are Unicode in the Python string.

Related

How to distinguish between a correct and a botched unicode encoded string in Python?

I have string data in various languages where parts of the strings have seen some wrong encoding/decoding while others are correct, I need to fix the wrong ones:
Here's an example for the german word "Zubehör":
correct = "ZUBEHÖR"
incorrect = "ZUBEHÃ\x96R"
I already found out that I can correct the errors like this:
incorrect.encode("raw_unicode_escape").decode("utf8")
However using this on the correct strings yields an error. I could iterate over all strings and use a try-statement, but I don't know if this will work reliably and I'd like to know a more elegant way.
Also while the \x96 is written out when printing it's actually only one character:
incorrect[-3]
Out[34]: 'Ã'
incorrect[-2]
Out[33]: '\x96'
How can I reliably only find those strings that have these odd unicode characters in them like ZUBEHÃ\x96R?
EDIT:
Here's something else I stumbled upon while experimenting:
When I do incorrect.encode("raw_unicode_escape") then the result is b'ZUBEH\xc3\x96R'.
But when I do this with e.g. a cyrillic word like this:
"Персонализированные".encode("raw_unicode_escape")
Then the result is b'\\u041f\\u0435\\u0440\\u0441\\u043e\\u043d\\u0430\\u043b\\u0438\\u0437\\u0438\\u0440\\u043e\\u0432\\u0430\\u043d\\u043d\\u044b\\u0435'
Why am I getting \x-escapes in the first case and \u-escapes in the second case while doing the exact same thing?
And why can I .decode("utf8") back the \x-escapes into a readable format but not the \u-escapes?

You should try the fixes-text-for-you library (ftfy):
>>> import ftfy
>>> ftfy.fix_text("ZUBEHÃ\x96R")
'ZUBEHÖR'
It operates line by line, so if you have a string with clean and corrupt strings, but on separate lines, ftfy can probably handle it.
Note: This is not an exact science.
The way ftfy works involves a lot of educated guesses.
The tool is very well made, but it may not guess correctly in all cases you have.
If you can, it is always better to fix the errors at the source (ie. make sure all text is correctly decoded in the first place).

Is there a way to get around unicode issues when using win32api/com modules in python 3?

I've looked around and haven't found anything just yet. I'm going through emails in an inbox and checking for a specific word set. It works on most emails but some of them don't parse. I checked the broken emails using.
print (msg.Body.encode('utf8'))
and my problem messages all start with b'.
like this
b'\xe6\xa0\xbc\xe6\xb5\xb4\xe3\xb9\xac\xe6\xa0\xbc\xe6\x85\xa5\xe3\xb9\xa4\xe0\xa8\x8d\xe6\xb4\xbc\xe7\x91\xa5\xe2\x81\xa1\xe7\x91\x
I think this is forcing python to read the body as bytes but I'm not sure. Either way after the b, no matter what encoding I try I don't get anything but garbage text.
I've tried other encoding methods as well decoding before but I'm just getting a ton of attribute errrors.
import win32api
import win32com.client
import datetime
import os
import time
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
dater = datetime.date.today() - datetime.timedelta(days = 1)
dater = str(dater.strftime("%m-%d-%Y"))
print (dater)
#for folders in outlook.folders:
# print(folders)
Receipt = outlook.folders[8]
print(Receipt)
Ritems = Receipt.folders["Inbox"]
Rmessage = Ritems.items
for msg in Rmessage:
if (msg.Class == 46 and msg.CreationTime.strftime("%m-%d-%Y") == dater):
print (msg.CreationTime)
print (msg.Subject)
print (msg.Body.encode('utf8'))
print ('..............................')
End result is to have the message printed out in the console, or at least give Python a way to read it so I can find the text I'm looking for in the body.

The byte literal posted in the question is valid UTF-8. First two characters are U+683C and U+6D74 from the CJK Unified Ideographs block, U+4E00 - U+9FFF.
Since you don't know the source encoding there is no way to be completely sure about it, but chances are that email body is just Han characters encoded in UTF-8 (Determine the encoding of text in Python). If you are not being able to see the UTF-8 characters correctly you should check your terminal or display character set.
That said, you should to get the fundamentals of character representation right. Randomly encoding or decoding is hardly going to solve anything. I would suggest you begin by reading Spolsky's introduction to Unicode and then move to Batchelder on Unicode in Python.

As martineau said the proper encoding I was searching for was utf16. The other messages were encoded using utf8. So a simple mail scrape turned out to be an excellent lesson in encoding as well message Classes (off topic). Thanks for the help.

python :same character, different behavior

I'm generating file names from a list pulled out from a postgres DB with Python 2.7.9. In this list there are words with special char. Normally I use ''.join() to record the name and fire it to my loader but I have just one name that want be recognized. the .py is set for utf-8 coding, but the words are in Portuguese, I think latin-1 coding.
from pydub import AudioSegment
from pydub.playback import play
templist = ['+ Orégano','- Búfala','+ Rúcola']
count_ins = (len(templist)-1)
while (count_ins >= 0 ):
kot_istructions = AudioSegment.from_ogg('/home/effe/voice_orders/Voz/'+"".join(templist[count_ins])+'.ogg')
count_ins-=1
play(kot_istructions)
The first two files are loaded:
/home/effe/voice_orders/Voz/+ Orégano.ogg
/home/effe/voice_orders/Voz/- Búfala.ogg
The third should be:
/home/effe/voice_orders/Voz/+ Rúcola.ogg
But python is trying to load
/home/effe/voice_orders/Voz/+ R\xc3\xbacola.ogg
Why just this one? I've tried to use normalize() to remove the accent but since this is a string the method didn't work.
Print works well, as db update. Just file name creation doesn't works as expected.
Suggestions?

It seems the root cause might be that the encoding of these names in inconsisitent within your database.
If you run:
>>> 'R\xc3\xbacola'.decode('utf-8')
You get
u'R\xfacola'
which is in fact a Python unicode, correctly representing the name. So, what should you do? Although it's a really unclean programming style, you could play .encode()/.decode() whackamole, where you try to decode the raw string from your db using utf-8, and failing that, latin-1. It would look something like this:
try:
clean_unicode = dirty_string.decode('utf-8')
except UnicodeDecodeError:
clean_unicode = dirty_string.decode('latin-1')
As a general rule, always work with clean unicode objects within your own source, and only convert to an encoding on saving it out. Also, don't let people insert data into a database without specifying the encoding, as that will stop you from having this problem in the first place.
Hope that helps!

Solved: Was a problem with the file. Deleting and build it again do the job.

Convert into malayalama text in python 3.3.2

Hi, my code is like this(python 3.3.2)
fw = codecs.open('outputfile.txt','w')
if((unidata[i]==U'\u0d46' and unidata[i-1]==U'\u0d28') and (unidata[i+1]==U'\u0d24') and (unidata[i+2]==U'\u0d4d')):
print ('code 1')
if(var==1):
x=unidata[0:i-1]+U'\u0d7b'+ ' + '+U'\u0d0e'+unidata[i+1:len(unidata)]
first_word=unidata[0:i-1]+U'\u0d7b'
fw.write(str(first_word.encode('UTF-8')))
output in file is like this:
(b'\xe0\xb4\xb0\xe0\xb4\xbe\xe0\xb4\xae\xe0\xb5\xbb')
Actual output should be:
രാമൻ
How to resolve this?

this works..
fw=open("myunicodefile.txt","w")
fw.write(firstword.encode('UTF-8'))
but i think you are telling about the strings inside the file####
yes actualy unicode will looks like that after converting using """"str()"""
"\xe0\xb4\xb0\xe0\xb4\xbe\xe0\xb4\xae\xe0\xb5\xbb"
this is unicode.but to see this in malayalam with texteditor it must opened with uncode mode
___and if you use python to read that file then must open that file and encode to utf
example:
fr=open("mytext.txt","r")
data=fr.read()
unicodedata=data.encode("utf-8")
print unicodedata
this will print malayalam

unicode deconversion issues and solutions
I'm giving the link bc they explain better than I can and theres additional definitions of functions there as well, number 3 on the directly linked page I think helps you though.

Python doesn't interpret UTF8 correctly

I know similar questions have been asked a million times, but despite reading through many of them I can't find a solution that applies to my situation.
I have a django application, in which I've created a management script. This script reads some text files, and outputs them to the terminal (it will do more useful stuff with the contents later, but I'm still testing it out) and the characters come out with escape sequences like \xc3\xa5 instead of the intended å. Since that escape sequence means Ã¥, which is a common misinterpretation of å because of encoding problems, I suspect there are at least two places where this is going wrong. However, I can't figure out where - I've checked all the possible culprits I can think of:
The terminal encoding is UTF-8; echo $LANG gives en_US.UTF-8
The text files are encoded in UTF-8; file * in the directory where they reside results in all entries being listed as "UTF-8 Unicode text" except one, which does not contain any non-ASCII characters and is listed as "ASCII text". Running iconv -f ascii -t utf8 thefile.txt > utf8.txt on that file yields another file with ASCII text encoding.
The Python scripts are all UTF-8 (or, in several cases, ASCII with no non-ASCII characters). I tried inserting a comment in my management script with some special characters to force it to save as UTF-8, but it did not change the behavior. The above observations on the text files apply on all Python script files as well.
The Python script that handles the text files has # -*- encoding: utf-8 -*- at the top; the only line preceding that is #!/usr/bin/python3, but I've tried both changing to .../python for Python 2.7 or removing it entirely to leave it up to Django, without results.
According to the documentation, "Django natively supports Unicode data", so I "can safely pass around Unicode strings" anywhere in the application.
I really can't think of anywhere else to look for a non-UTF-8 link in the chain. Where could I possibly have missed a setting to change to UTF-8?
For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
UPDATE:
In response to quiestions in comments:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None) for all files.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not. Adding .decode('utf-8') yields a tuple where both strings are marked with a prepending u and \xe5 (the correct escape sequence for å) instead of the odd characters before - but I can't figure out how to print them as regular strings, with no escape characters. I've tested another call to .decode('utf-8') as well as wrapping in str() but both fail with UnicodeEncodeError complaining that \xe5 can't be encoded in ascii. Since a single string works correctly, I don't know what else to test.
SSCCE:
# -*- coding: utf-8 -*-
import os, sys
for root,dirs,files in os.walk('txt-songs'):
for filename in files:
with open(os.path.join(root,filename)) as f:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding)
lines = f.readlines()
print(lines[0].strip()) # works
print(lines[0].strip(), lines[1].strip()) # does not work

The big problem here is that you're mixing up Python 2 and Python 3. In particular, you've written Python 3 code, and you're trying to run it in Python 2.7. But there are a few other problems along the way. So, let me try to explain everything that's going wrong.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not.
The first problem here is that the str of a tuple (or any other collection) includes the repr, not the str, of its elements. The simple way to solve this problem is to not print collections. In this case, there is really no reason to print a tuple at all; the only reason you have one is that you've built it for printing. Just do something like this:
print '({}, {})'.format(lines[0].strip(), lines[1].strip())
In cases where you already have a collection in a variable, and you want to print out the str of each element, you have to do that explicitly. You can print the repr of the str of each with this:
print tuple(map(str, my_tuple))
… or print the str of each directly with this:
print '({})'.format(', '.join(map(str, my_tuple)))
Notice that I'm using Python 2 syntax above. That's because if you actually used Python 3, there would be no tuple in the first place, and there would also be no need to call str.
You've got a Unicode string. In Python 3, unicode and str are the same type. But in Python 2, it's bytes and str that are the same type, and unicode is a different one. So, in 2.x, you don't have a str yet, which is why you need to call str.
And Python 2 is also why print(lines[0].strip(), lines[1].strip()) prints a tuple. In Python 3, that's a call to the print function with two strings as arguments, so it will print out two strings separated by a space. In Python 2, it's a print statement with one argument, which is a tuple.
If you want to write code that works the same in both 2.x and 3.x, you either need to avoid ever printing more than one argument, or use a wrapper like six.print_, or do a from __future__ import print_function, or be very careful to do ugly things like adding in extra parentheses to make sure your tuples are tuples in both versions.
So, in 3.x, you've got str objects and you just print them out. In 2.x, you've got unicode objects, and you're printing out their repr. You can change that to print out their str, or to avoid printing a tuple in the first place… but that still won't help anything.
Why? Well, printing anything, in either version, just calls str on it and then passes it to sys.stdio.write. But in 3.x, str means unicode, and sys.stdio is a TextIOWrapper; in 2.x, str means bytes, and sys.stdio is a binary file.
So, the pseudocode for what ultimately happens is:
sys.stdio.wrapped_binary_file.write(s.encode(sys.stdio.encoding, sys.stdio.errors))
sys.stdio.write(s.encode(sys.getdefaultencoding()))
And, as you saw, those will do different things, because:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None)
You can simulate Python 3 here by using a io.TextIOWrapper or codecs.StreamWriter and then using print >>f, … or f.write(…) instead of print, or you can explicitly encode all your unicode objects like this:
print '({})'.format(', '.join(element.encode('utf-8') for element in my_tuple)))
But really, the best way to deal with all of these problems is to run your existing Python 3 code in a Python 3 interpreter instead of a Python 2 interpreter.
If you want or need to use Python 2.7, that's fine, but you have to write Python 2 code. If you want to write Python 3 code, that's great, but you have to run Python 3.3. If you really want to write code that works properly in both, you can, but it's extra work, and takes a lot more knowledge.
For further details, see What's New In Python 3.0 (the "Print Is A Function" and "Text Vs. Data Instead Of Unicode Vs. 8-bit" sections), although that's written from the point of view of explaining 3.x to 2.x users, which is backward from what you need. The 3.x and 2.x versions of the Unicode HOWTO may also help.

For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
In Python 3.x, the standard print function just writes Unicode to sys.stdout. Since that's a io.TextIOWrapper, its write method is equivalent to this:
self.wrapped_binary_file.write(s.encode(self.encoding, self.errors))
So one likely problem is that sys.stdout.encoding does not match your terminal's actual encoding.
And of course another is that your shell's encoding does not match your terminal window's encoding.
For example, on OS X, I create a myscript.py like this:
print('\u00e5')
Then I fire up Terminal.app, create a session profile with encoding "Western (ISO Latin 1)", create a tab with that session profile, and do this:
$ export LANG=en_US.UTF-8
$ python3 myscript.py
… and I get exactly the behavior you're seeing.

It seems from your comment that you are using python-2 and not python-3.
If you are using python-3, it's worth reading the unicode howto guide on reading/writing to understand what python is doing.
The basic flow if encoding is:
DECODE from encoding to unicode -> Processing -> Encode from unicode to encoding
In python3 the bytes are decoded to strings and strings are encoded to bytes.
The bytes to string decoding is handled for you with open().
[..] the built-in open() function can return a file-like object that
assumes the file’s contents are in a specified encoding and accepts
Unicode parameters for methods such as read() and write(). This works
through open()‘s encoding and errors parameters [..]
So to read in unicode from a utf-8 encoded file you should be doing this:
# python-3
with open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode
If you want similar functionality using python-2, you can use codecs.open():
# python-2
import codecs
with codecs.open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python2.7 - reading a dictionary from a .txt file riddled with unicode - python

There's nothing wrong with your encoding... Look at this: >>> d = {1:'男人'} >>> d[1] '\xe7\x94\xb7\xe4\xba\xba' >>> print d[1] 男人 One thing is to print a unicode string another one is printing its representation.

str(quizy) calls repr(quizy['character1']) which produces an ASCII representation of the string value. If you just print quizy['character1'] you'll see that the character codes are Unicode in the Python string.

Related

How to distinguish between a correct and a botched unicode encoded string in Python?

Is there a way to get around unicode issues when using win32api/com modules in python 3?

python :same character, different behavior

Convert into malayalama text in python 3.3.2

Python doesn't interpret UTF8 correctly

Categories

Resources