python :same character, different behavior

python :same character, different behavior - python

I'm generating file names from a list pulled out from a postgres DB with Python 2.7.9. In this list there are words with special char. Normally I use ''.join() to record the name and fire it to my loader but I have just one name that want be recognized. the .py is set for utf-8 coding, but the words are in Portuguese, I think latin-1 coding.
from pydub import AudioSegment
from pydub.playback import play
templist = ['+ Orégano','- Búfala','+ Rúcola']
count_ins = (len(templist)-1)
while (count_ins >= 0 ):
kot_istructions = AudioSegment.from_ogg('/home/effe/voice_orders/Voz/'+"".join(templist[count_ins])+'.ogg')
count_ins-=1
play(kot_istructions)
The first two files are loaded:
/home/effe/voice_orders/Voz/+ Orégano.ogg
/home/effe/voice_orders/Voz/- Búfala.ogg
The third should be:
/home/effe/voice_orders/Voz/+ Rúcola.ogg
But python is trying to load
/home/effe/voice_orders/Voz/+ R\xc3\xbacola.ogg
Why just this one? I've tried to use normalize() to remove the accent but since this is a string the method didn't work.
Print works well, as db update. Just file name creation doesn't works as expected.
Suggestions?

It seems the root cause might be that the encoding of these names in inconsisitent within your database.
If you run:
>>> 'R\xc3\xbacola'.decode('utf-8')
You get
u'R\xfacola'
which is in fact a Python unicode, correctly representing the name. So, what should you do? Although it's a really unclean programming style, you could play .encode()/.decode() whackamole, where you try to decode the raw string from your db using utf-8, and failing that, latin-1. It would look something like this:
try:
clean_unicode = dirty_string.decode('utf-8')
except UnicodeDecodeError:
clean_unicode = dirty_string.decode('latin-1')
As a general rule, always work with clean unicode objects within your own source, and only convert to an encoding on saving it out. Also, don't let people insert data into a database without specifying the encoding, as that will stop you from having this problem in the first place.
Hope that helps!

Solved: Was a problem with the file. Deleting and build it again do the job.

Related

Is there a way to get around unicode issues when using win32api/com modules in python 3?

I've looked around and haven't found anything just yet. I'm going through emails in an inbox and checking for a specific word set. It works on most emails but some of them don't parse. I checked the broken emails using.
print (msg.Body.encode('utf8'))
and my problem messages all start with b'.
like this
b'\xe6\xa0\xbc\xe6\xb5\xb4\xe3\xb9\xac\xe6\xa0\xbc\xe6\x85\xa5\xe3\xb9\xa4\xe0\xa8\x8d\xe6\xb4\xbc\xe7\x91\xa5\xe2\x81\xa1\xe7\x91\x
I think this is forcing python to read the body as bytes but I'm not sure. Either way after the b, no matter what encoding I try I don't get anything but garbage text.
I've tried other encoding methods as well decoding before but I'm just getting a ton of attribute errrors.
import win32api
import win32com.client
import datetime
import os
import time
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
dater = datetime.date.today() - datetime.timedelta(days = 1)
dater = str(dater.strftime("%m-%d-%Y"))
print (dater)
#for folders in outlook.folders:
# print(folders)
Receipt = outlook.folders[8]
print(Receipt)
Ritems = Receipt.folders["Inbox"]
Rmessage = Ritems.items
for msg in Rmessage:
if (msg.Class == 46 and msg.CreationTime.strftime("%m-%d-%Y") == dater):
print (msg.CreationTime)
print (msg.Subject)
print (msg.Body.encode('utf8'))
print ('..............................')
End result is to have the message printed out in the console, or at least give Python a way to read it so I can find the text I'm looking for in the body.

The byte literal posted in the question is valid UTF-8. First two characters are U+683C and U+6D74 from the CJK Unified Ideographs block, U+4E00 - U+9FFF.
Since you don't know the source encoding there is no way to be completely sure about it, but chances are that email body is just Han characters encoded in UTF-8 (Determine the encoding of text in Python). If you are not being able to see the UTF-8 characters correctly you should check your terminal or display character set.
That said, you should to get the fundamentals of character representation right. Randomly encoding or decoding is hardly going to solve anything. I would suggest you begin by reading Spolsky's introduction to Unicode and then move to Batchelder on Unicode in Python.

As martineau said the proper encoding I was searching for was utf16. The other messages were encoded using utf8. So a simple mail scrape turned out to be an excellent lesson in encoding as well message Classes (off topic). Thanks for the help.

Unicode-strings from xlrd

I'm trying to read some information from an excel-file using the xlrd-module. This works fine most of the time, but when the script encounters any scandinavian letters the script stops. I've been reading several posts about unicode and encoding, but I must admit I'm not familiar with it.
The cell I'm reading contains text (string) and is being read as unicode (as normal with xlrd). One example of a value that fails is Glørmestervej and it is read by xlrd as u'Gl\xf8mestervej. If I try to print the variable, the script stops. I've had most success by encoding the value with latin1:
print cellValue.encode("latin1")
which gives the result Glormestervej, but with a KeyError.
How do I get the variable to become a string with ø instead of \xf8? The reason is that I need to use it as an input to another service and it does not seem to work using unicode.
Regards, Torbjørn

I'm happy to say the problem have been solved, in fact there were not any error after all. There were some permission-issues with the user that I used for calling the service in which the variable was used. Thank you for your response!

django countries encoding is not giving correct name

I am using django_countries module for countries list, the problem is there are couple of countries with special characters like 'Åland Islands' and 'Saint Barthélemy'.
I am calling this method to get the country name:
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
I know that country_label is lazy translated proxy object of django utils, but it is not giving the right name rather it gives 'Ã…land Islands'. any suggestions for this please?

Django stores unicode string using code points and identifies the string as unicode for further processing.
UTF-8 uses four 8-bit bytes encoding, so the unicode string that's being used by Django needs to be decoded or interpreted from code point notation to its UTF-8 notation at some point.
In the case of Åland Islands, what seems to be happening is that it's taking the UTF-8 byte encoding and interpret it as code points to convert the string.
The string django_countries returns is most likely u'\xc5land Islands' where \xc5 is the UTF code point notation of Å. In UTF-8 byte notation \xc5 becomes \xc3\x85 where each number \xc3 and \x85 is a 8-bit byte. See:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc5&mode=hex
Or you can use country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('utf-8') to go from u'\xc5land Islands' to '\xc3\x85land Islands'
If you take then each byte and use them as code points, you'll see it'll give you these characters: Ã…
See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc3&mode=hex
And: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=x85&mode=hex
See code snippet with html notation of these characters.
<div id="test">ÃÅ</div>
So I'm guessing you have 2 different encodings in you application. One way to get from u'\xc5land Islands' to u'\xc3\x85land Islands' would be to in an utf-8 environment encode to UTF-8 which would convert u'\xc5' to '\xc3\x85' and then decode to unicode from iso-8859 which would give u'\xc3\x85land Islands'. But since it's not in the code you're providing, I'm guessing it's happening somewhere between the moment you set country_label and the moment your output isn't displayed properly. Either automatically because of encodings settings, or through an explicit assignation somewhere.
FIRST EDIT:
To set encoding for you app, add # -*- coding: utf-8 -*- at the top of your py file and <meta charset="UTF-8"> in of your template.
And to get unicode string from a django.utils.functional.proxy object you can call unicode(). Like this:
country_label = unicode(fields.Country(form.cleaned_data.get('country')[0:2]).name)
SECOND EDIT:
One other way to figure out where the problem is would be to use force_bytes (https://docs.djangoproject.com/en/1.8/ref/utils/#module-django.utils.encoding) Like this:
from django.utils.encoding import force_bytes
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
forced_country_label = force_bytes(country_label, encoding='utf-8', strings_only=False, errors='strict')
But since you already tried many conversions without success, maybe the problem is more complex. Can you share your version of django_countries, Python and your django app language settings?
What you can do also is go see directly in your djano_countries package (that should be in your python directory), find the file data.py and open it to see what it looks like. Maybe the data itself is corrupted.

try:
from __future__ import unicode_literals #Place as first import.
AND / OR
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('latin1').decode('utf8')

Just this this week I encountered a similar encoding error. I believe the problem is because the machine encoding is differ with the one on Python. Try to add this to your .bashrc or .zshrc.
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Then, open up a new terminal and run the Django app again.

Python Encoding - Could not decode to utf8

I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error:
OperationalError: Could not decode to UTF-8
If I open the database in sqlite manager and look at the data in the offending record(s) using the inbuilt browse and search it looks fine, however if I export the table as csv, I notice the character £ in the offending records has become Â£
If I read the csv in python, the £ in the offending records is still read as Â£ but its not a problem I can parse this manually. However I need to be able to read the data direct from the database, without the intermediate step of converting to csv.
I have looked at some answers online for similar questions, I have so far tried setting "text_factory = str" and I have also tried changing the datatype of the column from TEXT to BLOB using sqlite manager, but still get the error.
My code below results in the OperationalError: Could not decode to UTF-8
conn = sqlite3.connect('test.db')
conn.text_factory = str
curr = conn.cursor()
curr.execute('''SELECT xml_dump FROM hands_1 LIMIT 5000 , 5001''')
row = curr.fetchone()
All the records above 5000 in the database have this character problem and hence produce the error.
Any help appreciated.

Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.
The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line
conn.text_factory = str
However (based on your response in the comments above), since you are using python 3, str is the default text factory, so that line will do nothing new for you (see the docs).
What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str function, kind of like:
your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`
...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:
conn.text_factory = lambda x: str(x, 'latin1')
Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:
'iso-8859-1'
'utf-16'
'utf-32'
'latin1'
You can find a more complete list here.
Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:
conn.text_factory = bytes

If the text in the database is actually mostly encoded in UTF-8, but you're still seeing this error (Could not decode to UTF-8), then the problem may be that one or more rows have bogus data that is not valid UTF-8. By default, Python's decode() function throws an exception when it sees text like that. If you are in this situation and want to simply ignore these errors, you can set up a text_factory like this:
conn = sqlite3.connect('my-database.db')
conn.text_factory = lambda b: b.decode(errors = 'ignore')

python2.7 - reading a dictionary from a .txt file riddled with unicode

I enrolled into a Chinese Studies course some time ago, and I thought it'd be a great exercise for me to write a flashcard program in python. I'm storing the flash card lists in a dictionary in a .txt file, so far without trouble. The real problems kick in when I try to load the file, encoded in utf-8, into my program. An excerpt of my code:
import codecs
f = codecs.open(('list.txt'),'r','utf-8')
quiz_list = eval(f.read())
quizy = str(quiz_list).encode('utf-8')
print quizy
Now, if for example list.txt consists of:
{'character1':'男人'}
what is printed is actually
{'character1': '\xe7\x94\xb7\xe7\x86\xb1'}
Obviously there are some serious encoding issues here, but I cannot for the life of me understand where these occur. I am working with a terminal which supports utf-8, so not the standard cmd.exe: this is not the problem. Reading a normal list.txt without the curly dict-bits returns the chinese characters without a problem, so my guess is I'm not handling the dictionary part correctly. Any thoughts would be greatly appreciated!

There's nothing wrong with your encoding... Look at this:
>>> d = {1:'男人'}
>>> d[1]
'\xe7\x94\xb7\xe4\xba\xba'
>>> print d[1]
男人
One thing is to print a unicode string another one is printing its representation.

str(quizy) calls repr(quizy['character1']) which produces an ASCII representation of the string value. If you just print quizy['character1'] you'll see that the character codes are Unicode in the Python string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python :same character, different behavior - python

Solved: Was a problem with the file. Deleting and build it again do the job.

Related

Is there a way to get around unicode issues when using win32api/com modules in python 3?

Unicode-strings from xlrd

django countries encoding is not giving correct name

Python Encoding - Could not decode to utf8

python2.7 - reading a dictionary from a .txt file riddled with unicode

Categories

Resources