django countries encoding is not giving correct name - python

I am using django_countries module for countries list, the problem is there are couple of countries with special characters like 'Åland Islands' and 'Saint Barthélemy'.
I am calling this method to get the country name:
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
I know that country_label is lazy translated proxy object of django utils, but it is not giving the right name rather it gives 'Ã…land Islands'. any suggestions for this please?

Django stores unicode string using code points and identifies the string as unicode for further processing.
UTF-8 uses four 8-bit bytes encoding, so the unicode string that's being used by Django needs to be decoded or interpreted from code point notation to its UTF-8 notation at some point.
In the case of Åland Islands, what seems to be happening is that it's taking the UTF-8 byte encoding and interpret it as code points to convert the string.
The string django_countries returns is most likely u'\xc5land Islands' where \xc5 is the UTF code point notation of Å. In UTF-8 byte notation \xc5 becomes \xc3\x85 where each number \xc3 and \x85 is a 8-bit byte. See:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc5&mode=hex
Or you can use country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('utf-8') to go from u'\xc5land Islands' to '\xc3\x85land Islands'
If you take then each byte and use them as code points, you'll see it'll give you these characters: Ã…
See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc3&mode=hex
And: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=x85&mode=hex
See code snippet with html notation of these characters.
<div id="test">ÅÅ</div>
So I'm guessing you have 2 different encodings in you application. One way to get from u'\xc5land Islands' to u'\xc3\x85land Islands' would be to in an utf-8 environment encode to UTF-8 which would convert u'\xc5' to '\xc3\x85' and then decode to unicode from iso-8859 which would give u'\xc3\x85land Islands'. But since it's not in the code you're providing, I'm guessing it's happening somewhere between the moment you set country_label and the moment your output isn't displayed properly. Either automatically because of encodings settings, or through an explicit assignation somewhere.
FIRST EDIT:
To set encoding for you app, add # -*- coding: utf-8 -*- at the top of your py file and <meta charset="UTF-8"> in of your template.
And to get unicode string from a django.utils.functional.proxy object you can call unicode(). Like this:
country_label = unicode(fields.Country(form.cleaned_data.get('country')[0:2]).name)
SECOND EDIT:
One other way to figure out where the problem is would be to use force_bytes (https://docs.djangoproject.com/en/1.8/ref/utils/#module-django.utils.encoding) Like this:
from django.utils.encoding import force_bytes
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
forced_country_label = force_bytes(country_label, encoding='utf-8', strings_only=False, errors='strict')
But since you already tried many conversions without success, maybe the problem is more complex. Can you share your version of django_countries, Python and your django app language settings?
What you can do also is go see directly in your djano_countries package (that should be in your python directory), find the file data.py and open it to see what it looks like. Maybe the data itself is corrupted.

try:
from __future__ import unicode_literals #Place as first import.
AND / OR
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('latin1').decode('utf8')

Just this this week I encountered a similar encoding error. I believe the problem is because the machine encoding is differ with the one on Python. Try to add this to your .bashrc or .zshrc.
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Then, open up a new terminal and run the Django app again.

Related

python :same character, different behavior

I'm generating file names from a list pulled out from a postgres DB with Python 2.7.9. In this list there are words with special char. Normally I use ''.join() to record the name and fire it to my loader but I have just one name that want be recognized. the .py is set for utf-8 coding, but the words are in Portuguese, I think latin-1 coding.
from pydub import AudioSegment
from pydub.playback import play
templist = ['+ Orégano','- Búfala','+ Rúcola']
count_ins = (len(templist)-1)
while (count_ins >= 0 ):
kot_istructions = AudioSegment.from_ogg('/home/effe/voice_orders/Voz/'+"".join(templist[count_ins])+'.ogg')
count_ins-=1
play(kot_istructions)
The first two files are loaded:
/home/effe/voice_orders/Voz/+ Orégano.ogg
/home/effe/voice_orders/Voz/- Búfala.ogg
The third should be:
/home/effe/voice_orders/Voz/+ Rúcola.ogg
But python is trying to load
/home/effe/voice_orders/Voz/+ R\xc3\xbacola.ogg
Why just this one? I've tried to use normalize() to remove the accent but since this is a string the method didn't work.
Print works well, as db update. Just file name creation doesn't works as expected.
Suggestions?
It seems the root cause might be that the encoding of these names in inconsisitent within your database.
If you run:
>>> 'R\xc3\xbacola'.decode('utf-8')
You get
u'R\xfacola'
which is in fact a Python unicode, correctly representing the name. So, what should you do? Although it's a really unclean programming style, you could play .encode()/.decode() whackamole, where you try to decode the raw string from your db using utf-8, and failing that, latin-1. It would look something like this:
try:
clean_unicode = dirty_string.decode('utf-8')
except UnicodeDecodeError:
clean_unicode = dirty_string.decode('latin-1')
As a general rule, always work with clean unicode objects within your own source, and only convert to an encoding on saving it out. Also, don't let people insert data into a database without specifying the encoding, as that will stop you from having this problem in the first place.
Hope that helps!
Solved: Was a problem with the file. Deleting and build it again do the job.

pygtk spinbutton "greek" floating point

I'm trying to use the data collected by a form I to a sqlite query. In this form I've made a spin button which gets any numeric input (ie. either2,34 or 2.34) and sends it in the form of 2,34 which python sees as str.
I've already tried to float() the value but it doesn't work. It seems to be a locale problem but somehow locale.setlocale(locale.LC_ALL, '') is unsupported (says WinXP).
All these happen even though I haven't set anything to greek (language, locale, etc) but somehow Windows does its magic.
Can someone help?
PS: Of course my script starts with # -*- coding: utf-8 -*- so as to have anything in greek (even comments) in the code.
AFAIK, WinXP supports setlocale just fine.
If you want to do locale-aware conversions, try using locale.atof('2,34') instead of float('2,34').

Python and UTF-8: kind of confusing

I am on google app engine with Python 2.5. My application have to deal with multilanguages so I have to deal with utf-8.
I have done lots of google but dont get what I want.
1.Whats the usage of # -*- coding: utf-8 -*- ?
2.What is the difference between
s=u'Witaj świecie'
s='Witaj świecie'
'Witaj świecie' is a utf-8 string.
3.When I save the .py file to 'utf-8', do I still need the u before every string?
u'blah' turns it into a different kind of string (type unicode rather than type str) - it makes it a sequence of unicode codepoints. Without it, it is a sequence of bytes. Only bytes can be written to disk or to a network stream, but you generally want to work in Unicode (although Python, and some libraries, will do some of the conversion for you) - the encoding (utf-8) is the translation between these. So, yes, you should use the u in front of all your literals, it will make your life much easier. See Programatic Unicode for a better explanation.
The coding line tells Python what encoding your file is in, so that Python can understand it. Again, reading from disk gives bytes - but Python wants to see the characters. In Py2, the default encoding for code is ASCII, so the coding line lets you put things like ś directly in your .py file in the first place - other than that, it doesn't change how your code works.

SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.
I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.
Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.
Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !
You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Django: Non-ASCII character

My Django View/Template is not able to handle special characters. The simple view below fails because of the ñ. I get below error:
Non-ASCII character '\xf1' in file"
def test(request):
return HttpResponse('español')
Is there some general setting that I need to set? It would be weird if I had to handle all strings separately: non-American letters are pretty common!
EDIT
This is in response to the comments below. It still fails :(
I added the coding comment to my view and the meta info to my html, as suggested by Gabi.
Now my example above doesn't give an error, but the ñ is displayed incorrectly.
I tried return render_to_response('tube/mysite.html', {"s": 'español'}). No error, but it doesn't dislay (it does if s = hello). The other information on the html page displays fine.
I tried hardcoding 'español' into my HTML and that fails:
UnicodeDecodeError 'utf8' codec can't decode byte 0xf.
I tried with the u in front of the string:
SyntaxError (unicode error) 'utf8' codec can't decode byte 0xf1
Does this help at all??
Do you have this at the beginning of your script:
# -*- coding: utf-8 -*-
...?
See this: http://www.python.org/dev/peps/pep-0263/
EDIT: For the second problem, it's about the html encoding. Put this in the head of your html page (you should send the request as an html page, otherwise I don't think you will be able to output that character correctly):
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Insert at the top of views.py
# -*- coding: utf-8 -*-
And add "u" before your string
my_str = u"plus de détails"
Solved!
You need the coding comment Gabi mentioned and also use the unicode "u" sign before your string :
return HttpResponse(u'español')
The best page I found on the web explaining all the ASCII/Unicode mess is this one :
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror
Enjoy!
Set DEFAULT_CHARSET to 'utf-8' in your settings.py file.
I was struggling with the same issue as #dkgirl, yet despite making all of the changes suggested here I still could not get constant strings that I'd defined in settings.py that contain ñ to show up in pages rendered from my templates.
Instead I replaced every instance of "utf-8" in my python code from the above solutions to "ISO-8859-1" (Latin-1). It works fine now.
Odd since everything seems to indicate that ñ is supported by utf-8 (and in fact I'm still using utf-8 in my templates). Perhaps this is an issue only on older Django versions? I'm running 1.2 beta 1.
Any other ideas what may have caused the problem? Here's my old traceback:
Traceback (most recent call last):
File "manage.py", line 4, in
import settings # Assumed to be in the same directory.
File "C:\dev\xxxxx\settings.py", line 53
('es', ugettext(u'Espa±ol') ),
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xf1 in position 0:
unexpected end of data
ref from: https://docs.djangoproject.com/en/1.8/ref/unicode/
"If your code only uses ASCII data, it’s safe to use your normal strings, passing them around at will, because ASCII is a subset of UTF-8.
Don’t be fooled into thinking that if your DEFAULT_CHARSET setting is set to something other than 'utf-8' you can use that other encoding in your bytestrings! DEFAULT_CHARSET only applies to the strings generated as the result of template rendering (and email). Django will always assume UTF-8 encoding for internal bytestrings. The reason for this is that the DEFAULT_CHARSET setting is not actually under your control (if you are the application developer). It’s under the control of the person installing and using your application – and if that person chooses a different setting, your code must still continue to work. Ergo, it cannot rely on that setting.
In most cases when Django is dealing with strings, it will convert them to Unicode strings before doing anything else. So, as a general rule, if you pass in a bytestring, be prepared to receive a Unicode string back in the result."
The thing about encoding is that apart from declaring to use UTF-8 (via <meta> and the project's settings.py file) you should of course respect your declaration: make sure your files are saved using UTF-8 encoding.
The reason is simple: you tell the interpreter to do IO using a specific charset. When you didn't save your files with that charset, the interpreter will get lost.
Some IDEs and editors will use Latin1 (ISO-8859-1) by default, which explains why Ryan his answer could work. Although it's not a valid solution to the original question being asked, but a quick fix.

Categories