Django: Non-ASCII character - python

My Django View/Template is not able to handle special characters. The simple view below fails because of the ñ. I get below error:
Non-ASCII character '\xf1' in file"
def test(request):
return HttpResponse('español')
Is there some general setting that I need to set? It would be weird if I had to handle all strings separately: non-American letters are pretty common!
EDIT
This is in response to the comments below. It still fails :(
I added the coding comment to my view and the meta info to my html, as suggested by Gabi.
Now my example above doesn't give an error, but the ñ is displayed incorrectly.
I tried return render_to_response('tube/mysite.html', {"s": 'español'}). No error, but it doesn't dislay (it does if s = hello). The other information on the html page displays fine.
I tried hardcoding 'español' into my HTML and that fails:
UnicodeDecodeError 'utf8' codec can't decode byte 0xf.
I tried with the u in front of the string:
SyntaxError (unicode error) 'utf8' codec can't decode byte 0xf1
Does this help at all??

Do you have this at the beginning of your script:
# -*- coding: utf-8 -*-
...?
See this: http://www.python.org/dev/peps/pep-0263/
EDIT: For the second problem, it's about the html encoding. Put this in the head of your html page (you should send the request as an html page, otherwise I don't think you will be able to output that character correctly):
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Insert at the top of views.py
# -*- coding: utf-8 -*-
And add "u" before your string
my_str = u"plus de détails"
Solved!

You need the coding comment Gabi mentioned and also use the unicode "u" sign before your string :
return HttpResponse(u'español')
The best page I found on the web explaining all the ASCII/Unicode mess is this one :
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror
Enjoy!

Set DEFAULT_CHARSET to 'utf-8' in your settings.py file.

I was struggling with the same issue as #dkgirl, yet despite making all of the changes suggested here I still could not get constant strings that I'd defined in settings.py that contain ñ to show up in pages rendered from my templates.
Instead I replaced every instance of "utf-8" in my python code from the above solutions to "ISO-8859-1" (Latin-1). It works fine now.
Odd since everything seems to indicate that ñ is supported by utf-8 (and in fact I'm still using utf-8 in my templates). Perhaps this is an issue only on older Django versions? I'm running 1.2 beta 1.
Any other ideas what may have caused the problem? Here's my old traceback:
Traceback (most recent call last):
File "manage.py", line 4, in
import settings # Assumed to be in the same directory.
File "C:\dev\xxxxx\settings.py", line 53
('es', ugettext(u'Espa±ol') ),
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xf1 in position 0:
unexpected end of data

ref from: https://docs.djangoproject.com/en/1.8/ref/unicode/
"If your code only uses ASCII data, it’s safe to use your normal strings, passing them around at will, because ASCII is a subset of UTF-8.
Don’t be fooled into thinking that if your DEFAULT_CHARSET setting is set to something other than 'utf-8' you can use that other encoding in your bytestrings! DEFAULT_CHARSET only applies to the strings generated as the result of template rendering (and email). Django will always assume UTF-8 encoding for internal bytestrings. The reason for this is that the DEFAULT_CHARSET setting is not actually under your control (if you are the application developer). It’s under the control of the person installing and using your application – and if that person chooses a different setting, your code must still continue to work. Ergo, it cannot rely on that setting.
In most cases when Django is dealing with strings, it will convert them to Unicode strings before doing anything else. So, as a general rule, if you pass in a bytestring, be prepared to receive a Unicode string back in the result."

The thing about encoding is that apart from declaring to use UTF-8 (via <meta> and the project's settings.py file) you should of course respect your declaration: make sure your files are saved using UTF-8 encoding.
The reason is simple: you tell the interpreter to do IO using a specific charset. When you didn't save your files with that charset, the interpreter will get lost.
Some IDEs and editors will use Latin1 (ISO-8859-1) by default, which explains why Ryan his answer could work. Although it's not a valid solution to the original question being asked, but a quick fix.

Related

django countries encoding is not giving correct name

I am using django_countries module for countries list, the problem is there are couple of countries with special characters like 'Åland Islands' and 'Saint Barthélemy'.
I am calling this method to get the country name:
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
I know that country_label is lazy translated proxy object of django utils, but it is not giving the right name rather it gives 'Ã…land Islands'. any suggestions for this please?
Django stores unicode string using code points and identifies the string as unicode for further processing.
UTF-8 uses four 8-bit bytes encoding, so the unicode string that's being used by Django needs to be decoded or interpreted from code point notation to its UTF-8 notation at some point.
In the case of Åland Islands, what seems to be happening is that it's taking the UTF-8 byte encoding and interpret it as code points to convert the string.
The string django_countries returns is most likely u'\xc5land Islands' where \xc5 is the UTF code point notation of Å. In UTF-8 byte notation \xc5 becomes \xc3\x85 where each number \xc3 and \x85 is a 8-bit byte. See:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc5&mode=hex
Or you can use country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('utf-8') to go from u'\xc5land Islands' to '\xc3\x85land Islands'
If you take then each byte and use them as code points, you'll see it'll give you these characters: Ã…
See: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=xc3&mode=hex
And: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=x85&mode=hex
See code snippet with html notation of these characters.
<div id="test">ÅÅ</div>
So I'm guessing you have 2 different encodings in you application. One way to get from u'\xc5land Islands' to u'\xc3\x85land Islands' would be to in an utf-8 environment encode to UTF-8 which would convert u'\xc5' to '\xc3\x85' and then decode to unicode from iso-8859 which would give u'\xc3\x85land Islands'. But since it's not in the code you're providing, I'm guessing it's happening somewhere between the moment you set country_label and the moment your output isn't displayed properly. Either automatically because of encodings settings, or through an explicit assignation somewhere.
FIRST EDIT:
To set encoding for you app, add # -*- coding: utf-8 -*- at the top of your py file and <meta charset="UTF-8"> in of your template.
And to get unicode string from a django.utils.functional.proxy object you can call unicode(). Like this:
country_label = unicode(fields.Country(form.cleaned_data.get('country')[0:2]).name)
SECOND EDIT:
One other way to figure out where the problem is would be to use force_bytes (https://docs.djangoproject.com/en/1.8/ref/utils/#module-django.utils.encoding) Like this:
from django.utils.encoding import force_bytes
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name
forced_country_label = force_bytes(country_label, encoding='utf-8', strings_only=False, errors='strict')
But since you already tried many conversions without success, maybe the problem is more complex. Can you share your version of django_countries, Python and your django app language settings?
What you can do also is go see directly in your djano_countries package (that should be in your python directory), find the file data.py and open it to see what it looks like. Maybe the data itself is corrupted.
try:
from __future__ import unicode_literals #Place as first import.
AND / OR
country_label = fields.Country(form.cleaned_data.get('country')[0:2]).name.encode('latin1').decode('utf8')
Just this this week I encountered a similar encoding error. I believe the problem is because the machine encoding is differ with the one on Python. Try to add this to your .bashrc or .zshrc.
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Then, open up a new terminal and run the Django app again.

Is there an easy way to make unicode work in python?

I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.
Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?
I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...
For example:
url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Update
If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "classes.py", line 583, in <module>
wiki.getPage(title)
File "classes.py", line 146, in getPage
url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
return ''.join(map(quoter, s))
KeyError: u'\xf1'
I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.
Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.
This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:
from __future__ import unicode_literals
This will at least ensure that your own literal strings default to unicode rather than str.
There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.
Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.
In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:
encoded_title = title
if isinstance(encoded_title, unicode):
encoded_title = title.encode('utf-8')
If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:
python -Werror -municodenazi myprog.py
This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.
Yes, define your unicode data as unicode literals:
>>> u'Hi, this is unicode: üæ'
u'Hi, this is unicode: üæ'
You usually want to use '\uxxxx` unicode escapes or set a source code encoding. The following line at the top of your module, for example, sets the encoding to UTF-8:
# -*- coding: utf-8 -*-
Read the Python Unicode HOWTO for the details, such as default encodings and such (the default source code encoding, for example, is ASCII).
As for your specific example, your title is not a Unicode literal but a python byte string, and python is trying to decode it to unicode for you just so you can encode it again. This fails, as the default codec for such automatic encodings is ASCII:
>>> 'å'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Encoding only applies to actual unicode strings, so a byte string needs to be explicitly decoded:
>>> 'å'.decode('utf-8').encode('utf-8')
'\xc3\xa5'
If you are used to Python 3, then unicode literals in Python 2 (u'') are the new default string type in Python 3, while regular (byte) strings in Python 2 ('') are the same as bytes objects in Python 3 (b'').
If you have errors both with and without the encode call on title, you have mixed data. Test the title and encode as needed:
if isinstance(title, unicode):
title = title.encode('utf-8')
You may want to find out what produces the mixed unicode / byte string titles though, and correct that source to always produce one or the other.
be sure that title in your title.encode("utf-8") is type of unicode and dont use str("İŞşĞğÖöÜü")
use unicode("ĞğıIİiÖöŞşcçÇ") in your stringifiers
Actually, the easiest way to make Python work with unicode is to use Python 3, where everything is unicode by default.
Unfortunately, there are not many libraries written for P3, as well as some basic differences in coding & keyword use. That's the problem I have: the libraries I need are only available for P 2.7, and I don't know enough to convert them to P 3. :(

django + unicode constant errors

I built a django site last year that utilises both a dashboard and an API for a client.
They are, on occasion, putting unicode information (usually via a Microsoft keyboard and a single quote character!) into the database.
It's fine to change this one instance for everything, but what I constantly get is something like this error when a new character is added that I haven't "converted":
UnicodeDecodeError at /xx/xxxxx/api/xxx.json
'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)
The issue is actually that I need to be able to convert this unicode (from the model) into HTML.
# if a char breaks the system, replace it here (duplicate line)
text = unicode(str(text).replace('\xa3', '£'))
I duplicate this line here, but it just breaks otherwise.
Tearing my hair out because I know this is straight forward and I'm doing something remarkably silly somewhere.
Have searched elsewhere and realised that while my issue is not new, I can't find the answer elsewhere.
I assume that text is unicode (which seems a safe assumption, as \xa3 is the unicode for the £ character).
I'm not sure why you need to encode it at all, seeing as the text will be converted to utf-8 on output in the template, and all browsers are perfectly capable of displaying that. There is likely another point further down the line where something (probably your code, unfortunately) is assuming ASCII, and the implicit conversion is breaking things.
In that case, you could just do this:
text = text.encode('ascii', 'xmlcharrefreplace')
which converts the non-ASCII characters into HTML/XML entities like £.
Tell the JSON-decoder that it shall decode the json-file as unicode. When using the json module directly, this can be done using this code:
json.JSONDecoder(encoding='utf8').decode(
json.JSONEncoder(encoding='utf8').encode('blä'))
If the JSON decoding takes place via some other modules (django, ...) maybe you can pass the information through this other module into the json stuff.

Compile Syntax Error: non ASCII letters in a string

I have a python file that contains a long string of HTML. When I compile & run this file/script I get this error:
_SyntaxError: Non-ASCII character '\x92' in file C:\Users...\GlobalVars.py on line 2509, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details_
I have followed the instructions and gone to the url suggested. But putting something like this at the top of my script still doesn't work:
#!/usr/bin/python
# -*- coding: latin-1 -*-
What do you think I can do to stop this compiler error from occuring?
First, in order to prevent problems like the one specified in the question you should not ever use other encoding than utf-8 for python source code.
This is the correct header to use
#! /usr/bin/env python
# -*- coding: utf-8 -*-
Now you have to convert the file from whatever encoding you may have to utf-8, probably your current text editor is able to do that.
If you wonder why I say this remember that it is impossible for a text editor to safely guess your non-unicode encoding because there is no BOM for non-unicode. For this reason most decent editors are using UTF-8 as default even when encoding is not specified. And BTW, the encoding specified in the python file header is for Python only, most editors ignore what you wrote there.
Also, as you can see Python is trying to decode a character above 128 using ASCII (not latin-1), this is supposed to fail. I am not sure why this happens but I don't even care too much because there is a much better way to solve the problem.
It must be at the top of the script that has the non-ASCII text, and it must match the actual encoding of the file. \x92 is CP1252, not Latin-1.
If you are just concerned about getting rid of this error without getting into the details of it(which you can get from the other answers on this page), you can do the following -
1) Copy your code and paste it in Notepad++
2) Select Encoding -> Encode in UTF-8
3) Select View -> Show Symbol -> Show All Characters
Now it would be visible to you that which symbol is causing the issue(x92 would be visible). Replace/Remove it to solve the problem.
Found this and hope it's helpful to the next person:
http://www.sitepoint.com/forums/showthread.php?567734-Anyone-know-what-this-error-means
Code point 0x92 (146 decimal) is the right single quotation mark, or
apostrophe (’) in Windows-1252. It's an invalid character in ISO 8859
and in UTF-8, since the 0x80-0x9F range is reserved for C1 control
characters.
Not sure if I'm busting copyright. If so please remove the blockquote.
The encoding declaration indicates that you think the file is in latin-1 encoding, but the python interpreter is finding that a char at or very near line 2509 in GlobalVars.py that is not what you think it is.
You should first confirm the encoding of GlobalVars.py. Is it really latin-1?
Next, you should check the characters near line 2509. Are they also latin-1, or were they cut and pasted from a web page or somewhere else (maybe there are UTF-8 chars mixed up in there)?
If you have chars in your source file that aren't what you think they are, then you may need to clean up the file before going any further.
add these lines on top of your code
#! /usr/bin/env python
# -*- coding: utf-8 -*-
An easy workaround solution if your file is really in latin-1 is to change the html string with its representation.
Afaik:
\x92 => 146 in decimal => Æ => Æ
If your character is not Æ, then your file is not encoded into latin-1 ;-) (and you might wanna check if utf-8/cp1292 works better as a quick win)
EDIT:
Of course, you want to check your ACTUAL file encoding before trying. I might be wrong, not 100% sure \x92 is Æ in Iso8859-1 : according to this page, it doesn't seem defined.

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe
You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like ’ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')
Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.
[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

Categories