I have a model that has a regular text field, that needs to be able to accept user pasted text data into it which may contain scientific symbols, specifically lowercase delta δ. The users will be entering the data through the model admin.
I'm using a mysql backend, and the encoding is set to Latin-1. Changing the DB encoding is not an option for me.
what i would like to do, for simplicity's sake, is have the admin form scrub the input text, much like sanitation or validation, but to escape the characters such as δ to their HTML representation,so that i can store them in the DB without having to convert to Unicode and then back again.
What utilities are available to do this? I've looked at escape() and conditional_escape(), but they do not seem to do what i want them to (not escape the special characters) and the django.utils.encoding.force_text() will encode everything, but my data will render as its Unicode representation if i do that.
The site runs on django 1.10 and python 2.7.x
any help or thoughts are much appreciated.
As part of the save method or view that receives the request.POST data, you can escape it, encode it to ascii with xmlcharrefreplace, and then decode it back from bytes to a string:
raw_str = "this is a string with δ problematic chars"
result = html.escape(raw_str).encode("ascii", "xmlcharrefreplace").decode()
print(result) # 'this is a string with δ problematic chars'
Gets the job done since you can't change the encoding, though not nearly as clean as just getting to live in UTF-8. Good luck!
Related
I'm writing SQL to a file on a server this way:
import codecs
f = codecs.open('translate.sql',mode='a',encoding='utf8',errors='strict')
and then writing SQL statements like this:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to'), lookup.get(q), kw.get(q)))
f.write(query)
I have confirmed that the text was okay when I pulled it. Here is the data from the dictionary (kw) passed out to a webpage:
46:埼玉県
47:熊谷市
42:お散歩デモ
It appears correct (I want it to be utf8 escaped).
But the file.write output is garbage (encoding problems):
INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(279,#last_story_id,62,'ãã©ã³ãã£ã¢ããã'); )
/* updating the story text on old story_id */
UPDATE story_question_response
SET answer = '大å¦ã®ããã·ã§ã¯ãã¦å¦çãæ¬å¤§éç½ã®è¢«ç½å°(岩æçã®å¤§è¹æ¸¡å¸)ã«æ´¾é£ãããããã¦ã¯ç¾å°ã®å¤ç¥ãã®ãæ$
WHERE story_id = 65591
AND question_id = 41
AND group_id = 276;
using an explicit decode gives an error:
f.write(query.decode('utf8'))
I don't know what else to try.
Question: What am I doing wrong, in writing a utf8 file?
We don't have enough information to be sure, but I'd give decent odds that your file is actually perfectly valid UTF-8, and you're just viewing it as if it were something else.
For example, on Windows, if you open a file in Notepad, by default, it will only treat it as UTF-8 if it starts with a UTF-8 BOM (which no valid file ever should, but Microsoft likes them anyway); otherwise, it will treat it as whatever your default code page is. Which is probably some Latin-1 derivative like CP1252.
So, your string of kana and kanji ends up encoded as a bunch of three-byte UTF-8 sequences like '\xe6\xad\xa9'. Then, that gets displayed in Notepad as whatever each of those bytes happen to mean in CP1252, like æ© (note that there's an invisible character between the two visible ones).
As a general rule, whenever you see weirdly-accented versions of lowercase A and E every 2 or 3 characters, that almost always means you've interpreted some CJK UTF-8 as some Latin-1-derived character set, because UTF-8 uses \xE3 through \xED as the prefix bytes for most CJK characters, and Latin-1 has accented lowercase A and E characters in that range. (Similarly, weirdly-accented capital A versions usually mean European or symbolic UTF-8 interpreted as Latin-1, especially when you've got stray Âs inserted into what looks like otherwise valid or almost-valid European text. If you look at the charts, you should be able to tell why.)
Assuming your input is utf8, you should probably use the following code to generate the query:
query = (u"""INSERT INTO story_question_response
(group_id, story_id, question_id, answer )
VALUES
(%s,#last_story_id,%s,'%s');
""" % (kw.get('to').decode('utf8'), lookup.get(q).decode('utf8'), kw.get(q).decode('utf8')))
I would also suggest trying to output the contents of kw and lookup to some log file to debug this issue.
You should use encode on objects of class unicode, and decode on objects of class str in python.
You should escape any string you insert into SQL statement to prevent nasty SQL injections.
The code above doesn't include such escaping, so be careful.
I built a django site last year that utilises both a dashboard and an API for a client.
They are, on occasion, putting unicode information (usually via a Microsoft keyboard and a single quote character!) into the database.
It's fine to change this one instance for everything, but what I constantly get is something like this error when a new character is added that I haven't "converted":
UnicodeDecodeError at /xx/xxxxx/api/xxx.json
'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)
The issue is actually that I need to be able to convert this unicode (from the model) into HTML.
# if a char breaks the system, replace it here (duplicate line)
text = unicode(str(text).replace('\xa3', '£'))
I duplicate this line here, but it just breaks otherwise.
Tearing my hair out because I know this is straight forward and I'm doing something remarkably silly somewhere.
Have searched elsewhere and realised that while my issue is not new, I can't find the answer elsewhere.
I assume that text is unicode (which seems a safe assumption, as \xa3 is the unicode for the £ character).
I'm not sure why you need to encode it at all, seeing as the text will be converted to utf-8 on output in the template, and all browsers are perfectly capable of displaying that. There is likely another point further down the line where something (probably your code, unfortunately) is assuming ASCII, and the implicit conversion is breaking things.
In that case, you could just do this:
text = text.encode('ascii', 'xmlcharrefreplace')
which converts the non-ASCII characters into HTML/XML entities like £.
Tell the JSON-decoder that it shall decode the json-file as unicode. When using the json module directly, this can be done using this code:
json.JSONDecoder(encoding='utf8').decode(
json.JSONEncoder(encoding='utf8').encode('blä'))
If the JSON decoding takes place via some other modules (django, ...) maybe you can pass the information through this other module into the json stuff.
I am not able to save special characters, such as θæŋ after making re.search.
I am saving to the Django model Textfield. in Admin page, Instead of θæŋkfəli, i am getting
\xce\xb8\xc3\xa6\xc5\x8bkf\xc9\x99li
Is it error of re.search?
Is it error of Admin?
Am i saving wrongly?
How to search for needed part of string an save it in the Textfield of the model with 'θæŋ' characters?
lines='title="Listen to audio" /></a><span class="pr">/<span class="unicode">ˈ</span>θæŋkfəli/</span> <span class="fl">adverb' #the string which i wan to save exactly as it is, Django saves it correctly
liness=smart_str(lines, encoding='utf-8', strings_only=False, errors='replace') # saves correctly
linesu=smart_unicode(lines, encoding='utf-8', strings_only=False, errors='replace') # saves correctly
After trying to seach part of string θæŋkfəli Django does not save it in needed special characters. Instead of θæŋkfəli, I am getting "\xce\xb8\xc3\xa6\xc5\x8bkf\xc9\x99li"
stryc=re.compile('<span\s*class=\"pr\">\s*/\s*<span\s*class="unicode\">(?P<Pronun>.*)<span\s*class=\"fl\">', re.DOTALL)
#\s+/\s+<span class=\"unicode\">\s+[\\a-zA-Z0-9\s]+/\s+</span> '
strys=re.search(stryc, linesu)
Pronun=stryWordcs.groups('Pronun')
text=Pronun.encode('utf-8') # does not covert unicode to letters
Pronun=smart_str(Pronun, encoding='utf-8', strings_only=False, errors='replace') # also does not covert unicode to letters
a=Pronunciation(field=Pronun) # or field=text
a.save()
# Pronun= "θæŋkfəli", nevertheless it is saved as \xce\xb8\xc3\xa6\xc5\x8bkf\xc9\x99li or in unicode
if i do not use smart_str, i am getting "\u03b8\xe6\u014bkf\u0259li"
if i try to search in lines or liness i am not able to find θæŋkfəli due special character ˈ (small stick on top =\xcb\x88 = \u02c8)
regards,
gintare
When you see "\xce\xb8\xc3\xa6\xc5\x8bkf\xc9\x99li, what you're seeing is in fact the same as θæŋkfəli, but in hexadecimal notation. Similarly, you could represent the same set of characters as
U+03B8 U+00E6 U+014BkfU+0259li
This is because the text is stored as a unicode string. To see that this is actually the same, try copying your text (with the special characters) into the top box in this conversion website and hitting convert. Python is able to handle unicode but depending on what you're using to display the characters it's going to come out differently.
I'm not entirely sure what your question is. If you're concerned about your regular expression being able to accurately match the unicode characters, python's re module has an option to make it work differently with unicode.
If you're concerned about how this text is being displayed, this is going to vary based on how you're trying to display it, and you'll need to be more specific about your problem.
I can assure you though, that Django is storing your string just fine.
I have a list that I need to send through a URL to a third party vendor. I don't know what language they are using.
The list prints out like this:
[u'1', u'6', u'5']
I know that the u encodes the string in utf-8 right? So a couple of questions.
Can I send a list through a URL?
Will the u's show up on the other end when going through the URL?
If so, how do I remove them?
I am not sure what keywords to search to help me out, so any resources would be helpful too.
Can I send a list through a URL?
No. A URL is just text. If you want a way to package structured information in it, you'll have to agree that with the provider you're talking to.
One standard encoding for structure in URLs, that might or might not be what you need, is the use of multiple parameters with the same name in a query string. This format comes from HTML form submissions:
http://www.example.com/script?par=1&par=6&par=5
might be considered to represent a parameter par with a three-item list as its value. Or maybe not, it's up to the receiver to decide. For example in a PHP application you would have had to name the parameter par[] to get it to accept the array value.
I know that the u encodes the string in utf-8 right?
No. a u'...' string is a native Unicode string, where each index represents a whole character and not a byte in any particular encoding. If you want UTF-8 bytes, write u'...'.encode('utf-8') before URL-encoding. UTF-8 is a good default choice, but again: what encoding the receiving side wants is up to that application.
Will the u's show up on the other end when going through the URL?
u is part of the literal representation of the string, just the same as the ' quotes themselves. They are not part of the string value and would not be echoed by print or when joined into other strings, unless you deliberately asked for the literal representation by calling repr.
u'' is not utf-8, its python unicode strings for python 2.x
To send it through url, you need to encode them with utf8 like .encode('utf-8'), and also need to urlencode, and list cannot send it through URL, you need to make it as string.
Basically, you need to do it in following steps
python list -> unicode string -> utf8 string -> url encode -> send it through proper urllib api
Incorrect. unicode literals use Python's internal encoding, decided when it was compiled.
You can't send anything "through" URLs. Pick a protocol instead. And encode before sending, probably to UTF-8.
The user entered the word
éclair
into the search box.
Showing results 1 - 10 of about 140 for �air.
Why does it show the weird question mark?
I'm using Django to display it:
Showing results 1 - 10 of about 140 for {{query|safe}}
It's an encoding problem. Most likely your form or the output page is not UTF-8 encoded.
This article is very good reading on the issue: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
You need to check the encoding of
the HTML page where the user input the word
the HTML page you are using to output the word
the multi-byte ability of the functions you use to work with the string (though that probably isn't a problem in Python)
If the search is going to apply to a data base, you will need to check the encoding of the database connection, as well as the encoding of your tables and columns.
This is the result when you interpret data that is not encoded in UTF-8 as UTF-8 encoded.
The interpreter expects from the code point of your first character of the word éclair a multibyte encoded character with a length of three characters, consumes the next two characters but can’t decode it (probably invalid byte sequence). For this case the REPLACEMENT CHARACTER � (U+FFFD) is shown.
So in your case you just need to really encode your data with UTF-8.
You are serving the page with the wrong character encoding (charset). Check that you are using the same encoding throughout all your application (for example UTF-8). This includes:
HTTP headers from web server (Content-Type: text/html;charset=utf-8)
Communication with database (i.e SET NAMES 'utf-8')
It would also be good to check your browser encoding setting.
I second the responses above. Some other things from the top of my head:
If you're using e.g. MySQL database, then it could be good to create your database using:
CREATE DATABASE x CHARACTER SET UTF8
You can also check this: http://docs.djangoproject.com/en/dev/ref/settings/#default-charset