I'm trying to compare two strings, the first one, s1, comes from mongoengine and the second one, s2, comes from a Django http request.
They look like this:
>>> s1 = product_model.Product.objects.get(pk=1).name
>>> s1
u'Product \xe4 asdf'
>>> s2 = request.POST['name']
>>> s2
'Product \xc3\xa4 asdf'
They have the same letter in them, the Swedish 'ä', but mongoengines (s1) is in a Python unicode string and Djangos (s2) is in a Python bytestring with unicode encoded characters.
I can easily solve this by e.g. converting the Python unicode string to be a byte string
>>> s1.encode('utf-8') == s2
True
But I would like to think that the best-practice is to have all my Python strings encoded the same way in my system, correct?
How can I tell Django to use Python unicode strings instead? Or how can I tell MongoEngine to use unicode encoded Python bytestrings?
Django docs says:
General string handling
Whenever you use strings with Django – e.g., in database lookups,
template rendering or anywhere else – you have two choices for
encoding those strings. You can use Unicode strings, or you can use
normal strings (sometimes called “bytestrings”) that are encoded using
UTF-8.
In Python 3, the logic is reversed, that is normal strings are
Unicode, and when you want to specifically create a bytestring, you
have to prefix the string with a ‘b’. As we are doing in Django code
from version 1.5, we recommend that you import unicode_literals from
the future library in your code. Then, when you specifically want
to create a bytestring literal, prefix the string with ‘b’.
Python 2 legacy:
my_string = "This is a bytestring"
my_unicode = u"This is an Unicode string"
Python 2 with unicode literals or Python 3:
from __future__ import unicode_literals
my_string = b"This is a bytestring"
my_unicode = "This is an Unicode string"
If you are in Python 2, you can try that. As I said in the comment:
I would not suggest to work with encoded strings. Like this slices say
(farmdev.com/talks/unicode) "Decode early, Unicode everywhere, encode
late". So i would suggest you to tell Django to use unicode strings,
but I am not Django expert, sorry. My approach: s1 ==
s2.decode("utf8"), so you have both Unicode strings to work with
Hope it works
EDIT: I suppose you are using Django's HttpRequest, so from the docs:
HttpRequest.encoding
A string representing the current encoding used
to decode form submission data (or None, which means the
DEFAULT_CHARSET setting is used). You can write to this attribute to
change the encoding used when accessing the form data. Any subsequent
attribute accesses (such as reading from GET or POST) will use the new
encoding value. Useful if you know the form data is not in the
DEFAULT_CHARSET encoding.
Related
I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O3. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?
SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').
I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)
Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.
If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).
On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).
If you have a byte string (length 7), decode the Unicode escape.
>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃
Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.
It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.
You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.
Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.
I got a json file like this:
{
'errNum': 0,
'retData': {
'city': "武汉"
}
}
import json
content = json.loads(result) # supposing json file named result
cityname = content['retData']['city']
print cityname
After that, I got a output : \u6b66\u6c49
I know it's unicode of Chinese character of 武汉 ,but the type of it is str
isinstance(cityname,str) is True.
so how can I convert this str to unicode and output will be 武汉
I also have tried these solutions:
>>> u'\u6b66\u6c49'
u'\u6b66\u6c49'
>>> print u'\u6b66\u6c49'
武汉
>>> print '\u6b66\u6c49'.decode()
\u6b66\u6c49
>>> print '\u6b66\u6c49'
\u6b66\u6c49
Searched something about ascii,unicode and utf-8 ,encode and decode ,but also cannot understand,it is crazy!
I need some help ,Thanks !
Perhaps this answer comes five years too late, but since I had a similar issue that I was trying to solve, while building a preprocessor for the Japanese language, here is the answer I found.
when you loads the result to content add the following flag:
content = json.loads(result, ensure_ascii=False)
This fixed my issue.
Your json contains escaped unicode characters. You can decode them into actual unicode characters using the unicode_escape codec:
print cityname.decode('unicode_escape')
Note that, while this will usually work, depending on the source of the unicode escaping you could have problems with characters outside the Basic Multilingual Plane (U+0 to U+FFFF). A convenient quote from user #bobince that I took from a comment:
Note that ... there are a number of different formats that use \u
escapes - Python unicode literals (which unicode-escape handles), Java
properties, JavaScript string literals, JSON, and so on. It is
important to know which one you are dealing with because they all have
slightly different rules about what other escapes are valid.
unicode-escape may or may not be a valid way of parsing that data
depending on where it comes from.
I have a unicode object like
x = u"a & 日本語: enči hallöle"
and want to convert it into a latin-1 string with html-entities like
"a & 日本語: enči hallöle"
the reason behind this is, that I want my users to be able to enter unicode data, but my legacy database where I need to save my data only accepts latin-1 strings. (the "ö" should not be converted, but the other special characters must be converted)
Any idea which module to use here? I searched through the encoding module, looked up some codecs, tried a bit with methods of unicode objects, but came to no sensible solution.
Use the "xmlcharrefreplace" option of unicode.encode, but note that it won't translate & to & for you:
>>> x = "a & 日本語: enči hallöle".decode("utf-8")
>>> x.replace("&", "&").encode("latin-1", "xmlcharrefreplace")
'a & 日本語: enči hall\xf6le'
Just encode your to UTF-8, that should be save.
>>> x.encode("UTF-8")
'a & \xc3\xa6\xc2\x97\xc2\xa5\xc3\xa6\xc2\x9c\xc2\xac\xc3\xa8\xc2\xaa\xc2\x9e: en\xc3\x84\xc2\x8di hall\xc3\x83\xc2\xb6le'
In my database, I have stored some UTF-8 characters. E.g. 'α' in the "name" field
Via Django ORM, when I read this out, I get something like
>>> p.name
u'\xce\xb1'
>>> print p.name
α
I was hoping for 'α'.
After some digging, I think if I did
>>> a = 'α'
>>> a
'\xce\xb1'
So when Python is trying to display '\xce\xb1' I get alpha, but when it's trying to display u'\xce\xb1', it's double encoding?
Why did I get u'\xce\xb1' in the first place? Is there a way I can just get back '\xce\xb1'?
Thanks. My UTF-8 and unicode handling knowledge really need some help...
Try to put the unicode signature u before your string, e.g. u'YOUR_ALFA_CHAR' and revise your database encoding, because Django always supports UTF-8 .
What you seem to have is the individual bytes of a UTF-8 encoded string interpreted as unicode codepoints. You can "decode" your string out of this strange form with:
p.name = ''.join(chr(ord(x)) for x in p.name)
or perhaps
p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8')
One way to get your strings "encoded" into this form is
''.join(unichr(ord(x)) for x in '\xce\xb1')
although I have a feeling your strings actually got in this state by different components of your system disagreeing on the encoding in use.
You will probably have to fix the source of your bad "encoding" rather than just fixing the data currently in your database. And the code above might be okay to convert your bad data once, but I would advise you don't insert this code into your Django app.
The problem is that p.name was not correctly stored and/or read in from the database.
Unicode small alpha is U+03B1 and p.name should have printed as u'\x03b1' or if you were using a Unicode capable terminal the actual alpha symbol itself may have been printed in quotes. Note the difference between u'\xce\xb1' and u'\xceb1'. The former is a two character string and the latter in a single character string. I have no idea how the '03' byte of the UTF-8 got translated into 'CE'.
You can turn any byte sequence into internal unicode representation through the decode function:
print '\xce\xb1'.decode('utf-8')
This allows you to import a byte sequence from any source and then turn it into a Python unicode string.
Reference: http://docs.python.org/library/stdtypes.html#string-methods
Try converting the encoding with p.name.encode('latin-1'). Here's a demonstration:
>>> print u'\xce\xb1'
α
>>> print u'\xce\xb1'.encode('latin-1')
α
>>> print '\xce\xb1'
α
>>> '\xce\xb1' == u'\xce\xb1'.encode('latin1')
True
For more information, see str.encode and Standard Encodings.