Python 3 utf8 value decode to string - python

Hi i am using python3 and i want to change utf8 value to string (decode)
Here is my code now
s1 = '\u54c7'
print(chr(ord(s1))) # print 哇
It's fine if input is one char but how to change a string?
s2 = '\u300c\u54c7\u54c8\u54c8!!\u300d'
print(chr(ord(s2))) # Error! I want print "「哇哈哈!!」"
Thanks
Edit: ================================================================
Hi all,i update the question
If i got the string is "s3" like below and i use replace to change format
but print "s3" not show "哇哈哈!!"
If i initiated s4 with \u54c7\u54c8\u54c8!!' and print s4
it's look like correct so how can i fix s3 ?
s3 = '哇哈哈!!'
s3 = s3.replace("&#x","\\u").replace(";","") # s3 = \u54c7\u54c8\u54c8!!
s4 = '\u54c7\u54c8\u54c8!!'
print(s3) # \u54c7\u54c8\u54c8!!
print(s4) # 哇哈哈!!

If you are in fact using python3, you don't need to do anything. You can just print the string. Also you can just copy and paste the literals into a python string and it will work.
'「哇哈哈!!」' == '\u300c\u54c7\u54c8\u54c8!!\u300d'
In regards to the updated question, the difference is escaping. If you type a string literal, some sequences of characters are changed to characters that can't be easily typed or be displayed. The string is not stored as the series of characters you see but as a list of values created from characters like 'a', ';', and '\300'. Note that all of those have a len of 1 because they are all one character.
To actually convert those values you could use eval, the answer provided by Iron Fist, or find a library that converts the string you have. I would suggest the last since the rules surrounding such things can be complex and rarely are covered by simple replacements. I don't recognize the particular pattern of escaping, so I cannot recommend anything, sorry.

Regarding your s3 string, this seems to me more like an HTML entity or text in HTML format, so use proper html.parser, this way:
>>> s3 = '哇哈哈!!'
>>> from html.parser import HTMLParser
>>>
>>> p = HTMLParser()
>>>
>>> p.unescape(s3)
'哇哈哈!!'
Or, more simply with html.unescape:
>>> import html
>>>
>>> html.unescape(s3)
'哇哈哈!!'
Quoting from Python docs on html.unescape:
html.unescape(s)
Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters.
...

Related

Does Python3 still need raw string in regular expression?

In Python 2, when dealing with regular expression we use r'expression', do we still need prepend "r" in Python 3, since I know Python 3 use Unicode by default
Yes. Backslash escape sequences are still present in Python 3 strings, thus raw strings prefixed with r make a difference as shown in this simple example:
>>> s = 'hello\n'
>>> raw = r'hello\n'
>>> s
hello\n
>>> raw
hello\\n
>>> print(s)
hello
>>> print(raw)
hello\n
Raw strings are still useful for writing characters like \ without escaping them. This is generally useful in regex and window paths etc.

Unicode in python with Django and MongoEngine

I'm trying to compare two strings, the first one, s1, comes from mongoengine and the second one, s2, comes from a Django http request.
They look like this:
>>> s1 = product_model.Product.objects.get(pk=1).name
>>> s1
u'Product \xe4 asdf'
>>> s2 = request.POST['name']
>>> s2
'Product \xc3\xa4 asdf'
They have the same letter in them, the Swedish 'ä', but mongoengines (s1) is in a Python unicode string and Djangos (s2) is in a Python bytestring with unicode encoded characters.
I can easily solve this by e.g. converting the Python unicode string to be a byte string
>>> s1.encode('utf-8') == s2
True
But I would like to think that the best-practice is to have all my Python strings encoded the same way in my system, correct?
How can I tell Django to use Python unicode strings instead? Or how can I tell MongoEngine to use unicode encoded Python bytestrings?
Django docs says:
General string handling
Whenever you use strings with Django – e.g., in database lookups,
template rendering or anywhere else – you have two choices for
encoding those strings. You can use Unicode strings, or you can use
normal strings (sometimes called “bytestrings”) that are encoded using
UTF-8.
In Python 3, the logic is reversed, that is normal strings are
Unicode, and when you want to specifically create a bytestring, you
have to prefix the string with a ‘b’. As we are doing in Django code
from version 1.5, we recommend that you import unicode_literals from
the future library in your code. Then, when you specifically want
to create a bytestring literal, prefix the string with ‘b’.
Python 2 legacy:
my_string = "This is a bytestring"
my_unicode = u"This is an Unicode string"
Python 2 with unicode literals or Python 3:
from __future__ import unicode_literals
my_string = b"This is a bytestring"
my_unicode = "This is an Unicode string"
If you are in Python 2, you can try that. As I said in the comment:
I would not suggest to work with encoded strings. Like this slices say
(farmdev.com/talks/unicode) "Decode early, Unicode everywhere, encode
late". So i would suggest you to tell Django to use unicode strings,
but I am not Django expert, sorry. My approach: s1 ==
s2.decode("utf8"), so you have both Unicode strings to work with
Hope it works
EDIT: I suppose you are using Django's HttpRequest, so from the docs:
HttpRequest.encoding
A string representing the current encoding used
to decode form submission data (or None, which means the
DEFAULT_CHARSET setting is used). You can write to this attribute to
change the encoding used when accessing the form data. Any subsequent
attribute accesses (such as reading from GET or POST) will use the new
encoding value. Useful if you know the form data is not in the
DEFAULT_CHARSET encoding.

Convert Unicode string to UTF-8, and then to JSON

I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:
>>> unicode('©', 'utf-8').encode('utf-8')
'\xc2\xa9'
Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)
Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:
>>> import json; json.dumps('\xc2\xa9')
'"\\u00a9"'
Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.
TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?
If you really want "\u00c2\u00a9" as the output, give json a Unicode string as input.
>>> print json.dumps(u'\xc2\xa9')
"\u00c2\u00a9"
You can generate this Unicode string from the raw bytes:
s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)
I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.

Greek encoding in PYTHON

i'm trying to store a string and after tokenize it with nltk in python.But i cant understand why after tokenizing it ( it creates a list ) i cant see the strings in list..
Can anyone help me plz?
Here is the code:
#a="Γεια σου"
#b=nltk.word_tokenize(a)
#b
['\xc3\xe5\xe9\xe1', '\xf3\xef\xf5']
I just want to be able to see the content of the list regularly..
Thx in advance
You are using Python 2, where unprefixed quotes denote a byte as opposed to a character string (if you're not sure about the difference, read this). Either switch to Python 3, where this has been fixed, or prefix all character strings with u and print the strings (as opposed to showing their repr, which differs in Python 2.x):
>>> import nltk
>>> a = u'Γεια σου'
>>> b = nltk.word_tokenize(a)
>>> print(u'\n'.join(b))
Γεια
σου
You can see the strings. The characters are represented by escape sequences because of your terminal encoding settings. Configure your terminal to accept input, and present output, in UTF-8.

Python raw strings and unicode : how to use Web input as regexp patterns?

EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here.
For people looking for a quick anwser, I added on below.
If I enter a regexp manually in a Python script, I can use 4 combinations of flags for my pattern strings :
p1 = "pattern"
p2 = u"pattern"
p3 = r"pattern"
p4 = ru"pattern"
I have a bunch a unicode strings coming from a Web form input and want to use them as regexp patterns.
I want to know what process I should apply to the strings so I can expect similar result from the usage of the manual form above. Something like :
import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)
What would be someProcess1 to someProcessN and why ?
I suppose that someProcess2 doesn't need to do anything while someProcess1 should do some unicode conversion to the local encoding. For the raw string literals, I am clueless.
Apart from possibly having to encode Unicode properly (in Python 2.*), no processing is needed because there is no specific type for "raw strings" -- it's just a syntax for literals, i.e. for string constants, and you don't have any string constants in your code snippet, so there's nothing to "process".
Note the following in your first example:
>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True
While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a str and p2 and p4 a unicode object in Python 2.x), containing the value "pattern". The u, r and ur just tell the parser, how to interpret the following quoted string, namely as a unicode text (u) and/or a raw text (r) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.
When you get unicode text as input, you have to differ (in Python 2.x) if it is a unicode text or a str object. If you want to work with the unicode content, you should internally work only with those, and convert all str objects to unicode objects (either with str.decode() or with the u'text' syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.
A different approach would be using Python 3, which str object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.
"r" flags just prevent Python from interpreting "\" in a string. Since the Web doesn't care about what kind of data it carries, your web input will be a bunch of bytes you are free to interpret the way you want.
So to address this problem :
be sure you use Unicode (e.g. utf-8) all long the way
when you get the string, it will be Unicode and "\n", "\t" and "\a" will be literals, so you don't need to care about if you need to escape them of not.

Categories