This question already has answers here:
byte string vs. unicode string. Python
(2 answers)
Closed 3 years ago.
I'm a Python noob. I was reading through some documentation and I came across something that baffled me.
What is the difference between Byte strings and Unicode strings in python? Especially in terms of what is being inputed and the output.
Please explain using the simplest terms possible
N.B : I use python 3.x
I searched around and found that byte strings can only contain byte characters, which exclude punctuation marks and other unicode characters. Unicode strings can contain, well, all unicode characters.
In python 2.x, byte strings are written much like ordinary strings while unicode strings have a prefixed "u".
a = 'foobar' (byte string)
b = u'foo-bar' (unicode string)
It's written the opposite way for python 3.x
a = b'foobar' (byte string)
b = 'foo-bar' (unicode string)
Related
This question already has answers here:
Escaped Unicode to Emoji in Python
(1 answer)
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I was looking at https://r12a.github.io/app-conversion/ and I see that they have a "JS/Java/C" section. I was wondering if anyone had the code for that in python. I can't seem to find it. Thanks!
Edit: code
b = '😀'
txt = b.encode('utf-8')
From How to work with surrogate pairs in Python? (linked from duplicate Escaped Unicode to Emoji in Python )
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\uD83D\uDE00".encode('utf-16', 'surrogatepass').decode('utf-16')
'😀'
Original Answer
Perhaps you're looking for ord()?
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
>>> hex(ord("😀"))
'0x1f600'
This question already has answers here:
getting bytes from unicode string in python
(6 answers)
Closed 2 years ago.
I have a string which I get from a function
>>> example = Some_function()
This Some_function return a very long combination of Unicode and ASCII string like 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'
My Problem is that when I try to convert this unicode string to bytes it gives me an error that \ud919 cannot be encoded by utf-8. I tried :
>>> further=bytes(example,encoding='utf-8')
Note: I cannot ignore this \ud919. If there is a way to solve this problem or how can I convert 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123' to 'gn1\ud123a\ud123\ud123\ud123\\ud919\ud123\ud123' to treat \ud919 as simple string not unicode.
based on the version.
print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string)
\ud919 is a surrogate character, one does not simply convert it. Use surrogatepass flag:
'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'.encode('utf-8', 'surrogatepass')
>>> b'gn1\xed\x84\xa3a\xed\x84\xa3\xed\x84\xa3\xed\x84\xa3\xed\xa4\x99\xed\x84\xa3\xed\x84\xa3'
This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)
output is —
how can I handle the hex decimal characters in this situation?
Your input doesn't have backslashes. It has 3 bytes, the UTF-8 encoding for the U+2014 EM DASH character:
>>> text = "\xe2\x80\x94"
>>> len(text)
3
>>> text[0]
'\xe2'
>>> text.decode('utf8')
u'\u2014'
>>> print text.decode('utf8')
—
You either need to match those UTF-8 bytes directly, or decode from UTF-8 to unicode and match the codepoint. The latter is preferable; always try to deal with text as Unicode to simplify how many characters you have to transform at a time.
Also note that Python's repr() output (which is used impliciltly when echoing in the interactive interpreter or when printing lists, dicts or other containers) uses \xhh escape sequences to represent any non-printable character. For UTF-8 strings, that includes anything outside the ASCII range. You could just replace anything outside that range with:
re.sub(r'[\x80-\xff]+', "replacement_text", text)
Take into account that this'll match multiple UTF-8-encoded characters in a row, and replace these together as a group!
Your input is in hex, not an actual "\xe2\x80\x94".
\x is just the way to say that the following characters should be interpreted in hex.
This was explained in this post.
This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
What is the difference between \x and \u escape sequences in python? (Apart from the fact that \x uses the syntax \xXX and \u uses \uXXXX). print('\xa5') gives the output as '¥' in script mode and so does print('\u00a5'), so how is one different from the other, apart from the syntax used?
The most important difference is that \uXXXX accepts 4 hexadecimal digits and is therefore suitable for higher numbers (and therefore can be used to refer to special characters that are not in ASCII or your current code page). It can therefore only be used in unicode strings:
u'\u0123'
The older \xXX can be used in both unicode strings and str strings, but only for code points up to 255:
u'\u0123\x20'
'\x20'
This question already has an answer here:
python get unicode string size
(1 answer)
Closed 8 years ago.
How would I get the character count of the below in python?
s = 'הוא אוסף אתכם מחר בשלוש וחצי.'
Char count: 29
Char length: 52
len(s) = 52
? = 29
decode your byte string (according to whatever encoding it's in, utf-8 maybe) -- the len of the resulting Unicode string is what you're after.
If fact best practice is to decode inputs as soon as possible, deal only with actual text (i.e, unicode, in Python 2; it's just the way ordinary strings are, in Python 3) in your code, and if need be encode just as you're outputting again.
Byte strings should be handled in your program only if it's specifically about byte strings (e.g, controlling or monitoring some hardware device, &c) -- far more programs are about text, and thus, except where indispensable at some I/O boundaries, they should be exclusively dealing with text strings (spelled unicode in Python 2:-).
But if you do want to keep s as a bytestring nevertheless,
len(s.decode('utf-8'))
(or whatever other encoding you're using to represent text as byte strings) should still do what you request.
Use a unicode string
s = 'הוא אוסף אתכם מחר בשלוש וחצי.'
len(s) #52
s = u'הוא אוסף אתכם מחר בשלוש וחצי.'
len(s) #29