Emoji to unicode [duplicate] - python

This question already has answers here:
Escaped Unicode to Emoji in Python
(1 answer)
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I was looking at https://r12a.github.io/app-conversion/ and I see that they have a "JS/Java/C" section. I was wondering if anyone had the code for that in python. I can't seem to find it. Thanks!
Edit: code
b = '😀'
txt = b.encode('utf-8')

From How to work with surrogate pairs in Python? (linked from duplicate Escaped Unicode to Emoji in Python )
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\uD83D\uDE00".encode('utf-16', 'surrogatepass').decode('utf-16')
'😀'
Original Answer
Perhaps you're looking for ord()?
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
>>> hex(ord("😀"))
'0x1f600'

Related

How to convert unicode string to bytes Python [duplicate]

This question already has answers here:
getting bytes from unicode string in python
(6 answers)
Closed 2 years ago.
I have a string which I get from a function
>>> example = Some_function()
This Some_function return a very long combination of Unicode and ASCII string like 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'
My Problem is that when I try to convert this unicode string to bytes it gives me an error that \ud919 cannot be encoded by utf-8. I tried :
>>> further=bytes(example,encoding='utf-8')
Note: I cannot ignore this \ud919. If there is a way to solve this problem or how can I convert 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123' to 'gn1\ud123a\ud123\ud123\ud123\\ud919\ud123\ud123' to treat \ud919 as simple string not unicode.
based on the version.
print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string)
\ud919 is a surrogate character, one does not simply convert it. Use surrogatepass flag:
'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'.encode('utf-8', 'surrogatepass')
>>> b'gn1\xed\x84\xa3a\xed\x84\xa3\xed\x84\xa3\xed\x84\xa3\xed\xa4\x99\xed\x84\xa3\xed\x84\xa3'

extract some arabic/persian (unicode) words with regex using python [duplicate]

This question already has answers here:
Python and regular expression with Unicode
(2 answers)
Closed 2 years ago.
I need to extract some specific names in Arabic/Persian (something like proper nouns in English), using python re library.
example (the word "شرکت" means "company" and we want to extract what the company name is):
input: شرکت تست گستران خلیج فارس
output: تست گستران خلیج فارس
I've seen [this answer] and it would be fine to replace "university" with "شرکت" in that example but I don't understand how to find the keywords by regex with Arabic Unicode when it's not possible to use that in this way:
re.match("شرکت", "\u0634\u0631\u06A9\u062A") # returns None
Python 2 does not default to parsing unicode literals (like when pasting unicode letters, or having a \u in the code). You have to be explicit about it:
re.match(u"شرکت", u"\u0634\u0631\u06A9\u062A")
Otherwise, the Arabic will be translated to the actual bytes, which are different then the unicode code-points, and the Unicode string on the right will have literal backslashes since Python 2 does not recognize \u as a valid escape by default.
Another option is to import from the future - in Python 3 everything is initially parsed as unicode, making that u"..." somewhat obsolete:
from __future__ import unicode_literals
will make unicode literals be parsed correctly with no u"".

Byte string and unicode string inputs [duplicate]

This question already has answers here:
byte string vs. unicode string. Python
(2 answers)
Closed 3 years ago.
I'm a Python noob. I was reading through some documentation and I came across something that baffled me.
What is the difference between Byte strings and Unicode strings in python? Especially in terms of what is being inputed and the output.
Please explain using the simplest terms possible
N.B : I use python 3.x
I searched around and found that byte strings can only contain byte characters, which exclude punctuation marks and other unicode characters. Unicode strings can contain, well, all unicode characters.
In python 2.x, byte strings are written much like ordinary strings while unicode strings have a prefixed "u".
a = 'foobar' (byte string)
b = u'foo-bar' (unicode string)
It's written the opposite way for python 3.x
a = b'foobar' (byte string)
b = 'foo-bar' (unicode string)

Printing unicode character from number [duplicate]

This question already has answers here:
Convert an int value to unicode
(4 answers)
Closed 5 years ago.
In Python I'm trying to print out the character corresponding to a given code like this:
a = 159
print unichr(a)
It prints out a strange symbol. Can anyone explain why?
#To get numerical value of string. This gives me 49.
ord("1")
#To get string from numerical value. This gives me "1".
chr(49)
It is possible that the numerical value that you're trying to convert to a digit is the representative of a special character, in which case it is likely that python converted it into it's hex equivalent. To see the hex value of an integer:
hex(ord("1"))
If that is not the case, it's possible that it used another representative, since it is(hypothetically) a special character.
The character at unicode 159 is an Application Program Command. It's a control character, and is deemed not a graphic character.
More information

Escape sequence in python [duplicate]

This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
What is the difference between \x and \u escape sequences in python? (Apart from the fact that \x uses the syntax \xXX and \u uses \uXXXX). print('\xa5') gives the output as '¥' in script mode and so does print('\u00a5'), so how is one different from the other, apart from the syntax used?
The most important difference is that \uXXXX accepts 4 hexadecimal digits and is therefore suitable for higher numbers (and therefore can be used to refer to special characters that are not in ASCII or your current code page). It can therefore only be used in unicode strings:
u'\u0123'
The older \xXX can be used in both unicode strings and str strings, but only for code points up to 255:
u'\u0123\x20'
'\x20'

Categories