Escape sequence in python [duplicate] - python

This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
What is the difference between \x and \u escape sequences in python? (Apart from the fact that \x uses the syntax \xXX and \u uses \uXXXX). print('\xa5') gives the output as '¥' in script mode and so does print('\u00a5'), so how is one different from the other, apart from the syntax used?

The most important difference is that \uXXXX accepts 4 hexadecimal digits and is therefore suitable for higher numbers (and therefore can be used to refer to special characters that are not in ASCII or your current code page). It can therefore only be used in unicode strings:
u'\u0123'
The older \xXX can be used in both unicode strings and str strings, but only for code points up to 255:
u'\u0123\x20'
'\x20'

Related

Emoji to unicode [duplicate]

This question already has answers here:
Escaped Unicode to Emoji in Python
(1 answer)
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I was looking at https://r12a.github.io/app-conversion/ and I see that they have a "JS/Java/C" section. I was wondering if anyone had the code for that in python. I can't seem to find it. Thanks!
Edit: code
b = '😀'
txt = b.encode('utf-8')
From How to work with surrogate pairs in Python? (linked from duplicate Escaped Unicode to Emoji in Python )
If you see '\ud83d\ude4f' Python string (2 characters) then there is a bug upstream. Normally, you shouldn't get such string. If you get one and you can't fix upstream that generates it; you could fix it using surrogatepass error handler:
>>> "\uD83D\uDE00".encode('utf-16', 'surrogatepass').decode('utf-16')
'😀'
Original Answer
Perhaps you're looking for ord()?
Given a string representing one Unicode character, return an integer representing the Unicode code point of that character. For example, ord('a') returns the integer 97 and ord('€') (Euro sign) returns 8364. This is the inverse of chr().
>>> hex(ord("😀"))
'0x1f600'

How to ignore backslashes as escape characters in Python? [duplicate]

This question already has answers here:
How to write string literals in Python without having to escape them?
(6 answers)
Closed 7 months ago.
I know this is similar to many other questions regarding backslashes, but this deals with a specific problem that has yet to have been addressed. Is there a mode that can be used to completely eliminate backslashes as escape characters in a print statement? I need to know this for ascii art, as it is very difficult to find correct positioning when all backslashes must be doubled.
print('''
/\\/\\/\\/\\/\\
\\/\\/\\/\\/\\/
''')
\```
Preface the string with r (for "raw", I think) and it will be interpreted literally without substitutions:
>>> # Your original
>>> print('''
... /\\/\\/\\/\\/\\
... \\/\\/\\/\\/\\/
... ''')
/\/\/\/\/\
\/\/\/\/\/
>>> # as a raw string instead
>>> print(r'''
... /\\/\\/\\/\\/\\
... \\/\\/\\/\\/\\/
... ''')
/\\/\\/\\/\\/\\
\\/\\/\\/\\/\\/
These are often used for regular expressions, where it gets tedious to have to double-escape backslashes. There are a couple other letters you can do this with, including f (for format strings, which act differently), b (a literal bytes object, instead of a string), and u, which used to designate Unicode strings in python 2 and I don't think does anything special in python 3.

Byte string and unicode string inputs [duplicate]

This question already has answers here:
byte string vs. unicode string. Python
(2 answers)
Closed 3 years ago.
I'm a Python noob. I was reading through some documentation and I came across something that baffled me.
What is the difference between Byte strings and Unicode strings in python? Especially in terms of what is being inputed and the output.
Please explain using the simplest terms possible
N.B : I use python 3.x
I searched around and found that byte strings can only contain byte characters, which exclude punctuation marks and other unicode characters. Unicode strings can contain, well, all unicode characters.
In python 2.x, byte strings are written much like ordinary strings while unicode strings have a prefixed "u".
a = 'foobar' (byte string)
b = u'foo-bar' (unicode string)
It's written the opposite way for python 3.x
a = b'foobar' (byte string)
b = 'foo-bar' (unicode string)

utf-8 encoding and greek characters [duplicate]

This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
How to output a utf-8 string list as it is in python?
(4 answers)
Closed 6 years ago.
While I managed to get all the data that I need as well as save it on a cv file, the output I get is in UTF-8 format, which is normal(correct me If I'm wrong)
TBH I've already "played" with the .encode() and .decode() option without any results.
here is my code
brands=[name.text for name in Unibrands]
here is the output
u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae'
And this is the desired output
u'Spirulina Ελληνική'
That string is already fine; you're seeing the repr of it, which does escape certain characters because this is intended to be safe to copy and paste directly into Python source code (which in Python 2.x means it needs to have only printable ASCII characters) - eg, \u0395 represents the codepoint U+0395 GREEK CAPITAL LETTER EPSILON. You're seeing this form of it because printing a list (or other container) always shows you the repr of its contents - if you instead print the string directly, you should see an appropriate glyph instead of the escaped form:
>>> print(u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae')
>>> 'Spirulina Ελληνική'
You could also consider upgrading to a newer Python version; Python 3.5 (and possibly earlier 3.x versions) no longer escape these letters in the repr, since Python now accepts Unicode characters in source files by default.

python regex: how to remove hex dec characters from string [duplicate]

This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)
output is —
how can I handle the hex decimal characters in this situation?
Your input doesn't have backslashes. It has 3 bytes, the UTF-8 encoding for the U+2014 EM DASH character:
>>> text = "\xe2\x80\x94"
>>> len(text)
3
>>> text[0]
'\xe2'
>>> text.decode('utf8')
u'\u2014'
>>> print text.decode('utf8')
—
You either need to match those UTF-8 bytes directly, or decode from UTF-8 to unicode and match the codepoint. The latter is preferable; always try to deal with text as Unicode to simplify how many characters you have to transform at a time.
Also note that Python's repr() output (which is used impliciltly when echoing in the interactive interpreter or when printing lists, dicts or other containers) uses \xhh escape sequences to represent any non-printable character. For UTF-8 strings, that includes anything outside the ASCII range. You could just replace anything outside that range with:
re.sub(r'[\x80-\xff]+', "replacement_text", text)
Take into account that this'll match multiple UTF-8-encoded characters in a row, and replace these together as a group!
Your input is in hex, not an actual "\xe2\x80\x94".
\x is just the way to say that the following characters should be interpreted in hex.
This was explained in this post.

Categories