Remove escape character from string [duplicate] - python

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed last month.
I would like to turn this string:
a = '\\a'
into this one
b = '\a'
It doesn't seem like there is an obvious way to do this with replace?
To be more precise, I want to change the escaping of the backslash to escaping the character a.

The character '\a' is the ASCII BEL character, chr(7).
To do the conversion in Python 2:
from __future__ import print_function
a = '\\a'
c = a.decode('string-escape')
print(repr(a), repr(c))
output
'\\a' '\x07'
And for future reference, in Python 3:
a = '\\a'
b = bytes(a, encoding='ascii')
c = b.decode('unicode-escape')
print(repr(a), repr(c))
This gives identical output to the above snippet.
In Python 3, if you were working with bytes objects you'd do something like this:
a = b'\\a'
c = bytes(a.decode('unicode-escape'), 'ascii')
print(repr(a), repr(c))
output
b'\\a' b'\x07'
As Antti Haapala mentions, this simple strategy for Python 3 won't work if the source string contains unicode characters too. In tha case, please see his answer for a more robust solution.

On Python 2 you can use
>>> '\\a'.decode('string_escape')
'\x07'
Note how \a is repr'd as \x07.
If the string is a unicode string with also extended characters, you need to decode it to a bytestring first, otherwise the default encoding (ascii!) is used to convert the unicode object to a bytestring first.
However, this codec doesn't exist in Python 3, and things are very much more complicated. You can use the unicode-escape to decode but it is very broken if the source string contains unicode characters too:
>>> '\aäầ'.encode().decode('unicode_escape')
'\x07äầ'
The resulting string doesn't consist of Unicode characters but bytes decoded as latin-1. The solution is to re-encode to latin-1 and then decode as utf8 again:
>>> '\\aäầ\u1234'.encode().decode('unicode_escape').encode('latin1').decode()
'\x07äầሴ'

Unescape string is what I searched for to find this:
>>> a = r'\a'
>>> a.encode().decode('unicode-escape')
'\x07'
>>> '\a'
'\x07'
That's the way to do it with unicode. Since you're in Python 2 and may not be using unicode, you may actually one:
>>> a.decode('string-escape')
'\x07'

Related

Python: convert unicode character to corresponding Unicode string

How do I convert a unicode character 'ב' to its corresponding Unicode character string '\u05d1' in Python?
I asked the opposite question a few days ago:
Python: convert unicode string to corresponding Unicode character
You can do something like,
>>> x
'ב'
>>> x.encode('ascii', 'backslashreplace').decode('utf-8')
'\\u05d1'
From the docs:
The errors parameter is the same as the parameter of the decode()
method but supports a few more possible handlers. As well as 'strict',
'ignore', and 'replace' (which in this case inserts a question mark
instead of the unencodable character), there is also
'xmlcharrefreplace' (inserts an XML character reference),
backslashreplace (inserts a \uNNNN escape sequence) and namereplace
(inserts a \N{...} escape sequence).
Something like this works
>>> hex(ord('ב'))
'0x5d1'
Python Specific Encodings:
unicode_escape - Encoding suitable as the contents of a Unicode
literal in ASCII-encoded Python source code, except that quotes are
not escaped.
'ב'.encode('unicode-escape').decode() ### '\\u05d1'
print('ב'.encode('unicode-escape').decode()) ### \u05d1
I prefer my own answer which is clean and simple:
json.dumps(unicode_character)
decoded_string = "ב"
encoded_string = decoded_string.encode("utf-8")

String has unicode code points embedded, how to convert? Python 3 [duplicate]

I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

python regex: how to remove hex dec characters from string [duplicate]

This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)
output is —
how can I handle the hex decimal characters in this situation?
Your input doesn't have backslashes. It has 3 bytes, the UTF-8 encoding for the U+2014 EM DASH character:
>>> text = "\xe2\x80\x94"
>>> len(text)
3
>>> text[0]
'\xe2'
>>> text.decode('utf8')
u'\u2014'
>>> print text.decode('utf8')
—
You either need to match those UTF-8 bytes directly, or decode from UTF-8 to unicode and match the codepoint. The latter is preferable; always try to deal with text as Unicode to simplify how many characters you have to transform at a time.
Also note that Python's repr() output (which is used impliciltly when echoing in the interactive interpreter or when printing lists, dicts or other containers) uses \xhh escape sequences to represent any non-printable character. For UTF-8 strings, that includes anything outside the ASCII range. You could just replace anything outside that range with:
re.sub(r'[\x80-\xff]+', "replacement_text", text)
Take into account that this'll match multiple UTF-8-encoded characters in a row, and replace these together as a group!
Your input is in hex, not an actual "\xe2\x80\x94".
\x is just the way to say that the following characters should be interpreted in hex.
This was explained in this post.

How to use Python convert a unicode string to the real string [duplicate]

This question already has answers here:
Chinese and Japanese character support in python
(3 answers)
Closed 7 years ago.
I have used Python to get some info through urllib2, but the info is unicode string.
I've tried something like below:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")
but all results are the same:
\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
And I want to get the following Chinese text:
方法,删除存储在
You need to convert the string to a unicode string.
First of all, the backslashes in a are auto-escaped:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
a # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
So playing with the encoding / decoding of this escaped string makes no difference.
You can either use unicode literal or convert the string into a unicode string.
To use unicode literal, just add a u in the front of the string:
a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:
print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在
I bet you are getting the string from a JSON response, so the second method is likely to be what you need.
BTW, the unicode_escape encoding is a Python specific encoding which is used to
Produce a string that is suitable as Unicode literal in Python source
code
Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.
Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:
>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法,删除存储在
This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.

How can I print the decimal representation of a unicode string?

I am trying to compare unicode strings in Python. Since a lot of the symbols look similar and some may contain non-printable characters, I am having trouble debugging where my comparisons are failing. Is there a way to take a string of unicode characters and print their unicode codes? i.e.:
>>> unicode_print('❄')
'\u2744'
You can encode that string with some other encoding:
>>> s = '❄'
>>> s.encode() # "utf8" by default
b'\xe2\x9d\x84'
And for the output you specified, I just found this from here:
>>> s.encode("unicode_escape")
b'\\u2744'

Categories