python regex: how to remove hex dec characters from string [duplicate] - python

This question already has answers here:
What does a leading `\x` mean in a Python string `\xaa`
(2 answers)
Closed 7 years ago.
text="\xe2\x80\x94"
print re.sub(r'(\\(?<=\\)x[a-z0-9]{2})+',"replacement_text",text)
output is —
how can I handle the hex decimal characters in this situation?

Your input doesn't have backslashes. It has 3 bytes, the UTF-8 encoding for the U+2014 EM DASH character:
>>> text = "\xe2\x80\x94"
>>> len(text)
3
>>> text[0]
'\xe2'
>>> text.decode('utf8')
u'\u2014'
>>> print text.decode('utf8')
—
You either need to match those UTF-8 bytes directly, or decode from UTF-8 to unicode and match the codepoint. The latter is preferable; always try to deal with text as Unicode to simplify how many characters you have to transform at a time.
Also note that Python's repr() output (which is used impliciltly when echoing in the interactive interpreter or when printing lists, dicts or other containers) uses \xhh escape sequences to represent any non-printable character. For UTF-8 strings, that includes anything outside the ASCII range. You could just replace anything outside that range with:
re.sub(r'[\x80-\xff]+', "replacement_text", text)
Take into account that this'll match multiple UTF-8-encoded characters in a row, and replace these together as a group!

Your input is in hex, not an actual "\xe2\x80\x94".
\x is just the way to say that the following characters should be interpreted in hex.
This was explained in this post.

Related

How to caculate the number of all elements including escape sequences in a string? [duplicate]

This question already has answers here:
How do I get the string representation of a variable in python?
(4 answers)
Closed last month.
I have a string, and I have to count all elements in this string.
str = '\r\n\r\n\r\n \r\n \xa0\xa0\r\nIntroduction\r\n\r\n\r\nHello\r\n\r\nWorld\r\nProblems...\r\nHow to calculate numbers...\r\nConclusion\r\n\r\n\r\n\xa0\r\n\r\nHello world.'
These elements contain numbers, letters, escape sequences, whitespaces, commas, etc.
Is there any way to count all elements in this kind of string in Python?
I know that len() and count() cannot help. And I also tried some regex methods like re.findall(r'.', str), but it cannot find elements like \n and also can only find \r instead of \ and r.
Edit:
To be more clear, I want to count \n as 2, not 1, and also \xa0 as 4, not 1.
\ is a special character in Python so you have to escape them like str = '\\r\\n ' or str = r'\r\n '. After that, len() counts \ as an independent character.
Python compiles your string literal into a python string where escaped character sequences such as \n are replaced with their unicode character equivalent (in this case the unicode U-000A newline). len would count this 2 character sequence as a single character.
By the time your code sees this string, the original python literal escape sequence is gone. But the repr representation adds escape sequences back. So you could take the length of that.
>>> s = '\r\n\r\n\r\n \r\n \xa0\xa0\r\nIntroduction\r\n\r\n\r\nHello\r\n\r\nWorld\r\nProblems...\r\nHow to calculate numbers...\r\nConclusion\r\n\r\n\r\n\xa0\r\n\r\nHello world.'
>>> print(len(s))
123
>>> print(len(repr(s)))
170
This isn't going to be 100% accurate because there is more than one way to construct a unicode character in a literal string. For instance "\n" and "\x0a" both decode to the same newline character and there is no way to know which form it came from.
Alternately, you could use "raw" strings that do not escape the characters. So, r"\n" is length 2.

Python unicode strings

I'm a Python newbie and I'm trying to make one script that writes some strings in a file if there's a difference. Problem is that original string has some characters in \uNNNN Unicode format and I cannot convert the new string to the same Unicode format.
The original string I'm trying to compare: \u00A1 ATENCI\u00D3N! \u25C4
New string is received as: ¡ ATENCIÓN! ◄
And this the code
str = u'¡ ATENCIÓN! ◄'
print(str)
str1 = str.encode('unicode_escape')
print (str1)
str2 = str1.decode()
print (str2)
And the result is:
¡ ATENCIÓN! ◄
b'\\xa1 ATENCI\\xd3N! \\u25c4'
\xa1 ATENCI\xd3N! \u25c4
So, how can I get \xa1 ATENCI\xd3N! \u25c4 converted to \u00A1 ATENCI\u00D3N! \u25C4 as this is the only Unicode format I can save?
Note: Cases of characters in strings also need to be the same for comparison.
The issue is, according to the docs (read down a little bit, between the escape sequences tables), the \u, \U, and \N Unicode escape sequences are only recognized in string literals. That means that once the literal is evaluated in memory, such as in a variable assignment:
s = "\u00A1 ATENCI\u00D3N! \u25C4"
any attempt to str.encode() it automatically converts it to a bytes object that uses \x where it can:
b'\\xa1 ATENCI\\xd3N! \\u25c4'
Using
b'\\xa1 ATENCI\\xd3N! \\u25c4'.decode("unicode_escape")
will convert it back to '¡ ATENCIÓN! ◄'. This uses the actual (intended) representation of the characters, and not the \uXXXX escape sequences of the original string s.
So, what you should do is not mess around with encoding and decoding things. Observe:
print("\u00A1 ATENCI\u00D3N! \u25C4" == '¡ ATENCIÓN! ◄')
True
That's all the comparison you need to do.
For further reading, you may be interested in:
How to work with surrogate pairs in Python?
Encodings and Unicode from the Python docs.

Byte string and unicode string inputs [duplicate]

This question already has answers here:
byte string vs. unicode string. Python
(2 answers)
Closed 3 years ago.
I'm a Python noob. I was reading through some documentation and I came across something that baffled me.
What is the difference between Byte strings and Unicode strings in python? Especially in terms of what is being inputed and the output.
Please explain using the simplest terms possible
N.B : I use python 3.x
I searched around and found that byte strings can only contain byte characters, which exclude punctuation marks and other unicode characters. Unicode strings can contain, well, all unicode characters.
In python 2.x, byte strings are written much like ordinary strings while unicode strings have a prefixed "u".
a = 'foobar' (byte string)
b = u'foo-bar' (unicode string)
It's written the opposite way for python 3.x
a = b'foobar' (byte string)
b = 'foo-bar' (unicode string)

Remove escape character from string [duplicate]

This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed last month.
I would like to turn this string:
a = '\\a'
into this one
b = '\a'
It doesn't seem like there is an obvious way to do this with replace?
To be more precise, I want to change the escaping of the backslash to escaping the character a.
The character '\a' is the ASCII BEL character, chr(7).
To do the conversion in Python 2:
from __future__ import print_function
a = '\\a'
c = a.decode('string-escape')
print(repr(a), repr(c))
output
'\\a' '\x07'
And for future reference, in Python 3:
a = '\\a'
b = bytes(a, encoding='ascii')
c = b.decode('unicode-escape')
print(repr(a), repr(c))
This gives identical output to the above snippet.
In Python 3, if you were working with bytes objects you'd do something like this:
a = b'\\a'
c = bytes(a.decode('unicode-escape'), 'ascii')
print(repr(a), repr(c))
output
b'\\a' b'\x07'
As Antti Haapala mentions, this simple strategy for Python 3 won't work if the source string contains unicode characters too. In tha case, please see his answer for a more robust solution.
On Python 2 you can use
>>> '\\a'.decode('string_escape')
'\x07'
Note how \a is repr'd as \x07.
If the string is a unicode string with also extended characters, you need to decode it to a bytestring first, otherwise the default encoding (ascii!) is used to convert the unicode object to a bytestring first.
However, this codec doesn't exist in Python 3, and things are very much more complicated. You can use the unicode-escape to decode but it is very broken if the source string contains unicode characters too:
>>> '\aäầ'.encode().decode('unicode_escape')
'\x07äầ'
The resulting string doesn't consist of Unicode characters but bytes decoded as latin-1. The solution is to re-encode to latin-1 and then decode as utf8 again:
>>> '\\aäầ\u1234'.encode().decode('unicode_escape').encode('latin1').decode()
'\x07äầሴ'
Unescape string is what I searched for to find this:
>>> a = r'\a'
>>> a.encode().decode('unicode-escape')
'\x07'
>>> '\a'
'\x07'
That's the way to do it with unicode. Since you're in Python 2 and may not be using unicode, you may actually one:
>>> a.decode('string-escape')
'\x07'

How can I print the decimal representation of a unicode string?

I am trying to compare unicode strings in Python. Since a lot of the symbols look similar and some may contain non-printable characters, I am having trouble debugging where my comparisons are failing. Is there a way to take a string of unicode characters and print their unicode codes? i.e.:
>>> unicode_print('❄')
'\u2744'
You can encode that string with some other encoding:
>>> s = '❄'
>>> s.encode() # "utf8" by default
b'\xe2\x9d\x84'
And for the output you specified, I just found this from here:
>>> s.encode("unicode_escape")
b'\\u2744'

Categories