Remove bad character "\xC2" python string - python

I have the next code:
string_msg = '\x80\x01\x00\x00\x00\x00\x53\x58\x00\x1C\x00\x00\x00\x08\x00\x01\x00\x74\x00\x00\x00\x0A\x00\x54\x00\x00\x00\x03'
print(string_msg)
if sys.version < '3':
print(":".join("{:02x}".format(ord(c)) for c in string_msg))
else:
print(":".join("{:02x}".format(c) for c in string_msg.encode()))
In python 2, the result is:
80:01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
But in python 3, the result is:
c2:80:01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
Right now I need to execute this code in python 3 so I have to remove the first byte at the beginning in order to remove the "c2" and everything would be OK, but trying to do that with too many pieces of code I found in this forum such as:
string_msg = string_msg[1:]
string_msg.replace('\xC2', '')
string_msg = ''.join([i if ord(i) < 130 else '' for i in string_msg])
The result is always the same:
01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
Removing also the second byte 80, so my question is: How can I remove just the first byte c2 and why when I try to do that the second byte is also removed?

The issue is that string_msg is a bytestring on Python 2 and despite looking the same it is a Unicode string on Python 3 -- a byte b'\x80' is a completely different concept from a Unicode codepoint u'\x80': the same Unicode codepoint can be represented using different bytes in different encodings and vice versa the same byte may represent different characters in different encodings.
If string_msg is a sequence of bytes then use b'' literal:
data = b'\x80\x01\x00\x00\x00\x00\x53\x58\x00\x1C\x00\x00\x00\x08'
print(":".join(map("{:02x}".format, bytearray(data))))
# -> 80:01:00:00:00:00:53:58:00:1c:00:00:00:08

You can convert text in the first 256 characters to its naive byte value by encoding as ISO 8859-1.
3>> '\x80'.encode('latin-1')
b'\x80'

Related

Does python provide a list of special characters for uuencoding?

I can find the uuencode character mappings on wikipedia. Does python have a means to loop through this list?
for x in [builtin_uuencode_mappings]:
print(x)
I would like to focus on special characters such as "!##$" and so on.
Python already has built-in support for encoding and decoding uuencoded messages.
from codecs import encode # decode also works
print(encode("my message", 'uu'))
# -> 'begin 666 <data>\n*;7D#;65S<V%G90 \n \nend\n'
Internally python uses the binascii package to encode or decode the message line by line. We can use that to encode a single byte or even all bytes in the range(64) (because uuencoding tranforms 6bit into an ascii character: 2**6 == 64).
To generate all necessary bit patterns we can count to 64 and shift the result by 2 bit to the left. That way the highest 6 bits count from 0 to 64. Then it's just a matter of converting that into python bytes, uuencode them and extract the actual character.
In python2
from binascii import b2a_uu
for byte in range(64):
pattern = chr(byte << 2) # str and bytes are identical in python2
encoded = b2a_uu(pattern)
character = encoded[1] # encoded[0] is the character count in that line
print "{:2} -> {!r}".format(byte, character)
In python3 the first part is a little bit ugly.
from binascii import b2a_uu
for byte in range(64):
pattern = bytes([byte << 2]) # chr().encode() will not work!
encoded = b2a_uu(pattern)
character = chr(encoded[1])
print(f"{byte:2} -> {character!r}")
Thanks to Mark Ransom who explained why the bit shifting actually works.

How can I convert utf8 code number to unicode code number in Python3

I want to generate all utf8 characters list.
I wrote the code below but it didn't work well.
I thought that because chr() expected unicode number, but I gave utf8 code number.
I think I have to convert utf8 code number to unicode code number but I don't know the way.
How can I do? Or do you know better way?
def utf8_2byte():
characters = []
# first byte range: [C2-DF]
for first in range(0xC2, 0xDF + 1):
# second byte range: [80-BF]
for second in range(0x80, 0xBF + 1):
num = (first << 8) + second
line = [hex(num), chr(num)]
characters.append(line)
return characters
I expect:
# UTF8 code number, UTF8 character
[0xc380,À]
[0xc381,Á]
[0xc382,Â]
actually:
[0xc380,쎀]
[0xc381,쎁]
[0xc382,쎂]
In python 3, chr takes unicode codepoints, not utf-8. U+C380 is in the Hangul range. Instead you can use bytearray for the decode
>>> bytearray((0xc3, 0x80)).decode('utf-8')
'À'
There are other methods also, like struct or ctypes. Anything that assembles native bytes and converts them to bytes will do.
Unicode is a character set while UTF-8 is a encoding which is a algorithm to encode code point from Unicode to bytes in machine level and vice versa.
The code point 0xc380 is 쎀 in the standard of Unicode.
The bytes 0xc380 is À when you decode it use UTF-8 encoding.
>>> s = "쎀"
>>> hex(ord(s))
'0xc380'
>>> b = bytes.fromhex("C3 80")
>>> b
b'\xc3\x80'
>>> b.decode("utf8")
'À'
>>> bytes((0xc3, 0x80)).decode("utf8")
'À'

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
In Python 2, Unicode strings may contain both unicode and bytes:
No, they may not. They contain Unicode characters.
Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).
There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
Рус utf:ек bytes:blää
The problem is that your string is not actually encoded in a specific encoding. Your example string:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек
But you say, bytes is utf-8 encoded:
>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек
Wrong! But what about:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.
You should convert unichrs to chrs, then decode them.
u'\xd0' == u'\u00d0' is True
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
I search all these mistaken unichars, and convert them to chars, then decode them.
If I'm wrong, please tell me.
You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:
Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
Decodes the sequence using UTF-8 back into Unicode.
Replaces the original UTF-8-like sequence with the matching Unicode character.
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
###Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'
I solved it by
unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")

How can I determine the byte length of a utf-8 encoded string in Python?

I am working with Amazon S3 uploads and am having trouble with key names being too long. S3 limits the length of the key by bytes, not characters.
From the docs:
The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.
I also attempt to embed metadata in the file name, so I need to be able to calculate the current byte length of the string using Python to make sure the metadata does not make the key too long (in which case I would have to use a separate metadata file).
How can I determine the byte length of the utf-8 encoded string? Again, I am not interested in the character length... rather the actual byte length used to store the string.
def utf8len(s):
return len(s.encode('utf-8'))
Works fine in Python 2 and 3.
Use the string 'encode' method to convert from a character-string to a byte-string, then use len() like normal:
>>> s = u"¡Hola, mundo!"
>>> len(s)
13 # characters
>>> len(s.encode('utf-8'))
14 # bytes
Encoding the string and using len on the result works great, as other answers have shown. It does need to build a throw-away copy of the string - if you're working with very large strings this might not be optimal (I don't consider 1024 bytes to be large though). The structure of UTF-8 allows you to get the length of each character very easily without even encoding it, although it might still be easier to encode a single character. I present both methods here, they should give the same result.
def utf8_char_len_1(c):
codepoint = ord(c)
if codepoint <= 0x7f:
return 1
if codepoint <= 0x7ff:
return 2
if codepoint <= 0xffff:
return 3
if codepoint <= 0x10ffff:
return 4
raise ValueError('Invalid Unicode character: ' + hex(codepoint))
def utf8_char_len_2(c):
return len(c.encode('utf-8'))
utf8_char_len = utf8_char_len_1
def utf8len(s):
return sum(utf8_char_len(c) for c in s)

Truncating unicode so it fits a maximum size when encoded for wire transfer

Given a Unicode string and these requirements:
The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
The encoded string has a maximum length
For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.
What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?
(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)
See Also:
Related Java question: How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
Related Javascript question: Using JavaScript to truncate text to a certain size
def unicode_truncate(s, length, encoding='utf-8'):
encoded = s.encode(encoding)[:length]
return encoded.decode(encoding, 'ignore')
Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would've crashed if the split Unicode code point wasn't ignored:
>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'
One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.
As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)
Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless
of length, we only check the last 1-4 bytes.
# coding: UTF-8
def decodeok(bytestr):
try:
bytestr.decode("UTF-8")
except UnicodeDecodeError:
return False
return True
def is_first_byte(byte):
"""return if the UTF-8 #byte is the first byte of an encoded character"""
o = ord(byte)
return ((0b10111111 & o) != o)
def truncate_utf8(bytestr, maxlen):
u"""
>>> us = u"ウィキペディアにようこそ"
>>> s = us.encode("UTF-8")
>>> trunc20 = truncate_utf8(s, 20)
>>> print trunc20.decode("UTF-8")
ウィキペディ
>>> len(trunc20)
18
>>> trunc21 = truncate_utf8(s, 21)
>>> print trunc21.decode("UTF-8")
ウィキペディア
>>> len(trunc21)
21
"""
L = maxlen
for x in xrange(1, 5):
if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
return bytestr[:L-x]
return bytestr[:L]
if __name__ == '__main__':
# unicode doctest hack
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
import doctest
doctest.testmod()
This will do for UTF8, If you like to do it in regex.
import re
partial="\xc2\x80\xc2\x80\xc2"
re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
"\xc2\x80\xc2\x80"
Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings
Its really straight forward just like UTF8 algorithm
From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx
Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111)
It is [\xc0-\xdf], it will be partial one.
From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx
If you see only 1 or 2 bytes in the end, it will be partial one.
It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}
From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
If you see only 1 to 3 bytes in the end, it will be partial one
It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}
Update :
If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.
re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
Let me know if there is any problem with that regex.
For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:
Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
Truncate 3 bytes more than my final limit
Use a regular expression to detect and chop off a partial encoding of a Unicode value
So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:
# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
final_string = partial_string.decode('unicode_escape')
Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.
Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.
Check the last character of the string. If high bit set, then
it is not the last byte in a UTF-8 character, so back up and try again
until you find one that is.
mxlen=255
while( toolong.encode("utf8")[mxlen-1] & 0xc0 == 0xc0 ):
mxlen -= 1
truncated_string = toolong.encode("utf8")[0:mxlen].decode("utf8")

Categories