I want to remove characters with encodings larger than 3 bytes.
Because when I upload my CSV data to Amazon Mechanical Turk system, it asks me to do it.
Your CSV file needs to be UTF-8 encoded and cannot contain characters
with encodings larger than 3 bytes. For example, some non-English
characters are not allowed (learn more).
To overcome this problem,
I want to make a filter_max3bytes funciton to remove those characters in Python3.
x = 'below ð\x9f~\x83,'
y = remove_max3byes(x) # y=="below ~,"
Then I will apply the function before saving it to a CSV file, which is UTF-8 encoded.
This post is related my problem, but they uses python 2 and the solution did not worked for me.
Thank you!
None of the characters in your string seems to take 3 bytes in UTF-8:
x = 'below ð\x9f~\x83,'
Anyway, the way to remove them, if there were any would be:
filtered_x = ''.join(char for char in x if len(char.encode('utf-8')) < 3)
For example (with such characters):
>>> x = 'abcd漢字efg'
>>> ''.join(char for char in x if len(char.encode('utf-8')) < 3)
'abcdefg'
BTW, you can verify that your original string does not have 3-byte encodings by doing the following:
>>> for char in 'below ð\x9f~\x83,':
... print(char, [hex(b) for b in char.encode('utf-8')])
...
b ['0x62']
e ['0x65']
l ['0x6c']
o ['0x6f']
w ['0x77']
['0x20']
ð ['0xc3', '0xb0']
['0xc2', '0x9f']
~ ['0x7e']
['0xc2', '0x83']
, ['0x2c']
EDIT: A wild guess
I believe the OP asks the wrong question and the question is in fact whether the character is printable. I'll assume anything Python displays as \x<number> is not printable, so this solution should work:
x = 'below ð\x9f~\x83,'
filtered_x = ''.join(char for char in x if not repr(char).startswith("'\\x"))
Result:
'below ð~,'
While indirectly stated, the website only allows characters from the Basic Multilingual Plane (BMP). That includes Unicode code points U+0000 to U+FFFF. In UTF-8, it takes four bytes to encode anything above U+FFFF:
>>> '\uffff'.encode('utf8')
b'\xef\xbf\xbf'
>>> '\U00010000'.encode('utf8')
b'\xf0\x90\x80\x80'
This filters out Unicode code points above U+FFFF:
>>> test_string = 'abc马克😀' # emoticon is U+1F600
>>> ''.join(c for c in test_string if ord(c) < 0x10000)
'abc马克'
When encoded (note three bytes for each Chinese character):
>>> ''.join(c for c in test_string if ord(c) < 0x10000).encode('utf8')
b'abc\xe9\xa9\xac\xe5\x85\x8b'
According to the UTF-8 standard, characters with Unicode code points below U+0800 will use at most two bytes in the encoding. So just remove any character at or above U+0800. This code copies all characters that take at most two bytes and just leave out the other characters.
def remove_max3byes(x):
return ''.join(c for c in x if ord(c) < 0x800)
As a comment pointed out, your example string has no characters that take more than two bytes. But this command at the REPL
remove_max3byes(chr(0x07ff))
gives
'\u07ff'
and this command
remove_max3byes(chr(0x0800))
gives
''
Both are as wanted.
Related
I am working with an HTML string in Python that contains non-English characters that is represented in the string by 16-bit unicode hex values. The string reads:
"Skr\u00E4ddarev\u00E4gen"
The string when properly converted should read "Skräddarevägen". How do i ensure that the unicode hex value gets correctly encoded/decoded on output and reads with the correct accents?
(Note, I'm using Requests and Pandas and the encoding in both is set to utf-8)
Thanks in advance!
In Python 3, the following can happen:
If you pick up your string from an HTML file, you have to read in
the HTML file using the correct encoding.
If you have your string in Python 3 code, it should be already in Unicode (32-bit) in memory.
Write the string out to a file, you have to specify the encoding you want in the file open.
If you are using Python 3 and that is literally the content of the string, it "just works":
>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skräddarevägen'
If you have that string as raw data, you have to decode it. If it is a Unicode string you'll have to encode it to bytes first. The final result will be Unicode. If you already have a byte string, skip the encode step.
>>> s = r"Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.encode('ascii').decode('unicode_escape')
'Skräddarevägen'
If you are on Python 2, you'll need to decode, plus print to see it properly:
>>> s = "Skr\u00E4ddarev\u00E4gen"
>>> s
'Skr\\u00E4ddarev\\u00E4gen'
>>> s.decode('unicode_escape')
u'Skr\xe4ddarev\xe4gen'
>>> print s.decode('unicode_escape')
Skräddarevägen
From your display, it is hard to be sure what is in the string. Assuming that it is the 24 characters displayed, I believe the last line of the following answers your question.
s = "Skr\\u00E4ddarev\\u00E4gen"
print(len(s))
for c in s: print(c, end=' ')
print()
print(eval("'"+s+"'"))
print(eval("'"+s+"'").encode('utf-8'))
This prints
24
S k r \ u 0 0 E 4 d d a r e v \ u 0 0 E 4 g e n
Skräddarevägen
b'Skr\xc3\xa4ddarev\xc3\xa4gen'
I have the next code:
string_msg = '\x80\x01\x00\x00\x00\x00\x53\x58\x00\x1C\x00\x00\x00\x08\x00\x01\x00\x74\x00\x00\x00\x0A\x00\x54\x00\x00\x00\x03'
print(string_msg)
if sys.version < '3':
print(":".join("{:02x}".format(ord(c)) for c in string_msg))
else:
print(":".join("{:02x}".format(c) for c in string_msg.encode()))
In python 2, the result is:
80:01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
But in python 3, the result is:
c2:80:01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
Right now I need to execute this code in python 3 so I have to remove the first byte at the beginning in order to remove the "c2" and everything would be OK, but trying to do that with too many pieces of code I found in this forum such as:
string_msg = string_msg[1:]
string_msg.replace('\xC2', '')
string_msg = ''.join([i if ord(i) < 130 else '' for i in string_msg])
The result is always the same:
01:00:00:00:00:53:58:00:1c:00:00:00:08:00:01:00:74:00:00:00:0a:00:54:00:00:00:03
Removing also the second byte 80, so my question is: How can I remove just the first byte c2 and why when I try to do that the second byte is also removed?
The issue is that string_msg is a bytestring on Python 2 and despite looking the same it is a Unicode string on Python 3 -- a byte b'\x80' is a completely different concept from a Unicode codepoint u'\x80': the same Unicode codepoint can be represented using different bytes in different encodings and vice versa the same byte may represent different characters in different encodings.
If string_msg is a sequence of bytes then use b'' literal:
data = b'\x80\x01\x00\x00\x00\x00\x53\x58\x00\x1C\x00\x00\x00\x08'
print(":".join(map("{:02x}".format, bytearray(data))))
# -> 80:01:00:00:00:00:53:58:00:1c:00:00:00:08
You can convert text in the first 256 characters to its naive byte value by encoding as ISO 8859-1.
3>> '\x80'.encode('latin-1')
b'\x80'
In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
In Python 2, Unicode strings may contain both unicode and bytes:
No, they may not. They contain Unicode characters.
Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).
There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
Рус utf:ек bytes:blää
The problem is that your string is not actually encoded in a specific encoding. Your example string:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек
But you say, bytes is utf-8 encoded:
>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек
Wrong! But what about:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.
You should convert unichrs to chrs, then decode them.
u'\xd0' == u'\u00d0' is True
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
I search all these mistaken unichars, and convert them to chars, then decode them.
If I'm wrong, please tell me.
You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:
Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
Decodes the sequence using UTF-8 back into Unicode.
Replaces the original UTF-8-like sequence with the matching Unicode character.
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
###Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'
I solved it by
unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")
I currently have the following code
def removeControlCharacters(line):
i = 0
for c in line:
if (c < chr(32)):
line = line[:i - 1] + line[i+1:]
i += 1
return line
This is just does not work if there are more than one character to be deleted.
There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".
This snippet removes all control characters from a string.
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
Examples of unicode categories:
>>> from unicodedata import category
>>> category('\r') # carriage return --> Cc : control character
'Cc'
>>> category('\0') # null character ---> Cc : control character
'Cc'
>>> category('\t') # tab --------------> Cc : control character
'Cc'
>>> category(' ') # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A') # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',') # comma -----------> Po : punctuation
'Po'
>>>
You could use str.translate with the appropriate map, for example like this:
>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'
Anyone interested in a regex character class that matches any Unicode control character may use [\x00-\x1f\x7f-\x9f].
You may test it like this:
>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True
So to remove the control characters using re just use the following:
>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'
This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.
pip install regex
import regex as rx
def remove_control_characters(str):
return rx.sub(r'\p{C}', '', 'my-string')
\p{C} is the unicode character property for control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z} for any kind of whitespace.
Your implementation is wrong because the value of i is incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n2) instead of O(n). Try this instead:
return ''.join(c for c in line if ord(c) >= 32)
And for Python 2, with the builtin translate:
import string
all_bytes = string.maketrans('', '') # String of 256 characters with (byte) value 0 to 255
line.translate(all_bytes, all_bytes[:32]) # All bytes < 32 are deleted (the second argument lists the bytes to delete)
You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])
filter(string.printable[:-5].__contains__,line)
I've tried all the above and it didn't help. In my case, I had to remove Unicode 'LRM' chars:
Finally I found this solution that did the job:
df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')
Reference here.
Given a Unicode string and these requirements:
The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
The encoded string has a maximum length
For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.
What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?
(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)
See Also:
Related Java question: How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
Related Javascript question: Using JavaScript to truncate text to a certain size
def unicode_truncate(s, length, encoding='utf-8'):
encoded = s.encode(encoding)[:length]
return encoded.decode(encoding, 'ignore')
Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would've crashed if the split Unicode code point wasn't ignored:
>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'
One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.
As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)
Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless
of length, we only check the last 1-4 bytes.
# coding: UTF-8
def decodeok(bytestr):
try:
bytestr.decode("UTF-8")
except UnicodeDecodeError:
return False
return True
def is_first_byte(byte):
"""return if the UTF-8 #byte is the first byte of an encoded character"""
o = ord(byte)
return ((0b10111111 & o) != o)
def truncate_utf8(bytestr, maxlen):
u"""
>>> us = u"ウィキペディアにようこそ"
>>> s = us.encode("UTF-8")
>>> trunc20 = truncate_utf8(s, 20)
>>> print trunc20.decode("UTF-8")
ウィキペディ
>>> len(trunc20)
18
>>> trunc21 = truncate_utf8(s, 21)
>>> print trunc21.decode("UTF-8")
ウィキペディア
>>> len(trunc21)
21
"""
L = maxlen
for x in xrange(1, 5):
if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
return bytestr[:L-x]
return bytestr[:L]
if __name__ == '__main__':
# unicode doctest hack
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
import doctest
doctest.testmod()
This will do for UTF8, If you like to do it in regex.
import re
partial="\xc2\x80\xc2\x80\xc2"
re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
"\xc2\x80\xc2\x80"
Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings
Its really straight forward just like UTF8 algorithm
From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx
Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111)
It is [\xc0-\xdf], it will be partial one.
From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx
If you see only 1 or 2 bytes in the end, it will be partial one.
It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}
From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
If you see only 1 to 3 bytes in the end, it will be partial one
It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}
Update :
If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.
re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
Let me know if there is any problem with that regex.
For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:
Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
Truncate 3 bytes more than my final limit
Use a regular expression to detect and chop off a partial encoding of a Unicode value
So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:
# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
final_string = partial_string.decode('unicode_escape')
Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.
Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.
Check the last character of the string. If high bit set, then
it is not the last byte in a UTF-8 character, so back up and try again
until you find one that is.
mxlen=255
while( toolong.encode("utf8")[mxlen-1] & 0xc0 == 0xc0 ):
mxlen -= 1
truncated_string = toolong.encode("utf8")[0:mxlen].decode("utf8")