Bytes in a unicode Python string

Bytes in a unicode Python string - python

In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

In Python 2, Unicode strings may contain both unicode and bytes:
No, they may not. They contain Unicode characters.
Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).
There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
Рус utf:ек bytes:blää

The problem is that your string is not actually encoded in a specific encoding. Your example string:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ÐµÐº
But you say, bytes is utf-8 encoded:
>>> print bytes.encode('utf-8')
ÐµÐº
>>> print bytes.encode('utf-8').decode('utf-8')
ÐµÐº
Wrong! But what about:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.

You should convert unichrs to chrs, then decode them.
u'\xd0' == u'\u00d0' is True
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
I search all these mistaken unichars, and convert them to chars, then decode them.
If I'm wrong, please tell me.

You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:
Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
Decodes the sequence using UTF-8 back into Unicode.
Replaces the original UTF-8-like sequence with the matching Unicode character.
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
###Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'

I solved it by
unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")

Related

How can I convert utf8 code number to unicode code number in Python3

I want to generate all utf8 characters list.
I wrote the code below but it didn't work well.
I thought that because chr() expected unicode number, but I gave utf8 code number.
I think I have to convert utf8 code number to unicode code number but I don't know the way.
How can I do? Or do you know better way?
def utf8_2byte():
characters = []
# first byte range: [C2-DF]
for first in range(0xC2, 0xDF + 1):
# second byte range: [80-BF]
for second in range(0x80, 0xBF + 1):
num = (first << 8) + second
line = [hex(num), chr(num)]
characters.append(line)
return characters
I expect:
# UTF8 code number, UTF8 character
[0xc380,À]
[0xc381,Á]
[0xc382,Â]
actually:
[0xc380,쎀]
[0xc381,쎁]
[0xc382,쎂]

In python 3, chr takes unicode codepoints, not utf-8. U+C380 is in the Hangul range. Instead you can use bytearray for the decode
>>> bytearray((0xc3, 0x80)).decode('utf-8')
'À'
There are other methods also, like struct or ctypes. Anything that assembles native bytes and converts them to bytes will do.

Unicode is a character set while UTF-8 is a encoding which is a algorithm to encode code point from Unicode to bytes in machine level and vice versa.
The code point 0xc380 is 쎀 in the standard of Unicode.
The bytes 0xc380 is À when you decode it use UTF-8 encoding.
>>> s = "쎀"
>>> hex(ord(s))
'0xc380'
>>> b = bytes.fromhex("C3 80")
>>> b
b'\xc3\x80'
>>> b.decode("utf8")
'À'
>>> bytes((0xc3, 0x80)).decode("utf8")
'À'

How can I print unicode without using u'\uXXXX'

I'm trying to make a program to iterate through japanese characters (Python 2.7) and return/yield them in a printable format, but I cannot convert the hexadecimal numbers (3040-309f) into a format that can print the characters. I have found that using u'\u' works, but when I attempt to convert the numbers into that format using unicode('\u3040'), it is different from u'\u3040'. The code explains it better.
>>> s1 = u'\u309d'
>>> s2 = unicode("\u209d")
>>> print type(s1) == type(s2)
True
>>> print s1 == s2
False
>>> print s1, s2
ゝ \u209d
I have tried using UTF-8 and latin-1 for s2 as the second argument, but It does nothing. Also, I found that you can do u'\u{0}'.format(u'3040'), but I cannot make u'3040' in my iterator, and u'\u{0}'.format(unicode('3040') raises an error.

In byte string literals, the \uhhhh escape sequence is not interpreted, so you get a literal 6 characters instead.
Converting that to Unicode only decodes the string as ASCII data, not as a Python escape sequence.
You could decode from the unicode_escape encoding instead:
>>> "\u209d".decode('unicode_escape')
u'\u209d'
>>> print "\u209d".decode('unicode_escape')
₝
There are several downsides to this, however. Any other \ escape sequences also get decoded:
>>> '\\n'
'\\n'
>>> '\\n'.decode('unicode_escape')
u'\n'
so you may have to replace backslashes with doubled backslashes first to come back on top with those literal backslashes retained:
>>> '\\n'.replace('\\', '\\\\').decode('unicode_escape')
u'\\n'
But be very careful that you are not in fact trying to treat JSON data as Python string literals. JSON also uses the same escape sequence format but should instead be treated as JSON; decode with json.loads() instead:
>>> import json
>>> json.loads('"\u209d"')
u'\u209d'

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.

I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()

You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')

import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

String to unicode code point escpe sequences in Python

Here is my problem... I have a "normal" String like :
Hello World
And unlike all the other subjects I have found, I WANT to print it as it's Unicode Codepoint Escape value !
The output I am looking for is something like this:
\u0015\u0123
If anyone has an idea :)

You are encoding ASCII codepoints only. UTF-8 is a superset of ASCII, any ASCII codepoints are encoded to the same bytes as ASCII would use. What you are printing is correct, that is UTF-8.
Use some non-ASCII codepoints to see the difference:
>>> 'Hello world with an em-dash: \u2014\n'.encode('utf8')
b'Hello world with an em-dash: \xe2\x80\x94\n'
Python will just use the characters themselves when it shows you a bytes value with printable ASCII bytes in it. Any byte value that is not printable is shown as a \x.. escape code, or a single-character escape sequence if there is one (\n for newline).
From your example output, on the other hand, you seem to be expecting to output Python unicode literal escape codes:
>>> '\u0015\u0123'
'\x15ģ'
Since U+0123 is printable, Python 3 just shows it; the non-printable U+0015 (NEGATIVE ACKNOWLEDGE) is a codepoint in the 0x00-0xFF range and is shown using the shorter \x.. escape notation.
To show only unicode escape sequences for your text, you need to process it character by character:
>>> input_text = 'Hello World!'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064\u0021
>>> input_text = 'Hello world with an em-dash: \u2014\n'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064\u0020\u0077\u0069\u0074\u0068\u0020\u0061\u006e\u0020\u0065\u006d\u002d\u0064\u0061\u0073\u0068\u003a\u0020\u2014\u000a
It is important to stress that this is not UTF-8, however.

You can use ord to the encoded bytes into numbers and use string formatting you display their hex values.
>>> s = u'Hello World \u0664\u0662'
>>> print s
Hello World ٤٢
>>> print ''.join('\\x%02X' % ord(c) for c in s.encode('utf-8'))
\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x20\xD9\xA4\xD9\xA2

Truncating unicode so it fits a maximum size when encoded for wire transfer

Given a Unicode string and these requirements:
The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
The encoded string has a maximum length
For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.
What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?
(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)
See Also:
Related Java question: How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
Related Javascript question: Using JavaScript to truncate text to a certain size

def unicode_truncate(s, length, encoding='utf-8'):
encoded = s.encode(encoding)[:length]
return encoded.decode(encoding, 'ignore')
Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would've crashed if the split Unicode code point wasn't ignored:
>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'

One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.
As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)
Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless
of length, we only check the last 1-4 bytes.
# coding: UTF-8
def decodeok(bytestr):
try:
bytestr.decode("UTF-8")
except UnicodeDecodeError:
return False
return True
def is_first_byte(byte):
"""return if the UTF-8 #byte is the first byte of an encoded character"""
o = ord(byte)
return ((0b10111111 & o) != o)
def truncate_utf8(bytestr, maxlen):
u"""
>>> us = u"ウィキペディアにようこそ"
>>> s = us.encode("UTF-8")
>>> trunc20 = truncate_utf8(s, 20)
>>> print trunc20.decode("UTF-8")
ウィキペディ
>>> len(trunc20)
18
>>> trunc21 = truncate_utf8(s, 21)
>>> print trunc21.decode("UTF-8")
ウィキペディア
>>> len(trunc21)
21
"""
L = maxlen
for x in xrange(1, 5):
if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
return bytestr[:L-x]
return bytestr[:L]
if __name__ == '__main__':
# unicode doctest hack
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
import doctest
doctest.testmod()

This will do for UTF8, If you like to do it in regex.
import re
partial="\xc2\x80\xc2\x80\xc2"
re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
"\xc2\x80\xc2\x80"
Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings
Its really straight forward just like UTF8 algorithm
From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx
Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111)
It is [\xc0-\xdf], it will be partial one.
From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx
If you see only 1 or 2 bytes in the end, it will be partial one.
It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}
From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
If you see only 1 to 3 bytes in the end, it will be partial one
It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}
Update :
If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.
re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)
Let me know if there is any problem with that regex.

For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:
Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
Truncate 3 bytes more than my final limit
Use a regular expression to detect and chop off a partial encoding of a Unicode value
So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:
# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
final_string = partial_string.decode('unicode_escape')
Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.
Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.

Check the last character of the string. If high bit set, then
it is not the last byte in a UTF-8 character, so back up and try again
until you find one that is.
mxlen=255
while( toolong.encode("utf8")[mxlen-1] & 0xc0 == 0xc0 ):
mxlen -= 1
truncated_string = toolong.encode("utf8")[0:mxlen].decode("utf8")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bytes in a unicode Python string - python

I solved it by unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")

Related

How can I convert utf8 code number to unicode code number in Python3

How can I print unicode without using u'\uXXXX'

Python3 : unescaping non ascii characters

String to unicode code point escpe sequences in Python

Truncating unicode so it fits a maximum size when encoded for wire transfer

Categories

Resources