Convert exotic charset to string with python

Convert exotic charset to string with python - python

After parsing some webpage with utf-8 coding, I realize that I obtain characters that I can't manipulaten, though it is readable by the means of print.
>> print data
Ａ　Ｄｅｕｃｅ
>> data
u'\uff21\u3000\uff24\uff45\uff55\uff43\uff45'
How can I get this into a decent coding using Python?
I would like to obtain
>> my_variable
'A Deuce'
(I mean being able to cast that text in a variable as a "regular" string)
I saw several solutions related to that topic but did not find relevant answer (mainly based on encoding/decoding in other charset)

This functionality is built into the unicodedata module:
>>> unicodedata.normalize('NFKC', 'Ａ　Ｄｅｕｃｅ')
'A Deuce'

With a little help from this answer:
>>> table = dict([(x + 0xFF00 - 0x20, unichr(x)) for x in xrange(0x21, 0x7F)] + [(0x3000, unichr(0x20))])
>>> data.translate(table)
u'A Deuce'
The translate method takes a dictionary that maps one Unicode code point to another. In this case, it maps the full-width Latin alphabet (which is essentially part of the ASCII character set shifted up to the range 0xFF01-0xFF5E) to the "normal" ASCII character set. For example, 0xFF21 (full-width A) maps to 0x41 (ASCII A), 0xFF22 (full-width B) maps to 0x42 (ASCII B), etc.

Consider using Python 3, which has better printing support for Unicode characters. Here's a sample:
>>> s=u'\uff21\u3000\uff24\uff45\uff55\uff43\uff45'
>>> print(s)
Ａ　Ｄｅｕｃｅ
>>> s
'Ａ\u3000Ｄｅｕｃｅ'
>>> import unicodedata as ud
>>> ud.name('\u3000')
'IDEOGRAPHIC SPACE'
>>> print(ascii(s))
'\uff21\u3000\uff24\uff45\uff55\uff43\uff45'

Related

Char and bytes in python

In reading this tutorial I came across the following difference between __unicode__ and __str__ method:
Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2: __unicode__. In Python 2, __str__ returns bytes, whereas __unicode__ returns characters.
How exactly is a "character" and "byte" be defined here? For example, in C a char is one byte, so wouldn't a char = a byte? Or, is this referring to (potentially) unicode characters, which could be multiple bytes? For example, if we took the following:
Ω (omega symbol)
03 A9 or u'\u03a9'
In python, would this be considered one character (Ω) and two bytes, or two characters(03 A9) and two bytes? Or maybe I am confusing the difference between char and character ?

In Python, u'\u03a9' is a string consisting of the single Unicode character Ω (U+03A9). The internal representation of that string is an implementation detail, so it doesn't make sense to ask about the bytes involved.
One source of ambiguity is a string like 'é', which could either be the single character U+00E9 or the two-character string U+0065 U+0301.
>>> len(u'\u00e9'); print(u'\u00e9')
1
é
>>> len(u'\u0065\u0301'); print(u'\u0065\u0301')
2
é
The two-byte sequence '\xce\xa9', however, can be interpret as the UTF-8 encoding of U+03A9.
>>> u'\u03a9'.encode('utf-8')
'\xce\xa9'
>>> '\xce\xa9'.decode('utf-8')
u'\u03a9'
In Python 3, that would be (with UTF-8 being the default encoding scheme)
>>> '\u03a9'.encode()
b'\xce\xa9'
>>> b'\xce\xa9'.decode()
'Ω'
Other byte sequences can be decoded to U+03A9 as well:
>>> b'\xff\xfe\xa9\x03'.decode('utf16')
'Ω'
>>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32')
'Ω'

How do I replace \xc3 etc. with umlauts?

I have an output of spannkr \xc3\xa4ftig, da\xc3\x9f unser in Python. How do I replace this with umlauts?

The German characters are already there, but encoded as utf-8. If you want to see the umlauts etc in the interpreter then you can decode to str:
>>> bs = b'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> s = bs.decode('utf-8')
>>> print(s)
spannkr äftig, daß unser
It's possible that you are dealing with a str that somehow contains utf-8 encoded data. In this case you need to perform an extra step:
>>> s = 'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> bs = s.encode('raw-unicode-escape') # encode to bytes without double-encoding
>>> print(bs)
b'spannkr \xc3\xa4ftig, da\xc3\x9f unser'
>>> decoded = bs.decode('utf-8')
>>> print(decoded)
spannkr äftig, daß unser
There isn't an easy way to distinguish between incorrectly embedded spaces and the spaces between words. You would need to use some kind of spellchecker or natural language application.

How can I slice a substring from a unicode string with Python?

I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?

The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]

Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>Â®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.

u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])

If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё

How can I format an integer to a two digit hex?

Does anyone know how to get a chr to hex conversion where the output is always two digits?
for example, if my conversion yields 0x1, I need to convert that to 0x01, since I am concatenating a long hex string.
The code that I am using is:
hexStr += hex(ord(byteStr[i]))[2:]

You can use string formatting for this purpose:
>>> "0x{:02x}".format(13)
'0x0d'
>>> "0x{:02x}".format(131)
'0x83'
Edit: Your code suggests that you are trying to convert a string to a hexstring representation. There is a much easier way to do this (Python2.x):
>>> "abcd".encode("hex")
'61626364'
An alternative (that also works in Python 3.x) is the function binascii.hexlify().

You can use the format function:
>>> format(10, '02x')
'0a'
You won't need to remove the 0x part with that (like you did with the [2:])

If you're using python 3.6 or higher you can also use fstrings:
v = 10
s = f"0x{v:02x}"
print(s)
output:
0x0a
The syntax for the braces part is identical to string.format(), except you use the variable's name. See https://www.python.org/dev/peps/pep-0498/ for more.

htmlColor = "#%02X%02X%02X" % (red, green, blue)

The standard module binascii may also be the answer, namely when you need to convert a longer string:
>>> import binascii
>>> binascii.hexlify('abc\n')
'6162630a'

Use format instead of using the hex function:
>>> mychar = ord('a')
>>> hexstring = '%.2X' % mychar
You can also change the number "2" to the number of digits you want, and the "X" to "x" to choose between upper and lowercase representation of the hex alphanumeric digits.
By many, this is considered the old %-style formatting in Python, but I like it because the format string syntax is the same used by other languages, like C and Java.

The simpliest way (I think) is:
your_str = '0x%02X' % 10
print(your_str)
will print:
0x0A
The number after the % will be converted to hex inside the string, I think it's clear this way and from people that came from a C background (like me) feels more like home

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

In Python 2, Unicode strings may contain both unicode and bytes:
No, they may not. They contain Unicode characters.
Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).
There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
Рус utf:ек bytes:blää

The problem is that your string is not actually encoded in a specific encoding. Your example string:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ÐµÐº
But you say, bytes is utf-8 encoded:
>>> print bytes.encode('utf-8')
ÐµÐº
>>> print bytes.encode('utf-8').decode('utf-8')
ÐµÐº
Wrong! But what about:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.

You should convert unichrs to chrs, then decode them.
u'\xd0' == u'\u00d0' is True
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
I search all these mistaken unichars, and convert them to chars, then decode them.
If I'm wrong, please tell me.

You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:
Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
Decodes the sequence using UTF-8 back into Unicode.
Replaces the original UTF-8-like sequence with the matching Unicode character.
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
###Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'

I solved it by
unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert exotic charset to string with python - python

This functionality is built into the unicodedata module: >>> unicodedata.normalize('NFKC', 'Ａ　Ｄｅｕｃｅ') 'A Deuce'

Related

Char and bytes in python

How do I replace \xc3 etc. with umlauts?

How can I slice a substring from a unicode string with Python?

How can I format an integer to a two digit hex?

Bytes in a unicode Python string

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert exotic charset to string with python - python

This functionality is built into the unicodedata module: >>> unicodedata.normalize('NFKC', 'Ａ Ｄｅｕｃｅ') 'A Deuce'

Related

Char and bytes in python

How do I replace \xc3 etc. with umlauts?

How can I slice a substring from a unicode string with Python?

How can I format an integer to a two digit hex?

Bytes in a unicode Python string

Categories

Resources

This functionality is built into the unicodedata module: >>> unicodedata.normalize('NFKC', 'Ａ　Ｄｅｕｃｅ') 'A Deuce'