I have a list of Unicode character codes I need to convert into chars on python 2.7.
U+0021
U+0022
U+0023
.......
U+0024
How to do that?
This regular expression will replace all U+nnnn sequences with the corresponding Unicode character:
import re
s = u'''\
U+0021
U+0022
U+0023
.......
U+0024
'''
s = re.sub(ur'U\+([0-9A-F]{4})',lambda m: unichr(int(m.group(1),16)),s)
print(s)
Output:
!
"
#
.......
$
Explanation:
unichr gives the character of a codepoint, e.g. unichr(0x21) == u'!'.
int('0021',16) converts a hexadecimal string to an integer.
lambda(m): expression is an anonymous function that receives the regex match.
It defines a function equivalent to def func(m): return expression but inline.
re.sub matches a pattern and sends each match to a function that returns the replacement. In this case, the pattern is U+hhhh where h is a hexadecimal digit, and the replacement function converts the hexadecimal digit string into a Unicode character.
In case anyone using Python 3 and above wonders, how to do this effectively, I'll leave this post here for reference, since I didn't realize the author was asking about Python 2.7...
Just use the built-in python function chr():
char = chr(0x2474)
print(char)
Output:
⑴
Remember that the four digits in Unicode codenames U+WXYZ stand for a hexadecimal number WXYZ, which in python should be written as 0xWXYZ.
The code written below will take every Unicode string and will convert into the string.
for I in list:
print(I.encode('ascii', 'ignore'))
a = 'U+aaa'
a.encode('ascii','ignore')
'aaa'
This will convert for unicode to Ascii which i think is what you want.
Related
I'm using the argparse library in Python 3 to read in Unicode strings from a command line parameter. Often those strings contain "ordinary" Unicode characters (extended Latin, etc.), but sometimes--particularly when the characters belong to a right-to-left script--it's easier to encode the strings as Unicode code points, like \u0644. But argparse treats these designators as a sequence of characters, and does not convert them into the character they designate. For instance, if a command line parameter is
... -a "abc\06d2d" ...
then what I get in the argparse variable is
"abc\06d2d"
rather than the expected
"abcےd"
(the character between the 'c' and 'd' is the yeh baree). Of course both outcomes are logical, it's just that the second one is the one I want.
I tried to reproduce this in an interpreter, but under most circumstances Python3 automagically converts a string like "abc\06d2d" into "abcےd". Not so when I read the string using argparse...
I came up with a function to do the conversion, see below. But I feel like I'm missing something much simpler. Is there an easier way to do this conversion? (Obviously I could use str.startswith(), or regex's to match the entire thing, rather than going character by character, but the code below is really just an illustration. It seems like I shouldn't have to create my own function to do this at all, especially since in some circumstances it seems to happen automatically.)
---------My code to do this follows---------
def ParseString2Unicode(sInString):
"""Return a version of sInString in which any Unicode code points of the form
\uXXXX (X = hex digit)
have been converted into their corresponding Unicode characters.
Example:
"\u0064b\u0065"
becomes
"dbe"
"""
sOutString = ""
while sInString:
if len(sInString) >= 6 and \
sInString[0] == "\\" and \
sInString[1] == "u" and \
sInString[2] in "0123456789ABCDEF" and \
sInString[3] in "0123456789ABCDEF" and \
sInString[4] in "0123456789ABCDEF" and \
sInString[5] in "0123456789ABCDEF":
#If we get here, the first 6 characters of sInString represent
# a Unicode code point, like "\u0065"; convert it into a char:
sOutString += chr(int(sInString[2:6], 16))
sInString = sInString[6:]
else:
#Strip a single char:
sOutString += sInString[0]
sInString = sInString[1:]
return sOutString
What you may want to look at is the raw_unicode_escape encoding.
>>> len(b'\\uffff')
6
>>> b'\\uffff'.decode('raw_unicode_escape')
'\uffff'
>>> len(b'\\uffff'.decode('raw_unicode_escape'))
1
So, the function would be:
def ParseString2Unicode(sInString):
try:
decoded = sInString.encode('utf-8')
return decoded.decode('raw_unicode_escape')
except UnicodeError:
return sInString
This, however, also matches other unicode escape sequences, like \Uxxxxxxxx. If you just want to match \uxxxx, use a regex, like so:
import re
escape_sequence_re = re.compile(r'\\u[0-9a-fA-F]{4}')
def _escape_sequence_to_char(match):
return chr(int(match[0][2:], 16))
def ParseString2Unicode(sInString):
return re.sub(escape_sequence_re, _escape_sequence_to_char, sInString)
A concise, flexible way of handling this would be to use a regular expression:
return re.sub(
r"\\u([0-9A-Fa-f]{4})",
lambda m: chr(int(m[1], 16)),
sInString
)
string.maketrans("","")
gives
\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13
\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90
\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2
\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4
\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9
\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde
\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed
\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff
What does this mean?
And how does it help in removing punctuation in a string with the following call:
import string
myStr.translate(string.maketrans("",""), string.punctuation)
I'll take some liberties, since Python 2 muddles the line being strings and bytes. There are 256 bytes, ranging from 0 to 255. You can get their byte representation by using chr(). So, all the bytes from 0 to 255 look like this
>>> ''.join(map(chr, range(256)))
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\
x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;
<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80
\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93
\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6
\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9
\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc
\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf
\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2
\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
string.maketrans(from, to) creates a string of 256 characters, where the characters in from will be replaced by to. For example, string.maketrans('ab01', 'AB89') will return the string from above, but a will be replaced by A, b by B, 0 by 8 and 1 by 9.
>>> string.maketrans('ab01', 'AB89')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\
x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./8923456789:;
<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`ABcdefghijklmnopqrstuvwxyz{|}~\x7f\x80
\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93
\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6
\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9
\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc
\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf
\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2
\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
Effectively, string.maketrans('', '') == ''.join(map(chr, range(256))).
This serves as a map, which when provided to str.translate(), it can be used to replace multiple characters with one pass over your string. For the example map above, all characters will remain the same, except from all a turning into A, b into B, etc. If you do myStr.translate(string.maketrans('', '')), you simply don't change anything in myStr.
Finally, translate() has one additional argument, deletechars. If you pass a string for that argument, translate() will translate all characters according to the mapping you provide, but it will ignore, any characters in deletechars. So, putting it all together, myStr.translate(string.maketrans('', ''), string.punctuation) does not change any character in the string, but in the process will ignore any character in string.punctuation. Effectively, you have removed the punctuation in the output string.
string.maketrans(intab, outtab)returns a translation table that maps each character in the intabstring into the character at the same position in the outtab string.
tran_table = string.maketrans(intab, outtab)
print myStr.translate(tran_table)
The code above will then translate myStr using your created table. In your case the table generates all characters because you do not specify anything.
Python 2.7's string.maketrans() returns a byte value, like your result, which could be used with string.translate().
string.translate(s, table) translates characters in s (Let's call this c) into table[ord(c)]. So \x00 is translated into table[0], and so on. In your case, it's just returning an identity table.
It should be noted that string.translate is deprecated in Python 2.7, and in Python 3.1 and onwards, they are replaced by bytes.maketrans(), bytes.translate(), and the corresponding methods for str ans bytearray.
I have a py3 string that includes escaped utf-8 sequencies, such as "Company\\ffffffc2\\ffffffae", which I would like to convert to the correct utf 8 string (which would in the example be "Company®", since the escaped sequence is c2 ae). I've tried
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace(
"\\\\ffffff", "\\x"), "ascii").decode("utf-8"))
result: Company\xc2\xae
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace (
"\\\\ffffff", "\\x"), "ascii").decode("unicode_escape"))
result: Company®
(wrong, since chracters are treated separately, but they should be treated together.
If I do
print (b"Company\xc2\xae".decode("utf-8"))
It gives the correct result.
Company®
How can i achieve that programmatically (i.e. starting from a py3 str)
A simple solution is:
import ast
test_in = "Company\\\\ffffffc2\\\\ffffffae"
test_out = ast.literal_eval("b'''" + test_in.replace('\\\\ffffff','\\x') + "'''").decode('utf-8')
print(test_out)
However it will fail if there is a triple quote ''' in the input string itself.
Following code does not have this problem, but it is not as simple as the first one.
In the first step the string is split on a regular expression. The odd items are ascii parts, e.g. "Company"; each even item corresponds to one escaped utf8 code, e.g. "\\\\ffffffc2". Each substring is converted to bytes according to its meaning in the input string. Finally all parts are joined together and decoded from bytes to a string.
import re
REGEXP = re.compile(r'(\\\\ffffff[0-9a-f]{2})', flags=re.I)
def convert(estr):
def split(estr):
for i, substr in enumerate(REGEXP.split(estr)):
if i % 2:
yield bytes.fromhex(substr[-2:])
elif substr:
yield bytes(substr, 'ascii')
return b''.join(split(estr)).decode('utf-8')
test_in = "Company\\\\ffffffc2\\\\ffffffae"
print(convert(test_in))
The code could be optimized. Ascii parts do not need encode/decode and consecutive hex codes should be concatenated.
I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?
The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]
Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.
u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])
If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё
How do you dynamically create single char hex values?
For instance, I tried
a = "ff"
"\x{0}".format(a)
and
a = "ff"
"\x" + a
I ultimately was looking for something like
\xff
However, neither of the combinations above appear to work.
Additionally, I was originally using chr to obtain single char hex representations of integers but I noticed that chr(63) would return ? (as that is its ascii representation).
Is there another function aside from chr that will return chr(63) as \x_ _ where _ _ is its single char hex representation? In other words, a function that only produces single char hex representations.
When you say \x{0}, Python escapes x and thinks that the next two characters will be hexa-decimal characters, but they are actually not. Refer the table here.
\xhh Character with hex value hh (4,5)
4 . Unlike in Standard C, exactly two hex digits are required.
5 . In a string literal, hexadecimal and octal escapes denote the byte with the given value; it is not necessary that the byte encodes a character in the source character set. In a Unicode literal, these escapes denote a Unicode character with the given value.
So, you have to escape \ in \x, like this
print "\\x{0}".format(a)
# \xff
Try str.decode with 'hex' encoding:
In [204]: a.decode('hex')
Out[204]: '\xff'
Besides, chr returns a single-char string, you don't need to worry about the output of this string:
In [219]: c = chr(31)
In [220]: c
Out[220]: '\x1f'
In [221]: print c #invisible printout
In [222]: