Python utf-8 character range - python

I work with a text file, encoded with utf-8, and read its contents with python. After reading the content, I split the text to characters array.
import codecs
with codecs.open(fullpath,'r',encoding='utf8') as f:
text = f.read()
# Split the 'text' to characters
Now, I'm iterating on each character. First, convert it to its hexadecimal representation and running some code on it.
numerialValue = ord(char)
I have noticed that between all those characters, some characters are beyond the expected range.
Expected max value - FFFF.
Actual character value - 1D463.
I translated this code to python. The original source code is coming from C#, whose value '\u1D463' is invalid character.
Being confused.

It seems you escaped your Unicode code-point (U+1D463) with \u instead of \U. The former expects four hex digits, where the latter expects eight hex digits. According to Microsoft Visual Studio:
The condition was ch == '\u1D463'
When I used this literal in Python Interpreter, it doesn't complain but it escapes the first four hex digits happily and 3 prints normally when run in cmd:
>>> print('\u1D463')
ᵆ3
You got this exception:Expected max value - FFFF. Actual character value - 1D463 because you're using the incorrect unicode escape, use \U0001D463 instead of \u1D463. The maximum value for characters code-points in \u is \uFFFF and the maximum value for \U is \UFFFFFFFF. Notice the leading zeros in \U0001D463, \U takes exactly eight hex digits and \u takes exactly four hex digits:
>>> '\U1D463'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
>>> '\uFF'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-3: truncated \uXXXX escape

Related

Can't do ASCII with 'u' character in python

I'm trying to do an ascii image in python but gives me this error
File "main.py", line 1
teste = print('''
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 375-376: truncated \UXXXXXXXX escape
And I think it's because of the U character, why that happened, is any way to solve this?
ASCII image
You've got \U in your string, which is being interpreted as the beginning of a Unicode ordinal escape (it expects it to be followed by 8 hex characters representing a single Unicode ordinal).
You could double the escape, making it \\U, but that would make it harder to see the image in the code itself. The simplest approach is to make it a raw string that ignores all escapes save escapes applied to the quote character, by putting an r immediately before the literal:
teste = print(r'''
Note the r immediately after the (, before the '''.

How can I add Unicode character in status text by using Tweepy?

I want to update status with the Chinese text 我 for which the Unicode is U+6211. I do the same thing when I add emoji in status ("\U0006211") but it didn't work. So is it possible to update the text that is not English?
The error that I got:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-8: truncated \UXXXXXXXX escape
when I use print("\u6211") it prints the correct character but as a "?" so it should work if you have the font I think
The \U escape sequence [...] expects eight hex digits
https://docs.python.org/3/howto/unicode.html#unicode-literals-in-python-source-code
You're providing 7 instead of 8 digits. You can simply append an extra 0: "\U00006211"
Alternatively, you can use the \u excape sequence: "\u6211"

How to convert network hex string to integer in python

I have a hexadecimal string which I have received from internet (SSL client). It is as follows: "\x7f\xab\xff". This string is actually the length of something. I need to calculate the integer value of a sub-string in the above string (based on starting and ending position). How do I do that.
I tried struct.pack and unpack. It did not work
I tried doing a split based on \x, and that gave some UTF error
I tried converting the string to a raw string. Even that did not work
r"\xff\x7f".replace('\x' , '')
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-1: truncated \xXX escape
>>> "\xff\x7f"[1]
'\x7f'
>>> "\xff\x7f"[0]
'ÿ'
>>> "\xff\x7f"[1]
'\x7f'
>>> "\xff\x7f"[0]
'ÿ'
The following worked for me:
>>> str_val = r'\xff\x7f'
>>> int_val = int(str_val.replace('\\x', ''), 16)
>>> int_val
65407
Don't forget that the backslash character is literally appearing in the string. So to target it, you need to declare '\\x', not just '\x'!
Note that this also assumes the 1st octet, '\xff', is higher/more significant than the second, '\x7f'. It's possible the source of this value wants you to treat the second value as the more significant one.

How to list Amharic (Unicode) code points in python 3.6

I want a list containing Amharic alphabet from utf-8. The character ranges are from U+1200 to U+1399. I am using windows 8. I encountered SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape.
I tried this:
[print(c) for c in u'U1399']
How can i list the characters?
To print the characters from U-1200 to U-1399, I would use a for loop with an int control variable. It's easy enough to convert numbers to characters using chr().
The integer value 0x1200 (i.e. 1200 in hexadecimal) can be converted to the Unicode codepoint U-1200 like so: chr(0x1200) == '\u1200'.
Similarly for 0x1201, 0x1202, ... 0x1399.
Note that we use .isprintable() to filter out code some of the useless entries.
print(' '.join(chr(x) for x in range(0x1200, 0x139A) if chr(x).isprintable()))
or
for x in range(0x1200, 0x139A):
if chr(x).isprintable():
print(hex(x), chr(x))
Note that the code samples require Python3.
Your posted code doesn't produce any errors at all:
>>> [print(c) for c in u'U1399']
U
1
3
9
9
[None, None, None, None, None]
It also doesn't have any non-ASCII characters in it.
You probably wanted to use a Unicode backslash escape. And your problem is probably more like this:
>>> u'\U1399'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape
The reason is that—as the error message implies—a \U escape requires 8 hex digits, and you've only provided 4. So:
>>> u'\U00001399'
'᎙'
But there's a different escape, sequence \u (notice the lowercase u), which takes only 4 digits:
>>> u'\u1399'
'᎙'
If you're using Python 2.7, and possibly even with Python 3 on Windows, you may not see that nice output, but instead something with backslash escapes in it. But if you print that string, you will see the right character.
The full details for \U and \u escapes (and other escapes) are documented in String and Bytes literals (make sure to switch to the Python version you're actually using, because the details can be different, especially between 2.x and 3.x), but usually you don't need to know much more than explained above.

How to initialize a UTF-16 in code?

Using Python3 to minimize the pain when dealing with Unicode, I can print a UTF-8 character as such:
>>> print (u'\u1010')
တ
But when trying to do the same with UTF-16, let's say U+20000, u'\u20000' is the wrong way to initialize the character:
>>> print (u'\u20000')
  0
>>> print (list(u'\u20000'))
['\u2000', '0']
It reads a 2 UTF-8 characters instead.
I've also tried the big U, i.e. u'\U20000', but it throws some escape error:
>>> print (u'\U20000')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
Big U outside the string didn't work too:
>>> print (U'\u20000')
 0
>>> print (U'\U20000')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
These are not UTF-8 and UTF-16 literals, but just unicode literals, and they mean the same:
>>> print(u'\u1010')
တ
>>> print(u'\U00001010')
တ
>>> print(u'\u1010' == u'\U00001010')
True
The second form just allows you to specify a code point above U+FFFF.
How to do this the easiest way: encode your source file as UTF-8 (or UTF-16), and then you can just write u"တ" and u"𠀀".
UTF-8 and UTF-16 are ways to encode those to bytes. To be technical, in UTF-8 that would be "\xf0\xa0\x80\x80" (which I would probably write as u"𠀀".encode("utf-8")).
As #Mark Ransom commented, Python's UTF16 \U notation requires eight characters to work.
Therefore, the Python code to use is:
u"\U00020000"
as listed on this page:
Python source code u"\U00020000"

Categories