How to convert network hex string to integer in python - python

I have a hexadecimal string which I have received from internet (SSL client). It is as follows: "\x7f\xab\xff". This string is actually the length of something. I need to calculate the integer value of a sub-string in the above string (based on starting and ending position). How do I do that.
I tried struct.pack and unpack. It did not work
I tried doing a split based on \x, and that gave some UTF error
I tried converting the string to a raw string. Even that did not work
r"\xff\x7f".replace('\x' , '')
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-1: truncated \xXX escape
>>> "\xff\x7f"[1]
'\x7f'
>>> "\xff\x7f"[0]
'ÿ'
>>> "\xff\x7f"[1]
'\x7f'
>>> "\xff\x7f"[0]
'ÿ'

The following worked for me:
>>> str_val = r'\xff\x7f'
>>> int_val = int(str_val.replace('\\x', ''), 16)
>>> int_val
65407
Don't forget that the backslash character is literally appearing in the string. So to target it, you need to declare '\\x', not just '\x'!
Note that this also assumes the 1st octet, '\xff', is higher/more significant than the second, '\x7f'. It's possible the source of this value wants you to treat the second value as the more significant one.

Related

Python converting code page character number to unicode

By default, print(chr(195)) displays the unicode character at position 195 ("Ã")
How do I print chr(195) that appears in code page 1251, ie. "Г"
I tried: print(chr(195).decode('cp1252')), and various .encode methods.
Since you cannot store a 'raw' value 0xC3 in a string (and if you did, you should not have – raw binary "unparsed" data should be a byte array): the proper way to convert from a raw byte array is indeed .decode('cp1251'):
>>> print (b'\xc3'.decode('cp1251'))
Г
However, if you already got it in a string, then the easiest is to first convert from a string to a bytes object using the 1-on-1 "encoding" Latin-1:
str = 'Ãamma'
print (bytes(str.encode('latin1')).decode('cp1251'))
>>> Гamma
In Python 3, chr(n) returns a Unicode string, which can only be encoded. Use bytes to create byte strings that can be decoded:
>>> bytes([195])
b'\xc3'
>>> bytes([195]).decode('cp1251')
'Г'
>>> bytes([195,196,197])
b'\xc3\xc4\xc5'
>>> bytes([195,196,197]).decode('cp1251')
'ГДЕ'
You can use urllib
print urllib.quote_plus(str.encode('cp1251'))
Also remember, if you are using international strings, make sure to include the u prefix in your string that you are parsing.
str = u"whateverhere"
changed to remove downvote??

How to list Amharic (Unicode) code points in python 3.6

I want a list containing Amharic alphabet from utf-8. The character ranges are from U+1200 to U+1399. I am using windows 8. I encountered SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape.
I tried this:
[print(c) for c in u'U1399']
How can i list the characters?
To print the characters from U-1200 to U-1399, I would use a for loop with an int control variable. It's easy enough to convert numbers to characters using chr().
The integer value 0x1200 (i.e. 1200 in hexadecimal) can be converted to the Unicode codepoint U-1200 like so: chr(0x1200) == '\u1200'.
Similarly for 0x1201, 0x1202, ... 0x1399.
Note that we use .isprintable() to filter out code some of the useless entries.
print(' '.join(chr(x) for x in range(0x1200, 0x139A) if chr(x).isprintable()))
or
for x in range(0x1200, 0x139A):
if chr(x).isprintable():
print(hex(x), chr(x))
Note that the code samples require Python3.
Your posted code doesn't produce any errors at all:
>>> [print(c) for c in u'U1399']
U
1
3
9
9
[None, None, None, None, None]
It also doesn't have any non-ASCII characters in it.
You probably wanted to use a Unicode backslash escape. And your problem is probably more like this:
>>> u'\U1399'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape
The reason is that—as the error message implies—a \U escape requires 8 hex digits, and you've only provided 4. So:
>>> u'\U00001399'
'᎙'
But there's a different escape, sequence \u (notice the lowercase u), which takes only 4 digits:
>>> u'\u1399'
'᎙'
If you're using Python 2.7, and possibly even with Python 3 on Windows, you may not see that nice output, but instead something with backslash escapes in it. But if you print that string, you will see the right character.
The full details for \U and \u escapes (and other escapes) are documented in String and Bytes literals (make sure to switch to the Python version you're actually using, because the details can be different, especially between 2.x and 3.x), but usually you don't need to know much more than explained above.

Python utf-8 character range

I work with a text file, encoded with utf-8, and read its contents with python. After reading the content, I split the text to characters array.
import codecs
with codecs.open(fullpath,'r',encoding='utf8') as f:
text = f.read()
# Split the 'text' to characters
Now, I'm iterating on each character. First, convert it to its hexadecimal representation and running some code on it.
numerialValue = ord(char)
I have noticed that between all those characters, some characters are beyond the expected range.
Expected max value - FFFF.
Actual character value - 1D463.
I translated this code to python. The original source code is coming from C#, whose value '\u1D463' is invalid character.
Being confused.
It seems you escaped your Unicode code-point (U+1D463) with \u instead of \U. The former expects four hex digits, where the latter expects eight hex digits. According to Microsoft Visual Studio:
The condition was ch == '\u1D463'
When I used this literal in Python Interpreter, it doesn't complain but it escapes the first four hex digits happily and 3 prints normally when run in cmd:
>>> print('\u1D463')
ᵆ3
You got this exception:Expected max value - FFFF. Actual character value - 1D463 because you're using the incorrect unicode escape, use \U0001D463 instead of \u1D463. The maximum value for characters code-points in \u is \uFFFF and the maximum value for \U is \UFFFFFFFF. Notice the leading zeros in \U0001D463, \U takes exactly eight hex digits and \u takes exactly four hex digits:
>>> '\U1D463'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
>>> '\uFF'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-3: truncated \uXXXX escape

convert unicode ucs4 into utf8

I have a value like u'\U00000958' being returned back from a database and I want to convert this string to utf8. I try something like this:
cp = u'\\U00000958'
value = cp.decode('unicode-escape').encode('utf-8')
print 'Value: " + value
I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position
0: ordinal not in range(128)
What can I do to properly convert this value?
More Detail. I'm in 2.7.10 which uses ucs2.
For unicode issues, it often helps to specify python 2 versus python 3, and also how one came to get a particular representation.
It is not clear from the first sentence what the actual value is, as opposed to how it is displayed. It is unclear if the value like u'\\U00000958' is a 1 char unicode string, a 10 char unicode string, a 14 char (ascii) byte string, or something else. Use of len and type can be used to be sure of what you have.
By trying to decode cp, you are implying that you know that cp is bytes, but what encoding? The error message says that it is not ascii bytes. 0xe0 is a typical start byte for utf-8 encoding. The following interaction
>>> s = "u'\\U00000958'"
>>> se = eval(s)
>>> se
u'\u0958'
>>> se.encode(encoding='utf-8')
'\xe0\xa5\x98'
>>>
suggests to me that cp, starting with \xe0 is 3 utf-8 encoded bytes and that u'\\U00000958' is an evaluable representation of its unicode decoding.

Why does string and list make difference in UNICODE(utf-8)? How can '-' make Error?

>>> final=[]
>>> for a in range(65535):
final.append([a,chr(a)])
>>> file=open('1.txt','w',encoding='utf-8')
>>> file.write(str(final))
960881
>>> file.close()
>>> final=''
>>> for a in range(65535):
final+='%d -------- %s'%(a,chr(a))
>>> file=open('2.txt','w',encoding='utf-8')
>>> file.write(final)
Traceback (most recent call last):
File "<pyshell#29>", line 1, in <module>
file.write(final)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 873642: surrogates not allowed
As you can see, 1.txt is saved. Why does saving the second 'file'(string) make error?
From Wikibooks:
Unicode and ISO/IEC 10646 do not assign actual characters to any of the code points in the D800–DFFF range — these code points only have meaning when used in surrogate pairs. Hence an individual code point from a surrogate pair does not represent a character, is invalid unless used in a surrogate pair, and
So I'd say chr(0xd800) already is invalid and I guess Python just doesn't check it for speed reasons. But the UTF-8 encoder does check it and complains.
The reason it works for the first file is that wrapping the string in a list and using str on that list leads to repr-ing the string:
>>> str( chr(0xd800) )
'\ud800'
>>> str([chr(0xd800)])
"['\\ud800']"
Note the double backslash in the list version. Instead of one "invalid character" \ud800 it's the six valid characters \, u, d, 8, 0 and 0. And those can be encoded.
The codepoints U+D800 through U+DFFF are reserved for surrogate pairs and can already be seen in the error message
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 873642: surrogates not allowed
You can't write characters in that range. It's only used for UTF-16 to encode codepoints outside the BMP (i.e. > 65535).
Note that Unicode is not a 16-bit charset, so going up to 65535 is not enough. To print all the Unicode characters you need to print all the way up to U+​10FFFF except the surrogate range. It's also easier to use UTF-32 for this instead
I am not aware of a good way how to get UTF-16 to UTF-8, however you could probably apply this method to read file if you DO NOT require 100% accurate representation:
f = open(filename, encoding='utf-8', errors='replace')

Categories