Unicode, Bytes, and String to Integer Conversion

Unicode, Bytes, and String to Integer Conversion - python

I am writing a program that is dealing with letters from a foreign alphabet. The program is taking the input of a number that is associated with the unicode number for a character. For example 062A is the number assigned in unicode for that character.
I first ask the user to input a number that corresponds to a specific letter, i.e. 062A. I am now attempting to turn that number into a 16-bit integer that can be decoded by python to print the character back to the user.
example:
for \u0394
print(bytes([0x94, 0x03]).decode('utf-16'))
however when I am using
int('062A', '16')
I receive this error:
ValueError: invalid literal for int() with base 10: '062A'
I know it is because I am using A in the string, however that is the unicode for the symbol. Can anyone help me?

however when I am using int('062A', '16'), I receive this error:
ValueError: invalid literal for int() with base 10: '062A'
No, you aren't:
>>> int('062A', '16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer
It's exactly as it says. The problem is not the '062A', but the '16'. The base should be specified directly as an integer, not a string:
>>> int('062A', 16)
1578
If you want to get the corresponding numbered Unicode code point, then converting through bytes and UTF-16 is too much work. Just directly ask using chr, for example:
>>> chr(int('0394', 16))
'Δ'

Related

str.isdigit() behaviour when handling strings

Assuming the following:
>>> square = '²' # Superscript Two (Unicode U+00B2)
>>> cube = '³' # Superscript Three (Unicode U+00B3)
Curiously:
>>> square.isdigit()
True
>>> cube.isdigit()
True
OK, let's convert those "digits" to integer:
>>> int(square)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'
Oooops!
Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?

str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:
str.isdigit()
Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?
Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

Bytearray conversion, integer is required error on python3

asking for an integer on 0x00 hex position, python3
>>> command = bytearray()
>>> command.extend(chr(0x00))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required

Bytearrays consist of either bytes (b'\x00') or byte-sized ints (0x00). The result of chr(0x00) is a unicode string, however.
You can feed bytearray.extend with either a) a bytes string or b) an iterable of byte-sized integers. Both of these represent "sequence of bytes", which a bytearray is. Also, both can be used with hex notation.
command.extend(b'\x00')
command.extend([0x00])
In case you want to add a single integer, you can also use bytearray.append:
command.append(0x00)
Since a string is an iterable, bytearray.extend tries to append its elements. These are also strings, however. Hence, the error that an integer was expected.

How do I encode hexadecimal to base64 in python?

If I try to do:
from base64 import b64encode
b64encode('ffffff')
I get this error:
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
base64.b64encode('ffffff')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/base64.py", line 58, in b64encode
encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'
Because it said bytes-like object I then tried this:
b64encode(bytes('ffffff'))
Which failed.
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
b64encode(bytes('ffffff'))
TypeError: string argument without an encoding
Finally, using the .encode('utf-8') function:
b64encode('ffffff'.encode('utf-8'))
With incorrect output b'ZmZmZmZm', the correct base64 encoding is ////.
I already know how to decode b64 to hex so don't say how to do that.
Edit: This question got flagged for being the same as converting hex strings to hex bytes. This involves base64.

To fully go from the string ffffff to base64 of the hex value, you need to run it through some encoding and decoding, using the codecs module:
import codecs
# Convert string to hex
hex = codecs.decode('ffffff', 'hex')
# Encode as base64 (bytes)
codecs.encode(hex, 'base64')
For an odd-length string like 0xfffff you need to put a zero at the beginning of the hex string (0x0fffff), otherwise python will give you an error.

Here's an alternative to using codecs.
This one is a bit less readable, but works great and hopefully teaches you how codecs, hex and integers work. (word of caution, works on odd lengths, but will ignore the odd byte-string-representation)
import struct
s = 'ffffff'
b''.join([struct.pack('B', int(''.join(x), 16)) for x in zip(s[0::2], s[1::2])])
Which should give you b'\xff\xff\xff'.
Your main problem is probably that you think 'ffffff' represents the values 255, 255, 255. Which they don't. They're still in a string format with the letters ff. Subsequently you need to parse/convert the string representation of hex, into actual hex. We can do this by first passing the string through int() which can intemperate hex in string representation format.
You will need to convert each pair of ff individually by doing int('ff', 16) which tells Python to intemperate the string as a base-16 integer (hex-numbers).
And then convert that integer into a bytes like object representing that integer. That's where struct.pack comes in. It's meant for exactly this.
struct.pack('B', 255) # 255 is given to us by int('ff', 16)
Essentially, 'B' tells Python to pack the value 255 into a 1-byte-object, in this case, that gives us b'\xff' which is your end goal. Now, do this for every 2-pair of letters in your original data.
This is more of a manual approach where you'll iterate over 2 characters in the string at a time, and use the above description to bundle them into what you expect them to be. Or just use codecs, either way works.
Expanded version of the above oneliner:
import struct
hex_string = 'ffffff'
result = b''
for pair in zip(hex_string[0::2], hex_string[1::2]):
value = int(''.join(pair), 16)
result += struct.pack('B', value)
At the very least, I hope this explains how hex works on a practical level. And how the computer interpenetrates hour humanly readable version of bits and bytes.

Error type in my work about python

#!/usr/bin/env python
# coding=utf8
value=input("please enter value:")
result=hex(value)
r=hex(0xffff-result)
print r
print result
TypeError: unsupported operand type(s) for -: 'int' and 'str'
I study python for a few days,I try this python job,I can't understand what's the shape of '0xffff',it is str or int?and it's right for 'result' be str?

hex returns a string containing the hexadecimal representation of the given value. You don't need to convert the value to hex, though, since 0xffff is just an int literal.
Don't use input in Python 2; use raw_input to get a string, then explicitly (try to) convert the string to the value of the desired type.
value = raw_input("Please enter a value: ")
r = 0xffff - int(value)
print r # Print the result as a decimal value
print hex(r) # Print the result as a hexadecimal value

When you are just beginning with Python you can get a huge amount of help by just running an interactive Python session and typing code in. Just enter the command
python
The code you enter can be expressions or statements. Expressions are automatically evaluated, and the result printed (unless it happens to be None). Statements are executed.
>>> value=input("please enter value:")
please enter value:2134
>>> value
'2134'
The quotes around the value flag it as a string.
>>> result=hex(value)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer
If you look at the documentation for the hex built-in function you will see that it takes an integer argument.
>>> hex(int(value))
'0x856'
So you have now got a hexadecimal string. let's store it.
>>> result = hex(int(value))
>>> r=hex(0xffff-result)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'int' and 'str'
>>> result
'0x856'
The problem here is that you are trying to subtract a hex string from a hex integer. Because 0xffff is in fact an integer value. So all you actually need to do is subtract the (integer) value of your input and subtract it. Then you presumably want to convert the result to hexadecimal.
>>> 0xffff - int(value)
63401
>>> hex(0xffff - int(value))
'0xf7a9'
By going through this interactive experimental methodology you save yourself considerable time by learning instantly what works and what doesn't work (and often, in the latter case, why not). So you are then much better placed to write your complete program.

Unknown format code 'f' for object of type 'unicode'

can someone tell me what is wrong with this code...
def format_money_value(num):
return u'{0:.2f}'.format(num)
It gives me the following error:
Unknown format code 'f' for object of type 'unicode'
I'm running Django 1.5
Thank you

In your case num is a unicode string, which does not support the f format modifier:
>>> '{0:.2f}'.format(u"5.0")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Unknown format code 'f' for object of type 'unicode'
You can fix the error making the conversion to float yourself:
>>> '{0:.2f}'.format(float(u"5.0"))
'5.00'
As pointed out by mgilson when you do '{0:.2f}'.format(num), the format method of the strings calls num.__format__(".2f"). This results in an error for str or unicode, because they don't know how to handle this format specifier. Note that the meaning of f is left as an implementation for the object. For numeric types it means to convert the number to a floating point string representation, but other objects may have different conventions.
If you used the % formatting operator the behaviour is different, because in that case %f calls __float__ directly to obtain a floating point representation of the object.
Which means that when using %-style formatting f does have a specific meaning, which is to convert to a floating point string representation.

what .format() do
str.format method calls __format__() method of related type. That means
<type>.__format__(<value>, <spec>)
above method accepts the same type argument as first value, and accepts a suitable spec type
as second one. Like,
str.__format__('1', 's')
int.__format__(1, 'f')
float.__format__(1.00, 'f')
str.__format__ accepts any type that is derived from str type, like str or unicode. Spec value must be a valid formatter that is usable of that type. Following will raise an error
str.__format__('1', 'f')
ValueError: Unknown format code 'f' for object of type 'str'
since floating point formatting is not a suitable format type fot string. Likewise following will raise an error too
float.__format__(1.00, 's')
ValueError: Unknown format code 's' for object of type 'float'
since float is a numeric type and can not formatted as a string. But following are all valid:
float.__format__(1.00, 'g')
float.__format__(1.00, 'f')
similarly following will raise an exception
float.__format__(1.00, 'd')
ValueError: Unknown format code 'd' for object of type 'float'
since formatting a float point to a decimal value will cause precision values to be lost. But formatting an int to float will not cause a such thing, so it is a valid conversion:
int.__format__(1, 'f')
So .format() is limeted to specs that is available to the related formatting type. You must parse your value as #Bakuriu defined:
'{0:.2f}'.format(float(u"5.0"))

The scenario where you are re-formatting a string (unicode or otherwise) as a float string is not very safe. You should first convert the string to a numeric representation and only if that succeeds should you format it as a string again. You are not in control of the data that comes into your program so you should be sure to validate it coming in.
If you choose to leave it as a string you can use this:
return u"{:.2f}".format(num) if num.isnumeric() else u"{}".format(num)
If you have already converted the string into a numeric format, you can adjust your formatter like this:
return u"{:.2f}".format(num) if isinstance(num, NumberTypes) else u"{}".format(num)

I've faced a similar problem as the OP, when num was a numerical answer returned by one of my libraries, whose implementation detais got forgotten:
>>> num
-4132.56528214700
>>> u'{0:.4g}'.format(num)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Unknown format code 'g' for object of type 'unicode'
I was really puzzled because num behaved like a float but after testing #Bakuriu 's solution, I've found out it wasn't a float:
>>> type(num)
<class 'sympy.core.numbers.Float'>
So #Bakuriu 's solution was right on target for my case:
>>> u'{0:.4g}'.format(float(num))
u'-4133'
Therefore, the error can be due to types that display/calculate like but aren't really floats.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.