str.isdigit() behaviour when handling strings - python

Assuming the following:
>>> square = '²' # Superscript Two (Unicode U+00B2)
>>> cube = '³' # Superscript Three (Unicode U+00B3)
Curiously:
>>> square.isdigit()
True
>>> cube.isdigit()
True
OK, let's convert those "digits" to integer:
>>> int(square)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'
Oooops!
Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?

str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:
str.isdigit()
Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.
In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?
Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

Related

How to get the maximal unicode value in Python? [duplicate]

From the Python's documentation of the chr built-in function, the maximum value that chr accepts is 1114111 (in decimal) or 0x10FFFF (in base 16). And in fact
>>> chr(1114112)
Traceback (most recent call last):
File "<pyshell#20>", line 1, in <module>
chr(1114112)
ValueError: chr() arg not in range(0x110000)
My first question is the following, why exactly that number? The second question is: if this number changes, is it possible to know from a Python command the maximum value accepted by chr?
Use sys.maxunicode:
An integer giving the value of the largest Unicode code point, i.e. 1114111 (0x10FFFF in hexadecimal).
On my Python 2.7 UCS-2 build the maximum Unicode character supported by unichr() is 0xFFFF:
>>> import sys
>>> sys.maxunicode
65535
but Python 3.3 and newer switched to a new internal storage format for Unicode strings, and the maximum is now always 0x10FFFF. See PEP 393.
0x10FFFF is the maximum Unicode codepoint as defined in the Unicode standard. Quoting the Wikipedia article on Unicode:
Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF.

Unicode, Bytes, and String to Integer Conversion

I am writing a program that is dealing with letters from a foreign alphabet. The program is taking the input of a number that is associated with the unicode number for a character. For example 062A is the number assigned in unicode for that character.
I first ask the user to input a number that corresponds to a specific letter, i.e. 062A. I am now attempting to turn that number into a 16-bit integer that can be decoded by python to print the character back to the user.
example:
for \u0394
print(bytes([0x94, 0x03]).decode('utf-16'))
however when I am using
int('062A', '16')
I receive this error:
ValueError: invalid literal for int() with base 10: '062A'
I know it is because I am using A in the string, however that is the unicode for the symbol. Can anyone help me?
however when I am using int('062A', '16'), I receive this error:
ValueError: invalid literal for int() with base 10: '062A'
No, you aren't:
>>> int('062A', '16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object cannot be interpreted as an integer
It's exactly as it says. The problem is not the '062A', but the '16'. The base should be specified directly as an integer, not a string:
>>> int('062A', 16)
1578
If you want to get the corresponding numbered Unicode code point, then converting through bytes and UTF-16 is too much work. Just directly ask using chr, for example:
>>> chr(int('0394', 16))
'Δ'

TensorFlow Setup Issue [duplicate]

Apparently, the following is the valid syntax:
b'The string'
I would like to know:
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you're familiar with:
Java or C#, think of str as String and bytes as byte[];
SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
Windows registry, think of str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
The b denotes a byte string.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I'd use strings unless I had some specific low level reason to use bytes.
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
The answer to the question is that, it does:
data.encode()
and in order to decode it(remove the b, because sometimes you don't need it)
use:
data.decode()
Here's an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
Adding a b prefix would fix the problem.
It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.
The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
You can use JSON to convert it to dictionary
import json
data = b'{"key":"value"}'
print(json.loads(data))
{"key":"value"}
FLASK:
This is an example from flask. Run this on terminal line:
import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})
In flask/routes.py
#app.route('/', methods=['POST'])
def api_script_add():
print(request.data) # --> b'{"hi":"Hello"}'
print(json.loads(request.data))
return json.loads(request.data)
{'key':'value'}
b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.
.encode() and .decode() convert between strings and bytes.
bytes(somestring.encode()) is the solution that worked for me in python 3.
def compare_types():
output = b'sometext'
print(output)
print(type(output))
somestring = 'sometext'
encoded_string = somestring.encode()
output = bytes(encoded_string)
print(output)
print(type(output))
compare_types()

Identify unicode characters that can't be printed

I need to be able to determine (or predict) when a unicode character won't be printable. For instance, if I print this unicode character under default settings, it prints fine:
>>> print(u'\ua62b')
ꘫ
But if I print another unicode character, it prints as a stupid, weird square:
>>> print(u'\ua62c')
꘬
I really need to be able to determine before a character is printed if it will display like this as an ugly square (or sometimes as an anonymous blank). What causes this, and how can I predict it?
While it's not very easy to tell if the terminal running your script (or the font your terminal is using) is able to render a given character correctly, you can at least check that the character actually has a representation.
The character \ua62b is defined as VAI SYLLABLE NDOLE DO, whereas the character \ua62c has no definition, hence why it may be rendered as a square or other generic symbol.
One way to check if a character is defined is to use the unicodedata module:
>>> import unicodedata
>>> unicodedata.name(u"\ua62b")
'VAI SYLLABLE NDOLE DO'
>>> unicodedata.name(u"\ua62c")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
As you can see above, a ValueError is raised for the \ua62c character because it isn't defined.
Another method is to check the category of the character. If it is Cn then the character is not assigned:
>>> import unicodedata
>>> unicodedata.category(u"\ua62b")
'Lo'
>>> unicodedata.category(u"\ua62c")
'Cn'

What exactly is the binary literal in Python 3? [duplicate]

Apparently, the following is the valid syntax:
b'The string'
I would like to know:
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you're familiar with:
Java or C#, think of str as String and bytes as byte[];
SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
Windows registry, think of str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
The b denotes a byte string.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I'd use strings unless I had some specific low level reason to use bytes.
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
The answer to the question is that, it does:
data.encode()
and in order to decode it(remove the b, because sometimes you don't need it)
use:
data.decode()
Here's an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
Adding a b prefix would fix the problem.
It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.
The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
You can use JSON to convert it to dictionary
import json
data = b'{"key":"value"}'
print(json.loads(data))
{"key":"value"}
FLASK:
This is an example from flask. Run this on terminal line:
import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})
In flask/routes.py
#app.route('/', methods=['POST'])
def api_script_add():
print(request.data) # --> b'{"hi":"Hello"}'
print(json.loads(request.data))
return json.loads(request.data)
{'key':'value'}
b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.
.encode() and .decode() convert between strings and bytes.
bytes(somestring.encode()) is the solution that worked for me in python 3.
def compare_types():
output = b'sometext'
print(output)
print(type(output))
somestring = 'sometext'
encoded_string = somestring.encode()
output = bytes(encoded_string)
print(output)
print(type(output))
compare_types()

Categories