Getting the attributes of integers and floats [duplicate] - python

This question already has answers here:
Why doesn't 2.__add__(3) work in Python?
(2 answers)
Closed 8 years ago.
I was playing with the Python interpreter (Python 3.2.3) and tried the following:
>>> dir(1)
This gave me all the attributes and methods of the int object. Next I tried:
>>> 1.__class__
However this threw an exception:
File "<stdin>", line 1
1.__class__
^
SyntaxError: invalid syntax
When I tried out the same with a float I got what I expected:
>>> 2.0.__class__
<class 'float'>
Why do int and float literals behave differently?

It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.
After joining patterns with common prefixes – in this case, the pattern for int literals and the integral part of the pattern of float literals – what happens in the tokenizer is that it:
Reads the 1, and enters the state that indicates "reading either a float or an int literal"
Reads the ., and enters the state "reading a float literal"
Reads the _, which can not be part of a float literal. The parser emits 1. as a float literal token.
Carries on parsing starting with the _, and eventually emits __class__ as an identifier token.
Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers
contain letters, digits, and underscores, but cannot start with a
digit. If that was allowed, 123abc could be intended as either an
identifier, or the integer 123 followed by the identifier abc.
A lex-like tokenizer would recognize this as the former since it leads
to the longest single token, but nobody likes having to keep details
like this in their head when trying to read code. Or when trying to
write and debug the tokenizer for that matter.
The parser then tries to process the token stream:
<FloatLiteral: '1.'> <Identifier: '__class__'>
In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc being invalid syntax isn't the tokenizer error "the character a isn't valid in an integer literal", but the parser error "the identifier abc cannot directly follow the integer literal 123"
The reason why the tokenizer can't recognize the 1 as an int literal is that the character that makes it leave the float-or-int state determines what it just read. If it's ., it was the start of a float literal, which might continue afterwards. If it's something else, it was a complete int literal token.
It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.
Now, your second example is valid because the tokenizer knows a float literal can only have one . in it. More precisely: the first . makes it transition from the float-or-int state to the float state. In this state, it only expects digits (or an E for scientific/engineering notation, a j for complex numbers…) to continue the the float literal. The first character that's not a digit etc. (i.e. the .) is definitely no longer part of the float literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:
<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>
Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string literals may contain whitespace, it's merely skipped between tokens.) So the code:
1 .__class__
is always tokenized as
<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>
And since a closing parenthesis cannot appear in an int literal, this:
(1).__class__
gets read as this:
<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>
The above implies that, amusingly, the following is also valid:
1..__class__ # => <type 'float'>
The decimal part of a float literal is optional, and the second . read will make the preceding input be recognized as one.

It is a tokenization issue... the . is parsed as the beginning of the fractional part of a floating point number.
You can use
(1).__class__
to avoid the problem

Because if there's a . after a number, python thinks you're creating a float. When it encounters something else that isn't a number, it will throw an error.
However, in a float, python doesn't expect another . to be a part of the value, hence the result! It works. :)
How do we get the attributes, then?
You can easily wrap it in parentheses. For example, see this console session:
>>> (1).__class__
<type 'int'>
Now, Python knows that you're not trying to make a float, but to refer to the int itself.
Bonus: putting a blank space after the number works as well.
>>> 1 .__class__
<type 'int'>
Also, if you only want to get the __class__, type(1) will do it for you.
Hope this helps!

Or you can even do this:
>>> getattr(1 , '__class__')
<type 'int'>

You need parenthesis to surround the number:
>>> (1).__class__
<type 'int'>
>>>
Otherwise, Python sees the . after the number and it tries to interpret the whole thing as a float.

Related

Replace Unicode code point with actual character using regex

I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "👍" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.
I've tried to acomplish this with
re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)
but obviously this won't work because we cannot insert the group into this unicode escape.
What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...
When you have a number of the character, you can do ord(number) to get the character of that number.
Because we have a string, we need to read it as int with base 16.
Both of those together:
>>> chr(int("0001F44D", 16))
'👍'
However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub
Now we get:
re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)
PS Don't name your string just str - you'll shadow the builtin str meaning type.

Ignoring escape sequences

I'm using Python 2.6 and I have a variable which contains a string (I have sent it thorugh sockets and now I want to do something with it).
The problem is that I get the following error:
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
After I looked it up I found out that the problem is probably that the string I'm sending contains '\0' but it isn't a literal string that I can just edit with double backslash or adding a 'r' before hand, so is there a way to tell python to ignore the escape sequences and treat the whole thing as string?
(For example - I don't want python to treat the sequence \0 as a null char, but rather I want it to be treated as a backslash char followed by a zero char)
Considering all comments it looks like incorrectly used PIL/Pillow API, namely the Image.open function that requires file name instead of file data.

Why is 3 behaving differently from int(3)? [duplicate]

This question already has answers here:
Why doesn't 2.__add__(3) work in Python?
(2 answers)
Closed 8 years ago.
I was playing with the Python interpreter (Python 3.2.3) and tried the following:
>>> dir(1)
This gave me all the attributes and methods of the int object. Next I tried:
>>> 1.__class__
However this threw an exception:
File "<stdin>", line 1
1.__class__
^
SyntaxError: invalid syntax
When I tried out the same with a float I got what I expected:
>>> 2.0.__class__
<class 'float'>
Why do int and float literals behave differently?
It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.
After joining patterns with common prefixes – in this case, the pattern for int literals and the integral part of the pattern of float literals – what happens in the tokenizer is that it:
Reads the 1, and enters the state that indicates "reading either a float or an int literal"
Reads the ., and enters the state "reading a float literal"
Reads the _, which can not be part of a float literal. The parser emits 1. as a float literal token.
Carries on parsing starting with the _, and eventually emits __class__ as an identifier token.
Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers
contain letters, digits, and underscores, but cannot start with a
digit. If that was allowed, 123abc could be intended as either an
identifier, or the integer 123 followed by the identifier abc.
A lex-like tokenizer would recognize this as the former since it leads
to the longest single token, but nobody likes having to keep details
like this in their head when trying to read code. Or when trying to
write and debug the tokenizer for that matter.
The parser then tries to process the token stream:
<FloatLiteral: '1.'> <Identifier: '__class__'>
In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc being invalid syntax isn't the tokenizer error "the character a isn't valid in an integer literal", but the parser error "the identifier abc cannot directly follow the integer literal 123"
The reason why the tokenizer can't recognize the 1 as an int literal is that the character that makes it leave the float-or-int state determines what it just read. If it's ., it was the start of a float literal, which might continue afterwards. If it's something else, it was a complete int literal token.
It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.
Now, your second example is valid because the tokenizer knows a float literal can only have one . in it. More precisely: the first . makes it transition from the float-or-int state to the float state. In this state, it only expects digits (or an E for scientific/engineering notation, a j for complex numbers…) to continue the the float literal. The first character that's not a digit etc. (i.e. the .) is definitely no longer part of the float literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:
<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>
Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string literals may contain whitespace, it's merely skipped between tokens.) So the code:
1 .__class__
is always tokenized as
<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>
And since a closing parenthesis cannot appear in an int literal, this:
(1).__class__
gets read as this:
<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>
The above implies that, amusingly, the following is also valid:
1..__class__ # => <type 'float'>
The decimal part of a float literal is optional, and the second . read will make the preceding input be recognized as one.
It is a tokenization issue... the . is parsed as the beginning of the fractional part of a floating point number.
You can use
(1).__class__
to avoid the problem
Because if there's a . after a number, python thinks you're creating a float. When it encounters something else that isn't a number, it will throw an error.
However, in a float, python doesn't expect another . to be a part of the value, hence the result! It works. :)
How do we get the attributes, then?
You can easily wrap it in parentheses. For example, see this console session:
>>> (1).__class__
<type 'int'>
Now, Python knows that you're not trying to make a float, but to refer to the int itself.
Bonus: putting a blank space after the number works as well.
>>> 1 .__class__
<type 'int'>
Also, if you only want to get the __class__, type(1) will do it for you.
Hope this helps!
Or you can even do this:
>>> getattr(1 , '__class__')
<type 'int'>
You need parenthesis to surround the number:
>>> (1).__class__
<type 'int'>
>>>
Otherwise, Python sees the . after the number and it tries to interpret the whole thing as a float.

How to escape special char

I got the following code to handle Chinese character problem, or some special character in powerpoint file , because I would like to use the content of the ppt as the filename to save.
If it contains some special character, it will throw some exception, so I use the following code to handle it.
It works fine under Python 2.7 , but when I run with Python 3.0 it gives me the following error :
if not (char in '<>:"/\|?*'):
TypeError: 'in <string>' requires string as left operand, not int
I Googled the error message but I don't understand how to resolve it. I know the code if not (char in '<>:"/\|?*'): is to convert the character to ASCII code number, right?
Is there any example to fix my problem in Python 3?
def rm_invalid_char(self,str):
final=""
dosnames=['CON', 'PRN', 'AUX', 'NUL', 'COM1', 'COM2', 'COM3', 'COM4', 'COM5', 'COM6', 'COM7', 'COM8', 'COM9', 'LPT1', 'LPT2', 'LPT3', 'LPT4', 'LPT5', 'LPT6', 'LPT7', 'LPT8', 'LPT9']
for char in str:
if not (char in '<>:"/\|?*'):
if ord(char)>31:
final+=char
if final in dosnames:
#oh dear...
raise SystemError('final string is a DOS name!')
elif final.replace('.', '')=='':
print ('final string is all periods!')
pass
return final
Simple: use this
re.escape(YourStringHere)
From the docs:
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
You are passing an iterable whose first element is an integer (232) to rm_invalid_char(). The problem does not lie with this function, but with the caller.
Some debugging is in order: right at the beginning of rm_invalid_char(), you should do print(repr(str)): you will not see a string, contrary to what is expected by rm_invalid_char(). You must fix this until you see the string that you were expecting, by adjusting the code before rm_invalid_char() is called.
The problem is likely due to how Python 2 and Python 3 handle strings (in Python 2, str objects are strings of bytes, while in Python 3, they are strings of characters).
I'm curious why there is something in "str" that is acting like an integer - something strange is going on with the input.
However, I suspect if you:
Change the name of your str value to something else, e.g. char_string
Right after for char in char_string coerce whatever your input is to a string
then the problem you describe will be solved.
You might also consider adding a random bit to the end of your generated file name so you don't have to worry about colliding with the DOS reserved names.

Tell a raw string (r'') from a regular string ('')?

I'm currently building a tool that will have to match filenames against a pattern. For convenience, I intend to provide both lazy matching (in a glob-like fashion) and regexp matching. For example, the following two snippets would eventually have the same effects:
#mylib.rule('static/*.html')
def myfunc():
pass
#mylib.rule(r'^static/([^/]+)\.html')
def myfunc():
pass
AFAIK r'' is only useful to the Python parser and it actually creates a standard str instance after parsing (the only difference being that it keeps the \).
Is anybody aware of a way to tell one from another?
I would hate to have to provide two alternate decorators for the same purpose or, worse, resorting manually parsing the string to determine if it's a regexp or not.
You can't tell them apart. Every raw string literal could also be written as a standard string literal (possibly requiring more quoting) and vice versa. Apart from this, I'd definitely give different names to the two decorators. They don't do the same things, they do different things.
Example (CPython):
>>> a = r'^static/([^/]+)\.html'; b = '^static/([^/]+)\.html'
>>> a is b
True
So in this particular example, the raw string literal and the standard string literal even result in the same string object.
You can't tell whether a string was defined as a raw string after the fact. Personally, I would in fact use a separate decorator, but if you don't want to, you could use a named parameter (e.g. #rule(glob="*.txt") for globs and #rule(re=r".+\.txt") for regex).
Alternatively, require users to provide a compiled regular expression object if they want to use a regex, e.g. #rule(re.compile(r".+\.txt")) -- this is easy to detect because its type is different.
The term "raw string" is confusing because it sounds like it is a special type of string - when in fact, it is just a special syntax for literals that tells the compiler to do no interpretation of '\' characters in the string. Unfortunately, the term was coined to describe this compile-time behavior, but many beginners assume it carries some special runtime characteristics.
I prefer to call them "raw string literals", to emphasize that it is their definition of a string literal using a don't-interpret-backslashes syntax that is what makes them "raw". Both raw string literals and normal string literals create strings (or strs), and the resulting variables are strings like any other. The string created by a raw string literal is equivalent in every way to the same string defined non-raw-ly using escaped backslashes.

Categories