So in our lecture slide on assembly we had:
High-level language data types of C, A, and B determine the correct
circuit from among several choices (integer, floating point) to use to
perform “+” operation
Now in languages like Python, I do not specify the type of the variable. I was wondering how does the language compiles (interprets, I think is what it does) down into assembly and chooses the right circuit?
Thank you
At the interpreter level it's fairly easy to tell the difference between an integer (34), a floating point number (34.24), and a string ("Thirty-Four"). The full list of types can be seen at https://docs.python.org/3/library/stdtypes.html .
Once the type is known, it's easy to tell what operation is needed. A separate function (__add__) is defined for each class, and the interpreter (written in C for standard Python) will do the arithmetic. C is typed and it's (comparatively) easy for the compiler to be translated to machine code.
Every Python variable is a reference to an object. That object includes the type information of the variable. For instance, just walk through a few of the possibilities as we repeatedly reassign the value and type of x "on the fly":
for x in [1, 1.0, "1", [1]]:
print(x, type(x))
Output:
1 <class 'int'>
1.0 <class 'float'>
1 <class 'str'>
[1] <class 'list'>
If you're wondering how Python can tell that 1 is an int and 1.0 is a float, that's obvious from the input string. A language processor typically contains a tokenizer that can discriminate language tokens, and another module that interprets those tokens within the language syntax. int and float objects have different token formats ... as do strings, punctuation, identifiers, and any other language elements.
If you want to learn more about that level of detail, research how to parse a computer language: most of the techniques are applicable to most languages.
As n.m. commented below your post, variables do not have a type in Python. Values do.
As far as how integer vs float is determined when you type the following:
x = 1.5
y = 2
This is determined during the parsing stage. Compiled and interpreted languages actually start off in the same manner.
The general flow when code is sent to an interpreter/compiler is as follows:
[source code] --> lexical analyzer --> [tokens] --> parser --> [abstract syntax tree] -->
The parser step examines tokens like 'x' '=' '1.5' and looks for patterns which indicate different types of literals like ints, floats, and strings. By the time the actual interpreter/compiler gets the abstract syntax tree (tree representation of your program), it already knows that the value stored in x (1.5) is a float.
So just to be clear, this part of the process is conceptually the same for intepreters and compilers.
Related
There are operators that represent a built-in data type
Examples
"" represents a string
{} represents a dictionary
How can I make my operators to represent a user-defined data type?
You can’t do that in python
Or most other languages. You would have to change the python parser and the source code. So this is sadly not possible.
For example, is the string var1 = 'ROB' stored as 3 memory locations R, O and B each with its own address and the variable var1 points to the memory location R? Then how does it point to O and B?
And do other strings – for example: var2 = 'BOB' – point to the same B and O in memory that var1 refers to?
How strings are stored is an implementation detail, but in practice, on the CPython reference interpreter, they're stored as a C-style array of characters. So if the R is at address x, then O is at x+1 (or +2 or +4, depending on the largest ordinal value in the string), and B is at x+2 (or +4 or +8). Because the letters are stored consecutively, knowing where R is (and a flag in the str that says how big each character's storage is) is enough to locate O and B.
'BOB' is at a completely different address, y, and its O and B are contiguous as well. The OB in 'ROB' is utterly unrelated to the OB in 'BOB'.
There is a confusing aspect to this. If you index into the strings, and check the id of the result, it will seem like 'O' has the same address in both strings. But that's only because:
Indexing into a string returns a new string, unrelated to the one being indexed, and
CPython caches length one strings in the latin-1 range, so 'O' is a singleton (no matter how you make it, you get back the cached string)
I'll note that the actual str internals in modern Python are even more complicated than I covered above; a single string might store the same data in up to three different encodings in the same object (the canonical form, and cached version(s) for working with specific Python C APIs). It's not visible from the Python level aside from checking the size with sys.getsizeof though, so it's not worth worrying about in general.
If you really want to head off into the weeds, feel free to read PEP 393: Flexible String Representation which elaborates on the internals of the new str object structure adopted in CPython 3.3.
This is only a partial answer:
var1 is a name that refers to a string object 'ROB'.
var2 is a name that refers to another string object 'BOB'.
How a string object stores the individual characters, and whether different string objects share the same memory, I cannot answer now in more detail than "sometimes" and "it depends". It has to do with string interning, which may be used.
This question already exists:
Closed 10 years ago.
Possible Duplicate:
accessing a python int literals methods
Everything in Python is an object. Even a number is an object:
>>> a=1
>>> type(a)
<class 'int'>
>>>a.real
1
I tried the following, because we should be able to access class members of an object:
>>> type(1)
<class 'int'>
>>> 1.real
File "<stdin>", line 1
1.real
^
SyntaxError: invalid syntax
Why does this not work?
Yes, an integer literal is an object in Python. To summarize, the parser needs to be able to understand it is dealing with an object of type integer, while the statement 1.real confuses the parser into thinking it has a float 1. followed by the word real, and therefore raises a syntax error.
To test this you can also try
>> (1).real
1
as well as,
>> 1.0.real
1.0
so in the case of 1.real python is interpreting the . as a decimal point.
Edit
BasicWolf puts it nicely too - 1. is being interpreted as the floating point representation of 1, so 1.real is equivalent to writing (1.)real - so with no attribute access operator i.e. period /full stop. Hence the syntax error.
Further edit
As mgilson alludes to in his/her comment: the parser can handle access to int's attributes and methods, but only as long the statement makes it clear that it is being given an int and not a float.
a language is usually built in three layers.
when you provide a program to a language it first has to "read" the program. then it builds what it has read into something it can work with. and finally it runs that thing as "a program" and (hopefully) prints a result.
the problem here is that the first part of python - the part that reads programs - is confused. it's confused because it's not clever enough to know the difference between
1.234
and
1.letters
what seems to be happening is that it thinks you were trying to type a number like 1.234 but made a mistake and typed letters instead(!).
so this has nothing to do with what 1 "really is" and whether or not is it an object. all that kind of logic happens in the second and third stages i described earlier, when python tries to build and then run the program.
what you've uncovered is just a strange (but interesting!) wrinkle in how python reads programs.
[i'd call it a bug, but it's probably like this for a reason. it turns out that some things are hard for computers to read. python is probably designed so that it's easy (fast) for the computer to read programs. fixing this "bug" would probably make the part of python that reads programs slower or more complicated. so it's probably a trade-off.]
Although the behaviour with 1.real seems unlogical, it is expected due to the language specification: Python interprets 1. as a float (see floating point literals). But as #mutzmatron pointed out (1).real works because the expression in brackets is a valid Python object.
Update: Note the following pits:
1 + 2j.real
>>> 1.0 # due to the fact that 2j.real == 0
# but
1 + 2j.imag
>>> 3.0 # due to the fact that 2j.imag == 2
You can still access 1.real:
>>> hasattr(1, 'real')
True
>>> getattr(1, 'real')
1
I'm writing a mapping class which persists to the disk. I am currently allowing only str keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict), but more reasonable I would accept string, unicode, int, and tuples of these types.
To that end I would like to derive a deterministic serialization scheme.
Option 1 - Pickling the key
The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle do not match each other:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
... print repr(pickle.dumps(x))
... print repr(cPickle.dumps(x))
...
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
Is there any implementation/protocol combination of pickle which is deterministic for some set of types (e.g. can only use cPickle with protocol 0)?
Option 2 - Repr and ast.literal_eval
Another option is to use repr to dump and ast.literal_eval to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):
def is_reprable_key(key):
return type(key) in (int, str, unicode) or (type(key) == tuple and all(
is_reprable_key(x) for x in key))
The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.
Option 3: Custom repr
Another option which is likely overkill is to write my own repr which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945
(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash is not deterministic across platforms either. Scratch that.)
Any insights?
Important note: repr() is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.
For example print repr({'a':1, 'b':2}) might print out as {'a':1, 'b':2} or {'b':2, 'a':1}, depending on how Python decides to manage the keys in the dictionary.
After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that repr of these types is, in fact, deterministic. But, frankly, this was expected.
I believe that the repr method is susceptible to nearly all of the same cases under which the marshal method would break down (longs > 2**32 can never be an int on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).
My solution for the time being has been to use the repr method and write a comprehensive test suite to make sure that repr returns the same values on the various platforms I am using.
In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.
"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__ or comparisons, implicitly using their id for hashing and comparison purposes, and the ids won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).
For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:
>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'
This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b, but that's a cosmetic issue, and of course u'23' becomes invalid syntax and '23' becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u for Unicode strings, i for integers, ( for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc or .pyo).
You mention a few requirements in the paragraph, and I think you might want to be a little more clear on these. So far I gather:
You're building an SQLite backend to basically a dictionary.
You want to allow the keys to be more than basestring type (which types).
You want it to survive the Python 2 -> Python 3 barrier.
You want to support large integers above 2**32 as the key.
Ability to store infinite values (because you don't want hash collisions).
So, are you trying to build a general 'this can do it all' solution, or just trying to solve an immediate problem to continue on within a current project? You should spend a little more time to come up with a clear set of requirements.
Using a hash seemed like the best solution to me, but then you complain that you're going to have multiple rows with the same hash implying you're going to be storing enough values to even worry about the hash.
In the Python documentation and on mailing lists I see that values are sometimes "cast", and sometimes "coerced".
Cast is explicit. Coerce is implicit.
The examples in Python would be:
cast(2, POINTER(c_float)) #cast
1.0 + 2 #coerce
1.0 + float(2) #conversion
Cast really only comes up in the C FFI. What is typically called casting in C or Java is referred to as conversion in python, though it often gets referred to as casting because of its similarities to those other languages. In pretty much every language that I have experience with (including python) Coercion is implicit type changing.
I think "casting" shouldn't be used for Python; there are only type conversion, but no casts (in the C sense). A type conversion is done e.g. through int(o) where the object o is converted into an integer (actually, an integer object is constructed out of o). Coercion happens in the case of binary operations: if you do x+y, and x and y have different types, they are coerced into a single type before performing the operation. In 2.x, a special method __coerce__ allows object to control their coercion.