I am trying to test a basic premise in python and it always fails and I can't figure out why.
My sys.argv looks like this:
['test.py', 'test']
And my code looks like this:
if len(sys.argv) > 1 and sys.argv[1] is 'test':
print 'Test mode'
But the test is never true. I am sure that I am missing something really simple here, but I can't figure out what it is.
As mentioned above, the main reason is your test comparison. Using is is different than using == as it compares if two objects are equal. In this case, you can verify that they are not equal by checking their ids:
import sys
print id(sys.argv[1])
print id('test')
My output:
140335994263232
140335994263424
As they point to different objects, they will not be equal when using is (but using == will compare the strings themselves, which will return True).
The issue at work here is the concept of interning. When you hardcode two identical strings into your source, the strings are interned and the two will share an object ID (this explains #SamMussmann's very valid point below). But when you pass a string in via argv, a new object is created, thereby making the comparison to an identical hardcoded string in your code return False. The best explanation I have found so far is in here, where both Alex Martelli and Jon Skeet (two very reputable sources) explain interning and when strings are interned. From these explanations, it does seem that since the data from argv is external to the program, the values aren't interned, and therefore have different object IDs than if they were both literals in the source.
One additional point of interest (unrelated to the issue at hand but pertinent to the is discussion) is the caching that is done with numbers. The numbers from -5 to 256 are cached, meaning that is comparisons with equal numbers in that range will be True, regardless of how they are calculated:
In [1]: 256 is 255 + 1
Out[1]: True
In [2]: 257 is 256 + 1
Out[2]: False
In [3]: -5 is -4 - 1
Out[3]: True
In [4]: -6 is -5 - 1
Out[4]: False
Related
This question already has answers here:
"is" operator behaves unexpectedly with integers
(11 answers)
The `is` operator behaves unexpectedly with non-cached integers
(2 answers)
Closed 5 years ago.
So Python 3.6.2 has some weird behavior with their assignment of id's for integer values.
For any integer value in the range [-5, 256], any variable assigned a given value will also be assigned the same ID as any other variable with the same value. This effect can be seen below.
>>> a, b = -5, -5
>>> id(a), id(b)
(1355597296, 1355597296)
>>> a, b = -6, -6
>>> id(a), id(b)
(2781041259312, 2781041260912)
In fact, to see the ID pairs in action, you can just run this simple program that prints out the number and id in the range that I'm talking about...
for val in range(-6, 258):
print(format(val, ' 4d'), ':', format(id(val), '11x'))
If you add some other variables with values outside this range, you will see the boundary condition (i.e. -6 and 257) values id's change within the python interpreter, but never the values here.
This means (at least to me) that Python has taken the liberty to hardcode the addresses of variables that hold values in a seemingly arbitrary range of numbers.
In practice, this can be a little dangerous for a beginning Python learner: since the ID's assigned are the same within what is a a normal range of operation for beginners, they may be inclined to use logic that might get them in trouble, even though it seemingly works, and makes sense...
One possible (though a bit odd) problem might be printing an incrementing number:
a = 0
b = 10
while a is not b:
a = a + 1
print(a)
This logic, though not in the standard Pythonic way, works and is fine as long as b is in the range of statically defined numbers [-5. 256]
However, as soon as b is raised out of this range, we see the same strange behavior. In this case, it actually throws the code into an infinite loop.
I know that using 'is' to compare values is really not a good idea, but this produces inconsistent results when using the 'is' operator, and it is not immediately obvious to someone new to the language, and it would be especially confusing for new programmers that mistakenly used this method.
So my question is...
a) Why (was Python written to behave this way), and
b) Should it be changed?
p.s. In order to properly demonstrate the range in a usable script, I had to do some odd tweaks that really are improper code. However, I still hold my argument, since my method would not show any results if this odd glitch didn't exist.
for val in range(-6, 300):
a = int(float(val))
b = int(float(val))
print(format(a, ' 4d'), format(id(a), '11x'), ':',format(b, ' 4d'), format(id(b), '11x'), ':', a is b)
val = val + 1
The float(int(val)) is necessary to force Python to give each value a new address/id rather than the pointer to the object that it is accessing.
This is documented behavior of Python:
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object.
source
It helps to save memory and to make operations a bit faster.
It is implementation-specific. For example, IronPython has a range between -1000 and 1000 in which it it re-uses integers.
Given:
>>> a,b=2,3
>>> c,d=3,2
>>> def f(x,y): print(x,y)
I have an existing (as in cannot be changed) 2 positional parameter function where I want the positional parameters to always be in ascending order; i.e., f(2,3) no matter what two arguments I use (f(a,b) is the same as f(c,d) in the example)
I know that I could do:
>>> f(*sorted([c,d]))
2 3
Or I could do:
>>> f(*((a,b) if a<b else (b,a)))
2 3
(Note the need for tuple parenthesis in this form because , is lower precedence than the ternary...)
Or,
def my_f(a,b):
return f(a,b) if a<b else f(b,a)
All these seem kinda kludgy. Is there another syntax that I am missing?
Edit
I missed an 'old school' Python two member tuple method. Index a two member tuple based on the True == 1, False == 0 method:
>>> f(*((a,b),(b,a))[a>b])
2 3
Also:
>>> f(*{True:(a,b), False:(b,a)}[a<b])
2 3
Edit 2
The reason for this silly exercise: numpy.isclose has the following usage note:
For finite values, isclose uses the following equation to test whether
two floating point values are equivalent.
absolute(a - b) <= (atol + rtol * absolute(b))
The above equation is not symmetric in a and b, so that isclose(a, b)
might be different from isclose(b, a) in some rare cases.
I would prefer that not happen.
I am looking for the fastest way to make sure that arguments to numpy.isclose are in a consistent order. That is why I am shying away from f(*sorted([c,d]))
Implemented my solution in case anyone else is looking.
def sort(f):
def wrapper(*args):
return f(*sorted(args))
return wrapper
#sort
def f(x, y):
print(x, y)
f(3, 2)
>>> (2, 3)
Also since #Tadhg McDonald-Jensen mention that you may not be able to change the function yourself that you could wrap the function as such
my_func = sort(f)
You mention that your use-case is np.isclose. However your approach isn't a good way to solve the real issue. But it's understandable given the poor argument naming of that function - it sort of implies that both arguments are interchangable. If it were: numpy.isclose(measured, expected, ...) (or something like it) it would be much clearer.
For example if you expect the value 10 and measure 10.51 and you allow for 5% deviation, then in order to get a useful result you must use np.isclose(10.51, 10, ...), otherwise you would get wrong results:
>>> import numpy as np
>>> measured = 10.51
>>> expected = 10
>>> err_rel = 0.05
>>> err_abs = 0.0
>>> np.isclose(measured, expected, err_rel, err_abs)
False
>>> np.isclose(expected, measured, err_rel, err_abs)
True
It's clear to see that the first one gives the correct result because the actually measured value is not within the tolerance of the expected value. That's because the relative uncertainty is an "attribute" of the expected value, not of the value you compare it with!
So solving this issue by "sorting" the parameters is just wrong. That's a bit like changing the numerator and denominator for division because the denominator contains zeros and dividing by zero could give NaN, Inf, a Warning or an Exception... it definetly avoids the problem but just by giving an incorrect result (the comparison isn't perfect because with division it will almost always give a wrong result; with isclose it's rare).
This was a somewhat artificial example designed to trigger that behaviour and most of the time it's not important if you use measured, expected or expected, measured but in the few cases where it does matter you can't solve it by swapping the arguments (except when you have no "expected" result, but that rarely happens - at least it shouldn't).
There was some discussion about this topic when math.isclose was added to the python library:
Symmetry (PEP 485)
[...]
Which approach is most appropriate depends on what question is being asked. If the question is: "are these two numbers close to each other?", there is no obvious ordering, and a symmetric test is most appropriate.
However, if the question is: "Is the computed value within x% of this known value?", then it is appropriate to scale the tolerance to the known value, and an asymmetric test is most appropriate.
[...]
This proposal [for math.isclose] uses a symmetric test.
So if your test falls into the first category and you like a symmetric test - then math.isclose could be a viable alternative (at least if you're dealing with scalars):
math.isclose(a, b, *, rel_tol=1e-09, abs_tol=0.0)
[...]
rel_tol is the relative tolerance – it is the maximum allowed difference between a and b, relative to the larger absolute value of a or b. For example, to set a tolerance of 5%, pass rel_tol=0.05. The default tolerance is 1e-09, which assures that the two values are the same within about 9 decimal digits. rel_tol must be greater than zero.
[...]
Just in case this answer couldn't convince you and you still want to use a sorted approach - then you should order by the absolute of you values (i.e. *sorted([a, b], key=abs)). Otherwise you might get surprising results when comparing negative numbers:
>>> np.isclose(-10.51, -10, err_rel, err_abs) # -10.51 is smaller than -10!
False
>>> np.isclose(-10, -10.51, err_rel, err_abs)
True
For only two elements in the tuple, the second one is the preferred idiom -- in my experience. It's fast, readable, etc.
No, there isn't really another syntax. There's also
(min(a,b), max(a,b))
... but this isn't particularly superior to the other methods; merely another way of expressing it.
Note after comment by dawg:
A class with custom comparison operators could return the same object for both min and max.
I am using Anaconda (Python 3.6).
In the interactive mode, I did object identity test for positive integers >256:
# Interactive test 1
>>> x = 1000
>>> y = 1000
>>> x is y
False
Clearly, large integers (>256) writing in separate lines are not reused in interactive mode.
But if we write the assignment in one line, the large positive integer object is reused:
# Interactive test 2
>>> x, y = 1000, 1000
>>> x is y
True
That is, in interactive mode, writing the integer assignments in one or separate lines would make a difference for reusing the integer objects (>256). For integers in [-5,256] (as described https://docs.python.org/2/c-api/int.html), caching mechanism ensures that only one object is created, whether or not the assignment is in the same or different lines.
Now let's consider small negative integers less than -5 (any negative integer beyond the range [-5, 256] would serve the purpose), surprising results come out:
# Interactive test 3
>>> x, y = -6, -6
>>> x is y
False # inconsistent with the large positive integer 1000
>>> -6 is -6
False
>>> id(-6), id(-6), id(-6)
(2280334806256, 2280334806128, 2280334806448)
>>> a = b =-6
>>> a is b
True # different result from a, b = -6, -6
Clearly, this demonstrates inconsistency for object identity test between large positive integers (>256) and small negative integers (<-5). And for small negative integers (<-5), writing in the form a, b = -6, -6 and a = b =-6 also makes a difference (in contrast, it doesn't which form is used for large integers). Any explanations for these strange behaviors?
For comparison, let's move on to IDE run (I am using PyCharm with the same Python 3.6 interpreter), I run the following script
# IDE test case
x = 1000
y = 1000
print(x is y)
It prints True, different from the interactive run. Thanks to #Ahsanul Haque, who already gave a nice explanation to the inconsistency between IDE run and interactive run. But it still remains to answer my question on the inconsistency between large positive integer and small negative integer in the interactive run.
Only one copy of a particular constant is created for a particular source code and reused if needed further. So, in pycharm, you are getting x is y == True.
But, in the interpreter, things are different. Here, only one line/statement runs at once. A particular constant is created for each new line. It is not reused in the next line. So, x is not y here.
But, if you can initialize in same line, you can have the same behavior (Reusing the same constant).
>>> x,y = 1000, 1000
>>> x is y
True
>>> x = 1000
>>> y = 1000
>>> x is y
False
>>>
Edit:
A block is a piece of Python program text that is executed as a unit.
In an IDE, the whole module get executed at once i.e. the whole module is a block. But in interactive mode, each instruction is actually a block of code that is executed at once.
As I said earlier, a particular constant is created once for a block of code and reused if reappears in that block of code again.
This is main difference between IDE and interpreter.
Then, why actually interpreter gives same output as IDE for smaller numbers? This is when, integer caching comes into consideration.
If numbers are smaller, then they are cached and reused in next code block. So, we get the same id in the IDE.
But if they are bigger, they are not cached. Rather a new copy is created. So, as expected, the id is different.
Hope this makes sense now,
When you run 1000 is 1000 in the interactive shell or as part of the bigger script, CPython generates the bytecode like
In [3]: dis.dis('1000 is 1000')
...:
1 0 LOAD_CONST 0 (1000)
2 LOAD_CONST 0 (1000)
4 COMPARE_OP 8 (is)
6 RETURN_VALUE
What it does is:
Loads two constants (LOAD_CONST pushes co_consts[consti] onto the stack -- docs)
Compares them using is (True if operands refer to the same object; False otherwise)
Returns the result
As CPython only creates one Python object for a constant used in a code block, 1000 is 1000 will result in a single integer constant being created:
In [4]: code = compile('1000 is 1000', '<string>', 'single') # code object
In [5]: code.co_consts # constants used by the code object
Out[5]: (1000, None)
According to the bytecode above, Python will load that same object twice and compare it with itself, so the expression will evaluate to True:
In [6]: eval(code)
Out[6]: True
The results are different for -6, because -6 is not immediately recognized as a constant:
In [7]: ast.dump(ast.parse('-6'))
Out[7]: 'Module(body=[Expr(value=UnaryOp(op=USub(), operand=Num(n=6)))])'
-6 is an expression negating the value of the integer literal 6.
Nevertheless, the bytecode for -6 is -6 is virtually the same as the first bytecode sample:
In [8]: dis.dis('-6 is -6')
1 0 LOAD_CONST 1 (-6)
2 LOAD_CONST 2 (-6)
4 COMPARE_OP 8 (is)
6 RETURN_VALUE
So Python loads two -6 constants and compares them using is.
How does the -6 expression become a constant? CPython has a peephole optimizer, capable of optimizing simple expressions involving constants by evaluating them right after the compilation, and storing the results in the table of constants.
As of CPython 3.6, folding unary operations is handled by fold_unaryops_on_constants in Python/peephole.c. In particular, - (unary minus) is evaluated by PyNumber_Negative that returns a new Python object (-6 is not cached). After that, the newly created object is inserted to the consts table. However, the optimizer does not check whether the result of the expression can be reused, so the results of identical expressions end up being distinct Python objects (again, as of CPython 3.6).
To illustrate this, I'll compile the -6 is -6 expression:
In [9]: code = compile('-6 is -6', '<string>', 'single')
There're two -6 constants in the co_consts tuple
In [10]: code.co_consts
Out[10]: (6, None, -6, -6)
and they have different memory addresses
In [11]: [id(const) for const in code.co_consts if const == -6]
Out[11]: [140415435258128, 140415435258576]
Of course, this means that -6 is -6 evaluates to False:
In [12]: eval(code)
Out[12]: False
For the most part the explanation above remains valid in presence of variables. When executed in the interactive shell, these three lines
>>> x = 1000
>>> y = 1000
>>> x is y
False
are parts of three different code blocks, so the 1000 constant won't be reused. However, if you put them all in one code block (like a function body) the constant will be reused.
In contrast, the x, y = 1000, 1000 line is always executed in one code block (even in the interactive shell), and therefore CPython always reuses the constant. In x, y = -6, -6, -6 isn't reused for the reasons explained in the first part of my answer.
x = y = -6 is trivial. Since there's exactly one Python object involved, x is y would return True even if you replaced -6 with something else.
For complement the answer of the Ahsanul Haque, Try this in any IDE:
x = 1000
y = 1000
print (x is y)
print('\ninitial id x: ',id(x))
print('initial id y: ',id(y))
x=2000
print('\nid x after change value: ',id(x))
print('id y after change x value: ', id(y))
initial id x: 139865953872336
initial id y: 139865953872336
id x after change value: 139865953872304
id y after change x value: 139865953872336
Very likely you will see the same ID for 'x' and 'y', then run the code in the interpreter and ids will be different.
>x=1000
>y=1000
>id(x)
=> 139865953870576
>id(y)
=> 139865953872368
See Here.
I need a code to replace this..
import _mysql
a = "111"
a = _mysql.escape_string(a)
"a" is always gonna be a number between 1 and 1000+
and thus maybe there is a more secure way to "cleaning up" the "a" string in this example for mysql and etc..
rather than relying on
_mysql.escape_string()
function.
which we have no idea what it even does. or how it works. perhaps would be slower than something that we can invent given that all we are working is a number between 1 and 1000+
RE-PHRASİNG THE QUESTİON:
How to ask python if the string is a maximum of 4 digit number"
Check if it's a number:
>>> "1234".isdigit()
True
>>> "ABCD".isdigit()
False
Check its length:
>>> 1 <= len("1234") <= 4
True
>>> 1 <= len("12345") <= 4
False
escape_string won't clean your string. From the docs:
"escape_string(s) -- quote any SQL-interpreted characters in string s.
Use connection.escape_string(s), if you use it at all. _mysql.escape_string(s) cannot handle character sets. You are probably better off using connection.escape(o) instead, since
it will escape entire sequences as well as strings."
The is operator compares the memory addresses of two objects, and returns True if they're the same. Why, then, does it not work reliably with strings?
Code #1
>>> a = "poi"
>>> b = "poi"
>>> a is b
True
Code #2
>>> ktr = "today is a fine day"
>>> ptr = "today is a fine day"
>>> ktr is ptr
False
I have created two strings whose content is the same but they are living on different memory addresses. Why is the output of the is operator not consistent?
I believe it has to do with string interning. In essence, the idea is to store only a single copy of each distinct string, to increase performance on some operations.
Basically, the reason why a is b works is because (as you may have guessed) there is a single immutable string that is referenced by Python in both cases. When a string is large (and some other factors that I don't understand, most likely), this isn't done, which is why your second example returns False.
EDIT: And in fact, the odd behavior seems to be a side-effect of the interactive environment. If you take your same code and place it into a Python script, both a is b and ktr is ptr return True.
a="poi"
b="poi"
print a is b # Prints 'True'
ktr = "today is a fine day"
ptr = "today is a fine day"
print ktr is ptr # Prints 'True'
This makes sense, since it'd be easy for Python to parse a source file and look for duplicate string literals within it. If you create the strings dynamically, then it behaves differently even in a script.
a="p" + "oi"
b="po" + "i"
print a is b # Oddly enough, prints 'True'
ktr = "today is" + " a fine day"
ptr = "today is a f" + "ine day"
print ktr is ptr # Prints 'False'
As for why a is b still results in True, perhaps the allocated string is small enough to warrant a quick search through the interned collection, whereas the other one is not?
is is identity testing. It will work on smaller some strings(because of cache) but not on bigger other strings. Since str is NOT a ptr. [thanks erykson]
See this code:
>>> import dis
>>> def fun():
... str = 'today is a fine day'
... ptr = 'today is a fine day'
... return (str is ptr)
...
>>> dis.dis(fun)
2 0 LOAD_CONST 1 ('today is a fine day')
3 STORE_FAST 0 (str)
3 6 LOAD_CONST 1 ('today is a fine day')
9 STORE_FAST 1 (ptr)
4 12 LOAD_FAST 0 (str)
15 LOAD_FAST 1 (ptr)
18 COMPARE_OP 8 (is)
21 RETURN_VALUE
>>> id(str)
26652288
>>> id(ptr)
27604736
#hence this comparison returns false: ptr is str
Notice the IDs of str and ptr are different.
BUT:
>>> x = "poi"
>>> y = "poi"
>>> id(x)
26650592
>>> id(y)
26650592
#hence this comparison returns true : x is y
IDs of x and y are the same. Hence is operator works on "ids" and not on "equalities"
See the below link for a discussion on when and why python will allocate a different memory location for identical strings(read the question as well).
When does python allocate new memory for identical strings
Also sys.intern on python3.x and intern on python2.x should help you allocate the strings in the same memory location, regardless of the size of the string.
is is not the same as ==.
Basically, is checks if the two objects are the same, while == compares the values of those objects (strings, like everything in python, are objects).
So you should use is when you really know what objects you're looking at (ie. you've made the objects, or are comparing with None as the question comments point out), and you want to know if two variables are referencing the exact same object in memory.
In your examples, however, you're looking at str objects that python is handling behind the scenes, so without diving deep into how python works, you don't really know what to expect. You would have the same problem with ints or floats. Other answers do a good job of explaining the "behind the scenes" stuff (string interning), but you mostly shouldn't have to worry about it in day-to-day programming.
Note that this is a CPython specific optimization. If you want your code to be portable, you should avoid it. For example, in PyPy
>>>> a = "hi"
>>>> b = "hi"
>>>> a is b
False
It's also worth pointing out that a similar thing happens for small integers
>>> a = 12
>>> b = 12
>>> a is b
True
which again you should not rely on, because other implementations might not include this optimization.