Odd Python ID assignment for Int values ==> Inconsistent 'is' operation [duplicate] - python

This question already has answers here:
"is" operator behaves unexpectedly with integers
(11 answers)
The `is` operator behaves unexpectedly with non-cached integers
(2 answers)
Closed 5 years ago.
So Python 3.6.2 has some weird behavior with their assignment of id's for integer values.
For any integer value in the range [-5, 256], any variable assigned a given value will also be assigned the same ID as any other variable with the same value. This effect can be seen below.
>>> a, b = -5, -5
>>> id(a), id(b)
(1355597296, 1355597296)
>>> a, b = -6, -6
>>> id(a), id(b)
(2781041259312, 2781041260912)
In fact, to see the ID pairs in action, you can just run this simple program that prints out the number and id in the range that I'm talking about...
for val in range(-6, 258):
print(format(val, ' 4d'), ':', format(id(val), '11x'))
If you add some other variables with values outside this range, you will see the boundary condition (i.e. -6 and 257) values id's change within the python interpreter, but never the values here.
This means (at least to me) that Python has taken the liberty to hardcode the addresses of variables that hold values in a seemingly arbitrary range of numbers.
In practice, this can be a little dangerous for a beginning Python learner: since the ID's assigned are the same within what is a a normal range of operation for beginners, they may be inclined to use logic that might get them in trouble, even though it seemingly works, and makes sense...
One possible (though a bit odd) problem might be printing an incrementing number:
a = 0
b = 10
while a is not b:
a = a + 1
print(a)
This logic, though not in the standard Pythonic way, works and is fine as long as b is in the range of statically defined numbers [-5. 256]
However, as soon as b is raised out of this range, we see the same strange behavior. In this case, it actually throws the code into an infinite loop.
I know that using 'is' to compare values is really not a good idea, but this produces inconsistent results when using the 'is' operator, and it is not immediately obvious to someone new to the language, and it would be especially confusing for new programmers that mistakenly used this method.
So my question is...
a) Why (was Python written to behave this way), and
b) Should it be changed?
p.s. In order to properly demonstrate the range in a usable script, I had to do some odd tweaks that really are improper code. However, I still hold my argument, since my method would not show any results if this odd glitch didn't exist.
for val in range(-6, 300):
a = int(float(val))
b = int(float(val))
print(format(a, ' 4d'), format(id(a), '11x'), ':',format(b, ' 4d'), format(id(b), '11x'), ':', a is b)
val = val + 1
The float(int(val)) is necessary to force Python to give each value a new address/id rather than the pointer to the object that it is accessing.

This is documented behavior of Python:
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object.
source
It helps to save memory and to make operations a bit faster.
It is implementation-specific. For example, IronPython has a range between -1000 and 1000 in which it it re-uses integers.

Related

How does the below overrided __add__ method works for sum function in python? [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Why tempory variable calculating in python for-loop takes so much memory usage? [duplicate]

This question already has answers here:
Python string interning
(2 answers)
About the changing id of an immutable string
(5 answers)
Closed 3 years ago.
The following two codes are equivalent, but the first one takes about 700M memory, the latter one takes only about 100M memory(via windows task manager). What happens here?
def a():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
lst.append(t)
return lst
_ = a()
def a():
lst = []
for i in range(10**7):
t = "a" * 2
lst.append(t)
return lst
_ = a()
#vurmux presented the right reason for the different memory usage: string interning, but some important details seem to be missing.
CPython-implementation interns some strings during the compilation, e.g "a"*2 - for more info about how/why "a"*2 gets interned see this SO-post.
Clarification: As #MartijnPieters has correctly pointed out in his comment: the important thing is whether the compiler does the constant-folding (e.g. evaluates the multiplication of two constants "a"*2) or not. If constant-folding is done, the resulting constant will be used and all elements in the list will be references to the same object, otherwise not. Even if all string constants get interned (and thus constant folding performed => string interned) - still it was sloppy to speak about interning: constant folding is the key here, as it explains the behavior also for types which have no interning at all, for example floats (if we would use t=42*2.0).
Whether constant folding has happened, can be easily verified with dis-module (I call your second version a2()):
>>> import dis
>>> dis.dis(a2)
...
4 18 LOAD_CONST 2 ('aa')
20 STORE_FAST 2 (t)
...
As we can see, during the run time the multiplication isn't performed, but directly the result (which was computed during the compiler time) of the multiplication is loaded - the resulting list consists of references to the same object (the constant loaded with 18 LOAD_CONST 2):
>>> len({id(s) for s in a2()})
1
There, only 8 bytes per reference are needed, that means about 80Mb (+overalocation of the list + memory needed for the interpreter) memory needed.
In Python3.7 constant folding isn't performed if the resulting string has more than 4096 characters, so replacing "a"*2 with "a"*4097 leads to the following byte-code:
>>> dis.dis(a1)
...
4 18 LOAD_CONST 2 ('a')
20 LOAD_CONST 3 (4097)
22 BINARY_MULTIPLY
24 STORE_FAST 2 (t)
...
Now, the multiplication isn't precalculated, the references in the resulting string will be of different objects.
The optimizer is yet not smart enough to recognize, that t is actually "a" in t=t*2, otherwise it would be able to perform the constant folding, but for now the resulting byte-code for your first version (I call it a2()):
...
5 22 LOAD_CONST 3 (2)
24 LOAD_FAST 2 (t)
26 BINARY_MULTIPLY
28 STORE_FAST 2 (t)
...
and it will return a list with 10^7 different objects (but all object being equal) inside:
>>> len({id(s) for s in a1()})
10000000
i.e. you will need about 56 bytes per string (sys.getsizeof returns 51, but because the pymalloc-memory-allocator is 8-byte aligned, 5 bytes will be wasted) + 8 bytes per reference (assuming 64bit-CPython-version), thus about 610Mb (+overalocation of the list + memory needed for the interpreter).
You can enforce the interning of the string via sys.intern:
import sys
def a1_interned():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
# here ensure, that the string-object gets interned
# returned value is the interned version
t = sys.intern(t)
lst.append(t)
return lst
And realy, we can now not only see, that less memory is needed, but also that the list has references to the same object (see it online for a slightly smaller size(10^5) here):
>>> len({id(s) for s in a1_interned()})
1
>>> all((s=="aa" for s in a1_interned())
True
String interning can save a lot of memory, but it is sometimes tricky to understand, whether/why a string gets interned or not. Calling sys.intern explicitly eliminates this uncertainty.
The existence of additional temporary objects referenced by t is not the problem: CPython uses reference counting for memory managment, so an object gets deleted as soon as there is no references to it - without any interaction from the garbage collector, which in CPython is only used to break-up cycles (which is different to for example Java's GC, as Java doesn't use reference counting). Thus, temporary variables are really temporaries - those objects cannot be accumulated to make any impact on memory usage.
The problem with the temporary variable t is only that it prevents peephole optimization during the compilation, which is performed for "a"*2 but not for t*2.
This difference is exist because of string interning in Python interpreter:
String interning is the method of caching particular strings in memory as they are instantiated. The idea is that, since strings in Python are immutable objects, only one instance of a particular string is needed at a time. By storing an instantiated string in memory, any future references to that same string can be directed to refer to the singleton already in existence, instead of taking up new memory.
Let me show it in a simple example:
>>> t1 = 'a'
>>> t2 = t1 * 2
>>> t2 is 'aa'
False
>>> t1 = 'a'
>>> t2 = 'a'*2
>>> t2 is 'aa'
True
When you use the first variant, the Python string interning is not used so the interpreter creates additional internal variables to store temporal data. It can't optimize many-lines-code this way.
I am not a Python guru, but I think the interpreter works this way:
t = "a"
t = t * 2
In the first line it creates an object for t. In the second line it creates a temporary object for t right of the = sign and writes the result in the third place in the memory (with GC called later). So the second variant should use at least 3 times less memory than the first.
P.S. You can read more about string interning here.

Why does python forbid the use of sum with strings? [duplicate]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Compound boolean logic in python if

I am trying to test a basic premise in python and it always fails and I can't figure out why.
My sys.argv looks like this:
['test.py', 'test']
And my code looks like this:
if len(sys.argv) > 1 and sys.argv[1] is 'test':
print 'Test mode'
But the test is never true. I am sure that I am missing something really simple here, but I can't figure out what it is.
As mentioned above, the main reason is your test comparison. Using is is different than using == as it compares if two objects are equal. In this case, you can verify that they are not equal by checking their ids:
import sys
print id(sys.argv[1])
print id('test')
My output:
140335994263232
140335994263424
As they point to different objects, they will not be equal when using is (but using == will compare the strings themselves, which will return True).
The issue at work here is the concept of interning. When you hardcode two identical strings into your source, the strings are interned and the two will share an object ID (this explains #SamMussmann's very valid point below). But when you pass a string in via argv, a new object is created, thereby making the comparison to an identical hardcoded string in your code return False. The best explanation I have found so far is in here, where both Alex Martelli and Jon Skeet (two very reputable sources) explain interning and when strings are interned. From these explanations, it does seem that since the data from argv is external to the program, the values aren't interned, and therefore have different object IDs than if they were both literals in the source.
One additional point of interest (unrelated to the issue at hand but pertinent to the is discussion) is the caching that is done with numbers. The numbers from -5 to 256 are cached, meaning that is comparisons with equal numbers in that range will be True, regardless of how they are calculated:
In [1]: 256 is 255 + 1
Out[1]: True
In [2]: 257 is 256 + 1
Out[2]: False
In [3]: -5 is -4 - 1
Out[3]: True
In [4]: -6 is -5 - 1
Out[4]: False

Python sum, why not strings? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Python has a built in function sum, which is effectively equivalent to:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
Python tries to discourage you from "summing" strings. You're supposed to join them:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
A quick benchmark:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place "forever", i.e., since the sum was added as a built-in function (rev. 32347)
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..) anyway..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
It's explicitly checked in the code and rejected.
From the docs:
The preferred, fast way to concatenate a
sequence of strings is by calling
''.join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
Short answer: Efficiency.
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
int (formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N²).
Thus, summing strings would be much slower than summing numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum.
The Reason Why
#dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
I dont know why, but this works!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

Categories