Python: understanding iterators and `join()` better - python

The join() function accepts an iterable as parameter. However, I was wondering why having:
text = 'asdfqwer'
This:
''.join([c for c in text])
Is significantly faster than:
''.join(c for c in text)
The same occurs with long strings (i.e. text * 10000000).
Watching the memory footprint of both executions with long strings, I think they both create one and only one list of chars in memory, and then join them into a string. So I am guessing perhaps the difference is only between how join() creates this list out of the generator and how the Python interpreter does the same thing when it sees [c for c in text]. But, again, I am just guessing, so I would like somebody to confirm/deny my guesses.

The join method reads its input twice; once to determine how much memory to allocate for the resulting string object, then again to perform the actual join. Passing a list is faster than passing a generator object that it needs to make a copy of so that it can iterate over it twice.
A list comprehension is not simply a generator object wrapped in a list, so constructing the list externally is faster than having join create it from a generator object. Generator objects are optimized for memory efficiency, not speed.
Of course, a string is already an iterable object, so you could just write ''.join(text). (Also again this is not as fast as creating the list explicitly from the string.)

Related

How to confirm if my answer has O(1) space complexity and in modification of an array? It is the Reverse a String Leetcode question

I was doing a problem on Leetcode - here is the problem:
Write a function that reverses a string. The input string is given as an array of characters char[].
Do not allocate extra space for another array, you must do this by modifying the input array in-place with O(1) extra memory.
You may assume all the characters consist of printable ascii characters.
My solution is
def reverseString(s):
"""
Do not return anything, modify s in-place instead.
"""
temp = ""
for index,value in enumerate(s):
temp+=value
s.clear()
for i in "".join(reversed(temp)):
s.append(i)
reverseString(["h","e","l","l","o"])
My solution works and is accepted by Leetcode. It also passes all the test cases. However, I am still new to the concept of space and time and was not sure if my solution follows the requirements of O(1) and modifies the array in place. If someone could confirm if it does or not and also teach me how to confirm this, it would be helpful. Thank you!
O(1) means you use extra constant memory of variables to solve the question. To be opposed, the extra memory of variabels you used related to the question's data size. E.g, the size of string array which want you to reverse is x,you use y=ax+b memory of variables means O(n). y=ax^2+bx+c means O(n^2). Do you get it?

How is the string.join(str_list, ''") implemented under the hood in Python?

I know that concatenating two strings using the += operator makes a new copy of the old string and then concatenates the new string to that, resulting in quadratic time complexity.
This answer gives a nice time comparison between the += operation and string.join(str_list, ''). It looks like the join() method runs in linear time (correct me if I am wrong). Out of curiosity, I wanted to know how the string.join(str_list, '') method is implemented in Python since strings are immutable objects?
It's implemented in C, so python mutability is less important. You can find the appropriate source here: unicodeobject.c

Does Python automatically optimize/cache function calls?

I'm relatively new to Python, and I keep seeing examples like:
def max_wordnum(texts):
count = 0
for text in texts:
if len(text.split()) > count:
count = len(text.split())
return count
Is the repeated len(text.split()) somehow optimized away by the interpreter/compiler in Python, or will this just take twice the CPU cycles of storing len(text.split()) in a variable?
Duplicate expressions are not "somehow optimized away". Use a local variable to capture and re-use a result that is 'known not to change' and 'takes some not-insignificant time' to create; or where using a variable increases clarity.
In this case, it's impossible for Python to know that 'text.split()' is pure - a pure function is one with no side-effects and always returns the same value for the given input.
Trivially: Python, being a dynamically-typed language, doesn't even know the type of 'text' before it actually gets a value, so generalized optimization of this kind is not possible. (Some classes may provide their own internal 'cache optimizations', but digressing..)
As: even a language like C#, with static typing, won't/can't optimize away general method calls - as, again, there is no basic enforceable guarantee of purity in C#. (ie. What if the method returned a different value on the second call or wrote to the console?)
But: a Haskell, a Purely Functional language, has the option to not 'evaluate' the call twice, being a different language with different rules...
Even if python did optimize this (which isn't the case), the code is copy/paste all over and more difficult to maintain, so creating a variable to hold the result of a complex computation is always a good idea.
A better idea yet is to use max with a key function in this case:
return max(len(text.split()) for text in texts)
this is also faster.
Also note that len(text.split()) creates a list and you just count the items. A better way would be to count the spaces (if words are separated by only one space) by doing
return max(text.count(" ") for text in texts) + 1
if there can be more than 1 space, use regex and finditer to avoid creating lists:
return max(sum(1 for _ in re.finditer("\s+",text)) for text in texts) + 1
note the 1 value added in the end to correct the value (number of separators is one less than the number of words)
As an aside, even if the value isn't cached, you still can use complex expressions in loops with range:
for i in range(len(text.split())):
the range object is created at the start, and the expression is only evaluated once (as opposed as C loops for instance)

Reverse string time and space complexity

I have written different python codes to reverse a given string. But, couldn't able to figure the which one among them is efficient. Can someone point out the differences between these algorithms using time and space complexities?
def reverse_1(s):
result = ""
for i in s :
result = i + result
return result
def reverse_2(s):
return s[::-1]
There are already some solutions out there, but I couldn't find out the time and space complexity. I would like to know how much space s[::-1] will take?
Without even trying to bench it (you can do it easily), reverse_1 would be dead slow because of many things:
loop with index
constantly adding character to string, creating a copy each time.
So, slow because of loop & indexes, O(n*n) time complexity because of the string copies, O(n) complexity because it uses extra memory to create temp strings (which are hopefully garbage collected in the loop).
On the other hand s[::-1]:
doesn't use a visible loop
returns a string without the need to convert from/to list
uses compiled code from python runtime
So you cannot beat it in terms of time & space complexity and speed.
If you want an alternative you can use:
''.join(reversed(s))
but that will be slower than s[::-1] (it has to create a list so join can build a string back). It's interesting when other transformations are required than reversing the string.
Note that unlike C or C++ languages (as far as the analogy goes for strings) it is not possible to reverse the string with O(1) space complexity because of the immutability of strings: you need twice the memory because string operations cannot be done in-place (this can be done on list of characters, but the str <=> list conversions use memory)

Converting list comprehension notation into a for loop

I've been seeing many solutions with list comprehensions and i'm wondering if it's possible to convert it into a for loop notation.
For example, if i have a list comprehension notation:
radix = [radix_sort(i,0) for i in lst]
if i write it in:
for i in lst:
radix_sort(i,0)
do i get the same output? What differentiates both? Is it more efficient for list comprehension to be applied rather than conventional for loops?
A list comprehension creates a list—that's the whole point of it. But your loop doesn't create anything at all. So no, you're not going to get the same "output".
The equivalent loop is:
radix = []
for i in lst:
radix.append(radix_sort(i,0))
The list comprehension is defined to mean almost exactly the same thing as this. It may run a bit faster, at least in CPython,* but it will have the same effect.
If that radix_sort returns a copy of the list in sorted order, your loop was doing a lot of work for no effect. But now, as with the list comprehension, you're saving the result of all that work.
If, on the other hand, that radix_sort sorts the list in-place and returns nothing, then both the list comprehension and the explicit loop with append are highly misleading, and you should just use the loop without append.
* For example, in a comprehension, you don't have any way to access radix until the looping is done, so the compiler can make some assumptions and use a faster way of appending to the list.

Categories