Time complexity for two different solutions - python

I want to understand the difference in time complexity between these two solutions.
The task is not relevant but if you're curious here's the link with the explanation.
This is my first solution. Scores a 100% in correctness but 0% in performance:
def solution(s, p ,q):
dna = dict({'A': 1, 'C': 2, 'G': 3, 'T': 4})
result = []
for i in range(len(q)):
least = 4
for c in set(s[p[i] : q[i] + 1]):
if least > dna[c]: least = dna[c]
result.append(least)
return result
This is the second solution. Scores 100% in both correctness and performance:
def solution(s, p ,q):
result = []
for i in range(len(q)):
if 'A' in s[p[i]:q[i] + 1]: result.append(1)
elif 'C' in s[p[i]:q[i] + 1]: result.append(2)
elif 'G' in s[p[i]:q[i] + 1]: result.append(3)
else: result.append(4)
return list(result)
Now this is how I see it. In both solutions I'm iterating through a range of Q length and on each iteration I'm slicing different portions of a string, with a length between 1 and 100,000.
Here's where I get confused, in my first solution on each iteration, I'm slicing once a portion of the string and create a set to remove all the duplicates. The set can have a length between 1 and 4, so iterating through it must be very quick. What I notice is that I iterate through it only once, on each iteration.
In the second solution on each iteration, I'm slicing three times a portion of the string and iterate through it, in the worst case three times with a length of 100,000.
Then why is the second solution faster? How can the first have a time complexity of O(n*m) and the second O(n+m)?
I thought it's because of the in and the for operators, but I tried the same second solution in JavaScript with the indexOf method and it still gets a 100% in performance. But why? I can understand that if in Python the in and the for operators have different implementations and work differently behind the scene, but in JS the indexOf method is just going to apply a for loop. Then isn't it the same as just doing the for loop directly inside my function? Shouldn't that be a O(n*m) time complexity?

You haven't specified how the performance rating is obtained, but anyway, the second algorithm is clearly better, mainly because it uses the in operator, that under the hood calls a function implemented in C, which is far more efficient than python. More on this topic here.
Also, I'm not sure, but I don't think that the python interpreter isn't smart enough to slice the string only once and then reuse the same portion the other times in the second algorithm.
Creating the set in the first algorithm also seems like a very costly operation.
Lastly, maybe the performance ratings aren't based on the algorithm complexity, but rather on the execution time over different test strings?

I think the difference in complexity can easily be showcased on an example.
Consider the following input:
s = 'ACGT' * 1000000
# = 'ACGTACGTACGTACGTACGTACGTACGTACGTACGT...ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT'
p = [0]
q = [3999999]
Algorithm 2 very quickly checks that 'A' is in s[0:4000000] (it's the first character - no need to iterate through the whole string to find it!).
Algorithm 1, on the other hand, must iterate through the whole string s[0:4000000] to build the set {'A','C','G','T'}, because iterating through the whole string is the only way to check that there isn't a fifth distinct character hidden somewhere in the string.
Important note
I said algorithm 2 should be fast on this example, because the test 'A' in ... doesn't need to iterate through the whole string to find 'A' if 'A' is at the beginning of the string. However, note a possible important difference in complexity between 'A' in s and 'A' in s[0:4000000]. The problem is that creating a slice of the string might cost time (and memory) if it's copying the string. Instead of slicing, you should use s.find('A', 0, 4000000), which is guaranteed not to build a copy. For more information on this:
Documentation on string.find
Stackoverflow: Time complexity of string slice

Related

Run time difference for "in" searching through "list" and "set" using Python

My understanding of list and set in Python are mainly that list allows duplicates, list allows ordered information, and list has position information. I found while I was trying to search if an element is with in a list, the runtime is much faster if I convert the list to a set first. For example, I wrote a code trying find the longest consecutive sequence in a list. Use a list from 0 to 10000 as an example, the longest consecutive is 10001. While using a list:
start_time = datetime.now()
nums = list(range(10000))
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the list##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code is "Duration: 0:00:01.481939"
With adding only one line to convert the list to set in third row below:
start_time = datetime.now()
nums = list(range(10000))
nums = set(nums)
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the set(was a list)##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code by using a set is now "Duration: 0:00:00.005138", Many time shorter than search through a list. Could anyone help me to understand the reason for that? Thank you!
This is a great question.
The issue with arrays is that there is no smarter way to search in some array a besides just comparing every element one by one.
Sometimes you'll get lucky and get a match on the first element of a.
Sometimes you'll get unlucky and not find a match until the last element of a, or perhaps none at all.
On average, you'll have to search half the elements of they array each time.
This is said to have a "time complexity" of O(len(a)), or colloquially, O(n). This means the time taken by the algorithm (searching for a value in array) is linearly propertional to the size of the input (the number of elements in the array to be searched). This is why it's called "linear search". Oh, your array got 2x bigger? Well your searches just got 2x slower. 1000x bigger? 1000x slower.
Arrays are great, but they're đź’© for searching if the element count gets too high.
Sets are clever. In Python, they're implemented as if they were a Dictionary with keys and no values. Like dictionaries, they're backed by data structure called a hash table.
A hash table uses the hash of a value as a quick way to get a "summary" of an object. This "summary" is then used to narrow down its search, so it only needs to linearly search a very small subset of all the elements. Searching in a hash table a time complexity of O(1). Notice that there's no "n" or len(the_set) in there. That's because the time taken to search in a hash table does not grow as the size of the hash table grows. So it's said to have constant time complexity.
By analogy, you only search the dairy isle when you're looking for milk. You know the hash value of milk (say, it's isle) is "dairy" and not "deli", so you don't have to waste any time searching for milk
A natural follow-up question is "then why don't we always use sets?". Well, there's a trade-off.
As you mentioned, sets can't contain duplicates, so if you want to store two of something, it's a non-starter.
Prior to Python 3.7, they were also unordered, so if you cared about
the order of elements, they won't do, either. * Sets generally have a
larger cpu/memory overhead, which adds up when using many sets containing small numbers of elements.
Also, it's possible
that because of fancy CPU features (like CPU caches and branch
prediction), linear searching through small arrays can actually be
faster than the hash-based look-up in sets.
I'd recommend you do some further reading into data structures and algorithms. This stuff is quite language-independent. Now that you know that set and dict use a Hash Table behind the scenes, you can look up resource that cover hash tables in any language, and that should help. There's also some Python-centric resoruces too, like https://www.interviewcake.com/concept/python/hash-map

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.
I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).
Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

Is the time-complexity of iterative string append actually O(n^2), or O(n)?

I am working on a problem out of CTCI.
The third problem of chapter 1 has you take a string such as
'Mr John Smith '
and asks you to replace the intermediary spaces with %20:
'Mr%20John%20Smith'
The author offers this solution in Python, calling it O(n):
def urlify(string, length):
'''function replaces single spaces with %20 and removes trailing spaces'''
counter = 0
output = ''
for char in string:
counter += 1
if counter > length:
return output
elif char == ' ':
output = output + '%20'
elif char != ' ':
output = output + char
return output
My question:
I understand that this is O(n) in terms of scanning through the actual string from left to right. But aren't strings in Python immutable? If I have a string and I add another string to it with the + operator, doesn't it allocate the necessary space, copy over the original, and then copy over the appending string?
If I have a collection of n strings each of length 1, then that takes:
1 + 2 + 3 + 4 + 5 + ... + n = n(n+1)/2
or O(n^2) time, yes? Or am I mistaken in how Python handles appending?
Alternatively, if you'd be willing to teach me how to fish: How would I go about finding this out for myself? I've been unsuccessful in my attempts to Google an official source. I found https://wiki.python.org/moin/TimeComplexity but this doesn't have anything on strings.
In CPython, the standard implementation of Python, there's an implementation detail that makes this usually O(n), implemented in the code the bytecode evaluation loop calls for + or += with two string operands. If Python detects that the left argument has no other references, it calls realloc to attempt to avoid a copy by resizing the string in place. This is not something you should ever rely on, because it's an implementation detail and because if realloc ends up needing to move the string frequently, performance degrades to O(n^2) anyway.
Without the weird implementation detail, the algorithm is O(n^2) due to the quadratic amount of copying involved. Code like this would only make sense in a language with mutable strings, like C++, and even in C++ you'd want to use +=.
The author relies on an optimization that happens to be here, but is not explicitly dependable. strA = strB + strC is typically O(n), making the function O(n^2). However, it is pretty easy to make sure it the whole process is O(n), use an array:
output = []
# ... loop thing
output.append('%20')
# ...
output.append(char)
# ...
return ''.join(output)
In a nutshell, the append operation is amortized O(1), (although you can make it strong O(1) by pre-allocating the array to the right size), making the loop O(n).
And then the join is also O(n), but that's okay because it is outside the loop.
I found this snippet of text on Python Speed > Use the best algorithms and fastest tools:
String concatenation is best done with ''.join(seq) which is an O(n) process. In contrast, using the '+' or '+=' operators can result in an O(n^2) process because new strings may be built for each intermediate step. The CPython 2.4 interpreter mitigates this issue somewhat; however, ''.join(seq) remains the best practice
For future visitors: Since it is a CTCI question, any reference to learning urllib package is not required here, specifically as per OP and the book, this question is about Arrays and Strings.
Here's a more complete solution, inspired from #njzk2's pseudo:
text = 'Mr John Smith'#13
special_str = '%20'
def URLify(text, text_len, special_str):
url = []
for i in range(text_len): # O(n)
if text[i] == ' ': # n-s
url.append(special_str) # append() is O(1)
else:
url.append(text[i]) # O(1)
print(url)
return ''.join(url) #O(n)
print(URLify(text, 13, '%20'))

Performance issues in Burrows-Wheeler in python

I was trying to implement Burrows-Wheeler transform in python. (This is one of the assignments in online course, but I hope I have done some work to be qualified to ask for help).
The algorithm works as follows. Take a string which ends with a special character ($ in my case) and create all cyclic strings from this string. Sort all these strings alphabetically, having a special character always less then any other character. After this get the last element of each string.
This gave me a oneliner:
''.join([i[-1] for i in sorted([text[i:] + text[0:i] for i in xrange(len(text))])]
Which is correct and reasonably fast for reasonably big strings (which is enough to solve the problem):
60 000 chars - 16 secs
40 000 chars - 07 secs
25 000 chars - 02 secs
But when I tried to process a really huge string with few millions of chars, I failed (it takes too much time to process).
I assume that the problem is with storing too many strings in the memory.
Is there any way to overcome this?
P.S. just want to point out that also this might look like a homework problem, my solution already passes the grader and I am just looking for a way to make it faster. Also I am not spoiling the fun for other people, because if they would like to find solution, wiki article has one which is similar to mine. I also checked this question which sounds similar but answers a harder question, how to decode the string coded with this algorithm.
It takes a long time to make all those string slices with long strings. It's at least O(N^2) (since you create N strings of N length, and each one has to be copied into memory taking its source data from the original), which destroys the overall performance and makes the sorting irrelevant. Not to mention the memory requirement!
Instead of actually slicing the string, the next thought is to order the i values you use to create the cyclic strings, in order of how the resulting string would compare - without actually creating it. This turns out to be somewhat tricky. (Removed/edited some stuff here that was wrong; please see #TimPeters' answer.)
The approach I've taken here is to bypass the standard library - which makes it difficult (though not impossible) to compare those strings 'on demand' - and do my own sorting. The natural choice of algorithm here is radix sort, since we need to consider the strings one character at a time anyway.
Let's get set up first. I am writing code for version 3.2, so season to taste. (In particular, in 3.3 and up, we could take advantage of yield from.) I am using the following imports:
from random import choice
from timeit import timeit
from functools import partial
I wrote a general-purpose radix sort function like this:
def radix_sort(values, key, step=0):
if len(values) < 2:
for value in values:
yield value
return
bins = {}
for value in values:
bins.setdefault(key(value, step), []).append(value)
for k in sorted(bins.keys()):
for r in radix_sort(bins[k], key, step + 1):
yield r
Of course, we don't need to be general-purpose (our 'bins' can only be labelled with single characters, and presumably you really mean to apply the algorithm to a sequence of bytes ;) ), but it doesn't hurt. Might as well have something reusable, right? Anyway, the idea is simple: we handle a base case, and then we drop each element into a "bin" according to the result from the key function, and then we pull values out of the bins in sorted bin order, recursively sorting each bin's contents.
The interface requires that key(value, n) gives us the nth "radix" of value. So for simple cases, like comparing strings directly, that could be a simple as lambda v, n: return v[n]. Here, though, the idea is to compare indices into the string, according to the data in the string at that point (considered cyclically). So let's define a key:
def bw_key(text, value, step):
return text[(value + step) % len(text)]
Now the trick to getting the right results is to remember that we're conceptually joining up the last characters of the strings we aren't actually creating. If we consider the virtual string made using index n, its last character is at index n - 1, because of how we wrap around - and a moment's thought will confirm to you that this still works when n == 0 ;) . [However, when we wrap forwards, we still need to keep the string index in-bounds - hence the modulo operation in the key function.]
This is a general key function that needs to be passed in the text to which it will refer when transforming the values for comparison. That's where functools.partial comes in - you could also just mess around with lambda, but this is arguably cleaner, and I've found it's usually faster, too.
Anyway, now we can easily write the actual transform using the key:
def burroughs_wheeler_custom(text):
return ''.join(text[i - 1] for i in radix_sort(range(len(text)), partial(bw_key, text)))
# Notice I've dropped the square brackets; this means I'm passing a generator
# expression to `join` instead of a list comprehension. In general, this is
# a little slower, but uses less memory. And the underlying code uses lazy
# evaluation heavily, so :)
Nice and pretty. Let's see how it does, shall we? We need a standard to compare it against:
def burroughs_wheeler_standard(text):
return ''.join([i[-1] for i in sorted([text[i:] + text[:i] for i in range(len(text))])])
And a timing routine:
def test(n):
data = ''.join(choice('abcdefghijklmnopqrstuvwxyz') for i in range(n)) + '$'
custom = partial(burroughs_wheeler_custom, data)
standard = partial(burroughs_wheeler_standard, data)
assert custom() == standard()
trials = 1000000 // n
custom_time = timeit(custom, number=trials)
standard_time = timeit(standard, number=trials)
print("custom: {} standard: {}".format(custom_time, standard_time))
Notice the math I've done to decide on a number of trials, inversely related to the length of the test string. This should keep the total time used for testing in a reasonably narrow range - right? ;) (Wrong, of course, since we established that the standard algorithm is at least O(N^2).)
Let's see how it does (*drumroll*):
>>> imp.reload(burroughs_wheeler)
<module 'burroughs_wheeler' from 'burroughs_wheeler.py'>
>>> burroughs_wheeler.test(100)
custom: 4.7095093091438684 standard: 0.9819262643716229
>>> burroughs_wheeler.test(1000)
custom: 5.532266880287807 standard: 2.1733253807396977
>>> burroughs_wheeler.test(10000)
custom: 5.954826800612864 standard: 42.50686064849015
Whoa, that's a bit of a frightening jump. Anyway, as you can see, the new approach adds a ton of overhead on short strings, but enables the actual sorting to be the bottleneck instead of string slicing. :)
Just adding a bit to #KarlKnechtel's spot-on response.
First, the "standard way" to speed cyclic-permutation extraction is just to paste two copies together and index directly into that. After:
N = len(text)
text2 = text * 2
then the cyclic permutation starting at index i is just text2[i: i+N], and character j in that permutation is just text2[i+j]. No need for pasting together two slices, or for modulus (%) operations.
Second, the builtin sort() can be used for this, although:
It's funky ;-)
For strings with few distinct characters (compared to the length of the string) Karl's radix sort will almost certainly be faster.
As proof-of-concept, here's a drop-in replacement for that part of Karl's code (although this sticks to Python 2):
def burroughs_wheeler_custom(text):
N = len(text)
text2 = text * 2
class K:
def __init__(self, i):
self.i = i
def __lt__(a, b):
i, j = a.i, b.i
for k in xrange(N): # use `range()` in Python 3
if text2[i+k] < text2[j+k]:
return True
elif text2[i+k] > text2[j+k]:
return False
return False # they're equal
inorder = sorted(range(N), key=K)
return "".join(text2[i+N-1] for i in inorder)
Note that the builtin sort()'s implementation computes the key exactly once for each element in its input, and does save those results for the duration of the sort. In this case, the results are lazy little K instances that just remember the starting index, and whose __lt__ method compares one character pair at a time until "less than!" or "greater than!" is resolved.
I agree with the previous answer, string/list slicing in python becomes a bottleneck when performing huge algorithmic computations. The idea is not slicing.
[EDIT: not also slicing, but list indexing. If you use array.array instead of lists, the execution time reduces to a half. Indexing arrays is straightforward, indexing lists is a more complicated process) ]
Here there is a more functional solution to your problem.
The idea, is having a generator the will act as a slicer (rslice). It's a similar idea to itertools.islice but it goes to the beginning of the string when it reaches the end. And it will stop before reaching the start position you specified when creating it. With this trick you are not copying any substrings in memory, so in the end you only have pointers
moving over your string without creating copies everywhere.
So we create a list containing [rslices,lastchar of the slice]
and we sort it using as key the rslice ( as you can see in cf sort function).
When it's sorted, you will only need to collect for each element in the list the second element (last element of the slice previously stored).
from itertools import izip
def cf(i1,i2):
for i,j in izip(i1[0](),i2[0]()): # We grab the the first element (is a lambda) and execute it to get the generator
if i<j: return -1
elif i>j: return 1
return 0
def rslice(cad,pos): # Slice that rotates through the string (it's a generator)
pini=pos
lc=len(cad)
while pos<lc:
yield cad[pos]
pos+=1
pos=0
while pos<pini-1:
yield cad[pos]
pos+=1
def lambdagen(start,cad): # Closure to hold a generator
return lambda: rslice(cad,start)
def bwt(txt):
lt=len(txt)
arry=list(txt)+[None]
l=[(lambdagen(0,arry),None)]+[(lambdagen(i,arry),arry[i-1]) for i in range(1,lt+1)]
# What we keep in the list is the generator for the rotating-slice, plus the
# last character of the slice, so we save the time of going through the whole
# string to get the last character
l.sort(cmp=cf) # We sort using our cf function
return [i[1] for i in l]
print bwt('Text I want to apply BTW to :D')
# ['D', 'o', 'y', 't', 'o', 'W', 't', 'I', ' ', ' ', ':', ' ', 'B', None, 'T', 'w', ' ',
# 'T', 'p', 'a', 't', 't', 'p', 'a', 'x', 'n', ' ', ' ', ' ', 'e', 'l']
EDIT: Using arrays (execution time reduced by 2):
def bwt(txt):
lt=len(txt)
arry=array.array('h',[ord(i) for i in txt])
arry.append(-1)
l=[(lambdagen(0,arry),None)]+[(lambdagen(i,arry),arry[i-1]) for i in range(1,lt+1)]
l.sort(cmp=cf)
return [i[1] for i in l]

Python lists - codes & algorithem

I need some help with python, a new program language to me.
So, lets say that I have this list:
list= [3, 1, 4, 9, 8, 2]
And I would like to sort it, but without using the built-in function "sort", otherwise where's all the fun and the studying in here? I want to code as simple and as basic as I can, even if it means to work a bit harder. Therefore, if you want to help me and to offer me some of ideas and code, please, try to keep them very "basic".
Anyway, back to my problem: In order to sort this list, I've decided to compare every time a number from the list to the last number. First, I'll check 3 and 2. If 3 is smaller than 2 (and it's false, wrong), then do nothing.
Next - check if 1 is smaller than 2 (and it's true) - then change the index place of this number with the first element.
On the next run, it will check again if the number is smaller or not from the last number in the list. But this time, if the number is smaller, it will change the place with the second number (and on the third run with the third number, if it's smaller, of course).
and so on and so on.
In the end, the ()function will return the sorted list.
Hop you've understand it.
So I want to use a ()recursive function to make the task bit interesting, but still basic.
Therefore, I thought about this code:
def func(list):
if not list:
for i in range(len(list)):
if list[-1] > lst[i]:
#have no idea what to write here in order to change the locations
i = i + 1
#return func(lst[i+1:])?
return list
2 questions:
1. How can I change the locations? Using pop/remove and then insert?
2. I don't know where to put the recursive part and if I've wrote it good (I think I didn't). the recursive part is the second "#", the first "return".
What do you think? How can I improve this code? What's wrong?
Thanks a lot!
Oh man, sorting. That's one of the most popular problems in programming with many, many solutions that differ a little in every language. Anyway, the most straight-forward algorithm is I guess the bubble sort. However, it's not very effective, so it's mostly used for educational purposes. If you want to try something more efficient and common go for the quick sort. I believe it's the most popular sorting algorithm. In python however, the default algorithm is a bit different - read here. And like I've said, there are many, many more sorting algorithms around the web.
Now, to answer your specific questions: in python replacing an item in a list is as simple as
list[-1]=list[i]
or
tmp=list[-1]
list[-1]=list[i]
list[i]=tmp
As to recursion - I don't think it's a good idea to use it, a simple while/for loop is better here.
maybe you can try a quicksort this way :
def quicksort(array, up, down):
# start sorting in your array from down to up :
# is array[up] < array[down] ? if yes switch
# do it until up <= down
# call recursively quicksort
# with the array, middle, up
# with the array, down, middle
# where middle is the value found when the first sort ended
you can check this link : Quicksort on Wikipedia
It is nearly the same logic.
Hope it will help !
The easiest way to swap the two list elements is by using “parallel assignment”:
list[-1], list[i] = list[i], list[-1]
It doesn't really make sense to use recursion for this algorithm. If you call func(lst[i+1:]), that makes a copy of those elements of the list, and the recursive call operates on the copy, and then the copy is discarded. You could make func take two arguments: the list and i+1.
But your code is still broken. The not list test is incorrect, and the i = i + 1 is incorrect. What you are describing sounds a variation of selection sort where you're doing a bunch of extra swapping.
Here's how a selection sort normally works.
Find the smallest of all elements and swap it into index 0.
Find the smallest of all remaining elements (all indexes greater than 0) and swap it into index 1.
Find the smallest of all remaining elements (all indexes greater than 1) and swap it into index 2.
And so on.
To simplify, the algorithm is this: find the smallest of all remaining (unsorted) elements, and append it to the list of sorted elements. Repeat until there are no remaining unsorted elements.
We can write it in Python like this:
def func(elements):
for firstUnsortedIndex in range(len(elements)):
# elements[0:firstUnsortedIndex] are sorted
# elements[firstUnsortedIndex:] are not sorted
bestIndex = firstUnsortedIndex
for candidateIndex in range(bestIndex + 1, len(elements)):
if elements[candidateIndex] < elements[bestIndex]:
bestIndex = candidateIndex
# Now bestIndex is the index of the smallest unsorted element
elements[firstUnsortedIndex], elements[bestIndex] = elements[bestIndex], elements[firstUnsortedIndex]
# Now elements[0:firstUnsortedIndex+1] are sorted, so it's safe to increment firstUnsortedIndex
# Now all elements are sorted.
Test:
>>> testList = [3, 1, 4, 9, 8, 2]
>>> func(testList)
>>> testList
[1, 2, 3, 4, 8, 9]
If you really want to structure this so that recursion makes sense, here's how. Find the smallest element of the list. Then call func recursively, passing all the remaining elements. (Thus each recursive call passes one less element, eventually passing zero elements.) Then prepend that smallest element onto the list returned by the recursive call. Here's the code:
def func(elements):
if len(elements) == 0:
return elements
bestIndex = 0
for candidateIndex in range(1, len(elements)):
if elements[candidateIndex] < elements[bestIndex]:
bestIndex = candidateIndex
return [elements[bestIndex]] + func(elements[0:bestIndex] + elements[bestIndex + 1:])

Categories