Removing duplicates from a string in Python without using adittional buffer - python

I want to resolve this problem in Python:
given a string (without spacing), remove the duplicates without using an adittional buffer.
I have the following code:
def removedup(st):
temp = []
for i in range(len(st)):
if st[i] not in temp:
temp.append(st[i])
return temp
which returns a list without duplicates.
1-This code in O(n^2) right?
2- How I can do the same without using an additional buffer in python?? (I mean not using a list). Maybe I can use a string (not a list) but am not sure if this adds complexity. Also, strings in python are immutable, so I can't do some type of indexing to change something. (Like in C++ or Java).
What is the best way to resolve this in Python? I know that there is some questions that "looks like" duplicates here, but my question is more Python related (solving this without an additional buffer).
Thank you!

1) Yes.
2) Well
return set(st)
..is by far the simplest way to uniquify a string (or any iterable). I don't know if you consider this an "additional buffer" or not. Some extra memory needs to be allocated for another object any way you do it, since strings are immutable as you say.
This of course does not preserve order, and if that's an issue there's always the super-obvious:
from collections import OrderedDict
return ''.join(OrderedDict.fromkeys(st))

0) Apparently you have to use at least one additional buffer since, as you have mentioned, python strings are immutable and you need at least to return result somehow, right? So internaly at least one buffer is already used (even if you name it with the same name).
You can, of course, use string as buffer, they can do string + string or string += string, or even string[:n-1] + string[n:], but because of immutability, internaly it creates new object each time.
You can use some other, mutable, iterable instead of string, so it would work.
1) No, your code is not O(N**2). It's O(N*log(N)) in the worst case scenario (all symbols are unique) and O(N) in best case scenario (all symbols are just one symbol).
2) Assuming that you use list instead of string of string, you could do something like this:
def dup_remove(lst):
i = 0
n = len(lst)
while i < n:
if lst[i] in lst[:i]:
del lst[i]
n -= 1
else:
i += 1
return lst
it's still O(N*Log(N)) in worst case scenario, but it does not use any additional buffers which is what you wanted in the first place. I think that for practical purpose solution with OrderedDict should be more optimal though.

Another way to do it through list slicing loop.
# O(n ^ 2)
for item in input_list[:]:
index = input_list.index(item) + 1
while index < len(input_list):
if input_list[index] == item:
del input_list[index]
index += 1
Since slice creates a copy, if you truly want a solution without any internal buffers, this will do.
# O(n ^ 2)
i = 0
while i < len(input_list):
j = i + 1
while j < len(input_list):
if input_list[j] == input_list[i]:
del input_list[j]
# Don't increment j here since the next item
# after the deleted one will move to index j
else:
j += 1
i += 1

1) I'm not sure.
2) A very efficient way is coded below. Note that I don't use any additional package. I don't even use lists, just a string!
def removeDuplicate (input):
i = 0
while i < len(input)-1:
j = i + 1
while j < len(input):
if input[j] == input[i]:
input_list = input_list[0:j] + input_list[j+1:]
# Don't increment j here since the next item
# after the deleted one will move to index j
else:
j += 1
i += 1
return input

Related

Am I doing this right? Removing items from python list - is there room for optimization?

I have 2 lists. I want to remove all items from list which contain strings from second list. Now, I am using classical 2 loop approach, 1st I loop over copy of main list and then for every item i check if it contains any strings from 2nd list. Then I delete the item if string is found. And I can end the 2nd loop with break, since no more lookup is needed (we're gonna remove this line anyway). This works just fine - as you can see, I am iterating over copy of list, so removing elements is not a problem.
Here is the code:
intRemoved = 0
sublen = len(mylist) + 1
halflen = sublen / 2
for i, line in enumerate(mylist[:], 1):
for item in REM:
if item.encode('utf8').upper() in line.text.encode('utf8').upper():
if i < halflen:
linepos = i
else:
linepos = (sublen - i) * -1
mylist.remove(line)
intRemoved += 1
break
Now, I need data how many lines I removed (intRemoved) and position in the list (from beginning of list or end of list, that's why it splits in half). Positive numbers indicate removed line position from the beginning of the file, negative from end.
Ahh, yes, and I am ignoring the case. That's why there is .upper().
Now, since I am in no way pro, I just need to know if I am doing it right, performance-wise? Am I doing something that's bad for performance? Is there a way to optimize this?
Thanx,
D.
As you have been said, probably you should be looking in codereview. Anyhow, I am pretty sure using sets and intersection operation is going to be much faster.
Take a look here: https://docs.python.org/2/library/stdtypes.html#set
Calling encode is unnecessary. Calling upper each time is not ideal. Coping the list for iterating is expensive. Removing is more expensive, since one have to search for the element and shifting elements. Counting intRemoved is not the best way, either.
sublen = len(subsSrt) + 1
halflen = sublen / 2
filtered_list = []
rem_upper = [item.upper() for item in REM]
for i, line in enumerate(mylist, 1):
text = line.text.upper()
if any(item in text for item in rem_upper):
if i < halflen:
linepos = i
else:
linepos = (sublen - i) * -1
else:
filtered_list.append(line)
intRemoved = len(mylist) - len(filtered_list)

Sorting for index values using a binary search function in python

I am being tasked with designing a python function that returns the index of a given item inside a given list. It is called binary_sort(l, item) where l is a list(unsorted or sorted), and item is the item you're looking for the index of.
Here's what I have so far, but it can only handle sorted lists
def binary_search(l, item, issorted=False):
templist = list(l)
templist.sort()
if l == templist:
issorted = True
i = 0
j = len(l)-1
if item in l:
while i != j + 1:
m = (i + j)//2
if l[m] < item:
i = m + 1
else:
j = m - 1
if 0 <= i < len(l) and l[i] == item:
return(i)
else:
return(None)
How can I modify this so it will return the index of a value in an unsorted list if it is given an unsorted list and a value as parameters?
Binary Search (you probably misnamed it - the algorithm above is not called "Binary Sort") - requires ordered sequences to work.
It simply can't work on an unordered sequence, since is the ordering that allows it to throw away at least half of the items in each search step.
On the other hand, since you are allowed to use the list.sorted method, that seems to be the way to go: calling l.sort() will sort your target list before starting the search operations, and the algorithm will work.
In a side note, avoid in a program to call anything just l - it maybe a nice name for a list for someone with a background in Mathematics and used to do things on paper - but on the screen, l is hard to disinguish from 1 and makes for poor source code reading. Good names for this case could be sequence lst, or data. (list should be avoided as well, since it would override the Python built-in with the same name).

trouble with bubble in python

I'm writing in python and I thought I might try to use recursion to create a bubble sort. My idea is that, since the rightmost element is always sorted after every iteration (list[-1]), I add that element to another call of bubblesort for the rest of my elements (bubbleSort(list[:-1])). Here is my code:
def bubbleSort(list):
sorted = True
i = 0
if len(list) <= 1:
return list
while i < len(list) - 1:
if list[i] > list[i+1]:
temp = list[i+1]
list[i+1] = list[i]
list[i] = temp
sorted = False
i = i + 1
if sorted:
return list
else:
endElement = list[-1]
return bubbleSort(list[:-1]) + [endElement]
However, it only ever returns the first iteration of the sort, despite it running through every iteration (I used print inside of the code to see if it was running). The recursion is necessary: I know how to do it without it. It's just the recursion part that messes up anyways.
Your intuition is correct. In fact, your code works for me (once the method contents are indented): http://codepad.org/ILCH1k2z
Depending on your particular installation of Python, you may be running into issues due to your variable name. In Python, list is a reserved word (it is the constructor for lists). In general, it is not considered good form to use a reserved word as a variable name. Try renaming your variable and see if your code runs correctly.
python programs are structured by indentation, not by parenthesis like c-like languages. i think this is the problem with your code.
try to indent it like this
#!/usr/env/python
def bubble(lst):
sorted = True
for i in range(len(lst) - 1):
if lst[i] > lst[i + 1]:
temp = lst[i]
lst[i] = lst[i + 1]
lst[i + 1] = temp
sorted = False
if sorted:
return lst
else:
return bubble(lst[:-1]) + [lst[-1]]
also you shouldn't use reserved words like list for variable names.
the test if the list has 1 or less elements is also unnecessary, because it wouldn't enter the loop.

for-in loop's upper limit changing in each loop

How can I update the upper limit of a loop in each iteration? In the following code, List is shortened in each loop. However, the lenList in the for, in loop is not, even though I defined lenList as global. Any ideas how to solve this? (I'm using Python 2.sthg)
Thanks!
def similarity(List):
import difflib
lenList = len(List)
for i in range(1,lenList):
import numpy as np
global lenList
a = List[i]
idx = [difflib.SequenceMatcher(None, a, x).ratio() for x in List]
z = idx > .9
del List[z]
lenList = len(List)
X = ['jim','jimmy','luke','john','jake','matt','steve','tj','pat','chad','don']
similarity(X)
Looping over indices is bad practice in python. You may be able to accomplish what you want like this though (edited for comments):
def similarity(alist):
position = 0
while position < len(alist):
item = alist[position]
position += 1
# code here that modifies alist
A list will evaluate True if it has any entries, or False when it is empty. In this way you can consume a list that may grow during the manipulation of its items.
Additionally, if you absolutely have to have indices, you can get those as well:
for idx, item in enumerate(alist):
# code here, where items are actual list entries, and
# idx is the 0-based index of the item in the list.
In ... 3.x (I believe) you can even pass an optional parameter to enumerate to control the starting value of idx.
The issue here is that range() is only evaluated once at the start of the loop and produces a range generator (or list in 2.x) at that time. You can't then change the range. Not to mention that numbers and immutable, so you are assigning a new value to lenList, but that wouldn't affect any uses of it.
The best solution is to change the way your algorithm works not to rely on this behaviour.
The range is an object which is constructed before the first iteration of your loop, so you are iterating over the values in that object. You would instead need to use a while loop, although as Lattyware and g.d.d.c point out, it would not be very Pythonic.
What you are effectively looping on in the above code is a list which got generated in the first iteration itself.
You could have as well written the above as
li = range(1,lenList)
for i in li:
... your code ...
Changing lenList after li has been created has no effect on li
This problem will become quite a lot easier with one small modification to how your function works: instead of removing similar items from the existing list, create and return a new one with those items omitted.
For the specific case of just removing similarities to the first item, this simplifies down quite a bit, and removes the need to involve Numpy's fancy indexing (which you weren't actually using anyway, because of a missing call to np.array):
import difflib
def similarity(lst):
a = lst[0]
return [a] + \
[x for x in lst[1:] if difflib.SequenceMatcher(None, a, x).ratio() > .9]
From this basis, repeating it for every item in the list can be done recursively - you need to pass the list comprehension at the end back into similarity, and deal with receiving an empty list:
def similarity(lst):
if not lst:
return []
a = lst[0]
return [a] + similarity(
[x for x in lst[1:] if difflib.SequenceMatcher(None, a, x).ratio() > .9])
Also note that importing inside a function, and naming a variable list (shadowing the built-in list) are both practices worth avoiding, since they can make your code harder to follow.

Remove items from a list while iterating without using extra memory in Python

My problem is simple: I have a long list of elements that I want to iterate through and check every element against a condition. Depending on the outcome of the condition I would like to delete the current element of the list, and continue iterating over it as usual.
I have read a few other threads on this matter. Two solutions seam to be proposed. Either make a dictionary out of the list (which implies making a copy of all the data that is already filling all the RAM in my case). Either walk the list in reverse (which breaks the concept of the alogrithm I want to implement).
Is there any better or more elegant way than this to do it ?
def walk_list(list_of_g):
g_index = 0
while g_index < len(list_of_g):
g_current = list_of_g[g_index]
if subtle_condition(g_current):
list_of_g.pop(g_index)
else:
g_index = g_index + 1
li = [ x for x in li if condition(x)]
and also
li = filter(condition,li)
Thanks to Dave Kirby
Here is an alternative answer for if you absolutely have to remove the items from the original list, and you do not have enough memory to make a copy - move the items down the list yourself:
def walk_list(list_of_g):
to_idx = 0
for g_current in list_of_g:
if not subtle_condition(g_current):
list_of_g[to_idx] = g_current
to_idx += 1
del list_of_g[to_idx:]
This will move each item (actually a pointer to each item) exactly once, so will be O(N). The del statement at the end of the function will remove any unwanted items at the end of the list, and I think Python is intelligent enough to resize the list without allocating memory for a new copy of the list.
removing items from a list is expensive, since python has to copy all the items above g_index down one place. If the number of items you want to remove is proportional to the length of the list N, then your algorithm is going to be O(N**2). If the list is long enough to fill your RAM then you will be waiting a very long time for it to complete.
It is more efficient to create a filtered copy of the list, either using a list comprehension as Marcelo showed, or use the filter or itertools.ifilter functions:
g_list = filter(not_subtle_condition, g_list)
If you do not need to use the new list and only want to iterate over it once, then it is better to use ifilter since that will not create a second list:
for g_current in itertools.ifilter(not_subtle_condtion, g_list):
# do stuff with g_current
The built-in filter function is made just to do this:
list_of_g = filter(lambda x: not subtle_condition(x), list_of_g)
How about this?
[x for x in list_of_g if not subtle_condition(x)]
its return the new list with exception from subtle_condition
For simplicity, use a list comprehension:
def walk_list(list_of_g):
return [g for g in list_of_g if not subtle_condition(g)]
Of course, this doesn't alter the original list, so the calling code would have to be different.
If you really want to mutate the list (rarely the best choice), walking backwards is simpler:
def walk_list(list_of_g):
for i in xrange(len(list_of_g), -1, -1):
if subtle_condition(list_of_g[i]):
del list_of_g[i]
Sounds like a really good use case for the filter function.
def should_be_removed(element):
return element > 5
a = range(10)
a = filter(should_be_removed, a)
This, however, will not delete the list while iterating (nor I recommend it). If for memory-space (or other performance reasons) you really need it, you can do the following:
i = 0
while i < len(a):
if should_be_removed(a[i]):
a.remove(a[i])
else:
i+=1
print a
If you perform a reverse iteration, you can remove elements on the fly without affecting the next indices you'll visit:
numbers = range(20)
# remove all numbers that are multiples of 3
l = len(numbers)
for i, n in enumerate(reversed(numbers)):
if n % 3 == 0:
del numbers[l - i - 1]
print numbers
The enumerate(reversed(numbers)) is just a stylistic choice. You may use a range if that's more legible to you:
l = len(numbers)
for i in range(l-1, -1, -1):
n = numbers[i]
if n % 3 == 0:
del numbers[i]
If you need to travel the list in order, you can reverse it in place with .reverse() before and after the reversed iteration. This won't duplicate your list either.

Categories