Related
How can I check if a list has any duplicates and return a new list without duplicates?
The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.
The following example should cover whatever you are trying to do:
>>> t = [1, 2, 3, 1, 2, 3, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]
As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.
Maintaining order
If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]
Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):
>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]
Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.
Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.
In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
It's a one-liner: list(set(source_list)) will do the trick.
A set is something that can't possibly have duplicates.
Update: an order-preserving approach is two lines:
from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()
Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. (set works a lot like a dict with ignored values, too.)
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
if i not in s:
s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]
If you don't care about the order, just do this:
def remove_duplicates(l):
return list(set(l))
A set is guaranteed to not have duplicates.
To make a new list retaining the order of first elements of duplicates in L:
newlist = [ii for n,ii in enumerate(L) if ii not in L[:n]]
For example: if L = [1, 2, 2, 3, 4, 2, 4, 3, 5], then newlist will be [1, 2, 3, 4, 5]
This checks each new element has not appeared previously in the list before adding it.
Also it does not need imports.
There are also solutions using Pandas and Numpy. They both return numpy array so you have to use the function .tolist() if you want a list.
t=['a','a','b','b','b','c','c','c']
t2= ['c','c','b','b','b','a','a','a']
Pandas solution
Using Pandas function unique():
import pandas as pd
pd.unique(t).tolist()
>>>['a','b','c']
pd.unique(t2).tolist()
>>>['c','b','a']
Numpy solution
Using numpy function unique().
import numpy as np
np.unique(t).tolist()
>>>['a','b','c']
np.unique(t2).tolist()
>>>['a','b','c']
Note that numpy.unique() also sort the values. So the list t2 is returned sorted. If you want to have the order preserved use as in this answer:
_, idx = np.unique(t2, return_index=True)
t2[np.sort(idx)].tolist()
>>>['c','b','a']
The solution is not so elegant compared to the others, however, compared to pandas.unique(), numpy.unique() allows you also to check if nested arrays are unique along one selected axis.
In this answer, there will be two sections: Two unique solutions, and a graph of speed for specific solutions.
Removing Duplicate Items
Most of these answers only remove duplicate items which are hashable, but this question doesn't imply it doesn't just need hashable items, meaning I'll offer some solutions which don't require hashable items.
collections.Counter is a powerful tool in the standard library which could be perfect for this. There's only one other solution which even has Counter in it. However, that solution is also limited to hashable keys.
To allow unhashable keys in Counter, I made a Container class, which will try to get the object's default hash function, but if it fails, it will try its identity function. It also defines an eq and a hash method. This should be enough to allow unhashable items in our solution. Unhashable objects will be treated as if they are hashable. However, this hash function uses identity for unhashable objects, meaning two equal objects that are both unhashable won't work. I suggest you override this, and changing it to use the hash of an equivalent mutable type (like using hash(tuple(my_list)) if my_list is a list).
I also made two solutions. Another solution which keeps the order of the items, using a subclass of both OrderedDict and Counter which is named 'OrderedCounter'. Now, here are the functions:
from collections import OrderedDict, Counter
class Container:
def __init__(self, obj):
self.obj = obj
def __eq__(self, obj):
return self.obj == obj
def __hash__(self):
try:
return hash(self.obj)
except:
return id(self.obj)
class OrderedCounter(Counter, OrderedDict):
'Counter that remembers the order elements are first encountered'
def __repr__(self):
return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))
def __reduce__(self):
return self.__class__, (OrderedDict(self),)
def remd(sequence):
cnt = Counter()
for x in sequence:
cnt[Container(x)] += 1
return [item.obj for item in cnt]
def oremd(sequence):
cnt = OrderedCounter()
for x in sequence:
cnt[Container(x)] += 1
return [item.obj for item in cnt]
remd is non-ordered sorting, while oremd is ordered sorting. You can clearly tell which one is faster, but I'll explain anyways. The non-ordered sorting is slightly faster, since it doesn't store the order of the items.
Now, I also wanted to show the speed comparisons of each answer. So, I'll do that now.
Which Function is the Fastest?
For removing duplicates, I gathered 10 functions from a few answers. I calculated the speed of each function and put it into a graph using matplotlib.pyplot.
I divided this into three rounds of graphing. A hashable is any object which can be hashed, an unhashable is any object which cannot be hashed. An ordered sequence is a sequence which preserves order, an unordered sequence does not preserve order. Now, here are a few more terms:
Unordered Hashable was for any method which removed duplicates, which didn't necessarily have to keep the order. It didn't have to work for unhashables, but it could.
Ordered Hashable was for any method which kept the order of the items in the list, but it didn't have to work for unhashables, but it could.
Ordered Unhashable was any method which kept the order of the items in the list, and worked for unhashables.
On the y-axis is the amount of seconds it took.
On the x-axis is the number the function was applied to.
I generated sequences for unordered hashables and ordered hashables with the following comprehension: [list(range(x)) + list(range(x)) for x in range(0, 1000, 10)]
For ordered unhashables: [[list(range(y)) + list(range(y)) for y in range(x)] for x in range(0, 1000, 10)]
Note there is a step in the range because without it, this would've taken 10x as long. Also because in my personal opinion, I thought it might've looked a little easier to read.
Also note the keys on the legend are what I tried to guess as the most vital parts of the implementation of the function. As for what function does the worst or best? The graph speaks for itself.
With that settled, here are the graphs.
Unordered Hashables
(Zoomed in)
Ordered Hashables
(Zoomed in)
Ordered Unhashables
(Zoomed in)
Very late answer.
If you don't care about the list order, you can use *arg expansion with set uniqueness to remove dupes, i.e.:
l = [*{*l}]
Python3 Demo
A colleague have sent the accepted answer as part of his code to me for a codereview today.
While I certainly admire the elegance of the answer in question, I am not happy with the performance.
I have tried this solution (I use set to reduce lookup time)
def ordered_set(in_list):
out_list = []
added = set()
for val in in_list:
if not val in added:
out_list.append(val)
added.add(val)
return out_list
To compare efficiency, I used a random sample of 100 integers - 62 were unique
from random import randint
x = [randint(0,100) for _ in xrange(100)]
In [131]: len(set(x))
Out[131]: 62
Here are the results of the measurements
In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop
In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop
Well, what happens if set is removed from the solution?
def ordered_set(inlist):
out_list = []
for val in inlist:
if not val in out_list:
out_list.append(val)
return out_list
The result is not as bad as with the OrderedDict, but still more than 3 times of the original solution
In [136]: %timeit ordered_set(x)
10000 loops, best of 3: 52.6 us per loop
Another way of doing:
>>> seq = [1,2,3,'a', 'a', 1,2]
>> dict.fromkeys(seq).keys()
['a', 1, 2, 3]
Simple and easy:
myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanlist = []
[cleanlist.append(x) for x in myList if x not in cleanlist]
Output:
>>> cleanlist
[1, 2, 3, 5, 6, 7, 8]
I had a dict in my list, so I could not use the above approach. I got the error:
TypeError: unhashable type:
So if you care about order and/or some items are unhashable. Then you might find this useful:
def make_unique(original_list):
unique_list = []
[unique_list.append(obj) for obj in original_list if obj not in unique_list]
return unique_list
Some may consider list comprehension with a side effect to not be a good solution. Here's an alternative:
def make_unique(original_list):
unique_list = []
map(lambda x: unique_list.append(x) if (x not in unique_list) else False, original_list)
return unique_list
All the order-preserving approaches I've seen here so far either use naive comparison (with O(n^2) time-complexity at best) or heavy-weight OrderedDicts/set+list combinations that are limited to hashable inputs. Here is a hash-independent O(nlogn) solution:
Update added the key argument, documentation and Python 3 compatibility.
# from functools import reduce <-- add this import on Python 3
def uniq(iterable, key=lambda x: x):
"""
Remove duplicates from an iterable. Preserves order.
:type iterable: Iterable[Ord => A]
:param iterable: an iterable of objects of any orderable type
:type key: Callable[A] -> (Ord => B)
:param key: optional argument; by default an item (A) is discarded
if another item (B), such that A == B, has already been encountered and taken.
If you provide a key, this condition changes to key(A) == key(B); the callable
must return orderable objects.
"""
# Enumerate the list to restore order lately; reduce the sorted list; restore order
def append_unique(acc, item):
return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc
srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))]
If you want to preserve the order, and not use any external modules here is an easy way to do this:
>>> t = [1, 9, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9]
>>> list(dict.fromkeys(t))
[1, 9, 2, 3, 4, 5, 6, 7, 8]
Note: This method preserves the order of appearance, so, as seen above, nine will come after one because it was the first time it appeared. This however, is the same result as you would get with doing
from collections import OrderedDict
ulist=list(OrderedDict.fromkeys(l))
but it is much shorter, and runs faster.
This works because each time the fromkeys function tries to create a new key, if the value already exists it will simply overwrite it. This wont affect the dictionary at all however, as fromkeys creates a dictionary where all keys have the value None, so effectively it eliminates all duplicates this way.
I've compared the various suggestions with perfplot. It turns out that, if the input array doesn't have duplicate elements, all methods are more or less equally fast, independently of whether the input data is a Python list or a NumPy array.
If the input array is large, but contains just one unique element, then the set, dict and np.unique methods are costant-time if the input data is a list. If it's a NumPy array, np.unique is about 10 times faster than the other alternatives.
It's somewhat surprising to me that those are not constant-time operations, too.
Code to reproduce the plots:
import perfplot
import numpy as np
import matplotlib.pyplot as plt
def setup_list(n):
# return list(np.random.permutation(np.arange(n)))
return [0] * n
def setup_np_array(n):
# return np.random.permutation(np.arange(n))
return np.zeros(n, dtype=int)
def list_set(data):
return list(set(data))
def numpy_unique(data):
return np.unique(data)
def list_dict(data):
return list(dict.fromkeys(data))
b = perfplot.bench(
setup=[
setup_list,
setup_list,
setup_list,
setup_np_array,
setup_np_array,
setup_np_array,
],
kernels=[list_set, numpy_unique, list_dict, list_set, numpy_unique, list_dict],
labels=[
"list(set(lst))",
"np.unique(lst)",
"list(dict(lst))",
"list(set(arr))",
"np.unique(arr)",
"list(dict(arr))",
],
n_range=[2 ** k for k in range(23)],
xlabel="len(array)",
equality_check=None,
)
# plt.title("input array = [0, 1, 2,..., n]")
plt.title("input array = [0, 0,..., 0]")
b.save("out.png")
b.show()
You could also do this:
>>> t = [1, 2, 3, 3, 2, 4, 5, 6]
>>> s = [x for i, x in enumerate(t) if i == t.index(x)]
>>> s
[1, 2, 3, 4, 5, 6]
The reason that above works is that index method returns only the first index of an element. Duplicate elements have higher indices. Refer to here:
list.index(x[, start[, end]])
Return zero-based index in the list of
the first item whose value is x. Raises a ValueError if there is no
such item.
Best approach of removing duplicates from a list is using set() function, available in python, again converting that set into list
In [2]: some_list = ['a','a','v','v','v','c','c','d']
In [3]: list(set(some_list))
Out[3]: ['a', 'c', 'd', 'v']
You can use set to remove duplicates:
mylist = list(set(mylist))
But note the results will be unordered. If that's an issue:
mylist.sort()
Try using sets:
import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])
print t | t1
print t - t1
One more better approach could be,
import pandas as pd
myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanList = pd.Series(myList).drop_duplicates().tolist()
print(cleanList)
#> [1, 2, 3, 5, 6, 7, 8]
and the order remains preserved.
This one cares about the order without too much hassle (OrderdDict & others). Probably not the most Pythonic way, nor shortest way, but does the trick:
def remove_duplicates(item_list):
''' Removes duplicate items from a list '''
singles_list = []
for element in item_list:
if element not in singles_list:
singles_list.append(element)
return singles_list
Reduce variant with ordering preserve:
Assume that we have list:
l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
Reduce variant (unefficient):
>>> reduce(lambda r, v: v in r and r or r + [v], l, [])
[5, 6, 1, 2, 3, 4]
5 x faster but more sophisticated
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# user list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
reduce(reducer, l, default)[0]
There are many other answers suggesting different ways to do this, but they're all batch operations, and some of them throw away the original order. That might be okay depending on what you need, but if you want to iterate over the values in the order of the first instance of each value, and you want to remove the duplicates on-the-fly versus all at once, you could use this generator:
def uniqify(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item
This returns a generator/iterator, so you can use it anywhere that you can use an iterator.
for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
print(unique_item, end=' ')
print()
Output:
1 2 3 4 5 6 7 8
If you do want a list, you can do this:
unique_list = list(uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]))
print(unique_list)
Output:
[1, 2, 3, 4, 5, 6, 7, 8]
You can use the following function:
def rem_dupes(dup_list):
yooneeks = []
for elem in dup_list:
if elem not in yooneeks:
yooneeks.append(elem)
return yooneeks
Example:
my_list = ['this','is','a','list','with','dupicates','in', 'the', 'list']
Usage:
rem_dupes(my_list)
['this', 'is', 'a', 'list', 'with', 'dupicates', 'in', 'the']
Using set :
a = [0,1,2,3,4,3,3,4]
a = list(set(a))
print a
Using unique :
import numpy as np
a = [0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()
print a
Without using set
data=[1, 2, 3, 1, 2, 5, 6, 7, 8]
uni_data=[]
for dat in data:
if dat not in uni_data:
uni_data.append(dat)
print(uni_data)
The Magic of Python Built-in type
In python, it is very easy to process the complicated cases like this and only by python's built-in type.
Let me show you how to do !
Method 1: General Case
The way (1 line code) to remove duplicated element in list and still keep sorting order
line = [1, 2, 3, 1, 2, 5, 6, 7, 8]
new_line = sorted(set(line), key=line.index) # remove duplicated element
print(new_line)
You will get the result
[1, 2, 3, 5, 6, 7, 8]
Method 2: Special Case
TypeError: unhashable type: 'list'
The special case to process unhashable (3 line codes)
line=[['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']]
tuple_line = [tuple(pt) for pt in line] # convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index) # remove duplicated element
new_line = [list(t) for t in tuple_new_line] # convert list of tuple into list of list
print (new_line)
You will get the result :
[
['16.4966155686595', '-27.59776154691', '52.3786295521147'],
['17.6508629295574', '-27.143305738671', '47.534955022564'],
['18.8051102904552', '-26.688849930432', '42.6912804930134'],
['19.5504702331098', '-26.205884452727', '37.7709192714727'],
['20.2929416861422', '-25.722717575124', '32.8500163147157']
]
Because tuple is hashable and you can convert data between list and tuple easily
below code is simple for removing duplicate in list
def remove_duplicates(x):
a = []
for i in x:
if i not in a:
a.append(i)
return a
print remove_duplicates([1,2,2,3,3,4])
it returns [1,2,3,4]
Here's the fastest pythonic solution comaring to others listed in replies.
Using implementation details of short-circuit evaluation allows to use list comprehension, which is fast enough. visited.add(item) always returns None as a result, which is evaluated as False, so the right-side of or would always be the result of such an expression.
Time it yourself
def deduplicate(sequence):
visited = set()
adder = visited.add # get rid of qualification overhead
out = [adder(item) or item for item in sequence if item not in visited]
return out
Recently I noticed that when I am converting a list to set the order of elements is changed and is sorted by character.
Consider this example:
x=[1,2,20,6,210]
print(x)
# [1, 2, 20, 6, 210] # the order is same as initial order
set(x)
# set([1, 2, 20, 210, 6]) # in the set(x) output order is sorted
My questions are -
Why is this happening?
How can I do set operations (especially set difference) without losing the initial order?
A set is an unordered data structure, so it does not preserve the insertion order.
This depends on your requirements. If you have an normal list, and want to remove some set of elements while preserving the order of the list, you can do this with a list comprehension:
>>> a = [1, 2, 20, 6, 210]
>>> b = set([6, 20, 1])
>>> [x for x in a if x not in b]
[2, 210]
If you need a data structure that supports both fast membership tests and preservation of insertion order, you can use the keys of a Python dictionary, which starting from Python 3.7 is guaranteed to preserve the insertion order:
>>> a = dict.fromkeys([1, 2, 20, 6, 210])
>>> b = dict.fromkeys([6, 20, 1])
>>> dict.fromkeys(x for x in a if x not in b)
{2: None, 210: None}
b doesn't really need to be ordered here – you could use a set as well. Note that a.keys() - b.keys() returns the set difference as a set, so it won't preserve the insertion order.
In older versions of Python, you can use collections.OrderedDict instead:
>>> a = collections.OrderedDict.fromkeys([1, 2, 20, 6, 210])
>>> b = collections.OrderedDict.fromkeys([6, 20, 1])
>>> collections.OrderedDict.fromkeys(x for x in a if x not in b)
OrderedDict([(2, None), (210, None)])
In Python 3.6, set() now should keep the order, but there is another solution for Python 2 and 3:
>>> x = [1, 2, 20, 6, 210]
>>> sorted(set(x), key=x.index)
[1, 2, 20, 6, 210]
Remove duplicates and preserve order by below function
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
How to remove duplicates from a list while preserving order in Python
Answering your first question, a set is a data structure optimized for set operations. Like a mathematical set, it does not enforce or maintain any particular order of the elements. The abstract concept of a set does not enforce order, so the implementation is not required to. When you create a set from a list, Python has the liberty to change the order of the elements for the needs of the internal implementation it uses for a set, which is able to perform set operations efficiently.
In mathematics, there are sets and ordered sets (osets).
set: an unordered container of unique elements (Implemented)
oset: an ordered container of unique elements (NotImplemented)
In Python, only sets are directly implemented. We can emulate osets with regular dict keys (3.7+).
Given
a = [1, 2, 20, 6, 210, 2, 1]
b = {2, 6}
Code
oset = dict.fromkeys(a).keys()
# dict_keys([1, 2, 20, 6, 210])
Demo
Replicates are removed, insertion-order is preserved.
list(oset)
# [1, 2, 20, 6, 210]
Set-like operations on dict keys.
oset - b
# {1, 20, 210}
oset | b
# {1, 2, 5, 6, 20, 210}
oset & b
# {2, 6}
oset ^ b
# {1, 5, 20, 210}
Details
Note: an unordered structure does not preclude ordered elements. Rather, maintained order is not guaranteed. Example:
assert {1, 2, 3} == {2, 3, 1} # sets (order is ignored)
assert [1, 2, 3] != [2, 3, 1] # lists (order is guaranteed)
One may be pleased to discover that a list and multiset (mset) are two more fascinating, mathematical data structures:
list: an ordered container of elements that permits replicates (Implemented)
mset: an unordered container of elements that permits replicates (NotImplemented)*
Summary
Container | Ordered | Unique | Implemented
----------|---------|--------|------------
set | n | y | y
oset | y | y | n
list | y | n | y
mset | n | n | n*
*A multiset can be indirectly emulated with collections.Counter(), a dict-like mapping of multiplicities (counts).
You can remove the duplicated values and keep the list order of insertion with one line of code, Python 3.8.2
mylist = ['b', 'b', 'a', 'd', 'd', 'c']
results = list({value:"" for value in mylist})
print(results)
>>> ['b', 'a', 'd', 'c']
results = list(dict.fromkeys(mylist))
print(results)
>>> ['b', 'a', 'd', 'c']
As denoted in other answers, sets are data structures (and mathematical concepts) that do not preserve the element order -
However, by using a combination of sets and dictionaries, it is possible that you can achieve wathever you want - try using these snippets:
# save the element order in a dict:
x_dict = dict(x,y for y, x in enumerate(my_list) )
x_set = set(my_list)
#perform desired set operations
...
#retrieve ordered list from the set:
new_list = [None] * len(new_set)
for element in new_set:
new_list[x_dict[element]] = element
Building on Sven's answer, I found using collections.OrderedDict like so helped me accomplish what you want plus allow me to add more items to the dict:
import collections
x=[1,2,20,6,210]
z=collections.OrderedDict.fromkeys(x)
z
OrderedDict([(1, None), (2, None), (20, None), (6, None), (210, None)])
If you want to add items but still treat it like a set you can just do:
z['nextitem']=None
And you can perform an operation like z.keys() on the dict and get the set:
list(z.keys())
[1, 2, 20, 6, 210]
One more simpler way can be two create a empty list ,let's say "unique_list" for adding the unique elements from the original list, for example:
unique_list=[]
for i in original_list:
if i not in unique_list:
unique_list.append(i)
else:
pass
This will give you all the unique elements as well as maintain the order.
Late to answer but you can use Pandas, pd.Series to convert list while preserving the order:
import pandas as pd
x = pd.Series([1, 2, 20, 6, 210, 2, 1])
print(pd.unique(x))
Output:
array([ 1, 2, 20, 6, 210])
Works for a list of strings
x = pd.Series(['c', 'k', 'q', 'n', 'p','c', 'n'])
print(pd.unique(x))
Output
['c' 'k' 'q' 'n' 'p']
An implementation of the highest score concept above that brings it back to a list:
def SetOfListInOrder(incominglist):
from collections import OrderedDict
outtemp = OrderedDict()
for item in incominglist:
outtemp[item] = None
return(list(outtemp))
Tested (briefly) on Python 3.6 and Python 2.7.
In case you have a small number of elements in your two initial lists on which you want to do set difference operation, instead of using collections.OrderedDict which complicates the implementation and makes it less readable, you can use:
# initial lists on which you want to do set difference
>>> nums = [1,2,2,3,3,4,4,5]
>>> evens = [2,4,4,6]
>>> evens_set = set(evens)
>>> result = []
>>> for n in nums:
... if not n in evens_set and not n in result:
... result.append(n)
...
>>> result
[1, 3, 5]
Its time complexity is not that good but it is neat and easy to read.
It's interesting that people always use 'real world problem' to make joke on the definition in theoretical science.
If set has order, you first need to figure out the following problems.
If your list has duplicate elements, what should the order be when you turn it into a set? What is the order if we union two sets? What is the order if we intersect two sets with different order on the same elements?
Plus, set is much faster in searching for a particular key which is very good in sets operation (and that's why you need a set, but not list).
If you really care about the index, just keep it as a list. If you still want to do set operation on the elements in many lists, the simplest way is creating a dictionary for each list with the same keys in the set along with a value of list containing all the index of the key in the original list.
def indx_dic(l):
dic = {}
for i in range(len(l)):
if l[i] in dic:
dic.get(l[i]).append(i)
else:
dic[l[i]] = [i]
return(dic)
a = [1,2,3,4,5,1,3,2]
set_a = set(a)
dic_a = indx_dic(a)
print(dic_a)
# {1: [0, 5], 2: [1, 7], 3: [2, 6], 4: [3], 5: [4]}
print(set_a)
# {1, 2, 3, 4, 5}
We can use collections.Counter for this:
# tested on python 3.7
>>> from collections import Counter
>>> lst = ["1", "2", "20", "6", "210"]
>>> for i in Counter(lst):
>>> print(i, end=" ")
1 2 20 6 210
>>> for i in set(lst):
>>> print(i, end=" ")
20 6 2 1 210
You can remove the duplicated values and keep the list order of insertion, if you want
lst = [1,2,1,3]
new_lst = []
for num in lst :
if num not in new_lst :
new_lst.append(num)
# new_lst = [1,2,3]
don't use 'sets' for removing duplicate if 'order' is something you want,
use sets for searching i.e.
x in list
takes O(n) time
where
x in set
takes O(1) time *most cases
Here's an easy way to do it:
x=[1,2,20,6,210]
print sorted(set(x))
I was reading sets in python http://www.python-course.eu/sets_frozensets.php and got confusion that whether the elements of sets in python must be mutable or immutable? Because in the definition section they said "A set contains an unordered collection of unique and immutable objects." If it is true than how can a set contain the list as list is mutable?
Can someone clarify my doubt?
>>> x = [x for x in range(0,10,2)]
>>> x
[0, 2, 4, 6, 8] #This is a list x
>>> my_set = set(x) #Here we are passing list x to create a set
>>> my_set
set([0, 8, 2, 4, 6]) #and here my_set is a set which contain the list.
>>>
When you pass the set() constructor built-in any iterable, it builds a set out of the elements provided in the iterable. So when you pass set() a list, it creates a set containing the objects within the list - not a set containing the list itself, which is not permissible as you expect because lists are mutable.
So what matters is that the objects inside your list are immutable, which is true in the case of your linked tutorial as you have a list of (immutable) strings.
>>> set(["Perl", "Python", "Java"])
set[('Java', 'Python', 'Perl')]
Note that this printing formatting doesn't mean your set contains a list, it is just how sets are represented when printed. For instance, we can create a set from a tuple and it will be printed the same way.
>>> set((1,2,3))
set([1, 2, 3])
In Python 2, sets are printed as set([comma-separated-elements]).
You seem to be confusing initialising a set with a list:
a = set([1, 2])
with adding a list to an existing set:
a = set()
a.add([1, 2])
the latter will throw an error, where the former initialises the set with the values from the list you provide as an argument. What is most likely the cause for the confusion is that when you print a from the first example it looks like:
set([1, 2])
again, but here [1, 2] is not a list, just the way a is represented:
a = set()
a.add(1)
a.add(2)
print(a)
gives:
set([1, 2])
without you ever specifying a list.
Why does Python have the built-in function reversed?
Why not just use x[::-1] instead of reversed(x)?
Edit: #TanveerAlam pointed out that reversed is not actually a function, but rather a class, despite being listed on the page Built-in Functions.
>>> a= [1,2,3,4,5,6,7,8,9,10]
>>> a[::-1]
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
>>> reversed(a)
<listreverseiterator object at 0x10dbf5390>
The first notation is generating the reverse eagerly; the second is giving you a reverse iterator, which is possibly cheaper to acquire, as it has potential to only generate elements as needed
reversed returns a reverse iterator.
[::-1] asks the object for a slice
Python objects try to return what you probably expect
>>> [1, 2, 3][::-1]
[3, 2, 1]
>>> "123"[::-1]
'321'
This is convenient - particularly for strings and tuples.
But remember the majority of code doesn't need to reverse strings.
The most important role of reversed() is making code easier to read and understand.
The fact that it returns an iterator without creating a new sequence is of secondary importance
From the docs
PEP 322: Reverse Iteration A new built-in function, reversed(seq)(),
takes a sequence and returns an iterator that loops over the elements
of the sequence in reverse order.
>>>
>>> for i in reversed(xrange(1,4)):
... print i
...
3
2
1
Compared to extended slicing, such as range(1,4)[::-1], reversed() is
easier to read, runs faster, and uses substantially less memory.
Note that reversed() only accepts sequences, not arbitrary iterators.
If you want to reverse an iterator, first convert it to a list with
list().
>>>
>>> input = open('/etc/passwd', 'r')
>>> for line in reversed(list(input)):
... print line
...
reversed return a reverse iterator.
x[::-1] return a list.
In [1]: aaa = [1,2,3,4,5]
In [4]: aaa[::-1]
Out[4]: [5, 4, 3, 2, 1]
In [5]: timeit(aaa[::-1])
1000000 loops, best of 3: 206 ns per loop
In [6]: reversed(aaa)
Out[6]: <listreverseiterator at 0x104310d50>
In [7]: timeit(reversed(aaa))
10000000 loops, best of 3: 182 ns per loop
First of all, reversed is not a built-in function.
>>> type(reversed)
<type 'type'>
It's a class which itrates over a sequence and gives a reverse order of sequence.
Try:
>>> help(reversed)
Help on class reversed in module __builtin__:
class reversed(object)
| reversed(sequence) -> reverse iterator over values of the sequence
And when we pass a parameter to it, it acts as a iterator,
>>> l = [1, 2, 3, 4]
>>> obj = reversed(l)
>>> obj
<listreverseiterator object at 0x0220F950>
>>> obj.next()
4
>>> obj.next()
3
>>> obj.next()
2
>>> obj.next()
1
>>> obj.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
whereas a slice operation returns the whole list which is not memory efficient for larger lists.
That is why, in Python 2, we have range (which returns whole list) as well as xrange (which generates each element on every iteration).
>>> l[::-1]
[4, 3, 2, 1]
aList = [123, 'xyz', 'zara', 'abc', 'xyz'];
print type(reversed(aList))
bList = [123, 'xyz', 'zara', 'abc', 'xyz'];
print type(bList[::-1])
Output:
<type 'listreverseiterator'>
<type 'list'>
The reversed function returns a reverse iterator. The [::-1] returns a list.
Because reversed returns an iterator.
Recently I noticed that when I am converting a list to set the order of elements is changed and is sorted by character.
Consider this example:
x=[1,2,20,6,210]
print(x)
# [1, 2, 20, 6, 210] # the order is same as initial order
set(x)
# set([1, 2, 20, 210, 6]) # in the set(x) output order is sorted
My questions are -
Why is this happening?
How can I do set operations (especially set difference) without losing the initial order?
A set is an unordered data structure, so it does not preserve the insertion order.
This depends on your requirements. If you have an normal list, and want to remove some set of elements while preserving the order of the list, you can do this with a list comprehension:
>>> a = [1, 2, 20, 6, 210]
>>> b = set([6, 20, 1])
>>> [x for x in a if x not in b]
[2, 210]
If you need a data structure that supports both fast membership tests and preservation of insertion order, you can use the keys of a Python dictionary, which starting from Python 3.7 is guaranteed to preserve the insertion order:
>>> a = dict.fromkeys([1, 2, 20, 6, 210])
>>> b = dict.fromkeys([6, 20, 1])
>>> dict.fromkeys(x for x in a if x not in b)
{2: None, 210: None}
b doesn't really need to be ordered here – you could use a set as well. Note that a.keys() - b.keys() returns the set difference as a set, so it won't preserve the insertion order.
In older versions of Python, you can use collections.OrderedDict instead:
>>> a = collections.OrderedDict.fromkeys([1, 2, 20, 6, 210])
>>> b = collections.OrderedDict.fromkeys([6, 20, 1])
>>> collections.OrderedDict.fromkeys(x for x in a if x not in b)
OrderedDict([(2, None), (210, None)])
In Python 3.6, set() now should keep the order, but there is another solution for Python 2 and 3:
>>> x = [1, 2, 20, 6, 210]
>>> sorted(set(x), key=x.index)
[1, 2, 20, 6, 210]
Remove duplicates and preserve order by below function
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
How to remove duplicates from a list while preserving order in Python
Answering your first question, a set is a data structure optimized for set operations. Like a mathematical set, it does not enforce or maintain any particular order of the elements. The abstract concept of a set does not enforce order, so the implementation is not required to. When you create a set from a list, Python has the liberty to change the order of the elements for the needs of the internal implementation it uses for a set, which is able to perform set operations efficiently.
In mathematics, there are sets and ordered sets (osets).
set: an unordered container of unique elements (Implemented)
oset: an ordered container of unique elements (NotImplemented)
In Python, only sets are directly implemented. We can emulate osets with regular dict keys (3.7+).
Given
a = [1, 2, 20, 6, 210, 2, 1]
b = {2, 6}
Code
oset = dict.fromkeys(a).keys()
# dict_keys([1, 2, 20, 6, 210])
Demo
Replicates are removed, insertion-order is preserved.
list(oset)
# [1, 2, 20, 6, 210]
Set-like operations on dict keys.
oset - b
# {1, 20, 210}
oset | b
# {1, 2, 5, 6, 20, 210}
oset & b
# {2, 6}
oset ^ b
# {1, 5, 20, 210}
Details
Note: an unordered structure does not preclude ordered elements. Rather, maintained order is not guaranteed. Example:
assert {1, 2, 3} == {2, 3, 1} # sets (order is ignored)
assert [1, 2, 3] != [2, 3, 1] # lists (order is guaranteed)
One may be pleased to discover that a list and multiset (mset) are two more fascinating, mathematical data structures:
list: an ordered container of elements that permits replicates (Implemented)
mset: an unordered container of elements that permits replicates (NotImplemented)*
Summary
Container | Ordered | Unique | Implemented
----------|---------|--------|------------
set | n | y | y
oset | y | y | n
list | y | n | y
mset | n | n | n*
*A multiset can be indirectly emulated with collections.Counter(), a dict-like mapping of multiplicities (counts).
You can remove the duplicated values and keep the list order of insertion with one line of code, Python 3.8.2
mylist = ['b', 'b', 'a', 'd', 'd', 'c']
results = list({value:"" for value in mylist})
print(results)
>>> ['b', 'a', 'd', 'c']
results = list(dict.fromkeys(mylist))
print(results)
>>> ['b', 'a', 'd', 'c']
As denoted in other answers, sets are data structures (and mathematical concepts) that do not preserve the element order -
However, by using a combination of sets and dictionaries, it is possible that you can achieve wathever you want - try using these snippets:
# save the element order in a dict:
x_dict = dict(x,y for y, x in enumerate(my_list) )
x_set = set(my_list)
#perform desired set operations
...
#retrieve ordered list from the set:
new_list = [None] * len(new_set)
for element in new_set:
new_list[x_dict[element]] = element
Building on Sven's answer, I found using collections.OrderedDict like so helped me accomplish what you want plus allow me to add more items to the dict:
import collections
x=[1,2,20,6,210]
z=collections.OrderedDict.fromkeys(x)
z
OrderedDict([(1, None), (2, None), (20, None), (6, None), (210, None)])
If you want to add items but still treat it like a set you can just do:
z['nextitem']=None
And you can perform an operation like z.keys() on the dict and get the set:
list(z.keys())
[1, 2, 20, 6, 210]
One more simpler way can be two create a empty list ,let's say "unique_list" for adding the unique elements from the original list, for example:
unique_list=[]
for i in original_list:
if i not in unique_list:
unique_list.append(i)
else:
pass
This will give you all the unique elements as well as maintain the order.
Late to answer but you can use Pandas, pd.Series to convert list while preserving the order:
import pandas as pd
x = pd.Series([1, 2, 20, 6, 210, 2, 1])
print(pd.unique(x))
Output:
array([ 1, 2, 20, 6, 210])
Works for a list of strings
x = pd.Series(['c', 'k', 'q', 'n', 'p','c', 'n'])
print(pd.unique(x))
Output
['c' 'k' 'q' 'n' 'p']
An implementation of the highest score concept above that brings it back to a list:
def SetOfListInOrder(incominglist):
from collections import OrderedDict
outtemp = OrderedDict()
for item in incominglist:
outtemp[item] = None
return(list(outtemp))
Tested (briefly) on Python 3.6 and Python 2.7.
In case you have a small number of elements in your two initial lists on which you want to do set difference operation, instead of using collections.OrderedDict which complicates the implementation and makes it less readable, you can use:
# initial lists on which you want to do set difference
>>> nums = [1,2,2,3,3,4,4,5]
>>> evens = [2,4,4,6]
>>> evens_set = set(evens)
>>> result = []
>>> for n in nums:
... if not n in evens_set and not n in result:
... result.append(n)
...
>>> result
[1, 3, 5]
Its time complexity is not that good but it is neat and easy to read.
It's interesting that people always use 'real world problem' to make joke on the definition in theoretical science.
If set has order, you first need to figure out the following problems.
If your list has duplicate elements, what should the order be when you turn it into a set? What is the order if we union two sets? What is the order if we intersect two sets with different order on the same elements?
Plus, set is much faster in searching for a particular key which is very good in sets operation (and that's why you need a set, but not list).
If you really care about the index, just keep it as a list. If you still want to do set operation on the elements in many lists, the simplest way is creating a dictionary for each list with the same keys in the set along with a value of list containing all the index of the key in the original list.
def indx_dic(l):
dic = {}
for i in range(len(l)):
if l[i] in dic:
dic.get(l[i]).append(i)
else:
dic[l[i]] = [i]
return(dic)
a = [1,2,3,4,5,1,3,2]
set_a = set(a)
dic_a = indx_dic(a)
print(dic_a)
# {1: [0, 5], 2: [1, 7], 3: [2, 6], 4: [3], 5: [4]}
print(set_a)
# {1, 2, 3, 4, 5}
We can use collections.Counter for this:
# tested on python 3.7
>>> from collections import Counter
>>> lst = ["1", "2", "20", "6", "210"]
>>> for i in Counter(lst):
>>> print(i, end=" ")
1 2 20 6 210
>>> for i in set(lst):
>>> print(i, end=" ")
20 6 2 1 210
You can remove the duplicated values and keep the list order of insertion, if you want
lst = [1,2,1,3]
new_lst = []
for num in lst :
if num not in new_lst :
new_lst.append(num)
# new_lst = [1,2,3]
don't use 'sets' for removing duplicate if 'order' is something you want,
use sets for searching i.e.
x in list
takes O(n) time
where
x in set
takes O(1) time *most cases
Here's an easy way to do it:
x=[1,2,20,6,210]
print sorted(set(x))