python3 : Getting unordered elements after using set. Why, how? [duplicate] - python

Recently I noticed that when I am converting a list to set the order of elements is changed and is sorted by character.
Consider this example:
x=[1,2,20,6,210]
print(x)
# [1, 2, 20, 6, 210] # the order is same as initial order
set(x)
# set([1, 2, 20, 210, 6]) # in the set(x) output order is sorted
My questions are -
Why is this happening?
How can I do set operations (especially set difference) without losing the initial order?

A set is an unordered data structure, so it does not preserve the insertion order.
This depends on your requirements. If you have an normal list, and want to remove some set of elements while preserving the order of the list, you can do this with a list comprehension:
>>> a = [1, 2, 20, 6, 210]
>>> b = set([6, 20, 1])
>>> [x for x in a if x not in b]
[2, 210]
If you need a data structure that supports both fast membership tests and preservation of insertion order, you can use the keys of a Python dictionary, which starting from Python 3.7 is guaranteed to preserve the insertion order:
>>> a = dict.fromkeys([1, 2, 20, 6, 210])
>>> b = dict.fromkeys([6, 20, 1])
>>> dict.fromkeys(x for x in a if x not in b)
{2: None, 210: None}
b doesn't really need to be ordered here – you could use a set as well. Note that a.keys() - b.keys() returns the set difference as a set, so it won't preserve the insertion order.
In older versions of Python, you can use collections.OrderedDict instead:
>>> a = collections.OrderedDict.fromkeys([1, 2, 20, 6, 210])
>>> b = collections.OrderedDict.fromkeys([6, 20, 1])
>>> collections.OrderedDict.fromkeys(x for x in a if x not in b)
OrderedDict([(2, None), (210, None)])

In Python 3.6, set() now should keep the order, but there is another solution for Python 2 and 3:
>>> x = [1, 2, 20, 6, 210]
>>> sorted(set(x), key=x.index)
[1, 2, 20, 6, 210]

Remove duplicates and preserve order by below function
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
How to remove duplicates from a list while preserving order in Python

Answering your first question, a set is a data structure optimized for set operations. Like a mathematical set, it does not enforce or maintain any particular order of the elements. The abstract concept of a set does not enforce order, so the implementation is not required to. When you create a set from a list, Python has the liberty to change the order of the elements for the needs of the internal implementation it uses for a set, which is able to perform set operations efficiently.

In mathematics, there are sets and ordered sets (osets).
set: an unordered container of unique elements (Implemented)
oset: an ordered container of unique elements (NotImplemented)
In Python, only sets are directly implemented. We can emulate osets with regular dict keys (3.7+).
Given
a = [1, 2, 20, 6, 210, 2, 1]
b = {2, 6}
Code
oset = dict.fromkeys(a).keys()
# dict_keys([1, 2, 20, 6, 210])
Demo
Replicates are removed, insertion-order is preserved.
list(oset)
# [1, 2, 20, 6, 210]
Set-like operations on dict keys.
oset - b
# {1, 20, 210}
oset | b
# {1, 2, 5, 6, 20, 210}
oset & b
# {2, 6}
oset ^ b
# {1, 5, 20, 210}
Details
Note: an unordered structure does not preclude ordered elements. Rather, maintained order is not guaranteed. Example:
assert {1, 2, 3} == {2, 3, 1} # sets (order is ignored)
assert [1, 2, 3] != [2, 3, 1] # lists (order is guaranteed)
One may be pleased to discover that a list and multiset (mset) are two more fascinating, mathematical data structures:
list: an ordered container of elements that permits replicates (Implemented)
mset: an unordered container of elements that permits replicates (NotImplemented)*
Summary
Container | Ordered | Unique | Implemented
----------|---------|--------|------------
set | n | y | y
oset | y | y | n
list | y | n | y
mset | n | n | n*
*A multiset can be indirectly emulated with collections.Counter(), a dict-like mapping of multiplicities (counts).

You can remove the duplicated values and keep the list order of insertion with one line of code, Python 3.8.2
mylist = ['b', 'b', 'a', 'd', 'd', 'c']
results = list({value:"" for value in mylist})
print(results)
>>> ['b', 'a', 'd', 'c']
results = list(dict.fromkeys(mylist))
print(results)
>>> ['b', 'a', 'd', 'c']

As denoted in other answers, sets are data structures (and mathematical concepts) that do not preserve the element order -
However, by using a combination of sets and dictionaries, it is possible that you can achieve wathever you want - try using these snippets:
# save the element order in a dict:
x_dict = dict(x,y for y, x in enumerate(my_list) )
x_set = set(my_list)
#perform desired set operations
...
#retrieve ordered list from the set:
new_list = [None] * len(new_set)
for element in new_set:
new_list[x_dict[element]] = element

Building on Sven's answer, I found using collections.OrderedDict like so helped me accomplish what you want plus allow me to add more items to the dict:
import collections
x=[1,2,20,6,210]
z=collections.OrderedDict.fromkeys(x)
z
OrderedDict([(1, None), (2, None), (20, None), (6, None), (210, None)])
If you want to add items but still treat it like a set you can just do:
z['nextitem']=None
And you can perform an operation like z.keys() on the dict and get the set:
list(z.keys())
[1, 2, 20, 6, 210]

One more simpler way can be two create a empty list ,let's say "unique_list" for adding the unique elements from the original list, for example:
unique_list=[]
for i in original_list:
if i not in unique_list:
unique_list.append(i)
else:
pass
This will give you all the unique elements as well as maintain the order.

Late to answer but you can use Pandas, pd.Series to convert list while preserving the order:
import pandas as pd
x = pd.Series([1, 2, 20, 6, 210, 2, 1])
print(pd.unique(x))
Output:
array([ 1, 2, 20, 6, 210])
Works for a list of strings
x = pd.Series(['c', 'k', 'q', 'n', 'p','c', 'n'])
print(pd.unique(x))
Output
['c' 'k' 'q' 'n' 'p']

An implementation of the highest score concept above that brings it back to a list:
def SetOfListInOrder(incominglist):
from collections import OrderedDict
outtemp = OrderedDict()
for item in incominglist:
outtemp[item] = None
return(list(outtemp))
Tested (briefly) on Python 3.6 and Python 2.7.

In case you have a small number of elements in your two initial lists on which you want to do set difference operation, instead of using collections.OrderedDict which complicates the implementation and makes it less readable, you can use:
# initial lists on which you want to do set difference
>>> nums = [1,2,2,3,3,4,4,5]
>>> evens = [2,4,4,6]
>>> evens_set = set(evens)
>>> result = []
>>> for n in nums:
... if not n in evens_set and not n in result:
... result.append(n)
...
>>> result
[1, 3, 5]
Its time complexity is not that good but it is neat and easy to read.

It's interesting that people always use 'real world problem' to make joke on the definition in theoretical science.
If set has order, you first need to figure out the following problems.
If your list has duplicate elements, what should the order be when you turn it into a set? What is the order if we union two sets? What is the order if we intersect two sets with different order on the same elements?
Plus, set is much faster in searching for a particular key which is very good in sets operation (and that's why you need a set, but not list).
If you really care about the index, just keep it as a list. If you still want to do set operation on the elements in many lists, the simplest way is creating a dictionary for each list with the same keys in the set along with a value of list containing all the index of the key in the original list.
def indx_dic(l):
dic = {}
for i in range(len(l)):
if l[i] in dic:
dic.get(l[i]).append(i)
else:
dic[l[i]] = [i]
return(dic)
a = [1,2,3,4,5,1,3,2]
set_a = set(a)
dic_a = indx_dic(a)
print(dic_a)
# {1: [0, 5], 2: [1, 7], 3: [2, 6], 4: [3], 5: [4]}
print(set_a)
# {1, 2, 3, 4, 5}

We can use collections.Counter for this:
# tested on python 3.7
>>> from collections import Counter
>>> lst = ["1", "2", "20", "6", "210"]
>>> for i in Counter(lst):
>>> print(i, end=" ")
1 2 20 6 210
>>> for i in set(lst):
>>> print(i, end=" ")
20 6 2 1 210

You can remove the duplicated values and keep the list order of insertion, if you want
lst = [1,2,1,3]
new_lst = []
for num in lst :
if num not in new_lst :
new_lst.append(num)
# new_lst = [1,2,3]
don't use 'sets' for removing duplicate if 'order' is something you want,
use sets for searching i.e.
x in list
takes O(n) time
where
x in set
takes O(1) time *most cases

Here's an easy way to do it:
x=[1,2,20,6,210]
print sorted(set(x))

Related

Is there, a better method to remove duplicates values from a list when using a dictionary instead of set(), to preserve the values insertion order? [duplicate]

How do I remove duplicates from a list, while preserving order? Using a set to remove duplicates destroys the original order.
Is there a built-in or a Pythonic idiom?
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
Fastest one:
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)
The best solution varies by Python version and environment constraints:
Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):
First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain dict is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5) collections.OrderedDict. So the fastest solution, by far, is also the simplest:
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(dict.fromkeys(items)) # Or [*dict.fromkeys(items)] if you prefer
[1, 2, 0, 3]
Like list(set(items)) this pushes all the work to the C layer (on CPython), but since dicts are insertion ordered, dict.fromkeys doesn't lose ordering. It's slower than list(set(items)) (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of sets in a listcomp).
Important note: The unique_everseen solution from more_itertools (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it's the only solution that will work.
Python 3.5 (and all older versions if performance isn't critical)
As Raymond pointed out, in CPython 3.5 where OrderedDict is implemented in C, ugly list comprehension hacks are slower than OrderedDict.fromkeys (unless you actually need the list at the end - and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the OrderedDict equivalent of the 3.6+ use of plain dict:
>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]
On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading.
Python 3.4 and earlier, if performance is critical and third-party modules are acceptable
As #abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is the fastest solution too:
>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]
Just one simple library import and no hacks.
The module is adapting the itertools recipe unique_everseen which looks like:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
but unlike the itertools recipe, it supports non-hashable items (at a performance cost; if all elements in iterable are non-hashable, the algorithm becomes O(n²), vs. O(n) if they're all hashable).
Important note: Unlike all the other solutions here, unique_everseen can be used lazily; the peak memory usage will be the same (eventually, the underlying set grows to the same size), but if you don't listify the result, you just iterate it, you'll be able to process unique items as they're found, rather than waiting until the entire input has been deduplicated before processing the first unique item.
Python 3.4 and earlier, if performance is critical and third party modules are unavailable
You have two options:
Copy and paste in the unique_everseen recipe to your code and use it per the more_itertools example above
Use ugly hacks to allow a single listcomp to both check and update a set to track what's been seen:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
at the expense of relying on the ugly hack:
not seen.add(x)
which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.
Note that all of the solutions above are O(n) (save calling unique_everseen on an iterable of non-hashable items, which is O(n²), while the others would fail immediately with a TypeError), so all solutions are performant enough when they're not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don't assume it is; it usually isn't), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn't worth it).
In CPython 3.6+ (and all other Python implementations starting with Python 3.7+), dictionaries are ordered, so the way to remove duplicates from an iterable while keeping it in the original order is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5 and below (including Python 2.7), use the OrderedDict. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5 (when it gained a C implementation; prior to 3.5 it's still the clearest solution, though not the fastest).
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.
import pandas as pd
my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]
>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4, 5]
In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.
The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> list(dict.fromkeys(lst))
[1, 2, 3, 4]
sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]
unique → ['1', '2', '3', '6', '4', '5']
from itertools import groupby
[ key for key,_ in groupby(sortedList)]
The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.
Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.
Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".
I think if you wanna maintain the order,
you can try this:
list1 = ['b','c','d','b','c','a','a']
list2 = list(set(list1))
list2.sort(key=list1.index)
print list2
OR similarly you can do this:
list1 = ['b','c','d','b','c','a','a']
list2 = sorted(set(list1),key=list1.index)
print list2
You can also do this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
for i in list1:
if not i in list2:
list2.append(i)`
print list2
It can also be written as this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
[list2.append(i) for i in list1 if not i in list2]
print list2
Just to add another (very performant) implementation of such a functionality from an external module1: iteration_utilities.unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]
>>> list(unique_everseen(lst))
[1, 2, 3, 4]
Timings
I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def iteration_utilities_unique_everseen(seq):
return list(unique_everseen(seq))
def more_itertools_unique_everseen(seq):
return list(mi_unique_everseen(seq))
def odict(seq):
return list(OrderedDict.fromkeys(seq))
from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: list(range(2**i)) for i in range(1, 20)},
'list size (no duplicates)')
b.plot()
And just to make sure I also did a test with more duplicates just to check if it makes a difference:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
'list size (lots of duplicates)')
b.plot()
And one containing only one value:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [1]*(2**i) for i in range(1, 20)},
'list size (only duplicates)')
b.plot()
In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).
This iteration_utilities.unique_everseen function can also handle unhashable values in the input (however with an O(n*n) performance instead of the O(n) performance when the values are hashable).
>>> lst = [{1}, {1}, {2}, {1}, {3}]
>>> list(unique_everseen(lst))
[{1}, {2}, {3}]
1 Disclaimer: I'm the author of that package.
For another very late answer to another very old question:
The itertools recipes have a function that does this, using the seen set technique, but:
Handles a standard key function.
Uses no unseemly hacks.
Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don't.)
Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that's in C, and much faster.)
Is it actually faster than f7? It depends on your data, so you'll have to test it and see. If you want a list in the end, f7 uses a listcomp, and there's no way to do that here. (You can directly append instead of yielding, or you can feed the generator into the list function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn't require DSU when you want to decorate.
As with all of the recipes, it's also available in more-iterools.
If you just want the no-key case, you can simplify it as:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
For no hashable types (e.g. list of lists), based on MizardX's:
def f7_noHash(seq)
seen = set()
return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]
pandas users should check out pandas.unique.
>>> import pandas as pd
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> pd.unique(lst)
array([1, 2, 3, 4])
The function returns a NumPy array. If needed, you can convert it to a list with the tolist method.
5 x faster reduce variant but more sophisticated
>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# use list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]
here is a simple way to do it:
list1 = ["hello", " ", "w", "o", "r", "l", "d"]
sorted(set(list1 ), key=list1.index)
that gives the output:
["hello", " ", "w", "o", "r", "l", "d"]
Borrowing the recursive idea used in definining Haskell's nub function for lists, this would be a recursive approach:
def unique(lst):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))
e.g.:
In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]
I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).
In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop
In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop
In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop
In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop
In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop
I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:
import operator
def unique(lst, cmp_op=operator.ne):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)
For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:
def test_round(x,y):
return round(x) != round(y)
then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:
In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]
You can reference a list comprehension as it is being built by the symbol '_[1]'. For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.
def unique(my_list):
return [x for x in my_list if x not in locals()['_[1]']]
Demo:
l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2
Output:
[1, 2, 3, 4, 5]
Eliminating the duplicate values in a sequence, but preserve the order of the remaining items. Use of general purpose generator function.
# for hashable sequence
def remove_duplicates(items):
seen = set()
for item in items:
if item not in seen:
yield item
seen.add(item)
a = [1, 5, 2, 1, 9, 1, 5, 10]
list(remove_duplicates(a))
# [1, 5, 2, 9, 10]
# for unhashable sequence
def remove_duplicates(items, key=None):
seen = set()
for item in items:
val = item if key is None else key(item)
if val not in seen:
yield item
seen.add(val)
a = [ {'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 1, 'y': 2}, {'x': 2, 'y': 4}]
list(remove_duplicates(a, key=lambda d: (d['x'],d['y'])))
# [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]
1. These solutions are fine…
For removing duplicates while preserving order, the excellent solution(s) proposed elsewhere on this page:
seen = set()
[x for x in seq if not (x in seen or seen.add(x))]
and variation(s), e.g.:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
are indeed popular because they are simple, minimalistic, and deploy the correct hashing for optimal efficency. The main complaint about these seems to be that using the invariant None "returned" by method seen.add(x) as a constant (and therefore excess/unnecessary) value in a logical expression—just for its side-effect—is hacky and/or confusing.
2. …but they waste one hash lookup per iteration.
Surprisingly, given the amount of discussion and debate on this topic, there is actually a significant improvement to the code that seems to have been overlooked. As shown, each "test-and-set" iteration requires two hash lookups: the first to test membership x not in seen and then again to actually add the value seen.add(x). Since the first operation guarantees that the second will always be successful, there is a wasteful duplication of effort here. And because the overall technique here is so efficient, the excess hash lookups will likely end up being the most expensive proportion of what little work remains.
3. Instead, let the set do its job!
Notice that the examples above only call set.add with the foreknowledge that doing so will always result in an increase in set membership. The set itself never gets an chance to reject a duplicate; our code snippet has essentially usurped that role for itself. The use of explicit two-step test-and-set code is robbing set of its core ability to exclude those duplicates itself.
4. The single-hash-lookup code:
The following version cuts the number of hash lookups per iteration in half—from two down to just one.
seen = set()
[x for x in seq if len(seen) < len(seen.add(x) or seen)]
If you need one liner then maybe this would help:
reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))
... should work but correct me if i'm wrong
MizardX's answer gives a good collection of multiple approaches.
This is what I came up with while thinking aloud:
mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]
You could do a sort of ugly list comprehension hack.
[l[i] for i in range(len(l)) if l.index(l[i]) == i]
Relatively effective approach with _sorted_ a numpy arrays:
b = np.array([1,3,3, 8, 12, 12,12])
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])
Outputs:
array([ 1, 3, 8, 12])
l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))
A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.
A simple recursive solution:
def uniquefy_list(a):
return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]
this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.
def deduplicate(l):
count = {}
(read,write) = (0,0)
while read < len(l):
if l[read] in count:
read += 1
continue
count[l[read]] = True
l[write] = l[read]
read += 1
write += 1
return l[0:write]
x = [1, 2, 1, 3, 1, 4]
# brute force method
arr = []
for i in x:
if not i in arr:
arr.insert(x[i],i)
# recursive method
tmp = []
def remove_duplicates(j=0):
if j < len(x):
if not x[j] in tmp:
tmp.append(x[j])
i = j+1
remove_duplicates(i)
remove_duplicates()
One liner list comprehension:
values_non_duplicated = [value for index, value in enumerate(values) if value not in values[ : index]]
If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:
import pandas as pd
import numpy as np
uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()
# from the chosen answer
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()
print uniquifier(alist) == f7(alist) # True
Timing:
In [104]: %timeit f7(alist)
1000 loops, best of 3: 1.3 ms per loop
In [110]: %timeit uniquifier(alist)
100 loops, best of 3: 4.39 ms per loop
A solution without using imported modules or sets:
text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)
Gives output:
['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']
An in-place method
This method is quadratic, because we have a linear lookup into the list for every element of the list (to that we have to add the cost of rearranging the list because of the del s).
That said, it is possible to operate in place if we start from the end of the list and proceed toward the origin removing each term that is present in the sub-list at its left
This idea in code is simply
for i in range(len(l)-1,0,-1):
if l[i] in l[:i]: del l[i]
A simple test of the implementation
In [91]: from random import randint, seed
In [92]: seed('20080808') ; l = [randint(1,6) for _ in range(12)] # Beijing Olympics
In [93]: for i in range(len(l)-1,0,-1):
...: print(l)
...: print(i, l[i], l[:i], end='')
...: if l[i] in l[:i]:
...: print( ': remove', l[i])
...: del l[i]
...: else:
...: print()
...: print(l)
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5, 2]
11 2 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]
10 5 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4]: remove 5
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4]
9 4 [6, 5, 1, 4, 6, 1, 6, 2, 2]: remove 4
[6, 5, 1, 4, 6, 1, 6, 2, 2]
8 2 [6, 5, 1, 4, 6, 1, 6, 2]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2]
7 2 [6, 5, 1, 4, 6, 1, 6]
[6, 5, 1, 4, 6, 1, 6, 2]
6 6 [6, 5, 1, 4, 6, 1]: remove 6
[6, 5, 1, 4, 6, 1, 2]
5 1 [6, 5, 1, 4, 6]: remove 1
[6, 5, 1, 4, 6, 2]
4 6 [6, 5, 1, 4]: remove 6
[6, 5, 1, 4, 2]
3 4 [6, 5, 1]
[6, 5, 1, 4, 2]
2 1 [6, 5]
[6, 5, 1, 4, 2]
1 5 [6]
[6, 5, 1, 4, 2]
In [94]:

Using Regex in Python 3 [duplicate]

How can I check if a list has any duplicates and return a new list without duplicates?
The common approach to get a unique collection of items is to use a set. Sets are unordered collections of distinct objects. To create a set from any iterable, you can simply pass it to the built-in set() function. If you later need a real list again, you can similarly pass the set to the list() function.
The following example should cover whatever you are trying to do:
>>> t = [1, 2, 3, 1, 2, 3, 5, 6, 7, 8]
>>> list(set(t))
[1, 2, 3, 5, 6, 7, 8]
>>> s = [1, 2, 3]
>>> list(set(t) - set(s))
[8, 5, 6, 7]
As you can see from the example result, the original order is not maintained. As mentioned above, sets themselves are unordered collections, so the order is lost. When converting a set back to a list, an arbitrary order is created.
Maintaining order
If order is important to you, then you will have to use a different mechanism. A very common solution for this is to rely on OrderedDict to keep the order of keys during insertion:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]
Starting with Python 3.7, the built-in dictionary is guaranteed to maintain the insertion order as well, so you can also use that directly if you are on Python 3.7 or later (or CPython 3.6):
>>> list(dict.fromkeys(t))
[1, 2, 3, 5, 6, 7, 8]
Note that this may have some overhead of creating a dictionary first, and then creating a list from it. If you don’t actually need to preserve the order, you’re often better off using a set, especially because it gives you a lot more operations to work with. Check out this question for more details and alternative ways to preserve the order when removing duplicates.
Finally note that both the set as well as the OrderedDict/dict solutions require your items to be hashable. This usually means that they have to be immutable. If you have to deal with items that are not hashable (e.g. list objects), then you will have to use a slow approach in which you will basically have to compare every item with every other item in a nested loop.
In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.
In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
It's a one-liner: list(set(source_list)) will do the trick.
A set is something that can't possibly have duplicates.
Update: an order-preserving approach is two lines:
from collections import OrderedDict
OrderedDict((x, True) for x in source_list).keys()
Here we use the fact that OrderedDict remembers the insertion order of keys, and does not change it when a value at a particular key is updated. We insert True as values, but we could insert anything, values are just not used. (set works a lot like a dict with ignored values, too.)
>>> t = [1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> t
[1, 2, 3, 1, 2, 5, 6, 7, 8]
>>> s = []
>>> for i in t:
if i not in s:
s.append(i)
>>> s
[1, 2, 3, 5, 6, 7, 8]
If you don't care about the order, just do this:
def remove_duplicates(l):
return list(set(l))
A set is guaranteed to not have duplicates.
To make a new list retaining the order of first elements of duplicates in L:
newlist = [ii for n,ii in enumerate(L) if ii not in L[:n]]
For example: if L = [1, 2, 2, 3, 4, 2, 4, 3, 5], then newlist will be [1, 2, 3, 4, 5]
This checks each new element has not appeared previously in the list before adding it.
Also it does not need imports.
There are also solutions using Pandas and Numpy. They both return numpy array so you have to use the function .tolist() if you want a list.
t=['a','a','b','b','b','c','c','c']
t2= ['c','c','b','b','b','a','a','a']
Pandas solution
Using Pandas function unique():
import pandas as pd
pd.unique(t).tolist()
>>>['a','b','c']
pd.unique(t2).tolist()
>>>['c','b','a']
Numpy solution
Using numpy function unique().
import numpy as np
np.unique(t).tolist()
>>>['a','b','c']
np.unique(t2).tolist()
>>>['a','b','c']
Note that numpy.unique() also sort the values. So the list t2 is returned sorted. If you want to have the order preserved use as in this answer:
_, idx = np.unique(t2, return_index=True)
t2[np.sort(idx)].tolist()
>>>['c','b','a']
The solution is not so elegant compared to the others, however, compared to pandas.unique(), numpy.unique() allows you also to check if nested arrays are unique along one selected axis.
In this answer, there will be two sections: Two unique solutions, and a graph of speed for specific solutions.
Removing Duplicate Items
Most of these answers only remove duplicate items which are hashable, but this question doesn't imply it doesn't just need hashable items, meaning I'll offer some solutions which don't require hashable items.
collections.Counter is a powerful tool in the standard library which could be perfect for this. There's only one other solution which even has Counter in it. However, that solution is also limited to hashable keys.
To allow unhashable keys in Counter, I made a Container class, which will try to get the object's default hash function, but if it fails, it will try its identity function. It also defines an eq and a hash method. This should be enough to allow unhashable items in our solution. Unhashable objects will be treated as if they are hashable. However, this hash function uses identity for unhashable objects, meaning two equal objects that are both unhashable won't work. I suggest you override this, and changing it to use the hash of an equivalent mutable type (like using hash(tuple(my_list)) if my_list is a list).
I also made two solutions. Another solution which keeps the order of the items, using a subclass of both OrderedDict and Counter which is named 'OrderedCounter'. Now, here are the functions:
from collections import OrderedDict, Counter
class Container:
def __init__(self, obj):
self.obj = obj
def __eq__(self, obj):
return self.obj == obj
def __hash__(self):
try:
return hash(self.obj)
except:
return id(self.obj)
class OrderedCounter(Counter, OrderedDict):
'Counter that remembers the order elements are first encountered'
def __repr__(self):
return '%s(%r)' % (self.__class__.__name__, OrderedDict(self))
def __reduce__(self):
return self.__class__, (OrderedDict(self),)
def remd(sequence):
cnt = Counter()
for x in sequence:
cnt[Container(x)] += 1
return [item.obj for item in cnt]
def oremd(sequence):
cnt = OrderedCounter()
for x in sequence:
cnt[Container(x)] += 1
return [item.obj for item in cnt]
remd is non-ordered sorting, while oremd is ordered sorting. You can clearly tell which one is faster, but I'll explain anyways. The non-ordered sorting is slightly faster, since it doesn't store the order of the items.
Now, I also wanted to show the speed comparisons of each answer. So, I'll do that now.
Which Function is the Fastest?
For removing duplicates, I gathered 10 functions from a few answers. I calculated the speed of each function and put it into a graph using matplotlib.pyplot.
I divided this into three rounds of graphing. A hashable is any object which can be hashed, an unhashable is any object which cannot be hashed. An ordered sequence is a sequence which preserves order, an unordered sequence does not preserve order. Now, here are a few more terms:
Unordered Hashable was for any method which removed duplicates, which didn't necessarily have to keep the order. It didn't have to work for unhashables, but it could.
Ordered Hashable was for any method which kept the order of the items in the list, but it didn't have to work for unhashables, but it could.
Ordered Unhashable was any method which kept the order of the items in the list, and worked for unhashables.
On the y-axis is the amount of seconds it took.
On the x-axis is the number the function was applied to.
I generated sequences for unordered hashables and ordered hashables with the following comprehension: [list(range(x)) + list(range(x)) for x in range(0, 1000, 10)]
For ordered unhashables: [[list(range(y)) + list(range(y)) for y in range(x)] for x in range(0, 1000, 10)]
Note there is a step in the range because without it, this would've taken 10x as long. Also because in my personal opinion, I thought it might've looked a little easier to read.
Also note the keys on the legend are what I tried to guess as the most vital parts of the implementation of the function. As for what function does the worst or best? The graph speaks for itself.
With that settled, here are the graphs.
Unordered Hashables
(Zoomed in)
Ordered Hashables
(Zoomed in)
Ordered Unhashables
(Zoomed in)
Very late answer.
If you don't care about the list order, you can use *arg expansion with set uniqueness to remove dupes, i.e.:
l = [*{*l}]
Python3 Demo
A colleague have sent the accepted answer as part of his code to me for a codereview today.
While I certainly admire the elegance of the answer in question, I am not happy with the performance.
I have tried this solution (I use set to reduce lookup time)
def ordered_set(in_list):
out_list = []
added = set()
for val in in_list:
if not val in added:
out_list.append(val)
added.add(val)
return out_list
To compare efficiency, I used a random sample of 100 integers - 62 were unique
from random import randint
x = [randint(0,100) for _ in xrange(100)]
In [131]: len(set(x))
Out[131]: 62
Here are the results of the measurements
In [129]: %timeit list(OrderedDict.fromkeys(x))
10000 loops, best of 3: 86.4 us per loop
In [130]: %timeit ordered_set(x)
100000 loops, best of 3: 15.1 us per loop
Well, what happens if set is removed from the solution?
def ordered_set(inlist):
out_list = []
for val in inlist:
if not val in out_list:
out_list.append(val)
return out_list
The result is not as bad as with the OrderedDict, but still more than 3 times of the original solution
In [136]: %timeit ordered_set(x)
10000 loops, best of 3: 52.6 us per loop
Another way of doing:
>>> seq = [1,2,3,'a', 'a', 1,2]
>> dict.fromkeys(seq).keys()
['a', 1, 2, 3]
Simple and easy:
myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanlist = []
[cleanlist.append(x) for x in myList if x not in cleanlist]
Output:
>>> cleanlist
[1, 2, 3, 5, 6, 7, 8]
I had a dict in my list, so I could not use the above approach. I got the error:
TypeError: unhashable type:
So if you care about order and/or some items are unhashable. Then you might find this useful:
def make_unique(original_list):
unique_list = []
[unique_list.append(obj) for obj in original_list if obj not in unique_list]
return unique_list
Some may consider list comprehension with a side effect to not be a good solution. Here's an alternative:
def make_unique(original_list):
unique_list = []
map(lambda x: unique_list.append(x) if (x not in unique_list) else False, original_list)
return unique_list
All the order-preserving approaches I've seen here so far either use naive comparison (with O(n^2) time-complexity at best) or heavy-weight OrderedDicts/set+list combinations that are limited to hashable inputs. Here is a hash-independent O(nlogn) solution:
Update added the key argument, documentation and Python 3 compatibility.
# from functools import reduce <-- add this import on Python 3
def uniq(iterable, key=lambda x: x):
"""
Remove duplicates from an iterable. Preserves order.
:type iterable: Iterable[Ord => A]
:param iterable: an iterable of objects of any orderable type
:type key: Callable[A] -> (Ord => B)
:param key: optional argument; by default an item (A) is discarded
if another item (B), such that A == B, has already been encountered and taken.
If you provide a key, this condition changes to key(A) == key(B); the callable
must return orderable objects.
"""
# Enumerate the list to restore order lately; reduce the sorted list; restore order
def append_unique(acc, item):
return acc if key(acc[-1][1]) == key(item[1]) else acc.append(item) or acc
srt_enum = sorted(enumerate(iterable), key=lambda item: key(item[1]))
return [item[1] for item in sorted(reduce(append_unique, srt_enum, [srt_enum[0]]))]
If you want to preserve the order, and not use any external modules here is an easy way to do this:
>>> t = [1, 9, 2, 3, 4, 5, 3, 6, 7, 5, 8, 9]
>>> list(dict.fromkeys(t))
[1, 9, 2, 3, 4, 5, 6, 7, 8]
Note: This method preserves the order of appearance, so, as seen above, nine will come after one because it was the first time it appeared. This however, is the same result as you would get with doing
from collections import OrderedDict
ulist=list(OrderedDict.fromkeys(l))
but it is much shorter, and runs faster.
This works because each time the fromkeys function tries to create a new key, if the value already exists it will simply overwrite it. This wont affect the dictionary at all however, as fromkeys creates a dictionary where all keys have the value None, so effectively it eliminates all duplicates this way.
I've compared the various suggestions with perfplot. It turns out that, if the input array doesn't have duplicate elements, all methods are more or less equally fast, independently of whether the input data is a Python list or a NumPy array.
If the input array is large, but contains just one unique element, then the set, dict and np.unique methods are costant-time if the input data is a list. If it's a NumPy array, np.unique is about 10 times faster than the other alternatives.
It's somewhat surprising to me that those are not constant-time operations, too.
Code to reproduce the plots:
import perfplot
import numpy as np
import matplotlib.pyplot as plt
def setup_list(n):
# return list(np.random.permutation(np.arange(n)))
return [0] * n
def setup_np_array(n):
# return np.random.permutation(np.arange(n))
return np.zeros(n, dtype=int)
def list_set(data):
return list(set(data))
def numpy_unique(data):
return np.unique(data)
def list_dict(data):
return list(dict.fromkeys(data))
b = perfplot.bench(
setup=[
setup_list,
setup_list,
setup_list,
setup_np_array,
setup_np_array,
setup_np_array,
],
kernels=[list_set, numpy_unique, list_dict, list_set, numpy_unique, list_dict],
labels=[
"list(set(lst))",
"np.unique(lst)",
"list(dict(lst))",
"list(set(arr))",
"np.unique(arr)",
"list(dict(arr))",
],
n_range=[2 ** k for k in range(23)],
xlabel="len(array)",
equality_check=None,
)
# plt.title("input array = [0, 1, 2,..., n]")
plt.title("input array = [0, 0,..., 0]")
b.save("out.png")
b.show()
You could also do this:
>>> t = [1, 2, 3, 3, 2, 4, 5, 6]
>>> s = [x for i, x in enumerate(t) if i == t.index(x)]
>>> s
[1, 2, 3, 4, 5, 6]
The reason that above works is that index method returns only the first index of an element. Duplicate elements have higher indices. Refer to here:
list.index(x[, start[, end]])
Return zero-based index in the list of
the first item whose value is x. Raises a ValueError if there is no
such item.
Best approach of removing duplicates from a list is using set() function, available in python, again converting that set into list
In [2]: some_list = ['a','a','v','v','v','c','c','d']
In [3]: list(set(some_list))
Out[3]: ['a', 'c', 'd', 'v']
You can use set to remove duplicates:
mylist = list(set(mylist))
But note the results will be unordered. If that's an issue:
mylist.sort()
Try using sets:
import sets
t = sets.Set(['a', 'b', 'c', 'd'])
t1 = sets.Set(['a', 'b', 'c'])
print t | t1
print t - t1
One more better approach could be,
import pandas as pd
myList = [1, 2, 3, 1, 2, 5, 6, 7, 8]
cleanList = pd.Series(myList).drop_duplicates().tolist()
print(cleanList)
#> [1, 2, 3, 5, 6, 7, 8]
and the order remains preserved.
This one cares about the order without too much hassle (OrderdDict & others). Probably not the most Pythonic way, nor shortest way, but does the trick:
def remove_duplicates(item_list):
''' Removes duplicate items from a list '''
singles_list = []
for element in item_list:
if element not in singles_list:
singles_list.append(element)
return singles_list
Reduce variant with ordering preserve:
Assume that we have list:
l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
Reduce variant (unefficient):
>>> reduce(lambda r, v: v in r and r or r + [v], l, [])
[5, 6, 1, 2, 3, 4]
5 x faster but more sophisticated
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# user list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
reduce(reducer, l, default)[0]
There are many other answers suggesting different ways to do this, but they're all batch operations, and some of them throw away the original order. That might be okay depending on what you need, but if you want to iterate over the values in the order of the first instance of each value, and you want to remove the duplicates on-the-fly versus all at once, you could use this generator:
def uniqify(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item
This returns a generator/iterator, so you can use it anywhere that you can use an iterator.
for unique_item in uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]):
print(unique_item, end=' ')
print()
Output:
1 2 3 4 5 6 7 8
If you do want a list, you can do this:
unique_list = list(uniqify([1, 2, 3, 4, 3, 2, 4, 5, 6, 7, 6, 8, 8]))
print(unique_list)
Output:
[1, 2, 3, 4, 5, 6, 7, 8]
You can use the following function:
def rem_dupes(dup_list):
yooneeks = []
for elem in dup_list:
if elem not in yooneeks:
yooneeks.append(elem)
return yooneeks
Example:
my_list = ['this','is','a','list','with','dupicates','in', 'the', 'list']
Usage:
rem_dupes(my_list)
['this', 'is', 'a', 'list', 'with', 'dupicates', 'in', 'the']
Using set :
a = [0,1,2,3,4,3,3,4]
a = list(set(a))
print a
Using unique :
import numpy as np
a = [0,1,2,3,4,3,3,4]
a = np.unique(a).tolist()
print a
Without using set
data=[1, 2, 3, 1, 2, 5, 6, 7, 8]
uni_data=[]
for dat in data:
if dat not in uni_data:
uni_data.append(dat)
print(uni_data)
The Magic of Python Built-in type
In python, it is very easy to process the complicated cases like this and only by python's built-in type.
Let me show you how to do !
Method 1: General Case
The way (1 line code) to remove duplicated element in list and still keep sorting order
line = [1, 2, 3, 1, 2, 5, 6, 7, 8]
new_line = sorted(set(line), key=line.index) # remove duplicated element
print(new_line)
You will get the result
[1, 2, 3, 5, 6, 7, 8]
Method 2: Special Case
TypeError: unhashable type: 'list'
The special case to process unhashable (3 line codes)
line=[['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['16.4966155686595', '-27.59776154691', '52.3786295521147']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['17.6508629295574', '-27.143305738671', '47.534955022564']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['18.8051102904552', '-26.688849930432', '42.6912804930134']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['19.5504702331098', '-26.205884452727', '37.7709192714727']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']
,['20.2929416861422', '-25.722717575124', '32.8500163147157']]
tuple_line = [tuple(pt) for pt in line] # convert list of list into list of tuple
tuple_new_line = sorted(set(tuple_line),key=tuple_line.index) # remove duplicated element
new_line = [list(t) for t in tuple_new_line] # convert list of tuple into list of list
print (new_line)
You will get the result :
[
['16.4966155686595', '-27.59776154691', '52.3786295521147'],
['17.6508629295574', '-27.143305738671', '47.534955022564'],
['18.8051102904552', '-26.688849930432', '42.6912804930134'],
['19.5504702331098', '-26.205884452727', '37.7709192714727'],
['20.2929416861422', '-25.722717575124', '32.8500163147157']
]
Because tuple is hashable and you can convert data between list and tuple easily
below code is simple for removing duplicate in list
def remove_duplicates(x):
a = []
for i in x:
if i not in a:
a.append(i)
return a
print remove_duplicates([1,2,2,3,3,4])
it returns [1,2,3,4]
Here's the fastest pythonic solution comaring to others listed in replies.
Using implementation details of short-circuit evaluation allows to use list comprehension, which is fast enough. visited.add(item) always returns None as a result, which is evaluated as False, so the right-side of or would always be the result of such an expression.
Time it yourself
def deduplicate(sequence):
visited = set()
adder = visited.add # get rid of qualification overhead
out = [adder(item) or item for item in sequence if item not in visited]
return out

Fast sorting of large nested lists

I am looking to find out the likelihood of parameter combinations using Monte Carlo Simulation.
I've got 4 parameters and each can have about 250 values.
I have randomly generated 250,000 scenarios for each of those parameters using some probability distribution function.
I now want to find out which parameter combinations are the most likely to occur.
To achieve this I have started by filtering out any duplicates from my 250,000 randomly generated samples in order to reduce the length of the list.
I then iterated through this reduced list and checked how many times each scenario occurs in the original 250,000 long list.
I have a large list of 250,000 items which contains lists, as such :
a = [[1,2,5,8],[1,2,5,8],[3,4,5,6],[3,4,5,7],....,[3,4,5,7]]# len(a) is equal to 250,000
I want to find a fast and efficient way of having each list within my list only occurring once.
The end goal is to count the occurrences of each list within list a.
so far I've got:
'''Removing duplicates from list a and storing this as a new list temp'''
b_set = set(tuple(x) for x in a)
temp = [ list(x) for x in b_set ]
temp.sort(key = lambda x: a.index(x) )
''' I then iterate through each of my possible lists (i.e. temp) and count how many times they occur in a'''
most_likely_dict = {}
for scenario in temp:
freq = list(scenario_list).count(scenario)
most_likely_dict[str(scenario)] = freq
at the moment it takes a good 15 minutes to perform ... Any suggestion on how to turn that into a few seconds would be greatly appreciated !!
You can take out the sorting part, as the final result is a dictionary which will be unordered in any case, then use a dict comprehension:
>>> a = [[1,2],[1,2],[3,4,5],[3,4,5], [3,4,5]]
>>> a_tupled = [tuple(i) for i in a]
>>> b_set = set(a_tupled)
>>> {repr(i): a_tupled.count(i) for i in b_set}
{'(1, 2)': 2, '(3, 4, 5)': 3}
calling list on your tuples will add more overhead, but you can if you want to
>>> {repr(list(i)): a_tupled.count(i) for i in b_set}
{'[3, 4, 5]': 3, '[1, 2]': 2}
Or just use a Counter:
>>> from collections import Counter
>>> Counter(tuple(i) for i in a)
{str(item):a.count(item) for item in a}
Input :
a = [[1,2,5,8],[1,2,5,8],[3,4,5,6],[3,4,5,7],[3,4,5,7]]
Output :
{'[3, 4, 5, 6]': 1, '[1, 2, 5, 8]': 2, '[3, 4, 5, 7]': 2}

Sorting list of lists by a third list of specified non-sorted order

I have a list:
[['18411971', 'kinase_2', 36], ['75910712', 'unnamed...', 160], ...
about 60 entries long
each entry is a list with three values
I want to sort this bigger list by the first value in an order specified by another list which has them in the desired order.
The usual idiom is to sort using a key:
>>> a = [[1,2],[2,10,10],[3,4,'fred']]
>>> b = [2,1,3]
>>> sorted(a,key=lambda x: b.index(x[0]))
[[2, 10, 10], [1, 2], [3, 4, 'fred']]
This can have performance issues, though-- if the keys are hashable, this will probably be faster for long lists:
>>> order_dict = dict(zip(b, range(len(b))))
>>> sorted(a,key=lambda x: order_dict[x[0]])
[[2, 10, 10], [1, 2], [3, 4, 'fred']]
How about:
inputlist = [['18411971', 'kinase_2', 36], ['75910712', 'unnamed...', 160], ... # obviously not valid syntax
auxinput = aux = ['75910712', '18411971', ...] # ditto
keyed = { sublist[0]:sublist for sublist in inputlist }
result = [keyed[item] for item in auxinput]
There is no need to use sorting here. For large lists this would be faster, because it's O(n) rather than O(n * log n).
In case the keys aren't unique, it is possible to use some variant of an ordered dict (e.g. defaultdict(list) as per Niklas B's suggestion) to build the keyed representation.
So, if I understand you right, you have your sample input list:
a = [['18411971', 'kinase_2', 36], ['75910712', 'unnamed...', 160], ...
and you want to sort this using an extra list that mentions the order in which the first elements of the sub-lists are to occur in the output:
aux = ['75910712', '18411971', ...]
If that's right, I think the result can be achieved using something like:
sorted(a, key = lambda x: aux.index(x[0]))

Converting a list to a set changes element order

Recently I noticed that when I am converting a list to set the order of elements is changed and is sorted by character.
Consider this example:
x=[1,2,20,6,210]
print(x)
# [1, 2, 20, 6, 210] # the order is same as initial order
set(x)
# set([1, 2, 20, 210, 6]) # in the set(x) output order is sorted
My questions are -
Why is this happening?
How can I do set operations (especially set difference) without losing the initial order?
A set is an unordered data structure, so it does not preserve the insertion order.
This depends on your requirements. If you have an normal list, and want to remove some set of elements while preserving the order of the list, you can do this with a list comprehension:
>>> a = [1, 2, 20, 6, 210]
>>> b = set([6, 20, 1])
>>> [x for x in a if x not in b]
[2, 210]
If you need a data structure that supports both fast membership tests and preservation of insertion order, you can use the keys of a Python dictionary, which starting from Python 3.7 is guaranteed to preserve the insertion order:
>>> a = dict.fromkeys([1, 2, 20, 6, 210])
>>> b = dict.fromkeys([6, 20, 1])
>>> dict.fromkeys(x for x in a if x not in b)
{2: None, 210: None}
b doesn't really need to be ordered here – you could use a set as well. Note that a.keys() - b.keys() returns the set difference as a set, so it won't preserve the insertion order.
In older versions of Python, you can use collections.OrderedDict instead:
>>> a = collections.OrderedDict.fromkeys([1, 2, 20, 6, 210])
>>> b = collections.OrderedDict.fromkeys([6, 20, 1])
>>> collections.OrderedDict.fromkeys(x for x in a if x not in b)
OrderedDict([(2, None), (210, None)])
In Python 3.6, set() now should keep the order, but there is another solution for Python 2 and 3:
>>> x = [1, 2, 20, 6, 210]
>>> sorted(set(x), key=x.index)
[1, 2, 20, 6, 210]
Remove duplicates and preserve order by below function
def unique(sequence):
seen = set()
return [x for x in sequence if not (x in seen or seen.add(x))]
How to remove duplicates from a list while preserving order in Python
Answering your first question, a set is a data structure optimized for set operations. Like a mathematical set, it does not enforce or maintain any particular order of the elements. The abstract concept of a set does not enforce order, so the implementation is not required to. When you create a set from a list, Python has the liberty to change the order of the elements for the needs of the internal implementation it uses for a set, which is able to perform set operations efficiently.
In mathematics, there are sets and ordered sets (osets).
set: an unordered container of unique elements (Implemented)
oset: an ordered container of unique elements (NotImplemented)
In Python, only sets are directly implemented. We can emulate osets with regular dict keys (3.7+).
Given
a = [1, 2, 20, 6, 210, 2, 1]
b = {2, 6}
Code
oset = dict.fromkeys(a).keys()
# dict_keys([1, 2, 20, 6, 210])
Demo
Replicates are removed, insertion-order is preserved.
list(oset)
# [1, 2, 20, 6, 210]
Set-like operations on dict keys.
oset - b
# {1, 20, 210}
oset | b
# {1, 2, 5, 6, 20, 210}
oset & b
# {2, 6}
oset ^ b
# {1, 5, 20, 210}
Details
Note: an unordered structure does not preclude ordered elements. Rather, maintained order is not guaranteed. Example:
assert {1, 2, 3} == {2, 3, 1} # sets (order is ignored)
assert [1, 2, 3] != [2, 3, 1] # lists (order is guaranteed)
One may be pleased to discover that a list and multiset (mset) are two more fascinating, mathematical data structures:
list: an ordered container of elements that permits replicates (Implemented)
mset: an unordered container of elements that permits replicates (NotImplemented)*
Summary
Container | Ordered | Unique | Implemented
----------|---------|--------|------------
set | n | y | y
oset | y | y | n
list | y | n | y
mset | n | n | n*
*A multiset can be indirectly emulated with collections.Counter(), a dict-like mapping of multiplicities (counts).
You can remove the duplicated values and keep the list order of insertion with one line of code, Python 3.8.2
mylist = ['b', 'b', 'a', 'd', 'd', 'c']
results = list({value:"" for value in mylist})
print(results)
>>> ['b', 'a', 'd', 'c']
results = list(dict.fromkeys(mylist))
print(results)
>>> ['b', 'a', 'd', 'c']
As denoted in other answers, sets are data structures (and mathematical concepts) that do not preserve the element order -
However, by using a combination of sets and dictionaries, it is possible that you can achieve wathever you want - try using these snippets:
# save the element order in a dict:
x_dict = dict(x,y for y, x in enumerate(my_list) )
x_set = set(my_list)
#perform desired set operations
...
#retrieve ordered list from the set:
new_list = [None] * len(new_set)
for element in new_set:
new_list[x_dict[element]] = element
Building on Sven's answer, I found using collections.OrderedDict like so helped me accomplish what you want plus allow me to add more items to the dict:
import collections
x=[1,2,20,6,210]
z=collections.OrderedDict.fromkeys(x)
z
OrderedDict([(1, None), (2, None), (20, None), (6, None), (210, None)])
If you want to add items but still treat it like a set you can just do:
z['nextitem']=None
And you can perform an operation like z.keys() on the dict and get the set:
list(z.keys())
[1, 2, 20, 6, 210]
One more simpler way can be two create a empty list ,let's say "unique_list" for adding the unique elements from the original list, for example:
unique_list=[]
for i in original_list:
if i not in unique_list:
unique_list.append(i)
else:
pass
This will give you all the unique elements as well as maintain the order.
Late to answer but you can use Pandas, pd.Series to convert list while preserving the order:
import pandas as pd
x = pd.Series([1, 2, 20, 6, 210, 2, 1])
print(pd.unique(x))
Output:
array([ 1, 2, 20, 6, 210])
Works for a list of strings
x = pd.Series(['c', 'k', 'q', 'n', 'p','c', 'n'])
print(pd.unique(x))
Output
['c' 'k' 'q' 'n' 'p']
An implementation of the highest score concept above that brings it back to a list:
def SetOfListInOrder(incominglist):
from collections import OrderedDict
outtemp = OrderedDict()
for item in incominglist:
outtemp[item] = None
return(list(outtemp))
Tested (briefly) on Python 3.6 and Python 2.7.
In case you have a small number of elements in your two initial lists on which you want to do set difference operation, instead of using collections.OrderedDict which complicates the implementation and makes it less readable, you can use:
# initial lists on which you want to do set difference
>>> nums = [1,2,2,3,3,4,4,5]
>>> evens = [2,4,4,6]
>>> evens_set = set(evens)
>>> result = []
>>> for n in nums:
... if not n in evens_set and not n in result:
... result.append(n)
...
>>> result
[1, 3, 5]
Its time complexity is not that good but it is neat and easy to read.
It's interesting that people always use 'real world problem' to make joke on the definition in theoretical science.
If set has order, you first need to figure out the following problems.
If your list has duplicate elements, what should the order be when you turn it into a set? What is the order if we union two sets? What is the order if we intersect two sets with different order on the same elements?
Plus, set is much faster in searching for a particular key which is very good in sets operation (and that's why you need a set, but not list).
If you really care about the index, just keep it as a list. If you still want to do set operation on the elements in many lists, the simplest way is creating a dictionary for each list with the same keys in the set along with a value of list containing all the index of the key in the original list.
def indx_dic(l):
dic = {}
for i in range(len(l)):
if l[i] in dic:
dic.get(l[i]).append(i)
else:
dic[l[i]] = [i]
return(dic)
a = [1,2,3,4,5,1,3,2]
set_a = set(a)
dic_a = indx_dic(a)
print(dic_a)
# {1: [0, 5], 2: [1, 7], 3: [2, 6], 4: [3], 5: [4]}
print(set_a)
# {1, 2, 3, 4, 5}
We can use collections.Counter for this:
# tested on python 3.7
>>> from collections import Counter
>>> lst = ["1", "2", "20", "6", "210"]
>>> for i in Counter(lst):
>>> print(i, end=" ")
1 2 20 6 210
>>> for i in set(lst):
>>> print(i, end=" ")
20 6 2 1 210
You can remove the duplicated values and keep the list order of insertion, if you want
lst = [1,2,1,3]
new_lst = []
for num in lst :
if num not in new_lst :
new_lst.append(num)
# new_lst = [1,2,3]
don't use 'sets' for removing duplicate if 'order' is something you want,
use sets for searching i.e.
x in list
takes O(n) time
where
x in set
takes O(1) time *most cases
Here's an easy way to do it:
x=[1,2,20,6,210]
print sorted(set(x))

Categories