Python : Choosing elements from a list using a list [duplicate] - python

This question already has answers here:
Explicitly select items from a list or tuple
(9 answers)
Closed 7 months ago.
I need to choose some elements from the given list, knowing their index. Let say I would like to create a new list, which contains element with index 1, 2, 5, from given list [-2, 1, 5, 3, 8, 5, 6]. What I did is:
a = [-2,1,5,3,8,5,6]
b = [1,2,5]
c = [ a[i] for i in b]
Is there any better way to do it? something like c = a[b] ?

You can use operator.itemgetter:
from operator import itemgetter
a = [-2, 1, 5, 3, 8, 5, 6]
b = [1, 2, 5]
print(itemgetter(*b)(a))
# Result:
(1, 5, 5)
Or you can use numpy:
import numpy as np
a = np.array([-2, 1, 5, 3, 8, 5, 6])
b = [1, 2, 5]
print(list(a[b]))
# Result:
[1, 5, 5]
But really, your current solution is fine. It's probably the neatest out of all of them.

Alternatives:
>>> map(a.__getitem__, b)
[1, 5, 5]
>>> import operator
>>> operator.itemgetter(*b)(a)
(1, 5, 5)

Another solution could be via pandas Series:
import pandas as pd
a = pd.Series([-2, 1, 5, 3, 8, 5, 6])
b = [1, 2, 5]
c = a[b]
You can then convert c back to a list if you want:
c = list(c)

Basic and not very extensive testing comparing the execution time of the five supplied answers:
def numpyIndexValues(a, b):
na = np.array(a)
nb = np.array(b)
out = list(na[nb])
return out
def mapIndexValues(a, b):
out = map(a.__getitem__, b)
return list(out)
def getIndexValues(a, b):
out = operator.itemgetter(*b)(a)
return out
def pythonLoopOverlap(a, b):
c = [ a[i] for i in b]
return c
multipleListItemValues = lambda searchList, ind: [searchList[i] for i in ind]
using the following input:
a = range(0, 10000000)
b = range(500, 500000)
simple python loop was the quickest with lambda operation a close second, mapIndexValues and getIndexValues were consistently pretty similar with numpy method significantly slower after converting lists to numpy arrays.If data is already in numpy arrays the numpyIndexValues method with the numpy.array conversion removed is quickest.
numpyIndexValues -> time:1.38940598 (when converted the lists to numpy arrays)
numpyIndexValues -> time:0.0193445 (using numpy array instead of python list as input, and conversion code removed)
mapIndexValues -> time:0.06477512099999999
getIndexValues -> time:0.06391049500000001
multipleListItemValues -> time:0.043773591
pythonLoopOverlap -> time:0.043021754999999995

Here's a simpler way:
a = [-2,1,5,3,8,5,6]
b = [1,2,5]
c = [e for i, e in enumerate(a) if i in b]

I'm sure this has already been considered: If the amount of indices in b is small and constant, one could just write the result like:
c = [a[b[0]]] + [a[b[1]]] + [a[b[2]]]
Or even simpler if the indices itself are constants...
c = [a[1]] + [a[2]] + [a[5]]
Or if there is a consecutive range of indices...
c = a[1:3] + [a[5]]

List comprehension is clearly the most immediate and easiest to remember - in addition to being quite pythonic!
In any case, among the proposed solutions, it is not the fastest (I have run my test on Windows using Python 3.8.3):
import timeit
from itertools import compress
import random
from operator import itemgetter
import pandas as pd
__N_TESTS__ = 10_000
vector = [str(x) for x in range(100)]
filter_indeces = sorted(random.sample(range(100), 10))
filter_boolean = random.choices([True, False], k=100)
# Different ways for selecting elements given indeces
# list comprehension
def f1(v, f):
return [v[i] for i in filter_indeces]
# itemgetter
def f2(v, f):
return itemgetter(*f)(v)
# using pandas.Series
# this is immensely slow
def f3(v, f):
return list(pd.Series(v)[f])
# using map and __getitem__
def f4(v, f):
return list(map(v.__getitem__, f))
# using enumerate!
def f5(v, f):
return [x for i, x in enumerate(v) if i in f]
# using numpy array
def f6(v, f):
return list(np.array(v)[f])
print("{:30s}:{:f} secs".format("List comprehension", timeit.timeit(lambda:f1(vector, filter_indeces), number=__N_TESTS__)))
print("{:30s}:{:f} secs".format("Operator.itemgetter", timeit.timeit(lambda:f2(vector, filter_indeces), number=__N_TESTS__)))
print("{:30s}:{:f} secs".format("Using Pandas series", timeit.timeit(lambda:f3(vector, filter_indeces), number=__N_TESTS__)))
print("{:30s}:{:f} secs".format("Using map and __getitem__", timeit.timeit(lambda: f4(vector, filter_indeces), number=__N_TESTS__)))
print("{:30s}:{:f} secs".format("Enumeration (Why anyway?)", timeit.timeit(lambda: f5(vector, filter_indeces), number=__N_TESTS__)))
My results are:
List comprehension :0.007113 secs
Operator.itemgetter :0.003247 secs
Using Pandas series :2.977286 secs
Using map and getitem :0.005029 secs
Enumeration (Why anyway?) :0.135156 secs
Numpy :0.157018 secs

Static indexes and small list?
Don't forget that if the list is small and the indexes don't change, as in your example, sometimes the best thing is to use sequence unpacking:
_,a1,a2,_,_,a3,_ = a
The performance is much better and you can also save one line of code:
%timeit _,a1,b1,_,_,c1,_ = a
10000000 loops, best of 3: 154 ns per loop
%timeit itemgetter(*b)(a)
1000000 loops, best of 3: 753 ns per loop
%timeit [ a[i] for i in b]
1000000 loops, best of 3: 777 ns per loop
%timeit map(a.__getitem__, b)
1000000 loops, best of 3: 1.42 µs per loop

My answer does not use numpy or python collections.
One trivial way to find elements would be as follows:
a = [-2, 1, 5, 3, 8, 5, 6]
b = [1, 2, 5]
c = [i for i in a if i in b]
Drawback: This method may not work for larger lists. Using numpy is recommended for larger lists.

Kind of pythonic way:
c = [x for x in a if a.index(x) in b]

The results for the latest pandas==1.4.2 as of June 2022 are as follows.
Note that simple slicing is no longer possible and benchmark results are faster.
import timeit
import pandas as pd
print(pd.__version__)
# 1.4.2
pd.Series([-2, 1, 5, 3, 8, 5, 6])[1, 2, 5]
# KeyError: 'key of type tuple not found and not a MultiIndex'
pd.Series([-2, 1, 5, 3, 8, 5, 6]).iloc[[1, 2, 5]].tolist()
# [1, 5, 5]
def extract_multiple_elements():
return pd.Series([-2, 1, 5, 3, 8, 5, 6]).iloc[[1, 2, 5]].tolist()
__N_TESTS__ = 10_000
t1 = timeit.timeit(extract_multiple_elements, number=__N_TESTS__)
print(round(t1, 3), 'seconds')
# 1.035 seconds

Related

Is there, a better method to remove duplicates values from a list when using a dictionary instead of set(), to preserve the values insertion order? [duplicate]

How do I remove duplicates from a list, while preserving order? Using a set to remove duplicates destroys the original order.
Is there a built-in or a Pythonic idiom?
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
Fastest one:
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)
The best solution varies by Python version and environment constraints:
Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):
First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain dict is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5) collections.OrderedDict. So the fastest solution, by far, is also the simplest:
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(dict.fromkeys(items)) # Or [*dict.fromkeys(items)] if you prefer
[1, 2, 0, 3]
Like list(set(items)) this pushes all the work to the C layer (on CPython), but since dicts are insertion ordered, dict.fromkeys doesn't lose ordering. It's slower than list(set(items)) (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of sets in a listcomp).
Important note: The unique_everseen solution from more_itertools (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it's the only solution that will work.
Python 3.5 (and all older versions if performance isn't critical)
As Raymond pointed out, in CPython 3.5 where OrderedDict is implemented in C, ugly list comprehension hacks are slower than OrderedDict.fromkeys (unless you actually need the list at the end - and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the OrderedDict equivalent of the 3.6+ use of plain dict:
>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]
On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading.
Python 3.4 and earlier, if performance is critical and third-party modules are acceptable
As #abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is the fastest solution too:
>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]
Just one simple library import and no hacks.
The module is adapting the itertools recipe unique_everseen which looks like:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
but unlike the itertools recipe, it supports non-hashable items (at a performance cost; if all elements in iterable are non-hashable, the algorithm becomes O(n²), vs. O(n) if they're all hashable).
Important note: Unlike all the other solutions here, unique_everseen can be used lazily; the peak memory usage will be the same (eventually, the underlying set grows to the same size), but if you don't listify the result, you just iterate it, you'll be able to process unique items as they're found, rather than waiting until the entire input has been deduplicated before processing the first unique item.
Python 3.4 and earlier, if performance is critical and third party modules are unavailable
You have two options:
Copy and paste in the unique_everseen recipe to your code and use it per the more_itertools example above
Use ugly hacks to allow a single listcomp to both check and update a set to track what's been seen:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
at the expense of relying on the ugly hack:
not seen.add(x)
which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.
Note that all of the solutions above are O(n) (save calling unique_everseen on an iterable of non-hashable items, which is O(n²), while the others would fail immediately with a TypeError), so all solutions are performant enough when they're not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don't assume it is; it usually isn't), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn't worth it).
In CPython 3.6+ (and all other Python implementations starting with Python 3.7+), dictionaries are ordered, so the way to remove duplicates from an iterable while keeping it in the original order is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5 and below (including Python 2.7), use the OrderedDict. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5 (when it gained a C implementation; prior to 3.5 it's still the clearest solution, though not the fastest).
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.
import pandas as pd
my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]
>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4, 5]
In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.
The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> list(dict.fromkeys(lst))
[1, 2, 3, 4]
sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]
unique → ['1', '2', '3', '6', '4', '5']
from itertools import groupby
[ key for key,_ in groupby(sortedList)]
The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.
Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.
Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".
I think if you wanna maintain the order,
you can try this:
list1 = ['b','c','d','b','c','a','a']
list2 = list(set(list1))
list2.sort(key=list1.index)
print list2
OR similarly you can do this:
list1 = ['b','c','d','b','c','a','a']
list2 = sorted(set(list1),key=list1.index)
print list2
You can also do this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
for i in list1:
if not i in list2:
list2.append(i)`
print list2
It can also be written as this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
[list2.append(i) for i in list1 if not i in list2]
print list2
Just to add another (very performant) implementation of such a functionality from an external module1: iteration_utilities.unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]
>>> list(unique_everseen(lst))
[1, 2, 3, 4]
Timings
I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def iteration_utilities_unique_everseen(seq):
return list(unique_everseen(seq))
def more_itertools_unique_everseen(seq):
return list(mi_unique_everseen(seq))
def odict(seq):
return list(OrderedDict.fromkeys(seq))
from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: list(range(2**i)) for i in range(1, 20)},
'list size (no duplicates)')
b.plot()
And just to make sure I also did a test with more duplicates just to check if it makes a difference:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
'list size (lots of duplicates)')
b.plot()
And one containing only one value:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [1]*(2**i) for i in range(1, 20)},
'list size (only duplicates)')
b.plot()
In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).
This iteration_utilities.unique_everseen function can also handle unhashable values in the input (however with an O(n*n) performance instead of the O(n) performance when the values are hashable).
>>> lst = [{1}, {1}, {2}, {1}, {3}]
>>> list(unique_everseen(lst))
[{1}, {2}, {3}]
1 Disclaimer: I'm the author of that package.
For another very late answer to another very old question:
The itertools recipes have a function that does this, using the seen set technique, but:
Handles a standard key function.
Uses no unseemly hacks.
Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don't.)
Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that's in C, and much faster.)
Is it actually faster than f7? It depends on your data, so you'll have to test it and see. If you want a list in the end, f7 uses a listcomp, and there's no way to do that here. (You can directly append instead of yielding, or you can feed the generator into the list function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn't require DSU when you want to decorate.
As with all of the recipes, it's also available in more-iterools.
If you just want the no-key case, you can simplify it as:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
For no hashable types (e.g. list of lists), based on MizardX's:
def f7_noHash(seq)
seen = set()
return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]
pandas users should check out pandas.unique.
>>> import pandas as pd
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> pd.unique(lst)
array([1, 2, 3, 4])
The function returns a NumPy array. If needed, you can convert it to a list with the tolist method.
5 x faster reduce variant but more sophisticated
>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# use list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]
here is a simple way to do it:
list1 = ["hello", " ", "w", "o", "r", "l", "d"]
sorted(set(list1 ), key=list1.index)
that gives the output:
["hello", " ", "w", "o", "r", "l", "d"]
Borrowing the recursive idea used in definining Haskell's nub function for lists, this would be a recursive approach:
def unique(lst):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))
e.g.:
In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]
I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).
In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop
In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop
In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop
In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop
In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop
I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:
import operator
def unique(lst, cmp_op=operator.ne):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)
For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:
def test_round(x,y):
return round(x) != round(y)
then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:
In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]
You can reference a list comprehension as it is being built by the symbol '_[1]'. For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.
def unique(my_list):
return [x for x in my_list if x not in locals()['_[1]']]
Demo:
l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2
Output:
[1, 2, 3, 4, 5]
Eliminating the duplicate values in a sequence, but preserve the order of the remaining items. Use of general purpose generator function.
# for hashable sequence
def remove_duplicates(items):
seen = set()
for item in items:
if item not in seen:
yield item
seen.add(item)
a = [1, 5, 2, 1, 9, 1, 5, 10]
list(remove_duplicates(a))
# [1, 5, 2, 9, 10]
# for unhashable sequence
def remove_duplicates(items, key=None):
seen = set()
for item in items:
val = item if key is None else key(item)
if val not in seen:
yield item
seen.add(val)
a = [ {'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 1, 'y': 2}, {'x': 2, 'y': 4}]
list(remove_duplicates(a, key=lambda d: (d['x'],d['y'])))
# [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]
1. These solutions are fine…
For removing duplicates while preserving order, the excellent solution(s) proposed elsewhere on this page:
seen = set()
[x for x in seq if not (x in seen or seen.add(x))]
and variation(s), e.g.:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
are indeed popular because they are simple, minimalistic, and deploy the correct hashing for optimal efficency. The main complaint about these seems to be that using the invariant None "returned" by method seen.add(x) as a constant (and therefore excess/unnecessary) value in a logical expression—just for its side-effect—is hacky and/or confusing.
2. …but they waste one hash lookup per iteration.
Surprisingly, given the amount of discussion and debate on this topic, there is actually a significant improvement to the code that seems to have been overlooked. As shown, each "test-and-set" iteration requires two hash lookups: the first to test membership x not in seen and then again to actually add the value seen.add(x). Since the first operation guarantees that the second will always be successful, there is a wasteful duplication of effort here. And because the overall technique here is so efficient, the excess hash lookups will likely end up being the most expensive proportion of what little work remains.
3. Instead, let the set do its job!
Notice that the examples above only call set.add with the foreknowledge that doing so will always result in an increase in set membership. The set itself never gets an chance to reject a duplicate; our code snippet has essentially usurped that role for itself. The use of explicit two-step test-and-set code is robbing set of its core ability to exclude those duplicates itself.
4. The single-hash-lookup code:
The following version cuts the number of hash lookups per iteration in half—from two down to just one.
seen = set()
[x for x in seq if len(seen) < len(seen.add(x) or seen)]
If you need one liner then maybe this would help:
reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))
... should work but correct me if i'm wrong
MizardX's answer gives a good collection of multiple approaches.
This is what I came up with while thinking aloud:
mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]
You could do a sort of ugly list comprehension hack.
[l[i] for i in range(len(l)) if l.index(l[i]) == i]
Relatively effective approach with _sorted_ a numpy arrays:
b = np.array([1,3,3, 8, 12, 12,12])
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])
Outputs:
array([ 1, 3, 8, 12])
l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))
A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.
A simple recursive solution:
def uniquefy_list(a):
return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]
this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.
def deduplicate(l):
count = {}
(read,write) = (0,0)
while read < len(l):
if l[read] in count:
read += 1
continue
count[l[read]] = True
l[write] = l[read]
read += 1
write += 1
return l[0:write]
x = [1, 2, 1, 3, 1, 4]
# brute force method
arr = []
for i in x:
if not i in arr:
arr.insert(x[i],i)
# recursive method
tmp = []
def remove_duplicates(j=0):
if j < len(x):
if not x[j] in tmp:
tmp.append(x[j])
i = j+1
remove_duplicates(i)
remove_duplicates()
One liner list comprehension:
values_non_duplicated = [value for index, value in enumerate(values) if value not in values[ : index]]
If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:
import pandas as pd
import numpy as np
uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()
# from the chosen answer
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()
print uniquifier(alist) == f7(alist) # True
Timing:
In [104]: %timeit f7(alist)
1000 loops, best of 3: 1.3 ms per loop
In [110]: %timeit uniquifier(alist)
100 loops, best of 3: 4.39 ms per loop
A solution without using imported modules or sets:
text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)
Gives output:
['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']
An in-place method
This method is quadratic, because we have a linear lookup into the list for every element of the list (to that we have to add the cost of rearranging the list because of the del s).
That said, it is possible to operate in place if we start from the end of the list and proceed toward the origin removing each term that is present in the sub-list at its left
This idea in code is simply
for i in range(len(l)-1,0,-1):
if l[i] in l[:i]: del l[i]
A simple test of the implementation
In [91]: from random import randint, seed
In [92]: seed('20080808') ; l = [randint(1,6) for _ in range(12)] # Beijing Olympics
In [93]: for i in range(len(l)-1,0,-1):
...: print(l)
...: print(i, l[i], l[:i], end='')
...: if l[i] in l[:i]:
...: print( ': remove', l[i])
...: del l[i]
...: else:
...: print()
...: print(l)
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5, 2]
11 2 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4, 5]
10 5 [6, 5, 1, 4, 6, 1, 6, 2, 2, 4]: remove 5
[6, 5, 1, 4, 6, 1, 6, 2, 2, 4]
9 4 [6, 5, 1, 4, 6, 1, 6, 2, 2]: remove 4
[6, 5, 1, 4, 6, 1, 6, 2, 2]
8 2 [6, 5, 1, 4, 6, 1, 6, 2]: remove 2
[6, 5, 1, 4, 6, 1, 6, 2]
7 2 [6, 5, 1, 4, 6, 1, 6]
[6, 5, 1, 4, 6, 1, 6, 2]
6 6 [6, 5, 1, 4, 6, 1]: remove 6
[6, 5, 1, 4, 6, 1, 2]
5 1 [6, 5, 1, 4, 6]: remove 1
[6, 5, 1, 4, 6, 2]
4 6 [6, 5, 1, 4]: remove 6
[6, 5, 1, 4, 2]
3 4 [6, 5, 1]
[6, 5, 1, 4, 2]
2 1 [6, 5]
[6, 5, 1, 4, 2]
1 5 [6]
[6, 5, 1, 4, 2]
In [94]:

calculations using both a value from a list and the index of that value?

I need to perform a calculation on a list using both the values of that list, and the index of those values, eg:
a = [1,2,3,4,5]
b= a[1]*1
but for all values of a, to get:
b = [1,4,9,16,25]
Is this possible to do without a for loop? I could put it in one, but it would change the structure of my code for the worse. This is meant to be used by a wide variety of users with varying skill levels, so I'm willing to sacrifice speed for ease of readability.
You can use list comprehension and enumerate() function, that provides the index as well as value of the element, simultaneously. Example -
b = [elem * (i+1) for i, elem in enumerate(a)]
Demo -
>>> a = [1,2,3,4,5]
>>> b = [elem * (i+1) for i, elem in enumerate(a)]
>>> b
[1, 4, 9, 16, 25]
If its fine to use numpy library , most probably a faster solution would be to use numpy's vectorised multiplication . Example -
In [1]: import numpy as np
In [2]: a = [1,2,3,4,5]
In [3]: a_arr = np.array(a)
In [4]: b = a_arr * np.arange(1,a_arr.shape[0]+1)
In [5]: b
Out[5]: array([ 1, 4, 9, 16, 25])
Though numpy method may be a bit more complex to understand that the list comprehension.
I recommend you to use map built-in function and let your code be simpler.
a = [1,2,3,4,5]
b = map(lambda x: x*x, a)
print b
# [1, 4, 9, 16, 25]
If you want to dynamically enter your list, you would use range built-in function (It's available for python2):
b = map(lambda x: x*x, range(1,6))
print b
# [1, 4, 9, 16, 25]

number of values in a list greater than a certain number

I have a list of numbers and I want to get the number of times a number appears in a list that meets a certain criteria. I can use a list comprehension (or a list comprehension in a function) but I am wondering if someone has a shorter way.
# list of numbers
j=[4,5,6,7,1,3,7,5]
#list comprehension of values of j > 5
x = [i for i in j if i>5]
#value of x
len(x)
#or function version
def length_of_list(list_of_numbers, number):
x = [i for i in list_of_numbers if j > number]
return len(x)
length_of_list(j, 5)
is there an even more condensed version?
You could do something like this:
>>> j = [4, 5, 6, 7, 1, 3, 7, 5]
>>> sum(i > 5 for i in j)
3
It might initially seem strange to add True to True this way, but I don't think it's unpythonic; after all, bool is a subclass of int in all versions since 2.3:
>>> issubclass(bool, int)
True
You can create a smaller intermediate result like this:
>>> j = [4, 5, 6, 7, 1, 3, 7, 5]
>>> len([1 for i in j if i > 5])
3
if you are otherwise using numpy, you can save a few strokes, but i dont think it gets much faster/compact than senderle's answer.
import numpy as np
j = np.array(j)
sum(j > i)
A (somewhat) different way:
reduce(lambda acc, x: acc + (1 if x > 5 else 0), j, 0)
If you are using NumPy (as in ludaavic's answer), for large arrays you'll probably want to use NumPy's sum function rather than Python's builtin sum for a significant speedup -- e.g., a >1000x speedup for 10 million element arrays on my laptop:
>>> import numpy as np
>>> ten_million = 10 * 1000 * 1000
>>> x, y = (np.random.randn(ten_million) for _ in range(2))
>>> %timeit sum(x > y) # time Python builtin sum function
1 loops, best of 3: 24.3 s per loop
>>> %timeit (x > y).sum() # wow, that was really slow! time NumPy sum method
10 loops, best of 3: 18.7 ms per loop
>>> %timeit np.sum(x > y) # time NumPy sum function
10 loops, best of 3: 18.8 ms per loop
(above uses IPython's %timeit "magic" for timing)
Different way of counting by using bisect module:
>>> from bisect import bisect
>>> j = [4, 5, 6, 7, 1, 3, 7, 5]
>>> j.sort()
>>> b = 5
>>> index = bisect(j,b) #Find that index value
>>> print len(j)-index
3
I'll add a map and filter version because why not.
sum(map(lambda x:x>5, j))
sum(1 for _ in filter(lambda x:x>5, j))
You can do like this using function:
l = [34,56,78,2,3,5,6,8,45,6]
print ("The list : " + str(l))
def count_greater30(l):
count = 0
for i in l:
if i > 30:
count = count + 1.
return count
print("Count greater than 30 is : " + str(count)).
count_greater30(l)
This is a little bit longer but the detailed solution for beginners:
from functools import reduce
from statistics import mean
two_dim_array = [[1, 5, 7, 3, 2], [2, 4 ,1 ,6, 8]]
# convert two dimensional array to one dimensional array
one_dim_array = reduce(list.__add__, two_dim_array)
arithmetic_mean = mean(one_dim_array)
exceeding_count = sum(i > arithmetic_mean for i in one_dim_array)

Python Add Elements of a List (or Set, or whatever Data Type is appropriate)

Is there an easy way to add the members of two equally sized lists (or tuple or whatever data type would work best)?
I have, for example a and b with 2 elements:
a = (0, 10)
b = (0, -10)
I want to add them and get as result:
result = (0, 0)
NOT (0, 10, 0, -10)
You can do this in one line in Python:
map(sum, zip(A, B))
Example:
>>> B = [1, 2, 3, 4]
>>> C = [1, 2, 4, 8]
>>> map(sum, zip(B, C))
[2, 4, 7, 12]
Three options:
>>> [a+b for (a,b) in zip(A,B)]
>>> map(int.__add__, A, B)
>>> map(sum, zip(A,B))
if you want to operate with list of numbers use numpy
>>> a = [1,2]
>>> b = [1,2]
>>> import numpy as np
>>> np.add(a,b)
array([2, 4])
>>>
You can use numpy to add the lists:
add(array([-1.2, 1.2]), array([1,3]))
and the result it:
array([-0.2, 4.2])
List ANSWER = ()
for index in range(0, len(A))
ANSWER.append(A[index]+B[index])
yes, just do this
map(sum,zip(A,B)
or, (actually clearly faster)
u=map(lambda a,b:a+b,A,B)
Timing examples:
A=range(3000)
B=range(3000)
for i in range(15000):
u=map(lambda a,b:a+b,A,B) # takes about 7.2 seconds
# u=map(sum,zip(A,B)) # takes about 11 seconds
# u=map(int.__add__,A,B) # (Edward Loper) actually also consistently ~0.5 sec slower than lambda

Python: Finding differences between elements of a list

Given a list of numbers, how does one find differences between every (i)-th elements and its (i+1)-th?
Is it better to use a lambda expression or maybe a list comprehension?
For example:
Given a list t=[1,3,6,...], the goal is to find a list v=[2,3,...] because 3-1=2, 6-3=3, etc.
>>> t
[1, 3, 6]
>>> [j-i for i, j in zip(t[:-1], t[1:])] # or use itertools.izip in py2k
[2, 3]
The other answers are correct but if you're doing numerical work, you might want to consider numpy. Using numpy, the answer is:
v = numpy.diff(t)
If you don't want to use numpy nor zip, you can use the following solution:
>>> t = [1, 3, 6]
>>> v = [t[i+1]-t[i] for i in range(len(t)-1)]
>>> v
[2, 3]
Starting in Python 3.10, with the new pairwise function it's possible to slide through pairs of elements and thus map on rolling pairs:
from itertools import pairwise
[y-x for (x, y) in pairwise([1, 3, 6, 7])]
# [2, 3, 1]
The intermediate result being:
pairwise([1, 3, 6, 7])
# [(1, 3), (3, 6), (6, 7)]
You can use itertools.tee and zip to efficiently build the result:
from itertools import tee
# python2 only:
#from itertools import izip as zip
def differences(seq):
iterable, copied = tee(seq)
next(copied)
for x, y in zip(iterable, copied):
yield y - x
Or using itertools.islice instead:
from itertools import islice
def differences(seq):
nexts = islice(seq, 1, None)
for x, y in zip(seq, nexts):
yield y - x
You can also avoid using the itertools module:
def differences(seq):
iterable = iter(seq)
prev = next(iterable)
for element in iterable:
yield element - prev
prev = element
All these solution work in constant space if you don't need to store all the results and support infinite iterables.
Here are some micro-benchmarks of the solutions:
In [12]: L = range(10**6)
In [13]: from collections import deque
In [15]: %timeit deque(differences_tee(L), maxlen=0)
10 loops, best of 3: 122 ms per loop
In [16]: %timeit deque(differences_islice(L), maxlen=0)
10 loops, best of 3: 127 ms per loop
In [17]: %timeit deque(differences_no_it(L), maxlen=0)
10 loops, best of 3: 89.9 ms per loop
And the other proposed solutions:
In [18]: %timeit [x[1] - x[0] for x in zip(L[1:], L)]
10 loops, best of 3: 163 ms per loop
In [19]: %timeit [L[i+1]-L[i] for i in range(len(L)-1)]
1 loops, best of 3: 395 ms per loop
In [20]: import numpy as np
In [21]: %timeit np.diff(L)
1 loops, best of 3: 479 ms per loop
In [35]: %%timeit
...: res = []
...: for i in range(len(L) - 1):
...: res.append(L[i+1] - L[i])
...:
1 loops, best of 3: 234 ms per loop
Note that:
zip(L[1:], L) is equivalent to zip(L[1:], L[:-1]) since zip already terminates on the shortest input, however it avoids a whole copy of L.
Accessing the single elements by index is very slow because every index access is a method call in python
numpy.diff is slow because it has to first convert the list to a ndarray. Obviously if you start with an ndarray it will be much faster:
In [22]: arr = np.array(L)
In [23]: %timeit np.diff(arr)
100 loops, best of 3: 3.02 ms per loop
I would suggest using
v = np.diff(t)
this is simple and easy to read.
But if you want v to have the same length as t then
v = np.diff([t[0]] + t) # for python 3.x
or
v = np.diff(t + [t[-1]])
FYI: this will only work for lists.
for numpy arrays
v = np.diff(np.append(t[0], t))
Using the := walrus operator available in Python 3.8+:
>>> t = [1, 3, 6]
>>> prev = t[0]; [-prev + (prev := x) for x in t[1:]]
[2, 3]
A functional approach:
>>> import operator
>>> a = [1,3,5,7,11,13,17,21]
>>> map(operator.sub, a[1:], a[:-1])
[2, 2, 2, 4, 2, 4, 4]
Using generator:
>>> import operator, itertools
>>> g1,g2 = itertools.tee((x*x for x in xrange(5)),2)
>>> list(itertools.imap(operator.sub, itertools.islice(g1,1,None), g2))
[1, 3, 5, 7]
Using indices:
>>> [a[i+1]-a[i] for i in xrange(len(a)-1)]
[2, 2, 2, 4, 2, 4, 4]
Ok. I think I found the proper solution:
v = [x[0]-x[1] for x in zip(t[1:],t[:-1])]
I suspect this is what the numpy diff command does anyway, but just for completeness you can simply difference the sub-vectors:
from numpy import array as a
a(x[1:])-a(x[:-1])
In addition, I wanted to add these solutions to generalizations of the question:
Solution with periodic boundaries
Sometimes with numerical integration you will want to difference a list with periodic boundary conditions (so the first element calculates the difference to the last. In this case the numpy.roll function is helpful:
v-np.roll(v,1)
Solutions with zero prepended
Another numpy solution (just for completeness) is to use
numpy.ediff1d(v)
This works as numpy.diff, but only on a vector (it flattens the input array). It offers the ability to prepend or append numbers to the resulting vector. This is useful when handling accumulated fields that is often the case fluxes in meteorological variables (e.g. rain, latent heat etc), as you want a resulting list of the same length as the input variable, with the first entry untouched.
Then you would write
np.ediff1d(v,to_begin=v[0])
Of course, you can also do this with the np.diff command, in this case though you need to prepend zero to the series with the prepend keyword:
np.diff(v,prepend=0.0)
All the above solutions return a vector that is the same length as the input.
You can also convert the difference into an easily readable transition matrix using
v = t.reshape((c,r)).T - t.T
Where c = number of items in the list and r = 1 since a list is basically a vector or a 1d array.
My way
>>>v = [1,2,3,4,5]
>>>[v[i] - v[i-1] for i, value in enumerate(v[1:], 1)]
[1, 1, 1, 1]

Categories