I am looking for the fastest way to output the index of the first difference of two arrays in Python. For example, let's take the following two arrays:
test1 = [1, 3, 5, 8]
test2 = [1]
test3 = [1, 3]
Comparing test1 and test2, I would like to output 1, while the comparison of test1 and test3 should output 2.
In other words I look for an equivalent to the statement:
import numpy as np
np.where(np.where(test1 == test2, test1, 0) == '0')[0][0]
with varying array lengths.
Any help is appreciated.
For lists this works:
from itertools import zip_longest
def find_first_diff(list1, list2):
for index, (x, y) in enumerate(zip_longest(list1, list2,
if x != y:
return index
zip_longest pads the shorter list with None or with a provided fill value. The standard zip does not work if the difference is caused by different list lengths rather than actual different values in the lists.
On Python 2 use izip_longest.
Updated: Created unique fill value to avoid potential problems with None as list value. object() is unique:
>>> o1 = object()
>>> o2 = object()
>>> o1 == o2
This pure Python approach might be faster than a NumPy solution. This depends on the actual data and other circumstances.
Converting a list into a NumPy array also takes time. This might actually
take longer than finding the index with the function above. If you are not
going to use the NumPy array for other calculations, the conversion
might cause considerable overhead.
NumPy always searches the full array. If the difference comes early,
you do a lot more work than you need to.
NumPy creates a bunch of intermediate arrays. This costs memory and time.
NumPy needs to construct intermediate arrays with the maximum length.
Comparing many small with very large arrays is unfavorable here.
In general, in many cases NumPy is faster than a pure Python solution.
But each case is a bit different and there are situations where pure
Python is faster.
with numpy arrays (which will be faster for big arrays) then you could check the lengths of the lists then (also) check the overlapping parts something like the following (obviously slicing the longer to the length of the shorter):
import numpy as np
n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
ans = x[0]
elif len(test1) != len(test2):
ans = n
ans = None
EDIT - despite this being voted down I will leave my answer up here in case someone else needs to do something similar.
If the starting arrays are large and numpy then this is the fastest method. Also I had to modify Andy's code to get it to work. In the order: 1. my suggestion, 2. Paidric's (now removed but the most elegant), 3. Andy's accepted answer, 4. zip - non numpy, 5. vanilla python without zip as per #leekaiinthesky
0.1ms, 9.6ms, 0.6ms, 2.8ms, 2.3ms
if the conversion to ndarray is included in timeit then the non-numpy nop-zip method is fastest
7.1ms, 17.1ms, 7.7ms, 2.8ms, 2.3ms
and even more so if the difference between the two lists is at around index 1,000 rather than 10,000
7.1ms, 17.1ms, 7.7ms, 0.3ms, 0.2ms
import timeit
setup = """
import numpy as np
from itertools import zip_longest
list1 = [1 for i in range(10000)] + [4, 5, 7]
list2 = [1 for i in range(10000)] + [4, 4]
test1 = np.array(list1)
test2 = np.array(list2)
def find_first_diff(l1, l2):
for index, (x, y) in enumerate(zip_longest(l1, l2, fillvalue=object())):
if x != y:
return index
def findFirstDifference(list1, list2):
minLength = min(len(list1), len(list2))
for index in range(minLength):
if list1[index] != list2[index]:
return index
return minLength
fn = ["""
n = min(len(test1), len(test2))
x = np.where(test1[:n] != test2[:n])[0]
if len(x) > 0:
ans = x[0]
elif len(test1) != len(test2):
ans = n
ans = None""",
x = np.where(np.in1d(list1, list2) == False)[0]
if len(x) > 0:
ans = x[0]
ans = None""",
x = test1
y = np.resize(test2, x.shape)
x = np.where(np.where(x == y, x, 0) == 0)[0]
if len(x) > 0:
ans = x[0]
ans = None""",
ans = find_first_diff(list1, list2)""",
ans = findFirstDifference(list1, list2)"""]
for f in fn:
print(timeit.timeit(f, setup, number = 1000))
Here one way to do it:
from itertools import izip
def compare_lists(lista, listb):
Compare two lists and return the first index where they differ. if
they are equal, return the list len
for position, (a, b) in enumerate(zip(lista, listb)):
if a != b:
return position
return min([len(lista), len(listb)])
The algorithm is simple: zip (or in this case, a more efficient izip) the two lists, then compare them element by element.
The eumerate function gives the index position which we can return if a discrepancy found
If we exit the for loop without any returns, one of the two possibilities can happen:
The two lists are identical. In this case, we want to return the length of either lists.
Lists are of different length and they are equal up to the length of the shorter list. In this case, we want to return the length of the shorter list
In ether cases, the min(...) expression is what we want.
This function has a bug: if you compare two empty lists, it returns 0, which seems wrong. I'll leave it to you to fix it as an exercise.
The fastest algorithm would compare every element up to the first difference and no more. So iterating through the two lists pairwise like that would give you this:
def findFirstDifference(list1, list2):
minLength = min(len(list1), len(list2))
for index in xrange(minLength):
if list1[index] != list2[index]:
return index
return minLength # the two lists agree where they both have values, so return the next index
Which gives the output you want:
print findFirstDifference(test1, test3)
> 2
Thanks for all of your suggestions, I just found a much simpler way for my problem which is:
x = numpy.array(test1)
y = np.resize(numpy.array(test2), x.shape)
np.where(np.where(x == y, x, 0) == '0')[0][0]
Here's an admittedly not very pythonic, numpy-free stab:
b = zip (test1, test2)
c = 0
while b:
b = b[1:]
if not b or b[0][0] != b[0][1]:
c = c + 1
print c
For Python 3.x:
def first_diff_index(ls1, ls2):
l = min(len(ls1), len(ls2))
return next((i for i in range(l) if ls1[i] != ls2[i]), l)
(for Python 2.7 onwards substitute range by xrange)
When having a list of lists in which all the sublists are ordered e.g.:
[[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
what is the fastest way to get the union of these lists without having duplicates in it and also get an ordered result?
So the resulting list should be:
I know I could just union them all without duplicates and then sort them but are there faster ways (like: do the sorting while doing the union) built into python?
Is the proposed answer faster than executing the following algorithm pairwise with all the sublists?
You can use heapq.merge for this:
def mymerge(v):
from heapq import merge
last = None
for a in merge(*v):
if a != last: # remove duplicates
last = a
yield a
print(list(mymerge([[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]])))
# [1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
The asymptotic theoretical best approach to the problem is to use the priority queue, like, for example, the one implemented in heapq.merge() (thanks to #kaya3 for pointing this out).
However, in practice, a number of things can go wrong. For example the constant factors in the complexity analysis are large enough that a theoretically-optimal approach is, in real-life scenarios, slower.
This is fundamentally dependent on the implementation.
For example, Python suffer some speed penalty for explicit looping.
So, let's consider a couple of approaches and how the do perform for some concrete inputs.
Just to give you some idea of the numbers we are discussing, here are a few approaches:
merge_sorted() which uses the most naive approach of flatten the sequence, reduce it to a set() (removing duplicates) and sort it as required
import itertools
def merge_sorted(seqs):
return sorted(set(itertools.chain.from_iterable(seqs)))
merge_heapq() which essentially #arshajii's answer. Note that the itertools.groupby() variation is slightly (less than ~1%) faster.
import heapq
def i_merge_heapq(seqs):
last_item = None
for item in heapq.merge(*seqs):
if item != last_item:
yield item
last_item = item
def merge_heapq(seqs):
return list(i_merge_heapq(seqs))
merge_bisect_set() is substantially the same algorithm as merge_sorted() except that the result is now constructed explicitly using the efficient bisect module for sorted insertions. Since sorted() is doing fundamentally the same thing but looping in Python, this is not going to be faster.
import itertools
import bisect
def merge_bisect_set(seqs):
result = []
for item in set(itertools.chain.from_iterable(seqs)):
bisect.insort(result, item)
return result
merge_bisect_cond() is similar to merge_bisect_set() but now the non-repeating constraint is explicitly done using the final list. However, this is much more expensive than just using set() (in fact it is so slow that it was excluded from the plots).
def merge_bisect_cond(seqs):
result = []
for item in itertools.chain.from_iterable(seqs):
if item not in result:
bisect.insort(result, item)
return result
merge_pairwise() explicitly implements the the theoretically efficient algorithm, similar to what you outlined in your question.
def join_sorted(seq1, seq2):
result = []
i = j = 0
len1, len2 = len(seq1), len(seq2)
while i < len1 and j < len2:
if seq1[i] < seq2[j]:
i += 1
elif seq1[i] > seq2[j]:
j += 1
else: # seq1[i] == seq2[j]
i += 1
j += 1
if i < len1:
elif j < len2:
return result
def merge_pairwise(seqs):
result = []
for seq in seqs:
result = join_sorted(result, seq)
return result
merge_loop() implements a generalization of the above, where now pass is done only once for all sequences, instead of doing this pairwise.
def merge_loop(seqs):
result = []
lengths = list(map(len, seqs))
idxs = [0] * len(seqs)
while any(idx < length for idx, length in zip(idxs, lengths)):
item = min(
for idx, seq, length in zip(idxs, seqs, lengths) if idx < length)
for i, (idx, seq, length) in enumerate(zip(idxs, seqs, lengths)):
if idx < length and seq[idx] == item:
idxs[i] += 1
return result
By generating the input using:
def gen_input(n, m=100, a=None, b=None):
if a is None and b is None:
b = 2 * n * m
a = -b
return tuple(tuple(sorted(set(random.randint(int(a), int(b)) for _ in range(n)))) for __ in range(m))
one can plot the timings for varying n:
Note that, in general, the performances will vary for different values of n (the size of each sequence) and m (the number of sequences), but also of a and b (the minimum and the maximum number generated).
For brevity, this was not explored in this answer, but feel free to play around with it here, which also includes some other implementations, notably some tentative speed-ups with Cython that were only partially successful.
You can make use of sets in Python-3.
mylist = [[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
mynewlist = mylist[0] + mylist[1] + mylist[2]
[1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
First merge all the sub-lists using list addition.
Then convert it into a set object where it will delete all the duplicates which will also be sorted in ascending order.
Convert it back to list.It gives your desired output.
Hope it answers your question.
I have two sorted lists containing float values. The first contains the values I am interested in (l1) and the second list contains values I want to search (l2). However, I am not looking for exact matches and I am tolerating differences based on a function. Since I have do this search very often (>>100000) and the lists can be quite large (~5000 and ~200000 elements), I am really interested in runtime. At first, I thought I could somehow use numpy.isclose(), but my tolerance is not fixed, but depending on the value of interest. Several nested for loops work, but are really slow. I am sure that there is some efficient way to do this.
#check if two floats are close enough to match
def matching(mz1, mz2):
if abs( (1-mz1/mz2) * 1000000) <= 2:
return True
return False
#imagine another huge for loop around everything
l1 = [132.0317, 132.8677, 132.8862, 133.5852, 133.7507]
l2 = [132.0317, 132.0318, 132.8678, 132.8861, 132.8862, 133.5851999, 133.7500]
d = {i:[] for i in l1}
for i in l1:
for j in l2:
if matching(i, j):
fyi: As an alternative to the matching function, I could also create a dictionary first, mapping the values of interest from l1 to the window (min ,max) I would allow. e.g. {132.0317:(132.0314359366, 132.0319640634), ...}, but I think checking for each value from l2 if it lies within one of the windows from this dictionary would be even slower...
This would be how to generate the dictionary containing min/max values for each value from l1:
def calcMinMaxMZ(mz, delta_ppm=2):
minmz = mz- (mz* +delta_ppm)/1000000
maxmz = mz- (mz* -delta_ppm)/1000000
return minmz, maxmz
minmax_d = {mz:calcMinMaxMZ(mz, delta_ppm=2) for mz in l1}
The result may be a dictionary like this:
d = {132.0317: [132.0317, 132.0318], 132.8677: [132.8678], 132.8862: [132.8862, 132.8861], 133.5852: [133.5851999], 133.7507: []} But I actually do much more, when there is a match.
Any help is appreciated!
I re-implemented the for loop using itertools. For it working, the inputs must be sorted. For benchmark I generated 1000 items from <130.0, 135.0> for l1 and 100_000 items from <130.0, 135.0> for l2:
from timeit import timeit
from itertools import tee
from random import uniform
#check if two floats are close enough to match
def matching(mz1, mz2):
if abs( (1-mz1/mz2) * 1000000) <= 2:
return True
return False
#imagine another huge for loop around everything
l1 = sorted([uniform(130.00, 135.00) for _ in range(1000)])
l2 = sorted([uniform(130.00, 135.00) for _ in range(100_000)])
def method1():
d = {i:[] for i in l1}
for i in l1:
for j in l2:
if matching(i, j):
return d
def method2():
iter_2, last_match = tee(iter(l2))
d = {}
for i in l1:
d.setdefault(i, [])
found = False
while True:
j = next(iter_2, None)
if j is None:
if matching(i, j):
if not found:
iter_2, last_match = tee(iter_2)
found = True
if found:
iter_2, last_match = tee(last_match)
return d
print(timeit(lambda: method1(), number=1))
print(timeit(lambda: method2(), number=1))
If you transpose your formula to produce a range of mz2 values for a given mz1, you could use a binary search to find the first match in the sorted l2 list, then work your way up sequentially until you reach the end of the range.
def getRange(mz1):
minimum = mz1/(1+2/1000000)
maximum = mz1/(1-2/1000000)
return minimum,maximum
l1 = [132.0317, 132.8677, 132.8862, 133.5852, 133.7507]
l2 = [132.0317, 132.0318, 132.8678, 132.8862, 132.8861, 133.5851999, 133.7500]
l2 = sorted(l2)
from bisect import bisect_left
d = { mz1:[] for mz1 in l1 }
for mz1 in l1:
lo,hi = getRange(mz1)
i = bisect_left(l2,lo)
while i < len(l2) and l2[i]<= hi:
Sorting l2 will cost O(NlogN) and the dictionary creation will cost O(MlogN) where N is len(l2) and M is len(l1). You will only be applying the tolerance/range formula M times instead of N*M times which should save a lot of processing.
Your lists are already sorted, so you can maybe use paradigm similar to the "Merge" part of MergeSort: keep track of the current element of both idx1 and idx2, and when one of them is acceptable, process it and advance only that index.
d = {i:[] for i in l1}
idx1, idx2 = 0, 0
while idx1 < len(l1):
while matching(l1[idx1], l2[idx2]) and idx2 < len(l2):
idx2 += 1
idx1 += 1
# {132.0317: [132.0317, 132.0318], 132.8677: [132.8678], 132.8862: [132.8862, 132.8861], 133.5852: [133.5851999], 133.7507: []}
this is O(len(l1) + len(l2)), since it executes exactly once for each element of both lists.
The big caveat here is that this never "steps back" - if the current element of l1 matches the current element of l2 but the next element of l1 would also match the current element of l2, then the latter does not get listed. Fixing this might require adding some sort of "look-back" functionality (which would drive the complexity class up by a magnitude of n in the worst case, but would still be quicker than iterating through both lists repeatedly). However, it does work for your given dataset.
I have a curious case, and after some time I have not come up with an adequate solution.
Say you have two lists and you need to find items that have the same index.
x = [1,4,5,7,8]
y = [1,3,8,7,9]
I am able to get a correct intersection of those which appear in both lists with the same index by using the following:
matches = [i for i, (a,b) in enumerate(zip(x,y)) if a==b)
This would return:
I am able to get a a simple intersection of both lists with the following (and in many other ways, this is just an example)
intersected = set(x) & set(y)
This would return this list:
Here's the question. I'm wondering for some ideas for a way of getting a list of items (as in the second list) which do not include those matches above but are not in the same position on the list.
In other words, I'm looking items in x that do not share the same index in the y
The desired result would be the index position of "8" in y, or [2]
Thanks in advance
You're so close: iterate through y; look for a value that is in x, but not at the same position:
offset = [i for i, a in enumerate(y) if a in x and a != x[i] ]
Including the suggested upgrade from pault, with respect to Martijn's comment ... the pre-processing reduces the complexity, in case of large lists:
>>> both = set(x) & set(y)
>>> offset = [i for i, a in enumerate(y) if a in both and a != x[i] ]
As PaulT pointed out, this is still quite readable at OP's posted level.
I'd create a dictionary of indices for the first list, then use that to test if the second value is a) in that dictionary, and b) the current index is not present:
def non_matching_indices(x, y):
x_indices = {}
for i, v in enumerate(x):
x_indices.setdefault(v, set()).add(i)
return [i for i, v in enumerate(y) if i not in x_indices.get(v, {i})]
The above takes O(len(x) + len(y)) time; a single full scan through the one list, then another full scan through the other, where each test to include i is done in constant time.
You really don't want to use a value in x containment test here, because that requires a scan (a loop) over x to see if that value is really in the list or not. That takes O(len(x)) time, and you do that for each value in y, which means that the fucntion takes O(len(x) * len(y)) time.
You can see the speed differences when you run a time trial with a larger list filled with random data:
>>> import random, timeit
>>> def using_in_x(x, y):
... return [i for i, a in enumerate(y) if a in x and a != x[i]]
>>> x = random.sample(range(10**6), 1000)
>>> y = random.sample(range(10**6), 1000)
>>> for f in (using_in_x, non_matching_indices):
... timer = timeit.Timer("f(x, y)", f"from __main__ import f, x, y")
... count, total = timer.autorange()
... print(f"{f.__name__:>20}: {total / count * 1000:6.3f}ms")
using_in_x: 10.468ms
non_matching_indices: 0.630ms
So with two lists of 1000 numbers each, if you use value in x testing, you easily take 15 times as much time to complete the task.
x = [1,4,5,7,8]
y = [1,3,8,7,9]
for e in x:
if e in y and x.index(e) != y.index(e):
print result #gives tuple with x_position,y_position,value
This version goes item by item through the first list and checks whether the item is also in the second list. If it is, it compares the indices for the found item in both lists and if they are different then it stores both indices and the item value as a tuple with three values in the result list.
The goal is to find the unique number in an array which contains identical numbers except one. Speed is of the essence as arrays can be huge. My code below works for smaller arrays but times out for large arrays. How to improve the algo? See input / output example below:
Input = [1, 1, 1, 1, 2, 1, 1, 1, 1]
Output = 2
def find_uniq(arr):
result = [x for x in arr if arr.count(x) ==1]
return result[0]
Your current solution is quadratic!
You can bring this down to linear, using collections.Counter in association with next (very handy when you don't want the entire list being built only to be thrown away). The counts are precomputed and then the first unique value found is returned.
from collections import Counter
def find_uniq(arr):
c = Counter(arr)
return next(x for x in arr if c[x] == 1)
next shines here because the goal is to return the first unique value found. next works with a generator, and only the first item is returned (further computation halts).
Using NumPy:
np.where(np.bincount(arr) == 1)
Sometimes the stupidiest solution is not the worst:
def find_uniq(arr):
count = arr.count
for x in arr:
if count(x) == 1:
return x
return None
Depending on where the first unique item is, this can be faster or slower than than Coldspeed's Counter based solution (which seems mostly stable). In the worst case, it's only marginally faster than the OP's solution but eats less memory. In the best case it's obviously the winner ;)
I think this is the best solution because it iterates over the Counter dictionary, not the original array (which will likely be longer) when trying to find the unique number/numbers. It will work with a Python list of integers or floats as the function input
from collections import Counter
def find_uniq(arr):
return next(k for k,v in Counter(arr).items()
if v == 1)
or if you may have more than one number with a count of 1:
from collections import Counter
def find_uniq(arr):
return [k for k,v in Counter(arr).items()
if v == 1]
Here's a way to find it quickly
def find_uniq(arr):
members = list(set(arr))
for member in members:
if arr.count(member) == 1:
return member
return None #if there aren't any uniques
You already have #cᴏʟᴅsᴘᴇᴇᴅ answer which is the fastest generic solution to this problem. Almost linear for any sorted or unsorted array.
from collections import Counter
def find_uniq(arr):
c = Counter(arr)
return next(x for x in arr if c[x] == 1)
Since the question is tagged algorithm I want to show how you can still solve the problem with linear time complexity without any module, just yout algorithm:
def find_uniq(arr):
i = 0
while i < len(arr):
j = i + 1
pivot = arr[i] # store arr[i] so you don't have to access array again to read its value
while j < len(arr) and pivot == arr[j]:
j += 1
if j - i == 1: # if previous loop never executed
return pivot
i = j
return None
About its complexity: This is not only O(n), it's also Θ(n) since i = j makes sure you don't access any array element more than once. Spatial complexity is O(1).
About its correctness: It works only if input array is sorted! So here is the big deal. If you can build your array already sorted this algorithm will be faster than Counter and anything else. Problem is if the array is not already sorted you'll have to sort it. arr.sort() uses Timsort algorithm, which is Ω(log(N!)),Θ(log(N!)) and O(log(N!)), with O(n) worst-case spatial complexity. Therefore in worst case (array not already sorted) the whole code will O(log(N!)). Still better than O(n2), though.
R_dic = dict()
R = [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8]
R_dic = {i: R.count(i) for i in R if i not in R_dic} #list comprehension
print R_dic
{1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 1}
now check for the key and value in the dictionary as per your condition.
Is there any faster way to calculate this value in Python:
len([x for x in my_list if x in other_list])
I tried to use sets, since the lists' elements are unique, but I noticed no difference.
I'm working with big lists, so even the slightest improvement counts.
Simple way is to find the least length'd list... than use that with set.intersection..., eg:
a = range(100)
b = range(50)
fst, snd = (a, b) if len(a) < len(b) else (b, a)
I think a generator expression like so would be fast
sum(1 for i in my_list if i in other_list)
Otherwise a set intersection is about as fast as it will get
From https://wiki.python.org/moin/TimeComplexity, set intersection for two sets s and t has time complexity:
Average - O(min(len(s), len(t))
Worst - O(len(s) * len(t))
len([x for x in my_list if x in other_list]) has complexity O(n^2) which is equivalent to the worst case for set.intersection().
If you use set.intersection() you only need to convert one of the lists to a set first:
So len(set(my_list).intersection(other_list)) should on average going to be faster than the nested list comprehension.
You could try using the filter function. Since you mentioned you're working with huge lists, ifilterof itertools module would be a good option:
from itertools import ifilter
my_set = set(range(100))
other_set = set(range(50))
for item in ifilter(lambda x: x in other_set, my_set):
print item
The idea is to sort the two lists first and then traverse them like we want to merge them, in order to find the elements in first list belonging also to second list. This way we have an O(n logn) algorithm.
def mycount(l, m):
i, j, counter = 0, 0, 0
while i < len(l) and j < len(m):
if l[i] == m[j]:
counter += 1
i += 1
elif l[i] < m[j]:
i += 1
j += 1
return counter
From local tests it's 100 times faster than len([x for x in a if x in b]) when working with lists of 10000 elements.
Considering that the list elements are unique, the common elements will have a frequency two in the union of the two lists. Also they will be together when we sort this union. So the following is also valid:
def mycount(l, m):
s = sorted(l + m)
return sum(s[i] == s[i + 1] for i in xrange(len(s) - 1))
Similarily, we can use a counter:
from collections import Counter
def mycount(l, m):
c = Counter(l)
return sum(v == 2 for v in c.itervalues())