Performance of checking "expanded list" equality - python

I need to check whether a given list is equal to the result of substituting some lists for some elements in another list. Concretely, I have a dictionary, say f = {'o': ['a', 'b'], 'l': ['z'], 'x': ['y']} and a list list1 = ['H', 'e', 'l', 'l', 'o'], so I want to check if some list2 is equal to ['H', 'e', 'z', 'z', 'a', 'b'].
Below, I first write a function apply to compute the image of list1 under f. Then, it suffices to write list2 == apply(list1, f). Since this function will be called thousands of times in my program, I need to make it very fast. Therefore, I thought of the second function below, which should be faster but turns out not to be. So my questions (detailed below) are: why? And: is there a faster method?
First function:
def apply(l, f):
result = []
for x in l:
if x in f:
result.extend(f[x])
else:
result.append(x)
return result
Second function:
def apply_equal(list1, f, list2):
i = 0
for x in list1:
if x in f:
sublist = f[x]
length = len(sublist)
if list2[i:i + length] != substmt:
return False
i += length
else:
if list2[i] != x:
return False
i += 1
return i == len(list2)
I thought the second method would be faster since it does not construct the list which is the image of the first list by the function and then checks equality with the second list. On the contrary, it checks equality "on the fly" without constructing a new list. So I was surprised to see that it is not faster (and even: a bit slower). For the record: list1, list2, and the lists which are values in the dictionary are all small (typically under 50 elements), as well as the number of keys of the dictionary.
So my questions are: why isn't the second method faster ? And: are there ways to do this faster (possibly using other data structures)?
Edit in response to the comments: list1 and list2 will most often be different, but f may be common to some of them. Typically, there could be around 100,000 checks in batches of around 50 consecutive checks with a common f. The elements of the lists are short strings. It is expected that all checks return True (so the whole lists have to be iterated over).

Without proper data for benchmarking it's hard to say, so I tested various solutions with various "sizes" of data.
Replacing result.extend(f[x]) with result += f[x] always made it faster.
This was faster for longer lists (using itertools.chain):
list2 == list(chain.from_iterable(map(f.get, list1, zip(list1))))
If the data allows it, explicitly storing all possible keys and always accessing with f[x] would speed it up. That is, set f[k] = k, for all "missing" keys in advance, so you don't have to check with in or use get.

You need to use profiling tools like scalene to see what's slow in your code, don't try to guess.
In case you want to read it, I was able to produce an even slower version based on your idea of stoping as soon as possible, but while keeping the first readable apply implementation:
def apply(l, f):
for x in l:
if x in f:
yield from f[x]
else:
yield x
def apply_equal(l1, f, l2):
return all(left == right for left, right in zip(apply(l1, f), l2, strict=True))
Beware it needs Python 3.10 for zip's strict=True.
As the comments told, speed highly depends on your data here, constructing the whole list may look faster on small datasets, but halting soon may be faster on a bigger list.

Related

Finding all the elements in a list between two elements (not using index, and with wrap around)

I'm trying to figure out a way to find all the elements that appear between two list elements (inclusive) - but to do it without reference to position, and instead with reference to the elements themselves. It's easier to explain with code:
I have a list like this:
['a','b','c','d','e']
And I want a function that would take, two arguments corresponding to elements eg. f('a','d'), and return the following:
['a','b','c','d']
I'd also like it to wrap around, eg. f('d','b'):
['d','e','a','b']
I'm not sure how to go about coding this. One hacky way I've thought of is duplicating the list in question (['a','b','c','d','e','a','b','c','d','e']) and then looping through it and flagging when the first element appears and when the last element does and then discarding the rest - but it seems like there would be a better way. Any suggestions?
def foo(a, b):
s, e = [a.index(x) for x in b]
if s <= e:
return a[s:e+1]
else:
return a[s:] + a[:e+1]
print(foo(['a','b','c','d','e'], ['a', 'd'])) # --> ['a', 'b', 'c', 'd']
print(foo(['a','b','c','d','e'], ['d', 'b'])) # --> ['d', 'e', 'a', 'b']
So the following obviously needs error handling as indicated below, and also, note the the index() function only takes the index of the first occurrence. You have not specified how you want to handle duplicate elements in the list.
def f(mylist, elem1, elem2):
posn_first = mylist.index(elem1) # what if it's not in the list?
posn_second = mylist.index(elem2) # ditto
if (posn_first <= posn_second):
return mylist[posn_first:posn_second+1]
else:
return mylist[posn_first:] + mylist[:posn_second+1]
This would be a simple approach, given you always want to use the first appearence of the element in the list:
def get_wrapped_values(input_list, start_element, end_element):
return input_list[input_list.index(start_element): input_list.index(end_element)+1]

Python: is there a temporary pop method?

Is there a method like pop that temporarily removes, say an element of a list, without permanently changing the original list?
One that would do the following:
list = [1,2,3,4]
newpop(list, 0) returns [2,3,4]
list is unchanged
I come from R, where I would just do c(1,2,3,4)[-4] if I wanted to temporarily remove the last element of some list, so forgive me if I'm thinking backwards here.
I know I could write a function like the following:
def newpop(list, index):
return(list[:index] + list[index+1 :]
, but it seems overly complex? Any tips would be appreciated, I'm trying to learn to think more Python and less R.
I am probably taking the "temporary" bit way too litaral, but you could define a contextmanager to pop the item from the list and insert it back in when you are done working with the list:
from contextlib import contextmanager
#contextmanager
def out(lst, idx):
x = lst.pop(idx) # enter 'with'
yield # in `with`, no need for `as` here
lst.insert(idx, x) # leave 'with'
lst = list("abcdef")
with out(lst, 2):
print(lst)
# ['a', 'b', 'd', 'e', 'f']
print(lst)
# ['a', 'b', 'c', 'd', 'e', 'f']
Note: This does not create a copy of the list. All changes you do to the list during with will reflect in the original, up to the point that inserting the element back into the list might fail if the index is no longer valid.
Also note that popping the element and then putting it back into the list will have complexity up to O(n) depending on the position, so from a performance point of view this does not really make sense either, except if you want to save memory on copying the list.
More akin to your newpop function, and probably more practical, you could use a list comprehension with enumerate to create a copy of the list without the offending position. This does not create the two temporary slices and might also be a bit more readable and less prone to off-by-one mistakes.
def without(lst, idx):
return [x for i, x in enumerate(lst) if i != idx]
print(without(lst, 2))
# ['a', 'b', 'd', 'e', 'f']
You could also change this to return a generator expression by simply changing the [...] to (...), i.e. return (x for ...). This will then be sort of a read-only "view" on the list without creating an actual copy.

Is it possible to extract intersection list that contains duplicate values?

I want to get an intersection of lists where duplication is not eliminated.
And I hope that the method is a fast way not to use loops.
Below was my attempt, but this method failed because duplicates were removed.
a = ['a','b','c','f']
b = ['a','b','b','o','k']
tmp = list(set(a) & set(b))
>>>tmp
>>>['b','a']
I want the result to be ['a', 'b', 'b'].
In this method, 'a' is a fixed value and 'b' is a variable value.
And the concept of extracting 'a' value from 'b'.
Is there a way to extract a list of cross-values ​​that do not remove duplicate values?
A solution could be
good = set(a)
result = [x for x in b if x in good]
there are two loops here; one is the set-building loop of set (that is implemented in C, a hundred of times faster than whatever you can do in Python) the other is the comprehension and runs in the interpreter.
The first loop is done to avoid a linear search in a for each element of b (if a becomes big this can be a serious problem).
Note that using filter instead is probably not going to gain much (if anything) because despite the filter loop being in C, for each element it will have to get back to the interpreter to call the filtering function.
Note that if you care about speed then probably Python is not a good choice... for example may be PyPy would be better here and in this case just writing an optimal algorithm explicitly should be ok (avoiding re-searching a for duplicates when they are consecutive in b like happens in your example)
good = set(a)
res = []
i = 0
while i < len(b):
x = b[i]
if x in good:
while i < len(b) and b[i] == x: # is?
res.append(x)
i += 1
else:
i += 1
Of course in performance optimization the only real way is try and measure with real data on the real system... guessing works less and less as technology advances and becomes more complicated.
If you insist on not using for explicitly then this will work:
>>> list(filter(a.__contains__, b))
['a', 'b', 'b']
But directly calling magic methods like __contains__ is not a recommended practice to the best of my knowledge, so consider this instead:
>>> list(filter(lambda x: x in a, b))
['a', 'b', 'b']
And if you want to improve the lookup in a from O(n) to O(1) then create a set of it first:
>>> a_set = set(a)
>>> list(filter(lambda x: x in a_set, b))
['a', 'b', 'b']
>>a = ['a','b','c','f']
>>b = ['a','b','b','o','k']
>>items = set(a)
>>found = [i for i in b if i in items]
>>items
{'f', 'a', 'c', 'b'}
>>found
['a', 'b', 'b']
This should do your work.
I guess it's not faster than a loop and finally you probably still need a loop to extract the result. Anyway...
from collections import Counter
a = ['a','a','b','c','f']
b = ['a','b','b','o','k']
count_b = Counter(b)
count_ab = Counter(set(b)-set(a))
count_b - count_ab
#=> Counter({'a': 1, 'b': 2})
I mean, if res holds the result, you need to:
[ val for sublist in [ [s] * n for s, n in res.items() ] for val in sublist ]
#=> ['a', 'b', 'b']
It isn't clear how duplicates are handled when performing an intersection of lists which contain duplicate elements, as you have given only one test case and its expected result, and you did not explain duplicate handling.
According to how keeping duplicates work currently, the common elements are 'a' and 'b', and the intersection list lists 'a' with multiplicity 1 and 'b' with multiplicity 2. Note 'a' occurs once on both lists a and b, but 'b' occurs twice on b. The intersection list lists the common element with multiplicity equal to the list having that element at the maximum multiplicity.
The answer is yes. However, a loop may implicitly be called - though you want your code to not explicitly use any loop statements. This algorithm, however, will always be iterative.
Step 1: Create the intersection set, Intersect that does not contain duplicates (You already done that). Convert to list to keep indexing.
Step 2: Create a second array, IntersectD. Create a new variable Freq which counts the maximum number of occurrences for that common element, using count. Use Intersect and Freq to append the element Intersect[k] a number of times depending on its corresponding Freq[k].
An example code with 3 lists would be
a = ['a','b','c','1','1','1','1','2','3','o']
b = ['a','b','b','o','1','o','1']
c = ['a','a','a','b','1','2']
intersect = list(set(a) & set(b) & set(c)) # 3-set case
intersectD = []
for k in range(len(intersect)):
cmn = intersect[k]
freq = max(a.count(cmn), b.count(cmn), c.count(cmn)) # 3-set case
for i in range(freq): # Can be done with itertools
intersectD.append(cmn)
>>> intersectD
>>> ['b', 'b', 'a', 'a', 'a', '1', '1', '1', '1']
For cases involving more than two lists, freq for this common element can be computed using a more complex set intersection and max expression. If using a list of lists, freq can be computed using an inner loop. You can also replace the inner i-loop with an itertools expression from How can I count the occurrences of a list item?.

Check number not a sum of 2 ints on a list

Given a list of integers, I want to check a second list and remove from the first only those which can not be made from the sum of two numbers from the second. So given a = [3,19,20] and b = [1,2,17], I'd want [3,19].
Seems like a a cinch with two nested loops - except that I've gotten stuck with break and continue commands.
Here's what I have:
def myFunction(list_a, list_b):
for i in list_a:
for a in list_b:
for b in list_b:
if a + b == i:
break
else:
continue
break
else:
continue
list_a.remove(i)
return list_a
I know what I need to do, just the syntax seems unnecessarily confusing. Can someone show me an easier way? TIA!
You can do like this,
In [13]: from itertools import combinations
In [15]: [item for item in a if item in [sum(i) for i in combinations(b,2)]]
Out[15]: [3, 19]
combinations will give all possible combinations in b and get the list of sum. And just check the value is present in a
Edit
If you don't want to use the itertools wrote a function for it. Like this,
def comb(s):
for i, v1 in enumerate(s):
for j in range(i+1, len(s)):
yield [v1, s[j]]
result = [item for item in a if item in [sum(i) for i in comb(b)]]
Comments on code:
It's very dangerous to delete elements from a list while iterating over it. Perhaps you could append items you want to keep to a new list, and return that.
Your current algorithm is O(nm^2), where n is the size of list_a, and m is the size of list_b. This is pretty inefficient, but a good start to the problem.
Thee's also a lot of unnecessary continue and break statements, which can lead to complicated code that is hard to debug.
You also put everything into one function. If you split up each task into different functions, such as dedicating one function to finding pairs, and one for checking each item in list_a against list_b. This is a way of splitting problems into smaller problems, and using them to solve the bigger problem.
Overall I think your function is doing too much, and the logic could be condensed into much simpler code by breaking down the problem.
Another approach:
Since I found this task interesting, I decided to try it myself. My outlined approach is illustrated below.
1. You can first check if a list has a pair of a given sum in O(n) time using hashing:
def check_pairs(lst, sums):
lookup = set()
for x in lst:
current = sums - x
if current in lookup:
return True
lookup.add(x)
return False
2. Then you could use this function to check if any any pair in list_b is equal to the sum of numbers iterated in list_a:
def remove_first_sum(list_a, list_b):
new_list_a = []
for x in list_a:
check = check_pairs(list_b, x)
if check:
new_list_a.append(x)
return new_list_a
Which keeps numbers in list_a that contribute to a sum of two numbers in list_b.
3. The above can also be written with a list comprehension:
def remove_first_sum(list_a, list_b):
return [x for x in list_a if check_pairs(list_b, x)]
Both of which works as follows:
>>> remove_first_sum([3,19,20], [1,2,17])
[3, 19]
>>> remove_first_sum([3,19,20,18], [1,2,17])
[3, 19, 18]
>>> remove_first_sum([1,2,5,6],[2,3,4])
[5, 6]
Note: Overall the algorithm above is O(n) time complexity, which doesn't require anything too complicated. However, this also leads to O(n) extra auxiliary space, because a set is kept to record what items have been seen.
You can do it by first creating all possible sum combinations, then filtering out elements which don't belong to that combination list
Define the input lists
>>> a = [3,19,20]
>>> b = [1,2,17]
Next we will define all possible combinations of sum of two elements
>>> y = [i+j for k,j in enumerate(b) for i in b[k+1:]]
Next we will apply a function to every element of list a and check if it is present in above calculated list. map function can be use with an if/else clause. map will yield None in case of else clause is successful. To cater for this we can filter the list to remove None values
>>> list(filter(None, map(lambda x: x if x in y else None,a)))
The above operation will output:
>>> [3,19]
You can also write a one-line by combining all these lines into one, but I don't recommend this.
you can try something like that:
a = [3,19,20]
b= [1,2,17,5]
n_m_s=[]
data=[n_m_s.append(i+j) for i in b for j in b if i+j in a]
print(set(n_m_s))
print("after remove")
final_data=[]
for j,i in enumerate(a):
if i not in n_m_s:
final_data.append(i)
print(final_data)
output:
{19, 3}
after remove
[20]

most pythonic way to order a sublist from a ordered list

If I have sublist A: ['E','C', 'W'], what is the most pythonic way to order the sublist according to the order of master list M: ['C','B','W','E','K']
My solution is seems rather rudimentary. I am curious if there is a more 'pythonic' way to get the same result.
ORDER = ['C','B','W','E','K']
possibilities = ['E','C', 'W']
possibilities_in_order = []
for x in ORDER:
if x in possibilities: possibilities_in_order.append(x)
>>> order = ['C','B','W','E','K']
>>> possibilities = ['E','C','W']
>>> possibilities_in_order = sorted(possibilities, key=order.index)
>>> possibilities_in_order
['C', 'W', 'E']
How this works: for each element in possibilities, order.index(element) is called, and the list is simply sorted by those respective positions.
More details: Built-in Functions → sorted.
possibilities.sort(key=lambda x : ORDER.index(x))
Here's a linear-time solution:
posset = set(possibilities)
[letter for letter in order if letter in posset]
This filters the master list for only the members of the sublist. It's O(n) because it only traverses the master list once, and will perform well if the sublist is close in size to the master list.
This also assumes that possibilities has no duplicates. You can handle that if necessary, however, although it will make the code more complex.

Categories