SortedSet ValueError not in list - python

I am using a SortedSet from the sortedcontainers library. The set contains Match objects which define a start attribute which is used to sort:
class Match:
def __lt__(self, other):
return self.start < other.start
Matches are constantly added (SortedSet.add) and discarded (SortedSet.discard) from the set.
Matches may have the same start. Matches may see their start changed while existing in the set.
Everything seems to work without any issues until I get the following error when trying to discard a match sortedset.discard(match):
ValueError: <Match: X vs Y> not in list
The match is present in the set, as match in sortedset returns True. Not that it should matter since discard removes quietly.
I have absolutely no idea why this is happening and I have been trying to figure out a solution for a couple of days, no success yet. I would provide more information if I had any clue of what may be wrong, but I am just clueless. Please ask any information you need and I will deliver.

Matches may have the same start. Matches may see their start changed while existing in the set.
This violates the constraints that containers use to track their elements.
You can't change the sort order of an element while it is in a container. The container has no way of knowing that the sort order has changed.
Objects must be different.
If you read the documentation for sortedset, it clearly states:
The hash and total ordering of values must not change while they are stored in the sorted set.
There doesn't even exist a total ordering for your elements, because there exist x != y where it is neither true that x < y or x > y.
You haven't mentioned __hash__ at all.
I am not sure how you would expect to solve these problems, since I am not familiar with your code, and solving them may require some redesign work. However, these problems are what is causing the unexpected behavior in your program.

Related

Efficient reverse order comparison of huge growing list in Python

In Python, my goal is to maintain a unique list of points (complex scalars, rounded), while steadily creating new ones with a function, like in this pseudo code
list_of_points = []
while True
# generate new point according to some rule
z = generate()
# check whether this point is already there
if z not in list_of_points:
list_of_points.append(z)
if some_condition:
break
Now list_of_points can become potentially huge (like 10 million entries or even more) during the process and duplicates are quite frequent. In fact about 50% of the time, a newly created point is already somewhere in the list. However, what I know is that oftentimes the already existing point is near the end of the list. Sometimes it is in the "bulk" and only very occasionally it can be found near the beginning.
This brought me to the idea of doing the search in reverse order. But how would I do this most efficiently (in terms of raw speed), given my potentially large list which grows during the process. Is the list container even the best way here?
I managed to gain some performance by doing this
list_of_points = []
while True
# generate new point according to some rule
z = generate()
# check very end of list
if z in list_of_points[-10:]:
continue
# check deeper into the list
if z in list_of_points[-100:-10]:
continue
# check the rest
if z not in list_of_points[:-100]:
list_of_points.append(z)
if some_condition:
break
Apparently, this is not very elegant. Using instead a second, FIFO-type container (collection.deque), gives about the same speed up.
Your best bet might to be to use a set instead of a list, python sets use hashing to insert items, so it is very fast. And, you can skip the step of checking if an item is already in the list by simply trying to add it, if it is already in the set it wont be added since duplicates are not allowed.
Stealing your pseudo code axample
set_of_points = {}
while True
# get size of set
a = len(set_of_points)
# generate new point according to some rule
z = generate()
# try to add z to the set
set_of_points.add(z)
b = len(set_of_points)
# if a == b it was not added, thus already existed in the set
if some_condition:
break
Use a set. This is what sets are for. Ah - you already have answer saying that. So my other comment: this part of your code appears to be incorrect:
# check the rest
if z not in list_of_points[100:]:
list_of_points.append(z)
In context, I believe you meant to write list_of_points[:-100] there instead. You already checked the last 100, but, as is, you're skipping checking the first 100 instead.
But even better, use plain list_of_points. As the list grows longer, the cost to possibly do 100 redundant comparisons becomes trivial compared to the cost of copying len(list_of_points) - 100 elements

Python - How to speed us Nested for loop with function and multiple return value

I am writing a python code to compute if there is any fuzzy match between 2 strings. If there is a match, I have to store both the strings and the avg match value. The string to be compared are from a list that runs into thousands of entries
The issue is that the code is taking too long to execute. To speed up, I looked the other answers in here but none of them had multiple return values from the inner function in the loop. Looking for optimized code here...
tokens=['abc','bcd','abe','efg','opq']
valid_list=['acb','abc','abf','bcd','rts','xyz']
for i in tokens:
for j in valid_list:
token,valid_entry,avg_match=get_match(i,j)
if(token!=0):
potential_entry.append(valid_entry)
match_tokens.append(token)
ag_match.append(avg_match)
def get_match(i,j):
avg_value=(fuzz.ratio(token,chk_str)+fuzz.partial_ratio(token,chk_str)+fuzz.token_sort_ratio(token,chk_str)+fuzz.token_set_ratio(token,chk_str))/4
if(int(avg_value)>70):
return token,chk_Str,int(avg_value)
else:
return 0,0,0
The main obvious thing I can see is that you could short circuit out of the fuzzy checks if any are clearly not going to be a valid match.
So instead of doing them all in one line, do them individually, and check if they are below a threshold before getting the other ratios, prioritise checking the ratio you'd expect to provide the clearest answer for this first.
Also, consider:
using a single list of an object to avoid having to append to three lists
using sets for your tokens and valid list to ensure there aren't any duplicate checks being done
not casting the avg_value to an integer for the if statement, it doesn't really make a difference here.
add in an explicit i == j check to return a 100% ratio before doing any other checks

Idiomatic way to to match against list of munch.munch objects?

I am using the openstack shade library to manage our openstack stacks. One task is to list all stacks owned by a user (for example to then allow for deletion of them).
The shade library call list_stacks() returns a list of munch.Munch objects, and basically I want to identify that stack object that has either an 'id' or 'name' matching some user provided input.
I came up with this code here:
def __find_stack(self, connection, stack_info):
stacks = connection.list_stacks()
for stack in stacks:
if stack_info in stack.values():
return stack
return None
But it feels clumsy, and I am wondering if there is a more idiomatic way to solve this in python? (stack_info is a simple string, either the "name" or "id", in other words: it might match this or that entry within the "dict" values of the munched stack objects)
As my comment suggests, I don't really think there is something to improve.
However, performance-wise, you could use filter to push the loop down to C level which may be beneficial if there are a lot of stacks.
Readability-wise, I don't think that you would gain much.
def __find_stack(self, connection, stack_info):
stacks = connection.list_stacks()
return list(filter(lambda stack: stack_info in stack.values(), stacks))
However this approach is not "short-circuited". Your original code stops when it finds a match, and this one will not, so in theory you will get more than one match if they exist (or an empty list in case there is no match).

Python: Why Lists do not have a find method?

I was trying to write an answer to this question and was quite surprised to find out that there is no find method for lists, lists have only the index method (strings have find and index).
Can anyone tell me the rationale behind that?
Why strings have both?
I don't know why or maybe is buried in some PEP somewhere, but i do know 2 very basic "find" method for lists, and they are array.index() and the in operator. You can always make use of these 2 to find your items. (Also, re module, etc)
I think the rationale for not having separate 'find' and 'index' methods is they're not different enough. Both would return the same thing in the case the sought item exists in the list (this is true of the two string methods); they differ in case the sought item is not in the list/string; however you can trivially build either one of find/index from the other. If you're coming from other languages, it may seem bad manners to raise and catch exceptions for a non-error condition that you could easily test for, but in Python, it's often considered more pythonic to shoot first and ask questions later, er, to use exception handling instead of tests like this (example: Better to 'try' something and catch the exception or test if its possible first to avoid an exception?).
I don't think it's a good idea to build 'find' out of 'index' and 'in', like
if foo in my_list:
foo_index = my_list.index(foo)
else:
foo_index = -1 # or do whatever else you want
because both in and index will require an O(n) pass over the list.
Better to build 'find' out of 'index' and try/catch, like:
try:
foo_index = my_list.index(foo)
catch ValueError:
foo_index = -1 # or do whatever else you want
Now, as to why list was built this way (with only index), and string was built the other way (with separate index and find)... I can't say.
The "find" method for lists is index.
I do consider the inconsistency between string.find and list.index to be unfortunate, both in name and behavior: string.find returns -1 when no match is found, where list.index raises ValueError. This could have been designed more consistently. The only irreconcilable difference between these operations is that string.find searches for a string of items, where list.index searches for exactly one item (which, alone, doesn't justify using different names).

Python: Nested for loops or "next" statement

I'm a rookie hobbyist and I nest for loops when I write python, like so:
dict = {
key1: {subkey/value1: value2}
...
keyn: {subkeyn/valuen: valuen+1}
}
for key in dict:
for subkey/value in key:
do it to it
I'm aware of a "next" keyword that would accomplish the same goal in one line (I asked a question about how to use it but I didn't quite understand it).
So to me, a nested for loop is much more readable. Why, then do people use "next"? I read somewhere that Python is a dynamically-typed and interpreted language and because + both concontinates strings and sums numbers, that it must check variable types for each loop iteration in order to know what the operators are, etc. Does using "next" prevent this in some way, speeding up the execution or is it just a matter of style/preference?
next is precious to advance an iterator when necessary, without that advancement controlling an explicit for loop. For example, if you want "the first item in S that's greater than 100", next(x for x in S if x > 100) will give it to you, no muss, no fuss, no unneeded work (as everything terminates as soon as a suitable x is located) -- and you get an exception (StopIteration) if unexpectedly no x matches the condition. If a no-match is expected and you want None in that case, next((x for x in S if x > 100), None) will deliver that. For this specific purpose, it might be clearer to you if next was actually named first, but that would betray its much more general use.
Consider, for example, the task of merging multiple sequences (e.g., a union or intersection of sorted sequences -- say, sorted files, where the items are lines). Again, next is just what the doctor ordered, because none of the sequences can dominate over the others by controlling A "main for loop". So, assuming for simplicity no duplicates can exist (a condition that's not hard to relax if needed), you keep pairs (currentitem, itsfile) in a list controlled by heapq, and the merging becomes easy... but only thanks to the magic of next to advance the correct file once its item has been used, and that file only.
import heapq
def merge(*theopentextfiles):
theheap = []
for afile in theopentextfiles:
theitem = next(afile, '')
if theitem: theheap.append((theitem, afile))
heapq.heapify(theheap)
while theheap:
theitem, afile = heapq.heappop(theheap)
yielf theitem
theitem = next(afile, '')
if theitem: heapq.heappush(theheap, (theitem, afile))
Just try to do anything anywhere this elegant without next...!-)
One could go on for a long time, but the two use cases "advance an iterator by one place (without letting it control a whole for loop)" and "get just the first item from an iterator" account for most important uses of next.

Categories