faster membership testing in python than set()

faster membership testing in python than set() - python

I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ?
import sys
#load ids
ids = set( x.strip() for x in open(idfile) )
for line in sys.stdin:
id=line.strip()
if id in ids:
#print fastq
print id
#update ids
ids.remove( id )

set is as fast as it gets.
However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.
If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.

As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:
#!/usr/bin/env python
import random
# create two million-element sets of random numbers
a = set(random.sample(xrange(10000000),1000000))
b = set(random.sample(xrange(10000000),1000000))
# a intersection b
c = a & b
# a difference c
d = list(a - c)
print "set d is all remaining elements in a not common to a intersection b"
print "length of d is %s" % len(d)
The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership

You should try to split your data to make the search faster. The tree strcuture would allow you to find very quickly if the data is present or not.
For example, start with a simple map that links the first letter with all the keys starting with that letter, thus you don't have to search all the keys but only a smaller part of them.
This would look like :
ids = {}
for id in open(idfile):
ids.setdefault(id[0], set()).add(id)
for line in sys.stdin:
id=line.strip()
if id in ids.get(id[0], set()):
#print fastq
print id
#update ids
ids[id[0]].remove( id )
Creation will be a bit slower but search should be much faster (I would expect 20 times faster, if the fisrt character of your keys is well distributed and not always the same).
This is a first step, you could do the same thing with the second character and so on, search would then just be walking the tree with each letter...

As mentioned by urschrei, you should "vectorize" the check.
It is faster to check for the presence of a million elements once (as that is done in C) than to do the check for one element a million times.

Related

Python - convert list into dictionary in order to reduce complexity

Let's say I have a big list:
word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()]
//complexity O(n) --> proporcional to list length "n"
I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:
word_dict = dict((elt, 1) for elt in word_list)
// complexity O(l) ---> constant.
using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?

The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).
Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.
What is different with a dict?
Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).
In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).
For more detils, see the article about DictionaryKeys on python.org.
How about a set?
A set is something like a dictionary with only keys and no values. The question contains this code:
word_dict = dict((elt, 1) for elt in word_list)
That is obviously a dictionary which does not need values, so a set would be more appropriate.
BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:
set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
Are there any drawbacks?
Always ;)
A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.
A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.
Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.
Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.
Reading everything from a set is also going to be a bit slower than reading everything from a list.
All of these apply to dict as well as to set.
Some examples with timings
Writing to list vs. set:
>>> timeit.timeit('[n for n in range(1000000)]', number=10)
0.7802875302271843
>>> timeit.timeit('{n for n in range(1000000)}', number=10)
1.025623542189976
Reading from list vs. set:
>>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
0.19846207875508526
>>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
3.5699193290383846e-06
So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.

Which one is faster: iterating over a set and iterating over a list

Say that I have a list of strings and a set of the same strings:
l = [str1, str2, str3, str4, ...]
s = set([str1, str2, st3, str4, ...])
I need to run a string comparison with a phrase that I have: comparephrase
I need to iterate over all the elements in the list or the set and generate a ratio between the comparephrase and the compared string. I know that set() is faster when we are doing a membership test. However, I am not doing a membership test but comparing the phrase that I have and the strings in the list/set. Does set() still offer a faster speed? If so, why? It seems to me that this set is actually a set with a list inside. Wouldn't that take the long time since we're iterating over the list within the set?

Iterating over a List is much much faster than iterating over a set.
The currently accepted answer is using a very small set and list and hence, the difference is negligible there.
The following code explains it:
>>> import timeit
>>> l = [ x*x for x in range(1, 400)]
>>> s = set(l)
>>> timeit.timeit("for i in s: pass", "from __main__ import s")
12.152284085999781
>>> timeit.timeit("for i in l: pass", "from __main__ import l")
5.460189446001095
>>> timeit.timeit("if 567 in l: pass", "from __main__ import l")
6.0497558240003855
>>> timeit.timeit("if 567 in s: pass", "from __main__ import s")
0.04609546199935721
I do not know what makes set iteration slow, but the fact is evident from the output above.

A Python set is optimized for equality tests and duplicate removal, and thus implements a hash table underneath. I believe this would make it very slightly slower than a list, if you have to compare every element to comparephrase; lists are very good for iterating over every element one after the other. The difference is probably going to be negligible in almost any case, though.

I've run some tests with timeit, and (while list performs slightly faster) there is no significant difference:
>>> import timeit
>>> # For the set
>>> timeit.timeit("for i in s: pass", "s = set([1,4,7,10,13])")
0.20565616500061878
>>> # For the list
>>> timeit.timeit("for i in l: pass", "l = [1,4,7,10,13]")
0.19532391999928223
These values stay very much the same (0.20 vs. 0.19) even when trying multiple times.
However, the overhead of creating the sets can be significant.

The test in the accepted answer is not really representative as Aditya Shaw stated.
Let me explain the technical differences between iterating lists and sets roughly in an easy way.
Iterating lists
Lists have their elements organized by an order and an index "by design", which can be iterated easily.
Access by an index is fast because it bases on few cheap memory read operations.
Iterating sets
Sets iterate slower because their element access is done by hashes.
Imagine a big tree with many branches and each leaf is an element. A hash is being translated into "directions" to traverse all branches until reaching the leaf.
Finding a leaf or element is still fast but slower compared to a simple index access of a list.
Sets have no linked items so an iteration cannot jump to the "next" item easily like a list. It has to start from the root of the tree on each iteration.
Sets (and dicts) are contrary to lists.
Each kind has a primary use case.
Lists and sets swap their advantages on finding elements directly.
Does a list contain an element?
The list has to iterate over all its elements until it finds a match. That depends heavily on the list size.
If the element is near the beginning, it gets found quite fast even on large lists.
If an element is not in the list or near the end, the list gets iterated completely or until a match near the end.
Does a set contain an element?
It just has to traverse let's say 5 branches to look if there's a leaf.
Even on large sets, the number of branches to traverse is relative low.
Why not make and use a universal collection type?
If a set had an internal list index, the set's operations would be slower because the list had to be updated and/or checked.
If a list had an internal set to find items faster, list operations would be slower because the hashes had to be updated and/or checked.
Data found behind the hash and behind the index also may be inconsistent because of duplicate management. And that also increases memory usage at all.

list membership test or set

Is it more efficient to check if an item is already in a list before adding it:
for word in open('book.txt','r').read().split():
if word in list:
pass
else:
list.append(item)
or to add everything then run set() on it? like this:
for word in open('book.txt','r').read().split():
list.append(word)
list = set(list)

If the ultimate intention is to construct a set, construct it directly and don't bother with the list:
words = set(open('book.txt','r').read().split())
This will be simple and efficient.
Just as your original code, this has the downside of first reading the entire file into memory. If that's an issue, this can be solved by reading one line at a time:
words = set(word for line in open('book.txt', 'r') for word in line.split())
(Thanks #Steve Jessop for the suggestion.)
Definitely don't take the first approach in your question, unless you know the list to be short, as it will need to scan the entire list on every single word.

A set is a hash table while a list is an array. set membership tests are O(1) while list membership tests are O(n). If anything, you should be filtering the list using a set, not filtering a set using a list.

It's worth testing to find out; but I frequently use comprehensions to filter my lists, and I find that works well; particularly if the code is experimental and subject to change.
l = list( open( 'book.txt', 'r').read().split() )
unique_l = list(set( l ))
# maybe something else:
good_l = [ word for word in l if not word in naughty_words ]
I have heard that this helps with efficiency; but as I said, a test tells more.

The algorithm with word in list is an expensive operation. Why? Because, to see if an item is in the list, you have to check every item in the list. Every time. It's a Shlemiel the painter algorithm. Every lookup is O(n), and you do it n times. There's no startup cost, but it gets expensive very quickly. And you end up looking at each item way more than one time - on average, len(list)/2 times.
Looking to see if things are in the set, is (usually) MUCH cheaper. Items are hashed, so you calculate the hash, look there, and if it's not there, it's not in the set - O(1). You do have to create the set the first time, so you'll look at every item once. Then you look at each item one more time to see if it's already in your set. Still overall O(n).
So, doing list(set(mylist)) is definitely preferable to your first solution.

#NPE's answer doesn't close the file explicitly. It's better to use a context manager
with open('book.txt','r') as fin:
words = set(fin.read().split())
For normal text files this is probably adequate. If it's an entire DNA sequence for example you probably don't want to read the entire file into memory at once.

python data type to track duplicates

I often keep track of duplicates with something like this:
processed = set()
for big_string in strings_generator:
if big_string not in processed:
processed.add(big_string)
process(big_string)
I am dealing with massive amounts of data so don't want to maintain the processed set in memory. I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
To cut down on memory use what do you think of using hashes like this:
processed = set()
for big_string in string_generator:
key = hash(big_string)
if key not in ignored:
processed.add(key)
process(big_string)
The drawback is I could lose data through occasional hash collisions.
1 collision in 1 billion hashes would not be a problem for my use.
I tried the md5 hash but found generating the hashes became a bottleneck.
What would you suggest instead?

I'm going to assume you are hashing web pages. You have to hash at most 55 billion web pages (and that measure almost certainly overlooks some overlap).
You are willing to accept a less than one in a billion chance of collision, which means that if we look at a hash function which number of collisions is close to what we would get if the hash was truly random[ˆ1], we want a hash range of size (55*10ˆ9)*10ˆ9. That is log2((55*10ˆ9)*10ˆ9) = 66 bits.
[ˆ1]: since the hash can be considered to be chosen at random for this purpose,
p(collision) = (occupied range)/(total range)
Since there is a speed issue, but no real cryptographic concern, we can use a > 66-bits non-cryptographic hash with the nice collision distribution property outlined above.
It looks like we are looking for the 128-bit version of the Murmur3 hash. People have been reporting speed increases upwards of 12x comparing Murmur3_128 to MD5 on a 64-bit machine. You can use this library to do your speed tests. See also this related answer, which:
shows speed test results in the range of python's str_hash, which speed you have already deemed acceptable elsewhere – though python's hash is a 32-bit hash leaving you only 2ˆ32/(10ˆ9) (that is only 4) values stored with a less than one in a billion chance of collision.
spawned a library of python bindings that you should be able to use directly.
Finally, I hope to have outlined the reasoning that could allow you to compare with other functions of varied size should you feel the need for it (e.g. if you up your collision tolerance, if the size of your indexed set is smaller than the whole Internet, etc, ...).

You have to decide which is more important: space or time.
If time, then you need to create unique representations of your large_item which take as little space as possible (probably some str value) that is easy (i.e. quick) to calculate and will not have collisions, and store them in a set.
If space, find the quickest disk-backed solution you can and store the smallest possible unique value that will identify a large_item.
So either way, you want small unique identifiers -- depending on the nature of large_item this may be a big win, or not possible.
Update
they are strings of html content
Perhaps a hybrid solution then: Keep a set in memory of the normal Python hash, while also keeping the actual html content on disk, keyed by that hash; when you check to see if the current large_item is in the set and get a positive, double-check with the disk-backed solution to see if it's a real hit or not, then skip or process as appropriate. Something like this:
import dbf
on_disk = dbf.Table('/tmp/processed_items', 'hash N(17,0); value M')
index = on_disk.create_index(lambda rec: rec.hash)
fast_check = set()
def slow_check(hashed, item):
matches = on_disk.search((hashed,))
for record in matches:
if item == record.value:
return True
return False
for large_item in many_items:
hashed = hash(large_item) # only calculate once
if hashed not in fast_check or not slow_check(hashed, large_item):
on_disk.append((hashed, large_item))
fast_check.add(hashed)
process(large_item)
FYI: dbf is a module I wrote which you can find on PyPI

If many_items already resides in memory, you are not creating another copy of the large_item. You are just storing a reference to it in the ignored set.
If many_items is a file or some other generator, you'll have to look at other alternatives.
Eg if many_items is a file, perhaps you can store a pointer to the item in the file instead of the actual item

As you have already seen few options but unfortunately none of them can fully address the situation partly because
Memory Constraint, to store entire object in memory
No perfect Hash function, and for huge data set change of collision is there.
Better Hash functions (md5) are slower
Use of database like sqlite would actually make things slower
As I read this following excerpt
I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
I feel if you work on this, it might help you marginally. Here how it should be
Use tmpfs to create a ramdisk. tmpfs has several advantages over other implementation because it supports swapping of less-used space to swap space.
Store the sqlite database on the ramdisk.
Change the size of the ramdisk and profile your code to check your performance.
I suppose you already have a working code to save your data in sqllite. You only need to define a tmpfs and use the path to store your database.
Caveat: This is a linux only solution

a bloom filter? http://en.wikipedia.org/wiki/Bloom_filter

well you can always decorate large_item with a processed flag. Or something similar.

You can give a try to the str type __hash__ function.
In [1]: hash('http://stackoverflow.com')
Out[1]: -5768830964305142685
It's definitely not a cryptographic hash function, but with a little chance you won't have too much collision. It works as described here: http://effbot.org/zone/python-hash.htm.

I suggest you profile standard Python hash functions and choose the fastest: they are all "safe" against collisions enough for your application.
Here are some benchmarks for hash, md5 and sha1:
In [37]: very_long_string = 'x' * 1000000
In [39]: %timeit hash(very_long_string)
10000000 loops, best of 3: 86 ns per loop
In [40]: from hashlib import md5, sha1
In [42]: %timeit md5(very_long_string).hexdigest()
100 loops, best of 3: 2.01 ms per loop
In [43]: %timeit sha1(very_long_string).hexdigest()
100 loops, best of 3: 2.54 ms per loop
md5 and sha1 are comparable in speed. hash is 20k times faster for this string and it does not seem to depend much on the size of the string itself.

how does your sql lite version work? If you insert all your strings into a database table and then run the query "select distinct big_string from table_name", the database should optimize it for you.
Another option for you would be to use hadoop.
Another option could be to split the strings into partitions such that each partition is small enough to fit in memory. then you only need to check for duplicates within each partition. the formula you use to decide the partition will choose the same partition for each duplicate. the easiest way is to just look at the first few digits of the string e.g.:
d=defaultdict(int)
for big_string in strings_generator:
d[big_string[:4]]+=1
print d
now you can decide on your partitions, go through the generator again and write each big_string to a file that has the start of the big_string in the filename. Now you could just use your original method on each file and just loop through all the files

This can be achieved much more easily by performing simpler checks first, then investigating these cases with more elaborate checks. The example below contains extracts of your code, but it is performing the checks on much smaller sets of data. It does this by first matching on a simple case that is cheap to check. And if you find that a (filesize, checksum) pairs are not discriminating enough you can easily change it for a more cheap, yet vigorous check.
# Need to define the following functions
def GetFileSize(filename):
pass
def GenerateChecksum(filename):
pass
def LoadBigString(filename):
pass
# Returns a list of duplicates pairs.
def GetDuplicates(filename_list):
duplicates = list()
# Stores arrays of filename, mapping a quick hash to a list of filenames.
filename_lists_by_quick_checks = dict()
for filename in filename_list:
quickcheck = GetQuickCheck(filename)
if not filename_lists_by_quick_checks.has_key(quickcheck):
filename_lists_by_quick_checks[quickcheck] = list()
filename_lists_by_quick_checks[quickcheck].append(filename)
for quickcheck, filename_list in filename_lists.iteritems():
big_strings = GetBigStrings(filename_list)
duplicates.extend(GetBigstringDuplicates(big_strings))
return duplicates
def GetBigstringDuplicates(strings_generator):
processed = set()
for big_string in strings_generator:
if big_sring not in processed:
processed.add(big_string)
process(big_string)
# Returns a tuple containing (filesize, checksum).
def GetQuickCheck(filename):
return (GetFileSize(filename), GenerateChecksum(filename))
# Returns a list of big_strings from a list of filenames.
def GetBigStrings(file_list):
big_strings = list()
for filename in file_list:
big_strings.append(LoadBigString(filename))
return big_strings

What is the best way to store set data in Python?

I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.