Lexicographical Sorting of Word List - python

I need to merge and sort lists of 100,000+ words lexicographically. I currently do it with a slightly modified bubble sort, but at O(n^2) it takes quite a while. Are there any faster algorithms for sorting lists of words? I'm using Python, but if there is a language that can handle this better I'm open to suggestions.

Use the built-in sort() list method:
>>> words = [ 'baloney', 'aardvark' ]
>>> words.sort()
>>> print words
['aardvark', 'baloney']
It uses a O(n lg(n)) sort1, the Timsort (which is a modified merge-sort, I believe. It's highly tuned for speed.).
1 As pointed out in the comments, this refers to the number of element comparisons, not the number of low-level operations. Since the elements in this case are strings, and comparing two strings takes min{|S1|, |S2|} character comparisons, the total complexity is O(n lg(n) * |S|) where |S| is the length of the longest string being sorted. This is true of all comparison sorts, however -- the true number of operations varies depending on the cost of the element-comparison function for the type of elements being sorted. Since all comparison sorts use the same comparison function, you can just ignore this subtlety when comparing the algorithmic complexity of these sorts amongst each other.

Any O(nlogn) sorting algorithm will probably do it better then bubble sort, but they will be O(nlogn * |S|)
However, sorting strings can be done in O(n*|S|) where |S| is the length of the average string, using a trie, and a simple DFS.
high-level pseudo code:
1. create a trie from your collection.
2. do a DFS on the trie generated, and add each string
to the list when you reach terminal node.

Related

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.
I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).
Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

String searching algorithm - Complexity of string matching

I am trying to solve for 'String search algorithm' but the answers of many sites seems to be complex ( 'Naive string search' with O(m(n-m+1) ), what's the issue with my algo below, it has worst case complexity of O(n), while KMP also has O(n) therefore I must be definitely wrong, but where?
def find(s1, s2):
size = len(s1)
index = 0
while ( index != len(s2)):
if s2[index : index + size] == s1:
print 'Pattern found at index %s'%(index)
index += size
else:
index += 1
Ok so I was supposing s2[index : index + size] == s1 to be O(1) which is O(n), so now my original question becomes,
Why isn't the hashes of two strings calculated and compared, if both hashes are equal strings should be equal.
I don't get how can they collide. Isn't that dependant of hashing algorithm. Like MD5 has known breaks.
Original question
I don't think your code has complexity O(n), but rather O(mn). This check: s2[index : index + size] == s1, since, in the worst case, it needs to do len(s1) comparisons of characters.
Hashing
Here's Wikipedia's definition of a hash function:
A hash function is any function that can be used to map data of
arbitrary size to data of fixed size. The values returned by a hash
function are called hash values, hash codes, digests, or simply
hashes. One use is a data structure called a hash table, widely used in computer software for rapid data lookup.
Here we run into the first problem with this approach. A hash function takes in a value of arbitrary size, and returns a value of a fixed size. Following the pigeonhole principle, there is at least one hash with multiple values, probably more. As a quick example, imagine your hash function always produces an output that is one byte long. That means there are 256 possible outputs. After you've hashed 257 items, you'll always be certain there are at least 2 items with the same hash. To avoid this for as long as possible, a good hash function will map inputs over all possible outputs as uniformly as possible.
So if the hashes aren't equal, you can be sure the strings aren't equal, but not vice versa. Two different strings can have the same hash.

Python heapq vs sorted speed for pre-sorted lists

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with heapq.merge() in a tree of depth log(n) than to sort the entire thing at once with sorted() in O(nklog(nk)) time.
However, the sorted() approach seems to be 17-44x faster on my machine. Is the implementation of sorted() that much faster than heapq.merge() that it outstrips the asymptotic time advantage of the classic merge?
import itertools
import heapq
data = [range(n*8000,n*8000+10000,100) for n in range(10000)]
# Approach 1
for val in heapq.merge(*data):
test = val
# Approach 2
for val in sorted(itertools.chain(*data)):
test = val
CPython's list.sort() uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting range(N)*2 (in Python 2) for increasing values of N, and you'll find the time needed grows linearly in N.
So the only real advantage of heapq.merge() in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).
In fact, list.sort() is taking more advantage of the structure in your specific data than the heapq.merge() approach. I have some insight into this, because I wrote Python's list.sort() ;-)
(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)
ABOUT THAT "more advantage"
As discussed a bit in comments, list.sort() plays lots of engineering tricks that may cut the number of comparisons needed over what heapq.merge() needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):
class V(object):
def __init__(self, val):
self.val = val
def __lt__(a, b):
global ncmp
ncmp += 1
return a.val < b.val
def __eq__(a, b):
global ncmp
ncmp += 1
return a.val == b.val
def __le__(a, b):
raise ValueError("unexpected comparison")
__ne__ = __gt__ = __ge__ = __le__
sort() was deliberately written to use only < (__lt__). It's more of an accident in heapq (and, as I recall, even varies across Python versions), but it turns out .merge() only required < and ==. So those are the only comparisons the class defines in a useful way.
Then changing your data to use instances of that class:
data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
for n in range(10000)]
Then run both methods:
ncmp = 0
for val in heapq.merge(*data):
test = val
print(format(ncmp, ","))
ncmp = 0
for val in sorted(itertools.chain(*data)):
test = val
print(format(ncmp, ","))
The output is kinda remarkable:
43,207,638
1,639,884
So sorted() required far fewer comparisons than merge(), for this specific data. And that's the primary reason it's much faster.
LONG STORY SHORT
Those comparison counts looked too remarkable to me ;-) The count for heapq.merge() looked about twice as large as I thought reasonable.
Took a while to track this down. In short, it's an artifact of the way heapq.merge() is implemented: it maintains a heap of 3-element list objects, each containing the current next value from an iterable, the 0-based index of that iterable among all the iterables (to break comparison ties), and that iterable's __next__ method. The heapq functions all compare these little lists (instead of just the iterables' values), and list comparison always goes thru the lists first looking for the first corresponding items that are not ==.
So, e.g., asking whether [0] < [1] first asks whether 0 == 1. It's not, so then it goes on to ask whether 0 < 1.
Because of this, each < comparison done during the execution of heapq.merge() actually does two object comparisons (one ==, the other <). The == comparisons are "wasted" work, in the sense that they're not logically necessary to solve the problem - they're just "an optimization" (which happens not to pay in this context!) used internally by list comparison.
So in some sense it would be fairer to cut the report of heapq.merge() comparisons in half. But it's still way more than sorted() needed, so I'll let it drop now ;-)
sorted uses an adaptive mergesort that detects sorted runs and merges them efficiently, so it gets to take advantage of all the same structure in the input that heapq.merge gets to use. Also, sorted has a really nice C implementation with a lot more optimization effort put into it than heapq.merge.

Python: time/platform independent fast hash for big sets of pairs of ints

I would like to get a time and platform independent hash function for big sets of pairs of integers in Python, which is also fast and has (almost certainly) no collision. (Hm, what else do you want a hash to be - but anyway... .)
What I have so far is to use hashlib.md5 on the string representation of the sorted list:
> my_set = set([(1,2),(0,3),(1,3)]) # the input set, size 1...10^6
> import hashlib
> def MyHash(my_set):
> my_lst = sorted(my_set)
> my_str = str(my_lst)
> return hashlib.md5(my_str).hexdigest()
my_set contains between 1 and 10^5 pairs, and each int is between 0 and 10^6. In total, I have about 10^8 such sets on which the hash should be almost certainly unique.
Does this sound reasonable, or is there a better way of doing it?
On my example set with 10^6 pairs in the list, this takes about 2.5sec, so improvements on the time might be good, if possible. Almost all of the time is spend computing the string of the sorted list, so a big part of the question is
Is the string of a sorted list of tuples of integers in python stable among versions and platforms? Is there a better/faster way of obtaining a stable string representation?

Remove repeated elements from a list in O(n log n) time

I need an O(n log n) algorithm to remove repeated elements from a list. I know that I can use a set, for example but I need an algorithm of this specific complexity and I have no idea how to code it. Since I now have this code, but I don't know what is its complexity although I believe it is not n log n.
def removing(a):
for e in a:
if e in a[a.index(e)+1:]:
a.remove(e)
return a
The exercise says it wants an O(n*log(n)) algorithm, and says nothing about sorting the list before.
Since it's an assignment, I'll just hint you - you can use the set solution, and have a O(nlogn) worst case performance by using an OrderedDict instead of a regular set (the mappings of the keys do not matter, you can map them all to None, or some other arbitrary value)
If the order of the elements in the resulting list does not matter, simplest solution would be to just sort, and then iterate, and exclude elements such that a[i] == a[i+1]. The result will be a sorted list with all unique elements, and is done in O(nlogn)
Good Luck.
x=[1,2,3,4,6,6,2,3,1]
dic={}
for i in x:
dic[i]=0
print dic.keys()
You can try this.

Categories