I've got a list of strings, for example: ['Lion','Rabbit','Sea Otter','Monkey','Eagle','Rat']
I'm trying to find out the total number of possible combinations of these items, where item order matters, and the total string length, when all strings are concatenated with comma separators is less than a given length.
So, for max total string length 14, I would need to count combinations such as (not exhaustive list):
Lion
Rabbit
Eagle,Lion
Lion,Eagle
Lion,Eagle,Rat
Eagle,Lion,Rat
Sea Otter,Lion
etc...
but it would not include combinations where the total string length is more than the 14 character limit, such as Sea Otter,Monkey
I know for this pretty limited sample it wouldn't be that hard to manually calculate or determine with a few nested loops, but the actual use case will be a list of a couple hundred strings and a much longer character limit, meaning the number of nested iterations to write manually would be extremely confusing...
I tried to work through writing this via Python's itertools, but keep getting lost as none of the examples I'm finding come close enough to what I'm needing, especially with the limited character length (not limited number of items) and the need to allow repeated combinations in different orders.
Any help getting started would be great.
You can use itertools.combinations in a list comprehension that creates lists of max 3 items (more items will per definition be more than 14 characters combined), while filtering by total string length:
import itertools
lst = ['Lion','Rabbit','Sea Otter','Monkey','Eagle','Rat']
sorted_lst = sorted(lst, key=len)
#find max number of items
for n, i in enumerate(sorted_lst):
if len(','.join(sorted_lst[:n+1])) > 14:
items_limit = n+1
break
[x for l in range(1, items_limit) for x in itertools.combinations(lst, l) if len(','.join(x))<15]
PS. use itertools.permutations if you need permutations (as in your sample output), your question is about combinations.
You can use a recursive generator function:
s, m_s = ['Lion','Rabbit','Sea Otter','Monkey','Eagle','Rat'], 14
def get_combos(d, c = []):
yield ','.join(c) #yield back valid combination
for i in range(len(d)):
if d[i] not in c and len(','.join(c+[d[i]])) <= m_s:
yield from get_combos(d[:i]+d[i+1:], c+[d[i]]) #found a new valid combination
if len(d[i]) <= m_s:
yield from get_combos(d[:i]+d[i+1:], [d[i]]) #ignore running combo and replace with new single string
yield from get_combos(d[:i]+d[i+1:], c) #ignore string at current iteration of `for` loop and keep the running combination
_, *vals = set(get_combos(s))
print(vals)
Output:
['Rat,Rabbit', 'Lion,Rabbit', 'Eagle', 'Eagle,Lion,Rat', 'Monkey', 'Rabbit,Lion', 'Rat,Eagle', 'Sea Otter', 'Rat,Lion,Eagle', 'Monkey,Rat', 'Rabbit,Monkey', 'Sea Otter,Rat', 'Rabbit', 'Lion,Sea Otter', 'Rabbit,Eagle', 'Rat,Eagle,Lion', 'Rat,Sea Otter', 'Lion,Monkey', 'Eagle,Lion', 'Eagle,Rat', 'Lion,Eagle,Rat', 'Rat', 'Lion,Rat,Eagle', 'Eagle,Rabbit', 'Rat,Lion', 'Monkey,Eagle', 'Lion,Eagle', 'Eagle,Monkey', 'Monkey,Lion', 'Rat,Monkey', 'Sea Otter,Lion', 'Rabbit,Rat', 'Monkey,Rabbit', 'Eagle,Rat,Lion', 'Lion,Rat', 'Lion']
Related
I have a large (50k-100k) set of strings mystrings. Some of the strings in mystrings may be exact substrings of others, and I would like to collapse these (discard the substring and only keep the longest). Right now I'm using a naive method, which has O(N^2) complexity.
unique_strings = set()
for s in sorted(mystrings, key=len, reverse=True):
keep = True
for us in unique_strings:
if s in us:
keep = False
break
if keep:
unique_strings.add(s)
Which data structures or algorithms would make this task easier and not require O(N^2) operations. Libraries are ok, but I need to stay pure Python.
Finding a substring in a set():
name = set()
name.add('Victoria Stuart') ## add single element
name.update(('Carmine Wilson', 'Jazz', 'Georgio')) ## add multiple elements
name
{'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
me = 'Victoria'
if str(name).find(me):
print('{} in {}'.format(me, name))
# Victoria in {'Jazz', 'Georgio', 'Carmine Wilson', 'Victoria Stuart'}
That's pretty easy -- but somewhat problematic, if you want to return the matching string:
for item in name:
if item.find(me):
print(item)
'''
Jazz
Georgio
Carmine Wilson
'''
print(str(name).find(me))
# 39 ## character offset for match (i.e., not a string)
As you can see, the loop above only executes until the condition is True, terminating before printing the item we want (the matching string).
It's probably better, easier to use regex (regular expressions):
import re
for item in name:
if re.match(me, item):
full_name = item
print(item)
# Victoria Stuart
print(full_name)
# Victoria Stuart
for item in name:
if re.search(me, item):
print(item)
# Victoria Stuart
From the Python docs:
search() vs. match()
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string ...
A naive approach:
1. sort strings by length, longest first # `O(N*log_N)`
2. foreach string: # O(N)
3. insert each suffix into tree structure: first letter -> root, and so on.
# O(L) or O(L^2) depending on string slice implementation, L: string length
4. if inserting the entire string (the longest suffix) creates a new
leaf node, keep it!
O[N*(log_N + L)] or O[N*(log_N + L^2)]
This is probably far from optimal, but should be significantly better than O(N^2) for large N (number of strings) and small L (average string length).
You could also iterate through the strings in descending order by length and add all substrings of each string to a set, and only keep those strings that are not in the set. The algorithmic big O should be the same as for the worse case above (O[N*(log_N + L^2)]), but the implementation is much simpler:
seen_strings, keep_strings = set(), set()
for s in sorted(mystrings, key=len, reverse=True):
if s not in seen_strings:
keep_strings.add(s)
l = len(s)
for start in range(0, l-1):
for end in range(start+1, l):
seen_strings.add(s[start:end])
In the mean time I came up with this approach.
from Bio.trie import trie
unique_strings = set()
suffix_tree = trie()
for s in sorted(mystrings, key=len, reverse=True):
if suffix_tree.with_prefix(contig) == []:
unique_strings.add(s)
for i in range(len(s)):
suffix_tree[s[i:]] = 1
The good: ≈15 minutes --> ≈20 seconds for the data set I was working with. The bad: introduces biopython as a dependency, which is neither lightweight nor pure python (as I originally asked).
You can presort the strings and create a dictionary that maps strings to positions in the sorted list. Then you can loop over the list of strings (O(N)) and suffixes (O(L)) and set those entries to None that exist in the position-dict (O(1) dict lookup and O(1) list update). So in total this has O(N*L) complexity where L is the average string length.
strings = sorted(mystrings, key=len, reverse=True)
index_map = {s: i for i, s in enumerate(strings)}
unique = set()
for i, s in enumerate(strings):
if s is None:
continue
unique.add(s)
for k in range(1, len(s)):
try:
index = index_map[s[k:]]
except KeyError:
pass
else:
if strings[index] is None:
break
strings[index] = None
Testing on the following sample data gives a speedup factor of about 21:
import random
from string import ascii_lowercase
mystrings = [''.join(random.choices(ascii_lowercase, k=random.randint(1, 10)))
for __ in range(1000)]
mystrings = set(mystrings)
I have "n" number of strings as input, which i separate into possible subsequences into a list like below
If the Input is : aa, b, aa
I create a list like the below(each list having the subsequences of the string):
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
I would like to find the combinations of palindromes across the lists in aList.
For eg, the possible palindromes for this would be 5 - aba, aba, aba, aba, aabaa
This could be achieved by brute force algorithm using the below code:
d = []
def isPalindrome(x):
if x == x[::-1]: return True
else: return False
for I in itertools.product(*aList):
a = (''.join(I))
if isPalindrome(a):
if a not in d:
d.append(a)
count += 1
But this approach is resulting in a timeout when the number of strings and the length of the string are bigger.
Is there a better approach to the problem ?
Second version
This version uses a set called seen, to avoid testing combinations more than once.
Note that your function isPalindrome() can simplified to single expression, so I removed it and just did the test in-line to avoid the overhead of an unnecessary function call.
import itertools
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
d = []
seen = set()
for I in itertools.product(*aList):
if I not in seen:
seen.add(I)
a = ''.join(I)
if a == a[::-1]:
d.append(a)
print('d: {}'.format(d))
Current approach has disadvantage and that most of generated solutions are finally thrown away when checked that solution is/isn't palindrome.
One Idea is that once you pick solution from one side, you can immediate check if there is corresponding solution in last group.
For example lets say that your space is this
[["a","b","c"], ... , ["b","c","d"]]
We can see that if you pick "a" as first pick, there is no "a" in last group and this exclude all possible solutions that would be tried other way.
For larger input you could probably get some time gain by grabbing words from the first array, and compare them with the words of the last array to check that these pairs still allow for a palindrome to be formed, or that such a combination can never lead to one by inserting arrays from the remaining words in between.
This way you probably cancel out a lot of possibilities, and this method can be repeated recursively, once you have decided that a pair is still in the running. You would then save the common part of the two words (when the second word is reversed of course), and keep the remaining letters separate for use in the recursive part.
Depending on which of the two words was longer, you would compare the remaining letters with words from the array that is next from the left or from the right.
This should bring a lot of early pruning in the search tree. You would thus not perform the full Cartesian product of combinations.
I have also written the function to get all substrings from a given word, which you probably already had:
def allsubstr(str):
return [str[i:j+1] for i in range(len(str)) for j in range(i, len(str))]
def getpalindromes_trincot(aList):
def collectLeft(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aRevList[j]:
if seq.startswith(needle):
results += collectRight(common+needle, seq[len(needle):], i, j-1)
elif needle.startswith(seq):
results += collectLeft(common+seq, needle[len(seq):], i, j-1)
return results
def collectRight(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aList[i]:
if seq.startswith(needle):
results += collectLeft(common+needle, seq[len(needle):], i+1, j)
elif needle.startswith(seq):
results += collectRight(common+seq, needle[len(seq):], i+1, j)
return results
aRevList = [[seq[::-1] for seq in seqs] for seqs in aList]
return collectRight('', '', 0, len(aList)-1)
# sample input and call:
input = ['already', 'days', 'every', 'year', 'later'];
aList = [allsubstr(word) for word in input]
result = getpalindromes_trincot(aList)
I did a timing comparison with the solution that martineau posted. For the sample data I have used, this solution is about 100 times faster:
See it run on repl.it
Another Optimisation
Some gain could also be found in not repeating the search when the first array has several entries with the same string, like the 'a' in your example data. The results that include the second 'a' will obviously be the same as for the first. I did not code this optimisation, but it might be an idea to improve the performance even more.
using the itertools tool, I have all the possible permutations of a given list of numbers, but if the list is as follows:
List=[0,0,0,0,3,6,0,0,5,0,0]
itertools does not "know" that iterating the zeros is wasted work, for example the following iterations will appear in the results:
List=[0,3,0,0,0,6,0,0,5,0,0]
List=[0,3,0,0,0,6,0,0,5,0,0]
they are the same but itertools just takes the first zero ( for example ) and moves it at the fourth place in the list and vice-versa.
The question is: how can I iterate only some selected numbers and left alone others such like zero ? it can be with or without itertools.
Voilá - it works now - after getting the permutations on the "meat", I further get all possible combnations for the "0"s positions and yield
one permutation for each possible set of "0 positions" for each permutation
of the non-0s:
from itertools import permutations, combinations
def permut_with_pivot(sequence, pivot=0):
pivot_indexes = set()
seq_len = 0
def yield_non_pivots():
nonlocal seq_len
for i, item in enumerate(sequence):
if item != pivot:
yield item
else:
pivot_indexes.add(i)
seq_len = i + 1
def fill_pivots(permutation):
for pivot_positions in combinations(range(seq_len), len(pivot_indexes)):
sequence = iter(permutation)
yield tuple ((pivot if i in pivot_positions else next(sequence)) for i in range(seq_len))
for permutation in permutations(yield_non_pivots()):
for filled_permutation in fill_pivots(permutation):
yield filled_permutation
(I've used Python's 3 "nonlocal" keyword - if you are still on Python 2.7,
you will have to take another approach, like making seq_len be a list with a single item you can then repplace on the inner function)
My second try (the working one is actually the 3rd)
This is a naive approach that just keeps a cache of the already "seen" permutations - it saves on the work done to each permutation but notonthe work to generate all possible permutations:
from itertools import permutations
def non_repeating_permutations(seq):
seen = set()
for permutation in permutations(seq):
hperm = hash(permutation)
if hperm in seen:
continue
seen.add(hperm)
yield permutation
Append each result to a List. Now you'll have every single possible combination and then do the following:
list(set(your_big_list))
Set will narrow down the list down to only the unique permutations.
I'm not wholly sure if this is the problem you're trying to solve or you're worried about performance.
Regardless just made an account so I thought I'd try to contribute something
Your questions is unclear, but if you are trying to list the permutations without having 0 in your output, you can do it as follows:
from itertools import permutations
def perms( listy ):
return permutations( [i for i in listy if i!=0])
I have been working on a fast, efficient way to solve the following problem, but as of yet, I have only been able to solve it using a rather slow, nest-loop solution. Anyways, here is the description:
So I have a string of length L, lets say 'BBBX'. I want to find all possible strings of length L, starting from 'BBBX', that differ at, at most, 2 positions and, at least, 0 positions. On top of that, when building the new strings, new characters must be selected from a specific alphabet.
I guess the size of the alphabet doesn't matter, so lets say in this case the alphabet is ['B', 'G', 'C', 'X'].
So, some sample output would be, 'BGBG', 'BGBC', 'BBGX', etc. For this example with a string of length 4 with up to 2 substitutions, my algorithm finds 67 possible new strings.
I have been trying to use itertools to solve this problem, but I am having a bit of difficulty finding a solution. I try to use itertools.combinations(range(4), 2) to find all the possible positions. I am then thinking of using product() from itertools to build all of the possibilities, but I am not sure if there is a way I could connect it somehow to the indices from the output of combinations().
Here's my solution.
The first for loop tells us how many replacements we will perform. (0, 1 or 2 - we go through each)
The second loop tells us which letters we will change (by their indexes).
The third loop goes through all of the possible letter changes for those indexes. There's some logic to make sure we actually change the letter (changing "C" to "C" doesn't count).
import itertools
def generate_replacements(lo, hi, alphabet, text):
for count in range(lo, hi + 1):
for indexes in itertools.combinations(range(len(text)), count):
for letters in itertools.product(alphabet, repeat=count):
new_text = list(text)
actual_count = 0
for index, letter in zip(indexes, letters):
if new_text[index] == letter:
continue
new_text[index] = letter
actual_count += 1
if actual_count == count:
yield ''.join(new_text)
for text in generate_replacements(0, 2, 'BGCX', 'BBBX'):
print text
Here's its output:
BBBX GBBX CBBX XBBX BGBX BCBX BXBX BBGX BBCX BBXX BBBB BBBG BBBC GGBX
GCBX GXBX CGBX CCBX CXBX XGBX XCBX XXBX GBGX GBCX GBXX CBGX CBCX CBXX
XBGX XBCX XBXX GBBB GBBG GBBC CBBB CBBG CBBC XBBB XBBG XBBC BGGX BGCX
BGXX BCGX BCCX BCXX BXGX BXCX BXXX BGBB BGBG BGBC BCBB BCBG BCBC BXBB
BXBG BXBC BBGB BBGG BBGC BBCB BBCG BBCC BBXB BBXG BBXC
Not tested much, but it does find 67 for the example you gave. The easy way to connect the indices to the products is via zip():
def sub(s, alphabet, minsubs, maxsubs):
from itertools import combinations, product
origs = list(s)
alphabet = set(alphabet)
for nsubs in range(minsubs, maxsubs + 1):
for ix in combinations(range(len(s)), nsubs):
prods = [alphabet - set(origs[i]) for i in ix]
s = origs[:]
for newchars in product(*prods):
for i, char in zip(ix, newchars):
s[i] = char
yield "".join(s)
count = 0
for s in sub('BBBX', 'BGCX', 0, 2):
count += 1
print s
print count
Note: the major difference from FogleBird's is that I posted first - LOL ;-) The algorithms are very similar. Mine constructs the inputs to product() so that no substitution of a letter for itself is ever attempted; FogleBird's allows "identity" substitutions, but counts how many valid substitutions are made and then throws the result away if any identity substitutions occurred. On longer words and a larger number of substitutions, that can be much slower (potentially the difference between len(alphabet)**nsubs and (len(alphabet)-1)**nsubs times around the ... in product(): loop).
The Problem
I've been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comparing their efficiency. As far as I can tell, in the normal case with an unsorted list, the Countermethod from the collections module is superior to anything I've come up with or found elsewhere, but it doesn't seem to take much advantage of the benefits of a pre-sorted list and I've come up with methods that beat it easily in this special case. So, in short, is there any built-in way to inform Counter of the fact that a list is already sorted to further speed it up?
(The next section is on unsorted lists where Counter works magic; you may want to skip to towards the end where it looses its charm when dealing with sorted lists.)
Unsorted input lists
One approach that doesn't work
The naive approach would be to use sorted([(word, corpus.count(word)) for word in set(corpus)]), but that one reliably gets you into runtime problems as soon as your corpus is a few thousand items long - not surprisingly since you're running through the entire list of n words m many times, where m is the number of unique words.
Sorting the list + local search
So what I tried to do instead before I found Counter was make sure that all searches are strictly local by first sorting the input list (I also have to remove digits and punctuation marks and convert all entries into lower case to avoid duplicates like 'foo', 'Foo', and 'foo:').
#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk
# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words())
import string
drop = string.punctuation+string.digits
def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
'''function for extracting word frequencies out of a corpus. Returns an alphabetic list
of tuples consisting of words contained in the corpus with their frequencies.
Default is case-insensitive, but if you need separate entries for upper and lower case
spellings of the same words, set option Case=True. If your input list is already sorted
and stripped of punctuation marks/digits and/or all lower case, you can accelerate the
operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
# you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
if lower or Case:
if StrippedSorted:
sortedc = corpus
else:
sortedc = sorted([word.replace('--',' ').strip(drop)
for word in sorted(corpus)])
# here we sort and purge the input list in the default case:
else:
sortedc = sorted([word.lower().replace('--',' ').strip(drop)
for word in sorted(corpus)])
# start iterating over the (sorted) input list:
scindex = 0
# create a list:
freqs = []
# identify the first token:
currentword = sortedc[0]
length = len(sortedc)
while scindex < length:
wordcount = 0
# increment a local counter while the tokens == currentword
while scindex < length and sortedc[scindex] == currentword:
scindex += 1
wordcount += 1
# store the current word and final score when a) a new word appears or
# b) the end of the list is reached
freqs.append((currentword, wordcount))
# if a): update currentword with the current token
if scindex < length:
currentword = sortedc[scindex]
return freqs
Finding collections.Counter
This is much better, but still not quite as fast as using the Counter class from the collections module, which creates a dictionary of {word: frequency of word} entries (we still have to do the same stripping and lowering, but no sorting):
from collections import Counter
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
# optionally, if we want to have the same output format as before,
# we can do the following which negligibly adds in runtime:
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
On the Gutenberg corpus with appr. 2 million entries, the Counter method is roughly 30% faster on my machine (5 seconds as opposed to 7.2), which is mostly explained through the sorting routine which eats around 2.1 seconds (if you don't have and don't want to install the nltk package (Natural Language Toolkit) which offers access to this corpus, any other adequately long text appropriately split into a list of strings at word level will show you the same.)
Comparing performance
With my idiosyncratic method of timing using the tautology as a conditional to delay execution, this gives us for the counter method:
import time
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in [token.lower().strip(drop) for token in corpus if token not in [" ", ""]]:
... cnt[word] += 1
... time.time()-start
... cntgbfreqs = sorted([(word, cnt[word]) for word in cnt])
... time.time()-start
...
4.999882936477661
5.191655874252319
(We see that the last step, that of formatting the results as a list of tuples, takes up less than 5% of the total time.)
Compared to my function:
>>> if 1:
... start = time.time()
... gbfreqs = wordcount5(corpus)
... time.time()-start
...
7.261770963668823
Sorted input lists - when Counter 'fails'
However, as you may have noticed, my function allows to specify that the input is already sorted, stripped of punctuational garbage, and converted to lowercase. If we already have created such a converted version of the list for some other operations, using it (and declaring so) can very much speed up the operation of my wordcount5:
>>> sorted_corpus = sorted([token.lower().strip(drop) for token in corpus if token not in [" ", ""]])
>>> if 1:
... start = time.time()
... strippedgbfreqs2 = wordcount5(sorted_corpus, lower=True, StrippedSorted=True)
... time.time()-start
...
0.9050078392028809
Here, we've reduced runtime by a factor of appr. 8 by not having to sort the corpus and convert the items. Of course the latter is also true when feeding Counter with this new list, so expectably it's also a bit faster, but it doesn't seem to take advantage of the fact that it is sorted and now takes twice as long as my function where it was 30% faster before:
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in sorted_corpus:
... cnt[word] += 1
... time.time()-start
... strippedgbfreqs = [(word, cnt[word])for word in cnt]
... time.time()-start
...
1.9455058574676514
2.0096349716186523
Of course, we can use the same logic I used in wordcount5 - incrementing a local counter until we run into a new word and only then storing the last word with the current state of the counter, and resetting the counter to 0 for the next word - only using Counter as storage, but the inherent efficiency of the Counter method seems lost, and performance is within the range of my function's for creating a dictionary, with the extra burden of converting to a list of tuples now looking more troublesome than it used to when we were processing the raw corpus:
>>> def countertest():
... start = time.time()
... sortedcnt = Counter()
... c = 0
... length = len(sorted_corpus)
... while c < length:
... wcount = 0
... word = sorted_corpus[c]
... while c < length and sorted_corpus[c] == word:
... wcount+=1
... c+=1
... sortedcnt[word] = wcount
... if c < length:
... word = sorted_corpus[c]
... print time.time()-start
... return sorted([(word, sortedcnt[word]) for word in sortedcnt])
... print time.time()-start
...
>>> strippedbgcnt = countertest()
0.920727014542
1.08029007912
(The similarity of the results is not really surprising since we're in effect disabling Counter's own methods and abusing it as a store for values obtained with the very same methodology as before.)
Therefore, my question: Is there a more idiomatic way to inform Counter that its input list is already sorted and to make it thus keep the current key in memory rather than looking it up anew every time it - predictably - encounters the next token of the same word? In other words, is it possible to improve performance on a pre-sorted list further by combining the inherent efficiency of the Counter/dictionary class with the obvious benefits of a sorted list, or am I already scratching on a hard limit with .9 seconds for tallying a list of 2M entries?
There probably isn't a whole lot of room for improvement - I get times of around .55 seconds when doing the simplest thing I can think of that still requires iterating through the same list and checking each individual value, and .25 for set(corpus) without a count, but maybe there's some itertools magic out there that would help to get close to those figures?
(Note: I'm a relative novice to Python and to programming in general, so excuse if I've missed something obvious.)
Edit Dec. 1:
Another thing, besides the sorting itself, which makes all of my methods above slow, is converting every single one of 2M strings into lowercase and stripping them of whatever punctuation or digits they may include. I have tried before to shortcut that by counting the unprocessed strings and only then converting the results and removing duplicates while adding up their counts, but I must have done something wrong for it made things ever so slightly slower. I therefore reverted to the previous versions, converting everything in the raw corpus, and now can't quite reconstruct what I did there.
If I try it now, I do get an improvement from converting the strings last. I'm still doing it by looping over a list (of results). What I did was write a couple of functions that would between them convert the keys in the output of J.F. Sebastian's winning default_dict method (of format [("word", int), ("Word", int)], ("word2", int),...]) into lowercase and strip them of punctuation, and collapse the counts for all keys that were left identical after that operation (code below). The advantage is that we're now handling a list of roughly 50k entries as opposed to the > 2M in the corpus. This way I'm now at 1.25 seconds for going from the corpus (as a list) to a case insensitive word count ignoring punctuation marks on my machine, down from about 4.5 with the Counter method and string conversion as a first step. But maybe there's a dictionary-based method for what I'm doing in sum_sorted() as well?
Code:
def striplast(resultlist, lower_or_Case=False):
"""function for string conversion of the output of any of the `count_words*` methods"""
if lower_or_Case:
strippedresult = sorted([(entry[0].strip(drop), entry[1]) for entry in resultlist])
else:
strippedresult = sorted([(entry[0].lower().strip(drop), entry[1]) for entry in resultlist])
strippedresult = sum_sorted(strippedresult)
return strippedresult
def sum_sorted(inputlist):
"""function for collapsing the counts of entries left identical by striplast()"""
ilindex = 0
freqs = []
currentword = inputlist[0][0]
length = len(inputlist)
while ilindex < length:
wordcount = 0
while ilindex < length and inputlist[ilindex][0] == currentword:
wordcount += inputlist[ilindex][1]
ilindex += 1
if currentword not in ["", " "]:
freqs.append((currentword, wordcount))
if ilindex < length and inputlist[ilindex][0] > currentword:
currentword = inputlist[ilindex][0]
return freqs
def count_words_defaultdict2(words, loc=False):
"""modified version of J.F. Sebastian's winning method, added a final step collapsing
the counts for words identical except for punctuation and digits and case (for the
latter, unless you specify that you're interested in a case-sensitive count by setting
l(ower_)o(r_)c(ase) to True) by means of striplast()."""
d = defaultdict(int)
for w in words:
d[w] += 1
if col=True:
return striplast(sorted(d.items()), lower_or_case=True)
else:
return striplast(sorted(d.items()))
I made some first attempts at using groupy to do the job currently done by sum_sorted(), and/or striplast(), but I couldn't quite work out how to trick it into summing [entry[1]] for a list of entries in count_words' results sorted by entry[0]. The closest I got was:
# "i(n)p(ut)list", toylist for testing purposes:
list(groupby(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in iplist])))
# returns:
[(('a', 1), <itertools._grouper object at 0x1031bb290>), (('a', 2), <itertools._grouper object at 0x1031bb250>), (('a', 3), <itertools._grouper object at 0x1031bb210>), (('a', 5), <itertools._grouper object at 0x1031bb2d0>), (('a', 8), <itertools._grouper object at 0x1031bb310>), (('b', 3), <itertools._grouper object at 0x1031bb350>), (('b', 7), <itertools._grouper object at 0x1031bb390>)]
# So what I used instead for striplast() is based on list comprehension:
list(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in iplist]))
# returns:
[('a', 1), ('a', 2), ('a', 3), ('a', 5), ('a', 8), ('b', 3), ('b', 7)]
Given a sorted list of words as you mention, have you tried the traditional Pythonic approach of itertools.groupby?
from itertools import groupby
some_data = ['a', 'a', 'b', 'c', 'c', 'c']
count = dict( (k, sum(1 for i in v)) for k, v in groupby(some_data) ) # or
count = {k:sum(1 for i in v) for k, v in groupby(some_data)}
# {'a': 2, 'c': 3, 'b': 1}
To answer the question from the title: Counter, dict, defaultdict, OrderedDict are hash-based types: to look up an item they compute a hash for a key and use it to get the item. They even support keys that have no defined order as long as they are hashable i.e., Counter can't take advantage of pre-sorted input.
The measurements show that the sorting of input words takes longer than to count the words using dictionary-based approach and to sort the result combined:
sorted 3.19
count_words_Counter 2.88
count_words_defaultdict 2.45
count_words_dict 2.58
count_words_groupby 3.44
count_words_groupby_sum 3.52
Also the counting of words in already sorted input with groupby() takes only fraction of the time it takes to sort the input in the first place and faster than dict-based approaches.
def count_words_Counter(words):
return sorted(Counter(words).items())
def count_words_groupby(words):
return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]
def count_words_groupby_sum(words):
return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]
def count_words_defaultdict(words):
d = defaultdict(int)
for w in words:
d[w] += 1
return sorted(d.items())
def count_words_dict(words):
d = {}
for w in words:
try:
d[w] += 1
except KeyError:
d[w] = 1
return sorted(d.items())
def _count_words_freqdist(words):
# note: .items() returns words sorted by word frequency (descreasing order)
# (same as `Counter.most_common()`)
# so the code sorts twice (the second time in lexicographical order)
return sorted(nltk.FreqDist(words).items())
To reproduce the results, run this code.
Note: It is 3 times faster if nltk's lazy sequence of words is converted to a list (WORDS = list(nltk.corpus.gutenberg.words()) but relative performance is the same:
sorted 1.22
count_words_Counter 0.86
count_words_defaultdict 0.48
count_words_dict 0.54
count_words_groupby 1.49
count_words_groupby_sum 1.55
The results are similar to Python - Is a dictionary slow to find frequency of each character?.
If you want to normalize the words (remove punctuation, make them lowercase, etc); see answers to What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?. Some examples:
def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
"""
>>> toascii_letter_lower_genexpr("ABC,-.!def")
'abcdef'
"""
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_translate(s,
table=maketrans(ascii_letters, ascii_lowercase * 2),
deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
return s.translate(table, deletechars)
def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
return filter(_letter_set.__contains__, s).lower()
To count and normalize the words simultaneously:
def combine_counts(items):
d = defaultdict(int)
for word, count in items:
d[word] += count
return d.iteritems()
def clean_words_in_items(clean_word, items):
return ((clean_word(word), count) for word, count in items)
def normalize_count_words(words):
"""Normalize then count words."""
return count_words_defaultdict(imap(toascii_letter_lower_translate, words))
def count_normalize_words(words):
"""Count then normalize words."""
freqs = count_words_defaultdict(words)
freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
return sorted(combine_counts(freqs))
Results
I've updated the benchmark to measure various combinations of count_words*() and toascii*() functions (5x4 pairs not shown):
toascii_letter_lower_filter 0.954 usec small
toascii_letter_lower_genexpr 2.44 usec small
toascii_letter_lower_genexpr_set 2.19 usec small
toascii_letter_lower_translate 0.633 usec small
toascii_letter_lower_filter 124 usec random 2000
toascii_letter_lower_genexpr 197 usec random 2000
toascii_letter_lower_genexpr_set 121 usec random 2000
toascii_letter_lower_translate 7.73 usec random 2000
sorted 1.28 sec
count_words_Counter 941 msec
count_words_defaultdict 501 msec
count_words_dict 571 msec
count_words_groupby 1.56 sec
count_words_groupby_sum 1.64 sec
count_normalize_words 622 msec
normalize_count_words 2.18 sec
The fastest methods:
normalize words - toascii_letter_lower_translate()
count words (presorted input) - groupby()-based approach
count words - count_words_defaultdict()
it is faster first to count the words and then to normalize them - count_normalize_words()
Latest version of the code: count-words-performance.py.
One source of inefficiency in the OP's code (which several answers fixed without commenting on) is the over-reliance on intermediate lists. There is no reason to create a temporary list of millions of words just to iterate over them, when a generator will do.
So instead of
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
it should be just
cnt = Counter(token.lower().strip(drop) for token in corpus)
And if you really want to sort the word counts alphabetically (what on earth for?), replace this
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
with this:
wordfreqs = sorted(cnt.items()) # In Python 2: cnt.iteritems()
This should remove much of the inefficiency around the use of Counter (or any dictionary class used in a similar way).