Reconstruct input string given ngrams of that string - python

Given a string, e.g. i am a string.
I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.
from nltk import ngrams
s = 'i am a string'
for n in range(1, 3):
for grams in ngrams(s.split(), n):
print(grams)
Gives the output:
('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')
Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).
[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]
The question is similar to this one, though with an additional layer of complexity.
Working solution - adapted from here.
I have a working solution, but it's really slow for longer strings.
def get_ngrams(s, min_=1, max_=4):
token_lst = []
for n in range(min_, max_):
for idx, grams in enumerate(ngrams(s.split(), n)):
token_lst.append(' '.join(grams))
return token_lst
def to_sum_k(s):
for len_ in range(1, len(s.split())+1):
for i in itertools.permutations(get_ngrams(s), r=len_):
if ' '.join(i) == s:
print(i)
to_sum_k('a b c')

EDIT:
This answer was based on the assumption that the question was to reconstruct an unknown unique string based on it's ngrams. I'll leave it active for anyone interested in this problem. The actual answer for the actual problem as clarified in the comments can be found here.
EDIT END
In general no. Consider e.g. the case n = 2 and s = "a b a b". Then your ngrams would be
[("a"), ("b"), ("a", "b"), ("b", "a")]
The set of strings that generate this set of ngrams in this case however would be all that may be generated by
(ab(a|(ab)*a?))|(ba(b|(ba)*b?)
Or n = 2, s = "a b c a b d a", where "c" and "d" may be arbitrarily ordered within the generating strings. E.g. "a b d a b c a" would also be a valid string. In addition the same issue as above arises and an arbitrary number of strings can generate the set of ngrams.
That being said there exists a way to test whether a set of ngrams uniquely identifies a string:
Consider your set of strings as a description of a non-deterministic state-machine. Each ngram can be defined as a chain of states where the single characters are transitions. As an example for the ngrams [("a", "b", "c"), ("c", "d"), ("a", "d", "b")] we would build the following state-machine:
0 ->(a) 1 ->(b) 2 ->(c) 3
0 ->(c) 3 ->(d) 4
0 ->(a) 1 ->(d) 5 ->(b) 6
Now perform a determinization of this state-machine. Iff there exists a unique string that can be reconstructed from the ngrams, the state-machine will have a longest transition-chain that doesn't contain any cycles and contains all ngrams we built the original state-machine from. In this case the original string is simply the individual state-transitions of this path joined back together. Otherwise there exist multiple strings that can be built from the provided ngrams.

While my previous answer assumed that the problem was to find an unknown string based on it's ngrams, this answer will deal with the problem of finding all ways to construct a given string using it's ngrams.
Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:
import numpy
def generate_sums(l, n, intermediate):
if l == 0:
yield intermediate
elif l < 0:
return
else:
for i in range(1, n + 1):
yield from generate_sums(l - i, n, intermediate + [i])
def combinations(s, n):
words = s.split(' ')
for c in generate_sums(len(words), n, [0]):
cs = numpy.cumsum(c)
yield [words[l:u] for (l, u) in zip(cs, cs[1:])]
EDIT:
As pointed out by #norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.
END EDIT
If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:
def build_state_machine(s, n):
next_state = 1
transitions = {}
for ng in ngrams(s.split(' '), n):
state = 0
for word in ng:
if (state, word) not in transitions:
transitions[(state, word)] = next_state
next_state += 1
state = transitions[(state, word)]
return transitions
def combinations(s, n):
transitions = build_state_machine(s, n)
states = [(0, set(), [], [])]
for word in s.split(' '):
new_states = []
for state, term_visited, path, cur_elem in states:
if state not in term_visited:
new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))
if (state, word) in transitions:
new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))
states = new_states
return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]
As an example the following state machine would be generated for the string "a b a":
Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.

Related

Given an integer, add operators between digits to get n and return list of correct answers

Here is the problem I'm trying to solve:
Given an int, ops, n, create a function(int, ops, n) and slot operators between digits of int to create equations that evaluates to n. Return a list of all possible answers. Importing functions is not allowed.
For example,
function(111111, '+-%*', 11) => [1*1+11/1-1 = 11, 1*1/1-1+11 =11, ...]
The question recommended using interleave(str1, str2) where interleave('abcdef', 'ab') = 'aabbcdef' and product(str1, n) where product('ab', 3) = ['aaa','aab','abb','bbb','aba','baa','bba'].
I have written interleave(str1, str2) which is
def interleave(str1,str2):
lsta,lstb,result= list(str1),list(str2),''
while lsta and lstb:
result += lsta.pop(0)
result += lstb.pop(0)
if lsta:
for i in lsta:
result+= i
else:
for i in lstb:
result+=i
return result
However, I have no idea how to code the product function. I assume it has to do something with recursion, so I'm trying to add 'a' and 'b' for every product.
def product(str1,n):
if n ==1:
return []
else:
return [product(str1,n-1)]+[str1[0]]
Please help me to understand how to solve this question. (Not only the product it self)
General solution
Assuming your implementation of interleave is correct, you can use it together with product (see my suggested implementation below) to solve the problem with something like:
def f(i, ops, n):
int_str = str(i)
retval = []
for seq_len in range(1, len(int_str)):
for op_seq in r_prod(ops, seq_len):
eq = interleave(int_str, op_seq)
if eval(eq) == n:
retval.append(eq)
return retval
The idea is that you interleave the digits of your string with your operators in a varying order. Basically I do that with all possible sequences of length seq_len which varies from 1 to max, which will be the number of digits - 1 (see assumptions below!). Then you use the built-in function eval to evaluate the expression returned by inteleave for a specific sequence of the operators and compare the result with the desired number, n. If the expression evaluates to n you append it to the return array retval (initially empty). After you evaluated all the expressions for all possible operator sequences (see assumptions!) you return the array.
Assumptions
It's not clear whether you can use the same operator multiple times or if you're allowed to omit using some. I assumed you can use the same operator many times and that you're allowed to omit using an operator. Hence, the r_prod was used (as suggested by your question). In case of such restrictions, you will want to use permutations (of possibly varying length) of the group of operators.
Secondly, I assumed that your implementation of the interleave function is correct. It is not clear if, for example, interleave("112", "*") should return both "1*12" and "11*2" or just "1*12" like your implementation does. In the case both should be returned, then you should also iterate over the possible ways the same ordered sequence of operators can be interleaved with the provided digits. I omitted that, because I saw that your function always returns a single string.
Product implementation
If you look at the itertools docs you can see the equivalent code for the function itertools.product. Using that you'd have:
def product(*args, repeat=1):
pools = [tuple(pool) for pool in args] * repeat
result = [[]]
for pool in pools:
result = [x+[y] for x in result for y in pool]
for prod in result:
yield tuple(prod)
a = ["".join(x) for x in product('ab', repeat=3)]
print(a)
Which prints ['aaa', 'aab', 'aba', 'abb', 'baa', 'bab', 'bba', 'bbb'] -- what I guess is what you're after.
A more specific (assuming iterable is a string), less efficient, but hopefully more understandable solution would be:
def prod(string, r):
if r < 1:
return None
retval = list(string)
for i in range(r - 1):
temp = []
for l in retval:
for c in string:
temp.append(l + c)
retval = temp
return retval
The idea is simple. The second parameter r gives you the length of the strings you want to produce. The characters in the string give you the elements from which you build the string. Hence, you first generate a string of length 1 that starts with each possible character. Then for each of those strings you generate new strings by concatenating the old string with all of the possible characters.
For example, given a pool of characters "abc", you'll first generate strings "a", "b", and "c". Then you'll replace string "a" with strings "aa", "ab", and "ac". Similarly for "b" and "c". You repeat this process n-times to get all possible strings of length r generated by drawing with replacement from the pool "abc".
I'd think it would be a good idea for you to try to implement the prod function recursively. You can see my ugly solution below, but I'd suggest you stop reading this now and try to do it without looking at my suggestion first.
SPOILER BELOW
def r_prod(string, r):
if r == 1:
return list(string)
else:
return [c + s for c in string for s in r_prod(string, r - 1)]

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Algorithm for finding the possible palindromic strings in a list containing a list of possible subsequences

I have "n" number of strings as input, which i separate into possible subsequences into a list like below
If the Input is : aa, b, aa
I create a list like the below(each list having the subsequences of the string):
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
I would like to find the combinations of palindromes across the lists in aList.
For eg, the possible palindromes for this would be 5 - aba, aba, aba, aba, aabaa
This could be achieved by brute force algorithm using the below code:
d = []
def isPalindrome(x):
if x == x[::-1]: return True
else: return False
for I in itertools.product(*aList):
a = (''.join(I))
if isPalindrome(a):
if a not in d:
d.append(a)
count += 1
But this approach is resulting in a timeout when the number of strings and the length of the string are bigger.
Is there a better approach to the problem ?
Second version
This version uses a set called seen, to avoid testing combinations more than once.
Note that your function isPalindrome() can simplified to single expression, so I removed it and just did the test in-line to avoid the overhead of an unnecessary function call.
import itertools
aList = [['a', 'a', 'aa'], ['b'], ['a', 'a', 'aa']]
d = []
seen = set()
for I in itertools.product(*aList):
if I not in seen:
seen.add(I)
a = ''.join(I)
if a == a[::-1]:
d.append(a)
print('d: {}'.format(d))
Current approach has disadvantage and that most of generated solutions are finally thrown away when checked that solution is/isn't palindrome.
One Idea is that once you pick solution from one side, you can immediate check if there is corresponding solution in last group.
For example lets say that your space is this
[["a","b","c"], ... , ["b","c","d"]]
We can see that if you pick "a" as first pick, there is no "a" in last group and this exclude all possible solutions that would be tried other way.
For larger input you could probably get some time gain by grabbing words from the first array, and compare them with the words of the last array to check that these pairs still allow for a palindrome to be formed, or that such a combination can never lead to one by inserting arrays from the remaining words in between.
This way you probably cancel out a lot of possibilities, and this method can be repeated recursively, once you have decided that a pair is still in the running. You would then save the common part of the two words (when the second word is reversed of course), and keep the remaining letters separate for use in the recursive part.
Depending on which of the two words was longer, you would compare the remaining letters with words from the array that is next from the left or from the right.
This should bring a lot of early pruning in the search tree. You would thus not perform the full Cartesian product of combinations.
I have also written the function to get all substrings from a given word, which you probably already had:
def allsubstr(str):
return [str[i:j+1] for i in range(len(str)) for j in range(i, len(str))]
def getpalindromes_trincot(aList):
def collectLeft(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aRevList[j]:
if seq.startswith(needle):
results += collectRight(common+needle, seq[len(needle):], i, j-1)
elif needle.startswith(seq):
results += collectLeft(common+seq, needle[len(seq):], i, j-1)
return results
def collectRight(common, needle, i, j):
if i > j:
return [common + needle + common[::-1]] if needle == needle[::-1] else []
results = []
for seq in aList[i]:
if seq.startswith(needle):
results += collectLeft(common+needle, seq[len(needle):], i+1, j)
elif needle.startswith(seq):
results += collectRight(common+seq, needle[len(seq):], i+1, j)
return results
aRevList = [[seq[::-1] for seq in seqs] for seqs in aList]
return collectRight('', '', 0, len(aList)-1)
# sample input and call:
input = ['already', 'days', 'every', 'year', 'later'];
aList = [allsubstr(word) for word in input]
result = getpalindromes_trincot(aList)
I did a timing comparison with the solution that martineau posted. For the sample data I have used, this solution is about 100 times faster:
See it run on repl.it
Another Optimisation
Some gain could also be found in not repeating the search when the first array has several entries with the same string, like the 'a' in your example data. The results that include the second 'a' will obviously be the same as for the first. I did not code this optimisation, but it might be an idea to improve the performance even more.

Python - build new string of specific length with n replacements from specific alphabet

I have been working on a fast, efficient way to solve the following problem, but as of yet, I have only been able to solve it using a rather slow, nest-loop solution. Anyways, here is the description:
So I have a string of length L, lets say 'BBBX'. I want to find all possible strings of length L, starting from 'BBBX', that differ at, at most, 2 positions and, at least, 0 positions. On top of that, when building the new strings, new characters must be selected from a specific alphabet.
I guess the size of the alphabet doesn't matter, so lets say in this case the alphabet is ['B', 'G', 'C', 'X'].
So, some sample output would be, 'BGBG', 'BGBC', 'BBGX', etc. For this example with a string of length 4 with up to 2 substitutions, my algorithm finds 67 possible new strings.
I have been trying to use itertools to solve this problem, but I am having a bit of difficulty finding a solution. I try to use itertools.combinations(range(4), 2) to find all the possible positions. I am then thinking of using product() from itertools to build all of the possibilities, but I am not sure if there is a way I could connect it somehow to the indices from the output of combinations().
Here's my solution.
The first for loop tells us how many replacements we will perform. (0, 1 or 2 - we go through each)
The second loop tells us which letters we will change (by their indexes).
The third loop goes through all of the possible letter changes for those indexes. There's some logic to make sure we actually change the letter (changing "C" to "C" doesn't count).
import itertools
def generate_replacements(lo, hi, alphabet, text):
for count in range(lo, hi + 1):
for indexes in itertools.combinations(range(len(text)), count):
for letters in itertools.product(alphabet, repeat=count):
new_text = list(text)
actual_count = 0
for index, letter in zip(indexes, letters):
if new_text[index] == letter:
continue
new_text[index] = letter
actual_count += 1
if actual_count == count:
yield ''.join(new_text)
for text in generate_replacements(0, 2, 'BGCX', 'BBBX'):
print text
Here's its output:
BBBX GBBX CBBX XBBX BGBX BCBX BXBX BBGX BBCX BBXX BBBB BBBG BBBC GGBX
GCBX GXBX CGBX CCBX CXBX XGBX XCBX XXBX GBGX GBCX GBXX CBGX CBCX CBXX
XBGX XBCX XBXX GBBB GBBG GBBC CBBB CBBG CBBC XBBB XBBG XBBC BGGX BGCX
BGXX BCGX BCCX BCXX BXGX BXCX BXXX BGBB BGBG BGBC BCBB BCBG BCBC BXBB
BXBG BXBC BBGB BBGG BBGC BBCB BBCG BBCC BBXB BBXG BBXC
Not tested much, but it does find 67 for the example you gave. The easy way to connect the indices to the products is via zip():
def sub(s, alphabet, minsubs, maxsubs):
from itertools import combinations, product
origs = list(s)
alphabet = set(alphabet)
for nsubs in range(minsubs, maxsubs + 1):
for ix in combinations(range(len(s)), nsubs):
prods = [alphabet - set(origs[i]) for i in ix]
s = origs[:]
for newchars in product(*prods):
for i, char in zip(ix, newchars):
s[i] = char
yield "".join(s)
count = 0
for s in sub('BBBX', 'BGCX', 0, 2):
count += 1
print s
print count
Note: the major difference from FogleBird's is that I posted first - LOL ;-) The algorithms are very similar. Mine constructs the inputs to product() so that no substitution of a letter for itself is ever attempted; FogleBird's allows "identity" substitutions, but counts how many valid substitutions are made and then throws the result away if any identity substitutions occurred. On longer words and a larger number of substitutions, that can be much slower (potentially the difference between len(alphabet)**nsubs and (len(alphabet)-1)**nsubs times around the ... in product(): loop).

Is there a way to make collections.Counter (Python2.7) aware that its input list is sorted?

The Problem
I've been playing around with different ways (in Python 2.7) to extract a list of (word, frequency) tuples from a corpus, or list of strings, and comparing their efficiency. As far as I can tell, in the normal case with an unsorted list, the Countermethod from the collections module is superior to anything I've come up with or found elsewhere, but it doesn't seem to take much advantage of the benefits of a pre-sorted list and I've come up with methods that beat it easily in this special case. So, in short, is there any built-in way to inform Counter of the fact that a list is already sorted to further speed it up?
(The next section is on unsorted lists where Counter works magic; you may want to skip to towards the end where it looses its charm when dealing with sorted lists.)
Unsorted input lists
One approach that doesn't work
The naive approach would be to use sorted([(word, corpus.count(word)) for word in set(corpus)]), but that one reliably gets you into runtime problems as soon as your corpus is a few thousand items long - not surprisingly since you're running through the entire list of n words m many times, where m is the number of unique words.
Sorting the list + local search
So what I tried to do instead before I found Counter was make sure that all searches are strictly local by first sorting the input list (I also have to remove digits and punctuation marks and convert all entries into lower case to avoid duplicates like 'foo', 'Foo', and 'foo:').
#Natural Language Toolkit, for access to corpus; any other source for a long text will do, though.
import nltk
# nltk corpora come as a class of their own, as I udnerstand it presenting to the
# outside as a unique list but underlyingly represented as several lists, with no more
# than one ever loaded into memory at any one time, which is good for memory issues
# but rather not so for speed so let's disable this special feature by converting it
# back into a conventional list:
corpus = list(nltk.corpus.gutenberg.words())
import string
drop = string.punctuation+string.digits
def wordcount5(corpus, Case=False, lower=False, StrippedSorted=False):
'''function for extracting word frequencies out of a corpus. Returns an alphabetic list
of tuples consisting of words contained in the corpus with their frequencies.
Default is case-insensitive, but if you need separate entries for upper and lower case
spellings of the same words, set option Case=True. If your input list is already sorted
and stripped of punctuation marks/digits and/or all lower case, you can accelerate the
operation by a factor of 5 or so by declaring so through the options "Sorted" and "lower".'''
# you can ignore the following 6 lines for now, they're only relevant with a pre-processed input list
if lower or Case:
if StrippedSorted:
sortedc = corpus
else:
sortedc = sorted([word.replace('--',' ').strip(drop)
for word in sorted(corpus)])
# here we sort and purge the input list in the default case:
else:
sortedc = sorted([word.lower().replace('--',' ').strip(drop)
for word in sorted(corpus)])
# start iterating over the (sorted) input list:
scindex = 0
# create a list:
freqs = []
# identify the first token:
currentword = sortedc[0]
length = len(sortedc)
while scindex < length:
wordcount = 0
# increment a local counter while the tokens == currentword
while scindex < length and sortedc[scindex] == currentword:
scindex += 1
wordcount += 1
# store the current word and final score when a) a new word appears or
# b) the end of the list is reached
freqs.append((currentword, wordcount))
# if a): update currentword with the current token
if scindex < length:
currentword = sortedc[scindex]
return freqs
Finding collections.Counter
This is much better, but still not quite as fast as using the Counter class from the collections module, which creates a dictionary of {word: frequency of word} entries (we still have to do the same stripping and lowering, but no sorting):
from collections import Counter
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
# optionally, if we want to have the same output format as before,
# we can do the following which negligibly adds in runtime:
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
On the Gutenberg corpus with appr. 2 million entries, the Counter method is roughly 30% faster on my machine (5 seconds as opposed to 7.2), which is mostly explained through the sorting routine which eats around 2.1 seconds (if you don't have and don't want to install the nltk package (Natural Language Toolkit) which offers access to this corpus, any other adequately long text appropriately split into a list of strings at word level will show you the same.)
Comparing performance
With my idiosyncratic method of timing using the tautology as a conditional to delay execution, this gives us for the counter method:
import time
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in [token.lower().strip(drop) for token in corpus if token not in [" ", ""]]:
... cnt[word] += 1
... time.time()-start
... cntgbfreqs = sorted([(word, cnt[word]) for word in cnt])
... time.time()-start
...
4.999882936477661
5.191655874252319
(We see that the last step, that of formatting the results as a list of tuples, takes up less than 5% of the total time.)
Compared to my function:
>>> if 1:
... start = time.time()
... gbfreqs = wordcount5(corpus)
... time.time()-start
...
7.261770963668823
Sorted input lists - when Counter 'fails'
However, as you may have noticed, my function allows to specify that the input is already sorted, stripped of punctuational garbage, and converted to lowercase. If we already have created such a converted version of the list for some other operations, using it (and declaring so) can very much speed up the operation of my wordcount5:
>>> sorted_corpus = sorted([token.lower().strip(drop) for token in corpus if token not in [" ", ""]])
>>> if 1:
... start = time.time()
... strippedgbfreqs2 = wordcount5(sorted_corpus, lower=True, StrippedSorted=True)
... time.time()-start
...
0.9050078392028809
Here, we've reduced runtime by a factor of appr. 8 by not having to sort the corpus and convert the items. Of course the latter is also true when feeding Counter with this new list, so expectably it's also a bit faster, but it doesn't seem to take advantage of the fact that it is sorted and now takes twice as long as my function where it was 30% faster before:
>>> if 1:
... start = time.time()
... cnt = Counter()
... for word in sorted_corpus:
... cnt[word] += 1
... time.time()-start
... strippedgbfreqs = [(word, cnt[word])for word in cnt]
... time.time()-start
...
1.9455058574676514
2.0096349716186523
Of course, we can use the same logic I used in wordcount5 - incrementing a local counter until we run into a new word and only then storing the last word with the current state of the counter, and resetting the counter to 0 for the next word - only using Counter as storage, but the inherent efficiency of the Counter method seems lost, and performance is within the range of my function's for creating a dictionary, with the extra burden of converting to a list of tuples now looking more troublesome than it used to when we were processing the raw corpus:
>>> def countertest():
... start = time.time()
... sortedcnt = Counter()
... c = 0
... length = len(sorted_corpus)
... while c < length:
... wcount = 0
... word = sorted_corpus[c]
... while c < length and sorted_corpus[c] == word:
... wcount+=1
... c+=1
... sortedcnt[word] = wcount
... if c < length:
... word = sorted_corpus[c]
... print time.time()-start
... return sorted([(word, sortedcnt[word]) for word in sortedcnt])
... print time.time()-start
...
>>> strippedbgcnt = countertest()
0.920727014542
1.08029007912
(The similarity of the results is not really surprising since we're in effect disabling Counter's own methods and abusing it as a store for values obtained with the very same methodology as before.)
Therefore, my question: Is there a more idiomatic way to inform Counter that its input list is already sorted and to make it thus keep the current key in memory rather than looking it up anew every time it - predictably - encounters the next token of the same word? In other words, is it possible to improve performance on a pre-sorted list further by combining the inherent efficiency of the Counter/dictionary class with the obvious benefits of a sorted list, or am I already scratching on a hard limit with .9 seconds for tallying a list of 2M entries?
There probably isn't a whole lot of room for improvement - I get times of around .55 seconds when doing the simplest thing I can think of that still requires iterating through the same list and checking each individual value, and .25 for set(corpus) without a count, but maybe there's some itertools magic out there that would help to get close to those figures?
(Note: I'm a relative novice to Python and to programming in general, so excuse if I've missed something obvious.)
Edit Dec. 1:
Another thing, besides the sorting itself, which makes all of my methods above slow, is converting every single one of 2M strings into lowercase and stripping them of whatever punctuation or digits they may include. I have tried before to shortcut that by counting the unprocessed strings and only then converting the results and removing duplicates while adding up their counts, but I must have done something wrong for it made things ever so slightly slower. I therefore reverted to the previous versions, converting everything in the raw corpus, and now can't quite reconstruct what I did there.
If I try it now, I do get an improvement from converting the strings last. I'm still doing it by looping over a list (of results). What I did was write a couple of functions that would between them convert the keys in the output of J.F. Sebastian's winning default_dict method (of format [("word", int), ("Word", int)], ("word2", int),...]) into lowercase and strip them of punctuation, and collapse the counts for all keys that were left identical after that operation (code below). The advantage is that we're now handling a list of roughly 50k entries as opposed to the > 2M in the corpus. This way I'm now at 1.25 seconds for going from the corpus (as a list) to a case insensitive word count ignoring punctuation marks on my machine, down from about 4.5 with the Counter method and string conversion as a first step. But maybe there's a dictionary-based method for what I'm doing in sum_sorted() as well?
Code:
def striplast(resultlist, lower_or_Case=False):
"""function for string conversion of the output of any of the `count_words*` methods"""
if lower_or_Case:
strippedresult = sorted([(entry[0].strip(drop), entry[1]) for entry in resultlist])
else:
strippedresult = sorted([(entry[0].lower().strip(drop), entry[1]) for entry in resultlist])
strippedresult = sum_sorted(strippedresult)
return strippedresult
def sum_sorted(inputlist):
"""function for collapsing the counts of entries left identical by striplast()"""
ilindex = 0
freqs = []
currentword = inputlist[0][0]
length = len(inputlist)
while ilindex < length:
wordcount = 0
while ilindex < length and inputlist[ilindex][0] == currentword:
wordcount += inputlist[ilindex][1]
ilindex += 1
if currentword not in ["", " "]:
freqs.append((currentword, wordcount))
if ilindex < length and inputlist[ilindex][0] > currentword:
currentword = inputlist[ilindex][0]
return freqs
def count_words_defaultdict2(words, loc=False):
"""modified version of J.F. Sebastian's winning method, added a final step collapsing
the counts for words identical except for punctuation and digits and case (for the
latter, unless you specify that you're interested in a case-sensitive count by setting
l(ower_)o(r_)c(ase) to True) by means of striplast()."""
d = defaultdict(int)
for w in words:
d[w] += 1
if col=True:
return striplast(sorted(d.items()), lower_or_case=True)
else:
return striplast(sorted(d.items()))
I made some first attempts at using groupy to do the job currently done by sum_sorted(), and/or striplast(), but I couldn't quite work out how to trick it into summing [entry[1]] for a list of entries in count_words' results sorted by entry[0]. The closest I got was:
# "i(n)p(ut)list", toylist for testing purposes:
list(groupby(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in iplist])))
# returns:
[(('a', 1), <itertools._grouper object at 0x1031bb290>), (('a', 2), <itertools._grouper object at 0x1031bb250>), (('a', 3), <itertools._grouper object at 0x1031bb210>), (('a', 5), <itertools._grouper object at 0x1031bb2d0>), (('a', 8), <itertools._grouper object at 0x1031bb310>), (('b', 3), <itertools._grouper object at 0x1031bb350>), (('b', 7), <itertools._grouper object at 0x1031bb390>)]
# So what I used instead for striplast() is based on list comprehension:
list(sorted([(entry[0].lower().strip(drop), entry[1]) for entry in iplist]))
# returns:
[('a', 1), ('a', 2), ('a', 3), ('a', 5), ('a', 8), ('b', 3), ('b', 7)]
Given a sorted list of words as you mention, have you tried the traditional Pythonic approach of itertools.groupby?
from itertools import groupby
some_data = ['a', 'a', 'b', 'c', 'c', 'c']
count = dict( (k, sum(1 for i in v)) for k, v in groupby(some_data) ) # or
count = {k:sum(1 for i in v) for k, v in groupby(some_data)}
# {'a': 2, 'c': 3, 'b': 1}
To answer the question from the title: Counter, dict, defaultdict, OrderedDict are hash-based types: to look up an item they compute a hash for a key and use it to get the item. They even support keys that have no defined order as long as they are hashable i.e., Counter can't take advantage of pre-sorted input.
The measurements show that the sorting of input words takes longer than to count the words using dictionary-based approach and to sort the result combined:
sorted 3.19
count_words_Counter 2.88
count_words_defaultdict 2.45
count_words_dict 2.58
count_words_groupby 3.44
count_words_groupby_sum 3.52
Also the counting of words in already sorted input with groupby() takes only fraction of the time it takes to sort the input in the first place and faster than dict-based approaches.
def count_words_Counter(words):
return sorted(Counter(words).items())
def count_words_groupby(words):
return [(w, len(list(gr))) for w, gr in groupby(sorted(words))]
def count_words_groupby_sum(words):
return [(w, sum(1 for _ in gr)) for w, gr in groupby(sorted(words))]
def count_words_defaultdict(words):
d = defaultdict(int)
for w in words:
d[w] += 1
return sorted(d.items())
def count_words_dict(words):
d = {}
for w in words:
try:
d[w] += 1
except KeyError:
d[w] = 1
return sorted(d.items())
def _count_words_freqdist(words):
# note: .items() returns words sorted by word frequency (descreasing order)
# (same as `Counter.most_common()`)
# so the code sorts twice (the second time in lexicographical order)
return sorted(nltk.FreqDist(words).items())
To reproduce the results, run this code.
Note: It is 3 times faster if nltk's lazy sequence of words is converted to a list (WORDS = list(nltk.corpus.gutenberg.words()) but relative performance is the same:
sorted 1.22
count_words_Counter 0.86
count_words_defaultdict 0.48
count_words_dict 0.54
count_words_groupby 1.49
count_words_groupby_sum 1.55
The results are similar to Python - Is a dictionary slow to find frequency of each character?.
If you want to normalize the words (remove punctuation, make them lowercase, etc); see answers to What is the most efficient way in Python to convert a string to all lowercase stripping out all non-ascii alpha characters?. Some examples:
def toascii_letter_lower_genexpr(s, _letter_set=ascii_lowercase):
"""
>>> toascii_letter_lower_genexpr("ABC,-.!def")
'abcdef'
"""
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_genexpr_set(s, _letter_set=set(ascii_lowercase)):
return ''.join(c for c in s.lower() if c in _letter_set)
def toascii_letter_lower_translate(s,
table=maketrans(ascii_letters, ascii_lowercase * 2),
deletechars=''.join(set(maketrans('', '')) - set(ascii_letters))):
return s.translate(table, deletechars)
def toascii_letter_lower_filter(s, _letter_set=set(ascii_letters)):
return filter(_letter_set.__contains__, s).lower()
To count and normalize the words simultaneously:
def combine_counts(items):
d = defaultdict(int)
for word, count in items:
d[word] += count
return d.iteritems()
def clean_words_in_items(clean_word, items):
return ((clean_word(word), count) for word, count in items)
def normalize_count_words(words):
"""Normalize then count words."""
return count_words_defaultdict(imap(toascii_letter_lower_translate, words))
def count_normalize_words(words):
"""Count then normalize words."""
freqs = count_words_defaultdict(words)
freqs = clean_words_in_items(toascii_letter_lower_translate, freqs)
return sorted(combine_counts(freqs))
Results
I've updated the benchmark to measure various combinations of count_words*() and toascii*() functions (5x4 pairs not shown):
toascii_letter_lower_filter 0.954 usec small
toascii_letter_lower_genexpr 2.44 usec small
toascii_letter_lower_genexpr_set 2.19 usec small
toascii_letter_lower_translate 0.633 usec small
toascii_letter_lower_filter 124 usec random 2000
toascii_letter_lower_genexpr 197 usec random 2000
toascii_letter_lower_genexpr_set 121 usec random 2000
toascii_letter_lower_translate 7.73 usec random 2000
sorted 1.28 sec
count_words_Counter 941 msec
count_words_defaultdict 501 msec
count_words_dict 571 msec
count_words_groupby 1.56 sec
count_words_groupby_sum 1.64 sec
count_normalize_words 622 msec
normalize_count_words 2.18 sec
The fastest methods:
normalize words - toascii_letter_lower_translate()
count words (presorted input) - groupby()-based approach
count words - count_words_defaultdict()
it is faster first to count the words and then to normalize them - count_normalize_words()
Latest version of the code: count-words-performance.py.
One source of inefficiency in the OP's code (which several answers fixed without commenting on) is the over-reliance on intermediate lists. There is no reason to create a temporary list of millions of words just to iterate over them, when a generator will do.
So instead of
cnt = Counter()
for word in [token.lower().strip(drop) for token in corpus]:
cnt[word] += 1
it should be just
cnt = Counter(token.lower().strip(drop) for token in corpus)
And if you really want to sort the word counts alphabetically (what on earth for?), replace this
wordfreqs = sorted([(word, cnt[word]) for word in cnt])
with this:
wordfreqs = sorted(cnt.items()) # In Python 2: cnt.iteritems()
This should remove much of the inefficiency around the use of Counter (or any dictionary class used in a similar way).

Categories