Kinda recursive word counting - python

Today is one of those days where all of my knowledge in programming seems to be failing horribly, and no amount of coffee administered via IV is helping the situation.
I am presented with a list of phrases, here is some as an example
"tax policies when emigrating from uk"
"shipping to scotland from california"
"immigrating to sweden"
"shipping good to australia"
"shipping good to new zealand"
"how to emigrate to california from the uk"
"shipping services from london to usa"
"cost of shipping from usa to uk"
Now I need to start doing word frequency analysis on this, thankfully in python this is pretty simple, I constructed the following function to take this list and give back a Counter of the most common words.
from collections import Counter
def count(phrases):
counter = Counter()
for phrase in phrases:
for word in phrase.split(" "):
counter[word] += 1
return counter
This rocks, because now I can easily acquire the most common words from the phrase list as so count(phrases).most_common(5)
Now it becomes harder. Say I set an arbitrary depth, lets say 5. Given the most popular word in that list (that isn't a glue word e.g. from or to and) is shipping. I now need to take the word shipping and count again for all the phrases that contain shipping the terms, again mostly kind of simple.
def filter_for_word(word, phrases):
return filter(lambda x: word in x, phrases)
count(filter_for_word("shipping", phrases))
This is where it starts to get hairy, I need to keep going down and down the results until I hit my depth. And then I need to be able to display this information along with the most common phrases together.
I started trying to do this with the following function but I simply cannot get my head around the next few steps to bind the content together and display it in a good structure and format.
def dive(depth, num, phrases):
phrase_tree = {}
for word, value in dict(count(phrases).most_common(num)).iteritems():
phrase_tree[word] = [value, {}]
current = phrase_tree
while True:
if depth == 0:
return phrase_tree
for word in current:
current[word][1] = {key: [v, {}] for (key, v) in count(filter_for_word(word, phrases)).most_common(num)}
# debug!!
return current
If anyone could help me bring this all together I would greatly appreciate it

def filter_for_words(words, phrases):
for word in words:
phrases = filter(lambda x: word in x, phrases)
return phrases
def dive(depth, num, phrases, phrase_tree=None, f_words=None):
if not phrase_tree:
phrase_tree = {}
for word, value in dict(count(phrases).most_common(num)).iteritems():
phrase_tree[word] = [value, {}]
if not f_words:
f_words = []
while True:
if depth == 0:
return phrase_tree
for word in phrase_tree:
words = f_words[:]
words.append(word)
child_tree = {key: [v, {}] for (key, v) in count(filter_for_words(words, phrases)).most_common(num)}
phrase_tree[word][1] = child_tree
dive(depth-1, num, phrases, child_tree, words)
return phrase_tree
Not efficiency but it should work.

Related

Search the string Benmark time python

For my project I want to write a program that searches for a word in a string/long document in python.
If the word is not in the string/document, I have to search for approximate matches.
For example the word “brain”,
Deletions: rain bain brin bran brai .
Substitutions: train grain blain bryin ...
I already have deletion and substitution function, but I am not sure how to search for the word in Brute Force runtime/ Benchmark runtime
string = "hereharewereworeherareteartoredeardareearrearehrerheasereseersearrah"
# the string can be much longer
Pattern = "ware"
# the output should have 4 deletion and 6 subtitutions
#string0 is Pattern, string1 is the word we compare, if it is the type, append to the list
Deletions = []
def deletions(string0, string1):
deletionlist = []
#append list of deletion word
for i in range(len(string0)):
deletionlist.append(string0.replace(string0[i], ""))
#delete first string and last
if string1[1:] in deletionlist:
Deletions.append(string1[1:])
return 1
elif string1[:-1] in deletionlist:
if len(string1[:-1]) == 1:
Deletions.append(string1[:-1])
return 1
Substitutions = []
def subsitutions(string0, string1):
if len(string0) == len(string1):
sublist = []
#append list of deletion word
for i in range(len(string0)):
sublist.append(string0.replace(string0[i], ""))
for j in range(len(string1)):
if string1.replace(string1[j], "") in sublist:
Substitutions.append(string1)
break
The best is levenshtein algorithm, you may calculate the distance between 2 words or sentences (how many character replacements it takes to convert one into another) or similarity ratio, if you like:
>>> import Levenshtein
>>> Levenshtein.distance( 'hello, guys', 'hello, girls' )
3
>>> Levenshtein.ratio( 'hello, guys', 'hello, girls' )
0.782608695652174
You may check the details of the implementation and other info here: https://en.wikipedia.org/wiki/Levenshtein_distance

Improved anagram generator

I'm trying to create a function in Python that will generate anagrams of a given word. I'm not just looking for code that will rearrange the letters aimlessly. All the options given must be real words. I currently have a solution, which to be honest I took most of this code from a YouTube video, but it is very slow for my purpose and can only provide one word responses to a single word given. It uses a 400,000 word dictionary to compare the words it is going though, called "dict.txt".
My goal is to get this code to mimic how well this website's code works:
https://wordsmith.org/anagram/
I could not find the javascript code when reviewing the network activity using Google Chrome's developer tool, so I believe the code is probably in the background, and is possibly using Node.js. This would perhaps make it faster than Python, but given how much faster it is I believe there is more to it than just the programming language. I assume they are using some type of search algorithm rather than just going through each line one by one like I am. I also like the fact that their response is not limited to a single word, but can break up the word given to provide more options to the user. For example, an anagram of "anagram" is "nag a ram".
Any suggestions or ideas would be appreciated.
Thank you.
def init_words(filename):
words = {}
with open(filename) as f:
for line in f:
word = line.strip()
words[word] = 1
return words
def init_anagram_dict(words):
anagram_dict = {}
for word in words:
sorted_word = ''.join(sorted(list(word)))
if sorted_word not in anagram_dict:
anagram_dict[sorted_word] = []
anagram_dict[sorted_word].append(word)
return anagram_dict
def find_anagrams(word, anagram_dict):
key = ''.join(sorted(list(word)))
if key in anagram_dict:
return set(anagram_dict[key]).difference(set([word]))
return set([])
#This is the first function called.
def make_anagram(user_word):
x = str(user_word)
lower_user_word = str.lower(x)
word_dict = init_words('dict.txt')
result = find_anagrams(lower_user_word, init_anagram_dict(word_dict.keys()))
list_result = list(result)
count = len(list_result)
if count > 0:
random_num = random.randint(0,count -1)
anagram_value = list_result[random_num]
return ('An anagram of %s is %s. Would you like me to search for another word?' %(lower_user_word, anagram_value))
else:
return ("Sorry, I could not find an anagram for %s." %(lower_user_word))
You can build a dictionary of anagrams by grouping words by their sorted text. All words that have the same sorted text are anagrams of each other:
from collections import defaultdict
with open("/usr/share/dict/words","r") as wordFile:
words = wordFile.read().split("\n")
anagrams = defaultdict(list)
for word in words:
anagrams["".join(sorted(word))].append(word)
aWord = "spear"
result = anagrams["".join(sorted(aWord))]
print(aWord,result)
# ['asper', 'parse', 'prase', 'spaer', 'spare', 'spear']
Using 235,000 words, the response time is instantaneous
In order to obtain multiple words forming an anagram of the specified word, you will need to get into combinatorics. A recursive function is probably the easiest way to go about it:
from itertools import combinations,product
from collections import Counter,defaultdict
with open("/usr/share/dict/words","r") as wordFile:
words = wordFile.read().split("\n")
anagrams = defaultdict(set)
for word in words:
anagrams["".join(sorted(word))].add(word)
counters = { w:Counter(w) for w in anagrams }
minLen = 2 # minimum word length
def multigram(word,memo=dict()):
sWord = "".join(sorted(word))
if sWord in memo: return memo[sWord]
result = anagrams[sWord]
wordCounts = counters.get(sWord,Counter())
for size in range(minLen,len(word)-minLen+1):
seen = set()
for combo in combinations(word,size):
left = "".join(sorted(combo))
if left in seen or seen.add(left): continue
left = multigram(left,memo)
if not left: continue
right = multigram("".join((wordCounts-Counter(combo)).elements()),memo)
if not right: continue
result.update(a+" "+b for a,b in product(left,right) )
memo[sWord] = list(result)
return memo[sWord]
Performance is good up to 12 character words. Beyond that the exponential nature of combinations start to take a heavy toll
result = multigram("spear")
print(result)
# ['parse', 'asper', 'spear', 'er spa', 're spa', 'se rap', 'er sap', 'sa per', 're asp', 'ar pes', 'se par', 'pa ers', 're sap', 'er asp', 'as per', 'spare', 'spaer', 'as rep', 'sa rep', 'ra pes', 'pa ser', 'es rap', 'es par', 'prase']
len(multigram("mulberries")) # 15986 0.1 second 10 letters
len(multigram("raspberries")) # 60613 0.2 second 11 letters
len(multigram("strawberries")) # 374717 1.3 seconds 12 letters
len(multigram("tranquillizer")) # 711491 7.6 seconds 13 letters
len(multigram("communications")) # 10907666 52.2 seconds 14 letters
In order to avoid any delay, you can convert the function to an iterator. This will allows you to get the first few anagrams without having to generate them all:
def iMultigram(word,prefix=""):
sWord = "".join(sorted(word))
seen = set()
for anagram in anagrams.get(sWord,[]):
full = prefix+anagram
if full in seen or seen.add(full): continue
yield full
wordCounts = counters.get(sWord,Counter(word))
for size in reversed(range(minLen,len(word)-minLen+1)): # longest first
for combo in combinations(sWord,size):
left = "".join(sorted(combo))
if left in seen or seen.add(left): continue
for left in iMultigram(left,prefix):
right = "".join((wordCounts-Counter(combo)).elements())
for full in iMultigram(right,left+" "):
if full in seen or seen.add(full): continue
yield full
from itertools import islice
list(islice(iMultigram("communications"),5)) # 0.0 second
# ['communications', 'cinnamomic so ut', 'cinnamomic so tu', 'cinnamomic os ut', 'cinnamomic os tu']

How to convert a list of words to adjacency lists in an efficient way

Say I have a list of words. If a word's last letter is the same with another word's first letter, then we can connect them together. We don't connect the word with itself.The input elements are distinct.
Example: apple - elephant - tower - rank
I implemented it as this.
def transform(lst):
graph = []
for picked in lst:
link = []
i = lst.index(picked)
rest = lst[:i]+lst[i+1:]
for compare in rest:
if picked[-1] == compare[0]:
link.append(compare)
if len(link) != 0:
graph.append(link)
return graph
I don't know if I can still improve it.
=======================================================================
I think I should change
if len(link) != 0:
graph.append(link)
to
graph.append(link)
Otherwise, the order of the adjacent lists will be mixed
You should start by identifying the two things you're grouping by here. Ending letters and starting letters. Drop all the words into two dicts, keyed by each, and you'll end up with much faster lookups. list.index is a killer for efficiency, with each lookup costing O(n)
from collections import defaultdict
startswith, endswith = defaultdict(list), defaultdict(list)
wordlist = ['apple','elephant','tower','rank']
for word in wordlist:
startswith[word[0]].append(word)
endswith[word[-1]].append(word)
Then it should be a fairly simple graph traversal problem.
funny!
I hope I understand what you're talking about.
def transform(lst):
graph = []
for picked in lst:
if len(graph)==0:
graph.append(picked)
else:
ch=graph[len(graph)-1][len(graph[len(graph)-1])-1]
world_ch=[i for i in lst if ((i not in graph)and(i[0]==ch)) ]
if len(world_ch)==0: break
else: graph.append(world_ch[0])
return(graph)
lista=['apple','carloh','horse','apple','elephant','tower','rank']
print(str(transform(lista)))
['apple', 'elephant', 'tower', 'rank']
print(str([[i] for i in transform(lista)]))
[['apple'], ['elephant'], ['tower'], ['rank']]

Finding common letters in list of words

I am new to python this is what i am trying to achieve:
letter = 4
word = "Demo Deer Deep Deck Cere Reep Creep Creeps"
split_word = word.split()
I am trying to achieve words that can be formed by any 4 common letters for example:
Deer Deep Reep [these can be formed by 4 letters d, e, r & p]
Creep Cere Reep [these can be formed by 4 letters c, r, e, p]
Is there easy way to do this in python without using regex.
You'll need to do this in a few steps, I think.
First, you need to figure out what your set of letters is. You could use the whole alphabet, but I'd recommend against that if you can avoid it. I'd try using a set:
letter_pool = set([ltr.lower() for ltr in word if ltr != " "])
Next, you need to iterate through all combinations of four letters in your pool and check which words can be formed with them. This is why it's probably better not to use the whole alphabet; this is a lot of combinations. In the example below, I'm storing the results in a dictionary keyed by the combination of letters, but you could modify that to suit your needs.
results = {}
import itertools
for combination in itertools.combinations(letter_pool, letter): #in this case, letter=4
results[combination] = []
for wrd in split_word:
for character in wrd:
if character.lower() not in combination:
break
else:
results[combination].append(wrd)
if len(results[combination]) == 0:
del results[combination]
Note the for-else syntax; it means if the loop does not break, the code in the else clause executes. Basically, for a given combination of letters, this code checks each word and determines if it is composed of just those letters. If it is, it stores that information. If a given combination doesn't form any words, its entry in the dictionary is removed (to save on memory). Note that this is a pretty naive solution and will not scale well.
If you want to print out the results, you can do:
for key in results:
print ", ".join(results[key]), " [ formed by "+str(key)+"]"
You can achieve this using sets and itertools.combinations:
word = "Demo Deer Deep Deck Cere Reep Creep Creeps"
split_word = word.split()
from itertools import combinations
letters = {s for it in split_word for s in it.lower()}
out = dict()
for n in range(len(letters)):
out[n] = {''.join(letters_subset): [word
for word in split_word
if set(word.lower()).issubset(letters_subset)]
for letters_subset in combinations(letters, n)}
out[n] = {k: v for k, v in out[n].items() if len(v) > 0}
# Print output
for n, d in out.items():
for k, v in d.items():
print('{}:\t{}\t{}'.format(n, k, v))

Python- Code Optimization Help- Find all dictionary-valid anagrams of a word

I solved the problem using this (horribly inefficient method):
def createList(word, wordList):
#Made a set, because for some reason permutations was returning duplicates.
#Returns all permutations if they're in the wordList
return set([''.join(item) for item in itertools.permutations(word) if ''.join(item) in wordList])
def main():
a = open('C:\\Python32\\megalist.txt', 'r+')
wordList = [line.strip() for line in a]
maximum = 0
length = 0
maxwords = ""
for words in wordList:
permList = createList(words, wordList)
length = len(permList)
if length > maximum:
maximum = length
maxwords = permList
print (maximum, maxwords)
It took something like 10 minutes to find the five-letter word that has the most dictionary-valid anagrams. I want to run this on words without a letter constraint, but it would take a ludicrous amount of time. Is there anyway to optimize this?
This following seems to work OK on a smallish dictionary. By sorting the letters in the word, it becomes easy to test if two words are an anagram. From that starting point, it's just a matter of accumulating the words in some way. It wouldn't be hard to modify this to report all matches, rather than just the first one
If you do need to add constraints on the number of letters, the use of iterators is a convenient way to filter out some words.
def wordIterator(dictionaryFilename):
with open(dictionaryFilename,'r') as f:
for line in f:
word = line.strip()
yield word
def largestAnagram(words):
import collections
d = collections.defaultdict(list)
for word in words:
sortedWord = str(sorted(word))
d[ hash(sortedWord) ].append(word)
maxKey = max( d.keys(), key = lambda k : len(d[k]) )
return d[maxKey]
iter = wordIterator( 'C:\\Python32\\megalist.txt' )
#iter = ( word for word in iter if len(word) == 5 )
print largestAnagram(iter)
Edit:
In response to the comment, the hash(sortedWord), is a space saving optimization, possibly premature in this case, to reduce sortedWord back to an integer, because we don't really care about the key, so long as we can always uniquely recover all the relevant anagrams. It would have been equally valid to just use sortedWord as the key.
The key keyword argument to max lets you find the maximum element in a collection based on a predicate. So the statement maxKey = max( d.keys(), key = lambda k : len(d[k]) ) is a succinct python expression for answering the query, given the keys in the dictionary, which key has the associated value with maximum length?. That call to max in that context could have been written (much more verbosely) as valueWithMaximumLength(d) where that function was defined as:
def valueWithMaximumLength( dictionary ):
maxKey = None
for k, v in dictionary.items():
if not maxKey or len(dictionary[maxKey]) < len(v):
maxKey = k
return maxKey
wordList should be a set.
Testing membership in a list requires you to scan through all the elements of the list checking whether they are the word you have generated. Testing membership in a set does not (on average) depend on the size of the set.
The next obvious optimisation is that once you have tested a word you can remove all its permutations from wordList, since they will give exactly the same set in createList. This is a very easy operation if everything is done with sets -- indeed, you simply use a binary minus.
There is no need to do ALL permutations, - it's a waste of memory and CPU.
So, first of all your dictionary should be kept in a binary tree, like this:
e.g. Dict = ['alex', 'noise', 'mother', 'xeal', 'apple', 'google', 'question']
Step 1: find the "middle" word for the dictionary, e.g. "mother", because "m"
is somewhere in the middle of the English alphabet
(this step is not necessary, but it helps to balance the tree a bit)
Step 2: Build the binary tree:
mother
/ \
/ \
alex noise
\ / \
\ / \
apple question xeal
\
\
google
Step 3: start looking for an anagram by permutations:
alex: 1. "a"... (going deeper into binary tree, we find a word 'apple')
1.1 # of, course we should omit apple, because len('apple') != len('alex')
# but that's enough for an example:
2. Check if we can build word "pple" from "lex": ("a" is already reserved!)
-- there is no "p" in "lex", so skipping, GOTO 1
...
1. "l"... - nothing begins with "l"...
1. "l"... - nothing begins with "e"...
1. "x" - going deeper, "xeal" found
2. Check if "eal" could be built from "ale" ("x" is already reserved)
for letter in "eal":
if not letter in "ale":
return False
return True
That's it :) This algorithm should work much faster.
EDIT:
Check bintrees package to avoid spending time on binary tree implementation.

Categories