Python - Recursive word list - python

I'm trying to get make an anagram algorithm, but I'm stuck once I get to the recursive part. Let me know if anymore information is needed.
My code:
def ana_words(words, letter_count):
"""Return all the anagrams using the given letters and allowed words.
- letter_count has 26 keys (one per lowercase letter),
and each value is a non-negative integer.
#type words: list[str]
#type letter_count: dict[str, int]
#rtype: list[str]
"""
anagrams_list = []
if not letter_count:
return [""]
for word in words:
if not _within_letter_count(word, letter_count):
continue
new_letter_count = dict(letter_count)
for char in word:
new_letter_count[char] -= 1
# recursive function
var1 = ana_words(words[1:], new_letter_count)
sorted_word = ''.join(word)
for i in var1:
sorted_word = ''.join([word, i])
anagrams_list.append(sorted_word)
return anagrams_list
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
Input: print ana_words('dormitory')
Output I'm getting:
['dirtyroom', 'dotoi', 'doori', 'dormitory', 'drytoori', 'itorod', 'ortoidry', 'rodtoi', 'roomidry', 'rootidry', 'torodi']
Output I want:
['dirty room', 'dormitory', 'room dirty']
Link to word list: https://1drv.ms/t/s!AlfWKzBlwHQKbPj9P_pyKdmPwpg

Without knowing your words list it is hard to tell why it is including the 'wrong' entries. Trying with just
words = ['room','dirty','dormitory']
Returns the correct entries.
if you are wanting spaces between the words you need to change
sorted_word = ''.join([word, i])
to
sorted_word = ' '.join([word, i])
(Note the added space)
Incidentally, if you are wanting to solve this problem more efficiently then using a 'trie' data structure to store words can help (https://en.wikipedia.org/wiki/Trie)

Question errors:
You are saying:
Words is a list of words from a file, and letter count is a dictionary of characters (already in lower case). the list of words in words is also in lowercase already.
But you are actually calling the function in a different way:
print ana_words('dormitory')
This is not right.
Checking if a dictionaries values are all 0:
if not letter_count: doesn't do what you expected. To check if a dictionary has all 0s you should do if not any(letter_count.values()): that first obtains the values, checks if any of them is different from 0 and then negates the answer.
Joining words:
str.join(arg1) method is not for joining 2 words, is for joining an iterable passed as arg1 by the string, in your case the string is an iterable of chars and you are joining by nothing so the result is the same word.
''.join('Hello')
>>> 'Hello'
The second time you use it the iterable is the list and it joins word with each of the elements of var1 that is actually a list of words so thats fine excluding the space you are missing here. The problem is you are not doing anything with sorted_words. You are just using the last time it appears. The anagram_list.append(sorted_word) should be inside the loop and the sorted_word = ''.join(word) should be deleted.
Other errors:
Aside from all this errors, you are never checking if the letter count gets to 0 to stop recursion.

Related

How to count occurences of word in string that stil works with periods and endings

so I was recently working on this function here:
# counts owls
def owl_count(text):
# sets all text to lowercase
text = text.lower()
# sets text to list
text = text.split()
# saves indices of owl in list
indices = [i for i, x in enumerate(text) if x == ["owl"] ]
# counts occurences of owl in text
owl_count = len(indices)
# returns owl count and indices
return owl_count, indices
My goal was to count how many times "owl" occurs in the string and save the indices of it. The issue I kept running into was that it would not count "owls" or "owl." I tried splitting it into a list of individual characters but I couldn't find a way to search for three consecutive elements in the list. Do you guys have any ideas on what I could do here?
PS. I'm definitely a beginner programmer so this is probably a simple solution.
thanks!
If you don't want to use huge libraries like NLTK, you can filter words that starts with 'owl', not equal to 'owl':
indices = [i for i, x in enumerate(text) if x.startswith("owl")]
In this case words like 'owlowlowl' will pass too, but one should use NLTK to properly tokenize words like in real world.
Python has built in functions for these.These types of matching of strings comes under something called Regular Expressions,which you can go into detail later
a_string = "your string"
substring = "substring that you want to check"
matches = re.finditer(substring, a_string)
matches_positions = [match.start() for match in matches]
print(matches_positions)
finditer() will return an iteration object and start() will return the starting index of the found matches.
Simply put ,it returns indices of all the substrings in the string

Grouping together words in a string which have common substrings of some length

I want to display words with common substring in a string.
for example if given string is
str = "the games are lame"
and the words have to be grouped together on the basis of common substring of length 3, so output should be
the
games, lame
are
since common substring of length 3 is "ame".
I proceeded by converting the string to list say "lista" using split() and made another list say "listb" with all the possible substrings of length 3, like
the, gam, gme, ges, ame, aes, mes, are, lam, lme, ame
then I checked the "listb" for duplicate items('ame') and on basis of them compared with items in "lista" like so
for items in duplicate:
for item in lista:
if items in item and not in listc:
listc.append(item)
Now, I have a "listc" with items that have common substring of length 3 but I can't figure out how to group them as needed in output. Also if "str" contains more words with common substring "listc" will also have those common words.
I don't know if I should have proceeded in this way and can't seem to figure out how to group items from "listc" as needed in output.
Here is a solution
str_ = "the games are lame"
# first I get a list of all the words
words = str_.split()
# words >>> ['the', 'games', 'are', 'lame']
groups = []
# This variable will contain the list of words
# For each words
for word in words:
found = False
# Get the first words of each groups
other_words = [x[0] for x in groups if x != word]
# Loop through the word and get all substring of 3 characters
for i in range(len(word)):
substring = word[i:i+3]
# Eliminates the substring that doesn't have the correct length
if len(substring) != 3:
continue
try:
# try to find the substring in a group and get the corresponding index of that group
index = [substring in other_word for other_word in other_words].index(True)
found = True
# Add the word in the group
groups[index].append(word)
except ValueError:
continue
# If we don't find a group for the word, we create a new group with that word in it
if not found:
groups.append([word])
# groups >>> [['the'], ['games', 'lame'], ['are']]
# Now print the groups
for group in groups:
print(", ".join(group))
output :
the
games, lame
are
I think you're creating a lot of lists there and this can be quite confusing.
If you want to use a purely logical approach without using libraries designed for sequence matching, such as difflib, you can first define a function that compares two strings; then you separate your sentence into a words list and perform a double iteration (nested) through that list comparing all possible pairs.
If the strings match they will be printed on the same line separated by commas otherwise on a new line.
In the following function I've also added a parameter for the length of the substring that you want to match, set to 3 by default to stay in line with your question:
# This function compairs two strings and returns them in a tuple if they contain the
# same substring of len_substring characters.
def string_matcher(string_a, string_b, len_substring = 3):
for i in range(len(string_a)-len_substring):
if string_a[i:i+len_substring] in string_b:
return string_a, string_b
return None
string = "the games are lame"
words = string.split()
output = ""
# Making a double iteration over the words list and calling string_matcher for each pair.
for i in range(len(words)-1):
output = output+words[i]
for j in range(i+1, len(words)):
try:
word_a, word_b = string_matcher(words[i], words[j])
output = output+", "+word_b
except TypeError:
pass
output = output+"\n"
print(output)
The program prints out:
the
games, lame
are

Return first word in sentence? [duplicate]

This question already has answers here:
How to extract the first and final words from a string?
(7 answers)
Closed 5 years ago.
Heres the question I have to answer for school
For the purposes of this question, we will define a word as ending a sentence if that word is immediately followed by a period. For example, in the text “This is a sentence. The last sentence had four words.”, the ending words are ‘sentence’ and ‘words’. In a similar fashion, we will define the starting word of a sentence as any word that is preceded by the end of a sentence. The starting words from the previous example text would be “The”. You do not need to consider the first word of the text as a starting word. Write a program that has:
An endwords function that takes a single string argument. This functioin must return a list of all sentence ending words that appear in the given string. There should be no duplicate entries in the returned list and the periods should not be included in the ending words.
The code I have so far is:
def startwords(astring):
mylist = astring.split()
if mylist.endswith('.') == True:
return my list
but I don't know if I'm using the right approach. I need some help
Several issues with your code. The following would be a simple approach. Create a list of bigrams and pick the second token of each bigram where the first token ends with a period:
def startwords(astring):
mylist = astring.split() # a list! Has no 'endswith' method
bigrams = zip(mylist, mylist[1:])
return [b[1] for b in bigrams if b[0].endswith('.')]
zip and list comprehenion are two things worth reading up on.
mylist = astring.split()
if mylist.endswith('.')
that cannot work, one of the reasons being that mylist is a list, and doesn't have endswith as a method.
Another answer fixed your approach so let me propose a regular expression solution:
import re
print(re.findall(r"\.\s*(\w+)","This is a sentence. The last sentence had four words."))
match all words following a dot and optional spaces
result: ['The']
def endwords(astring):
mylist = astring.split('.')
temp_words = [x.rpartition(" ")[-1] for x in mylist if len(x) > 1]
return list(set(temp_words))
This creates a set so there are no duplicates. Then goes on a for loop in a list of sentences (split by ".") then for each sentence, splits it in words then using [:-1] makes a list of the last word only and gets [0] item in that list.
print (set([ x.split()[:-1][0] for x in s.split(".") if len(x.split())>0]))
The if in theory is not needed but i couldn't make it work without it.
This works as well:
print (set([ x.split() [len(x.split())-1] for x in s.split(".") if len(x.split())>0]))
This is one way to do it ->
#!/bin/env/ python
from sets import Set
sentence = 'This is a sentence. The last sentence had four words.'
uniq_end_words = Set()
for word in sentence.split():
if '.' in word:
# check if period (.) is at the end
if '.' == word[len(word) -1]:
uniq_end_words.add(word.rstrip('.'))
print list(uniq_end_words)
Output (list of all the end words in a given sentence) ->
['words', 'sentence']
If your input string has a period in one of its word (lets say the last word), something like this ->
'I like the documentation of numpy.random.rand.'
The output would be - ['numpy.random.rand']
And for input string 'I like the documentation of numpy.random.rand a lot.'
The output would be - ['lot']

How do I get a program to print the number of words in a sentence and each word in order

I need to print how many characters there are in a sentence the user specifies, print how many words there are in a sentence the user specifies and print each word, the number of letters in the word, and the first and last letter in the word. Can this be done?
I want you to take your time and understand what is going on in the code below and I suggest you to read these resources.
http://docs.python.org/3/library/re.html
http://docs.python.org/3/library/functions.html#len
http://docs.python.org/3/library/functions.html
http://docs.python.org/3/library/stdtypes.html#str.split
import re
def count_letter(word):
"""(str) -> int
Return the number of letters in a word.
>>> count_letter('cat')
3
>>> count_letter('cat1')
3
"""
return len(re.findall('[a-zA-Z]', word))
if __name__ == '__main__':
sentence = input('Please enter your sentence: ')
words = re.sub("[^\w]", " ", sentence).split()
# The number of characters in the sentence.
print(len(sentence))
# The number of words in the sentence.
print(len(words))
# Print all the words in the sentence, the number of letters, the first
# and last letter.
for i in words:
print(i, count_letter(i), i[0], i[-1])
Please enter your sentence: hello user
10
2
hello 5 h o
user 4 u r
Please read Python's string documentation, it is self explanatory. Here is a short explanation of the different parts with some comments.
We know that a sentence is composed of words, each of which is composed of letters. What we have to do first is to split the sentence into words. Each entry in this list is a word, and each word is stored in a form of a succession of characters and we can get each of them.
sentence = "This is my sentence"
# split the sentence
words = sentence.split()
# use len() to obtain the number of elements (words) in the list words
print('There are {} words in the given sentence'.format(len(words)))
# go through each word
for word in words:
# len() counts the number of elements again,
# but this time it's the chars in the string
print('There are {} characters in the word "{}"'.format(len(word), word))
# python is a 0-based language, in the sense that the first element is indexed at 0
# you can go backward in an array too using negative indices.
#
# However, notice that the last element is at -1 and second to last is -2,
# it can be a little bit confusing at the beginning when we know that the second
# element from the start is indexed at 1 and not 2.
print('The first being "{}" and the last "{}"'.format(word[0], word[-1]))
We don't do your homework for you on stack overflow... but I will get you started.
The most important method you will need is one of these two (depending on the version of python):
Python3.X - input([prompt]),.. If the prompt argument is present, it is written
to standard output without a trailing newline. The function then
reads a line from input, converts it to a string (stripping a
trailing newline), and returns that. When EOF is read, EOFError is
raised. http://docs.python.org/3/library/functions.html#input
Python2.X raw_input([prompt]),... If the prompt argument is
present, it is written to standard output without a trailing newline.
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that. When EOF is read,
EOFError is raised. http://docs.python.org/2.7/library/functions.html#raw_input
You can use them like
>>> my_sentance = raw_input("Do you want us to do your homework?\n")
Do you want us to do your homework?
yes
>>> my_sentance
'yes'
as you can see, the text wrote was stroed in the my_sentance variable
To get the amount of characters in a string, you need to understand that a string is really just a list! So if you want to know the amount of characters you can use:
len(s),... Return the length (the number of items) of an object.
The argument may be a sequence (string, tuple or list) or a mapping
(dictionary). http://docs.python.org/3/library/functions.html#len
I'll let you figure out how to use it.
Finally you're going to need to use a built in function for a string:
str.split([sep[, maxsplit]]),...Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done (thus, the list will have at most
maxsplit+1 elements). If maxsplit is not specified or -1, then there
is no limit on the number of splits (all possible splits are made).
http://docs.python.org/2/library/stdtypes.html#str.split

Python- Remove all words that contain other words in a list

I have a list populated with words from a dictionary. I want to find a way to remove all words, only considering root words that form at the beginning of the target word.
For example, the word "rodeo" would be removed from the list because it contains the English-valid word "rode." "Typewriter" would be removed because it contains the English-valid word "type." However, the word "snicker" is still valid even if it contains the word "nick" because "nick" is in the middle and not at the beginning of the word.
I was thinking something like this:
for line in wordlist:
if line.find(...) --
but I want that "if" statement to then run through every single word in the list checking to see if its found and, if so, remove itself from the list so that only root words remain. Do I have to create a copy of wordlist to traverse?
So you have two lists: the list of words you want to check and possibly remove, and a list of valid words. If you like, you can use the same list for both purposes, but I'll assume you have two lists.
For speed, you should turn your list of valid words into a set. Then you can very quickly check to see if any particular word is in that set. Then, take each word, and check whether all its prefixes exist in the valid words list or not. Since "a" and "I" are valid words in English, will you remove all valid words starting with 'a', or will you have a rule that sets a minimum length for the prefix?
I am using the file /usr/share/dict/words from my Ubuntu install. This file has all sorts of odd things in it; for example, it seems to contain every letter by itself as a word. Thus "k" is in there, "q", "z", etc. None of these are words as far as I know, but they are probably in there for some technical reason. Anyway, I decided to simply exclude anything shorter than three letters from my valid words list.
Here is what I came up with:
# build valid list from /usr/dict/share/words
wfile = "/usr/dict/share/words"
valid = set(line.strip() for line in open(wfile) if len(line) >= 3)
lst = ["ark", "booze", "kite", "live", "rodeo"]
def subwords(word):
for i in range(len(word) - 1, 0, -1):
w = word[:i]
yield w
newlst = []
for word in lst:
# uncomment these for debugging to make sure it works
# print "subwords", [w for w in subwords(word)]
# print "valid subwords", [w for w in subwords(word) if w in valid]
if not any(w in valid for w in subwords(word)):
newlst.append(word)
print(newlst)
If you are a fan of one-liners, you could do away with the for list and use a list comprehension:
newlst = [word for word in lst if not any(w in valid for w in subwords(word))]
I think that's more terse than it should be, and I like being able to put in the print statements to debug.
Hmm, come to think of it, it's not too terse if you just add another function:
def keep(word):
return not any(w in valid for w in subwords(word))
newlst = [word for word in lst if keep(word)]
Python can be easy to read and understand if you make functions like this, and give them good names.
I'm assuming that you only have one list from which you want to remove any elements that have prefixes in that same list.
#Important assumption here... wordlist is sorted
base=wordlist[0] #consider the first word in the list
for word in wordlist: #loop through the entire list checking if
if not word.startswith(base): # the word we're considering starts with the base
print base #If not... we have a new base, print the current
base=word # one and move to this new one
#else word starts with base
#don't output word, and go on to the next item in the list
print base #finish by printing the last base
EDIT: Added some comments to make the logic more obvious
I find jkerian's asnwer to be the best (assuming only one list) and I would like to explain why.
Here is my version of the code (as a function):
wordlist = ["a","arc","arcane","apple","car","carpenter","cat","zebra"];
def root_words(wordlist):
result = []
base = wordlist[0]
for word in wordlist:
if not word.startswith(base):
result.append(base)
base=word
result.append(base)
return result;
print root_words(wordlist);
As long as the word list is sorted (you could do this in the function if you wanted to), this will get the result in a single parse. This is because when you sort the list, all words made up of another word in the list, will be directly after that root word. e.g. anything that falls between "arc" and "arcane" in your particular list, will also be eliminated because of the root word "arc".
You should use the built-in lambda function for this. I think it'll make your life a lot easier
words = ['rode', 'nick'] # this is the list of all the words that you have.
# I'm using 'rode' and 'nick' as they're in your example
listOfWordsToTry = ['rodeo', 'snicker']
def validate(w):
for word in words:
if w.startswith(word):
return False
return True
wordsThatDontStartWithValidEnglishWords = \
filter(lambda x : validate(x), listOfWordsToTry)
This should work for your purposes, unless I misunderstand your question.
Hope this helps
I wrote an answer that assumes two lists, the list to be pruned and the list of valid words. In the discussion around my answer, I commented that maybe a trie solution would be good.
What the heck, I went ahead and wrote it.
You can read about a trie here:
http://en.wikipedia.org/wiki/Trie
For my Python solution, I basically used dictionaries. A key is a sequence of symbols, and each symbol goes into a dict, with another Trie instance as the data. A second dictionary stores "terminal" symbols, which mark the end of a "word" in the Trie. For this example, the "words" are actually words, but in principle the words could be any sequence of hashable Python objects.
The Wikipedia example shows a trie where the keys are letters, but can be more than a single letter; they can be a sequence of multiple letters. For simplicity, my code uses only a single symbol at a time as a key.
If you add both the word "cat" and the word "catch" to the trie, then there will be nodes for 'c', 'a', and 't' (and also the second 'c' in "catch"). At the node level for 'a', the dictionary of "terminals" will have 't' in it (thus completing the coding for "cat"), and likewise at the deeper node level of the second 'c' the dictionary of terminals will have 'h' in it (completing "catch"). So, adding "catch" after "cat" just means one additional node and one more entry in the terminals dictionary. The trie structure makes a very efficient way to store and index a really large list of words.
def _pad(n):
return " " * n
class Trie(object):
def __init__(self):
self.t = {} # dict mapping symbols to sub-tries
self.w = {} # dict listing terminal symbols at this level
def add(self, word):
if 0 == len(word):
return
cur = self
for ch in word[:-1]: # add all symbols but terminal
if ch not in cur.t:
cur.t[ch] = Trie()
cur = cur.t[ch]
ch = word[-1]
cur.w[ch] = True # add terminal
def prefix_match(self, word):
if 0 == len(word):
return False
cur = self
for ch in word[:-1]: # check all symbols but last one
# If you check the last one, you are not checking a prefix,
# you are checking whether the whole word is in the trie.
if ch in cur.w:
return True
if ch not in cur.t:
return False
cur = cur.t[ch] # walk down the trie to next level
return False
def debug_str(self, nest, s=None):
"print trie in a convenient nested format"
lst = []
s_term = "".join(ch for ch in self.w)
if 0 == nest:
lst.append(object.__str__(self))
lst.append("--top--: " + s_term)
else:
tup = (_pad(nest), s, s_term)
lst.append("%s%s: %s" % tup)
for ch, d in self.t.items():
lst.append(d.debug_str(nest+1, ch))
return "\n".join(lst)
def __str__(self):
return self.debug_str(0)
t = Trie()
# Build valid list from /usr/dict/share/words, which has every letter of
# the alphabet as words! Only take 2-letter words and longer.
wfile = "/usr/share/dict/words"
for line in open(wfile):
word = line.strip()
if len(word) >= 2:
t.add(word)
# add valid 1-letter English words
t.add("a")
t.add("I")
lst = ["ark", "booze", "kite", "live", "rodeo"]
# "ark" starts with "a"
# "booze" starts with "boo"
# "kite" starts with "kit"
# "live" is good: "l", "li", "liv" are not words
# "rodeo" starts with "rode"
newlst = [w for w in lst if not t.prefix_match(w)]
print(newlst) # prints: ['live']
I don't want to provide an exact solution, but I think there are two key functions in Python that will help you greatly here.
The first, jkerian mentioned: string.startswith() http://docs.python.org/library/stdtypes.html#str.startswith
The second: filter() http://docs.python.org/library/functions.html#filter
With filter, you could write a conditional function that will check to see if a word is the base of another word and return true if so.
For each word in the list, you would need to iterate over all of the other words and evaluate the conditional using filter, which could return the proper subset of root words.
I only had one list - and I wanted to remove any word from it that was a prefix of another.
Here is a solution that should run in O(n log N) time and O(M) space, where M is the size of the returned list. The runtime is dominated by the sorting.
l = sorted(your_list)
removed_prefixes = [l[g] for g in range(0, len(l)-1) if not l[g+1].startswith(l[g])] + l[-1:]
If the list is sorted then the item at index N is a prefix if it begins the item at index N+1.
At the end it appends the last item of the original sorted list, since by definition it is not a prefix.
Handling it last also allows us to iterate over an arbitrary number of indexes w/o going out of range.
If you have the banned list hardcoded in another list:
banned = tuple(banned_prefixes]
removed_prefixes = [ i for i in your_list if not i.startswith(banned)]
This relies on the fact that startswith accepts a tuple. It probably runs in something close to N * M where N is elements in list and M is elements in banned. Python could conceivably be doing some smart things to make it a bit quicker. If you are like OP and want to disregard case, you will need .lower() calls in places.

Categories