I have 10 arbitrary letters and need to check the max length match from words file
I started to learn RE just some time ago, and can't seem to find suitable pattern
first idea that came was using set: [10 chars] but it also repeats included chars and I don't know how to avoid that
I stared to learn Python recently but before RE and maybe RE is not needed and this can be solved without it
using "for this in that:" iterator seems inappropriate, but maybe itertools can do it easily (with which I'm not familiar)
I guess solution is known even to novice programmers/scripters, but not to me
Thanks
I'm guessing this is something like finding possible words given a set of Scrabble tiles, so that a character can be repeated only as many times as it is repeated in the original list.
The trick is to efficiently test each character of each word in your word file against a set containing your source letters. For each character, if found in the test set, remove it from the test set and proceed; otherwise, the word is not a match, and go on to the next word.
Python has a nice function all for testing a set of conditions based on elements in a sequence. all has the added feature that it will "short-circuit", that is, as soon as one item fails the condition, then no more tests are done. So if your first letter of your candidate word is 'z', and there is no 'z' in your source letters, then there is no point in testing any more letters in the candidate word.
My first shot at writing this was simply:
matches = []
for word in wordlist:
testset = set(letters)
if all(c in testset for c in word):
matches.append(word)
Unfortunately, the bug here is that if the source letters contained a single 'm', a word with several 'm's would erroneously match, since each 'm' would separately match the given 'm' in the source testset. So I needed to remove each letter as it was matched.
I took advantage of the fact that set.remove(item) returns None, which Python treats as a Boolean False, and expanded my generator expression used in calling all. For each c in word, if it is found in testset, I want to additionally remove it from testset, something like (pseudo-code, not valid Python):
all(c in testset and "remove c from testset" for c in word)
Since set.remove returns a None, I can replace the quoted bit above with "not testset.remove(c)", and now I have a valid Python expression:
all(c in testset and not testset.remove(c) for c in word)
Now we just need to wrap that in a loop that checks each word in the list (be sure to build a fresh testset before checking each word, since our all test has now become a destructive test):
for word in wordlist:
testset = set(letters)
if all(c in testset and not testset.remove(c) for c in word):
matches.append(word)
The final step is to sort the matches by descending length. We can pass a key function to sort. The builtin len would be good, but that would sort by ascending length. To change it to a descending sort, we use a lambda to give us not len, but -1 * len:
matches.sort(key=lambda wd: -len(wd))
Now you can just print out the longest word, at matches[0], or iterate over all matches and print them out.
(I was surprised that this brute force approach runs so well. I used the 2of12inf.txt word list, containing over 80,000 words, and for a list of 10 characters, I get back the list of matches in about 0.8 seconds on my little 1.99GHz laptop.)
I think this code will do what you are looking for:
>>> words = open('file.txt')
>>> max(len(word) for word in set(words.split()))
If you require more sophisticated tokenising, for example if you're not using Latin text, would should use NLTK:
>>> import nltk
>>> words = open('file.txt')
>>> max(len(word) for word in set(nltk.word_tokenize(words)))
I assume you are trying to find out what is the longest word that can be made from your 10 arbitrary letters.
You can keep your 10 arbitrary letters in a dict along with the frequency they occur.
e.g., your 4 (using 4 instead of 10 for simplicity) arbitrary letters are: e, w, l, l. This would be in a dict as:
{'e':1, 'w':1, 'l':2}
Then for each word in the text file, see if all of the letters for that word can be found in your dict of arbitrary letters. If so, then that is one of your candidate words.
So:
we
wall
well
all of the letters in well would be found in your dict of arbitrary letters so save it and its length for comparison against other words.
Related
This question already has answers here:
Python Counter Comparison as Bag-type
(3 answers)
Closed 7 months ago.
I am making a word game program where I get a list of ~80,000 words from a text file, then use those words as a lexicon of words to choose from. The user requests a word of a certain length which is then given to them scrambled. They then guess words that are of the same length or less and that use the same letters in the same amount or less. I have this list comprehension in order to get all the words from the lexicon that are subsets of the scrambled word and are also in the lexicon. However it allows more occurrences of letters than appear in the original word. For example: If the scrambled word was 'minute', then 'in' should be a correct answer but 'inn' should not. The way I have it written now allows that though. Here is the list comprehension:
correct_answers = [
word for word in word_list
if set(word).issubset(random_length_word)
and word in word_list
and len(word) <= len(random_length_word)]
So I'm looking for something like issubset but that only allows the same number of letters or less. Hopefully that makes sense. Thanks in advance.
I wrote a function that does this for playing the Countdown letters game. I called the desired input a "subset-anagram", but there's probably a better technical term for it.
Essentially, what you're looking for is a multiset (from word) that is a subset of another multiset (from random_length_word). You can do this with collections.Counter, but I actually found it much faster to do it a different way: make a list out of random_length_word, then remove each character of word. It's probably faster due to the overhead of creating new Counter objects.
def is_subset_anagram(str1, str2):
"""
Check if str1 is a subset-anagram of str2.
Return true if str2 contains at least as many of each char as str1.
>>> is_subset_anagram('bottle', 'belott') # Just enough
True
>>> is_subset_anagram('bottle', 'belot') # less
False
>>> is_subset_anagram('bottle', 'bbeelloott') # More
True
"""
list2 = list(str2)
try:
for char in str1:
list2.remove(char)
except ValueError:
return False
return True
>>> [w for w in ['in', 'inn', 'minute'] if is_subset_anagram(w, 'minute')]
['in', 'minute']
For what it's worth, here's the Counter implementation:
from collections import Counter
def is_subset_anagram(str1, str2):
delta = Counter(str1) - Counter(str2)
return not delta
This works because Counter.__sub__() produces a multiset, that is, counts less than 1 are removed.
Your approach loses the information, how often a certain character appears, because set(answer) does not contain this information any more.
Anyway, I think you are over-complicating things with your approach. There is a more efficient way for checking whether an answer is correct, instead of creating a list of all possible answers:
We could just check whether the answer has the matching character frequencies with any of the words in word_list. More specifically, "matching character frequencies" means that all characters appear less (or equally) often in the answer and the candidate from word list.
Getting the character frequencies of a string is a classic job collections.Counter has been invented for.
Checking that the character frequencies match means that all characters in the word have less or equal count in the answer.
Finally, checking that an answer is correct means that this condition is true for any of the words in word_list.
from collections import Counter
from typing import List
def correct_answer(word_list: List[str], answer: str) -> bool:
return any(
all(
# this checks if each char occurs less often in the word
Counter(answer)[character] <= Counter(word)[character]
for character in Counter(answer).keys()
)
for word in word_list
)
This is more efficient than your approach, because it takes way less memory space. Thanks to any and all being short-circuit, it is also quite time-efficient.
So I have this huge list of strings in Hebrew and English, and I want to extract from them only those in Hebrew, but couldn't find a regex example that works with Hebrew.
I have tried the stupid method of comparing every character:
import string
data = []
for s in slist:
found = False
for c in string.ascii_letters:
if c in s:
found = True
if not found:
data.append(s)
And it works, but it is of course very slow and my list is HUGE.
Instead of this, I tried comparing only the first letter of the string to string.ascii_letters which was much faster, but it only filters out those that start with an English letter, and leaves the "mixed" strings in there. I only want those that are "pure" Hebrew.
I'm sure this can be done much better... Help, anyone?
P.S: I prefer to do it within a python program, but a grep command that does the same would also help
To check if a string contains any ASCII letters (ie. non-Hebrew) use:
re.search('[' + string.ascii_letters + ']', s)
If this returns true, your string is not pure Hebrew.
This one should work:
import re
data = [s for s in slist if re.match('^[a-zA-Z ]+$', s)]
This will pick all the strings that consist of lowercase and uppercase English letters and spaces. If the strings are allowed to contain digits or punctuation marks, the allowed characters should be included into the regex.
Edit: Just noticed, it filters out the English-only strings, but you need it do do the other way round. You can try this instead:
data = [s for s in slist if not re.match('^.*[a-zA-Z].*$', s)]
This will discard any string that contains at least one English letter.
Python has extensive unicode support. It depends on what you're asking for. Is a hebrew word one that contains only hebrew characters and whitespace, or is it simply a word that contains no latin characters? Either way, you can do so directly. Just create the criteria set and test for membership.
Note that testing for membership in a set is much faster than iteration through string.ascii_letters.
Please note that I do not speak hebrew so I may have missed a letter or two of the alphabet.
def is_hebrew(word):
hebrew = set("אבגדהוזחטיכךלמנס עפצקרשתםןףץ"+string.whitespace)
for char in word:
if char not in hebrew:
return False
return True
def contains_latin(word):
return any(char in set("abcdefghijklmnopqrstuvwxyz") for char in word.lower())
# a generator expression like this is a terser way of expressing the
# above concept.
hebrew_words = [word for word in words if is_hebrew(word)]
non_latin words = [word for word in words if not contains_latin(word)]
Another option would be to create a dictionary of hebrew words:
hebrew_words = {...}
And then you iterate through the list of words and compare them against this dictionary ignoring case. This will work much faster than other approaches (O(n) where n is the length of your list of words).
The downside is that you need to get all or most of hebrew words somewhere. I think it's possible to find it on the web in csv or some other form. Parse it and put it into python dictionary.
However, it makes sense if you need to parse such lists of words very often and quite quickly. Another problem is that the dictionary may contain not all hebrew words which will not give a completely right answer.
Try this:
>>> import re
>>> filter(lambda x: re.match(r'^[^\w]+$',x),s)
I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:
Total word count
Total count of unique words (without case and special characters interfering)
The number of sentences
Average words in a sentence
Find common used phrases (a phrase of 3 or more words used over 3 times)
A list of words used, in order of descending frequency (without case and special characters interfering)
The ability to accept input from STDIN, or from a file specified on the command line
So far I have this Python program to print total word count:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog, dog., "dog, and dog, as different words)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!
Certainly the most difficult part is identifying the sentences. You could use a regular expression for this, but there might still be some ambiguity, e.g. with names and titles, that have a dot followed by an upper case letter. For words, too, you can use a simple regex, instead of using split. The exact expression to use depends on what qualifies as a "word". Finally, you can use collections.Counter for counting all of those instead of doing this manually. Use str.lower to convert either the text as a whole or the individual words to lowercase.
This should help you getting startet:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit, but this might be a bit much for this task.
If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.
This probably isn't as efficient as using some NLP library, but it can get the job done.
str.strip Docs
I decided to write a little application in python to help me learn to type using the dvorak keyboard layout. In my algorithms class, we discussed trees, and tries, and implemented an autocomplete function.
I grabbed a word list from this site. Then I loaded all the words in it into a trie, (which surprisingly only took about a third of a second) and now I am trying to figure out how to make words that are relevant.
I currently am maintaining a priority queue to keep track of which letters the user is typing wrongly the most, and so I remove say 3 letters from this queue to start. If I wanted all the words that started with each of these letters, I could do this, and then probably just filter out all words that don't have any of the other letters that the user types wrongly the most.
is it possible to efficiently (or maybe even not efficiently) get a list of all words with the letters from the priority queue in them, and then filter out so that I get the word that will be the biggest challenge to the typer?
I was able to do this with characters, but the words present an interesting challenge, because the nature of the trie only gets words that have prefixes that start with the letters we have in the queue.
Do you need a trie here at all? I think you either don't need any advanced structure, or you need something else.
How much words do you want to process? If it takes only a third of a second to load them to a trie, then it will take not much longer to just go through all of them and chose whatever you want. You will have to do this every time, but if it's just 1/3 of a second, it will not be a problem.
You could re-calculate the TRIE to hold all the sub strings (on top of the real words themselves) as well, where the end of the sub string points to the real word in the TRIE.
This way you can use the code you already have and apply it to sub strings.
Okay. The solution I came up with combined #shapiro-yaacov's answer with code I wrote.
I scrapped the trie, and used a thing with bins for each letter. Each word is put into a bin for each letter, and then the algorithm adds up letters to find which words have the most wanted letters. I also take a 10th of a point away from words for each letter that I don't want, to encourage my program to give reasonable words, because if I simply were to add up all words with the most letters, I would get huge words.
Here is my Words.py:
import string
import random
import operator
class Bin:
"""
A bin is a container that stores words given in a dictionary file.
It is designed to retrieve all words in this file with the given letters.
The words are stored in this container in an array and when new words get added,
the container automatically adds the word to the words list,
and places them into as many bins as need be.
For example,
>>> bin=Bin("words.txt") #get all words from bin.txt
>>>bin.addWord("about")
now, the bins for a, b, o, u, t will have a pointer to "about".
Now immagine the bin has the words "king", "fish", and "dish" in it.
>>> d=bin.getWordWithLetters("sh")
>>> print d
["fish", "dish"]
"""
def __init__(self, wordsFile):
"""initialize the container from the given file,
if None, just initialize an empty container.
"""
self.bins={}
for i in string.ascii_lowercase+".'&": #these are the letters I need.
self.bins[i]=[] #initialize an empty list for each bin.
if wordsFile == None:
return
with open(wordsFile) as words:
for i in words:
self.addWord(i.strip("\n"))
def addWord(self, word):
for i in word:
self.bins[i].append(word) #add the word to the bin for each letter in that word.
def getWordsWithLetters(self, lrs):
"""Gets best word that has the letters lrs in it.
For example, if abcdef is given, and the words [has, babe, shame] are there,
[babe] would be returned because it is the word with the maximum return,
since it contains b,a,e."""
words=[]
for i in lrs:
words+=self.bins[i]
#Now we go through the words, and calculate the score of each word.
#a score is calculated by adding up the number of times a letter from lrs appears in each word.
# Then we will subtract out the number of
for index, item in enumerate(words):
score=random.randint(0,10) #give some randomness for the typing thing.
#print(score)
#score = 0 #to make it deterministic.
base=score
itCounts={}
for i in lrs:
itCounts[i]=False
for letter in item:
if letter in lrs and (not itCounts[letter]):
score+=1
itCounts[letter]= True
else:
score-=.1
words[index] = (item, score)
words = sorted(words, key=operator.itemgetter(1), reverse=True)
w=[]
for i in words:
if i[1] > base:
w.append(i[0])
return w[:50]
I am trying to import the alphabet but split it so that each character is in one array but not one string. splitting it works but when I try to use it to find how many characters are in an inputted word I get the error 'TypeError: Can't convert 'list' object to str implicitly'. Does anyone know how I would go around solving this? Any help appreciated. The code is below.
import string
alphabet = string.ascii_letters
print (alphabet)
splitalphabet = list(alphabet)
print (splitalphabet)
x = 1
j = year3wordlist[x].find(splitalphabet)
k = year3studentwordlist[x].find(splitalphabet)
print (j)
EDIT: Sorry, my explanation is kinda bad, I was in a rush. What I am wanting to do is count each individual letter of a word because I am coding a spelling bee program. For example, if the correct word is 'because', and the user who is taking part in the spelling bee has entered 'becuase', I want the program to count the characters and location of the characters of the correct word AND the user's inputted word and compare them to give the student a mark - possibly by using some kind of point system. The problem I have is that I can't simply say if it is right or wrong, I have to award 1 mark if the word is close to being right, which is what I am trying to do. What I have tried to do in the code above is split the alphabet and then use this to try and find which characters have been used in the inputted word (the one in year3studentwordlist) versus the correct word (year3wordlist).
There is a much simpler solution if you use the in keyword. You don't even need to split the alphabet in order to check if a given character is in it:
year3wordlist = ['asdf123', 'dsfgsdfg435']
total_sum = 0
for word in year3wordlist:
word_sum = 0
for char in word:
if char in string.ascii_letters:
word_sum += 1
total_sum += word_sum
# Length of characters in the ascii letters alphabet:
# total_sum == 12
# Length of all characters in all words:
# sum([len(w) for w in year3wordlist]) == 18
EDIT:
Since the OP comments he is trying to create a spelling bee contest, let me try to answer more specifically. The distance between a correctly spelled word and a similar string can be measured in many different ways. One of the most common ways is called 'edit distance' or 'Levenshtein distance'. This represents the number of insertions, deletions or substitutions that would be needed to rewrite the input string into the 'correct' one.
You can find that distance implemented in the Python-Levenshtein package. You can install it via pip:
$ sudo pip install python-Levenshtein
And then use it like this:
from __future__ import division
import Levenshtein
correct = 'because'
student = 'becuase'
distance = Levenshtein.distance(correct, student) # distance == 2
mark = ( 1 - distance / len(correct)) * 10 # mark == 7.14
The last line is just a suggestion on how you could derive a grade from the distance between the student's input and the correct answer.
I think what you need is join:
>>> "".join(splitalphabet)
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
join is a class method of str, you can do
''.join(splitalphabet)
or
str.join('', splitalphabet)
To convert the list splitalphabet to a string, so you can use it with the find() function you can use separator.join(iterable):
"".join(splitalphabet)
Using it in your code:
j = year3wordlist[x].find("".join(splitalphabet))
I don't know why half the answers are telling you how to put the split alphabet back together...
To count the number of characters in a word that appear in the splitalphabet, do it the functional way:
count = len([c for c in word if c in splitalphabet])
import string
# making letters a set makes "ch in letters" very fast
letters = set(string.ascii_letters)
def letters_in_word(word):
return sum(ch in letters for ch in word)
Edit: it sounds like you should look at Levenshtein edit distance:
from Levenshtein import distance
distance("because", "becuase") # => 2
While join creates the string from the split, you would not have to do that as you can issue the find on the original string (alphabet). However, I do not think is what you are trying to do. Note that the find that you are trying attempts to find the splitalphabet (actually alphabet) within year3wordlist[x] which will always fail (-1 result)
If what you are trying to do is to get the indices of all the letters of the word list within the alphabet, then you would need to handle it as
for each letter in the word of the word list, determine the index within alphabet.
j = []
for c in word:
j.append(alphabet.find(c))
print j
On the other hand if you are attempting to find the index of each character within the alphabet within the word, then you need to loop over splitalphabet to get an individual character to find within the word. That is
l = []
for c within splitalphabet:
j = word.find(c)
if j != -1:
l.append((c, j))
print l
This gives the list of tuples showing those characters found and the index.
I just saw that you talk about counting the number of letters. I am not sure what you mean by this as len(word) gives the number of characters in each word while len(set(word)) gives the number of unique characters. On the other hand, are you saying that your word might have non-ascii characters in it and you want to count the number of ascii characters in that word? I think that you need to be more specific in what you want to determine.
If what you are doing is attempting to determine if the characters are all alphabetic, then all you need to do is use the isalpha() method on the word. You can either say word.isalpha() and get True or False or check each character of word to be isalpha()