For my project I want to write a program that searches for a word in a string/long document in python.
If the word is not in the string/document, I have to search for approximate matches.
For example the word “brain”,
Deletions: rain bain brin bran brai .
Substitutions: train grain blain bryin ...
I already have deletion and substitution function, but I am not sure how to search for the word in Brute Force runtime/ Benchmark runtime
string = "hereharewereworeherareteartoredeardareearrearehrerheasereseersearrah"
# the string can be much longer
Pattern = "ware"
# the output should have 4 deletion and 6 subtitutions
#string0 is Pattern, string1 is the word we compare, if it is the type, append to the list
Deletions = []
def deletions(string0, string1):
deletionlist = []
#append list of deletion word
for i in range(len(string0)):
deletionlist.append(string0.replace(string0[i], ""))
#delete first string and last
if string1[1:] in deletionlist:
Deletions.append(string1[1:])
return 1
elif string1[:-1] in deletionlist:
if len(string1[:-1]) == 1:
Deletions.append(string1[:-1])
return 1
Substitutions = []
def subsitutions(string0, string1):
if len(string0) == len(string1):
sublist = []
#append list of deletion word
for i in range(len(string0)):
sublist.append(string0.replace(string0[i], ""))
for j in range(len(string1)):
if string1.replace(string1[j], "") in sublist:
Substitutions.append(string1)
break
The best is levenshtein algorithm, you may calculate the distance between 2 words or sentences (how many character replacements it takes to convert one into another) or similarity ratio, if you like:
>>> import Levenshtein
>>> Levenshtein.distance( 'hello, guys', 'hello, girls' )
3
>>> Levenshtein.ratio( 'hello, guys', 'hello, girls' )
0.782608695652174
You may check the details of the implementation and other info here: https://en.wikipedia.org/wiki/Levenshtein_distance
Related
This question already has answers here:
Finding occurrences of a word in a string in python 3
(14 answers)
Closed 2 years ago.
I want to search a text and count how many times selected words occur. For simplicity, I'll say the text is "Does it fit?" and the words I want to count are "it" and "fit".
I've written the following code:
mystring = 'Does it fit?'
search_words = 'it', 'fit'
for sw in search_words:
frequency = {}
count = mystring.count(sw.strip())
output = (sw + ',{}'.format(count))
print(output)
The output is
it,2
fit,1
because the code counts the 'it' in 'fit' towards the total for 'it'.
The output I want is
it,1
fit,1
I've tried changing line 5 to count = mystring.count('\\b'+sw+'\\b'.strip()) but the count is then zero for each word. How can I get this to work?
that list syntax is off, heres a way to do it though
bad_chars = [';', ':', '!', "*","?","."]
res = {}
for word in ["it","fit"]:
res[word] = 0
string = ''.join((filter(lambda i: i not in bad_chars, "does it fit?")))
for i in string.split(" "):
if word == i: res[word] += 1
print(res)
by using the in keyword you were checking if that string was in another string, in this case it was inside fit, so you were getting 2 occurrences of it
here it directly compares the words after removing punctuation/special characters!
output:
{'it': 1, 'fit': 1}
The issue with the regex pattern that you have tried implementing in your original post is with str.count() rather than the pattern itself.
str.count() (docs) returns the count of non-overlapping occurrences of the str passed as a parameter within the str that the method is applied to - so 'lots of love'.('lo') will return 2 - however, str.count() is for substring identification using string literals only and will not work with regular expression patterns.
The below solution using your original pattern and the built in re module should work nicely for you.
import re
mystring = 'Does it fit?'
search_words = 'it', 'fit'
results = dict()
for sw in search_words:
count = re.findall(rf'\b{sw}\b', mystring)
results[sw] = 0 if not count else len(count)
for k, v in results.items():
print(f'{k}, {v}')
If you want to get matches from search_words regardless of their case - e.g for each occurrence of the substrings 'Fit', 'FIT', 'fIt' etc. present in mystring to be included in the count stored in results['fit'] - you can achieve this by changing the line:
count = re.findall(rf'\b{sw}\b', mystring)
to
count = re.findall(rf'\b{sw}\b', mystring, re.IGNORECASE)
Try this:
def count_words(string, *args):
words = string.split()
search_words = args
frequency_dict = {}
for i in range(len(words)):
if words[i][-1] == '?':
words[i] = words[i][:-1]
for word in search_words:
frequency_dict[word] = words.count(word)
for word, count in frequency_dict.items():
print(f'{word}, {count}')
You can do,
count_words('Does it it it fit fit it?', 'it', 'fit')
And the output is,
it, 4
fit, 2
I'm trying to create a function in Python that will generate anagrams of a given word. I'm not just looking for code that will rearrange the letters aimlessly. All the options given must be real words. I currently have a solution, which to be honest I took most of this code from a YouTube video, but it is very slow for my purpose and can only provide one word responses to a single word given. It uses a 400,000 word dictionary to compare the words it is going though, called "dict.txt".
My goal is to get this code to mimic how well this website's code works:
https://wordsmith.org/anagram/
I could not find the javascript code when reviewing the network activity using Google Chrome's developer tool, so I believe the code is probably in the background, and is possibly using Node.js. This would perhaps make it faster than Python, but given how much faster it is I believe there is more to it than just the programming language. I assume they are using some type of search algorithm rather than just going through each line one by one like I am. I also like the fact that their response is not limited to a single word, but can break up the word given to provide more options to the user. For example, an anagram of "anagram" is "nag a ram".
Any suggestions or ideas would be appreciated.
Thank you.
def init_words(filename):
words = {}
with open(filename) as f:
for line in f:
word = line.strip()
words[word] = 1
return words
def init_anagram_dict(words):
anagram_dict = {}
for word in words:
sorted_word = ''.join(sorted(list(word)))
if sorted_word not in anagram_dict:
anagram_dict[sorted_word] = []
anagram_dict[sorted_word].append(word)
return anagram_dict
def find_anagrams(word, anagram_dict):
key = ''.join(sorted(list(word)))
if key in anagram_dict:
return set(anagram_dict[key]).difference(set([word]))
return set([])
#This is the first function called.
def make_anagram(user_word):
x = str(user_word)
lower_user_word = str.lower(x)
word_dict = init_words('dict.txt')
result = find_anagrams(lower_user_word, init_anagram_dict(word_dict.keys()))
list_result = list(result)
count = len(list_result)
if count > 0:
random_num = random.randint(0,count -1)
anagram_value = list_result[random_num]
return ('An anagram of %s is %s. Would you like me to search for another word?' %(lower_user_word, anagram_value))
else:
return ("Sorry, I could not find an anagram for %s." %(lower_user_word))
You can build a dictionary of anagrams by grouping words by their sorted text. All words that have the same sorted text are anagrams of each other:
from collections import defaultdict
with open("/usr/share/dict/words","r") as wordFile:
words = wordFile.read().split("\n")
anagrams = defaultdict(list)
for word in words:
anagrams["".join(sorted(word))].append(word)
aWord = "spear"
result = anagrams["".join(sorted(aWord))]
print(aWord,result)
# ['asper', 'parse', 'prase', 'spaer', 'spare', 'spear']
Using 235,000 words, the response time is instantaneous
In order to obtain multiple words forming an anagram of the specified word, you will need to get into combinatorics. A recursive function is probably the easiest way to go about it:
from itertools import combinations,product
from collections import Counter,defaultdict
with open("/usr/share/dict/words","r") as wordFile:
words = wordFile.read().split("\n")
anagrams = defaultdict(set)
for word in words:
anagrams["".join(sorted(word))].add(word)
counters = { w:Counter(w) for w in anagrams }
minLen = 2 # minimum word length
def multigram(word,memo=dict()):
sWord = "".join(sorted(word))
if sWord in memo: return memo[sWord]
result = anagrams[sWord]
wordCounts = counters.get(sWord,Counter())
for size in range(minLen,len(word)-minLen+1):
seen = set()
for combo in combinations(word,size):
left = "".join(sorted(combo))
if left in seen or seen.add(left): continue
left = multigram(left,memo)
if not left: continue
right = multigram("".join((wordCounts-Counter(combo)).elements()),memo)
if not right: continue
result.update(a+" "+b for a,b in product(left,right) )
memo[sWord] = list(result)
return memo[sWord]
Performance is good up to 12 character words. Beyond that the exponential nature of combinations start to take a heavy toll
result = multigram("spear")
print(result)
# ['parse', 'asper', 'spear', 'er spa', 're spa', 'se rap', 'er sap', 'sa per', 're asp', 'ar pes', 'se par', 'pa ers', 're sap', 'er asp', 'as per', 'spare', 'spaer', 'as rep', 'sa rep', 'ra pes', 'pa ser', 'es rap', 'es par', 'prase']
len(multigram("mulberries")) # 15986 0.1 second 10 letters
len(multigram("raspberries")) # 60613 0.2 second 11 letters
len(multigram("strawberries")) # 374717 1.3 seconds 12 letters
len(multigram("tranquillizer")) # 711491 7.6 seconds 13 letters
len(multigram("communications")) # 10907666 52.2 seconds 14 letters
In order to avoid any delay, you can convert the function to an iterator. This will allows you to get the first few anagrams without having to generate them all:
def iMultigram(word,prefix=""):
sWord = "".join(sorted(word))
seen = set()
for anagram in anagrams.get(sWord,[]):
full = prefix+anagram
if full in seen or seen.add(full): continue
yield full
wordCounts = counters.get(sWord,Counter(word))
for size in reversed(range(minLen,len(word)-minLen+1)): # longest first
for combo in combinations(sWord,size):
left = "".join(sorted(combo))
if left in seen or seen.add(left): continue
for left in iMultigram(left,prefix):
right = "".join((wordCounts-Counter(combo)).elements())
for full in iMultigram(right,left+" "):
if full in seen or seen.add(full): continue
yield full
from itertools import islice
list(islice(iMultigram("communications"),5)) # 0.0 second
# ['communications', 'cinnamomic so ut', 'cinnamomic so tu', 'cinnamomic os ut', 'cinnamomic os tu']
I'm trying to write an algorithm that by given to it a bunch of letters is giving you all the words that can be constructed of the letters, for instance, given 'car' should return a list contains [arc,car,a, etc...] and out of it returns the best scrabble word. The problem is in finding that list which contains all the words.
I've got a giant txt file dictionary, line delimited and I've tried this so far:
def find_optimal(bunch_of_letters: str):
words_to_check = []
c1 = Counter(bunch_of_letters.lower())
for word in load_words():
c2 = Counter(word.lower())
if c2 & c1 == c2:
words_to_check.append(word)
max_word = max_word_value(words_to_check)
return max_word,calc_word_value(max_word)
max_word_value - returns the word with the maximum value of the list given
calc_word_value - returns the word's score in scrabble.
load_words - return a list of the dictionary.
I'm currently using counters to do the trick but, the problem is that I'm currently on about 2.5 seconds per search and I don't know how to optimize this, any thoughts?
Try this:
def find_optimal(bunch_of_letters):
bunch_of_letters = ''.join(sorted(bunch_of_letters))
words_to_check = [word for word in load_words() if ''.join(sorted(word)) in bunch_of_letters]
max_word = max_word_value(words_to_check)
return max_word, calc_word_value(max_word)
I've just used (or at least tried to use) a list comprehension. Essentially, words_to_check will (hopefully!) be a list of all of the words which are in your text file.
On a side note, if you don't want to use a gigantic text file for the words, check out enchant!
from itertools import permutations
theword = 'car' # or we can use input('Type in a word: ')
mylist = [permutations(theword, i)for i in range(1, len(theword)+1)]
for generator in mylist:
for word in generator:
print(''.join(word))
# instead of .join just print (word) for tuple
Output:
c
a
r
ca
cr
...
ar rc ra car cra acr arc rca rac
This will give us all the possible combinations (i.e. permutations) of a word.
If you're looking to see if the generated word is an actual word in the English dictionary we can use This Answer
import enchant
d = enchant.Dict("en_US")
for word in mylist:
print(d.check(word), word)
Conclusion:
If want to generate all the combinations of the word. We use this code:
from itertools import combinations, permutations, product
word = 'word' # or we can use input('Type in a word: ')
solution = permutations(word, 4)
for i in solution:
print(''.join(i)) # just print(i) if you want a tuple
For each occurrence of a certain word, I need to display the context by showing about 5 words preceding and following the occurrence of the word.
Example output for the word 'stranger' in a text file of content when you enter occurs('stranger', 'movie.txt'):
My code so far:
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
print(words)
for i in range(len(words)):
if words[i].find(word):
#stuck here
I'd suggest slicing words depending on i:
print(words[i-5:i+6])
(This would go where your comment is)
Alternatively, to print as shown in your example:
print("...", " ".join(words[i-5:i+6]), "...")
To account for the word being in the first 5:
if i > 5:
print("...", " ".join(words[i-5:i+6]), "...")
else:
print("...", " ".join(words[0:i+6]), "...")
Additionally, find is not doing what you think it is. If find() doesn't find the string, it returns -1 which evaluates to True when used in a if statement. Try:
if word in words[i].lower():
This retrieves the index of every occurrence of the word in words, which is a list of all words in the file. Then slicing is used to get a list of the matched word and the 5 words before and after.
def occurs(word, filename):
infile = open(filename,'r')
lines = infile.read().splitlines()
infile.close()
wordsString = ''.join(lines)
words = wordsString.split()
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-5:m+6])
print(f"... {l} ...")
Consider the more_itertools.adajacent tool.
Given
import more_itertools as mit
s = """\
But we did not answer him, for he was a stranger and we were not used to, strangers and were shy of them.
We were simple folk, in our village, and when a stranger was a pleasant person we were soon friends.
"""
word, distance = "stranger", 5
words = s.splitlines()[0].split()
Demo
neighbors = list(mit.adjacent(lambda x: x == word, words, distance))
" ".join(word for bool_, word in neighbors if bool_)
# 'him, for he was a stranger and we were not used'
Details
more_itertools.adjacent returns an iterable of tuples, e.g. (bool, item) pairs. A True boolean is returned for words in the string that satisfy the predicate. Example:
>>> neighbors
[(False, 'But'),
...
(True, 'a'),
(True, 'stranger'),
(True, 'and'),
...
(False, 'to,')]
Neighboring words are filtered from the results given a distance from the target word.
Note: more_itertools is a third-party library. Install by pip install more_itertools.
Whenever I see rolling views of files, I think collections.deque
import collections
def occurs(needle, fname):
with open(fname) as f:
lines = f.readlines()
words = iter(''.join(lines).split())
view = collections.deque(maxlen=11)
# prime the deque
for _ in range(10): # leaves an 11-length deque with 10 elements
view.append(next(words, ""))
for w in words:
view.append(w)
if view[5] == needle:
yield list(view.copy())
Note that this approach intentionally does not handle any edge cases for needle names in the first 5 words or the last 5 words of the file. The question is ambiguous as to whether matching the third word should give the first through ninth words, or something different.
Let us suppose I have the following paragraph:
"This is the first sentence. This is the second sentence? This is the third
sentence!"
I need to create a function that will only return the number of sentences under a given character count. If it is less than one sentence, it will return all characters of the first sentence.
For example:
>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
sentence!"
>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"
>>> reduce_paragraph(50)
"This is the first sentence."
>>> reduce_paragraph(5)
"This "
I started off with something like this, but I can't seem to figure out how to finish it:
endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
if truth:
first_sentence = previous+''.join(sentence).replace('\n',' ')
previous = ''.join(sentence)
Processing sentences is very difficult to do, due to the syntactical constructs of the English language. As someone has already pointed out, issues like abbreviation will cause unending headaches even for the best regexer.
You should consider the Natural Laungauge Toolkit. Specifically the punkt module. It is a sentence tokenizer and it will do the heavy lifting for you.
Here's how you could use the punkt module mentioned by #BigHandsome to truncate the paragraph:
from nltk.tokenize.punkt import PunktSentenceTokenizer
def truncate_paragraph(text, maxnchars,
tokenize=PunktSentenceTokenizer().span_tokenize):
"""Truncate the text to at most maxnchars number of characters.
The result contains only full sentences unless maxnchars is less
than the first sentence length.
"""
sentence_boundaries = tokenize(text)
last = None
for start_unused, end in sentence_boundaries:
if end > maxnchars:
break
last = end
return text[:last] if last is not None else text[:maxnchars]
Example
text = ("This is the first sentence. This is the second sentence? "
"This is the third\n sentence!")
for limit in [100, 80, 50, 5]:
print(truncate_paragraph(text, limit))
Output
This is the first sentence. This is the second sentence? This is the third
sentence!
This is the first sentence. This is the second sentence?
This is the first sentence.
This
If we ignore the natural language issues (i.e. an algorithm to return complete chunks deliniated by ".?!", where the sum is less than k) then the following elementary approach will work:
def sentences_upto(paragraph, k):
sentences = []
current_sentence = ""
stop_chars = ".?!"
for i, c in enumerate(paragraph):
current_sentence += c
if(c in stop_chars):
sentences.append(current_sentence)
current_sentence = ""
if(i == k):
break
return sentences
return sentences
Your itertools solution can be completed like this:
def sentences_upto_2(paragraph, size):
stop_chars = ".?!"
sentences = itertools.groupby(paragraph, lambda x: any(x.endswith(punct) for punct in stop_chars))
for k, s in sentences:
ss = "".join(s)
size -= len(ss)
if not k:
if size < 0:
return
yield ss
You can break down this problem into simpler steps:
Given a paragraph, split it into sentences
Figure out how many sentences we can join together while staying under the character limit
If we can fit at least one sentence, then join those sentences together.
If the first sentence was too long, take the first sentence and truncate it.
Sample code (not tested):
def reduce_paragraph(para, max_len):
# Split into list of sentences
# A sentence is a sequence of characters ending with ".", "?", or "!".
sentences = re.split(r"(?<=[\.?!])", para)
# Figure out how many sentences we can have and stay under max_len
num_sentences = 0
total_len = 0
for s in sentences:
total_len += len(s)
if total_len > max_len:
break
num_sentences += 1
if num_sentences > 0:
# We can fit at least one sentence, so return whole sentences
return ''.join(sentences[:num_sentences])
else:
# Return a truncated first sentence
return sentences[0][:max_len]