Using difflib.get_close_matches to replace word in string - Python - python

If difflib.get_close_matches can return a single close match. Where I supply the sample string and close match. How can I utilize the 'close match' to replace the string token found?
# difflibQuestion.py
import difflib
word = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
result = difflib.get_close_matches(line,word,n=1)
print(result)
Output:
['Winterstreamrise']
I want to produce the line:
I went up to Winterstreamrise.
For many lines and words.
I have checked the docs
can't find any ref to string index of found match difflib.getget_close_matches
the other module classes & functions return lists
I Googled "python replace word in line using difflib" etc. I can't find any reference to anyone else asking/writing about it. It would seem a common scenario to me.
This example is of course a simplified version of my 'real world' scenario. Which may be of help. Since I am dealing more with table data (rather than line)
Surname, First names, Street Address, Town, Job Description
And my 'words' are a large list of street base names eg MAIN, EVERY, EASY, LOVERS (without the Road, Street, Lane) So my difflib.get_close_matches could be used to substitute the string of column x 'line' with the closest match 'word'.
However I would appreciate anyone suggesting an approach to either of these examples.

You could try something like this:
import difflib
possibilities = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
newWords = []
for word in line.split():
result = difflib.get_close_matches(word, possibilities, n=1)
newWords.append(result[0] if result else word)
result = ' '.join(newWords)
print(result)
Output:
I went up to Winterstreamrise
Explanation:
The docs show a first argument named word, and there is no suggestion that get_close_matches() has any awareness of sub-words within this argument; rather, it reports on the closeness of a match between this word atomically and the list of possibilities supplied as the second argument.
We can add the awareness of words within line by splitting it into a list of such words which we iterate over, calling get_close_matches() for each word separately and modifying the word in our result only if there is a match.

Related

How to understand the flaw in my simple three part python code?

My Python exercise in 'classes' is as follows:
You have been recruited by your friend, a linguistics enthusiast, to create a utility tool that can perform analysis on a given piece of text. Complete the class "analyzedText" with the following methods:
Constructor (_init_) - This method should take the argument text, make is lowercase and remove all punctuation. Assume only the following punctuation is used: period (.), exclamation mark (!), comma (,), and question mark (?). Assign this newly formatted text to a new attribute called fmtText.
freqAll - This method should create and return dictionary of all unique words in the text along with the number of times they occur in the text. Each key in the dictionary should be the unique word appearing in the text and the associated value should be the number of times it occurs in the text. Create this dictionary from the fmtText attribute.
This was my code:
class analysedText(object)
def __init__ (self, text):
formattedText = text.replace('.',' ').replace(',',' ').replace('!',' ').replace('?',' ')
formattedText = formattedText.lower()
self.fmtText = formattedText
def freqAll(self):
wordList = self.fmtText.split(' ')
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList(word)
return wordDict
I get errors on both of these and I can't seem to figure it out after a lot of little adjustments. I suspect the issue in the first part is when I try to assign a value to the newly formatted text but I cannot think of a workable solution. As for the second part, I am at a complete loss - I was wrongfully confident my answer was correct but I received a fail error when I ran it through the classroom's code cell to test it.
On the assumption that by 'errors' you mean a TypeError, this is caused because of line 13, wordDict[word] = wordList(word).
wordList is a list, and by using the ()/brackets you're telling Python that you want to call that list as a function. Which it cannot do.
According to your task, you are to instead find the occurrences of words in the list, which you could achieve with the .count() method. This method basically returns the total number of occurrences of an element in a list. (Feel free to read more about it here)
With this modification, (this is assuming you want wordDict to contain a dictionary with the word as the key, and the occurrence as the value) your freqAll function would look something like this:
def freqAll(self):
wordList = self.fmtText.split()
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList.count(word) # wordList.count(word) returns the number of times the string word appears as an element in wordList
return wordDict
Although you could also achieve this same task with a class known as collections.Counter, (of course this means you have to import collections) which you can read more about here

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

Matching if any keyword from a list is present in a string

I have a list of keywords. A sample is:
['IO', 'IO Combination','CPI Combos']
Now what I am trying to do is see if any of these keywords is present in a string. For example, if my string is: there is a IO competition coming in Summer 2018. So for this example since it contains IO, it should identify that but if the string is there is a competition coming in Summer 2018 then it should not identify any keywords.
I wrote this Python code but it also identifies IO in competition:
if any(word.lower() in string_1.lower() for word in keyword_list):
print('FOUND A KEYWORD IN STRING')
I also want to identify which keyword was identified in the string (if any present). What is the issue in my code and how can I make sure that it matches only complete words?
Regex solution
You'll need to implement word boundaries here:
import re
keywords = ['IO', 'IO Combination','CPI Combos']
words_flat = "|".join(r'\b{}\b'.format(word) for word in keywords)
rx = re.compile(words_flat)
string = "there is a IO competition coming in Summer 2018"
match = rx.search(string)
if match:
print("Found: {}".format(match.group(0)))
else:
print("Not found")
Here, your list is joined with | and \b on both sides.
Afterwards, you may search with re.search() which prints "Found: IO" in this example.
Even shorter with a direct comprehension:
rx = re.compile("|".join(r'\b{}\b'.format(word) for word in keywords))
Non-regex solution
Please note that you can even use a non-regex solution for single words, you just have to reorder your comprehension and use split() like
found = any(word in keywords for word in string.split())
if found:
# do sth. here
Notes
The latter has the drawback that strings like
there is a IO. competition coming in Summer 2018
# ---^---
won't work while they do count as a "word" in the regex solution (hence the approaches are yielding different results). Additionally, because of the split() function, combined phrases like CPI Combos cannot be found. The regex solution has the advantage to even support lower and uppercase scenarios (just apply flag = re.IGNORECASE).
It really depends on your actual requirements.
for index,key in enumerate(mylist):
if key.find(mystring) != -1:
return index
It loops over your list, on every item in the list, it checks if your string is contained in the item, if it does, find() returns -1 which means it is contained, and if that happens, you get the index of the item where it was found with the help of enumerate().

python print all words from list of a line

I am trying to code and print all keywords on a line from a list.I mean: i have a list keywords=['bike','car','home']
i have a text file:
every one has bike and car
i have a bike
i am trying to get a car and home
Coding:
keywords=['bike','car','home']
with open('qwe.txt','r') as file:
for line in file:
for key in keywords:
if key in line:
a = key
break
else:
a = 'nil'
print a
my output prints only one key not all key present over list!
I mean my output expected is:
bike car
bike
car home
Instead ,i get now as:
bike
bike
car
How can i print all the keys from the lines?Please help! Answers will be appreciated!
You have two separate problems here, both of which need to be fixed.
First, as pointed out by #jonrsharpe, you break as soon as you find the first match, which means you're specifically telling Python to stop after the first match, so it's doing exactly what you ask. Just remove that break.
Second, as pointed out by #hasan, even without the break, for each key, you're replacing a, either with the new key, or with nil. So, you're just going to print out either home or nil every time. What you want to do is accumulate all of the matching keys. Like this:
matches = []
for key in keywords:
if key in line:
matches.append(key)
if matches:
print ' '.join(matches)
else:
print 'nil'
From your comment:
small problem,if i have new line as 'car and bike',then its first printing 'bike car' instead of 'car bike'!
Think about what you're asking Python to do: You're going through the keys in keywords, and checking whether each one is found in line. So of course the order in which they're found will be the order they appear in keywords.
If you want them in the order they appear in the line instead, there are two options.
First, you can search through the line, looking for matches in keywords, instead of searching for keywords, looking for matches in line. If you only want to match complete words, and want duplicates to show up multiple times, this is dead simple:
for word in line.split():
if word in keywords:
matches.append(word)
If you want to match partial words (e.g., your existing code finds car in "this program is designed for scaring small children", but the code I just gave will not), you can search all substrings instead of all words:
for i in range(len(line)):
for key in keywords:
if line[i:].startswith(key):
matches.append(key)
If you only want to find each word once, you can check if word not in matches before appending it.
And so on. Whatever you want to add, you have to think through what the rule is before you can turn it into code, but usually it won't be very hard.
You're assigning a to new keyword each time, rather than storing each one you see. Maybe make a new list each time you look at a new line and append the words you find:
keywords=['bike','car','home']
with open('qwe.txt','r') as file:
for line in file:
a = []
for key in keywords:
if key in line:
a.append(key)
if len(a) > 0:
print ' '.join(a)
You might also see if a list comprehension can build that array for you in a single line. Haven't tried it but it might be possible. Good luck!
If you don't mind the order, you can make your keywords a set and search the intersection with the set of words in line:
keywords = set(['bike','car','home'])
with open('qwe.txt', 'r') as file:
for line in file:
print ' '.join(keywords & set(line.split())) or 'nil'
This is faster when your lines and keywords are big, since you don't have to iterate over the lists.
Example
input
every one has bike and car
i have a bike
i am trying to get a car and home
i don't have any
output
car bike
bike
car home
nil

regular expression dictionary [Google type search and match with regular expressions]

EDIT: One of the main problems with the code below is due to storing regular expression objects in dictionaries, and how to access them to see if they can match another string. But I will still leave my previous question because I think there's probably an easy way to do all of this.
I would like to find a method in python which knows how to return a boolean of whether or not two strings are referring to the same thing. I know that this is difficult, if not completely absurd in programming, but I am looking into dealing with this problem using a dictionary of alternative strings that refer to the same thing.
Here are some examples, since I know this doesn't make a whole lot of sense without them.
If I give the string:
'breakingBad.Season+01 Episode..02'
Then I would like it to match the string:
'Breaking Bad S01E02'
Or 'three.BuCkets+of H2O' can match '3 buckets of water'
I know this is nearly impossible to do with regard to '3' and 'water' etc. being synonymous, but I am willing to provide these as dictionaries of relevant regular expression synonyms to the function if need be.
I have a feeling that there is a much simpler way to do this in python, as there always is, but here is what I have so far:
import re
def check_if_match(given_string, string_to_match, alternative_dictionary):
print 'matching: ', given_string, ' against: ', string_to_match
# split the string into it's parts with pretty much any special character
list_of_given_strings = re.split(' |\+|\.|;|,|\*|\n', given_string)
print 'List of words retrieved from given string: '
print list_of_given_strings
check = False
counter = 0
for i in range(len(list_of_given_strings)):
m = re.search(list_of_given_strings[i], string_to_match, re.IGNORECASE)
m_alt = None
try:
m_alt = re.search(alternative_dictionary[list_of_given_strings[i]], string_to_match, re.IGNORECASE)
except KeyError:
pass
if m or m_alt:
if counter == len(list_of_given_strings)-1: check = True
else: counter += 1
print list_of_given_strings[i], ' found to match'
else:
print list_of_given_strings[i], ' did not match'
break
return check
string1 = 'breaking Bad.Season+01 Episode..02'
other_string_to_check = 'Breaking.Bad.S01+E01'
# make a dictionary of synonyms - here we should be saying that "S01" is equivalent to "Season 01"
alternative_dict = {re.compile(r'S[0-9]',flags=re.IGNORECASE):re.compile(r'Season [0-9]',flags=re.IGNORECASE),\
re.compile(r'E[0-9]',flags=re.IGNORECASE):re.compile(r'Episode [0-9]',flags=re.IGNORECASE)}
print check_if_match(string1, other_string_to_check, alternative_dict)
print
# another try
string2 = 'three.BuCkets+of H2O'
other_string_to_check2 = '3 buckets of water'
alternative_dict2 = {'H2O':'water', 'three':'3'}
print check_if_match(string2, other_string_to_check2, alternative_dict2)
This returns:
matching: breaking Bad.Season+01 Episode..02 against: Breaking.Bad.S01+E01
List of words retrieved from given string:
['breaking', 'Bad', 'Season', '01', 'Episode', '', '02']
breaking found to match
Bad found to match
Season did not match
False
matching: three.BuCkets+of H2O against: 3 buckets of water
List of words retrieved from given string:
['three', 'BuCkets', 'of', 'H2O']
three found to match
BuCkets found to match
of found to match
H2O found to match
True
I realize this probably means I am getting something wrong with the dictionary keys and values, but I feel like I am getting further away from a simple pythonic solution that has probably already been created.
Anyone have any thoughts?
I was tinkering with it and found some interesting things:
It might have to do with the way you are breaking up your initial words into lists
matching: breaking Bad.Season 1.Episode.1 against: Breaking.Bad.S1+E1
List of words retrieved from given string:
['breaking', 'Bad', 'Season', '1', 'Episode', '1']
I think you want it to be ..., 'Season 1', ... instead of having 'Season' and 1 be separate entries in the list.
You specify S[0-9], but this would not match double digits.
You are right about your regular expresions being stored in dictionaries; the mapping only applies in one direction. I was fiddling with the code (unfortunately don't remember what it was) by mapping r'Season [0-9]' to r'S[0-9]' instead of vice versa and it was able to match Season.
Suggestions
Instead of mapping, have an equivalence class for each string type (e.g. title, season, episode) and have some matcher code for that.
Separate the parse and compare steps. Parse each string individually into a common format or object and then do a comparison
You might need to implement some sort of state machine to know that you are processing a season and expect to see a number in a particular format right after it.
You may want to use a third party tool instead; I've heard good things about Renamer

Categories