I am trying to code and print all keywords on a line from a list.I mean: i have a list keywords=['bike','car','home']
i have a text file:
every one has bike and car
i have a bike
i am trying to get a car and home
Coding:
keywords=['bike','car','home']
with open('qwe.txt','r') as file:
for line in file:
for key in keywords:
if key in line:
a = key
break
else:
a = 'nil'
print a
my output prints only one key not all key present over list!
I mean my output expected is:
bike car
bike
car home
Instead ,i get now as:
bike
bike
car
How can i print all the keys from the lines?Please help! Answers will be appreciated!
You have two separate problems here, both of which need to be fixed.
First, as pointed out by #jonrsharpe, you break as soon as you find the first match, which means you're specifically telling Python to stop after the first match, so it's doing exactly what you ask. Just remove that break.
Second, as pointed out by #hasan, even without the break, for each key, you're replacing a, either with the new key, or with nil. So, you're just going to print out either home or nil every time. What you want to do is accumulate all of the matching keys. Like this:
matches = []
for key in keywords:
if key in line:
matches.append(key)
if matches:
print ' '.join(matches)
else:
print 'nil'
From your comment:
small problem,if i have new line as 'car and bike',then its first printing 'bike car' instead of 'car bike'!
Think about what you're asking Python to do: You're going through the keys in keywords, and checking whether each one is found in line. So of course the order in which they're found will be the order they appear in keywords.
If you want them in the order they appear in the line instead, there are two options.
First, you can search through the line, looking for matches in keywords, instead of searching for keywords, looking for matches in line. If you only want to match complete words, and want duplicates to show up multiple times, this is dead simple:
for word in line.split():
if word in keywords:
matches.append(word)
If you want to match partial words (e.g., your existing code finds car in "this program is designed for scaring small children", but the code I just gave will not), you can search all substrings instead of all words:
for i in range(len(line)):
for key in keywords:
if line[i:].startswith(key):
matches.append(key)
If you only want to find each word once, you can check if word not in matches before appending it.
And so on. Whatever you want to add, you have to think through what the rule is before you can turn it into code, but usually it won't be very hard.
You're assigning a to new keyword each time, rather than storing each one you see. Maybe make a new list each time you look at a new line and append the words you find:
keywords=['bike','car','home']
with open('qwe.txt','r') as file:
for line in file:
a = []
for key in keywords:
if key in line:
a.append(key)
if len(a) > 0:
print ' '.join(a)
You might also see if a list comprehension can build that array for you in a single line. Haven't tried it but it might be possible. Good luck!
If you don't mind the order, you can make your keywords a set and search the intersection with the set of words in line:
keywords = set(['bike','car','home'])
with open('qwe.txt', 'r') as file:
for line in file:
print ' '.join(keywords & set(line.split())) or 'nil'
This is faster when your lines and keywords are big, since you don't have to iterate over the lists.
Example
input
every one has bike and car
i have a bike
i am trying to get a car and home
i don't have any
output
car bike
bike
car home
nil
Related
My Python exercise in 'classes' is as follows:
You have been recruited by your friend, a linguistics enthusiast, to create a utility tool that can perform analysis on a given piece of text. Complete the class "analyzedText" with the following methods:
Constructor (_init_) - This method should take the argument text, make is lowercase and remove all punctuation. Assume only the following punctuation is used: period (.), exclamation mark (!), comma (,), and question mark (?). Assign this newly formatted text to a new attribute called fmtText.
freqAll - This method should create and return dictionary of all unique words in the text along with the number of times they occur in the text. Each key in the dictionary should be the unique word appearing in the text and the associated value should be the number of times it occurs in the text. Create this dictionary from the fmtText attribute.
This was my code:
class analysedText(object)
def __init__ (self, text):
formattedText = text.replace('.',' ').replace(',',' ').replace('!',' ').replace('?',' ')
formattedText = formattedText.lower()
self.fmtText = formattedText
def freqAll(self):
wordList = self.fmtText.split(' ')
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList(word)
return wordDict
I get errors on both of these and I can't seem to figure it out after a lot of little adjustments. I suspect the issue in the first part is when I try to assign a value to the newly formatted text but I cannot think of a workable solution. As for the second part, I am at a complete loss - I was wrongfully confident my answer was correct but I received a fail error when I ran it through the classroom's code cell to test it.
On the assumption that by 'errors' you mean a TypeError, this is caused because of line 13, wordDict[word] = wordList(word).
wordList is a list, and by using the ()/brackets you're telling Python that you want to call that list as a function. Which it cannot do.
According to your task, you are to instead find the occurrences of words in the list, which you could achieve with the .count() method. This method basically returns the total number of occurrences of an element in a list. (Feel free to read more about it here)
With this modification, (this is assuming you want wordDict to contain a dictionary with the word as the key, and the occurrence as the value) your freqAll function would look something like this:
def freqAll(self):
wordList = self.fmtText.split()
wordDict = {}
for word in set(wordList):
wordDict[word] = wordList.count(word) # wordList.count(word) returns the number of times the string word appears as an element in wordList
return wordDict
Although you could also achieve this same task with a class known as collections.Counter, (of course this means you have to import collections) which you can read more about here
If difflib.get_close_matches can return a single close match. Where I supply the sample string and close match. How can I utilize the 'close match' to replace the string token found?
# difflibQuestion.py
import difflib
word = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
result = difflib.get_close_matches(line,word,n=1)
print(result)
Output:
['Winterstreamrise']
I want to produce the line:
I went up to Winterstreamrise.
For many lines and words.
I have checked the docs
can't find any ref to string index of found match difflib.getget_close_matches
the other module classes & functions return lists
I Googled "python replace word in line using difflib" etc. I can't find any reference to anyone else asking/writing about it. It would seem a common scenario to me.
This example is of course a simplified version of my 'real world' scenario. Which may be of help. Since I am dealing more with table data (rather than line)
Surname, First names, Street Address, Town, Job Description
And my 'words' are a large list of street base names eg MAIN, EVERY, EASY, LOVERS (without the Road, Street, Lane) So my difflib.get_close_matches could be used to substitute the string of column x 'line' with the closest match 'word'.
However I would appreciate anyone suggesting an approach to either of these examples.
You could try something like this:
import difflib
possibilities = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'
newWords = []
for word in line.split():
result = difflib.get_close_matches(word, possibilities, n=1)
newWords.append(result[0] if result else word)
result = ' '.join(newWords)
print(result)
Output:
I went up to Winterstreamrise
Explanation:
The docs show a first argument named word, and there is no suggestion that get_close_matches() has any awareness of sub-words within this argument; rather, it reports on the closeness of a match between this word atomically and the list of possibilities supplied as the second argument.
We can add the awareness of words within line by splitting it into a list of such words which we iterate over, calling get_close_matches() for each word separately and modifying the word in our result only if there is a match.
I am trying to create a function to take in a string and return how many times a word in it has been used (with the word) as a dictionary. I also want it to look for a specific list of words to search up the string when provided and return the frequency of the words in the given list found in the string.
Example,
stringfunc = "I went to school today, to learn!"
print(wordfunc(stringfunc))
should return
{'i':1 , 'went':1, 'to':2, 'school':1, 'today':1, 'learn':1}
And,
stringfunc = "I went to school today, to learn!"
print(wordfunc(stringfunc,wordlist=["I", "feel", "Great"]))
should return
{'i':1, 'feel':0, 'great':0}
This is what I have so far
def wordfunc(stringfunc,wordlist=[]):
count_dict = dict()
stringfunc=stringfunc.lower() # i want it to be case insensitive
word = stringfunc.split()
for i in range(len(word)):
x = ord(word[i][-1]) # in the next few lines I am trying to get rid of special characters
if (not(x>=97 and x<=112) or (x>=65 and x<= 90)):
word[i]=word[i][:-1] # if a word ends with , or ! i want it to discount last character
for i in wordlist:
if (i not in word):
count_dict[i]=0
else:
count_dict[i]=word.count(i)
return count_dict
When I try
stringfunc = "I went to school today, to learn!"
print(wordfunc(stringfunc,wordlist=["I", "feel", "Great"]))
I get
{'I':1, 'feel':0, 'Great':0} # i can't get a lower case i don't know why
and when I try
stringfunc = "I went to school today, to learn!"
print(wordfunc(stringfunc))
I get an empty dictionary {}
Can you help me identify my error? Thanks!
You "can't get lower case" because you didn't program it. If the input supplies wordlist, then you blithely accept whatever is there. In the given case, you have two words capitalized, so that's what comes out. Instead, you need to convert every element of wordlist to lower case, just as you did with the input string.
BTW, do not give misleading names to variables: stringfunc is not a function.
The main loop will be much easier to read if you quit playing games with ASCII code values. Instead, simply use isletter. If this is new to you, then I strongly recommend that you repeat your tutorial on string processing; you missed some useful things that you will now recognize.
That said, also look up the collections package, notably the Counter type. Once you've cleaned out all but letters and spaces in your input string, you can do the main processing with
count_dict = Counter(stringfunc.split())
I have 2 strings loss of gene and aquaporin protein. In a line, I want to find if these two exist in a line of my file, within a proximity of 5 words.
Any ideas? I have searched extensively but cannot find anything.
Also, since these are multi-word strings, I cannot use abs(array.index) for the two (which was possible with single words).
Thanks
You could try the following approach:
First sanitise your text by converting it to lowercase, keeping only the characters and enforcing one space between each word.
Next, search for each of the phrases in the resulting text and keep a note of the starting index and the length of the phrase matched. Sort this index list.
Next make sure that all of the phrases were present in the text by making sure all found indexes are not -1.
If all are found count the number of words between the end of the first phrase, and the start of the last phrase. To do this take a text slice starting from the end of the first phrase to the start of the second phrase, and split it into words.
Script as follows:
import re
text = "The Aquaporin protein, sometimes 'may' exhibit a big LOSS of gene."
text = ' '.join(re.findall(r'\b(\w+)\b', text.lower()))
indexes = sorted((text.find(x), len(x)) for x in ['loss of gene', 'aquaporin protein'])
if all(i[0] != -1 for i in indexes) and len(text[indexes[0][0] + indexes[0][1] : indexes[-1][0]].split()) <= 5:
print "matched"
To extend this to work on a file with a list of phrases, the following approach could be used:
import re
log = 'loss of gene'
phrases = ['aquaporin protein', 'another protein']
with open('input.txt') as f_input:
for number, line in enumerate(f_input, start=1):
# Sanitise the line
text = ' '.join(re.findall(r'\b(\w+)\b', line.lower()))
# Only process lines containing 'loss of gene'
log_index = text.find(log)
if log_index != -1:
for phrase in phrases:
phrase_index = text.find(phrase)
if phrase_index != -1:
if log_index < phrase_index:
start, end = (log_index + len(log), phrase_index)
else:
start, end = (phrase_index + len(phrase), log_index)
if len(text[start:end].split()) <= 5:
print "line {} matched - {}".format(number, phrase)
break
This would give you the following kind of output:
line 1 matched - aquaporin protein
line 5 matched - another protein
Note, this will only spot one phrase pair per line.
I am not completely sure if this is what you want, but I'll give it a shot!
In Python, you can use "in" to check if a string is in another string. I am going to assume you already have a way to store a line from a file:
"loss of gene" in fileLine -> returns boolean (either True or False)
With this you can check if "loss of gene" and "aquaporin protein" are in your line from your file. Once you have confirmed that they are both there you can check their proximity by splitting the line of text into a list as so:
wordsList = fileLine.split()
If in your text file you have the string:
"The aquaporin protein sometimes may exhibit a loss of gene"
After splitting it becomes:
["The","aquaporin","protein","sometimes","may","exhibit","a","loss","of","gene"]
I'm not sure if that is a valid sentence but for the sake of example let's ignore it :P
Once you have the line of text split into a list of words and confirmed the words are in there, you can get their proximity with the index function that comes with lists in python!
wordsList.index("protein") -> returns index 2
After finding what index "protein" is at you can check what index "loss" is at, then subtract them to find out if they are within a 5 word proximity.
You can use the index function to discern if "loss of gene" comes before or after "aquaporin protein". If "loss of gene" comes first, index "gene" and "aquaporin" and subtract those indexes. If "aquaporin protein" comes first, index "protein" and "loss" and subtract those indexes.
You will have to do a bit more to ensure that you subtract indexes correctly if the words come in different orders, but this should cover the meat of the problem. Good luck Chahat!
I'm stuck on a simple problem. I've got a dictionary of words in the English language, and a sample text that is to be checked. I've got to check every word in the sample against the dictionary, and the code I'm using is wrong.
for word in checkList: # iterates through every word in the sample
if word not in refDict: # checks if word is not in the dictionary
print word # just to see if it's recognizing misspelled words
The only problem is, as it goes through the loop it prints out every word, not just the misspelled ones. Can someone explain this and offer a solution possibly? Thank you so much!
The snippet you have is functional. See for example
>>> refDict = {'alpha':1, 'bravo':2, 'charlie':3, 'delta':4}
>>> s = 'he said bravo to charlie O\'Brian and jack Alpha'
>>> for word in s.split():
... if word not in refDict:
... print(repr(word)) # by temporarily using repr() we can see exactly
... # what the words are like
...
'he'
'said'
'to'
"O'Brian"
'and'
'jack'
'Alpha' # note how Alpha was not found in refDict (u/l case difference)
Therefore, the dictionary contents must differ from what you think, or the words out of checklist are not exactly as they appear (eg. with whitespace or capitalization; see the use of repr() (*) in print statement to help identify cases of the former).
Debugging suggestion: FOCUS on the first word from checklist (or the first that you suspect is to be found in dictionary). Then for this word and this word only, print it in details, with its length, with bracket on either side etc., for both the word out of checklist and the corresponding key in the dictionary...
(*) repr() was a suggestion from John Machin. Instead I often use brackets or other characters as in print('[' + word + ']'), but repr() is more exacting in its output.
Consider stripping your words of any whitespace that might be there, and changing all the words of both sets to the same case. Like this:
word.strip().lower()
That way you can make sure you're comparing apples to apples.
Clearly "word not in refDict" always evaluates to True. This is probably because the contents of refDict or checkList are not what you think they are. Are they both tuples or lists of strings?
The code you have would work if the keys in refDict are the correctly spelt words. If the correctly spelt words are the values in your dict then you need something like this:
for word in checkList:
if word not in refDict.values():
print word
Is there a reason you dictionary is stored as a mapping as opposed to a list or a set? A python dict contains name-value pairs for example I could use this mapping: {"dog":23, "cat":45, "pony":67} to store an index of a word and page number it is found in some book. In your case your dict is a mapping of what to what?
Are the words in the refDict the keys or the values?
Your code will only see keys: e.g.:
refDict = { 'w':'x', 'y':'z' }
for word in [ 'w','x','y','z' ]:
if word not in refDict:
print word
prints:
x
z
Othewise you want;
if word not in refDict.values()
Of course this rather assumes that your dictionary is an actual python dictionary which seems an odd way to store a list of words.
Your refDict is probably wrong. The in keyword checks if the value is in the keys of the dictionary. I believe you've put your words in as values.
I'd propose using a set instead of a dictionary.
knownwords = set("dog", "cat")
knownwords.add("apple")
text = "The dog eats an apple."
for word in text.split(" "):
# to ignore case word is converted to lowercase
if word.lower() not in knownwords:
print word
# The
# eats
# an
# apple. <- doesn't work because of the dot