nltk comparing tokens ( == returning false when "true")

nltk comparing tokens ( == returning false when "true") - python

I think I may be fundamentally confused about something in python or nltk. I'm generating a list of tokens from a paper abstract, and attempting to see if a search word is contained by the tokens. I do know about concordance, but it doesn't work well with my intended use of the comparison.
Here is my code:
def tokenize(text):
tokens = nltk.word_tokenize(text.get_text())
return tokens
def search_abstract_single_word(tokens, keyword):
match = 0
for token in tokens:
if token == keyword:
match += 1
return match
def search_file_single_word(abstract_list, keyword):
matches = list()
for item in abstract_list:
tokens = tokenize(item)
match = search_abstract_single_word(tokens, keyword)
matches.append(match)
return matches
I've confirmed that the tokens and keyword being passed in are correct, but match (and thus the entire list of matches) always evaluates zero. I was under the understanding word_tokenize returns an array of strings, so I don't see why, for example, when token = computer and keyword = computer, token == keyword does not return true and increment match.
EDIT: In a standalone class/main method this code does appear to work. However, the code is being called from a tkinter window like so:
self.keyword = ""
....
self.keywords_box = Text(self.Frame2)
....
self.Submit = Button(master)
self.Submit.configure(command=self.submit)
....
#triggered by submit button
def submit(self):
self.keywords += self.keywords_box.get("1.0", END)
#triggered by run button after keyword saved
def run(self):
search_input = self.keywords
....
#use pandas to read excel file, create abstracts, and store
....
matches = search_file_single_word(abstract_list, search_input)
for match in matches:
self.output_box.insert(END, match)
self.output_box.insert(END, '\n')
I had assumed because print(keyword) was outputting correctly if I inserted it into search_file_single_word, that the value was passed correctly, but is it actually just passing the tkinter property along and refusing to evaluate it vs the token?

Moral of the story, be careful with options. Using textbox.get("1.0", END) will insert a newline character. string != string\n. Solution found in answer to this post

Related

Converting a file in list format into a dictionary with multiple conditions. (python)

Disclaimer, sorry if I have not explicitly expressed my issue. Terminology is still new to me. Thank you in advance for reading.
alright, I have a function named
def pluralize(word)
The aim is to pluralize all nouns within a file. The output I desire is: {'plural': word_in_plural, 'status' : x}
word_in_plural is the pluralized version of the input argument (word) and x is a string which can have one of the following values; 'empty_string', 'proper_noun', 'already_in_plural', 'success'.
My code so far looks like..
filepath = '/proper_noun.txt'
def pluralize(word):
proper_nouns = [line.strip() for line in open (filepath)] ### reads in file as list when function is called
dictionary = {'plural' : word_in_plural, 'status', : x} ### defined dictionary
if word == '': ### if word is an empty string, return values; 'word_in_plural = '' and x = 'empty_string'
dictionary['plural'] = ''
dictionary['status'] = 'empty_string'
return dictionary
what you can see above is my attempt at writing a condition that returns a value specified if the word is an empty string.
The next goal is to create a condition that if word is already in plural (assuming it ends with 's' 'es' 'ies' .. etc), then the function returns a dictionary with the values: **word_in_plural = word and x = 'already_in_plural'. So the input word remains untouched. eg. (input: apartments, output: apartments)
if word ### is already in plural (ending with plural), function returns a dictionary with values; word_in_plural = word and x = 'already_in_plural'
any ideas on how to read the last characters of the string to implement the rules ? I also very much doubt the logic.
Thank you for your input SOF community.

You can index the word by -1 to get its last character. You can slice a string to get the the last two [-2:] or last three [-3:] characters
last_char = word[-1]
last_three_char = word[-3:]

Remove certain characters if they are not in a specific location in a string #python

I am trying to figure out the following function situation from my python class. I've gotten the code to remove the three letters but from exactly where they don't want me to. IE removing WGU from the first line where it's supposed to stay but not from WGUJohn.
# Complete the function to remove the word WGU from the given string
# ONLY if it's not the first word and return the new string
def removeWGU(mystring):
#if mystring[0]!= ('WGU'):
#return mystring.strip('WGU')
#if mystring([0]!= 'WGU')
#return mystring.split('WGU')
# Student code goes here
# expected output: WGU Rocks
print(removeWGU('WGU Rocks'))
# expected output: Hello, John
print(removeWGU('Hello, WGUJohn'))

Check this one:
def removeWGU(mystring):
s = mystring.split()
if s[0] == "WGU":
return mystring
else:
return mystring.replace("WGU","")
print(removeWGU('WGU Rocks'))
print(removeWGU('Hello, WGUJohn'))

def removeWGU(mystring):
return mystring[0] + mystring[1:].replace("WGU","")
Other responses I seen wouldn't work on a edgy case where there is multiple "WGU" in the text and one at the beginning, such as
print(removeWGU("WGU, something else, another WGU..."))

Python: use a list index as a function argument

I'm trying to use list indices as arguments for a function that performs regex searches and substitutions over some text files. The different search patterns have been assigned to variables and I've put the variables in a list that I want to feed the function as it loops through a given text.
When I call the function using a list index as an argument nothing happens (the program runs, but no substitutions are made in my text files), however, I know the rest of the code is working because if I call the function with any of the search variables individually it behaves as expected.
When I give the print function the same list index as I'm trying to use to call my function it prints exactly what I'm trying to give as my function argument, so I'm stumped!
search1 = re.compile(r'pattern1')
search2 = re.compile(r'pattern2')
search3 = re.compile(r'pattern3')
searches = ['search1', 'search2', 'search2']
i = 0
for …
…
def fun(find)
…
fun(searches[i])
if i <= 2:
i += 1
…
As mentioned, if I use fun(search1) the script edits my text files as wished. Likewise, if I add the line print(searches[i]) it prints search1 (etc.), which is what I'm trying to give as an argument to fun.
Being new to Python and programming, I've a limited investigative skill set, but after poking around as best I could and subsequently running print(searches.index(search1) and getting a pattern1 is not in list error, my leading (and only) theory is that I'm giving my function the actual regex expression rather than the variable it's stored in???
Much thanks for any forthcoming help!

Try to changes your searches list to be [search1, search2, search3] instead of ['search1', 'search2', 'search2'] (in which you just use strings and not regex objects)

Thanks to all for the help. eyl327's comment that I should use a list or dictionary to store my regular expressions pointed me in the right direction.
However, because I was using regex in my search patterns, I couldn't get it to work until I also created a list of compiled expressions (discovered via this thread on stored regex strings).
Very appreciative of juanpa.arrivillaga point that I should have proved a MRE (please forgive, with a highly limited skill set, this in itself can be hard to do), I'll just give an excerpt of a slightly amended version of my actual code demonstrating the answer (one again, please forgive its long-windedness, I'm not presently able to do anything more elegant):
…
# put regex search patterns in a list
rawExps = ['search pattern 1', 'search pattern 2', 'search pattern 3']
# create a new list of compiled search patterns
compiledExps = [regex.compile(expression, regex.V1) for expression in rawExps]
i = 0
storID = 0
newText = ""
for file in filepathList:
for expression in compiledExps:
with open(file, 'r') as text:
thisText = text.read()
lines = thisThis.splitlines()
setStorID = regex.search(compiledExps[i], thisText)
if setStorID is not None:
storID = int(setStorID.group())
for line in lines:
def idSub(find):
global storID
global newText
match = regex.search(find, line)
if match is not None:
newLine = regex.sub(find, str(storID), line) + "\n"
newText = newText + newLine
storID = plus1(int(storID), 1)
else:
newLine = line + "\n"
newText = newText + newLine
# list index number can be used as an argument in the function call
idSub(compiledExps[i])
if i <= 2:
i += 1
write()
newText = ""
i = 0

switch multiple words in 1 string with variable, python

class Cleaner:
def __init__(self, forbidden_word = "frack"):
""" Set the forbidden word """
self.word = forbidden_word
def clean_line(self, line):
"""Clean up a single string, replacing the forbidden word by *beep!*"""
found = line.find(self.word)
if found != -1 :
return line[:found] + "*beep!*" + line[found+len(self.word):]
return line
def clean(self, text):
for i in range(len(text)):
text[i] = self.clean_line(text[i])
example_text = [
"What the frack! I am not going",
"to honour that question with a response.",
"In fact, I think you should",
"get the fracking frack out of here!",
"Frack you!"
]
Hi everyone, the issue with the following code, is the fact that when i run it, i get the following result:
What the *beep!*! I am not going
to honour that question with a response.
In fact, I think you should
get the *beep!*ing frack out of here!
Frack you!
On the second last line, one of the "frack" are not being changed.
I have tried using the if In line method but this doesn't work with variables. So how do i use an if statement that tracks a variable instead of a string? but also changes every word that needs changed?
PS. its exam practice i didn't make the code myself.
The expected outcome should be:
What the *beep!*! I am not going
to honour that question with a response.
In fact, I think you should
get the *beep!*ing *beep!* out of here!
Frack you!

That's because line.find(...) will only return the first result, which you then replace with "*beep!*" and then return, thus missing other matches.
Either use find iteratively, passing in the appropriate start index each time until the start index exceeds the length of the line, or use Python's replace method to do all of that for you.
I'd recommend replacing:
found = line.find(self.word)
if found != -1 :
return line[:found] + "*beep!*" + line[found+len(self.word):]
return line
with
return line.replace(self.word, "*beep!*")
Which will automatically find all matches and do the replacement.

Search in matrix - python

I need to write a function that will search for words in a matrix. For the moment i'm trying to search line by line to see if the word is there. This is my code:
def search(p):
w=[]
for i in p:
w.append(i)
s=read_wordsearch() #This is my matrix full of letters
for line in s:
l=[]
for letter in line:
l.append(letter)
if w==l:
return True
else:
pass
This code works only if my word begins in the first position of a line.
For example I have this matrix:
[[a,f,l,y],[h,e,r,e],[b,n,o,i]]
I want to find the word "fly" but can't because my code only works to find words like "here" or "her" because they begin in the first position of a line...
Any form of help, hint, advice would be appreciated. (and sorry if my english is bad...)

You can convert each line in the matrix to a string and try to find the search work in it.
def search(p):
s=read_wordsearch()
for line in s:
if p in ''.join(line):
return True

I'll give you a tip to search within a text for a word. I think you will be able to extrapolate to your data matrix.
s = "xxxxxxxxxhiddenxxxxxxxxxxx"
target = "hidden"
for i in xrange(len(s)-len(target)):
if s[i:i+len(target)] == target:
print "Found it at index",i
break
If you want to search for words of all length, if perhaps you had a list of possible solutions:
s = "xxxxxxxxxhiddenxxxtreasurexxxxxxxx"
targets = ["hidden","treasure"]
for i in xrange(len(s)-1):
for j in xrange(i+1,len(s)):
if s[i:j] in targets:
print "Found",s[i:j],"at index",

def search(p):
w = ''.join(p)
s=read_wordsearch() #This is my matrix full of letters
for line in s:
word = ''.join(line)
if word.find(w) >= 0:
return True
return False
Edit: there is already lot of string functions available in Python. You just need to use strings to be able to use them.

join the characters in the inner lists to create a word and search with in.
def search(word, data):
return any(word in ''.join(characters) for characters in data)
data = [['a','f','l','y'], ['h','e','r','e'], ['b','n','o','i']]
if search('fly', data):
print('found')
data contains the matrix, characters is the name of each individual inner list. any will stop after it has found the first match (short circuit).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

nltk comparing tokens ( == returning false when "true") - python

Moral of the story, be careful with options. Using textbox.get("1.0", END) will insert a newline character. string != string\n. Solution found in answer to this post

Related

Converting a file in list format into a dictionary with multiple conditions. (python)

Remove certain characters if they are not in a specific location in a string #python

Python: use a list index as a function argument

switch multiple words in 1 string with variable, python

Search in matrix - python

Categories

Resources