Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I've been looking into developing a discord bot that can reply to messages by reading their contents and checking if they appear in a list.
My problem is, I need to find a reliable way of getting python to look for certain words from a text, see if they appear in the given list and output the words that are detected.
I've managed to get it working somewhat myself with the following code:
if any(word in text in list):
print("Word Spotted")
I would really apreciate some help.
Here's some code that does something like what you're describing. But really it sounds like you need to spend a significant amount of time working through some basic Python tutorials before you will be able to implement this.
import re
key_words = set(['foo', 'bar', 'baz'])
typed_str = 'You are such a Foo BAR!'
print key_words & set(re.findall('[a-z]+', typed_str.lower()))
I'm not sure exactly what is being asked but somethings to consider (in no particular order) if you are building a bot that is taking in raw user input.
capitalization sensitivity
spell check
understanding intent simplistically
If your environment allows access to libraries you might consider checking out TextBlob. The following commands will give you the functionality needed for the example below.
pip install textblob
python -m textblob.download_corpora
core function
from textblob import TextBlob, Word
import copy
def score_intent(rawstring,keywords,weights=None,threshold=0.01,debug=False):
"""
rawstring: string of text with words that you want to detect
keywords: list of words that you are looking for
weights: (optional) dictionary with relative weights of words you want
threshold: spellcheck confidence threshold
debug: boolean for extra print statements to help debug
"""
allwords = TextBlob(rawstring).words
allwords = [w.upper() for w in allwords]
keywords = [k.upper() for k in keywords]
processed_input_as_list = spellcheck_subject_matter_specific(rawstring,keywords,threshold=threshold,debug=debug)
common_words = intersection(processed_input_as_list,keywords)
intent_score = len(common_words)
if weights:
for special_word in weights.keys():
if special_word.upper() in common_words:
# the minus one is so we dont double count a word.
intent_score = intent_score + weights[special_word] -1
if debug:
print "intent score: %s" %intent_score
print "words of interest found in text: {}".format(common_words)
# you could return common_words and score intent based on the list.
# return common_words, intent_score
return common_words
utilities for intersection & spellchecking
def intersection(a,b):
"""
a and b are lists
function returns a list that is the intersection of the two
"""
return list(set(a)&set(b))
def spellcheck_subject_matter_specific(rawinput,subject_matter_vector,threshold=0.01,capitalize=True,debug=False):
"""
rawinput: all the text that you want to check for spelling
subject_matter_vector: only the words that are worth spellchecking for (since the function can be sort of sensitive it might correct words that you don't want to correct)
threshold: the spell check confidence needed to update the word to the correct spelling
capitalize: boolean determining if you want the return string to be capitalized.
"""
new_input = copy.copy(rawinput)
for w in TextBlob(rawinput).words:
spellchecked_vec = w.spellcheck()
if debug:
print "Word: %s" %w
print "Spellchecked Guesses & Confidences: %s" %spellchecked_vec
print "Only spellchecked confidences greater than {} and in this list {} will be included".format(threshold,subject_matter_vector)
corrected_words = [z[0].upper() for z in spellchecked_vec if z[1] > threshold]
important_words = intersection(corrected_words,subject_matter_vector)
for new_word in important_words:
new_input = new_input + ' ' + new_word
inputBlob = TextBlob(new_input)
processed_input = inputBlob.words
if capitalize:
processed_input = [word.upper() for word in processed_input]
return processed_input
Usage Example
discord_str = "Hi, i want to talk about codee and pYtHon"
words2detect = ["python","code"]
score_intent(rawstring=discord_str,keywords=words2detect,threshold=0.01,debug=True)
output
intent score: 2
words of interest found in text: ['PYTHON', 'CODE']
Related
Python - Identify certain keywords in a user's input, to then lead to an answer. For example, user inputs "There is no display on my phone"
The keywords 'display' and 'phone' would link to a set of solutions.
I just need help finding a general idea on how to identify and then lead to a set of solutions. I would appreciate any help.
Use NLTK library, import stopwords.
write a code that if the word in your text is in stopword then you have to remove that word. You will get the filtered output.
Also,
Make a negative list file - containing all the words apart from stopwords that you want to remove, extent the stopwords with these words before the above code.and you will get a 100% correct output.
A simple way if you don't want to use any external libraries would be the following.
def bool_to_int(list):
num = 0
for k, v in enumerate(list):
if v==1:
num+=(2**k)
return num
def take_action(code):
if code==1:
# do this
elif code==2:
# do this
...
keywords = ['display', 'phone', .....,]
list_of_words = data.split(" ")
code = [0]*len(keywords)
for i in list_of_words:
if i in keywords:
idx = keywords.index(i)
code[idx]=1
code = bool_to_int(code)
take_action(code)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I have a sentence called "myString" , What i am trying to do is that,
creating a dictionary from the sentence where first character of the
each word must be the key of the dictionary( white , w), and all words
starting with that character must be the values of that
keys.('w',['white','with']).
I have already written some python code. I want to know which code
snippet is better or is there any better approach to this problem.
Like a Dictionary Comprehension.?
Output i want to generate.
{'w': ['white', 'with', 'well'], 'h': ['hats', 'hackers', 'hackers',
'hackable', 'hacker', 'hired'] ...}
myString = "White hats are hackers employed with the efforts of
keeping data safe from other hackers by looking for loopholes and
hackable areas This type of hacker typically gets paid quite well and
receives no jail time due to the consent of the company that hired
them"
counterDict = {}
for word in myString.lower().split():
fChar = word[0]
if fChar not in counterDict:
counterDict[fChar] = []
counterDict[fChar].append(word)
print(counterDict)
Using dictionary.get method
for word in myString.lower().split():
fChar = word[0]
counterDict.get(word,[]).append(word)
print(counterDict)
collections.defaultdict()
import collections
counterDict = collections.defaultdict(list)
for word in myString.lower().split():
fChar = word[0]
counterDict[fChar].append(word)
print(counterDict)
collections.defaultdict( ) + list comprehension.
import collections
counterDict = collections.defaultdict(list)
[ counterDict[word[0]].append(word) for word in myString.lower().split() ]
print(counterDict)
You can use dict comprehension for assigning default values to the counterDict, and then append:
myString = "White hats are hackers employed with the efforts of
keeping data safe from other hackers by looking for loopholes and
hackable areas This type of hacker typically gets paid quite well and
receives no jail time due to the consent of the company that hired
them"
new_string = myString.split()
counterDict = {i[0].lower():[] for i in new_string}
for i in new_string:
counterDict[i[0].lower()].append(i)
This should work for your purpose:
from collections import defaultdict
counter_dict = defaultdict(list)
word_list = [(word[0], word) for word in my_string.lower().split()] #index 0 and the word is taken
for letter, word in word_list:
counter_dict[letter].append(word)
In case you like one-liners and obsessed with *comprehensions(like me), you can combine dictionary comprehension with
list comprehension:
new_string = myString.lower().split() #helps readability
counterDict = {i[0]:[z for z in new_string if z[0] == i[0]] for i in new_string}
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
The desired result is either a function or a way to find where is a sentence within a list of strings.
sentence = 'The cat went to the pool yesterday'
structure = ['The cat went,', 'to the pool yesterday.','I wonder if you realize the effect you are having on me. It hurts. A lot.']
for example
def findsentence(sentence, list of strings):
# do something to get the output, vec of positions to find the sentence in hte string list
return output
findsentence(sentence, structure)
> (0,1) # beacuse the phrase is splitted in the list...
Caution!!
The challenge it is not to find exactly the sentence. Look at the example, this sentence is part of sentence position 0 and part in structure postition 1.
So this is not a simple, string manipulation problem.
Use the following :
sentence = "foo sam bar go"
structure = ["rq", "foo sam", "bar go", "ca", "da"]
def findsentencelist(sentence, list_of_strings):
l = []
for item in list_of_strings:
if item in sentence:
l.append(list_of_strings.index(item))
return l
print str(findsentencelist(sentence, structure))
Hopefully this will help you, Yahli.
EDIT :
There is a problem with your variables.
Your sentence MUST be a string - not a list.
Edit your variables and try this function again :)
SECOND EDIT:
I think I've finally understood what you're trying to do. Let me know if this one works better.
THIRD EDIT:
Jesus, Hopefully this one would solve your problem. Let me know if it did the trick :)
I just remove punctuations on structure to make it work:
sentence = 'The cat went to the pool yesterday'
structure = ['The cat went,', 'to the pool yesterday.','I wonder if you realize the effect you are having on me. It hurts. A lot.','Life is too short as it is. In short, she had a cushion job.']
import string
def findsentence(sentence, list_of_strings):
return tuple(i for i, s in enumerate(list_of_strings) if s.translate(None, string.punctuation) in sentence)
print findsentence(sentence, structure)
# (0, 1)
After removing the punctuation. You can use this code to get the index ,
for i,j in enumerate(structure):
if j in sentence:
print(i)
Hope this solves your problems. There are quite other solutions as python is flexible.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I found code text summarization in Github, and I will change this program became Tkinter program.I have problem when will get value in class using Button widget and show result in Text widget.How to get value of method summarize in this code use Tkinter button?I usualy only use function or procedure nothing class and method.This code already running in intrepreter.
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
class NaiveSummarizer:
def summarize(self, input, num_sentences ):
punt_list=['.',',','!','?']
summ_sentences = []
sentences = sent_tokenize(input)
lowercase_sentences =[sentence.lower()
for sentence in sentences]
#print lowercase_sentences
s=list(input)
ts=''.join([ o for o in s if not o in punt_list ]).split()
lowercase_words=[word.lower() for word in ts]
words = [word for word in lowercase_words if word not in stopwords.words()]
word_frequencies = FreqDist(words)
most_frequent_words = [pair[0] for pair in
word_frequencies.items()[:100]]
# add sentences with the most frequent words
for word in most_frequent_words:
for i in range(0, len(lowercase_sentences)):
if len(summ_sentences) < num_sentences:
if (lowercase_sentences[i] not in summ_sentences and word in lowercase_sentences[i]):
summ_sentences.append(sentences[i])
break
# reorder the selected sentences
summ_sentences.sort( lambda s1, s2: input.find(s1) - input.find(s2) )
return " ".join(summ_sentences)
if __name__ == "__main__":
naivesum = NaiveSummarizer()
text='''
To see a world in a grain of sand,
And a heaven in a wild flower,
Hold infinity in the palm of your hand,
And eternity in an hour.
A robin redbreast in a cage
Puts all heaven in a rage.
A dove-house fill'd with doves and pigeons
Shudders hell thro' all its regions.
'''
text2 = '''
You conclude with the aphorism of Hippocrates, "Qui gravi morbo correpti dolores non sentiunt, us mens aegrotat" (Those who do not perceive that they are wasted by serious illness are sick in mind), and suggest that I am in need of medicine not only to conquer my malady, but even more, to sharpen my senses for the condition of my inner self. I would fain give you an answer such as you deserve, fain reveal myself to you entirely, but I do not know how to set about it. Hardly do I know whether I am still the same person to whom your precious letter is addressed.
'''
print(naivesum.summarize(text2,3))
print(naivesum.summarize(text,2))
You can not directly use the summarize function as a callback for your button; instead you should wrap it into another function that calls summarize and then displays the result in the Entry widget.
First, you have to add a text variable to you widget so you can read and write the text, like this:
self.outputvar = StringVar()
Entry(self, textvariable=self.outputvar)
Now you can add a callback function to your button, like in your other question, doing this:
def callback(self):
text = "lorem ipsum" # get actual input text, maybe from another widget
num = 3 # get proper value for whatever this is for
result = self.summarize(text, num) # call summarize
self.outputvar.set(result) # show results in your widget
Alternatively, you could use a Text widget; here, inserting text is handled differently though.
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}